This analytical computational platform provides practical guidance for microbiologists, ecologists, and epidemiologists interested in bacterial population genomics. Specifically, the work presented here demonstrated how to perform: i) phylogeny-guided mapping of hierarchical genotypes; ii) frequency-based analysis of genotypes; iii) kinship and clonality analyses; iv) identification of lineage differentiating accessory loci.
Routine and systematic use of bacterial whole-genome sequencing (WGS) is enhancing the accuracy and resolution of epidemiological investigations carried out by Public Health laboratories and regulatory agencies. Large volumes of publicly available WGS data can be used to study pathogenic populations at a large scale. Recently, a freely available computational platform called ProkEvo was published to enable reproducible, automated, and scalable hierarchical-based population genomic analyses using bacterial WGS data. This implementation of ProkEvo demonstrated the importance of combining standard genotypic mapping of populations with mining of accessory genomic content for ecological inference. In particular, the work highlighted here used ProkEvo-derived outputs for population-scaled hierarchical analyses using the R programming language. The main objective was to provide a practical guide for microbiologists, ecologists, and epidemiologists by showing how to: i) use a phylogeny-guided mapping of hierarchical genotypes; ii) assess frequency distributions of genotypes as a proxy for ecological fitness; iii) determine kinship relationships and genetic diversity using specific genotypic classifications; and iv) map lineage differentiating accessory loci. To enhance reproducibility and portability, R markdown files were used to demonstrate the entire analytical approach. The example dataset contained genomic data from 2,365 isolates of the zoonotic foodborne pathogen Salmonella Newport. Phylogeny-anchored mapping of hierarchical genotypes (Serovar -> BAPS1 -> ST -> cgMLST) revealed the population genetic structure, highlighting sequence types (STs) as the keystone differentiating genotype. Across the three most dominant lineages, ST5 and ST118 shared a common ancestor more recently than with the highly clonal ST45 phylotype. ST-based differences were further highlighted by the distribution of accessory antimicrobial resistance (AMR) loci. Lastly, a phylogeny-anchored visualization was used to combine hierarchical genotypes and AMR content to reveal the kinship structure and lineage-specific genomic signatures. Combined, this analytical approach provides some guidelines for conducting heuristic bacterial population genomic analyses using pan-genomic information.
The increasing use of bacterial whole-genome sequencing (WGS) as a basis for routine surveillance and epidemiological inquiry by Public Health laboratories and regulatory agencies has substantially enhanced pathogen outbreak investigations1,2,3,4. As a consequence, large volumes of de-identified WGS data are now publicly available and can be used to study aspects of the population biology of pathogenic species at an unprecedented scale, including studies based on: population structures, genotype frequencies, and gene/allele frequencies across multiple reservoirs, geographical regions, and types of environments5. The most commonly used WGS-guided epidemiological inquiries are based on analyses using only the shared core-genomic content, where the shared (conserved) content alone is used for genotypic classification (e.g., variant calling), and these variants become the basis for epidemiological analysis and tracing1,2,6,7. Typically, bacterial core-genome-based genotyping is carried out with multi-locus sequence typing (MLST) approaches using seven to a few thousand loci8,9,10. These MLST-based strategies encompass mapping of pre-assembled or assembled genomic sequences onto highly curated databases, thereby combining allelic information into reproducible genotypic units for epidemiological and ecological analysis11,12. For instance, this MLST-based classification can generate genotypic information at two levels of resolution: lower-level sequence types (STs) or ST lineages (7 loci), and higher-level core-genome MLST (cgMLST) variants (~ 300-3,000 loci)10.
MLST-based genotypic classification is computationally portable and highly reproducible between laboratories, making it widely accepted as an accurate sub-typing approach beneath the bacterial species level13,14. However, bacterial populations are structured with species-specific varying degrees of clonality (i.e., genotypic homogeneity), complex patterns of hierarchical kinship between genotypes15,16,17, and a wide range of variation in the distribution of accessory genomic content18,19. Thus, a more holistic approach goes beyond discrete classifications into MLST genotypes and incorporates the hierarchical relationships of genotypes at different scales of resolution, along with mapping of accessory genomic content onto genotypic classifications, which facilitates population-based inference18,20,21. Moreover, analyses can also focus on shared patterns of inheritance of accessory genomic loci among even distantly-related genotypes21,22. Overall, the combined approach enables agnostic interrogation of relationships between population structure and the distribution of specific genomic compositions (e.g., loci) among geospatial or environmental gradients. Such an approach can yield both fundamental and practical information about the ecological characteristics of specific populations that may, in turn, explain their tropism and dispersion patterns across reservoirs, such as food animals or humans.
This systems-based hierarchical population-oriented approach demands large volumes of WGS data for sufficient statistical power to predict distinguishable genomic signatures. Consequently, the approach requires a computational platform capable of processing many thousands of bacterial genomes at once. Recently, ProkEvo was developed and is a freely available, automated, portable, and scalable bioinformatics platform that allows for integrative hierarchical-based bacterial population analyses, including pan-genomic mapping20. ProkEvo allows for the study of moderate-to-large scale bacterial datasets while providing a framework to generate testable and inferable epidemiological and ecological hypotheses and phenotypic predictions that can be customized by the user. This work complements that pipeline in providing a guide on how to utilize ProkEvo-derived output files as input for analyses and interpretation of hierarchical population classifications and accessory genomic mining. The case study presented here utilized the population of Salmonella enterica lineage I zoonotic serovar S. Newport as an example and was specifically aimed at providing practical guidelines for microbiologists, ecologists, and epidemiologists on how to: i) use an automated phylogeny-dependent approach to map hierarchical genotypes; ii) assess the frequency distribution of genotypes as a proxy for evaluating ecological fitness; iii) determine lineage-specific degrees of clonality using independent statistical approaches; and iv) map lineage-differentiating AMR loci as an example of how to mine accessory genomic content in the context of the population structure. More broadly, this analytical approach provides a generalizable framework to perform a population-based genomic analysis at a scale that can be used to infer evolutionary and ecological patterns regardless of the targeted species.
1. Prepare input files
NOTE: The protocol is available here – https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code. The protocol assumes that the researcher has specifically used ProkEvo (or a comparable pipeline) to get the necessary outputs available in this Figshare repository (https://figshare.com/account/projects/116625/articles/15097503 – login credentials are required – The user must create a free account to have file access!). Of note, ProkEvo automatically downloads genomic sequences from the NCBI-SRA repository and only requires a .txt file containing a list of genome identifications as an input20, and the one used for this work on S. Newport USA isolates is provided here (https://figshare.com/account/projects/116625/articles/15097503?file=29025729). Detailed information on how to install and use this bacterial genomics platform is available here (https://github.com/npavlovikj/ProkEvo/wiki/2.-Quick-start)20
2. Download and install the statistical software and integrated development environment (IDE) application
3. Install and activate data science libraries
4. Data entry and analysis
NOTE: A detailed information on each step of this analysis can be found in the available script (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd). However, here are some important points to be considered:
5. Conduct analyses and generate visualizations
NOTE: A detailed description of each step needed to produce all the analysis and visualizations can be found in the markdown file for this paper (https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code). Code for each figure is separated in chunks and the entire script should be run sequentially. Additionally, the code for each main and supplementary figure is provided as a separate file (see Supplementary File 1 and Supplementary File 2). Here are some essential points (with snippets of code) to be considered while generating each main and supplementary figures.
By utilizing the computational platform ProkEvo for population genomics analyses, the first step in bacterial WGS data mining is comprised of examining the hierarchical population structure in the context of a core-genome phylogeny (Figure 1). In the case of S. enterica lineage I, as exemplified by the S. Newport dataset, the population is hierarchically structured as follows: serovar (lowest level of resolution), BAPS1 sub-groups or haplotypes, ST lineages, and cgMLST variants (highest level of resolution)20. This phylogeny-guided analysis of the hierarchical population structure specifically allows for the following points to be examined: i) phylogenetic distribution of SISTR-based misclassified genomes into other serovars in the case of Salmonella; ii) genetic or kinship structure of the population; iii) pattern of diversification at different levels of genotypic resolution; iv) identification of major genotypic unit(s) underlying an evolutionary, ecological, or epidemiological patterns; v) ancestral relationships between ST lineages through BAPS1 sub-groups or haplotype composition, and across cgMLST variants within ST lineages; and vi) partial view of the degree of genotypic homogeneity of an ST lineage by the cgMLST variant composition.
Figure 1: Phylogeny-guided mapping of hierarchical genotypes for the S. Newport population. A core-genome phylogeny (black centered circle) was used to map hierarchical genotypes, including serovar (lowest level of resolution – innermost colored circle), BAPS level 1 (BAPS1) sub-groups or haplotypes, ST lineages, and cgMLST variants (highest level of resolution – outermost colored circle). Serovars were grouped into Newport (S. Newport) or "Other serovars" based on the SISTR algorithmic classification of genomes, which utilized core-genome MLST information, and ran as part of the computational platform ProkEvo. BAPS1 agnostically stratifies the population into sub-groups or clusters of related haplotypes using core-genomic data within ProkEvo. BAPS1 is hierarchically placed between serovar and ST lineages because it accurately captured the ancestral relationships between STs. ST lineages are formed based on canonical MLST analysis using seven genome-scattered loci. Only major or most frequent STs (proportion >1%) were depicted in the graph. Lastly, only the most frequent cgMLST variants (proportion >3.5%) were used to show the entire hierarchical structure for the S. Newport population (n = 2,365 USA isolates only). The category "Other STs" or "Other cgMLSTs" comprised of minor or low-frequency lineages or variants, respectively, with thresholding done arbitrarily that should be set empirically or statistically based on the dataset. Please click here to view a larger version of this figure.
Relative frequencies of all hierarchical genotypes were then used to evaluate the overall distribution and most frequently observed classifications (i.e., genotypes) (Figure 2). In Figure 2C–D, less frequent (minor) ST lineages or cgMLST variants were aggregated as "Other STs" or "Other cgMLSTs", respectively, in order to facilitate data visualization (dimensionality reduction). If sampling is systematically done across environments and/or hosts and is appropriately statistically powered, frequency distribution can become a proxy for ecological fitness. That is, the most frequent lineages or variants could then be predicted to have higher fitness, ensuing further investigation to determine the causative genetic determinants underlying such a quantitative trait6,30.
Figure 2: Proportion of S. Newport hierarchical genotypes at different levels of resolution. (A) Serovars are phenotypes of the S. enterica lineage I population that can be predicted solely from core-genomic data due to the inheritable high linkage disequilibrium between core-loci and O and H antigenic-coding loci (surface proteins). When using ProkEvo, Salmonella genomes are automatically classified to serovars using the SISTR program. Although only S. Newport (Newport) genomes from NCBI were putatively downloaded, some have been classified as "Other serovars" within ProkEvo. Approximately 2% (48 out of 2,365) of all genomes were classified as other than S. Newport serovar. (B) The proportion of BAPS level 1 (BAPS1) sub-groups or haplotypes. BAPS1 is inserted between serovar and ST lineages in the hierarchical scheme because it accurately and agnostically captured the ancestral relationships between STs. (C) The proportion of major ST lineages depicted only STs that were > 1% in relative frequency. Minor STs were grouped as "Other STs". (D)The proportion of major cgMLST variants showed only four predominant cgMLSTs that were >3% in relative frequency. The remainder cgMLSTs were grouped as "Other cgMLSTs". (B-D) Genomes classified by SISTR as "Other serovars" (2.03%) were filtered out of the data prior to plotting BAPS1, ST, and cgMLST relative frequencies. (C-D) Thresholds used to plot both ST and cgMLST data were arbitrarily defined and should be established empirically on a case-by-case basis. Please click here to view a larger version of this figure.
Alternatively, a scatter-plot can be used to assess the distribution and proportion of both ST lineages or cgMLST variants, without any data aggregation (Supplementary Figure 1). This use of a scatter-plot is particularly useful for ST lineages and cgMLST variants because of the typical occurrence of hundredths, if not thousands, classifications for both genotypes. This sparse distribution commonly does not occur for the serovar and BAPS1 levels of resolution, because they are at a lower level of resolution with sequences inheritably collapsing into a few sub-groups or categories.
Next, the ancestral relationships between STs were examined using a nested approach which encompasses assessing the relative frequency of ST lineages by BAPS1 sub-groups or haplotypes (Figure 3). ST lineages that belonged to the same BAPS1 sub-group were more likely to have shared a common ancestor more recently than with other STs (i.e., ST5 and ST118 vs. ST45). Similarly, by examining the distribution of cgMLST variants within ST lineages, the degree of genotypic heterogeneity across STs can be captured, while assessing their genetic composition and revealing the ancestral relationship between cgMLSTs (i.e., closely related cgMLST variants belong to the same ST lineage or clonal complex) (Supplementary Figure 2).
Figure 3: Distribution of ST lineages nested within BAPS1 sub-groups for the S. Newport population. This plot depicts the ST lineage distribution within each BAPS level 1 sub-group or haplotype, excluding genomes classified as "Other serovars" (2.03% of the entire data). Major STs (proportion >1%) for each BAPS1 sub-group are highlighted in each graph. The larger the circle diameter, the higher the proportion for the particular ST lineage. Please click here to view a larger version of this figure.
Given that the pattern of S. Newport population diversification appeared to be mostly driven by ST composition (Figure 1), two statistical approaches were used to assess the ST-based degree of clonality (i.e., genetic homogeneity), including Simpson's D index of diversity (Supplementary Figure 3), and the distribution of BAPS sub-groups or haplotypes using BAPS levels 1-6 (Supplementary Figure 4). Assessing the degree of clonality of a population can elucidate the following aspects: i) a better understanding of genetic diversity and population structure; ii) fine-tuning analysis of patterns of diversification across major genotypic units such as ST lineages; and iii) be an indicator of the necessity of using accessory genome mining to find cryptic genotypic units that may reveal novel sub-clusters present in the population. The more clonal a population is at the core-genome level, the harder it becomes to differentiate between variants, and the more likely the accessory genome content will be informative to stratify the population into meaningful genotypic units associated with unique ecological distributions18,19,21.
The relative frequency of ST lineage differentiating AMR loci was assessed to identify unique accessory genomic signatures linked to the S. Newport population structure (Figure 4). This step of the analysis was focused on AMR distribution because it is a Public Health-associated trait, but the same approach can be applied in a supervised (targeted) or agnostic fashion to examine other components of the accessory genome, including metabolic pathways, virulence factors, etc. Noticeably, mdf(A)_1 and aac(6')-Iaa_1 loci appear to be ancestrally-acquired by the S. Newport population; whereas, ST45 is predicted to be multi-drug resistant. Strikingly, these data also suggest that the other major ST lineages, ST5 and ST118, are more likely to be multi-drug susceptible when compared to ST45. These points have to be carefully considered because of the biases present in the dataset; however, this represents a potential epidemiological inference that could be made from more robust WGS data collections.
In general, here are some points to be considered when conducting an accessory genome mapping onto hierarchical genotypes: i) consider the frequency distribution as a quantitative trait but be aware that the allelic composition of a locus can alter trait variance. Moreover, the presence of a locus or loci should be indicative of function but not causal, because the phenotype may be polygenic, or vary according to the allelic composition for the causative locus (e.g., a non-synonymous mutation on the active site of a protein is more likely to affect function); ii) loci distribution can demonstrate genes that are fixed in the population (e.g., found in high frequency across all ST lineages) or recently-acquired by specific ST lineages and cgMLST variants, and may reflect the ecological or epidemiological pattern; iii) multi-drug resistance can be predicted from genomics data. And if the distribution of AMR loci, or other pathways, is strongly linked or commonly inherited by specific lineages, then phenotypes can be predicted by inference from hierarchical genotypes, such as in the case of ST lineages45,46; and iv) measuring phenotypes in the laboratory is still deterministic to validate computational predictions.
Figure 4: Distribution of AMR loci across major ST lineages of the S. Newport population. Relative frequency-based distribution of a selected number of AMR loci across major ST lineages (>1% of the population). Minor STs were grouped as "Other STs". Only genomes classified as S. Newport by the SISTR algorithm were kept in the analysis. AMR loci with a relative frequency greater than or equal to 10% were selected for data visualization. This is an arbitrary threshold that should be determined for each dataset. The proportions were calculated using a binary matrix composed of gene presence or absence. Please click here to view a larger version of this figure.
Lastly, a phylogeny-anchored visualization was used to systematically integrate the hierarchical population structure data along with ST lineage differentiating AMR loci distribution based on gene occurrence (Figure 5). By combining the population structure along with accessory genomic composition, the following set of questions can be addressed in any given dataset: 1) How is the population structured? How do STs relate to each other and ancestrally through BAPS1 sub-groups? How variable is the cgMLST composition across STs? 2) What is the phylogenetic branching pattern and overall tree topology? and 3) How is the accessory genome distributed? Is the accessory genomic composition mostly likely ancestrally-acquired or recently-derived? What is the lineage or variant-specific pattern? What is the phenotypic prediction and ecological inference? Is there niche-transcending vs. niche-specifying genes? How does the observed pattern relate or inform the epidemiology in the case of pathogens? Can lineages or variants be informatively sub-clustered based on accessory genomic content?
Figure 5: Phylogeny-guided mapping of hierarchical genotypes and accessory AMR loci differentiating between major ST lineages within the S. Newport population. A core-genome phylogeny (black centered circle) was used to map hierarchical genotypes, including serovar (lowest level of resolution – innermost colored circle), BAPS level 1 (BAPS1) sub-groups or haplotypes, ST lineages, and cgMLST variants (highest level of resolution – outermost colored circle), along with AMR loci colored as dark-blue if present or gray if absent. Serovars were grouped into Newport (S. Newport) or "Other serovars" based on the SISTR algorithmic classification. BAPS1 is hierarchically placed between serovar and ST lineages because it accurately and agnostically captured the ancestral relationships between STs. ST lineages are formed based on canonical MLST analysis using seven genome-scattered loci. Only major or most frequent STs (proportion >1%) were depicted in the graph. Also, only the most dominant cgMLST variants (proportion >3.5%) were used to show the entire hierarchical structure for the S. Newport population (n = 2,365 USA isolates only). The category "Other STs" or "Other cgMLSTs" comprised of minor or low-frequency lineages or variants, respectively, and thresholding was done arbitrarily and should be set based on the dataset. AMR loci with a relative frequency greater than or equal to 10% were selected for data visualization. This specific graph shows a unique distribution of AMR loci predominantly occurring in ST31, ST45, and ST132 lineages. Please click here to view a larger version of this figure.
Supplementary Figure 1: Sparse distribution of ST lineages and cgMLST variants for the S. Newport population. (A) The proportion of ST lineages without aggregating low-frequency STs. STs with proportion >1% are highlighted in the plot. (B) The proportion of cgMLST variants without aggregating low-frequency cgMLSTs. cgMLSTs with proportion > 3% are highlighted in the plot. (A-B) Thresholds used to plot both ST and cgMLST data were arbitrarily defined and should be established based on the dataset. Genomes classified by SISTR as "Other serovars" (2.03%) were filtered out of the data prior to plotting both ST and cgMLST relative frequencies. The larger the circle diameter, the higher the proportion for either the ST lineage or cgMLST variant. Please click here to download this File.
Supplementary Figure 2: Distribution of cgMLST variants nested within ST lineages for the S. Newport population. This plot depicts the cgMLST variant distribution across ST lineages, excluding genomes classified as "Other serovars" (2.03% of the entire data). Major cgMLSTs (proportion >15%) for each ST lineage are highlighted in each graph. The larger the circle diameter, the higher the proportion for the specific cgMSLT variant. Low-frequency STs were grouped as "Other STs". Please click here to download this File.
Supplementary Figure 3: Simpson's D-based degree of genetic diversity across ST lineages using BAPS levels 1-6 haplotypes or cgMLST genotypes as input data for the S. Newport population. The degree of clonality or genetic diversity of each ST lineage was calculated across different genotypic layers of resolution, including BAPS levels 1 (lowest level of resolution) to 6 (highest level of resolution) sub-groups or haplotypes, and by additionally using the cgMLST-based distribution of variants. The higher the index value, the higher the degree of genetic diversity. Highly diverse ST lineages have higher index values going from BAPS1 to BAPS6 (i.e., typically the index increases and eventually plateaus when going from BAPS1 to BAPS6). Only genomes classified as S. Newport by the SISTR program were kept in the analysis. Low-frequency STs were grouped as "Other STs". Please click here to download this File.
Supplementary Figure 4: Distribution of BAPS levels 1-6 sub-groups or haplotypes across major ST lineages of the S. Newport population. Relative frequency-based distribution of BAPS sub-groups or haplotypes, across major ST lineages, from the lowest (BAPS1) to the highest level of resolution (BAPS6). Major STs were selected based on having a proportion >1%. Only genomes classified as S. Newport by the SISTR program were kept in the analysis. The higher the degree of clonality, the less sparse or spread the distribution of BAPS sub-groups or haplotypes becomes when going from BAPS1 to BAPS6. In other words, a more genetically diverse ST lineage has a wider range of BAPS sub-groups at the BAPS level 6 (highest degree of resolution). Low-frequency STs were grouped as "Other STs". Please click here to download this File.
Supplementary File 1: Links to material list and genomes list Please click here to download this File.
Supplementary File 2: Hierarchical-based bacterial population genomics analysis using R Please click here to download this File.
The utilization of a systems-based heuristic and hierarchical population structure analysis provides a framework to identify novel genomic signatures in bacterial datasets that have the potential to explain unique ecological and epidemiological patterns20. Additionally, the mapping of accessory genome data onto the population structure can be used to infer ancestrally-acquired and/or recently-derived traits that facilitate the spread of ST lineages or cgMLST variants across reservoirs6,20,21,45,46. More broadly, a global assessment of pan-genomic content distribution in bacterial populations can reveal patterns of diversification that underly the ecological tropisms or geospatial/temporal bottlenecks that a population might have recently withstood18,21. In the case of pathogenic species, by mining the population structure of clinical vs. environmental isolates, genetic determinants associated with zoonotic events may be identified and used to improve diagnostics and surveillance33,34. The same approach can be applied to non-pathogenic species to identify genotypes with desirable niche-specific engrafting properties, as in the case of gastrointestinal probiotic strains used to improve human health49,50,51. Yet, the utilization of bacterial WGS data for population-based inquiries requires the use of reproducible, automated, and scalable computational platforms like ProkEvo20. Any computational approach comes with its caveats and nuances, but in general, freely available, well-documented, portable, and user-friendly platforms such as ProkEvo can facilitate the work of microbiologists, ecologists, and epidemiologists doing heuristic bacterial population-based genomics.
In the present work, it was demonstrated how to use ProkEvo-derived outputs to conduct a hierarchical population structure analysis that can be used to map and track genotypes of interest at different levels of resolution, along with predicting useful traits from WGS data. This computational protocol was written using the R programming language, but the framework or conceptual approach is generalizable to other languages such as Python through the utilization of the Pandas library, for instance. The input data is generated by ProkEvo20, which prevents some hurdles to be faced in terms of standardizing outputs and data formats for subsequent analysis. With the exception of phylogenies, all other input datasets come in a tabular format that can easily be quality-controlled, aggregated, parsed, and integrated to generate useful reports for data interpretation. However, it is important to highlight a few critical steps to enhance reproducibility while using this protocol: i) make sure the software versions are always updated and tracked; ii) track the versions of the data science libraries being used, and preferably update them over time; iii) quality-control the data using domain knowledge expertise to make sense of the outputs generated by ProkEvo, or a similar pipeline, in light of what is understood for the targeted bacterial population; iv) conduct an exploratory data analysis prior to using any modeling approach; v) aggregate the data based on empirical knowledge and/or statistical assessments; vi) define a strategy to deal with missing values a priori and be consistent and completely transparent about it; vii) if using R, try to use all the packages provided by Tidyverse, because this collection facilitates functional programming, portability, optimization, and is freely available; and viii) be aware that visualization approaches can be difficult because it takes some trial and error to get the right type of plot and coloring scheme that is most appropriately applicable for the question being asked and the data being portrayed.
Of note, this protocol comes with some limitations that can be further improved. For instance, ProkEvo has an intrinsic limit to how many genomes can be used for pan-genomic analysis, if the core-genome alignment step is generated concomitantly, while utilizing the Roary program (~ 2,000-3,000 genomes)24. That is a very specific bottleneck in the pipeline that will affect the number of genomes that can be classified into BAPS haplotypes since it depends on core-genome alignment (i.e., highly computationally demanding step). However, core-genome alignment can be done with other programs52, and such algorithms, in theory, could be easily incorporated into ProkEvo. Otherwise, datasets can be strategically split into random subsets, or in another basis such as by considering the population structure of the organism in question. Alternatively, ProkEvo can be run with a single genome to get ST-based annotation, antibiotic resistance and virulence gene composition, and mapping of plasmids, but the pipeline was designed for population-based genomics. Noteworthy, if the BAPS1-6 classifications are not needed, then the core-genome alignment option of Roary can be turned off, and in that case, ProkEvo can be used with many hundredths of thousands of genomes – it is only limited based on the number of computer cores available. An example of how to implement a new program or how to turn off the core-genome alignment option in Roary within ProkEvo can be found in the following GitHub links (https://github.com/npavlovikj/ProkEvo/wiki/4.1.-Add-new-bioinformatics-tool-to-ProkEvo) and (https://github.com/npavlovikj/ProkEvo/wiki/4.3.-Change-running-options-for-existing-tool-in-ProkEvo), respectively. In the case of accessory genomic mining, an agnostic analysis depends on the utilization of the pan-genomic .Rtab file generated by Roary24, which was not specifically used here, but instead, it was strategically demonstrated how to map AMR loci with ABRicate using the Resfinder database (https://github.com/tseemann/abricate). Nonetheless, there is an option to expand the scope of the accessory genomic mapping by using a pan-genomic file instead, which can be practically viewed as an expansion of the current approach (e.g., more loci included in the tabular dataset as new columns). It is important to mention that the pan-genomic mapping done by ProkEvo only provided binary information in terms of loci composition, and currently, cannot be used for the identification of single nucleotide polymorphisms across genes.
Another limitation of this protocol is the visualization of the phylogenetic tree. Currently, ggtree is the program of choice, but that comes at the cost of being unable to accurately inspect branch lengths and becomes cumbersome when many layers of data need to be added onto the phylogeny. Alternatively, phandango41 is a user-friendly, scalable web-page formatted GUI (https://jameshadfield.github.io/phandango/#/)41 that could easily be used to achieve the same goal, and further detailed information on how to use it with ProkEvo outputs is recently published20. Other tools like iTOL could also be used for phylogeny-dependent visualization of data53, but they require using a GUI and cannot be incorporated in automated scripts. Also, accurate core-genome phylogenies can be difficult to estimate due to the cryptic dataset-dependent impact of horizontal gene transfer. Programs such as Gubbins54 can be used for that purpose, but they also come with certain limitations such as the need of using whole-genome alignment and ST lineage-specific datasets for the correct estimation of phylogenies. Instead, other phylogeny-independent approaches can be deployed, which then end up requiring other types of visualizations to integrate metadata or accessory genomic information, as in the case of multi-dimensional analysis55,56. Lastly, an empirical and arbitrary approach was used to aggregate minor ST lineages and cgMLST variants, in addition to filtering the most important AMR loci to be quantified. This type of data aggregation can be done empirically using domain knowledge expertise, but could also be achieved statistically by defining a priori criterion of the proportion of the distribution that should be displayed, or by using distribution-related metrics such as interquartile range, standard deviation, or skewness, to ultimately define a threshold. Importantly, the definition for minor genotypes is directly influenced by the nature of the data since sample size, and bias in the types of environmental samples can directly influence the genotypic composition. Regardless, the main consideration is that the mapping of accessory genome content onto the population structure allows for potential genetic determinants of ecological diversification to be identified, such as niche-transcending or niche-specifying genes57,58,59.
Although the available R scripts were designed for automation of the present work, all provided scripts would need to be further developed to become an abstract and deployable data science library, that could for instance be an integral part of the ProkEvo pipeline. Nonetheless, there are some specific advantages of utilizing this approach such as the use of the BAPS level 1 genotyping or clustering scheme. The placement of BAPS level 1 sub-groups or haplotypes between serovar and ST lineages was defined empirically based on the genetic structure of the Salmonella population, but it seems to be applicable to other species such as Campylobacter jejuni and Staphylococcus aureus20. Moreover, BAPS1 accurately captures the ancestral relationship between ST lineages and provides a scalable approach for evolutionary analysis, especially when phylogenetic applications are limited20. Furthermore, the use of a nested approach for examining hierarchical relationships and patterns of diversification facilitates the identification of ancestry between ST lineages using BAPS1 sub-groups, and across cgMLST variants using ST lineages, successively going from lower to higher genotypic resolution in assessing the population structure. It is important to reiterate that the frequency distribution of ST lineages and cgMLST variants, if drawn from a systematically collected and statistically powered sample, can become a proxy for ecological fitness1,6,43. Consequently, dominant ST lineages and cgMLST variants are likely to contain unique genomic features that may be the basis of the biological mechanism for their dominance in the population in that particular environment or host.
Herein, two independent statistical metrics were used to assess the degree of clonality of the population, which allows for an auxiliary understanding of the population genetic diversity, which may indicate the past occurrence of sample bias, population bottlenecks, or founder effect. In particular, the agnostic assessment of BAPS levels 1-6 sub-groups across ST lineages can refine the understanding of genetic diversity that cannot typically be resolved by simply looking at Salmonella cgMLST variant level generated by SISTR. As mentioned beforehand, other features of the pan-genome can be mapped onto the population structure and files containing plasmid and virulence gene composition, in addition to the utilization of other AMR databases along with agnostic pan-genome dataset, are automatically generated by ProkEvo20. Of note, ProkEvo does not currently allow for differentiation between AMR loci present in the bacterial chromosome vs. plasmids. Ecological and epidemiological metadata can also be easily integrated into this analytical approach by incorporation of other variables in a .csv file containing all the genomic information. In particular, the work presented here specifically complements the utilization of the scalable and portable computational platform ProkEvo, which was designed to be used by researchers focused on heuristic population genomics analyses that facilitate data mining and customization by the user. Other platforms can be used for genotyping, population structure analysis, and/or mapping of accessory genomes such as Enterobase5, PATRIC60, andBacWGSTdb61. The latter are excellent resources that facilitate genomics data mining for researchers who are not seeking to customize and utilize cluster computing for scalable and complex analysis. The analytical approach presented here is specifically tailored for researchers who want to have the flexibility to carry out a population genomics analysis using reproducible scripts on their local machine or by using a cloud- or high-performance computational platform.
In conclusion, the analytical R-based platform presented in this work was targeted at providing a practical guide for microbiologists, ecologists, and epidemiologists on how to: i) use phylogeny-dependent approaches to map hierarchical genotypes; ii) assess the frequency distribution of genotypes as a proxy for evaluating ecological fitness; iii) determine lineage-specific degrees of clonality using independent statistical approaches; and iv) map lineage-differentiating AMR loci as an example of how to mine accessory genomic content in the context of the population structure. The scripts provided here can be used on either a local machine or a high-performance computational platform. For experimental and environmental microbiologists, this approach facilitates studies of datasets aimed at identifying unique traits and candidate pathways for further mechanistic studies that ultimately can be contextualized at the population level. Ecologists can benefit from this approach by being able to analyze moderate-to-large datasets, that in theory, increase the statistical power needed to find signatures of selection in a population while considering kinship relationships and patterns of diversification. Lastly, epidemiologists can harness unique practical information for diagnostics and surveillance by defining genotypic units of interest and predicting Public Health-associated traits such as AMR. More broadly, this analytical guidance provides a generalizable framework to utilize ProkEvo to perform a population-based genomic analysis that can be used to infer evolutionary and ecological patterns for pathogenic and non-pathogenic species since the approach is generalizable to other bacterial species.
The authors have nothing to disclose.
This work was supported by funding provided by the UNL-IANR Agricultural Research Division and the National Institute for Antimicrobial Resistance Research and Education and by the Nebraska Food for Health Center at the Food Science and Technology Department (UNL). This research could only be completed by utilizing the Holland Computing Center (HCC) at UNL, which receives support from the Nebraska Research Initiative. We are also thankful for having access, through the HCC, to resources provided by the Open Science Grid (OSG), which is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science. This work used the Pegasus Workflow Management Software which is funded by the National Science Foundation (grant #1664162).
amr_data_filtered | https://figshare.com/account/projects/116625/articles/14829225?file=28758762 | ||
amr_data_raw | https://figshare.com/account/projects/116625/articles/14829225?file=28547994 | ||
baps_output | https://figshare.com/account/projects/116625/articles/14829225?file=28548003 | ||
Core-genome phylogeny | https://figshare.com/account/projects/116625/articles/14829225?file=28548006 | ||
genome_sra | https://figshare.com/account/projects/116625/articles/14829225?file=28639209 | ||
Linux, Mac, or PC | any high-performance platform | ||
mlst_output | https://figshare.com/account/projects/116625/articles/14829225?file=28547997 | ||
sistr_output | https://figshare.com/account/projects/116625/articles/14829225?file=28548000 | ||
figshare credentials are required for login and have access to the files |