Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Natasha Pavlovikj; Joao Carlos Gomes-Neto; Andrew K. Benson

doi:10.3791/63115

JoVE Journal > Genetics

Genética

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Published: December 07, 2021

doi:

10.3791/63115

Natasha Pavlovikj*¹, Joao Carlos Gomes-Neto*^2,3, Andrew K. Benson^2,3

¹Department of Computer Science and Engineering,University of Nebraska-Lincoln, ²Department of Food Science and Technology,University of Nebraska-Lincoln, ³Nebraska Food for Health Center,University of Nebraska-Lincoln

Summary

This analytical computational platform provides practical guidance for microbiologists, ecologists, and epidemiologists interested in bacterial population genomics. Specifically, the work presented here demonstrated how to perform: i) phylogeny-guided mapping of hierarchical genotypes; ii) frequency-based analysis of genotypes; iii) kinship and clonality analyses; iv) identification of lineage differentiating accessory loci.

Abstract

Routine and systematic use of bacterial whole-genome sequencing (WGS) is enhancing the accuracy and resolution of epidemiological investigations carried out by Public Health laboratories and regulatory agencies. Large volumes of publicly available WGS data can be used to study pathogenic populations at a large scale. Recently, a freely available computational platform called ProkEvo was published to enable reproducible, automated, and scalable hierarchical-based population genomic analyses using bacterial WGS data. This implementation of ProkEvo demonstrated the importance of combining standard genotypic mapping of populations with mining of accessory genomic content for ecological inference. In particular, the work highlighted here used ProkEvo-derived outputs for population-scaled hierarchical analyses using the R programming language. The main objective was to provide a practical guide for microbiologists, ecologists, and epidemiologists by showing how to: i) use a phylogeny-guided mapping of hierarchical genotypes; ii) assess frequency distributions of genotypes as a proxy for ecological fitness; iii) determine kinship relationships and genetic diversity using specific genotypic classifications; and iv) map lineage differentiating accessory loci. To enhance reproducibility and portability, R markdown files were used to demonstrate the entire analytical approach. The example dataset contained genomic data from 2,365 isolates of the zoonotic foodborne pathogen Salmonella Newport. Phylogeny-anchored mapping of hierarchical genotypes (Serovar -> BAPS1 -> ST -> cgMLST) revealed the population genetic structure, highlighting sequence types (STs) as the keystone differentiating genotype. Across the three most dominant lineages, ST5 and ST118 shared a common ancestor more recently than with the highly clonal ST45 phylotype. ST-based differences were further highlighted by the distribution of accessory antimicrobial resistance (AMR) loci. Lastly, a phylogeny-anchored visualization was used to combine hierarchical genotypes and AMR content to reveal the kinship structure and lineage-specific genomic signatures. Combined, this analytical approach provides some guidelines for conducting heuristic bacterial population genomic analyses using pan-genomic information.

Introduction

The increasing use of bacterial whole-genome sequencing (WGS) as a basis for routine surveillance and epidemiological inquiry by Public Health laboratories and regulatory agencies has substantially enhanced pathogen outbreak investigations¹^,²^,³^,⁴. As a consequence, large volumes of de-identified WGS data are now publicly available and can be used to study aspects of the population biology of pathogenic species at an unprecedented scale, including studies based on: population structures, genotype frequencies, and gene/allele frequencies across multiple reservoirs, geographical regions, and types of environments⁵. The most commonly used WGS-guided epidemiological inquiries are based on analyses using only the shared core-genomic content, where the shared (conserved) content alone is used for genotypic classification (e.g., variant calling), and these variants become the basis for epidemiological analysis and tracing¹^,²^,⁶^,⁷. Typically, bacterial core-genome-based genotyping is carried out with multi-locus sequence typing (MLST) approaches using seven to a few thousand loci⁸^,⁹^,¹⁰. These MLST-based strategies encompass mapping of pre-assembled or assembled genomic sequences onto highly curated databases, thereby combining allelic information into reproducible genotypic units for epidemiological and ecological analysis¹¹^,¹². For instance, this MLST-based classification can generate genotypic information at two levels of resolution: lower-level sequence types (STs) or ST lineages (7 loci), and higher-level core-genome MLST (cgMLST) variants (~ 300-3,000 loci)¹⁰.

MLST-based genotypic classification is computationally portable and highly reproducible between laboratories, making it widely accepted as an accurate sub-typing approach beneath the bacterial species level¹³^,¹⁴. However, bacterial populations are structured with species-specific varying degrees of clonality (i.e., genotypic homogeneity), complex patterns of hierarchical kinship between genotypes¹⁵^,¹⁶^,¹⁷, and a wide range of variation in the distribution of accessory genomic content¹⁸^,¹⁹. Thus, a more holistic approach goes beyond discrete classifications into MLST genotypes and incorporates the hierarchical relationships of genotypes at different scales of resolution, along with mapping of accessory genomic content onto genotypic classifications, which facilitates population-based inference¹⁸^,²⁰^,²¹. Moreover, analyses can also focus on shared patterns of inheritance of accessory genomic loci among even distantly-related genotypes²¹^,²². Overall, the combined approach enables agnostic interrogation of relationships between population structure and the distribution of specific genomic compositions (e.g., loci) among geospatial or environmental gradients. Such an approach can yield both fundamental and practical information about the ecological characteristics of specific populations that may, in turn, explain their tropism and dispersion patterns across reservoirs, such as food animals or humans.

This systems-based hierarchical population-oriented approach demands large volumes of WGS data for sufficient statistical power to predict distinguishable genomic signatures. Consequently, the approach requires a computational platform capable of processing many thousands of bacterial genomes at once. Recently, ProkEvo was developed and is a freely available, automated, portable, and scalable bioinformatics platform that allows for integrative hierarchical-based bacterial population analyses, including pan-genomic mapping²⁰. ProkEvo allows for the study of moderate-to-large scale bacterial datasets while providing a framework to generate testable and inferable epidemiological and ecological hypotheses and phenotypic predictions that can be customized by the user. This work complements that pipeline in providing a guide on how to utilize ProkEvo-derived output files as input for analyses and interpretation of hierarchical population classifications and accessory genomic mining. The case study presented here utilized the population of Salmonella enterica lineage I zoonotic serovar S. Newport as an example and was specifically aimed at providing practical guidelines for microbiologists, ecologists, and epidemiologists on how to: i) use an automated phylogeny-dependent approach to map hierarchical genotypes; ii) assess the frequency distribution of genotypes as a proxy for evaluating ecological fitness; iii) determine lineage-specific degrees of clonality using independent statistical approaches; and iv) map lineage-differentiating AMR loci as an example of how to mine accessory genomic content in the context of the population structure. More broadly, this analytical approach provides a generalizable framework to perform a population-based genomic analysis at a scale that can be used to infer evolutionary and ecological patterns regardless of the targeted species.

Protocol

1. Prepare input files

NOTE: The protocol is available here – https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code. The protocol assumes that the researcher has specifically used ProkEvo (or a comparable pipeline) to get the necessary outputs available in this Figshare repository (https://figshare.com/account/projects/116625/articles/15097503 – login credentials are required – The user must create a free account to have file access!). Of note, ProkEvo automatically downloads genomic sequences from the NCBI-SRA repository and only requires a .txt file containing a list of genome identifications as an input²⁰, and the one used for this work on S. Newport USA isolates is provided here (https://figshare.com/account/projects/116625/articles/15097503?file=29025729). Detailed information on how to install and use this bacterial genomics platform is available here (https://github.com/npavlovikj/ProkEvo/wiki/2.-Quick-start)²⁰

Generate core-genome phylogeny using FastTree²³ as previously described²⁰, which is not part of the bioinformatics platform²⁰. FastTree requires the Roary²⁴ core-genome alignment as an input file. The phylogeny file is named newport_phylogeny.tree (https://figshare.com/account/projects/116625/articles/15097503?file=29025690).
Generate SISTR²⁵ output containing the information regarding serovars classifications for Salmonella and cgMLST variant calling data (sistr_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025699).
Generate BAPS file by fastbaps²⁶^,²⁷ containing the BAPS levels 1-6 classification of genomes into sub-groups or haplotypes (fastbaps_partition_baps_prior_l6.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025684).
Generate MLST-based classification of genomes into STs using the MLST program (https://github.com/tseemann/mlst)²⁸ (salmonellast_output.csv – https://figshare.com/account/projects/116625/articles/15097503?file=29025696).
Generate ABRicate (https://github.com/tseemann/abricate)²⁹ output as a .csv file containing AMR loci mapped per genome (sabricate_resfinder_output.csv – https://figshare.com/account/projects/116625/articles/15097503?file=29025693).
NOTE: The user can turn off specific parts of the ProkEvo bioinformatics pipeline (check here for more information - https://github.com/npavlovikj/ProkEvo/wiki/4.2.-Remove-existing-bioinformatics-tool-from-ProkEvo). The analytical approach presented here provides guidelines for how to conduct a population-based analysis after the bioinformatics pipeline has been run.

2. Download and install the statistical software and integrated development environment (IDE) application

Download the most up-to-date freely available version of the R software for Linux, Mac, or PC³⁰. Follow the default installation steps.
Download the most up-to-date freely available version of the RStudio desktop IDE here³¹. Follow the default steps for installation.
NOTE: The next steps are included in the available script, including detailed information of code utilization, and should be run sequentially to generate the outputs and figures presented in this work (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd). The user may decide to use another programming language to conduct this analytical/statistical analysis such as Python. In that case, use the steps in the scripts as a framework to carry out the analysis.

3. Install and activate data science libraries

Install all data science libraries at once as a first step in the analysis. Avoid installing the libraries every time the script needs to be re-run. Use the function install.packages() for library installation. Alternatively, the user may click on the Packages tab inside of the IDE and automatically install the packages. The code used to install all needed libraries is presented here:
# Install Tidyverse
install.packages("tidyverse")
# Install skimr
install.packages("skimr")
# Install vegan
install.packages("vegan")
# Install forcats
install.packages("forcats")
# Install naniar
install.packages("naniar")
# Install ggpubr
install.packages("ggpubr")
# Install ggrepel
install.packages("ggrepel")
# Install reshape2
install.packages("reshape2")
# Install RColorBrewer
install.packages("RColorBrewer")
# Install ggtree
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
# Installation of ggtree will prompt a question about installation – answer is "a" to install/update all dependencies
Activate all the libraries or packages using the library() function at the beginning of the script, right after installation. Here is a demonstration on how to activate all necessary packages:
# Activate the libraries and packages
library(tidyverse)
library(skimr)
library(vegan)
library(forcats)
library(naniar)
library(ggtree)
library(ggpubr)
library(ggrepel)
library(reshape2)
library(RColorBrewer)
Suppress outputting the code used for library and package installation and activation by using {r, include = FALSE} in the code chuck, as follows:
“` {r, include = FALSE}
# Install Tidyverse
install.packages("tidyverse")
“`
NOTE: This step is optional but avoids showing chunks of unnecessary code in the final html, doc, or pdf report.
For a brief description of the specific functions of all libraries along with some useful links to gather further information, refer to steps 3.4.1-3.4.11.
1. Tidyverse – use this collection of packages used for data science, including data entry, visualization, parsing and aggregation, and statistical modeling. Typically, ggplot2 (data visualization) and dplyr (data wrangling and modeling) are practical packages present in this library³².
2. skimr – use this package for generating summary statistics of data frames, including identification of missing values³³.
3. vegan – use this package for community ecology statistical analyses, such as calculating diversity-based statistics (e.g., alpha and beta-diversity)³⁴.
4. forcats – use this package to work with categorical variables such as re-ordering classifications. This package is part of the Tidyverse library³².
5. naniar – use this package to visualize the distribution of missing values across variables in a data frame, by using the viss_miss() function³⁵.
6. ggtree – use this package for the visualization of phylogenetic trees³⁶.
7. ggpubr – use this package to improve the quality of ggplot2-based visualizations³⁷.
8. ggrepel – use this package for text labeling inside of graphs³⁸.
9. reshape2 – use the melt() function from this package for the transformation of data frames from wide to long format³⁹.
10. RColorBrewer – use this package to manage colors in ggplot2-based visualizations⁴⁰.
11. Use the following basic functions for exploratory data analysis: head() to check the first observations in a data frame, tail() to check the last observations of a data frame, is.na() to count the number of rows with missing values across a data frame, dim() to check the number of rows and columns in a dataset, table() to count observations across a variable, and sum() to count the total number of observations or instances.

4. Data entry and analysis

NOTE: A detailed information on each step of this analysis can be found in the available script (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd). However, here are some important points to be considered:

Do all genomic data entry, including all genotypic classifications (serovar, BAPS, ST, and cgMLST) using the read_csv() function.
Rename, create new variables, and select columns of interest from each dataset before multi-dataset aggregation.
Don't remove missing values from any independent dataset. Wait until all datasets are aggregated to modify or exclude missing values. If new variables are created for each dataset, then missing values are by default categorized into one of the newly generated classifications.
Check for erroneous characters such as hyphens or interrogations marks and replace them with NA (Not applicable). Do the same for missing values.
Aggregate data based on the hierarchical order of genotypes (serovar -> BAPS1 -> ST -> cgMLST), and by grouping based on the individual genome identifications.
Check for missing values using multiple strategies and deal with such inconsistencies explicitly. Only remove a genome or isolate from the data if the classification is unreliable. Otherwise, consider the analysis being done and remove NAs on a case-by-case basis.
NOTE: It is highly recommended to establish a strategy to deal with such values a priori. Avoid removing all genomes or isolates with missing values across any variables. For instance, a genome may have ST classification without having cgMLST variant number. In that case, the genome can still be used for the ST-based analysis.
Once all datasets are aggregated, assign them to a data frame name or object that can be used in multiple locations in the follow-up analysis, to avoid having to generate the same metadata file for every figure in the paper.

5. Conduct analyses and generate visualizations

NOTE: A detailed description of each step needed to produce all the analysis and visualizations can be found in the markdown file for this paper (https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code). Code for each figure is separated in chunks and the entire script should be run sequentially. Additionally, the code for each main and supplementary figure is provided as a separate file (see Supplementary File 1 and Supplementary File 2). Here are some essential points (with snippets of code) to be considered while generating each main and supplementary figures.

Use ggtree to plot a phylogenetic tree along with genotypic information (Figure 1).
1. Optimize the ggtree figure size, including diameter and width of rings, by changing the numerical values inside of the xlim() and gheatmap(width = ) functions, respectively (see example code below).
  tree_plot <- ggtree(tree, layout = "circular") + xlim(-250, NA)
  figure_1 <- gheatmap(tree_plot, d4, offset=.0, width=20, colnames = FALSE)
  NOTE: For a more detailed comparison of programs that can be used for phylogenetic plotting, check this work²⁰. The work highlighted an attempt made to identify strategies to improve ggtree-based visualizations such as decreasing the dataset size, but branch lengths and tree topology were not as clearly discriminating as compared to phandango⁴¹.
2. Aggregate all metadata into as few categories as possible to facilitate the choice of coloring panel when plotting multiple layers of data with the phylogenetic tree (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_1.Rmd). Conduct the data aggregation based on the question of interest and domain knowledge.
Use a bar plot to assess relative frequencies (Figure 2).
1. Aggregate data for both ST lineages and cgMLST variants to facilitate visualizations. Choose an empirical or statistical threshold used for data aggregation, while considering the question being asked.
2. For an example code that can be used to inspect the frequency distribution of ST lineages to determine the cut-off see below:
  st_dist <- d2 %>% group_by(ST) %>% # group by the ST column
  count() %>% # count the number of observations
  arrange(desc(n)) # arrange the counts in decreasing order
3. For an example code showing how minor (low-frequency) STs can be aggregated refer below. As demonstrated below, STs that are not numbered as 5, 31, 45, 46, 118, 132, or 350, are grouped together as "Other STs". Use a similar code for cgMLST variants (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_2.Rmd).
  d2$st <- ifelse(d2$ST == 5, "ST5", # create a new ST column for which minor S Ts are aggregated as Others
  ifelse(d2$ST == 31, "ST31",
  ifelse(d2$ST == 45, "ST45",
  ifelse(d2$ST == 46, "ST46",
  ifelse(d2$ST == 118, "ST118",
  ifelse(d2$ST == 132, "ST132", ifelse(d2$ST == 350, "ST350", "Other STs")))))))
Use a nested approach to calculate the proportion of each ST lineage within each BAPS1 sub-group to identify STs that are ancestrally related (belong to the same BAPS1 sub-group) (Figure 3). The code below exemplifies how the ST-based proportion can be calculated across BAPS1 sub-groups (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_3.Rmd):
baps <- d2b %>% filter(serovar == "Newport") %>% # filter Newport serovars
select(baps_1, ST) %>% # select baps_1 and ST columns
mutate(ST = as.numeric(ST)) %>% # change ST column to numeric
drop_na(baps_1, ST) %>% # drop NAs
group_by(baps_1, ST) %>% # group by baps_1 and ST
summarise(n = n()) %>% # count observations
mutate(prop = n/sum(n)*100) # calculate proportions
Plot the distribution of AMR loci across ST lineages using the Resfinder-based gene annotation results (Figure 4).
NOTE: Resfinder has been widely used in ecological and epidemiological studies⁴². Annotation of protein-coding genes can vary depending on how often databases are curated and updated. If using the suggested bioinformatics pipeline, the researcher can compare AMR-based loci classifications across different databases²⁰. Be sure to check which databases are continually being updated. Do not use out-of-date or poorly curated databases, in order to avoid miscalls.
1. Use an empirical or statistical threshold to filter out the most important AMR loci to facilitate visualizations. Provide a raw .csv file containing the calculated proportions of all AMR loci across all ST lineages, such as shown here (https://figshare.com/account/projects/116625/articles/15097503?file=29025687).
2. Calculate the AMR proportion for each ST using the following code (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_4.Rmd):
  # Calculations for ST45
  d2c <- data6 %>% filter(st == "ST45") # filter ST45 data first
  # for ST45, calculate the proportion of AMR loci and only keep proportion greater than 10%
  d3c <- d2c %>% select(id, gene) %>% # select columns
  group_by(id, gene) %>% # group by id and gene
  summarize(count = n()) %>% # count observations
  mutate(count = replace(count, count == 2, 1)) %>% # replace counts equal to 2 with 1 to only consider one copy of each gene (duplications may not be reliable), but the researcher can decide to exclude or keep them. If the researcher wants to exclude them, then use the filter(count != 2) function or else leave as is
  filter(count <= 1) # filter counts below or equal to 1
  d4c <- d3c %>% group_by(gene) %>% # group by gene
  summarize(value = n()) %>% # count observations
  mutate(total = table(data1$st)[6]) %>% # get the total counts of st mutate(prop = (value/total)*100) # calculate proportions
  d5c <- d4c %>% mutate(st = "ST45") # create a st column and add ST information
3. After calculations are done for all STs, combine datasets as one data frame, using the following code:
  # Combine datasets
  d6 <- rbind(d5a, d5b, d5c, d5d, d5e, d5f, d5g, d5h) # row bind datasets
4. To export the .csv file containing the calculated proportions, use the code:
  # Export data table containing ST and AMR loci information
  abx_newport_st <- d6 write.csv(abx_newport_st,"abx_newport_st.csv", row.names = FALSE)
5. Before plotting the AMR-based distribution across ST lineages, filter the data based on a threshold to facilitate visualizations, as shown below:
  # Filter AMR loci with proportion higher than or equal to 10%
  d7 <- d6 %>% filter(prop >= 10) # determine the threshold empirically or statistically
Plot the core-genome phylogeny along with the hierarchical genotypic classifications and AMR data in a single plot using ggtree (Figure 5).
1. Optimize the figure size inside ggtree using the abovementioned parameters (see step 5.1.1.).
2. Optimize visualizations by aggregating variables, or using binary classification such as gene presence or absence. The more features are added to the plot, the harder the coloring selection process becomes (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_5.Rmd).
  NOTE: Supplementary figures – detailed description of the entire code can be found here (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd).
Use a scatter plot in ggplot2, without data aggregation, to display the distribution of ST lineages or cgMLST variants while highlighting the most frequent genotypes (Supplementary Figure 1) (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s1.Rmd).
Do a nested analysis to assess the composition of ST lineages through the proportion of cgMLST variants in order to get a glimpse of the ST-based genetic diversity, while identifying the most frequent variants and their genetic relationships (i.e., cgMLST variants that belong to the same ST shared an ancestor more recently than those belonging to distinct STs) (Supplementary Figure 2) (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s2.Rmd).
Use community ecology metric, namely Simpson's D index of diversity, to measure the degree of clonality or genotypic diversity of each of the major ST lineages⁴³ (Supplementary Figure 3).
1. Calculate the index of diversity across ST lineages at different levels of genotypic resolution including BAPS level 1 through 6 and cgMLST. Below is the code example on how to do this calculation at the BAPS level 1 (BAPS1) of genotypic resolution:
  # BAPS level 1 (BAPS1)
  # drop the STs and BAPS1 with NAs, group by ST and BAPS1 and then calculate Simpson's index
  baps1 <- data6 %>%
  select(st, BAPS1) %>% # select columns
  drop_na(st, BAPS1) %>% # drop NAs
  group_by(st, BAPS1) %>% # group by columns
  summarise(n = n()) %>% # count observations
  mutate(simpson = diversity(n, "simpson")) %>% # calculate diversity
  group_by(st) %>% # group by column
  summarise(simpson = mean(simpson)) %>% # calculate the mean of the index
  melt(id.vars=c("st"), measure.vars="simpson",
  variable.name="index", value.name="value") %>% # covert into long format
  mutate(strat = "BAPS1") # create a strat column
  NOTE: A more genetically diverse population (i.e., more variants at different layers of genotypic resolution) has a higher index at the cgMLST level and produces an increasing index-based values going from BAPS level 2 to 6 (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s3.Rmd).
Examine the degree of genotypic diversity of ST lineages by plotting the relative frequency of BAPS sub-groups at all levels of resolution (BAPS1-6) (Supplementary Figure 4). The more diverse the population is, the sparser the distribution of BAPS sub-groups (haplotypes) becomes going from BAPS1 (lower level of resolution) to BAPS6 (higher level of resolution) (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s4.Rmd).

Representative Results

By utilizing the computational platform ProkEvo for population genomics analyses, the first step in bacterial WGS data mining is comprised of examining the hierarchical population structure in the context of a core-genome phylogeny (Figure 1). In the case of S. enterica lineage I, as exemplified by the S. Newport dataset, the population is hierarchically structured as follows: serovar (lowest level of resolution), BAPS1 sub-groups or haplotypes, ST lineages, and cgMLST variants (highest level of resolution)²⁰. This phylogeny-guided analysis of the hierarchical population structure specifically allows for the following points to be examined: i) phylogenetic distribution of SISTR-based misclassified genomes into other serovars in the case of Salmonella; ii) genetic or kinship structure of the population; iii) pattern of diversification at different levels of genotypic resolution; iv) identification of major genotypic unit(s) underlying an evolutionary, ecological, or epidemiological patterns; v) ancestral relationships between ST lineages through BAPS1 sub-groups or haplotype composition, and across cgMLST variants within ST lineages; and vi) partial view of the degree of genotypic homogeneity of an ST lineage by the cgMLST variant composition.

Figure 1: Phylogeny-guided mapping of hierarchical genotypes for the S. Newport population. A core-genome phylogeny (black centered circle) was used to map hierarchical genotypes, including serovar (lowest level of resolution – innermost colored circle), BAPS level 1 (BAPS1) sub-groups or haplotypes, ST lineages, and cgMLST variants (highest level of resolution – outermost colored circle). Serovars were grouped into Newport (S. Newport) or "Other serovars" based on the SISTR algorithmic classification of genomes, which utilized core-genome MLST information, and ran as part of the computational platform ProkEvo. BAPS1 agnostically stratifies the population into sub-groups or clusters of related haplotypes using core-genomic data within ProkEvo. BAPS1 is hierarchically placed between serovar and ST lineages because it accurately captured the ancestral relationships between STs. ST lineages are formed based on canonical MLST analysis using seven genome-scattered loci. Only major or most frequent STs (proportion >1%) were depicted in the graph. Lastly, only the most frequent cgMLST variants (proportion >3.5%) were used to show the entire hierarchical structure for the S. Newport population (n = 2,365 USA isolates only). The category "Other STs" or "Other cgMLSTs" comprised of minor or low-frequency lineages or variants, respectively, with thresholding done arbitrarily that should be set empirically or statistically based on the dataset. Please click here to view a larger version of this figure.

Relative frequencies of all hierarchical genotypes were then used to evaluate the overall distribution and most frequently observed classifications (i.e., genotypes) (Figure 2). In Figure 2C–D, less frequent (minor) ST lineages or cgMLST variants were aggregated as "Other STs" or "Other cgMLSTs", respectively, in order to facilitate data visualization (dimensionality reduction). If sampling is systematically done across environments and/or hosts and is appropriately statistically powered, frequency distribution can become a proxy for ecological fitness. That is, the most frequent lineages or variants could then be predicted to have higher fitness, ensuing further investigation to determine the causative genetic determinants underlying such a quantitative trait⁶^,³⁰.

Figure 2: Proportion of S. Newport hierarchical genotypes at different levels of resolution. (A) Serovars are phenotypes of the S. enterica lineage I population that can be predicted solely from core-genomic data due to the inheritable high linkage disequilibrium between core-loci and O and H antigenic-coding loci (surface proteins). When using ProkEvo, Salmonella genomes are automatically classified to serovars using the SISTR program. Although only S. Newport (Newport) genomes from NCBI were putatively downloaded, some have been classified as "Other serovars" within ProkEvo. Approximately 2% (48 out of 2,365) of all genomes were classified as other than S. Newport serovar. (B) The proportion of BAPS level 1 (BAPS1) sub-groups or haplotypes. BAPS1 is inserted between serovar and ST lineages in the hierarchical scheme because it accurately and agnostically captured the ancestral relationships between STs. (C) The proportion of major ST lineages depicted only STs that were > 1% in relative frequency. Minor STs were grouped as "Other STs". (D)The proportion of major cgMLST variants showed only four predominant cgMLSTs that were >3% in relative frequency. The remainder cgMLSTs were grouped as "Other cgMLSTs". (B-D) Genomes classified by SISTR as "Other serovars" (2.03%) were filtered out of the data prior to plotting BAPS1, ST, and cgMLST relative frequencies. (C-D) Thresholds used to plot both ST and cgMLST data were arbitrarily defined and should be established empirically on a case-by-case basis. Please click here to view a larger version of this figure.

Alternatively, a scatter-plot can be used to assess the distribution and proportion of both ST lineages or cgMLST variants, without any data aggregation (Supplementary Figure 1). This use of a scatter-plot is particularly useful for ST lineages and cgMLST variants because of the typical occurrence of hundredths, if not thousands, classifications for both genotypes. This sparse distribution commonly does not occur for the serovar and BAPS1 levels of resolution, because they are at a lower level of resolution with sequences inheritably collapsing into a few sub-groups or categories.

Next, the ancestral relationships between STs were examined using a nested approach which encompasses assessing the relative frequency of ST lineages by BAPS1 sub-groups or haplotypes (Figure 3). ST lineages that belonged to the same BAPS1 sub-group were more likely to have shared a common ancestor more recently than with other STs (i.e., ST5 and ST118 vs. ST45). Similarly, by examining the distribution of cgMLST variants within ST lineages, the degree of genotypic heterogeneity across STs can be captured, while assessing their genetic composition and revealing the ancestral relationship between cgMLSTs (i.e., closely related cgMLST variants belong to the same ST lineage or clonal complex) (Supplementary Figure 2).

Figure 3: Distribution of ST lineages nested within BAPS1 sub-groups for the S. Newport population. This plot depicts the ST lineage distribution within each BAPS level 1 sub-group or haplotype, excluding genomes classified as "Other serovars" (2.03% of the entire data). Major STs (proportion >1%) for each BAPS1 sub-group are highlighted in each graph. The larger the circle diameter, the higher the proportion for the particular ST lineage. Please click here to view a larger version of this figure.

Given that the pattern of S. Newport population diversification appeared to be mostly driven by ST composition (Figure 1), two statistical approaches were used to assess the ST-based degree of clonality (i.e., genetic homogeneity), including Simpson's D index of diversity (Supplementary Figure 3), and the distribution of BAPS sub-groups or haplotypes using BAPS levels 1-6 (Supplementary Figure 4). Assessing the degree of clonality of a population can elucidate the following aspects: i) a better understanding of genetic diversity and population structure; ii) fine-tuning analysis of patterns of diversification across major genotypic units such as ST lineages; and iii) be an indicator of the necessity of using accessory genome mining to find cryptic genotypic units that may reveal novel sub-clusters present in the population. The more clonal a population is at the core-genome level, the harder it becomes to differentiate between variants, and the more likely the accessory genome content will be informative to stratify the population into meaningful genotypic units associated with unique ecological distributions¹⁸^,¹⁹^,²¹.

The relative frequency of ST lineage differentiating AMR loci was assessed to identify unique accessory genomic signatures linked to the S. Newport population structure (Figure 4). This step of the analysis was focused on AMR distribution because it is a Public Health-associated trait, but the same approach can be applied in a supervised (targeted) or agnostic fashion to examine other components of the accessory genome, including metabolic pathways, virulence factors, etc. Noticeably, mdf(A)_1 and aac(6')-Iaa_1 loci appear to be ancestrally-acquired by the S. Newport population; whereas, ST45 is predicted to be multi-drug resistant. Strikingly, these data also suggest that the other major ST lineages, ST5 and ST118, are more likely to be multi-drug susceptible when compared to ST45. These points have to be carefully considered because of the biases present in the dataset; however, this represents a potential epidemiological inference that could be made from more robust WGS data collections.

In general, here are some points to be considered when conducting an accessory genome mapping onto hierarchical genotypes: i) consider the frequency distribution as a quantitative trait but be aware that the allelic composition of a locus can alter trait variance. Moreover, the presence of a locus or loci should be indicative of function but not causal, because the phenotype may be polygenic, or vary according to the allelic composition for the causative locus (e.g., a non-synonymous mutation on the active site of a protein is more likely to affect function); ii) loci distribution can demonstrate genes that are fixed in the population (e.g., found in high frequency across all ST lineages) or recently-acquired by specific ST lineages and cgMLST variants, and may reflect the ecological or epidemiological pattern; iii) multi-drug resistance can be predicted from genomics data. And if the distribution of AMR loci, or other pathways, is strongly linked or commonly inherited by specific lineages, then phenotypes can be predicted by inference from hierarchical genotypes, such as in the case of ST lineages⁴⁵^,⁴⁶; and iv) measuring phenotypes in the laboratory is still deterministic to validate computational predictions.

Figure 4: Distribution of AMR loci across major ST lineages of the S. Newport population. Relative frequency-based distribution of a selected number of AMR loci across major ST lineages (>1% of the population). Minor STs were grouped as "Other STs". Only genomes classified as S. Newport by the SISTR algorithm were kept in the analysis. AMR loci with a relative frequency greater than or equal to 10% were selected for data visualization. This is an arbitrary threshold that should be determined for each dataset. The proportions were calculated using a binary matrix composed of gene presence or absence. Please click here to view a larger version of this figure.

Lastly, a phylogeny-anchored visualization was used to systematically integrate the hierarchical population structure data along with ST lineage differentiating AMR loci distribution based on gene occurrence (Figure 5). By combining the population structure along with accessory genomic composition, the following set of questions can be addressed in any given dataset: 1) How is the population structured? How do STs relate to each other and ancestrally through BAPS1 sub-groups? How variable is the cgMLST composition across STs? 2) What is the phylogenetic branching pattern and overall tree topology? and 3) How is the accessory genome distributed? Is the accessory genomic composition mostly likely ancestrally-acquired or recently-derived? What is the lineage or variant-specific pattern? What is the phenotypic prediction and ecological inference? Is there niche-transcending vs. niche-specifying genes? How does the observed pattern relate or inform the epidemiology in the case of pathogens? Can lineages or variants be informatively sub-clustered based on accessory genomic content?

Figure 5: Phylogeny-guided mapping of hierarchical genotypes and accessory AMR loci differentiating between major ST lineages within the S. Newport population. A core-genome phylogeny (black centered circle) was used to map hierarchical genotypes, including serovar (lowest level of resolution – innermost colored circle), BAPS level 1 (BAPS1) sub-groups or haplotypes, ST lineages, and cgMLST variants (highest level of resolution – outermost colored circle), along with AMR loci colored as dark-blue if present or gray if absent. Serovars were grouped into Newport (S. Newport) or "Other serovars" based on the SISTR algorithmic classification. BAPS1 is hierarchically placed between serovar and ST lineages because it accurately and agnostically captured the ancestral relationships between STs. ST lineages are formed based on canonical MLST analysis using seven genome-scattered loci. Only major or most frequent STs (proportion >1%) were depicted in the graph. Also, only the most dominant cgMLST variants (proportion >3.5%) were used to show the entire hierarchical structure for the S. Newport population (n = 2,365 USA isolates only). The category "Other STs" or "Other cgMLSTs" comprised of minor or low-frequency lineages or variants, respectively, and thresholding was done arbitrarily and should be set based on the dataset. AMR loci with a relative frequency greater than or equal to 10% were selected for data visualization. This specific graph shows a unique distribution of AMR loci predominantly occurring in ST31, ST45, and ST132 lineages. Please click here to view a larger version of this figure.

Supplementary Figure 1: Sparse distribution of ST lineages and cgMLST variants for the S. Newport population. (A) The proportion of ST lineages without aggregating low-frequency STs. STs with proportion >1% are highlighted in the plot. (B) The proportion of cgMLST variants without aggregating low-frequency cgMLSTs. cgMLSTs with proportion > 3% are highlighted in the plot. (A-B) Thresholds used to plot both ST and cgMLST data were arbitrarily defined and should be established based on the dataset. Genomes classified by SISTR as "Other serovars" (2.03%) were filtered out of the data prior to plotting both ST and cgMLST relative frequencies. The larger the circle diameter, the higher the proportion for either the ST lineage or cgMLST variant. Please click here to download this File.

Supplementary Figure 2: Distribution of cgMLST variants nested within ST lineages for the S. Newport population. This plot depicts the cgMLST variant distribution across ST lineages, excluding genomes classified as "Other serovars" (2.03% of the entire data). Major cgMLSTs (proportion >15%) for each ST lineage are highlighted in each graph. The larger the circle diameter, the higher the proportion for the specific cgMSLT variant. Low-frequency STs were grouped as "Other STs". Please click here to download this File.

Supplementary Figure 3: Simpson's D-based degree of genetic diversity across ST lineages using BAPS levels 1-6 haplotypes or cgMLST genotypes as input data for the S. Newport population. The degree of clonality or genetic diversity of each ST lineage was calculated across different genotypic layers of resolution, including BAPS levels 1 (lowest level of resolution) to 6 (highest level of resolution) sub-groups or haplotypes, and by additionally using the cgMLST-based distribution of variants. The higher the index value, the higher the degree of genetic diversity. Highly diverse ST lineages have higher index values going from BAPS1 to BAPS6 (i.e., typically the index increases and eventually plateaus when going from BAPS1 to BAPS6). Only genomes classified as S. Newport by the SISTR program were kept in the analysis. Low-frequency STs were grouped as "Other STs". Please click here to download this File.

Supplementary Figure 4: Distribution of BAPS levels 1-6 sub-groups or haplotypes across major ST lineages of the S. Newport population. Relative frequency-based distribution of BAPS sub-groups or haplotypes, across major ST lineages, from the lowest (BAPS1) to the highest level of resolution (BAPS6). Major STs were selected based on having a proportion >1%. Only genomes classified as S. Newport by the SISTR program were kept in the analysis. The higher the degree of clonality, the less sparse or spread the distribution of BAPS sub-groups or haplotypes becomes when going from BAPS1 to BAPS6. In other words, a more genetically diverse ST lineage has a wider range of BAPS sub-groups at the BAPS level 6 (highest degree of resolution). Low-frequency STs were grouped as "Other STs". Please click here to download this File.

Supplementary File 1: Links to material list and genomes list Please click here to download this File.

Supplementary File 2: Hierarchical-based bacterial population genomics analysis using R Please click here to download this File.

Discussion

The utilization of a systems-based heuristic and hierarchical population structure analysis provides a framework to identify novel genomic signatures in bacterial datasets that have the potential to explain unique ecological and epidemiological patterns²⁰. Additionally, the mapping of accessory genome data onto the population structure can be used to infer ancestrally-acquired and/or recently-derived traits that facilitate the spread of ST lineages or cgMLST variants across reservoirs⁶^,²⁰^,²¹^,⁴⁵^,⁴⁶. More broadly, a global assessment of pan-genomic content distribution in bacterial populations can reveal patterns of diversification that underly the ecological tropisms or geospatial/temporal bottlenecks that a population might have recently withstood¹⁸^,²¹. In the case of pathogenic species, by mining the population structure of clinical vs. environmental isolates, genetic determinants associated with zoonotic events may be identified and used to improve diagnostics and surveillance³³^,³⁴. The same approach can be applied to non-pathogenic species to identify genotypes with desirable niche-specific engrafting properties, as in the case of gastrointestinal probiotic strains used to improve human health⁴⁹^,⁵⁰^,⁵¹. Yet, the utilization of bacterial WGS data for population-based inquiries requires the use of reproducible, automated, and scalable computational platforms like ProkEvo²⁰. Any computational approach comes with its caveats and nuances, but in general, freely available, well-documented, portable, and user-friendly platforms such as ProkEvo can facilitate the work of microbiologists, ecologists, and epidemiologists doing heuristic bacterial population-based genomics.

In the present work, it was demonstrated how to use ProkEvo-derived outputs to conduct a hierarchical population structure analysis that can be used to map and track genotypes of interest at different levels of resolution, along with predicting useful traits from WGS data. This computational protocol was written using the R programming language, but the framework or conceptual approach is generalizable to other languages such as Python through the utilization of the Pandas library, for instance. The input data is generated by ProkEvo²⁰, which prevents some hurdles to be faced in terms of standardizing outputs and data formats for subsequent analysis. With the exception of phylogenies, all other input datasets come in a tabular format that can easily be quality-controlled, aggregated, parsed, and integrated to generate useful reports for data interpretation. However, it is important to highlight a few critical steps to enhance reproducibility while using this protocol: i) make sure the software versions are always updated and tracked; ii) track the versions of the data science libraries being used, and preferably update them over time; iii) quality-control the data using domain knowledge expertise to make sense of the outputs generated by ProkEvo, or a similar pipeline, in light of what is understood for the targeted bacterial population; iv) conduct an exploratory data analysis prior to using any modeling approach; v) aggregate the data based on empirical knowledge and/or statistical assessments; vi) define a strategy to deal with missing values a priori and be consistent and completely transparent about it; vii) if using R, try to use all the packages provided by Tidyverse, because this collection facilitates functional programming, portability, optimization, and is freely available; and viii) be aware that visualization approaches can be difficult because it takes some trial and error to get the right type of plot and coloring scheme that is most appropriately applicable for the question being asked and the data being portrayed.

Of note, this protocol comes with some limitations that can be further improved. For instance, ProkEvo has an intrinsic limit to how many genomes can be used for pan-genomic analysis, if the core-genome alignment step is generated concomitantly, while utilizing the Roary program (~ 2,000-3,000 genomes)²⁴. That is a very specific bottleneck in the pipeline that will affect the number of genomes that can be classified into BAPS haplotypes since it depends on core-genome alignment (i.e., highly computationally demanding step). However, core-genome alignment can be done with other programs⁵², and such algorithms, in theory, could be easily incorporated into ProkEvo. Otherwise, datasets can be strategically split into random subsets, or in another basis such as by considering the population structure of the organism in question. Alternatively, ProkEvo can be run with a single genome to get ST-based annotation, antibiotic resistance and virulence gene composition, and mapping of plasmids, but the pipeline was designed for population-based genomics. Noteworthy, if the BAPS1-6 classifications are not needed, then the core-genome alignment option of Roary can be turned off, and in that case, ProkEvo can be used with many hundredths of thousands of genomes – it is only limited based on the number of computer cores available. An example of how to implement a new program or how to turn off the core-genome alignment option in Roary within ProkEvo can be found in the following GitHub links (https://github.com/npavlovikj/ProkEvo/wiki/4.1.-Add-new-bioinformatics-tool-to-ProkEvo) and (https://github.com/npavlovikj/ProkEvo/wiki/4.3.-Change-running-options-for-existing-tool-in-ProkEvo), respectively. In the case of accessory genomic mining, an agnostic analysis depends on the utilization of the pan-genomic .Rtab file generated by Roary²⁴, which was not specifically used here, but instead, it was strategically demonstrated how to map AMR loci with ABRicate using the Resfinder database (https://github.com/tseemann/abricate). Nonetheless, there is an option to expand the scope of the accessory genomic mapping by using a pan-genomic file instead, which can be practically viewed as an expansion of the current approach (e.g., more loci included in the tabular dataset as new columns). It is important to mention that the pan-genomic mapping done by ProkEvo only provided binary information in terms of loci composition, and currently, cannot be used for the identification of single nucleotide polymorphisms across genes.

Another limitation of this protocol is the visualization of the phylogenetic tree. Currently, ggtree is the program of choice, but that comes at the cost of being unable to accurately inspect branch lengths and becomes cumbersome when many layers of data need to be added onto the phylogeny. Alternatively, phandango⁴¹ is a user-friendly, scalable web-page formatted GUI (https://jameshadfield.github.io/phandango/#/)⁴¹ that could easily be used to achieve the same goal, and further detailed information on how to use it with ProkEvo outputs is recently published²⁰. Other tools like iTOL could also be used for phylogeny-dependent visualization of data⁵³, but they require using a GUI and cannot be incorporated in automated scripts. Also, accurate core-genome phylogenies can be difficult to estimate due to the cryptic dataset-dependent impact of horizontal gene transfer. Programs such as Gubbins⁵⁴ can be used for that purpose, but they also come with certain limitations such as the need of using whole-genome alignment and ST lineage-specific datasets for the correct estimation of phylogenies. Instead, other phylogeny-independent approaches can be deployed, which then end up requiring other types of visualizations to integrate metadata or accessory genomic information, as in the case of multi-dimensional analysis⁵⁵^,⁵⁶. Lastly, an empirical and arbitrary approach was used to aggregate minor ST lineages and cgMLST variants, in addition to filtering the most important AMR loci to be quantified. This type of data aggregation can be done empirically using domain knowledge expertise, but could also be achieved statistically by defining a priori criterion of the proportion of the distribution that should be displayed, or by using distribution-related metrics such as interquartile range, standard deviation, or skewness, to ultimately define a threshold. Importantly, the definition for minor genotypes is directly influenced by the nature of the data since sample size, and bias in the types of environmental samples can directly influence the genotypic composition. Regardless, the main consideration is that the mapping of accessory genome content onto the population structure allows for potential genetic determinants of ecological diversification to be identified, such as niche-transcending or niche-specifying genes⁵⁷^,⁵⁸^,⁵⁹.

Although the available R scripts were designed for automation of the present work, all provided scripts would need to be further developed to become an abstract and deployable data science library, that could for instance be an integral part of the ProkEvo pipeline. Nonetheless, there are some specific advantages of utilizing this approach such as the use of the BAPS level 1 genotyping or clustering scheme. The placement of BAPS level 1 sub-groups or haplotypes between serovar and ST lineages was defined empirically based on the genetic structure of the Salmonella population, but it seems to be applicable to other species such as Campylobacter jejuni and Staphylococcus aureus²⁰. Moreover, BAPS1 accurately captures the ancestral relationship between ST lineages and provides a scalable approach for evolutionary analysis, especially when phylogenetic applications are limited²⁰. Furthermore, the use of a nested approach for examining hierarchical relationships and patterns of diversification facilitates the identification of ancestry between ST lineages using BAPS1 sub-groups, and across cgMLST variants using ST lineages, successively going from lower to higher genotypic resolution in assessing the population structure. It is important to reiterate that the frequency distribution of ST lineages and cgMLST variants, if drawn from a systematically collected and statistically powered sample, can become a proxy for ecological fitness¹^,⁶^,⁴³. Consequently, dominant ST lineages and cgMLST variants are likely to contain unique genomic features that may be the basis of the biological mechanism for their dominance in the population in that particular environment or host.

Herein, two independent statistical metrics were used to assess the degree of clonality of the population, which allows for an auxiliary understanding of the population genetic diversity, which may indicate the past occurrence of sample bias, population bottlenecks, or founder effect. In particular, the agnostic assessment of BAPS levels 1-6 sub-groups across ST lineages can refine the understanding of genetic diversity that cannot typically be resolved by simply looking at Salmonella cgMLST variant level generated by SISTR. As mentioned beforehand, other features of the pan-genome can be mapped onto the population structure and files containing plasmid and virulence gene composition, in addition to the utilization of other AMR databases along with agnostic pan-genome dataset, are automatically generated by ProkEvo²⁰. Of note, ProkEvo does not currently allow for differentiation between AMR loci present in the bacterial chromosome vs. plasmids. Ecological and epidemiological metadata can also be easily integrated into this analytical approach by incorporation of other variables in a .csv file containing all the genomic information. In particular, the work presented here specifically complements the utilization of the scalable and portable computational platform ProkEvo, which was designed to be used by researchers focused on heuristic population genomics analyses that facilitate data mining and customization by the user. Other platforms can be used for genotyping, population structure analysis, and/or mapping of accessory genomes such as Enterobase⁵, PATRIC⁶⁰, andBacWGSTdb⁶¹. The latter are excellent resources that facilitate genomics data mining for researchers who are not seeking to customize and utilize cluster computing for scalable and complex analysis. The analytical approach presented here is specifically tailored for researchers who want to have the flexibility to carry out a population genomics analysis using reproducible scripts on their local machine or by using a cloud- or high-performance computational platform.

In conclusion, the analytical R-based platform presented in this work was targeted at providing a practical guide for microbiologists, ecologists, and epidemiologists on how to: i) use phylogeny-dependent approaches to map hierarchical genotypes; ii) assess the frequency distribution of genotypes as a proxy for evaluating ecological fitness; iii) determine lineage-specific degrees of clonality using independent statistical approaches; and iv) map lineage-differentiating AMR loci as an example of how to mine accessory genomic content in the context of the population structure. The scripts provided here can be used on either a local machine or a high-performance computational platform. For experimental and environmental microbiologists, this approach facilitates studies of datasets aimed at identifying unique traits and candidate pathways for further mechanistic studies that ultimately can be contextualized at the population level. Ecologists can benefit from this approach by being able to analyze moderate-to-large datasets, that in theory, increase the statistical power needed to find signatures of selection in a population while considering kinship relationships and patterns of diversification. Lastly, epidemiologists can harness unique practical information for diagnostics and surveillance by defining genotypic units of interest and predicting Public Health-associated traits such as AMR. More broadly, this analytical guidance provides a generalizable framework to utilize ProkEvo to perform a population-based genomic analysis that can be used to infer evolutionary and ecological patterns for pathogenic and non-pathogenic species since the approach is generalizable to other bacterial species.

Divulgaciones

The authors have nothing to disclose.

Acknowledgements

This work was supported by funding provided by the UNL-IANR Agricultural Research Division and the National Institute for Antimicrobial Resistance Research and Education and by the Nebraska Food for Health Center at the Food Science and Technology Department (UNL). This research could only be completed by utilizing the Holland Computing Center (HCC) at UNL, which receives support from the Nebraska Research Initiative. We are also thankful for having access, through the HCC, to resources provided by the Open Science Grid (OSG), which is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science. This work used the Pegasus Workflow Management Software which is funded by the National Science Foundation (grant #1664162).

Materials

amr_data_filtered			https://figshare.com/account/projects/116625/articles/14829225?file=28758762
amr_data_raw			https://figshare.com/account/projects/116625/articles/14829225?file=28547994
baps_output			https://figshare.com/account/projects/116625/articles/14829225?file=28548003
Core-genome phylogeny			https://figshare.com/account/projects/116625/articles/14829225?file=28548006
genome_sra			https://figshare.com/account/projects/116625/articles/14829225?file=28639209
Linux, Mac, or PC			any high-performance platform
mlst_output			https://figshare.com/account/projects/116625/articles/14829225?file=28547997
sistr_output			https://figshare.com/account/projects/116625/articles/14829225?file=28548000
figshare credentials are required for login and have access to the files

Referencias

Grad, Y. H., et al. Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011. Proceedings of the National Academy of Sciences of the United States of America. 109 (8), 3065-3070 (2012).
Worby, C. J., Chang, H. -. H., Hanage, W. P., Lipsitch, M. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genética. 198 (4), 1395-1404 (2014).
Leekitcharoenphon, P., et al. Global genomic epidemiology of Salmonella enterica serovar Typhimurium DT104. Applied and Environmental Microbiology. 82 (8), 2516-2526 (2016).
Alba, P., et al. Molecular epidemiology of Salmonella Infantis in Europe: insights into the success of the bacterial host and its parasitic pESI-like megaplasmid. Microbial Genomics. 6 (5), (2020).
Zhou, Z., Alikhan, N. -. F., Mohamed, K., Fan, Y. the Agama Study Group, Achtman, M. The EnteroBase user’s guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Research. 30 (1), 138-152 (2020).
Azarian, T., et al. Global emergence and population dynamics of divergent serotype 3 CC180 pneumococci. PLOS Pathogens. 14 (11), 1007438 (2018).
Saltykova, A., et al. Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i. PLOS ONE. 13 (2), 0192504 (2018).
Achtman, M., et al. Multi-locus sequence typing as a replacement for serotyping in Salmonella enterica. PLoS Pathogens. 8 (6), 1002776 (2012).
Maiden, M. C. J., et al. Multi-locus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences of the United States of America. 95 (6), 3140-3145 (1998).
Alikhan, N. -. F., Zhou, Z., Sergeant, M. J., Achtman, M. A genomic overview of the population structure of Salmonella. PLOS Genetics. 14 (4), 1007261 (2018).
Gupta, A., Jordan, I. K., Rishishwar, L. stringMLST: a fast k-mer based tool for multi-locus sequence typing. Bioinformatics. 33 (1), 119-121 (2017).
Jolley, K. A., Maiden, M. C. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 11 (1), 595 (2010).
Maiden, M. C. J., et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nature Reviews Microbiology. 11 (10), 728-736 (2013).
Maiden, M. C. J. Multilocus sequence typing of bacteria. Annual Review of Microbiology. 60 (1), 561-588 (2006).
Shapiro, B. J., Polz, M. F. Ordering microbial diversity into ecologically and genetically cohesive units. Trends in Microbiology. 22 (5), 235-247 (2014).
Cordero, O. X., Polz, M. F. Explaining microbial genomic diversity in light of evolutionary ecology. Nature Reviews Microbiology. 12 (4), 263-273 (2014).
Achtman, M., Wagner, M. Microbial diversity and the genetic nature of microbial species. Nature Reviews Microbiology. 6 (6), 431-440 (2008).
Abudahab, K., et al. PANINI: Pangenome neighbour identification for bacterial populations. Microbial Genomics. 5 (4), (2019).
Laing, C. R., Whiteside, M. D., Gannon, V. P. J. Pan-genome analyses of the species Salmonella enterica, and identification of genomic markers predictive for species, subspecies, and serovar. Frontiers in Microbiology. 8, 1345 (2017).
Pavlovikj, N., Gomes-Neto, J. C., Deogun, J. S., Benson, A. K. ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses. PeerJ. 9, 11376 (2021).
McNally, A., et al. Combined analysis of variation in core, accessory and regulatory genome regions provides a super-resolution view into the evolution of bacterial populations. PLOS Genetics. 12 (9), 1006280 (2016).
Langridge, G. C., et al. Patterns of genome evolution that have accompanied host adaptation in Salmonella. Proceedings of the National Academy of Sciences of the United States of America. 112 (3), 863-868 (2015).
Price, M. N., Dehal, P. S., Arkin, A. P. FastTree 2 – Approximately maximum-likelihood trees for large alignments. PLoS ONE. 5 (3), 9490 (2010).
Page, A. J., et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 31 (22), 3691-3693 (2015).
Yoshida, C. E., et al. The Salmonella In silico typing resource (SISTR): An open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. PLOS ONE. 11 (1), 0147101 (2016).
Cheng, L., Connor, T. R., Siren, J., Aanensen, D. M., Corander, J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Molecular Biology and Evolution. 30 (5), 1224-1228 (2013).
Tonkin-Hill, G., Lees, J. A., Bentley, S. D., Frost, S. D. W., Corander, J. Fast hierarchical Bayesian analysis of population structure. Nucleic Acids Research. 47 (11), 5539-5549 (2019).
MLST. GitHub Available from: https://github.com/tseemann/mist (2020)
ABRicate. GitHub Available from: https://github.com/tseemann/abricate (2020)
R: A language and environment for statistical computing. R Foundation for Statistical Computing Available from: https://cran.r-project.org (2021)
Wickham, H., et al. Welcome to the Tidyverse. Journal of Open Source Software. 4 (43), 1686 (2019).
rOpenSci: The skimr package. GitHub Available from: https://github.com/ropensci/skimr/ (2021)
. vegan: Community ecology package. R package version 2.5-5 Available from: https://CRAN.R-project.org/package=vegan (2019)
Yu, G. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics. 69 (1), (2020).
. ggpubr: “ggplot2” Based Publication Ready Plots. R package version 0.4.0 Available from: https://CRAN.R-project.org/package=ggpubr (2020)
. ggrepel: Automatically Position Non-Overlapping Text Labels with “ggplot2”. R package version 0.9.1 Available from: https://CRAN.R-project.org/package=ggrepel (2021)
Wickham, H. Reshaping Data with the reshape Package. Journal of Statistical Software. 21 (12), (2007).
. RColorBrewer: ColorBrewer Palettes. R package version 1.1-2 Available from: https://CRAN.R-project.org/package=RColorBrewer (2014)
Hadfield, J., Croucher, N. J., Goater, R. J., Abudahab, K., Aanensen, D. M., Harris, S. R. Phandango: an interactive viewer for bacterial population genomics. Bioinformatics. 34 (2), 292-293 (2018).
Perron, G. G., et al. Functional characterization of bacteria isolated from ancient arctic soil exposes diverse resistance mechanisms to modern antibiotics. PLOS ONE. 10 (3), 0069533 (2015).
Mitchell, P. K., et al. Population genomics of pneumococcal carriage in Massachusetts children following introduction of PCV-13. Microbial Genomics. 5 (2), (2019).
Klemm, E. J., et al. Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host. Nature Microbiology. 1 (3), 15023 (2016).
Břinda, K., et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nature Microbiology. 5 (3), 455-464 (2020).
MacFadden, D. R., et al. Using genetic distance from archived samples for the prediction of antibiotic resistance in Escherichia coli. Antimicrobial Agents and Chemotherapy. 64 (5), (2020).
Mageiros, L., et al. Genome evolution and the emergence of pathogenicity in avian Escherichia coli. Nature Communications. 12 (1), 765 (2021).
Yahara, K., et al. Genome-wide association of functional traits linked with Campylobacter jejuni survival from farm to fork. Environmental Microbiology. 19 (1), 361-380 (2017).
Walter, J., Maldonado-Gómez, M. X., Martínez, I. To engraft or not to engraft: an ecological framework for gut microbiome modulation with live microbes. Current Opinion in Biotechnology. 49, 129-139 (2018).
Maldonado-Gómez, M. X., et al. Stable engraftment of Bifidobacterium longum AH1206 in the human gut depends on individualized features of the resident microbiome. Cell Host & Microbe. 20 (4), 515-526 (2016).
Zhao, S., et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host & Microbe. 25 (5), 656-667 (2019).
Treangen, T. J., Ondov, B. D., Koren, S., Phillippy, A. M. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology. 15 (11), 524 (2014).
Letunic, I., Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Research. 49, 293-296 (2021).
Croucher, N. J., et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research. 43 (3), 15 (2015).
Fenske, G. J., Thachil, A., McDonough, P. L., Glaser, A., Scaria, J. Geography shapes the population genomics of Salmonella enterica Dublin. Genome Biology and Evolution. 11 (8), 2220-2231 (2019).
Lees, J. A., et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research. 29 (2), 304-316 (2019).
Cohan, F. M. Towards a conceptual and operational union of bacterial systematics, ecology, and evolution. Philosophical Transactions of the Royal Society B: Biological Sciences. 361 (1475), 1985-1996 (2006).
Cohan, F. M., Koeppel, A. F. The origins of ecological diversity in prokaryotes. Current Biology. 18 (21), 1024-1034 (2008).
Cohan, F. M. Transmission in the origins of bacterial diversity, from ecotypes to phyla. Microbial Transmission. 5 (5), 311-343 (2019).
Davis, J. J., et al. The PATRIC bioinformatics resource center: expanding data and analysis capabilities. Nucleic Acids Research. 48, 606-612 (2019).
Feng, Y., Zou, S., Chen, H., Yu, Y., Ruan, Z. BacWGSTdb 2.0: a one-stop repository for bacterial whole-genome sequence typing and source tracking. Nucleic Acids Research. 49, 644-650 (2021).

Play Video

PDF

DOI

DOWNLOAD MATERIALS LIST

Citar este artículo

Pavlovikj, N., Gomes-Neto, J. C., Benson, A. K. Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations. J. Vis. Exp. (178), e63115, doi:10.3791/63115 (2021).