A mass spectrometry-guided genome mining protocol is established and described here. It is based on genome sequence information and LC-MS/MS analysis and aims to facilitate identification of molecules from complex microbial and plant extracts.
The chemical space covered by natural products is immense and widely unrecognized. Therefore, convenient methodologies to perform wide-ranging evaluation of their functions in nature and potential human benefits (e.g., for drug discovery applications) are desired. This protocol describes the combination of genome mining (GM) and molecular networking (MN), two contemporary approaches that match gene cluster-encoded annotations in whole genome sequencing with chemical structure signatures from crude metabolic extracts. This is the first step towards the discovery of new natural entities. These concepts, when applied together, are defined here as MS-guided genome mining. In this method, the main components are previously designated (using MN), and structurally related new candidates are associated with genome sequence annotations (using GM). Combining GM and MN is a profitable strategy to target new molecule backbones or harvest metabolic profiles in order to identify analogues from already known compounds.
Investigations of secondary metabolism often consist of screening crude extracts for specific biological activities followed by purification, identification, and characterization of the constituents belonging to active fractions. This process has proved to be efficient, promoting the isolation of several chemical entities. However, nowadays this is seen as unfeasible, mainly due to the high rates of rediscovery. As the pharmaceutical industry revolutionized without knowledge of the roles and functions of specialized metabolites, their identification was carried out under laboratory conditions that did not accurately represent nature1. Today, there is a better understanding of natural signaling influences, secretion, and the presence of most targets at undetectably low concentrations. Additionally, regulation of the process will help the academic community and pharmaceutical industry to take advantage of this knowledge. It will also benefit research involving the direct isolation of metabolites related to silent biosynthetic gene clusters (BGCs)2.
In this context, advances in genomic sequencing have renewed interest in screening microorganism metabolites. This is because analyzing the genomic information of uncovered biosynthetic clusters can reveal genes encoding novel compounds not observed or produced under laboratory conditions. Many microbial whole genome projects or drafts are available today, and the number is growing every year, providing massive prospects for uncovering novel bioactive molecules through genome mining3,4.
The Atlas of Biosynthetic Gene Clusters is the current largest collection of automatically mined gene clusters as a component of the Integrated Microbial Genomes Platform of the Joint Genome Institute (JGI IMG-ABC)2. Most recently, the Minimum Information for Biosynthetic Gene Clusters (MIBiG) Standardization Initiative has promoted the manual reannotation of BGCs, providing a highly curated reference dataset5. Nowadays, plenty of tools are available to enable computational mining of genetic data and their connection to known secondary metabolites. Different strategies have also been developed to access new bioactive natural products (i.e., heterologous expression, target gene deletion, in vitro reconstitution, genomic sequence, isotope-guided screening [genomisotopic approach], manipulation of local and global regulators, resistance target-based mining, culture independent mining, and, more recently, MS-guided/code approaches2,6,7,8,9,10,11,12,13,14,15).
Genome mining as a singular strategy requires efforts to annotate a single or small group of molecules; thus, gaps in the process remain in which new compounds are prioritized for isolation and structure elucidation. In principle, these approaches target only one biosynthetic pathway per experiment, thereby resulting in a slow discovery rate. In this sense, using GM along with a molecular networking approach represents an important advance for natural product research14,15.
The versatility, accuracy, and high sensitivity of liquid chromatography-mass spectrometry (LC-MS) make it a good method for compound identification. Currently, several platforms have invested algorithms and software suites for untargeted metabolomics16,17,18,19,20. The core of these programs includes feature detection (peak picking)21 and peak alignment, which allows match of identical features across a batch of samples and searching for patterns. MS pattern-based algorithms22,23 compare characteristic fragmentation patterns and match MS2 similarities generating molecular families sharing structural features. These features can then be highlighted and clustered, conferring the ability to rapidly discover known and unknown molecules from a complex biological extract by tandem MS2,24,25. Therefore, tandem MS is a versatile method to gain structural information of several chemotypes contained in a large amount of data simultaneously.
The Global Natural Products Social Molecular Networking (GNPS)26 algorithm uses the normalized fragment ions intensity to construct multidimensional vectors, in which similarities are compared using a cosine function. The relationship between different parent ions are plotted in a diagram representation, in which each fragmentation is visualized as a node (circles), and the relatedness of each node is defined by an edge (lines). The global visualization of molecules from a single source is defined as a molecular network. Structurally divergent molecules that fragment uniquely will form their own specific cluster or constellation, whereas related molecules cluster together. Clustering chemotypes allows the hypothetical connection of similar structural features to their biosynthetic origins.
Combining both chemotype-to-genotype and genotype-to-chemotype approaches is powerful when creating bioinformatics links between BGCs and their small molecule products27. Therefore, MS-guided genome mining is a rapid method and low material-consuming strategy, and it helps bridge parent ions and biosynthetic pathways revealed by WGS of one or more strains under diverse metabolic and environmental conditions.
The workflow of this protocol (Figure 1) consists of feeding WGS data into a biosynthetic gene cluster annotation platform such as antiSMASH28,29,30. It helps estimate the variety of compounds and class of compounds encoded by the genome. A strategy to target a biosynthetic gene cluster encoding a chemical entity of interest must be adopted, and culture extracts from a wild type strain and/or heterologous strain containing the BGC can be analyzed to generate clustered ions based on similarities using GNPS26,31. Consequently, it is possible to identify new molecules that associate with the targeted BGC and are unavailable in the database (mainly unknown analogues, sometimes produced in low titers). It is relevant to consider that users can contribute to these platforms and that the availability of bioinformatics and MS/MS data is increasing rapidly, driving to a constant development and upgrade of effective computational tools and algorithms to guide efficient connections of complex extracts with molecules.
Figure 1: Overview of the entire workflow. Shown is an illustration of the bioinformatic, cloning, and molecular networking steps involved in the described MS-guided genome mining approach to identify new metabolites. Please click here to view a larger version of this figure.
This protocol describes a rapid and efficient workflow to combine genome mining and molecular networking as starting point for the natural product discovery pipeline. Although many applications are able to visualize the composition and relatedness of MS-detectable molecules in one network, several are adopted here to visualize structurally similar clustered molecules. Using this strategy, novel cyclodepsipeptide products observed in metabolic extracts of Streptomyces sp. CBMAI 2042 are successfully identified. Guided by genome mining, the whole biosynthetic gene cluster encoding for valinomycins is recognized and cloned into the producer strain Streptomyces coelicolor M1146. Finally, following a MS pattern-based molecular networking, the molecules detected by MS are correlated with BGCs responsible for their biogenesis32.
1. Genome mining for biosynthetic gene clusters
Figure 2: Output from antiSMASH platform. Secondary metabolism in silico analysis from whole genome sequence annotation. Please click here to view a larger version of this figure.
2. Heterologous expression of whole biosynthetic gene cluster from the ESAC library
3. Streptomyces/E. coli conjugation
4. Strain cultivation
5. Acquiring mass spectra and preparation for GNPS analysis
Figure 3: Using MsConvert to convert MS files to mzXML extension. The correct parameter for GNPS analysis is displayed. The instructions are as follows: add all MS files in box 1 and add the filter Peak Picking in box 2; for this filter, use the algorithm vendor; press start and the processes of conversion will follow. Please click here to view a larger version of this figure.
6. GNPS analysis
Figure 4: Using online GNPS platform to perform molecular network analysis. Selection of mzXML files is done by clicking in box 1. In the open dialog box, the files can be selected from personal folder (box 2) or be uploaded in the second tab using the drag-and-drop file uploader (less than 20 MB). The files can be grouped into up to six groups. Please click here to view a larger version of this figure.
Figure 5: Using GNPS to perform molecular network analysis (advanced options). Min Pair Cos will directly influence the size of clusters, as high values will result in combining closely-related compounds and low values in combining distantly-related compounds. Using values that are too low should be avoided. Minimum matched fragment ions represent the number of shared fragments between two fragmentation spectra to be linked in the network. Together, both parameters guide the network format; lower values will cluster more distantly-related compounds and vice-versa. Using the proper values will greatly help the compound elucidation. Please click here to view a larger version of this figure.
7. Analysis of GNPS results
Figure 6: Using GNPS to visualize molecular network results. All related compound clusters can be seen in view spectral families (red box). To visualize only library hits, "view all library hits" (blue box) should be selected. For better graphical representation of molecular network results, "Direct Cytoscape Preview" (yellow box) should be downloaded, and the latest version of Cytoscape should be used. Please click here to view a larger version of this figure.
Figure 7: Using GNPS to visualize molecular cluster results. After opening the molecular clusters for better data visualization, the following should be chosen: "Parent mass" as node labels (red box); "DeltaMZ" as edge labels (yellow box); and "Draw pies" as node coloring (blue box). Navigate through the molecular cluster and try to annotate all nodes. Please click here to view a larger version of this figure.
The protocol was successfully exemplified using a combination of genome mining, heterologous expression, and MS-guided/code approaches to access new specialized valinomycin analogues molecules. The genome-to-molecule workflow for the target, valinomycin (VLM), is represented in Figure 8. Streptomyces sp. CBMAI 2042 draft genome was analyzed in silico, and the VLM gene cluster was then identified and transferred to a heterologous host. Heterologous and wild type strains were cultivated in triplicate using proper fermentation conditions, partitioned with ethyl acetate, and concentrated to generate the crude extract. From the product, MS/MS data was acquired to generate a tandem MS metabolite profile for molecular networking. Figure 9 represents the clustered ions obtained from MS/MS data from Streptomyces sp. CBMAI 2042 crude extract, in which characteristic fragmentation patterns and corresponding MS similarities suggest the occurrence of a molecular family sharing structural features2. Following known biosynthetic logic and bioinformatics insights, and supported by pattern-based MS/MS spectra, the structure of four originally reported cyclodepsipeptides were elucidated, and their origins were correlated with the same biosynthetic gene clusters responsible for VLM assembly32.
Molecular networking data (found at <https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=6f97aa4addfa4d20b505fdb4328b088c>) was processed in a GNPS platform and deposited in a MASSIVE repository (MSV000083709). For dereplication, two strategies were selected to populate the network with previously described compounds: 1) Dereplicator (found at <https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=1a55e768d02649aaa09d78d0d4778ef3>) and 2) a peptide natural product identification tool called VarQuest (found at Our previous publication provides further details32.
Figure 8: Workflow from in silico genome sequence analysis to MS data acquisition. (A) A draft from Streptomyces sp. CBMAI 2042 genome is obtained by Illumina MiSeq sequencing. (B) Valinomycin BGC identification and annotation. (C) After transferring the whole gene cluster to an appropriate host, the strain is cultivated. The ethyl acetate extract from culture is analyzed by LC to obtain a profile of produced secondary metabolites. The chromatogram shows that valinomycin, montanastatin, and five analogues are produced by VLM BGC expression in a heterologous host. Please click here to view a larger version of this figure.
Figure 9: Molecular networking results. (A) Molecular networking from Streptomyces sp. CBMAI 2042 extract. Molecular networking ions corresponding to valinomycin, an already known compound with the corresponding BGC annotated in Streptomyces sp. CBMAI 2042 genome, are clustered with ions related to analogues firstly described for VLM BGC. (B) MS spectra and chemical structures for valinomycin and related analogues are shown. Please click here to view a larger version of this figure.
The strongest advantage of this protocol is its ability to rapidly dereplicate metabolic profiles and bridge genomic information with MS data in order to elucidate the structures of new molecules, especially structural analogues2. Based on genomic information, different natural products chemotypes can be investigated, such as polyketides (PK), nonribosomal peptides (NRP), and glycosylated natural products (GNP), as well as cryptic BGCs. Metabolomic screening yields evidence of activated BGC profiles and chemical diversity produced by a specific strain under laboratory conditions. Thus, a BGC can be cloned to direct production of a new compound or unknown analogues related to an already known BGC, facilitated by similarities discovered by molecular networking. Therefore, this procedure helps to distinguish valuable compounds produced by natural sources and can be used as a guide for future isolation steps, which are common in natural product pipelines.
MS-guided genome mining was firstly described in the fields of peptidogenomics41 and glycogenomics42. To estimate the extent of peptide natural product chemical diversity, Dorrestein and colleagues developed an automated method using MS and genomics to visualize the connection between expressed natural products (chemotype) and their gene clusters (genotype). The concept of MS-guided genome mining was then described while using peptide specialized metabolites. Here, a method for the identification of microbial glycosylated natural products (GNP) using a GM approach and tandem MS was applied as tool to rapidly connect GNP chemotypes (from microbial metabolomes) with their corresponding biosynthetic genotypes following sugar footprints.
The concept of peptidogenomics has been applied to reveal stenothricin gene clusters in Streptomyces roseosporus, providing the first insights into the broad utility of GNPS as a platform43. Pattern-based genome mining and molecular networking was finally combined with the GNPS platform26 to facilitate the dereplication of new compounds, known compounds, detection of new analogs, and structure elucidation of 35 Salinispora strains. This led to the isolation and characterization of retimycin A, a quinomycin-type depsipeptide44. After the introduction of GNPS, integrated metabolomics and genome mining approaches have become the most versatile avenue to connect molecular networks with biosynthetic capabilities45,46,47,48,49,50.
This protocol reinforces the feasibility of using genomic and metabolomic analyses to investigate the production of known and unknown chemically analogous compounds in a few steps while consuming low levels of materials. The model presented here is related to valinomycin analogue identification from crude extracts through molecular networking dereplication. The structure of analogues is deduced by MS/MS fragmentation and follows the biosynthetic logic of cloned VLM BGCs.
Different software is available for mining secondary metabolite biosynthetic gene clusters51 and for metabolite elucidation, but open source options have the advantages because of constant updates, and they are open to the scientific community. In this sense, antiSMASH and the GNPS platform are the most popular choices.
This general procedure can be modified for other extraction methodologies based on the natural source explored. More than one method of extraction can also be combined according to metabolite properties (i.e., polarity, hydrophobicity, the capability to form micelles), and even similar properties, different solvents, or resin can achieve enhanced results. Usually, extracts are prepared from liquid medium cultivation, but there is a plethora of extraction methods available to isolate enriched extracts and screen any biological sample of interest.
When acquiring MS data, data dependent acquisition (DDA) analysis should be used. This issue is important when a larger number of compounds are being evaluated in a single injection. While performing DDA, the maximum number of MS/MS spectra of each precursor ion and maximum number of different precursor ion should be compensated. When using fast scan rate equipment, this can be achieved with higher scan rates (~6–10 MS/MS scans per cycle). However, in lower scan rate equipment, MN performance can be only increased with better chromatographic resolution. The most comprehensive data to populate the molecular networking should be obtained. For MS data acquisition, fixed collision energy is possible, but ramp energies are suitable to yield improved results. There are no optimal conditions that will perfectly work for all samples. Achieving sufficient MS analysis is crucial to the following steps. Henceforth, the molecular network clusters should be generated and dereplicated according to the procedure.
A frequent troubleshooting error is missing intensities for masses. Normally, this can be solved by introducing higher collision energy during analysis. Sometimes, no correlations are observed between the spectra and GNPS library, which is very uncommon. In this case, ensure that the folder opens properly in the previsualization MS software as errors can sometimes be created during the conversion step to .mzXML files.
Regarding genome mining, the most precise output from gene cluster annotation platforms will be provided for higher quality whole genome sequencing for both, single strain, or culture independent mining. High quality sequencing will generate high quality bioinformatic insights for dereplication of biosynthetic pathways. In contrast, although BGC prediction bioinformatics software has been rapidly developing, exact predictions of gene function and putative products is still difficult, especially when investigating novel biosynthetic pathways and features that cannot be predicted in silico. Also, some biosynthetic machinery is strikingly conserved, while enzymology that is involved in hybrid systems, trans-AT modular PKs, and NRPSs are recognized as exceptions of the colinearity rule. In this sense, heterologous expression and refinements in bioinformatic output software can help elucidate unpredictable enzyme functions and unusual biochemistry52,53,54. The enrichment of public databases will lead to more precise predictions and discovery of novel specialized metabolites, as the cost for WGS does not represent the handicap for genome mining.
Finally, the strongest advantages of integrated metabolomic and genome mining approaches are related to their feasibility to perform genotype and chemotype dereplication via automated and high throughput analysis connecting genomic, transcriptomic, and metabolomic data to efficiently connect genes with molecules.
The authors have nothing to disclose.
The financial support for this study was provided by São Paulo Research Foundation – FAPESP (2019/10564-5, 2014/12727-5 and 2014/50249-8 to L.G.O; 2013/12598-8 and 2015/01013-4 to R.S.; and 2019/08853-9 to C.F.F.A). B.S.P, C.F.F.A., and L.G.O. received fellowships from the National Council for Scientific and Technological Development – CNPq (205729/2018-5, 162191/2015-4, and 313492/2017-4). L.G.O. is also grateful for the grant support provided by the program For Women in Science (2008, Brazilian Edition). All authors acknowledge CAPES (Coordination for the Improvement of Higher Education Personnel) for supporting the post-graduation programs in Brazil.
Acetonitrile | Tedia | AA1120-048 | HPLC grade |
Agar | Oxoid | LP0011 | NA |
Apramycin | Sigma Aldrich | A2024 | NA |
Carbenicillin | Sigma Aldrich | C9231 | NA |
Centrifuge | Eppendorf | NA | 5804 |
Chloramphenicol | Sigma Aldrich | C3175 | NA |
Column C18 | Agilent Technologies | NA | ZORBAX RRHD Extend-C18, 80Å, 2.1 x 50 mm, 1.8 µm, 1200 bar pressure limit P/N 757700-902 |
Kanamycin | Sigma Aldrich | K1377 | NA |
Manitol P.A.- A.C.S. | Synth | NA | NA |
Microcentrifuge | Eppendorf | NA | 5418 |
Nalidixic acid | Sigma Aldrich | N4382 | NA |
Phusion Flash High-Fidelity PCR Master Mix | ThermoFisher Scientific | F548S | NA |
Q-TOF mass spectrometer | Agilent technologies | NA | 6550 iFunnel Q-TOF LC/MS |
Sacarose P.A.- A.C.S. | Synth | NA | NA |
Shaker/Incubator | Marconi | MA420 | NA |
Sodium Chloride | Synth | NA | P. A. – ACS |
Soy extract | NA | NA | NA |
Sucrose | Synth | NA | P. A. – ACS |
Thermal Cycles | Eppendorf | NA | Mastercycler Nexus Gradient |
Thiostrepton | Sigma Aldrich | T8902 | NA |
Tryptone | Oxoid | LP0042 | NA |
Tryptone Soy Broth | Oxoid | CM0129 | NA |
UPLC | Agilent Technologies | NA | 1290 Infinity LC System |
Yeast extract | Oxoid | LP0021 | NA |