Here, we present phenomic approaches for the functional characterization of putative phage genes. Techniques include a developed assay capable of monitoring host anabolic metabolism, the Multi-phenotype Assay Plates (MAPs), in addition to the established method of metabolomics, capable of measuring effects to catabolic metabolism.
Current investigations into phage-host interactions are dependent on extrapolating knowledge from (meta)genomes. Interestingly, 60 – 95% of all phage sequences share no homology to current annotated proteins. As a result, a large proportion of phage genes are annotated as hypothetical. This reality heavily affects the annotation of both structural and auxiliary metabolic genes. Here we present phenomic methods designed to capture the physiological response(s) of a selected host during expression of one of these unknown phage genes. Multi-phenotype Assay Plates (MAPs) are used to monitor the diversity of host substrate utilization and subsequent biomass formation, while metabolomics provides bi-product analysis by monitoring metabolite abundance and diversity. Both tools are used simultaneously to provide a phenotypic profile associated with expression of a single putative phage open reading frame (ORF). Representative results for both methods are compared, highlighting the phenotypic profile differences of a host carrying either putative structural or metabolic phage genes. In addition, the visualization techniques and high throughput computational pipelines that facilitated experimental analysis are presented.
Viruses that infect Bacteria (a.k.a. bacteriophage or phage) are estimated to exist at more than 1031 virus like particles (VLPs) globally and outnumber all other organisms in an environment1,2. The first metagenomic study investigating the viral communities associated with marine environments focused on quantifying the diversity seen within the viral fraction3. Additionally, Breitbart and colleagues found that over 65% of the viral community sequences shared no homology to any sequences available in public databases. Subsequent metagenomic studies found similar evidence: metagenomes from marine sediments in San Diego, California contain 75% unknown viral sequences4; metagenomes from hypersaline lakes of the Salton Sea contain 98% unknown viral sequences5; and coral–associated metagenomes contain 95 – 98% unknown viral sequences6. This accumulation of unannotated information has resulted in phage genetic material being “the dark matter of the biological universe”7.
Genomic characterization of phage relies on identifying sequence similarity through comparison against existing nucleic acid and protein databases. Because phage-encoded genetic information is predominantly unknown, homology-based methods are ineffective. Within their genome, phages typically encode three major gene types: transcription and replication genes, metabolic genes, and structural genes. The transcription and replication genes (class I/II genes8) include polymerases, primases, endo/exo-nucleases, and kinases. These genes are highly conserved due to their importance in phage infection, transcribing and replicating phage genetic material. Phage polymerases are readily identified using traditional sequence homology methods due to their global conservation9 and have been shown to serve as effective phylogenetic markers10. In contrast, phage metabolic and structural genes (class II/III genes8) are increasingly divergent and often annotated as hypothetical genes.
Phage metabolic genes affect the metabolic capacity of the host and are not necessarily required for viral replication. These genes, often referred to as auxiliary metabolic genes11 (AMGs), appear to modulate host metabolism and allow optimal progression of infection and success of virion maturation. AMGs have been associated with the utilization and uptake of limiting nutrients or in energy production pathways. Some examples include photosystem genes found in the genomes of various cyanophage12-16, genes connected to and regulated by phosphate metabolism17,18, and utilization of the pentose phosphate pathway for phage dNTP biosynthesis18,19. In comparison, structural genes are among the mid to late genes produced during infection and vary across different phage-host systems. The production of structural proteins are dependent on the availability of viral dNTP, and energy pools for their transcription, translation, and assembly8. The capsid and tail fiber structural proteins are deemed as the most divergent of all viral protein-encoding genes and are required for successful virion production. Their divergence is typically attributed to the active role they play in shaping virus-host coevolution20. Divergent proteins, regardless of the gene class, are readily overlooked when using traditional homology and sequence alignment techniques. An effort to correct for the limitations seen with strict sequence comparisons has resulted in bioinformatics tools capable of using sequence characteristics to determine association, such as artificial neural nets21. Artificial neural nets (ANNs) allow for the prediction of structural and metabolic genes, however, require downstream experimental validation to directly characterize gene function.
The objective of this manuscript is to provide phenomic protocols capable of monitoring both catabolic and anabolic metabolism of a host bacterium during the expression of a novel phage gene, functionally predicted through ANNs. The field of phenomics, the biology associated with cellular phenotypes, is well established in systems biology to aid in the investigation of proteins with unknown or pleiotropic function. Phenomic tools are used to link phenotypic information to genotypic information. We hypothesize for putative phage genes that their function(s) can be determined through observing host physiological effects during phage gene expression. To investigate this hypothesis, two quantitative methods were chosen. Multi-phenotype Assay Plates (MAPs) were used to monitor host substrate utilization and the subsequent biomass formation while metabolomics measured host metabolite diversity and relative abundance during growth in specific environmental conditions. Putative structural and metabolic proteins were overexpressed in Escherichia coli and representative results from both experiments are compared. Numerous visual techniques and high throughput processing pipelines are presented to facilitate experimental replication. Lastly, the reproducibility and accuracy of the presented methods are discussed in the context of expected physiological effects for an annotated capsid protein and phage metabolic protein, thioredoxin, plus two putative AMGs.
1. Preparation of Multi-phenotype Assay Plate (MAP) Substrates, Basal Media, Pre-growth Media, and Buffer
2. Preparation of Bacterial Cell-suspension
3. Preparation of Multi-phenotype Assay Plates (MAPs)
4. Multi-phenotype Assay Plates (MAPs) Processing and Parameterization
5. Phenotype Analysis of MAPs
6. Building the Continuous Culture Apparatus
7. Running Continuous Cultures for Metabolomics
8. Running Serial Passage Batch Cultures for Metabolomics
9. Metabolite Analysis & Processing
All samples used to determine the open reading frames (ORFs) selected in this study were collected from Starbuck Island, Site 7 (STAR7) and Caroline Atoll, Site 9 (CAR9) of the southern Line Islands. An estimated 100 L of seawater from these sites was collected below the coral boundary layer using bilge pumps, as described previously5. Contents of the pumps were subject to fractionalization through large pored filters to remove small eukaryotes and subsequently concentrated using 100 kDa tangential flow filters leaving only microbes and viral like particles (VLPs). To separate the VLPs, remaining seawater was passed through 0.45 µm filters resulting in the virome. Chloroform was introduced to this viral fraction to arrest growth of any remaining cells and stored at 4 °C.
VLPs were purified using the cesium chloride method, in which density gradients separate through centrifugation and allow for the recovery of virions at ~1.35 g/ml to 1.5 g/ml 3. Viral DNA was extracted using a CTAB/phenol:chloroform protocol and amplified via multiple displacement amplification using Phi29 reagents. Sequencing of the virome was accomplished with commercially available pyrosequencing technology.
Bioinformatics used in the processing and selection of viral ORFs for this study are as follows. Three pre-processing steps were used on the CAR9 and STAR7 viral metagenomes. First, public software was used to remove tag sequences that resulted from amplification of the viral DNA prior to sequencing27. Second, common sequencing artifacts such as sequence duplicates and low copy number were filtered out of the dataset via an additional bioinformatics program28. Lastly, removal of foreign sequence contamination29 was performed for those sequences that had ≥ 90% coverage and ≥ 94% identity with sequences in the following databases: RefSeq virus genomes; Human — Reference GRCh37; Human — Celera Genomics; Human — Craig Venter (HuRef); Human — Seong-Jin Kim (Korean); Human — Chromosome 7 version 2 (TCAG); and Human — James Watson, YanHuang (YH; Asian), Yoruba (NA18507; African) reference sequences21. Following these processes, sequences from the CAR9 samples totaled 591,600 and STAR7 sequences totaled 939,311. These sequences were uploaded onto MGRAST and assembled via assembler software using default settings. Contigs were translated into 6 reading frames and putative open reading frames (pORFs) were identified using scripts, as previously described21.
To identify unknown ORFs a number of similarity based searches were performed to remove ORFs of known function. Briefly, the following searches were performed with their corresponding searching criteria21:
The resulting pORFs were designed for expression in E. coli using publically available gene design software. Back-translation of the amino-acid sequences employed a Universal Codon Usage Table designed to accommodate expression in E. coli with a minimum usage threshold of 2%. Restriction enzyme recognition sequences for BamHI and HindIII were excluded from the sequences to facilitate cloning. An outside company synthesized the engineered gene sequences30 and then ORFs were cloned into a medium-copy number pBAD promoter vector, pEMB11, via standard restriction enzyme cloning. All clones were transformed into E. coli K-12 strain BW 2778423.
Multi-phenotype Assay Plates (MAPs)
A high throughput and robust software pipeline was leveraged for analysis of the MAPs, the PMAnalyzer24. The pipeline was developed in a Linux server environment and performs various steps including: parsing of optical density files, formatting data into readable text files, pre-processing of growth curves for quality assurance (QA), and performing mathematical modeling techniques to analyze growth curves. The primary modeling scripts were developed in Python version 2.7.5 to make use of the PyLab module.
MAPs reproducibility was assessed using the standard error (SE) for replicate data (Figure 2A). Raw growth curves were compared to logistic growth curves to determine whether the program PMAnalyzer accurately parameterized and modeled clone growth during experimentation (data not shown). For further information on the accuracy and validity of the MAPs and PMAnalyzer see Cuevas et al.24
Following validation of the method, MAPs data was analyzed using the multiple parameters, such as maximum growth rate (µmax) and growth level (GL), provided by the processing pipeline. Comparative visualization of growth curves is often used for growth data interpretation; however, the number of curves that can be visualized at a time for the comparison has limitations. To analyze numerous growth curves simultaneously, heat map derived plots were implored to compare tens of clones grown on a single substrate against that conditions’ average response (Figure 2B). The influence placed by overexpression of a novel phage protein is observed through changes in curve parameters, specifically: lag phase, exponential phase and the maximum biomass yield (asymptote). As an example, the steep climb from lag phase into exponential phase modeled in the growth curve for the Capsid protein (Figure 2A) is reproduced by a quick change in color intensity from black to white for the same clone in the dynamic plot of Figure 2B.
To obtain a global picture of clone distribution across substrates, phenotypic classifications derived from the GL were used (Figure 3). Here, the four phenotypes are separate into four charts where the height of each bar represents the number of clones displaying that phenotype for a specific substrate. Outliers in the data are recognized as clones falling into the “gain of function” or “loss of function” category. Outliers can then be sought out individually and investigated more closely experimentally. In addition, global analysis recognizes substrate biases in the assay. Substrates such as phenylalanine, malic acid, and glycine resulted in a “no growth” classification. Substrates that consistently fall into the no growth classification, across all clones, are not heavily weighted in downstream functional characterization.
Metabolomics
The catabolic products from clones expressing unknown phage genes were identified using metabolomics. Briefly, clones were grown under either a continuous culture or serial passage in batch culture environment prior to being sent out for GC-TOFMS analysis at a metabolomics core facility. For details on sample processing, analysis, and normalization for GC-TOFMS implemented by the chosen core facility see Fiehn et al.31 Briefly, 1 ml of cold extraction solvent is added to each sample, after which samples are vortexed and sonicated in a cold bath for 5 min. Samples are finally centrifuged and half of the sample is decanted and dried down for analysis. Extracts are purified and spiked with internal retention index markers before being loaded onto the gas chromatograph and then subsequently transferred to the mass spectrometer. Data from each sample is analyzed such that signal intensities for all detected signals in the chromatogram are reported. For normalization, the abundance of peaks for each sample is summed and the total peak abundances are averaged between all samples in the set. Metabolite abundances per sample are divided by the sample’s peak abundance and then multiplied by the average peak abundance of the sample set. The resulting data is used for the analysis of metabolomics in the research discussed.
Validation of metabolomics reproducibility was required to determine the appropriate sample size for each culturing method. To detect the precision seen within samples and the variation seen across sample sizes the standard error of the mean (SM) for both n = 3 and n = 6 datasets were reviewed (Figure 5B). Regardless of the continuous culture (CC) sample size, less than 1% of the data had a SM ≤ 1.5. Median SMs were 221 and 300, and values ranged from 0 to 7.55 x 105 and 3.74 x 105, for n = 3 and n = 6 respectively. SMs were also calculated for each set of sample replicates in the serial culture (SC) method. Again, less than 1% of the data had a SM ≤ 1.5, a median SM of 137, and a range from 0 to 3.51 x 105. To compare the SM distributions between each sample set (CC n = 3 vs. CC n = 6, CC n = 3 vs. SC n = 3, and CC n = 6 vs. SC n = 3) a permutation test was performed. The distribution of either continuous culture SM values dataset was not significantly different from the distribution of the serial culture SM values (p-value = 0.0). However, the distribution of SM values for continuous culture n = 3 data was significantly different from that of the SM values for the continuous culture n = 6 data (p-value = 1.908804 x 10-49). Lastly, the coefficient of variation per metabolite was compared before and after implementation of a quality assurance (QA) step (Figure 5C). In total, 210 metabolites were removed after implementation of the QA pipeline (40% of the data). Less than 1% of the removed data had a metabolite abundance of zero, ~2% was internal standard data, ~5% was data from metabolites never before observed in E. coli, and the remaining metabolites (> 30%) had a coefficient of variation greater than 1.
As with the MAPs analysis, global observations provided an initial understanding of the depth of information metabolomics offers. To obtain a global picture, clones were hierarchically clustered based on their relative metabolite abundances providing information on clone-metabolite profiles, potential clones with related functions, and clone-metabolite outliers (Figure 6). To highlight protein functions, metabolites are separated and grouped based on common metabolic pathways. Using this analysis with preliminary results, it was evident that metabolomics is capable of separating genes from different classes (Figure 6, highlighted clones). In addition, outlier identification with the metabolomics data was determined by calculating standard scores (z scores) for each clone-metabolite pair. To ensure statistical significance, outliers were defined as a clone-metabolite pair with a Z score value of 2, accounting for only 5 percent of the data (data not shown).
Figure 1. Definitions of phenotype classifications. (A) Relationship between the growth level (GL) and maximum growth rate. Data points circled in red represent growth curves showing little to no substrate utilization. (B) Boxplot representation defining growth threshold based on the distribution of growth curves with a minimal growth rate (< 0.15 OD/hr). (C) The variance and standard deviation of GL is calculated for substrate D-galactose. Short dashed lines represent two standard deviations away from the mean. Please click here to view a larger version of this figure.
Figure 2. MAPs validation via precision and differentiation. (A) Growth curves for annotated structural (Capsid) and metabolic clones (Thioredoxin), two novel metabolic clones (EDT2440, EDT2441), and the average response of clones grown on sucrose, D-galactose and D-mannose in the MAPs. Blue lines indicate the standard error seen between replicate data (n = 3). (B) The growth curves for 47 different clones are represented as heat maps for sucrose, D-galactose and D-mannose. The annotated structural (green circle) and metabolic clones (orange circle), two novel metabolic clones (dark and light blue circles), and the average response (red circle) are highlighted. Please click here to view a larger version of this figure.
Figure 3. Clone distribution for each phenotype across multiple substrates. The phenotype-clone count for 47 clones across 72 carbon-specific growth conditions. Table provides direct counts for each phenotype. Please click here to view a larger version of this figure.
Figure 4. Diagram detailing the construction of the continuous culture apparatus. (A) Steps used to build ports α-γ of the continuous culture reactor, (B) steps used to build the out flow port of the continuous culture reactor, and (C) the steps to build ports δ and ε of the continuous culture feeding bottle. Please click here to view a larger version of this figure.
Figure 5. Comparison of the phenomic methods presented. (A) Workflow for the preparation of the Multi-phenotype Assay Plates (MAPs), continuous cultures and serial cultures. (B) The percentage of Standard error of mean (SM) counts for both n = 3 and n = 6 sample sizes for the continuous culture (CC) and serial culture (SC) preparation methods for metabolomics. The y-axis is on a log scale. (C) The distributions of coefficients of variation (CV) per metabolite, before and after implementation of the QA pipeline for the continuous culture (CC) method. Please click here to view a larger version of this figure.
Figure 6. Metabolomic profiles of clones grown in continuous culture. The median metabolite abundances for a set of metabolites are plotted for 84 clones grown in continuous cultures. Metabolite profiles for annotated structural (Capsid) and metabolic clones (Thioredoxin), two novel metabolic clones (EDT2440, EDT2441), and the average metabolic response are highlighted in red. Please click here to view a larger version of this figure.
Compound | Carbon | Nitrogen | Sulfur | Phosphorus |
Glycerol | − | 0.40% | 0.40% | 0.40% |
Ammonium chloride | 9.5 mM | − | 9.5 mM | 9.5 mM |
Sodium sulfate | 0.250 mM | 0.250 mM | − | 0.250 mM |
Magnesium sulfate | 1.0 mM | 1.0 mM | − | 1.0 mM |
Potassium phosphate | 1.32 mM | 1.32 mM | 1.32 mM | − |
Magnesium chloride | − | − | * | − |
Potassium chloride | 10 mM | 10 mM | 10 mM | 10 mM |
Calcium chloride | 0.5 µM | 0.5 µM | 0.5 µM | 0.5 µM |
Sodium chloride | 5 mM | 5 mM | 5 mM | 5 mM |
Ferric chloride | 6 µM | 6 µM | 6 µM | 6 µM |
L- arabinose | 0.10% | 0.10% | 0.10% | 0.10% |
MOPS pH 7.4 | 1x | 1x | 1x | 1x |
Table 1. The compounds and concentrations of the different basal media used in the MAPs. *1.0 mM of magnesium chloride is substituted. 1x MOPS = 40 mM MOPS, 4 mM Tricine.
Carbon substrates | Nitrogen substrates | Sulfur substrates | Phosphorus substrates |
2 deoxy-D-ribose | 2-deoxy-D-ribose | 1-butane-sulfonic acid | adenosine-5-monophoshate |
4 hydroxy-phenyl acetic acid | acetamide | acetyl cysteine | beta-glycerophosphate |
acetic acid | adenine | D-cysteine | creatinephosphate |
adenosine-5-monophosphate | adenosine | D-methionine | D-glucose-6-phosphate |
adonitol | allantoin | diethyl-dithiophosphate | diethyl-dithiophosphate |
alpha-D-glucose | beta-phenylethylamine | DL-ethionine | DL-alpha-glycerophosphate |
alpha-D-lactose | biuret | glutathione | potassium phosphate |
alpha-D-melebiose | cytidine | isethionic acid | sodium pyrophosphate |
citric acid | cytosine | L-cysteic acid | sodium thiophosphate |
D-alanine | D-alanine | L-Cysteine | |
D-arabinose | D-asparagine | L-djenkolic acid | |
D-arabitol | D-aspartate | L-methionine | |
D-asparagine | D-cysteine | magnesium sulfate | |
D-aspartate | D-glucosamine | methane sulfonic acid | |
D-cellubiose | D-glutamic acid | N-acetyl-DL-methionine | |
D-cysteine | DL-alpha-amino-N-butyric acid | N-acetyl-L-cysteine | |
D-fructose | D-methionine | potassium-tetra-thionate | |
D-galactose | D-serine | sodium thiosulfate | |
D-glucosamine | D-valine | sulfanic acid | |
D-glucose | gamma-amino-N-butyric acid | taurine | |
D-glucose-6-phosphate | glycine | taurocholic acid | |
D-glutamate | guanidine | thiourea | |
D-mannose | histamine | ||
D-raffinose | inosine | ||
D-ribose | L-alanine | ||
D-salicin | L-arginine | ||
D-serine | L-asparagine | ||
D-trehalose | L-citrulline | ||
D-xylose | L-cysteine | ||
dulcitol | L-glutamic acid | ||
glycerol | L-glutamine | ||
glycine | L-glutathione | ||
i-erythritol | L-histidine | ||
inosine | L-isoleucine | ||
L-alanine | L-leucine | ||
L-arabinose | L-lysine | ||
L-arabitol | L-methionine | ||
L-asparagine | L-ornithine | ||
L-aspartate | L-phenyl-alanine | ||
L-cysteic acid | L-proline | ||
L-cysteine | L-pyro-glutamic acid | ||
L-fucose | L-serine; L-threonine | ||
L-glutamic Acid | L-tryptophan | ||
L-glutamine | L-valine | ||
L-isoleucine | N-acetyl-D-glucosamine | ||
L-leucine | putrescine | ||
L-lysine | thiourea | ||
L-methionine | thymidine | ||
L-phenylalanine | thymine | ||
L-pyro-glutamic acid | tyramine | ||
L-rhamnose | tyrosine | ||
L-serine | uridine | ||
L-sorbose | |||
L-threonine | |||
L-tryptophan | |||
L-valine | |||
L-xylose | |||
lactate | |||
lactulose | |||
malate | |||
myo-inositol | |||
oxalic acid | |||
potassium sorbate | |||
propionic acid | |||
putrescine | |||
quinic acid | |||
sodium pyruvate | |||
sodium succinate | |||
sucrose | |||
thymidine | |||
xylitol |
Table 2. List of substrates used in the MAP experiments.
Here, we present phenomic approaches for the functional characterization of putative phage genes. Techniques include a developed assay capable of monitoring host anabolic metabolism, the Multi-phenotype Assay Plates (MAPs), in addition to the established method of metabolomics, capable of measuring effects to catabolic metabolism. We provided additional tools to manage the large data sets resulting from these technologies, allowing for high throughput processing and analysis24. Lastly, through the comparison of an annotated phage capsid protein, phage thioredoxin, two putative metabolic phage genes, and the average experimental response we propose various strategies to interpret both datasets and gene classes, with emphasis on identification of phenotypic trends and identification of outliers.
As mentioned, both approaches quantitatively measure only half of host metabolism. To interpret the relative function of any of the novel proteins under investigation, data from both methods is required to provide evidence of function. While this is not a focus of our current manuscript, data outputs from each phenomic method is put through combinatory analyses that focus on clustering techniques such as random forest and principal component analysis. Furthermore, hypotheses resulting from the combined analysis must be subsequently validated by traditional genetic methodologies.
Finally, the methods presented are heavily influenced by bacterial physiology and therefore follow the same standards. When undertaking either method, considerations need to be made to ensure independent, clonal groups are experimented with; contamination is prevented; a single variable is being tested; and appropriate controls are being ran simultaneously. Failure to account for these points will result in unclear results, similar to any physiological assay.
Multi-phenotype Assay Plates (MAPs)
The development of MAPs provides a high throughput and adaptable assay compared to technologies currently available (Figure 5A and Tables 1,2). The assay uses supplies, equipment, and fundamental techniques available in all microbiology labs. The incorporation of a computational pipeline, PMAnalyzer24, for subsequent data processing and analysis ensures rapid data interpretation. In addition, both experimental and analytical aspects of the approach can be readily adjusted or tuned for customized purposes. For example, if a large proportion of the data fails to pass filtering outlined in section 4, one can manually sift through the growth curves to identify issues. If the problem arises due to stringent filter parameters, adjustments to the script can be made. Alternatively, if problems are associated with the experimental process (i.e., prolonged condensation; improper transferring of bacterial cells, etc.) then additional replicates can be readily repeated.
As described in Cuevas et al.24, the PMAnalyzer is a single bash program written as a wrapper script that executes the parsing and analysis scripts as a cohesive, automated pipeline. All scripts are freely accessible from a Git repository at 25 by taking the median value for each time point across triplicate data, and subsequently parameterizes the logistic curve to obtain the lag time, maximum growth rate, asymptote, and a novel term, Growth Level. The median value was chosen over the mean in our study to reduce the effect of large outliers, however, the script can be readily adapted to calculate the mean of replicate data. Due to reduced variation (SE) seen across replicate data (Figure 2A) we maintained the use of the median in the PMAnalyzer for fitting a logistic curve. Additionally, the cut off for growth in this study (GL ≥ 0.4) was determined by comparing how data separated across Growth Level and maximum growth rate (Figure 1A,B). Depending on the instruments and model system used this term may vary, requiring redefinition of this cut off value.
A major advantage of our assay is the ability to compare phenotypes using a single parameter characterizing overall microbial growth, which we define as Growth Level (GL). GL is a harmonic mean, and therefore mitigates the effects of large outliers in the data. The use of a harmonic mean with shifted logistic-fitted values to provide a summary of growth was arrived at through trial and error. Other methods attempted to differentiate growth included: time it took to reach specific curve parameters (half µmax, µmax, and carrying capacity), the coefficient of determination (R2), and combinations of the R2 multiplied by specific curve parameters. Using a harmonic mean with shifted logistic-fit values for the GL provided the greatest range in evaluating growth, thus it became the method of choice. One consideration to note is that dynamic growth curve patterns have the potential of being lost when using a single parameter or a fitted model. For instance, the individual curve parameters of the logistic curve and the GL are incapable of representing biphasic growth. In a single carbon environment, this effect on growth implies mediation of the viral protein on either conversion of the substrate or shift in substrate utilization. Additional effects potentially lost when not considering multiple growth parameters include: prolonged lag time, proposing an increased burden of viral machinery or products; rapidly accelerating exponential phase, suggesting viral proteins coupled to host energy production pathways; or higher levels of biomass formation, implying viral support in host nutrient uptake and anabolism (data not shown). Thus, plotting nascent growth curves (Figure 2A,B) provides information regarding trends over time whereas the GL takes into account the major variables of the logistic model, providing a single quantitative number to represent overall success of a clone.
When considering the different responses contributed by structural and metabolic genes in the MAPs, it is observed that the different substrate classes in question provide the greatest evidence for protein function. For example, metabolic proteins are often associated with acquisition of limiting nutrients, which are unspecific to host central metabolism16,32. Preliminary MAP experiments reveal that clones harboring putative metabolic phage genes have an increased lag phase when grown on central metabolism carbon sources (Figure 2A). Conversely, clones carrying putative structural genes, which require large proportions of host energy and dNTP pools, result in a false positive response on growth for central and amino acid metabolism carbon substrates. This is likely due to the accumulation of insoluble proteins resulting in host filamentation and/or inclusion bodies, as observed via microscopy (Figure 2A and data not shown). While further analysis is required to validate these preliminary results, the MAPs are capable of retrieving phenotypic responses that correlate to hypothesized functions of specific phage gene classes.
In addition to the elucidation of unknown viral proteins, the MAPs are a novel resource to investigate the functional and metabolic diversity of an individual bacterium or a community of bacteria. MAP components are designed for easy alteration to support the growth of a range of bacteria; including marine, auxotrophic, and anaerobic microbes. To facilitate these efforts the defined basal and pre-growth media require additional or adjusted chemical species before a different bacterial genus can be supported in the MAPs. One note in this use of the MAPs is to maintain defined media, prohibiting the use of ingredients such as tryptone, yeast extract and peptone.
Metabolomics
The field of metabolomics is dependent on metabolite databases, which include isolated metabolites identified by mass spectrometry. The core facility chosen here has one of the largest metabolomics databases. Interestingly, more than half of the metabolites resulting from our experimentations were unidentifiable (~65%), while others had never before been recorded in our host, Escherichia coli (examples include: Indole 3 acetic acid33, salicylic acid34, and dihydroabietic acid35). This fact could be attributed to either a strong bias of the database towards plant metabolites, or the specific proteins under investigation. Regardless, the result is a limited number of known metabolites available for data representation and analysis. In the future, multiple metabolomics methods using various databases would allow for greater metabolite coverage.
Presently, both known and unknown metabolites are used when comparing and contrasting our novel viral proteins. Using this approach, we hypothesize that clones harboring functionally similar proteins will share an increased similarity in their complete metabolomic profile. Preliminary metabolomics analysis revealed that while structural and metabolic genes do not clearly separate from one another, those genes exhibiting similar effects on the host when overexpressed do correlate (Figure 6). For example, the annotated Capsid gene clusters closely with the putative metabolic genes highlighted in this study, EDT2440 and EDT2441. Investigations using a publicly available transmembrane topology and signal peptide predictor program showed evidence that both putative metabolic genes harbor a single transmembrane domain. Interestingly 5 out of the 9 clones in the first cluster group (most left portion of the dendrogram) have predicted transmembrane domains using the same topology program. Further investigations are needed, however, it is likely that the metabolites present during the overexpression of these clones are associated with cellular stress response resulting from membrane or structural burdens. This evidence supports that while the metabolomics data possesses an increased amount of noise, the method is capable of highlighting signals that differentiate general effects of genes, both within and across a gene class. To determine whether the method is capable of extracting out specific information of gene function, metabolites were grouped into specific metabolic pathways. The hypothesis being, if a clone affects metabolites specific to a single pathway, then the overexpressed gene is active in that pathway. Prior to the establishment of our metabolomics quality assurance pipeline, preliminary data revealed that over and underrepresented metabolites were typically “unknown”, providing little information on the pathways they are associated with (data not shown). Preprocessed metabolomics data, however, reveals that the majority of the metabolite profiles are similar and only a select number of unknown and known metabolite abundances vary across clones, for instance putrescine and uracil (Figure 6). To provide greater resolution of protein function efforts are being made to experimentally compare the novel phage genes against known phage genes, which can be used to fill in the “holes” of metabolite based functional characterization. Using this technique, the assigned function of known viral genes provides a reference for the function of the unknown genes. Nonetheless, the limiting factor of metabolomic analysis is the size and relevance of the database. To correct for these limitations, metabolomic databases relatable to this research need to be developed; such as a database of metabolites and their abundances specific to the ASKA collection of E. coli clones in which a single ORF is overexpressed36. Evidence for the need of such databases was provided in 2013 when researchers at the Lawerence Berkeley National Laboratory compiled the first comprehensive database of metabolites specific to entire mutant libraries of model bacteria37. This research provided novel insight into genes required for utilization of specific metabolites, revealing the clear connection between phenotype and genotype.
When considering metabolomics as a tool, it is important to define the processing regime followed at the core facility. An artifact of most experimental procedures is the day-to-day variance associated with the instruments of use. To date all GC-MS analysis implements the use of internal standards that are included in each analytical run; however, addition of project specific internal samples ran each day of experimentation removes additional variance. These considerations must be addressed early to avoid normalization problems and biases. Another solution is to process all samples at a core facility on the same machine and as a single batch, an option available at any core facility.
The various tools both introduced and re-explored in this manuscript provide novel means to screen and characterize functionally unknown phage genes. The simplicity and adaptability of the experimental techniques with the streamline use of computational pipelines assures these methods are applicable to a broad range of research endeavors and fields. Our goal is that the phenomic approaches presented here will aid further investigations of novel phage proteins in addition to systems that are equally functionally undefined.
The authors have nothing to disclose.
We thank Benjamin Knowles, Yan Wei Lim, Andreas Haas, and members of the Viral Dark Matter consortium for their help and constructive input on this manuscript. This research is funded by the National Science Foundation (DEB-1046413) and is part of the Dimensions: Shedding Light on Viral Dark Matter project.
0.22µm Sterivex Filter | Fisher Scientific | SVGP01050 | Millipore |
0.22µm Millex Filters | Fisher Scientific | SLGV033RS | Millipore |
0.22µm SteriCap Filter | Fisher Scientific | SCGPS02RE | Millipore |
0.22 µm Omnipore membrane filters | Millipore | JHWP02500 | Millipore |
96 well micro-titer plates | VWR | 82050-764 | Standard F-Bottom 96 well Microplates |
2 mL 96 well plate | Fisher Scientific | ||
Adhesive Seal Plate Film | Sigma-Aldrich | Z369667 | |
2 L Nalgene square bottles | Cole Parmer | T-06040-70 | |
125 mL Nalgene square bottles | Cole Parmer | T-06040-50 | |
1/4inch Panel Mount Lock Nut, black nylon | Cole Parmer | EW-45509-04 | |
Female Luer Thread Style Panel Mount to 200 Series Barb 1/16inch | Cole Parmer | EW-45500-30 | |
Female Luer Thread Style Panel Mount to 200 Series Barb, 1/8inch | Cole Parmer | EW-45500-34 | |
Male Luer Integral Lock Ring to 500 Series Barb, 1/16inch ID tubing | Cole Parmer | EW-45505-31 | |
Male Luer with Lock Ring x Female Luer Coupler | Cole Parmer | T-45508-80 | |
Barbed Bulkhead Fittings 1/4inch OD | Fisher Scientific | 6149-0002 | |
Sanipure Tubing 1/16inch ID x 1/8inch OD | SaniPure | AR400002 | |
Sanipure Tubing 1/4inch OD x 1/8inch ID | SaniPure | AR400007 | |
Variable Flow Mini Pump (Peristaltic pump) | Fisher Scientific | 13-876-1 | |
Magnetic Stirrer | Velp Scientifica | F203A0160 | |
Forceps | Fisher Scientific | 14-512-141 | Millipore* Filter Forceps |
Multi-plate spectrophotometer plate reader | Molecular Devices Analyst GT | ||
Filter manifold | Fisher Scientific | XX10 025 02 | |
Software: | |||
Python version 2.7.5 | http://www.python.org/ | ||
PyLab module | http://wiki.scipy.org/PyLab | ||
R version 3.0.1 | http://www.r-project.org/ | ||
reshape2 library | http://had.co.nz/reshape | ||
ggplot2 library | http://ggplot2.org/ | ||
Gene Composer | PSI Tech Portal | http://www.genecomposer.net | |
Services: | |||
West Coast Metabolomics Center | UC Davis | http://metabolomics.ucdavis.edu | |
DNA 2.0 | https://www.dna20.com |