Here we present the proteogenomic tool PoGo and protocols for fast, quantitative, post-translational modification and variant enabled mapping of peptides identified through mass spectrometry onto reference genomes. This tool is of use to integrate and visualize proteogenomic and personal proteomic studies interfacing with orthogonal genomics data.
Cross-talk between genes, transcripts, and proteins is the key to cellular responses; hence, analysis of molecular levels as distinct entities is slowly being extended to integrative studies to enhance the understanding of molecular dynamics within cells. Current tools for the visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies. Furthermore, they only capture basic sequence identify, discarding post-translational modifications and quantitation. To address these issues, we developed PoGo to map peptides with associated post-translational modifications and quantification to reference genome annotation. In addition, the tool was developed to enable the mapping of peptides identified from customized sequence databases incorporating single amino acid variants. While PoGo is a command line tool, the graphical interface PoGoGUI enables non-bioinformatics researchers to easily map peptides to 25 species supported by Ensembl genome annotation. The generated output borrows file formats from the genomics field and, therefore, visualization is supported in most genome browsers. For large-scale studies, PoGo is supported by TrackHubGenerator to create web-accessible repositories of data mapped to genomes that also enable an easy sharing of proteogenomics data. With little effort, this tool can map millions of peptides to reference genomes within only a few minutes, outperforming other available sequence-identity based tools. This protocol demonstrates the best approaches for proteogenomics mapping through PoGo with publicly available datasets of quantitative and phosphoproteomics, as well as large-scale studies.
In cells, genome, transcriptome, and proteome affect each other to modulate a response to internal and external stimuli and interact with each other to carry out specific functions leading to health and disease. Therefore, characterizing and quantifying genes, transcripts, and proteins is crucial for fully understanding cellular processes. Next-generation sequencing (NGS) is one of the most commonly applied strategies to identify and quantify gene and transcript expression. However, protein expression is commonly assessed by mass spectrometry (MS). Significant advancements in MS technology over the last decade has enabled more a complete identification and quantification of proteomes, making the data comparable with transcriptomics1. Proteogenomics and multi-omics as ways to integrate NGS and MS data have become powerful approaches to assess cellular processes across multiple molecular levels, identifying subtypes of cancer and leading to novel potential drug targets in cancer2,3. It is important to note that proteogenomics was initially used to provide proteomic evidence for gene and transcript annotations4. Several genes previously thought to be non-coding have recently undergone reevaluation considering large-scale human tissue datasets5,6,7. In addition, proteomic data are successfully used to support annotation efforts in non-model organisms8,9. However, proteogenomic data integration can be exploited further to highlight protein expression in relation to genomic features and elucidate cross-talk between transcripts and proteins by providing a combined reference system and methods for co-visualization.
In order to provide a common reference for proteomics, transcriptomics, and genomics data, numerous tools have been implemented for mapping peptides identified through MS onto genome coordinates10,11,12,13,14,15,16,17. Approaches differ in aspects such as mapping reference, support of genome browsers, and degree of integration with other proteomics tools as shown in Figure 1. While some tools map reverse translated peptides onto a genome16, others use a search engine annotated position within a protein and gene annotation to reconstruct the nucleotide sequence of the peptide15. Still others use a 3- or 6-frame translation of the genome to map peptides against11,13. Lastly, several tools skip the nucleotide sequences and use amino acid sequence translations from RNA-sequencing mapped transcripts as an intermediate to map peptides to the associated genome coordinates10,12,14,17. However, the translation of nucleotide sequences is a slow process and custom databases are prone to errors that propagate to the peptide mapping. For fast and high-throughput mapping, a small and comprehensive reference is crucial. Therefore, a standardized protein reference with associated genome coordinates is essential for accurate peptide to genome mapping. Novel aspects in proteogenomics, such as the incorporation of variants and post-translational modifications (PTMs)2,3, are gaining momentum through recent studies. However, these are generally not supported by current proteogenomic mapping tools as shown in Figure 1. To improve speed and quality of mapping, PoGo was developed, a tool that allows the fast and quantitative mapping of peptides to genomes18. In addition, PoGo enables the mapping of peptides with up to two sequence variants and annotated post-translational modifications.
PoGo has been developed to cope with the rapid increase of quantitative high-resolution datasets capturing proteomes and global modifications and provides a central utility for large-scale analyses such as personal variation and precision medicine. This article describes the application of this tool to visualize the presence of post-translational modification in relation to genomic features. Furthermore, this article highlights the identification of alternative splicing events through mapped peptides and the mapping of peptides identified through custom variant databases to a reference genome. This protocol employs publicly available datasets downloaded from the PRIDE archive19 to demonstrate these functionalities of PoGo. In addition, this protocol describes the application of TrackHubGenerator for the creation of online accessible hubs of peptides mapped to genomes for large-scale proteogenomics studies.
1. Preparation, Download, and Setup
NOTE: The file and folder path examples are shown in a Windows format for the ease of access for standard users. PoGo and PoGoGUI are also available for macOS and Linux operating systems.
2. Mapping Peptides with Annotated Post-translational Modifications and Visualization Including Quantitation
NOTE: The resulting output file can be loaded in any genome browser supporting Browser Extensible Data (BED) format. A selection of browsers is the Integrative Genome Browser (IGV)24 (which is used in the following), the UCSC Genome Browser25, and the Ensembl Genome Browser20. It is important to note that the annotation GTF and protein FASTA versions used for PoGo mapping match the version of the genome in the genome browser. For the human Ensembl releases 57-75 and GENCODE versions 3d-19, use GRCh37/hg19; for the Ensembl versions 76 or higher and GENCODE 20 or higher, use GRCh38/hg38. For the mouse Ensembl versions 74 or higher and GENCODE M2 or higher, use GRCm38.
3. Mapping Peptides Identified Through a Custom Variant Database to a Reference Genome
NOTE: PoGo mapping can be carried out using the graphical user interface (GUI) or through the command line interface. They are interchangeable. In this part of the protocol, the command line interface is used to highlight interchangeability. The second part of this protocol section requires the software tool R26. Please ensure that the package is installed.
4. Mapping Using Multiple Files and Generating Track Hubs for Large Datasets
A graphical depiction highlighting at which stage of a regular proteomic workflow PoGo18 is applied, as well as downstream options of visualization, is shown in Figure 5. Shotgun proteomics (i.e., the proteolytic digestion of proteins followed by liquid chromatography coupled with tandem mass spectrometry) is one precursory step of proteogenomic mapping. The resulting tandem mass spectra are commonly compared to theoretical spectra derived from protein sequence databases. Proteogenomics studies introduce translation sequences of novel transcripts with coding potential and non-synonymous single nucleotide variants (SNVs) into the database, making it hard to easily relate these back to the reference genome8. The graphical user interface of PoGo (PoGoGUI) supports file formats for the standardized reporting of peptide identifications from mass spectrometry experiments and converts them into the simplified 4-column pogo format. PoGoGUI wraps the command line tool PoGo and thus enables the mapping of peptides onto genome coordinates utilizing the reference annotation of protein-coding genes commonly provided in the GTF and the translated transcript sequences in FASTA format. Different output formats are generated by PoGo to enable the visualization of different aspects of the peptides identified through mass spectrometry, including post-translational modifications and peptide level quantification. Output files in the BED can further be converted and combined into online accessible directories called track hubs. Single output files, as well as track hubs, then can be visualized in browsers such as the UCSC Genome Browser25, Ensembl Genome Browser20, IGV24, and Biodalliance28 (see Figure 5 bottom).
We applied PoGo to the reanalysis of the draft human proteome maps filtered at high significance as described in Wright et al.7 and compared it to two other tools for proteogenomic mapping, namely iPiG14 and PGx10. The dataset comprised 233,055 unique peptides across 59 adult and fetal tissues resulting in a total of over 3 million sequences. PoGo outperformed these tools both in runtime (6.9x and 96.4x faster, respectively) and memory usage (20% and 60% less memory, respectively) as shown in Figure 618. An example of a successfully mapped peptide is shown in Figure 7.
While PoGo significantly outperformed the other tools in speed and memory, it also is capable of mapping post-translational modifications and quantitative information associated with peptides onto the genome. Figure 8A schematically depicts the visualization of the BED format in a genome browser for peptides mapping to one exon and across splice junctions. PoGo utilizes the coloring option to provide easy visual aid with respect to the uniqueness of the peptide mapping within the genome. Mappings in red indicate uniqueness to a single transcript, while black highlights mapping to a single gene. However, the peptide is shared between different transcripts. Grey mappings show a peptide shared between multiple genes. These are, for example, less reliable for the quantification of a gene or untrustworthy to call the expression of a gene. The PTM BED option of PoGo redefines the color code to accommodate different types of post-translational modifications as shown in Figure 8B. Additionally, PTMs are indicated by thick blocks (see Figure 8B). A single PTM of a type is highlighted by a thick block at the position of the modified amino acid residue, while multiple PTMs of the same type are spanned by a thick block from the first modified amino acid to the last.
We applied PoGo and subsequently TrackHubGenerator to a dataset of 50 colorectal cancer cell lines including whole proteome and phosphoproteome29. While the track hub loaded in the UCSC Genome Browser shows the peptides mapped to the genome and highlights the uniqueness of the mappings and the phosphorylation sites (see Figure 9), additional data are provided in the supplemental folder. The GCT files then enable the visualization of the peptide and phosphopeptide quantitation in a genomic context. However, GCT files do not provide an easy visualization of peptides spanning across splice junctions (see Figure 10 top). The peptides across splice junctions are split into their respective parts mapping to the exons. While it is possible to identify splice peptides through the same quantitative values of exon mappings, loading sequence-based mapping files such as BED or GTF that connect the exons by a thin intron spanning line support the interpretation (see Figure 10 bottom).
To highlight the utility of variant enabled mapping, we applied PoGo in two configurations to a dataset of human testis proteome searched against neXtProt to hunt for missing proteins using a multi-enzyme strategy22. The neXtProt comprises besides reference protein sequences over 5 million single amino acid variants30. Mapping peptides identified with a single amino acid variant is not supported by other mapping tools. A total of 177,012 unique peptides were identified. Of these, 99.8% (176,694) peptides were first successfully mapped without allowing mismatches. Removing those from the identified peptide list resulted in 0.2% (318) peptides that subsequently were mapped allowing one amino acid substitution. This resulted in 3,446 mappings of 162 peptides that would not have been mapped to the reference genome with any other available tool. While the average number of mappings including a mismatch is high, 62 peptides were mapped to only a single locus, indicating true variant sequences. An example of a peptide mapped with a single amino acid substitution is highlighted with its sequence and the translated genomic sequence in Figure 11.
Figure 1. Visual comparison of different peptide-to-genome mapping tools. The comparison is shown with regards to various aspects. These aspects include a mapping reference, the level of integration into frameworks, and the support of online and offline browsers. Additionally, novel aspects of proteogenomics and their feature support is highlighted separately. PoGo only lacks the capability to directly map to a genome sequence compared to other tools. However, it supports all novel features that most of the other tools do not support. Please click here to view a larger version of this figure.
Figure 2. Example input file for mapping peptides. PoGo accepts input data in a tab-separated format with 4 columns. Column headers in the first line are 'Experiment', 'Peptide', 'PSMs', and 'Quant', indicating in the following lines the experiment or sample identifier, the peptide sequence, the number of peptide-spectrum matches, and a quantitative value for the peptide, respectively. File name extensions supported are *.txt, *.tsv, and *.pogo. Please click here to view a larger version of this figure.
Figure 3. PoGoGUI interface with highlighted steps for file selections and parameter options. The figure shows the steps for selecting and uploading all required files and the selection of options for mapping peptides with post-translational modifications onto the human reference genome. Please click here to view a larger version of this figure.
Figure 4. Screenshot of the Integrative Genomics Viewer (IGV) data upload procedure. The figure highlights the steps for uploading PoGo output files in the IGV browser. Furthermore, it shows the option of expanding the track of mapped peptides to highlight the mapping and sequence. Please click here to view a larger version of this figure.
Figure 5. Simplified workflow of steps from LC-MS/MS to visualization in genome browsers. PoGo mapping follows the identification of peptides from tandem mass spectra. To achieve the mapping to the genome, PoGo utilizes reference annotation provided as genome annotation (GTF) and transcript translation sequences (FASTA). Different output formats are generated that can be loaded separately in genome browsers. Additionally, files in BED format can be combined into track hubs supporting visualization of large-scale datasets. Please click here to view a larger version of this figure.
Figure 6. Benchmarking PoGo against PGx and iPiG. PoGo outperforms the other tools on benchmarking. Mapping 233,055 unique peptides across 59 adult and fetal tissues resulting in over 3 million sequences, PoGo was 6.9x and 96.4x faster than PGx and iPiG, respectively. Furthermore, PoGo required 20% and 60% less memory compared to PGx and iPiG, respectively. While PoGo and PGx finished successfully, iPiG resulted in a memory error at 16 GB. Please click here to view a larger version of this figure.
Figure 7. UCSC Genome browser example view of mapped peptides. The figure shows peptides mapped to the gene mTOR. While the combined track shows the peptides spanning across splice junctions and mapping only to one exon with the associated sequences, the tissue-specific tracks only highlight the mapping in a condensed format. Please click here to view a larger version of this figure.
Figure 8. Schematic of mapping visualization and color coding. (A) In the standard BED output file, peptides mapping to an exon are shown as single blocks (left), while peptides mapping across multiple exons highlight the exon covering parts as blocks (right). Introns are shown as thin concatenating lines. PoGo color-codes the uniqueness of mapping or peptides to genes, and transcripts using a 3-tier system. (B) In addition to the block structure of the BED format, PTM BED output highlights the position of post-translational modifications as thick blocks. The presence of a single PTM of a type highlights the modified amino acid residue with a thick block, while multiple sites of the same PTM are combined into long blocks spanning from the first to the last modification site. Peptide mappings are further divided by PTM type and color codec based on the modification. Please click here to view a larger version of this figure.
Figure 9. Track hub view in the UCSC genome browser of colorectal cancer proteome and phosphoproteome data. The track hub comprises whole proteome data as well as phosphoproteome. While the red color in the proteome and phosphoproteome tracks indicate the uniqueness of the mapping to the single transcript of SFN, tracks ending in _ptm show the phosphorylation sites within peptides. Here, the red color indicates the type of modification as phosphorylation. Only two peptides have been identified with each showing a single phosphorylation (thick blocks). Please click here to view a larger version of this figure.
Figure 10. View of colorectal cancer phosphopeptides and associated quantitation in IGV. The figure shows a subset of the 50 cancer cell lines. It furthermore shows four columns of blocks in differing shades of light red. The color indicates the relative abundance from low (white) to high (red). While the four columns might initially lead to believe that there are 4 peptides, it becomes clear with the associated sequence-based GTF output file that these are in fact two peptides, each spanning a splice junction. Please click here to view a larger version of this figure.
Figure 11. View of peptide with amino acid variant in IGV. The figure shows a peptide with a single amino acid variant mapped to the reference genome at the translation start of the gene GPSM1. The variant is positioned at amino acid residue 8 and results in the substitution of alanine to valine (A→V). The translation sequences of the annotated transcripts (blue) highlight the variant in comparison to the peptide sequence. Please click here to view a larger version of this figure.
This protocol describes how the software tool PoGo and its graphical user interface PoGoGUI enable a fast mapping of peptides onto genome coordinates. The tool offers unique features such as quantitative, post-translational modification and variant-enabled mapping to genomes using reference annotation. This article demonstrates the method on a large-scale proteogenomic study and highlights its speed and memory efficiency compared to other available tools18. In combination with the tool TrackHubGenerator, which creates online accessible hubs of genomic and genome linked data, PoGo, with its graphical user interface, enables large-scale proteogenomics studies to quickly visualize their data in genomic context. Furthermore, we demonstrate the unique features of PoGo with datasets searched against variant databases and quantitative phosphoproteomics22,29.
Single files, such as the GCT file, provide valuable visualization and links between peptide features and genomic loci. However, it is important to note that an interpretation based on these alone may be difficult or misleading due to their limitation to single aspects of proteogenomics such as uniqueness, post-translational modifications, and quantitative values. Therefore, it is important to carefully choose which output files, options, and combinations are appropriate for the proteogenomic question at hand and modify the combinations. For example, information about the uniqueness of the mapping to a specific genomic locus might be of great value for the annotation of a genomic feature7, while the quantification across different samples might be more appropriate for studies relating genomic features to changes in protein abundance29. The Output should be generated by PoGo for each setting. In case no output is generated, or empty files are shown in the output folder, it is recommended to check the input files for the desired content and the required file format. In cases where the file format or content does not follow the expectations of PoGo (e.g., the FASTA file supposedly containing the transcript translation sequences contains the nucleotide sequences of the transcripts), error messages will ask the user to check the input files.
Restrictions of the protocol and the tool are mostly based on the reuse of file formats commonly used in genomics. Repurposing file formats used in genomics for proteogenomic applications is accompanied by specific limitations. These are due to the differing sets of requirements for genome centered visualization of genomic and proteogenomic data, such as the need to visualize post-translational modifications from proteomics data. This is restricted in the genomics file formats by single feature usage. Many approaches and tools have been developed for proteomics to confidently localize post-translational modifications within peptide sequences31,32,33,34. However, the visualization of multiple modifications in a unique and discernable manner on the genome is hindered by the structure of genomic file formats. Therefore, the single block visualization of multiple PTMs of the same type does not constitute any ambiguity of the modification sites but is the consequence of the differing requirement from the genomics community to only visualize single features at a time. Nonetheless, PoGo has the advantage of mapping post-translational modifications onto genomic coordinates to enable studies focused on the effect of genomic features such as single nucleotide variants on post-translational modifications. Using PoGo, variant mapping increases the number of total mappings. However, the unique color coding of mapped peptides highlights reliable mappings from unreliable ones. The mapping of variant peptides identified from known single nucleotide variants can be accompanied by visualizing the mapped peptides alongside the variants in VCF format. This way the color code indicating an unreliable mapping of a variant peptide is overruled by the presence of the known nucleotide variant.
A critical step for using PoGo is the use of the correct files and formats. The use of translated transcript sequences as protein sequences to accompany the annotation in GTF format is the main criteria. Another critical element when considering using PoGo to map peptides with amino acid mismatches is memory. While highly memory-efficient for a standard application, the significantly and exponentially increasing number of possible mappings with one or two mismatches leads to a similarly exponential increase in memory usage18. We propose a staged mapping as described in this protocol to first map the peptides without mismatches and remove them from the set. The subsequent previously unmapped peptides then can be mapped using one mismatch and the procedure can be repeated with two mismatches for the peptides remaining unmapped.
Since the throughput of mass spectrometry has significantly increased and studies interfacing genomic and proteomic data are becoming more frequent in recent years, tools to readily enable interfacing these types of data in the same coordinate system are increasingly indispensable. The tool presented here will aid the need to combine genomic and proteomic data to enhance a better understanding of integrative studies across small and large datasets by mapping peptides onto a reference annotation. Encouragingly, PoGo has been applied to map peptides to gene candidates provided in the same format as the reference annotation to support annotation efforts of novel genes expressed in human testis35. The approach presented here is independent of databases used for peptide identification. The protocol might aid in the identification and visualization of novel translation products by using adapted input files from translation sequences and associated GTF files from RNA-seq experiments.
Several approaches and tools with a wide range of special application scenarios to map peptides to genomic coordinates, ranging from mapping peptides directly to the genome sequence to RNA-sequencing guided mapping, have been introduced10,11,12,13,14,15,16,17. However, these can result in a failure to correctly map peptides when post-translational modifications are present and errors in the underlying mapping of RNA-sequencing reads may be propagated to the peptide level. PoGo has been developed to specifically overcome those obstacles and to cope with the rapid increase of quantitative high-resolution proteomic datasets to integrate with orthogonal genomics platforms. The tool described here can be integrated into high-throughput workflows. Through the graphical interface PoGoGUI, the tool is simple to use and requires no specialist bioinformatics training.
The authors have nothing to disclose.
This work was funded by the Wellcome Trust (WT098051) and the NIH grant (U41HG007234) to the GENCODE project.
PoGo (software) | NA | NA | https://github.com/cschlaffner/PoGo |
PoGoGUI (software) | NA | NA | https://github.com/cschlaffner/PoGoGUI |
TrackHubGenerator (software) | NA | NA | https://github.com/cschlaffner/TrackHubGenerator |
Integrative Genomics Viewer (software) | NA | NA | http://software.broadinstitute.org/software/igv/ |
UCSC genome browser (website) | NA | NA | https://genome.ucsc.edu/ |
GENCODE (website) | NA | NA | http://gencodegenes.org |
Ensembl (website) | NA | NA | http://ensembl.org |
bedToBigBed (software) | NA | NA | http://hgdownload.soe.ucsc.edu/admin/exe/ |
fetchChromSizes.sh (software) | NA | NA | http://hgdownload.soe.ucsc.edu/admin/exe/ |