OpenProt is a freely accessible database that enforces a polycistronic model of eukaryotic genomes. Here, we present a protocol for the use of OpenProt databases when interrogating mass spectrometry datasets. Using OpenProt database for analysis of proteomic experiments allows for discovery of novel and previously undetectable proteins.
Genome annotation is central to today’s proteomic research as it draws the outlines of the proteomic landscape. Traditional models of open reading frame (ORF) annotation impose two arbitrary criteria: a minimum length of 100 codons and a single ORF per transcript. However, a growing number of studies report expression of proteins from allegedly non-coding regions, challenging the accuracy of current genome annotations. These novel proteins were found encoded either within non-coding RNAs, 5′ or 3′ untranslated regions (UTRs) of mRNAs, or overlapping a known coding sequence (CDS) in an alternative ORF. OpenProt is the first database that enforces a polycistronic model for eukaryotic genomes, allowing annotation of multiple ORFs per transcript. OpenProt is freely accessible and offers custom downloads of protein sequences across 10 species. Using OpenProt database for proteomic experiments enables novel proteins discovery and highlights the polycistronic nature of eukaryotic genes. The size of OpenProt database (all predicted proteins) is substantial and need be taken in account for the analysis. However, with appropriate false discovery rate (FDR) settings or the use of a restricted OpenProt database, users will gain a more realistic view of the proteomic landscape. Overall, OpenProt is a freely available tool that will foster proteomic discoveries.
Over the past decades, mass spectrometry (MS-)based proteomics has become the golden technique to decipher proteomes of eukaryotic cells1,2,3,4,5. This method relies on current genome annotations to generate a reference protein sequence database that outlines the scope of possibilities6,7,8. However, genome annotations hold arbitrary criteria for ORF annotation, such as a minimum length of 100 codons and a single ORF per transcript9,10. An increasing number of studies challenge the current annotation model and report discoveries of unannotated functional ORFs in eukaryotic genomes8,11,12,13,14. These novel proteins are found encoded in allegedly non-coding RNAs, in the 5' or 3' untranslated regions (UTR) of mRNAs, or overlapping the canonical coding sequence (cCDS) in an alternative frame. Although most of these discoveries have been serendipitous, they demonstrate the caveats of current genome annotations and the polycistronic nature of eukaryotic genes8.
Here, we highlight the use of OpenProt databases for MS-based proteomics. OpenProt is the first database to hold a polycistronic annotation model for eukaryotic transcriptomes. It is freely available at www.openprot.org15. A proportion of these predicted ORFs would be random and non-functional, which is why OpenProt cumulates experimental and functional evidence to increase confidence. Experimental evidence include protein expression (by MS) and translation evidence (by ribosome profiling)15. Functional evidence include protein orthology (with an In-Paranoid like approach) and functional domain prediction15.
OpenProt offers the possibility to download several databases, from containing only well-supported proteins to custom-made databases. Here, we will present a pipeline for the use of OpenProt databases and will offer insights into which database to choose considering the experimental aim. The proteomics analysis pipeline presented here is supported by the Galaxy framework as it is open-access and easy-to-use, but the databases can work with any workflow16,17,18. We will also present how to use the OpenProt website for gathering further information on novel proteins detected by MS. Using OpenProt databases will provide a more exhaustive view of the proteomic landscape and will foster proteomics and biomarkers discoveries in a more systematic way than current methods.
This protocol highlights the use of OpenProt databases15 when interrogating MS datasets; it will not review the design of the experiment itself, which has been thoroughly reviewed elsewhere20,21,22. In an effort to remain fully open-source, the protocol is freely available (Supplementary Material S1–S4). For easier reading, all terms used in OpenProt and hereby throughout this protocol are defined in Table 1.
1. OpenProt database download
NOTE: Custom databases based on RNA-seq data for example can also be obtained and the procedure is detailed in the second section of this protocol. If a custom database is needed, please skip to the next section.
2. Custom OpenProt database download
NOTE: This section details how to obtain a custom database. If no custom database is needed, skip to the next section.
3. Database handling
NOTE: From now on, the Galaxy platform will be used, but the same principles can be applied to other proteomic software.
4. Mass spectrometry file preparation
NOTE: Most of the proteomics tools available on Galaxy instances use the mzML format, and peptide search engines prefer data in centroid mode.
5. Peptide and protein identification/quantification
NOTE: This part of the pipeline uses tools from the OpenMS suite, a versatile and easy-to-use framework18.
6. Quality control
NOTE: Because MS-based proteomics is the result of a complex process where each step needs to be optimized to produce reproducible results, quality control is a necessary procedure in the workflow33.
7. OpenProt database mining
NOTE: Once a confident identification of a novel protein predicted by OpenProt (accession numbers starting with IP_ for AltProts and II_ for novel Isoforms) has been made, more biological information can be gathered from the OpenProt website15.
The workflow described above was applied to a MS dataset available on the PRIDE repository38,39. The original study developed a method (iMixPro), using stable isotope labeling of amino acids in cell culture (SILAC), to eliminate false positives from affinity-purification MS (AP-MS) experiments38. In brief, an AP-MS experiment consists of using beads-bound antibodies to fetch a protein of interest (bait) and its interactors (preys). The collected proteins are then digested and prepared for MS. The sample preparation method and the instrument settings are described in the original study and on the PRIDE repository (PXD004246). A challenge in such experiments is the abundance of false positives, notably from proteins binding to the beads but not the bait. Here, we used SILAC to generate different isotope ratios between true preys and false positives: 3 control samples (no bait) cultured in light medium, 1 sample expressing the bait cultured in light medium, and 1 sample expressing the bait cultured in heavy medium are processed with the beads and further mass spectrometry analysis. With such design, non-specific proteins binding to the beads will have an heavy-to-light ratio of 1:4; when true preys will have a ratio of 1:138.
We re-analyzed their AP-MS data using the OpenProt database; the baits included three endogenous proteins (PTPN14, JIP3 and IQGAP1), and two over-expressed proteins (RAF1 and RNF41). Since the experiments used SILAC, the Galaxy workflow for protein quantification was used (Supplementary Material S3, Figure 2). The workflow was run using the whole OpenProt database (OpenProt_all) or a restricted OpenProt database (OpenProt_2pep, including only proteins previously detected with a minimum of two unique peptides).
Protein identification and quantification were good and reproducible across the different used databases. As shown in Figure 3, most proteins identified in the original paper were also identified using either the OpenProt_2pep or OpenProt_all database (a detailed list is available in Supplementary Material S5). This result shows that the pipeline described here and the OpenProt databases are able to produce protein identification and quantification comparable to that of current procedures based on the UniProtKB databases40. However, the use of OpenProt databases has the unique advantage of allowing detection of novel and previously undetectable proteins, as demonstrated in this case study.
11 well-supported proteins (1 Isoform and 10 AltProts), yet currently not annotated in databases, were identified across all datasets, with confident peptides, using the OpenProt_2pep database (all protein accessions, along with the number of supporting peptides, are available in Supplementary Material S5). This database allows the use of a traditional 1% FDR as the search space increase remains moderate. These 11 proteins were not identified in the original study as they were absent from the database.
29 novel proteins (16 isoforms and 13 AltProts) were discovered across all datasets, with confident peptides, using the OpenProt_all database (all protein accessions, along with the number of supporting peptides, are available in Supplementary Material S6). As shown in Figure 3, the recommended stringent FDR did not affect the most confident protein identifications, although it did decrease the total number of identified proteins. Comparatively to the OpenProt_2pep database, a higher number of novel proteins can be confidently identified. All of these novel proteins are absent from the OpenProt_2pep database. This highlights the crucial role of the chosen database for MS-based proteomics.
One novel protein was discovered as an interactor of the RAF1 protein (IP_637643). Using the OpenProt website, one can see this protein had not been detected by MS nor ribosome profiling until now (OpenProt v1.3). The protein is 46 amino acids long and can only give two unique peptides upon tryptic digestion. The peptide detected in the RAF1 AP-MS dataset (fraction 18) had a good quality spectrum, as shown in Figure 4, and displayed a heavy-to-light ratio of 1,09. The protein is encoded in the NANOGNBP1 gene, which is a pseudogene of NANOGNB. The transcript (ENST00000448444), currently annotated as non-coding, was detected across several tissues according to the GTEx portal40. The protein contains a predicted functional domain associated with DNA binding (Gene Ontology GO:0003677)41.
Figure 1: Database choice for proteomics analyses chart. Analyses of MS data, notably the database choice, depend on the research objectives. Three common objectives are outlined in blue (classic proteomic pipeline), green (exhaustive proteomic search) and orange (proteomic discovery). Each objective depends on an appropriate database and pipeline. A single identification tool may be used for an exhaustive and classical proteomics pipelines. For the proteomic discovery pipeline, we strongly recommend using multiple identification engines. Recommended FDRs are indicated in red, and protein database sizes are indicated in grey boxes. Please click here to view a larger version of this figure.
Figure 2: Graphical representation of the Galaxy workflow used. Step-by-step representation of the proteomic analysis workflow used for re-analysis of Eyckerman et al. data38. Input files, peptide search, and protein quantification are indicated by orange boxes. Blue boxes correspond to the tools used and grey boxes correspond to the output files generated. The different search engines (MS-GF+ and X!Tandem) are indicated by different colors (respectively red and purple) as well as the arrows indicating their necessary inputs and outputs. The green box highlights the tool generating a list of protein identifications. When multiple outputs are generated, the one used for downstream steps is indicated as the closest to the arrow. This workflow is freely available in Supplementary Material S2. The X!Tandem default parameters configuration file is available in Supplementary Material S4. Please click here to view a larger version of this figure.
Figure 3: Comparison of interactor identification per bait using different databases. Venn diagrams of protein identifications using the most confident OpenProt database (in orange, supporting evidence of minimum 2 unique peptides, OpenProt_2pep) with a 1% FDR, or the whole OpenProt database (in blue, OpenProt_all) with a 0.001% FDR, or as reported in the original paper (in grey)38. Each diagram corresponds to identified interactors for the mentioned bait: RAF1, RNF41, PTPN14, JIP3 and IQGAP1. Please click here to view a larger version of this figure.
Figure 4: MS/MS spectrum of identified MDNLWAK(13C6) peptide from novel protein IP_637643. Intensity is relative (0 to 100%). Selected peaks are indicated in red, y ions annotations are in dark red and b ions annotations in green. Extracted from the TOPPview software34. Precursor Error = 2.70 ppm, PEP score = 0.12. Please click here to view a larger version of this figure.
Term | Definition | Reference |
Alternative ORF (AltORF) | non-canonical ORF currently not annotated in genome annotations, but annotated in OpenProt. | 15 |
Reference ORF (RefORF) | canonical ORF annotated in genome annotations and OpenProt. | 15 |
Alternative protein (AltProt) | novel protein coded by an AltORF, with no significant similarity with a RefProt. Accession prefix: IP_. | 15 |
Reference protein (RefProt) | protein currently annotated in protein sequence databases such as UniProtKB, Ensembl or NCBI RefSeq, and also in OpenProt. | 15 |
Novel Isoform | novel protein coded by an AltORF, with a significant similarity with a RefProt. Accession prefix: II_. | 15 |
OpenProt_2pep database | contains the sequence of all RefProts and novel proteins predicted by OpenProt, already detected with a minimum of 2 unique peptides. | 15 |
OpenProt_1pep database | contains the sequence of all RefProts and novel proteins predicted by OpenProt, already detected with a minimum of 1 unique peptide. | 15 |
OpenProt_all database | contains the sequence of all RefProts and novel proteins predicted by OpenProt. | 15 |
Table 1: Definition of terms used in OpenProt and throughout the protocol
Supplementary Material S1: Galaxy workflow for database handling. This will append the CRAPome and decoy sequences (reverse) to the input database. Output is a Fasta file. Please click here to download.
Supplementary Material S2: Galaxy workflow for protein identification. This will identify proteins from a mass spectrometry data file using two search engines (MS-GF+ and X!Tandem). Each parameter can be tuned as desired before running the workflow. Please click here to download.
Supplementary Material S3: Galaxy workflow for protein quantification using stable isotope labeling (SIL). This will identify and quantify proteins from a mass spectrometry data file using two search engines (MS-GF+ and X!Tandem). Each parameter can be tuned as desired before running the workflow. Please click here to download.
Supplementary Material S4: X!Tandem default parameters configuration file. This XML file is necessary for running the X!TandemAdapter tool on the Galaxy platform. Please click here to download.
Supplementary Material S5: Quantified proteins from iMixPro datasets. Data files from Eyckerman et al. 201638 were processed using OpenProt databases and quantified proteins are listed for each condition. Baits are PTPN14, JIP3, IQGAP1, RAF1 and RNF41. Gene names indicated in green correspond to proteins also identified in the original paper38. Gene names indicated in orange correspond to known interactors according to BioGrid that were not reported in the original paper. Gene names indicated in light blue correspond to novel proteins identified as interactors (the corresponding protein accession number is indicated in brackets). Gene names indicated in light grey and italics correspond to likely contaminants (keratin proteins). Please click here to download.
Supplementary Material S6: Identified novel proteins from iMixPro datasets. Data files from Eyckerman et al. 201638 were processed using OpenProt databases and novel identified proteins are listed for each condition. Baits are PTPN14, JIP3, IQGAP1, RAF1 and RNF41. Protein accession numbers are listed, starting with II_ for novel isoforms of a known protein, and with IP_ for novel proteins from an alternative ORF (AltProt).The number of supporting peptides are indicated in brackets. Please click here to download.
When analyzing data from mass spectrometers, the quality of protein identification partly relies on the accuracy of the used database6,20. Current approaches traditionally use UniProtKB databases, yet these support the genome annotation model of a single ORF per transcript and a minimum length of 100 codons (with the exception of previously demonstrated examples)40. Multiple studies relate the shortcomings of such databases with the discovery of functional ORFs from allegedly non-coding regions8,11,12,13. Now, OpenProt allows for more exhaustive protein identification as it draws protein sequences from multiple transcriptome annotations. OpenProt retrieves NCBI RefSeq (GRCh38.p7) and Ensembl (GRCh38.83) transcriptomes and UniProtKB annotations (UniProtKB-SwissProt, 2017-09-27)40,42,43. As current annotations present little overlap, OpenProt thus displays a more exhaustive view of the potential proteomic landscape than when limited to one annotation15.
Furthermore, as OpenProt enforces a polycistronic model, it allows for multiple protein annotations per transcript. For statistical and computational reasons, OpenProt still holds a minimum length threshold of 30 codons15. Yet, it predicts thousands of novel protein sequences, thereby widening the scope of possibilities for protein identification. With this approach, OpenProt supports proteomic discoveries in a more systematic manner.
The quality of protein identification can also be affected by the parameters that are used. MS-based proteomics analyses typically hold a 1% protein FDR. However, the whole OpenProt database contains about 6 times more entries (Figure 1). To account for this substantial increase in the search space, we recommend using a more stringent FDR of 0.001%. This parameter was optimized using benchmark studies and manual evaluation of randomly selected spectra15. False positive are still a possibility, though, and we encourage thorough inspection and validation of supporting evidence for a novel protein. A recommended standard could be the identification of a protein from two different MS runs, as background data and false positives vary between datasets15.
The pipeline provided here and used for the case study can be modified as pleased to fit the experimental design and parameters. We would recommend using multiple search engines as it increases sensibility and sensitivity of peptide identification32. Furthermore, we encourage using the database corresponding best to the experimental aim (Figure 1). As using the whole OpenProt database comes with a stringent FDR, true identifications may be lost. Thus, the whole database should be intended for discovery of novel proteins, whilst classical proteomics profiling should be using the smaller OpenProt databases (such as OpenProt_2pep used in the case study above).
OpenProt currently predicts sequences starting with an ATG codon, whereas several studies highlighted translation initiation at other codons44,45. When a novel protein is identified by one or several unique peptides, it is possible the true initiation codon is not the presumed ATG. Users can look for translation evidence on the OpenProt website. Currently, OpenProt only reports translation events if they concern the entire predicted protein sequence (100% overlap)15. Thus, absence of translation evidence would not mean the protein is not translated, but that the start codon may not be the alleged ATG.
Despite its current limitations, OpenProt offers a more exhaustive view of eukaryotic genomes' coding potential. OpenProt databases foster proteomic discoveries and the understanding of proteomic functions and interactions. Future developments of the OpenProt database will include annotation of other species, translation evidence from non-ATG start codon and development of a pipeline to include novel proteins in whole genome and exome sequencing studies.
The authors have nothing to disclose.
We thank Vivian Delcourt for his help, discussions and advice on this work. X.R. is a member of the Fonds de Recherche du Québec Santé (FRQS)-supported Centre de Recherche du Centre Hospitalier Universitaire de Sherbrooke. This research was supported by a Canada Research Chair in Functional Proteomics and Discovery of Novel Proteins to X.R. and CIHR grant MOP-137056. We thank the team at Calcul Québec and Compute Canada for their support with the use of the supercomputer mp2 from Université de Sherbrooke. Operation of the mp2 supercomputer is funded by the Canada Foundation of Innovation (CFI), le ministère de l'Économie, de la science et de l'innovation du Québec (MESI) and les Fonds de Recherche du Québec – Nature et technologies (FRQ-NT). The Galaxy server that was used for some proteomics calculations is in part funded by Collaborative Research Centre 992 Medical Epigenetics (DFG grant SFB 992/1 2012) and German Federal Ministry of Education and Research (BMBF grants 031 A538A/A538C RBC, 031L0101B/031L0101C de.NBI-epi, 031L0106 de.STAIR (de.NBI)).
OpenProt website | open source | n/a | www.openprot.org |
Galaxy Server | open source | n/a | https://usegalaxy.eu/ |
TOPPview software | open source | n/a | www.openms.de |