We present a systems biology tool JUMPn to perform and visualize network analysis for quantitative proteomics data, with a detailed protocol including data pre-processing, co-expression clustering, pathway enrichment, and protein-protein interaction network analysis.
With recent advances in mass spectrometry-based proteomics technologies, deep profiling of hundreds of proteomes has become increasingly feasible. However, deriving biological insights from such valuable datasets is challenging. Here we introduce a systems biology-based software JUMPn, and its associated protocol to organize the proteome into protein co-expression clusters across samples and protein-protein interaction (PPI) networks connected by modules (e.g., protein complexes). Using the R/Shiny platform, the JUMPn software streamlines the analysis of co-expression clustering, pathway enrichment, and PPI module detection, with integrated data visualization and a user-friendly interface. The main steps of the protocol include installation of the JUMPn software, the definition of differentially expressed proteins or the (dys)regulated proteome, determination of meaningful co-expression clusters and PPI modules, and result visualization. While the protocol is demonstrated using an isobaric labeling-based proteome profile, JUMPn is generally applicable to a wide range of quantitative datasets (e.g., label-free proteomics). The JUMPn software and protocol thus provide a powerful tool to facilitate biological interpretation in quantitative proteomics.
Mass spectrometry-based shotgun proteomics has become the key approach for analyzing proteome diversity of complex samples1. With recent advances in mass spectrometry instrumentation2,3, chromatography4,5, ion mobility detection6, acquisition methods (data-independent7 and data-dependent acquisition8), quantification approaches (multi-plex isobaric peptide labeling method, e.g., TMT9,10, and label-free quantification11,12) and data analysis strategies/software development13,14,15,16,17,18, quantification of the whole proteome (e.g., over 10,000 proteins) is now routine19,20,21. However, how to gain mechanistic insights from such deep quantitative datasets is still challenging22. Initial attempts for investigating these datasets relied predominantly upon the annotation of individual elements of the data, treating each component (protein) independently. However, biological systems and their behavior cannot be solely explained by examining individual components23. Therefore, a systems approach that places the quantified biomolecules in the context of interaction networks is essential for the understanding of complex systems and the associated processes such as embryogenesis, immune response, and pathogenesis of human diseases24.
Network-based systems biology has emerged as a powerful paradigm for analyzing large-scale quantitative proteomics data25,26,27,28,29,30,31,32,33. Conceptually, complex systems such as mammalian cells could be modeled as a hierarchical network34,35, in which the whole system is represented in tiers: first by a number of large components, each of which then iteratively modeled by smaller subsystems. Technically, the structure of proteome dynamics can be presented by inter-connected networks of co-expressed protein clusters (because co-expressed genes/proteins often share similar biological functions or mechanisms of regulation36) and physically interacting PPI modules37. As a recent example25, we generated temporal profiles of whole proteome and phosphoproteome during T cell activation and used integrative co-expression networks with PPIs to identify functional modules that mediate T-cell quiescence exit. Multiple bioenergetic-related modules were highlighted and experimentally validated (e.g., the mitoribosome and complex IV modules25, and the one-carbon module38). In another example26, we further extended our approach to study the pathogenesis of Alzheimer's disease, and successfully prioritized disease progression associated protein modules and molecules. Importantly, many of our unbiased discoveries were validated by independent patient cohorts26,29 and/or disease mouse models26. These examples illustrated the power of the systems biology approach for dissecting molecular mechanisms with quantitative proteomics and other omics integrations.
Here we introduce JUMPn, a streamlined software that explores quantitative proteomics data using network-based systems biology approaches. JUMPn serves as the downstream component of the established JUMP proteomics software suite13,14,39, and aims to fill the gap from individual protein quantifications to biologically meaningful pathways and protein modules using the systems biology approach. By taking the quantification matrix of differentially expressed (or the most variable) proteins as input, JUMPn aims to organize the proteome into a tiered hierarchy of protein clusters co-expressed across samples and densely connected PPI modules (e.g., protein complexes), which are further annotated with public pathway databases by over-representation (or enrichment) analysis (Figure 1). JUMPn is developed with the R/Shiny platform40 for a user-friendly interface and integrates three major functional modules: co-expression clustering analysis, pathway enrichment analysis, and PPI network analysis (Figure 1). After each analysis, results are automatically visualized and are adjustable via the R/shiny widget functions and readily downloadable as publication tables in Microsoft Excel format. In the following protocol, we use quantitative whole proteome data as an example and describe the major steps of using JUMPn, including installation of the JUMPn software, the definition of differentially expressed proteins or the (dys)regulated proteome, co-expression network analysis, and PPI module analysis, result visualization and interpretation, and trouble shootings. JUMPn software is freely available on GitHub41.
NOTE: In this protocol, the usage of JUMPn is illustrated by utilizing a published dataset of whole proteome profiling during B cell differentiation quantified by TMT isobaric label reagent27.
1. Setup of JUMPn software
NOTE: Two options are provided for setting up the JUMPn software: (i) installation on a local computer for personal use; and (ii) deployment of JUMPn on a remote Shiny Server for multiple users. For local installation, a personal computer with Internet access and ≥4 Gb of RAM is sufficient to run JUMPn analysis for a dataset with a small sample size (n < 30); larger RAM (e.g., 16 Gb) is needed for large-cohort analysis (e.g., n = 200 samples).
2. Demo run using an example dataset
NOTE: JUMPn offers a demo run using the published B cell proteomics dataset. The demo run illustrates a streamlined workflow that takes the quantification matrix of differentially expressed proteins as input and performs co-expression clustering, pathway enrichment, and PPI network analysis sequentially.
3. Preparation of the input file and upload to JUMPn
NOTE: JUMPn takes as input the quantification matrix of either the differentially expressed proteins (supervised method) or the most variable proteins (unsupervised method). If the goal of the project is to understand proteins changed across multiple conditions (e.g., different disease groups, or time-series analysis of biological process), the supervised method of performing DE analysis is preferred; otherwise, an unsupervised approach of selecting the most variable proteins may be used for the exploratory purpose.
4. Co-expression clustering analysis
NOTE: Our group25,26,27 and others28,29,31 have proved WGCNA49 an effective method for co-expression clustering analysis of quantitative proteomics. JUMPn follows a 3-step procedure for WGCNA analysis25,50: (i) initial definition of co-expression gene/protein clusters by dynamic tree cutting51 based on the topological overlap matrix (TOM; determined by quantification similarities among genes/proteins); (ii) merging of similar clusters to reduce redundancy (based on dendrogram of eigengene similarities); and (iii) final assignment of genes/proteins to each cluster that exceed the minimal Pearson correlation cutoff.
5. Protein-protein interaction network analysis
NOTE: By superimposing co-expression clusters onto the PPI network, each co-expression cluster is further stratified into smaller PPI modules. The analysis is performed for each co-expression cluster and includes two stages: in the first stage, JUMPn superimposes proteins from the co-expression cluster onto the PPI network and find all connected components (i.e., multiple clusters of connected nodes/proteins; as an example, see Figure 6A); then, communities or modules (of densely connected nodes) will be detected for each connected component iteratively using the topological overlap matrix (TOM) method52.
6. Pathway enrichment analysis
NOTE: The JUMPn-derived hierarchical structures of both co-expression clusters and PPI modules within are automatically annotated with over-represented pathways using Fisher's exact test. The pathway/topology databases used include Gene Ontology (GO), KEGG, Hallmark, and Reactome. Users may use advanced options to upload customized databases for the analysis (e.g., in the case of analyzing data from non-human species).
7. Analysis of dataset with large sample size
NOTE: JUMPn supports analysis of dataset with large sample size (up to 200 samples tested). To facilitate the visualization of a large sample size, an additional file (named "meta file") that specifies the sample group is needed to facilitate the display of co-expression clustering results.
We used our published deep proteomics datasets25,26,27,30 (Figures 5 and Figure 6) as well as data simulations57 (Table 1) to optimize and evaluate JUMPn performance. For co-expression protein clustering analysis via WGCNA, we recommend utilizing proteins significantly changed across samples as the input (e.g., differentially expressed (DE) proteins detected by statistical analysis). While including non-DE proteins for the analysis may result in more co-expression clusters returned by the program (due to larger input size), we hypothesize that mixing the real signal (e.g., the DE proteins) with the background (the remaining non-DE) for systems-level analysis may dilute the signal and mask the underlying network structure. To test this, simulation analysis was performed under two different conditions: i) highly dynamic proteome (e.g., 50% altered in T cell activation25) and ii) relatively stable proteome (e.g., 2% proteome changed in AD26). For the highly dynamic proteome, six co-expression clusters were simulated from 50% proteome following the same cluster size and expression patterns (i.e., eigengenes) of our published results25. Similarly, for a relatively stable proteome, we simulated three clusters from 2% proteome following our recent AD proteomics study26. As expected, increasing the input number of proteins increases the number of detected clusters (Table 1). For the highly dynamic proteome, using all proteins as input can capture most of the true clusters (5 out of the 6 simulated bona fide clusters; 83% recall) with 63% precision (5 out of the 8 returned clusters are true positives; i.e., the remaining 3 clusters are false positives). However, for the relatively stable proteome, increasing the input size with non-DE proteins dramatically reduces precision (Table 1). For instance, using the whole proteome as input, 169 modules are detected, of which only 2 are correct (1.2% precision; the remaining 98.8% detected modules are false positives). These results thus indicate that choosing only the changed proteome as input will increase the precision of co-expression analysis, especially for relatively stable proteome.
Following the detection of co-expression protein clusters, each cluster will be annotated by JUMPn using the pathway enrichment analysis (Figure 1). The current version includes four commonly used pathway databases, including Gene Ontology (GO), KEGG, Hallmark, and Reactome. Users may also compile their own database in GMT format54, which can be uploaded into JUMPn. Integrating multiple databases for pathway enrichment analysis may provide more comprehensive views; however, the sizes of different pathway databases vary significantly, which may induce unwanted bias to certain (especially large) databases. Two solutions are provided within JUMPn. First, using a statistical approach, nominal p values are adjusted (or penalized) for multiple-hypothesis testing by the Benjamini-Hochberg method58, with a larger database requiring a more significant nominal p-value to reach the same adjusted p level than that from a small database. Second, JUMPn highlights the top significantly enriched pathway for each database separately, thus database-specific top enriched pathways are always displayed.
Similar to pathway enrichment analysis, a composite PPI network was compiled by combining STRING59,60, BioPlex61,62, and InWeb_IM63 databases. The BioPlex database was created using affinity purification followed by mass spectrometry in human cell lines, whereas the STRING and InWeb contain information from various sources. Therefore the STRING and InWeb databases were further filtered by the edge score to ensure high quality, with the cutoff determined by best fitting the scale-free criteria24. The final merged PPI network covers more than 20,000 human genes with ~1,100,000 edges (Table 2). This comprehensive interactome is included and published in a bundle with our JUMPn software for sensitive PPI analysis.
After the analysis is finished, JUMPn generates the publication table spreadsheet file ComprehensiveSummaryTables.xlsx, consisting of three individual sheets. The first sheet contains results of co-expression protein clusterswith one protein per row: the first column indicates the cluster membership of each input protein, and the remaining columns are copied from the user-input file, which contains the protein accession, gene names, protein description, and quantification of individual samples. The second sheet contains results of pathway enrichment analysis, displaying significant pathways enriched in each co-expression cluster. This table is first organized by different pathway databases, then sorted by co-expression clusters, functional pathways, the total number of pathway genes, the total number of genes in the individual cluster, the overlapped gene numbers and names, enrichment fold, Fisher exact test derived P-values and Benjamini-Hochberg false discovery rate. The third sheet contains results of PPI module analysis with one PPI module per row; its columns include the module name (defined by its co-expression membership and module ID, for example, Cluster1_Module1), the mapped proteins and numbers, as well as functional pathways that are defined by searching the module proteins against the pathway databases.
Figure 1: Workflow of JUMPn. Quantification matrix of the top variable of differentially expressed (DE) proteins are taken as input, and proteins are grouped into co-expression clusters by the WGCNA algorithm. Each co-expression is then annotated by pathway enrichment analysis and further superimposed onto the protein-protein interaction (PPI) network for densely connected protein module identifications. Please click here to view a larger version of this figure.
Figure 2: JUMPn welcome page. Please click here to view a larger version of this figure.
Figure 3: Input page of JUMPn. The page includes the input file upload panel and parameter configuration panels for co-expression clustering and PPI network analysis, respectively. Please click here to view a larger version of this figure.
Figure 4: Example input file of quantification matrix. Columns include protein accession (or any unique IDs), GN (official gene symbols), protein description (or any user-provided information), followed by protein quantification of individual samples. Please click here to view a larger version of this figure.
Figure 5: Co-expression cluster results reported by JUMPn. The co-expression clustering patterns (A), top enriched pathway heatmap across clusters (B), and detailed protein abundance for each cluster are shown (C). Users may select various display options and navigate between different clusters via the selection box. Please click here to view a larger version of this figure.
Figure 6: PPI network analysis results reported by JUMPn. The global inter-module network is shown (A), followed by a subnetwork of individual modules (B) and its significantly enriched pathways (C). Users may select various display options and navigate between different clusters and modules via the selection box. Please click here to view a larger version of this figure.
% top proteins for analysis | # simulated modules | # detected modules | # recaptured modules1 | precision2 | recall3 |
Highly dynamic proteome (e.g., during T cell activation): 6 simulated modules from 50% proteome | |||||
2 | 6 | 2 | 2 | 1 | 0.33 |
5 | 6 | 2 | 2 | 1 | 0.33 |
10 | 6 | 3 | 3 | 1 | 0.5 |
20 | 6 | 4 | 4 | 1 | 0.67 |
50 | 6 | 6 | 6 | 1 | 1 |
100 | 6 | 8 | 5 | 0.63 | 0.83 |
Relatively stable proteome (e.g., during pathogenesis of AD): 3 simulated modules from 2% proteome | |||||
1 | 3 | 1 | 1 | 1 | 0.33 |
2 | 3 | 3 | 3 | 1 | 1 |
5 | 3 | 8 | 3 | 0.38 | 1 |
10 | 3 | 13 | 3 | 0.23 | 1 |
20 | 3 | 19 | 3 | 0.16 | 1 |
50 | 3 | 71 | 2 | 0.03 | 0.67 |
100 | 3 | 169 | 2 | 0.01 | 0.67 |
1A recaptured module is a detected module whose eigengene highly correlates (Pearson R > 0.95) with one of the simulated eigengenes. | |||||
2precision = # recaptured modules / # detected modules | |||||
3recall = # recaptured modules / # simulated modules |
Table 1: Simulation studies of co-expression cluster detection.
PPI networks | No. of Nodes | No. of Edges |
BioPlex 3.0 combined (293T+HCT116) | 14,551 | 1,67,399 |
InBio_Map_core_2016_09_12 | 17,429 | 6,08,166 |
STRING (v11.0) | 18,954 | 5,87,482 |
Composite PPI network | 20,485 | 11,52,607 |
Table 2: Statistics of human protein-protein interaction (PPI) networks. PPI networks are filtered by edge score to ensure high quality, with the score cutoff determined by best fitting the scale-free criteria.
Here we introduced our JUMPn software and its protocol, which have been applied in multiple projects for dissecting molecular mechanisms using deep quantitative proteomics data25,26,27,30,64. The JUMPn software and protocol have been fully optimized, including consideration of DE proteins for co-expression network analysis, a compilation of comprehensive and high-quality PPI network, stringent statistical analysis (e.g., by consideration of multiple hypothesis testing) with a streamlined and user-friendly interface. Multiple protein modules identified by JUMPn have been validated by functional experiment studies25,27 or independent patient cohorts26, exemplifying JUMPn as an effective tool for identifying key molecules and pathways underlying diverse biological processes.
Critical steps of this protocol include the generation of optimal results of co-expression clusters and PPI modules, which may require multiple iterations of parameter tuning, as well as upload of customized PPI network. In our protocol, we discussed common practical scenarios, including how to handle missing of important clusters, a high percentage of unassigned proteins, merging of two redundant clusters, and missing of important proteins within PPI modules. We recommend the user to prepare several positive control proteins and confirm their presence in the final co-expression clusters. Sometimes a positive control will never be included in the final PPI modules due to an incomplete PPI network database. To partially alleviate this, we have updated our PPI network with the latest versions of BioPlex V362 and STRING V1160. In addition, JUMPn allows users to upload customized PPI networks. For example, novel interactions derived from affinity purification-mass spectrometry (AP-MS) experiments using an important positive control protein as bait may be integrated with the current composite PPI network for more customized analysis.
By using the framework of pathway enrichment analysis for each co-expression protein cluster, JUMPn can be extended for inferring transcription factor (TF) activity. The assumption is that if there exists an over-representation of target genes of a specific TF in a co-expression cluster (i.e., these targets are differentially expressed and follow the same expression pattern), the activity of that TF is potentially altered across experimental conditions because its target protein abundance is changed consistently. Technically, this can be simply achieved via JUMPn by replacing the current pathway database with the TF-target database (e.g., from the ENCODE project65). Similarly, kinase activity may also be inferred by leveraging the kinase-substrate database, taking deep phosphoproteomics as input. As an example, we successfully identified dysregulated TFs and kinases underlying brain tumor pathogenesis64. Indeed, using the network approach for activity inference has emerged as a powerful approach for identifying dysregulated drivers for human diseases66,67.
The JUMPn software is readily applied to a wide range of data types. Even though isobaric labeling quantified proteome was used as an illustrative example, the same protocol is applicable also for label-free quantified proteomics data, as well as genome-wide expression profiles (e.g., quantified by RNA-seq or microarray; see our recent example of applying JUMPn for both gene and protein expression profiles27). Phosphoproteomics data could also be taken by JUMPn to identify co-expressed phosphosites, followed by kinase activity inference25. In addition, interactome data generated by the AP-MS approach will also be appropriate, by which prey proteins that follow similar bait interaction strength and stoichiometry will form co-expression clusters and further overlapped with known PPIs for data interpretation68.
Limitations exist for the current version of JUMPn. First, the installation procedure is command line-based and requires basic knowledge of computer science. This hinders wider usage of JUMPn, especially from biologists without computational background. A more ideal implementation is to publish JUMPn on an online server. Second, the current databases are human-centric because of our focus on human disease studies. Note that proteomics data generated by mice has also been analyzed by JUMPn using such human-centric databases25,27, assuming that most PPIs are conserved across both species69,70. Mouse-specific signaling will not be captured by this approach but is not of interest in those human studies. However, for non-mammalian model systems (e.g., zebrafish, fly, or yeast), species-specific databases should be prepared and uploaded to JUMPn using the advanced options. Resources of additional species may be provided via future JUMPn release. Third, the current step of ontology/pathway analysis takes significant time, which can be further optimized by parallel computing.
In conclusion, we present the JUMPn software and protocol for exploring quantitative proteomics data to identify and visualize co-expressed and potentially physically interacting protein modules by systems biology approach. The key features that distinguish JUMPn from others53,71,72 include: (i) JUMPn integrates and streamlines four major components of the pathway and network analysis (Figure 1); (ii) Different from most pathway analysis software that takes a simple gene list as input, JUMPn starts from quantification matrix, by which quantitative information can be seamlessly integrated with literature documented pathways and networks; (iii) Both co-expression protein clusters and interaction modules are automatically annotated by known pathways, and visualized via the R/shiny interacting platform using a user-friendly web browser; (iv) Final results are organized into three tables that are readily publishable in Excel format. Thus, we expect the JUMPn and this protocol will be widely applicable to many studies for dissecting mechanisms using quantitative proteomics data.
The authors have nothing to disclose.
Funding support was provided by the National Institutes of Health (NIH) (R01AG047928, R01AG053987, RF1AG064909, RF1AG068581, and U54NS110435) and ALSAC (American Lebanese Syrian Associated Charities). The MS analysis was carried out in St. Jude Children's Research Hospital's Center of Proteomics and Metabolomics, which was partially supported by NIH Cancer Center Support Grant (P30CA021765). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
MacBook Pro with a 2.3 GHz Quad-Core Processor running OS 10.15.7. | Apple Inc. | MacBook Pro 13'' | Hardware used for software development and testing |
Anoconda | Anaconda, Inc. | version 4.9.2 | https://docs.anaconda.com/anaconda/install/ |
miniconda | Anaconda, Inc. | version 4.9.2 | https://docs.conda.io/en/latest/miniconda.html |
RStudio | RStudio Public-benefit corporation | version 4.0.3 | https://www.rstudio.com/products/rstudio/download/ |
Shiny Server | RStudio Public-benefit corporation | https://shiny.rstudio.com/articles/shinyapps.html |