We present a flexible, extendible Jupyter-lab-based workflow for the unsupervised analysis of complex multi-omics datasets that combines different pre-processing steps, estimation of the multi-omics factor analysis model, and several downstream analyses.
Disease mechanisms are usually complex and governed by the interaction of several distinct molecular processes. Complex, multidimensional datasets are a valuable resource to generate more insights into those processes, but the analysis of such datasets can be challenging due to the high dimensionality resulting, for example, from different disease conditions, timepoints, and omics capturing the process at different resolutions.
Here, we showcase an approach to analyze and explore such a complex multiomics dataset in an unsupervised way by applying multi-omics factor analysis (MOFA) to a dataset generated from blood samples that capture the immune response in acute and chronic coronary syndromes. The dataset consists of several assays at differing resolutions, including sample-level cytokine data, plasma-proteomics and neutrophil prime-seq, and single-cell RNA-seq (scRNA-seq) data. Further complexity is added by having several different time points measured per patient and several patient subgroups.
The analysis workflow outlines how to integrate and analyze the data in several steps: (1) Data pre-processing and harmonization, (2) Estimation of the MOFA model, (3) Downstream analysis. Step 1 outlines how to process the features of the different data types, filter out low-quality features, and normalize them to harmonize their distributions for further analysis. Step 2 shows how to apply the MOFA model and explore the major sources of variance within the dataset across all omics and features. Step 3 presents several strategies for the downstream analysis of the captured patterns, linking them to the disease conditions and potential molecular processes governing those conditions.
Overall, we present a workflow for unsupervised data exploration of complex multi-omics datasets to enable the identification of major axes of variation composed of differing molecular features that can also be applied to other contexts and multi-omics datasets (including other assays as presented in the exemplary use case).
Disease mechanisms are usually complex and governed by the interaction of several distinct molecular processes. Deciphering the complex molecular mechanisms that lead to specific diseases or govern the evolution of a disease is a task with high medical relevance as it might reveal new insights for understanding and treating diseases.
Recent technological advances enable to measure those processes on a higher resolution (e.g., on the single-cell level) and at various biological layers (e.g., DNA, mRNA, chromatin accessibility, DNA methylation, proteomics) at the same time. This leads to the increasing generation of large multidimensional biological datasets, which can be jointly analyzed to generate more insights into the underlying processes. At the same time, combining and analyzing the different data sources in a biologically meaningful way remains a challenging task1.
Different technological limits, noises, and ranges of variability between different omics pose one challenge. For example, single-cell RNA-sequencing (scRNA-seq) data is very sparse and often influenced by large technical or batch effects. Additionally, the feature space is often very large, ranging across several thousand measured genes or proteins, while sample sizes are limited. This is further complicated by complex designs, which might include several disease conditions, confounding factors, time points, and resolutions. For example, in the presented use case, different data types were available either on the single-cell or sample (bulk) level. Besides that, the data may be incomplete, and not all measurements might be available for all analyzed subjects.
Due to these challenges, different omics and included features are still often analyzed only separately2 even though performing an integrated analysis cannot only provide a complete picture of the process but biological and technical noises from one omic might also be compensated by other omics3,4. Several different methods have been proposed to perform an integrated analysis of multi-omics data, including Bayesian methods, network-based methods5,6, multimodal deep learning7, and dimensionality reduction methods via matrix factorization8,9. For the latter, the results of a large benchmarking study10 have shown the MOFA9 (multi-omic factor analysis) method to be one of the better-suited tools when data should be linked to clinical annotations.
Especially in complex settings, unsupervised matrix factorization methods are a useful approach to reduce complexity and extract shared and complementary signals from different data sources and features. By decomposing the complex space into lower-rank latent representations, the major sources of variance within the data can be quickly explored and linked to known covariates. In case the same pattern of variation is shared across multiple features (e.g., genes or proteins), this may be aggregated to a few factors while noise is reduced. Regularization can be used to increase the sparsity of model coefficients, which makes the approach well-suited in settings where the feature space is large while the number of samples is limited9.
This protocol presents a flexible analysis workflow that uses the MOFA model to showcase how to quickly explore a complex multi-omics dataset and distill the main patterns of variation that characterize this dataset. The workflow consists of three main steps. In the first step, Data pre-processing and harmonization, different strategies for data preprocessing based on different input data types (scRNA-seq, proteomics, cytokine, clinical data) are presented. The protocol elaborates on how to process the features of the different input datasets, filter out low-quality features, and normalize them to harmonize their distributions. We also show how those pre-processing decisions might affect downstream results. In the second step, the MOFA model is applied to the data, and the resulting variance decomposition can be used to evaluate the integration of the different datasets. The third step shows how to link the captured factors to covariates and uncover the molecular programs that define those factors. With the presented workflow, we were able to extract several latent factors linked to clinical covariates in a dataset of patients suffering from coronary syndromes and identify potential underlying multicellular immune programs from a previous project11. We will use this dataset here, but the protocol can easily be applied to other contexts, including other omics.
The dataset consists of samples from patients with stable chronic coronary syndromes (CCS), acute coronary syndromes (ACS), and a control group with healthy coronaries (non-CCS) (Figure 1). ACS is caused by plaque rupture in preexisting CCS, leading to an acute disruption of blood flow to the myocardium and a subsequent ischaemic injury of the heart. This injury causes an inflammatory response by the immune system followed by a reparative phase, which lasts until several days after the acute event12. To be able to characterize this immune response for ACS patients, blood samples were taken at four different timepoints: acute (TP1); after recanalization (14 [± 8] h) (TP2); 60 [± 12] h later (TP3); before discharge (6.5 [±1.5] days) (TP4) (Figure 1A). For CCS and patients with healthy coronaries, only one time point was available- (TP0). For all patients and time points different assays based on the blood samples were measured: clinical markers of inflammation (Creatine-Kinase (CK), CK-MB, Troponin, C-reactive protein (CRP)), scRNA-seq of peripheral blood mononuclear cells (PBMCs), cytokine analysis, plasma proteomics and prime-seq13 data of neutrophils.
Figure 1: Myocardial infarction multi-omic input dataset. Input dataset: The analyzed data includes blood samples from patients (n = 62) with acute coronary syndrome (ACS), chronic coronary syndromes (CCS), and patients with healthy coronaries (non-CCS). For ACS patients blood samples were included at four different time points (TP1-4), for CCS and non-CCS patients at a single time point (TP0). Each patient and time point combination is treated as a separate sample in the analysis. Different omic assays were measured on the samples: clinical blood tests (n = 125), scRNA-seq (n = 121), plasma-proteomics (n = 119), cytokine assay (n = 127) and neutrophil prime-seq (n = 121). Subsequently, the described protocol was applied to integrate the data across all omics and explore it using the MOFA model and further downstream analysis (factor analysis, pathway enrichment). Please click here to view a larger version of this figure.
As input for the workflow as presented here, we take raw counts from scRNA-seq data after processing with cellranger and quality control (QC) as, for example, outlined in the scanpy14 preprocessing tutorial. For cell-type annotation, we used the automated Azimuth15 pipeline. The counts are then aggregated on the sample level for each cell-type by taking the mean across all cells for each sample and cell-type (pseudobulk aggregation). Plasma-proteomics is included as normalized and median-centered intensities, and for neutrophils, we take the umi unique molecular identifier (UMI) exon counts from the prime-seq. On cytokine and clinical values, no previous preprocessing has been applied. Further details on the (experimental) data generation are outlined in the corresponding manuscript11. As the results presented here are based on using the automated Azimuth annotation for cell-types in the scRNA-seq data compared to the marker based strategy that was used in the referenced publication the results presented here are similar but not exactly the same as presented in the publication. In the manuscript it could be shown that the cell-type annotation strategy does not change the main patterns and biological interpretations of the analysis but small changes in the exact values resulting from the model might vary. Overall the input data was a complex multi-dimensional dataset including different time points and measurement levels (single-cells vs. bulk) of more than 10,000 different features (genes, proteins, clinical values). A strict pre-processing and data harmonization strategy followed by MOFA analysis has been shown to be a useful and quick tool for exploring the data and extracting relevant immune program. Each time point and patient combination is treated as an independent sample in the MOFA analysis. Each data type and cell-type is considered a separate view in the MOFA analysis.
This protocol provides instructions for preparing the input data for the workflow, executing the different workflow steps, customizing configurations, interpreting the resulting figures, and iteratively adjusting the configurations based on the interpretations. An overview of the different steps of the protocol, the required input datasets at each step, and resulting figures and datasets is given by the technical workflow overview (Figure 2).
Figure 2: Technical workflow overview. Outline of the workflow for the analysis of the multi-omics dataset. The different elements are highlighted by different colors and symbols. Jupyter Notebooks belonging to the Data Preprocessing and Harmonization (1) step are colored in blue. Jupyter Notebooks belonging to the 'MOFA Model' (2) step are colored in orange. Jupyter Notebooks belonging to the 'Downstream Analysis' (3) step are colored in green. One Jupyter Notebook to be used for comparison of the results is colored in yellow. Configuration files where parameters for the execution of the workflow can be modified are highlighted in purple. Input datasets that are required to run the workflow are indicated by the dataset symbol and highlighted in grey. All figure outputs that are generated during the workflow execution are indicated by the magnifying glass symbol. Datasets generated during workflow execution are indicated as tables. In general, the workflow is executed sequentially: (1) Data Preprocessing and Harmonization consists of two steps: first generation of a pseudobulk table based on the scRNA-seq input data (01_Prepare_Pseudobulk) and subsequent integration and normalization of this data together with all the other sample-level (bulk) inputs (02_Integrate_and_Normalize_Data). Within this step via the configuration files, it is possible to configure for each dataset separately which of the indicated pre-processing and normalization steps (e.g., Sample Filter) should be applied. (2) 'MOFA Model': runs the MOFA model on the generated input of the first step with the configurations specified in the configuration file (03_MOFA_configs.csv) (3) 'Downstream Analysis': consists of three different notebooks which may be run independently of each other to generate insights into the generated MOFA results and associate them with sample meta-data (covariates) provided as input via the 'Sample Meta Data.csv' file. (4) 'Model Comparison': is a small separate step that may be used to compare different models generated in step 2. Please click here to view a larger version of this figure.
The workflow consists of several Jupyter Notebooks written in R and Python (knowledge of the R and Python language is not required to run the workflow but might be helpful in case errors appear). In various steps of the protocol, parameters are changed via configuration files ('.csv' files containing the postfix '_Configs' in the name). Within the protocol, we only outline the parameters that need to be changed starting from the default configuration.
Several other parameters may also be changed, for example to customize the pre-processing. A documentation of these parameters and explanations is given in the file 'Documentation_Config_Parameter', which is included in the downloaded repository.
With the outlined protocol, a modular and extendible Jupyter-notebook-based workflow that can be used to quickly explore a complex multi-omics dataset is presented. The major parts of the workflow consist of the pre-processing and data harmonization part (offering different standard steps for filtering and normalization of the data), estimation of the MOFA9 model and some exemplary downstream analysis. One of the main critical steps is to pre-process and integrate, and harmonize the different omic…
The authors have nothing to disclose.
C.L. is supported by the Helmholtz Association under the joint research school "Munich School for Data Science – MUDS".
Apptainer | NA | NA | https://apptainer.org/docs/admin/main/installation.html |
Compute server or workstation or cloud (Linux, Mac or Windows environment). Depending on the size of the different input datasets we recommend running the workflow on a suitable machine (in our setting we use: 16 CPU, 64GB Memory) |
Any manufacturer | 16 CPU, 64GB Memory | Large Memory is only required for the processing of the raw single cell data. After preprocessing the later analysis steps can also be performed on regular desktop or laptop computers |
git | NA | NA | https://git-scm.com/book/en/v2/Getting-Started-Installing-Git |
GitHub | GitHub | NA | https://github.com/heiniglab/mofa_workflow |
.