We present a flexible, extendible Jupyter-lab-based workflow for the unsupervised analysis of complex multi-omics datasets that combines different pre-processing steps, estimation of the multi-omics factor analysis model, and several downstream analyses.
Disease mechanisms are usually complex and governed by the interaction of several distinct molecular processes. Complex, multidimensional datasets are a valuable resource to generate more insights into those processes, but the analysis of such datasets can be challenging due to the high dimensionality resulting, for example, from different disease conditions, timepoints, and omics capturing the process at different resolutions.
Here, we showcase an approach to analyze and explore such a complex multiomics dataset in an unsupervised way by applying multi-omics factor analysis (MOFA) to a dataset generated from blood samples that capture the immune response in acute and chronic coronary syndromes. The dataset consists of several assays at differing resolutions, including sample-level cytokine data, plasma-proteomics and neutrophil prime-seq, and single-cell RNA-seq (scRNA-seq) data. Further complexity is added by having several different time points measured per patient and several patient subgroups.
The analysis workflow outlines how to integrate and analyze the data in several steps: (1) Data pre-processing and harmonization, (2) Estimation of the MOFA model, (3) Downstream analysis. Step 1 outlines how to process the features of the different data types, filter out low-quality features, and normalize them to harmonize their distributions for further analysis. Step 2 shows how to apply the MOFA model and explore the major sources of variance within the dataset across all omics and features. Step 3 presents several strategies for the downstream analysis of the captured patterns, linking them to the disease conditions and potential molecular processes governing those conditions.
Overall, we present a workflow for unsupervised data exploration of complex multi-omics datasets to enable the identification of major axes of variation composed of differing molecular features that can also be applied to other contexts and multi-omics datasets (including other assays as presented in the exemplary use case).
Disease mechanisms are usually complex and governed by the interaction of several distinct molecular processes. Deciphering the complex molecular mechanisms that lead to specific diseases or govern the evolution of a disease is a task with high medical relevance as it might reveal new insights for understanding and treating diseases.
Recent technological advances enable to measure those processes on a higher resolution (e.g., on the single-cell level) and at various biological layers (e.g., DNA, mRNA, chromatin accessibility, DNA methylation, proteomics) at the same time. This leads to the increasing generation of large multidimensional biological datasets, which can be jointly analyzed to generate more insights into the underlying processes. At the same time, combining and analyzing the different data sources in a biologically meaningful way remains a challenging task1.
Different technological limits, noises, and ranges of variability between different omics pose one challenge. For example, single-cell RNA-sequencing (scRNA-seq) data is very sparse and often influenced by large technical or batch effects. Additionally, the feature space is often very large, ranging across several thousand measured genes or proteins, while sample sizes are limited. This is further complicated by complex designs, which might include several disease conditions, confounding factors, time points, and resolutions. For example, in the presented use case, different data types were available either on the single-cell or sample (bulk) level. Besides that, the data may be incomplete, and not all measurements might be available for all analyzed subjects.
Due to these challenges, different omics and included features are still often analyzed only separately2 even though performing an integrated analysis cannot only provide a complete picture of the process but biological and technical noises from one omic might also be compensated by other omics3,4. Several different methods have been proposed to perform an integrated analysis of multi-omics data, including Bayesian methods, network-based methods5,6, multimodal deep learning7, and dimensionality reduction methods via matrix factorization8,9. For the latter, the results of a large benchmarking study10 have shown the MOFA9 (multi-omic factor analysis) method to be one of the better-suited tools when data should be linked to clinical annotations.
Especially in complex settings, unsupervised matrix factorization methods are a useful approach to reduce complexity and extract shared and complementary signals from different data sources and features. By decomposing the complex space into lower-rank latent representations, the major sources of variance within the data can be quickly explored and linked to known covariates. In case the same pattern of variation is shared across multiple features (e.g., genes or proteins), this may be aggregated to a few factors while noise is reduced. Regularization can be used to increase the sparsity of model coefficients, which makes the approach well-suited in settings where the feature space is large while the number of samples is limited9.
This protocol presents a flexible analysis workflow that uses the MOFA model to showcase how to quickly explore a complex multi-omics dataset and distill the main patterns of variation that characterize this dataset. The workflow consists of three main steps. In the first step, Data pre-processing and harmonization, different strategies for data preprocessing based on different input data types (scRNA-seq, proteomics, cytokine, clinical data) are presented. The protocol elaborates on how to process the features of the different input datasets, filter out low-quality features, and normalize them to harmonize their distributions. We also show how those pre-processing decisions might affect downstream results. In the second step, the MOFA model is applied to the data, and the resulting variance decomposition can be used to evaluate the integration of the different datasets. The third step shows how to link the captured factors to covariates and uncover the molecular programs that define those factors. With the presented workflow, we were able to extract several latent factors linked to clinical covariates in a dataset of patients suffering from coronary syndromes and identify potential underlying multicellular immune programs from a previous project11. We will use this dataset here, but the protocol can easily be applied to other contexts, including other omics.
The dataset consists of samples from patients with stable chronic coronary syndromes (CCS), acute coronary syndromes (ACS), and a control group with healthy coronaries (non-CCS) (Figure 1). ACS is caused by plaque rupture in preexisting CCS, leading to an acute disruption of blood flow to the myocardium and a subsequent ischaemic injury of the heart. This injury causes an inflammatory response by the immune system followed by a reparative phase, which lasts until several days after the acute event12. To be able to characterize this immune response for ACS patients, blood samples were taken at four different timepoints: acute (TP1); after recanalization (14 [± 8] h) (TP2); 60 [± 12] h later (TP3); before discharge (6.5 [±1.5] days) (TP4) (Figure 1A). For CCS and patients with healthy coronaries, only one time point was available- (TP0). For all patients and time points different assays based on the blood samples were measured: clinical markers of inflammation (Creatine-Kinase (CK), CK-MB, Troponin, C-reactive protein (CRP)), scRNA-seq of peripheral blood mononuclear cells (PBMCs), cytokine analysis, plasma proteomics and prime-seq13 data of neutrophils.
Figure 1: Myocardial infarction multi-omic input dataset. Input dataset: The analyzed data includes blood samples from patients (n = 62) with acute coronary syndrome (ACS), chronic coronary syndromes (CCS), and patients with healthy coronaries (non-CCS). For ACS patients blood samples were included at four different time points (TP1-4), for CCS and non-CCS patients at a single time point (TP0). Each patient and time point combination is treated as a separate sample in the analysis. Different omic assays were measured on the samples: clinical blood tests (n = 125), scRNA-seq (n = 121), plasma-proteomics (n = 119), cytokine assay (n = 127) and neutrophil prime-seq (n = 121). Subsequently, the described protocol was applied to integrate the data across all omics and explore it using the MOFA model and further downstream analysis (factor analysis, pathway enrichment). Please click here to view a larger version of this figure.
As input for the workflow as presented here, we take raw counts from scRNA-seq data after processing with cellranger and quality control (QC) as, for example, outlined in the scanpy14 preprocessing tutorial. For cell-type annotation, we used the automated Azimuth15 pipeline. The counts are then aggregated on the sample level for each cell-type by taking the mean across all cells for each sample and cell-type (pseudobulk aggregation). Plasma-proteomics is included as normalized and median-centered intensities, and for neutrophils, we take the umi unique molecular identifier (UMI) exon counts from the prime-seq. On cytokine and clinical values, no previous preprocessing has been applied. Further details on the (experimental) data generation are outlined in the corresponding manuscript11. As the results presented here are based on using the automated Azimuth annotation for cell-types in the scRNA-seq data compared to the marker based strategy that was used in the referenced publication the results presented here are similar but not exactly the same as presented in the publication. In the manuscript it could be shown that the cell-type annotation strategy does not change the main patterns and biological interpretations of the analysis but small changes in the exact values resulting from the model might vary. Overall the input data was a complex multi-dimensional dataset including different time points and measurement levels (single-cells vs. bulk) of more than 10,000 different features (genes, proteins, clinical values). A strict pre-processing and data harmonization strategy followed by MOFA analysis has been shown to be a useful and quick tool for exploring the data and extracting relevant immune program. Each time point and patient combination is treated as an independent sample in the MOFA analysis. Each data type and cell-type is considered a separate view in the MOFA analysis.
This protocol provides instructions for preparing the input data for the workflow, executing the different workflow steps, customizing configurations, interpreting the resulting figures, and iteratively adjusting the configurations based on the interpretations. An overview of the different steps of the protocol, the required input datasets at each step, and resulting figures and datasets is given by the technical workflow overview (Figure 2).
Figure 2: Technical workflow overview. Outline of the workflow for the analysis of the multi-omics dataset. The different elements are highlighted by different colors and symbols. Jupyter Notebooks belonging to the Data Preprocessing and Harmonization (1) step are colored in blue. Jupyter Notebooks belonging to the 'MOFA Model' (2) step are colored in orange. Jupyter Notebooks belonging to the 'Downstream Analysis' (3) step are colored in green. One Jupyter Notebook to be used for comparison of the results is colored in yellow. Configuration files where parameters for the execution of the workflow can be modified are highlighted in purple. Input datasets that are required to run the workflow are indicated by the dataset symbol and highlighted in grey. All figure outputs that are generated during the workflow execution are indicated by the magnifying glass symbol. Datasets generated during workflow execution are indicated as tables. In general, the workflow is executed sequentially: (1) Data Preprocessing and Harmonization consists of two steps: first generation of a pseudobulk table based on the scRNA-seq input data (01_Prepare_Pseudobulk) and subsequent integration and normalization of this data together with all the other sample-level (bulk) inputs (02_Integrate_and_Normalize_Data). Within this step via the configuration files, it is possible to configure for each dataset separately which of the indicated pre-processing and normalization steps (e.g., Sample Filter) should be applied. (2) 'MOFA Model': runs the MOFA model on the generated input of the first step with the configurations specified in the configuration file (03_MOFA_configs.csv) (3) 'Downstream Analysis': consists of three different notebooks which may be run independently of each other to generate insights into the generated MOFA results and associate them with sample meta-data (covariates) provided as input via the 'Sample Meta Data.csv' file. (4) 'Model Comparison': is a small separate step that may be used to compare different models generated in step 2. Please click here to view a larger version of this figure.
The workflow consists of several Jupyter Notebooks written in R and Python (knowledge of the R and Python language is not required to run the workflow but might be helpful in case errors appear). In various steps of the protocol, parameters are changed via configuration files ('.csv' files containing the postfix '_Configs' in the name). Within the protocol, we only outline the parameters that need to be changed starting from the default configuration.
Several other parameters may also be changed, for example to customize the pre-processing. A documentation of these parameters and explanations is given in the file 'Documentation_Config_Parameter', which is included in the downloaded repository.
1. Preparations: Technical setup and installation
NOTE: To run this program, have wget, git, and Apptainer preinstalled on the device. A guide for installing Apptainer on different systems (Linux, Windows, Mac) is given here: https://apptainer.org/docs/admin/main/installation.html. Installation information on git can be found here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. Depending on the size of the different input datasets, running the workflow on a suitable machine (16 CPU, 64GB Memory) is recommended. A smoke test with the provided example data can be executed on the local machine. Instructions and expected outputs from running the protocol on the example data are given in Supplementary File 1. Refer to Supplementary Video File 1 for the important steps of the protocol which are executed on the dataset outlined above.
2. Initialization and data preparation
Figure 3: Data input and setup. For the execution of the workflow, all data needs to be stored in a specified input_data folder. For each input dataset a separate file should be provided. Single-cell data should be given as .h5ad containing cell-annotation on the cluster_id (resulting, e.g., from previous cell-type annotation steps) and a sample_id column (uniquely identifying each separate sample that should be analyzed). All other input datasets should be given in '.csv' format, including one column specifying the sample_id (matching to the corresponding column of the single-cell data) and features to be used in the MOFA analysis in all other columns. Please click here to view a larger version of this figure.
Figure 4: Jupyter-lab configuration files. During the execution of the workflow, changes in parameters (e.g., adjusting filtering options etc.) are specified via '.csv' configuration files. Within the cloned repository, default configuration files for each step are included. They may be edited directly in the jupyter-lab console, similarly as in a spreadsheet. Please click here to view a larger version of this figure.
Figure 5: Jupyter-notebooks scripts. The complete workflow consists of a series of Jupyter notebooks that will be executed sequentially after the corresponding configuration files have been modified. By double clicking on the Jupyter notebook on the left side the corresponding file will be opened up on the right side. The complete execution of the file can be started with the button highlighted at the top. Please click here to view a larger version of this figure.
3. Data pre-processing and harmonization
Figure 6: Data pre-processing and harmonization. One output of the '01_Prepare_Pseudobulk' step is the plot 'Fig01_Amount_of_Cells_Overview'. Here, for each cluster_id (y-axis indicating the cell type from previous cell-type annotation steps), the number of cells per sample ('sample_id') is given. Within the presented results, cell types with a low amount of cells per sample are excluded from the subsequent analysis (indicated by the strikethrough). Please click here to view a larger version of this figure.
4. Running MOFA
5. Downstream analysis
6. Comparing different configurations and versions (Supplementary Figure 1, Supplementary Figure 2, Supplementary Figure 3, Supplementary Figure 4)
7. Extending the workflow: Adding other parameters and configurations
NOTE: Besides the parameters that are currently configurable in the configuration files, other adjustments in the code or other parameters might be included. For example, the MOFA model itself offers several other training parameters17 that can either be modified directly in the code or made adjustable via the configuration files. The next section of the protocol will outline an example of how to do this for additional MOFA model training parameters. For this part, R programing knowledge is required.
Following the successful execution of the workflow, several tables and figures are generated as indicated by Figure 2. Figures are placed in the /figures folder (Figure 6, Figure 7, Figure 8, Supplementary Figure 1, Supplementary Figure 2, Supplementary Figure 3, Supplementary Figure 4), and tables will be placed in the specified /results folder.
In case the workflow execution is not successful, this might be mainly due to: technical errors caused, for example, by insufficient memory (especially in the first step where a large single-cell dataset is loaded), incorrectly formatted data (e.g., non-matching sample_id columns between datasets) or incorrect specifications in the configuration files (e.g., excluding to many features). In this case, usually, an error message within the Jupyter-notebook script will occur during execution and no plots and data will be generated. It is recommended to use the default configuration files as generated during the script execution and only modify specific parameters as described in the protocol.
A successful execution is indicated by the generation of the resulting plots and tables, and each step will reveal additional information about the data and the main patterns of variance inherent in it. Still, not necessarily each execution will produce biologically useful and interpretable results. Often, the data is characterized by large technical effects and different distributions, which need to be accounted for in the 'Data Pre-Processing and Harmonization' step or in the 'MOFA9 Model' (which also allows specifying different distributions for the input data types) to be able to extract the variation of the data that reflects the underlying biological processes.
Within the presented workflow, different multi-omic datasets may be used as input. Currently, the workflow accepts the popular .h5ad file format for single-cell data and a very general .csv file format for all the other datasets as input (Figure 3). It is common for different omic datasets to have very different file formats. To not limit the execution of the workflow to specific file formats, .csv is used as a very general format. Therefore, all kinds of different omics datasets can be used as input for the workflow but need to be converted to the corresponding .csv format as indicated in Figure 3 first before usage within this workflow. This might be prepared either using a spreadsheet or omic specific software. To preprocess the different omics datasets, several options are available within the workflow to apply different pre-processing and normalization steps (e.g., library size adjustment, log transformation, sample quantile normalization) on the different input datasets by configuring the 02_Pre_Processing_Configs.csv and 02_Pre_Processing_Configs_SC.csv file (Figure 2). Nevertheless, the available options here are mainly based on the specific input data available in the data set presented here (scRNA-seq, cytokine assay, proteomics, prime-seq). In case other omics/data types are used, it might be necessary to apply additional omic-specific normalization steps according to existing best practices. In this case the data can be handed over to the workflow in an already pre-processed form and will be integrated along with the other datasets without applying further pre-processing steps. In many cases, applying the Feature Wise Quantile Normalization step is useful in aligning the distribution of all data types to a normal distribution and making downstream analysis between the different input features more comparable and compatible with the model specification of Gaussian noise.
During the workflow execution several plots and outputs are generated that support the process of data integration and subsequent biological downstream interpretation. For scRNA-seq data, the plot in FIG01_Amount_of_Cells_Overview (Figure 6) indicates which cell types might include too few cells per sample and cell type to reliably measure a gene expression signal as for subsequent analyses, the mean value across all cells of a cell type per sample is used as an expression estimate (psedobulk- approach). In this use-case, we exclude cell types that have less than three cells in the majority of samples.
The variance decomposition plot FIG03_Overview_Variance_Decomposition (Figure 7, Supplementary Figure 1) can indicate how well the different data sources integrate and how much of the variance in the different data sources is shared and unique to each data source. For example, testing different pre-processing strategies on the dataset used here shows for example that removing the Feature Wise Quantile normalization step from the pre-processing leads to latent factors that are more focused on specific data views and reduces the integration of the proteomic data with the other data sources. This can be seen in the reduced amount of explained variance (Supplementary Figure 1B). Running the MOFA model without any filtering of features or without normalization leads to less shared variance between the different views captured by the latent factors (Supplementary Figure 1C). This indicates that latent factors predominantly reflect data type specific technical effects. Besides that, the MOFA9 model itself might also return warnings in case of poorly pre-processed data. An example of such a warning is shown in Supplementary Figure 1 for the alternative pre-processing configurations MI_v2 and MI_v3 (the specific example configuration files are stored in the cloned GitHub repository in the config_examples folder).
Additionally, after running the MOFA model, the results can be evaluated in several downstream analyses by associating the factor with known biological meta-information about the samples, as well as technical and other confounding covariates (04_Downstream_Factor_Analysis) to identify the likely cause for the variation captured by the factors. For example, if one of the factors of the MOFA models associates strongly with one of the technical covariates (such as batch information) this might indicate that this factor rather captures technical variation within the data instead of biological variation.
To narrow down the biological interpretation in the downstream analysis part, a couple of findings based on the input dataset (a more refined interpretation can be found in the original publication11) are outlined here. In the first step, we could observe that with the applied pre-processing strategy, we find several factors that capture variance across multiple cell types but also other omics data types (Figure 7A). For example, Factor 2 captures variance in the clinical input features and in several cell types of the scRNA-seq dataset. Associating the first three factors with relevant clinical covariates like 'CRP' and 'CK' (Figure 7B) and investigating the differences in factor values for the different patient subgroups: 'Control (including CCS and non-CCS) vs. 'ACS' measured at the different time points (TP1-TP4) (Figure 7C), we also find that Factor2 associates significantly with the 'CK' value and Factor3 with the 'CRP' value. At the same time, 'ACS' samples at TP1 and TP2 (which reflect the acute phase of the immune response to myocardial infarction (MI)) show an increase in factor values compared to 'Control' and later timepoint samples (TP3/TP4). CK is a known marker of myocardial damage and is typically characterized by increased values at TP1/TP2, similar to the pattern captured by Factor2.
To generate insights into the biological processes shaping Factor2 we evaluate the top-ranking features of the factor by looking at the feature weights table generated by the model (03_Weight_Data.csv). Analyzing the top 1% of features with the highest absolute weights on the factor, we find mainly CD4.TCM and CD14. Mono-derived features are overrepresented compared to their overall number of input features (Figure 8A), indicating these cell types to be highly relevant in the inflammatory process after MI (NOTE: in case no feature-wise quantile normalization was applied in the pre-processing different distributions of the features might also affect this result and evaluation should be done separately by data type). Analyzing the top-ranking features of the CD4.TCM cell type on the factor, we find several interesting genes like EIF3E18 required for robust T-cell activation and HMGB119, which promotes expansion and activation of T-cells (Figure 8B). Next, we run the pathway enrichment analysis using immune pathways from the REACTOME20 database as a pathway set (Prepared_Pathway_Data.csv). We find enrichment for several 'Interleukin' pathways, including 'Interleukin-6' signaling. The expression levels of several genes in different cell types of the scRNA-seq data and the 'IL6' cytokine values measured by the Cytokine assay contributed to this result (Figure 8C). Identification of these shared patterns across data types highlights the added value of an integrated analysis. Overall, this approach can also identify several other factors that reflect disease state or associate treatment outcome and the underlying multicellular immune programs as is described more in detail in the corresponding publication11.
To further stress the advantage of integrated analyses across multiple omics, the same workflow was also run only including the proteomics input data (Supplementary Figure 4). Analyzing the resulting factors, we find similarly to the integrated analysis a factor (Factor1) that strongly correlates with the 'CRP' value. This pattern describes the main source of variation within the proteomics data and is also aligned with some of the variation in the other datasets as captured by 'Factor3' in the integrated analysis (Figure 7C). However, a similar pattern as indicated by Factor2 that captures the time course of inflammation in the integrated analysis cannot be identified solely based on proteomics data.
The introduced workflow and the MOFA9 model itself are highly customizable with many adjustable parameters. Therefore, it is important to visualize and systematically compare the results produced by different configurations. To facilitate this task, the final output that can be generated by the workflow is a comparison of different named runs of the pipeline with different parameters in the pre-processing and model estimation. For example, the MOFA model can be estimated with different numbers of latent factors (Supplementary Figure 2A) or views with a lower number of features can be weighted (Supplementary Figure 3A). Configuring and running the last script of the workflow '07_Compare_Models' produces several plots to assess the similarity between different pipeline runs. FIG07_Variance_Model_Comparison (Supplementary Figure 2B, Supplementary Figure 3B) shows a comparison of the total explained variance for each view for different runs. The correlation of the factor values and feature factor weights between the different runs can indicate how much results change when modifying a certain parameter (Supplementary Figure 2C, Supplementary Figure 3C). Here, modifying the number of factors only causes minor changes in the estimated factor values and feature weights (Supplementary Figure 2C). Modifying the weighting of the data view results in much higher explained variance in the views with a lower number of features, e.g., the 'clinical' view (Supplementary Figure 3B). Nevertheless, the relevant features within the first three factors are still highly correlated to those inferred with the unweighted version (Supplementary Figure 3C).
With the generated model output .csv files in the results folder (e.g., the estimated factor and feature weights), further individual downstream analyses can be conducted. All code and necessary configuration files (including documentation) are available on GitHub at https://github.com/heiniglab/mofa_workflow. The singularity image that was created to enable an easy installation of the required conda packages for the analysis can be downloaded from https://doi.org/10.5281/zenodo.10815146. A small example dataset that can be used to perform an initial test of the pipeline can be downloaded from the same zenodo record as well.
Figure 7: MOFA output analysis. After the running of the MOFA model (03_Run_MOFA.ipynb) and the downstream analysis of factor values (04_Downstream_Factor_Analysis.ipynb) several plots are generated: (A) FIG03_Overview_Variance_Decomposition: returns a visualization of the explained variance of the estimated MOFA factors within the different views. Heatmap (left): shows the percentage of total variance of a view captured by a factor for each view. Barplot (right): shows the total percentage of variance that is captured by all factors for each view. (B) FIG04_Factor_Association_Numerical_Features: shows the Pearson correlation of the factor values with chosen numeric sample covariates, here: clinical variables (CRP, CK). (C) FIG04_Factor_Association_Categorical_Features: shows the difference in factor values for categorical sample covariates as a boxplot. Here, the factor values of Factors1-3 for each timepoint of ACS and Control patients are compared. Please click here to view a larger version of this figure.
Figure 8: MOFA feature analysis. After the running of the downstream analyses (04_Downstream_Factor_Analysis.ipynb, 05_Downstream_Investigate_Features.ipynb) several plots are generated. All plots here visualize MOFA Factor 2: (A) FIG04_Top_Feature_Overview_per_Factor: Heatmap (left) shows for each view the percentage of variance that is captured by the selected factor. Barplots (right) indicate the relevance of the features of the different views for the factor. On the left the total amount of features of a specific view within the top 1% highest-ranking features across views on the factor is given. On the right, the percentage is given, dividing the total number among the top 1% by the total number of features of that view. (B) FIG05_Heatmap_Feature_Overview: Heatmap (left) shows for the highest ranking 1% of features of the CD4.TCM cell-type the normalized expression values of each sample comparing the 'Control' group patients (CCS and non-CCS) to the different time points for 'ACS' patients. The barplot (right) shows the weight of the features. The direction of the sign of the weight is indicated before on the left before the cell-type names: '+' positive factor weight; '-' negative factor weight. (C) FIG06_Pathway_and_Genes: shows the weight of the highest 25% ranking genes for the factor that belong to enriched Interleukin pathways. In the heatmap on the top, they are averaged across views, and in the heatmap on the bottom shown per view. Please click here to view a larger version of this figure.
Supplementary Figure 1: Data harmonization effects. The figure shows FIG03_Overview_Variance_Decomposition for several different data-preprocessing configurations: visualization of the explained variance of the estimated MOFA factors within the different views. Heatmap (left): shows for each view the percentage of total variance of a view that is captured by a factor. Barplot (right): shows for each view the total percentage of variance that is captured by all factors. (A) The configuration ('MI_v1') based on which biological downstream results have been analyzed in previous figures (parameters set as in default configuration files in the cloned repository). (B) The same pre-processing configuration as in 'MI_v1' with the modification that no feature wise quantile normalization is applied (parameters set as in exemplary configuration files in 'config_examples' folder of repository). A screenshot of the MOFA model output warning for this configuration is added to the plot below. (C) The resulting variance decomposition when no pre-processing steps are applied, and all data is used as input without any pre-processing or filtering of features (parameters set as in exemplary configuration file in 'config_examples' folder of repository). A screenshot of the MOFA model output warning for this configuration is added to the plot below. Please click here to download this File.
Supplementary Figure 2: MOFA configuration – Factor amount effect. The resulting figures generated by the '07_Compare_Models.ipynb' script using several different configurations to run the MOFA model. (A) '03_MOFA_configs.csv': Example of the different configurations used to run the '03_Run_MOFA.ipynb' script specifying several different amount of factors (10,15,20,25). '07_Comparison_configs.csv': Example of how to specify the configuration input file for the execution of the script '07_Compare_Models.ipynb'. (B) 'FIG07_Variance_Model_Comparison' showing the total explained variance for each view (y-axis) for the different models across all factors specified in the model. (C) 'FIG07_Factor_Correlations' showing the correlation of the factor sample values between the different configurations. Please click here to download this File.
Supplementary Figure 3: MOFA configuration – Effect of weighting views. The resulting figures generated by the '07_Compare_Models.ipynb' script using several different configurations to run the MOFA model. (A) '03_MOFA_configs.csv': Example of the different configurations used to run the '03_Run_MOFA.ipynb' script specifying the 'weighting_of_views' parameter to be either 'TRUE' (MI_v1_MOFA_weighted) or 'FALSE' (MI_v1_MOFA). '07_Comparison_configs.csv': Example of how to specify the configuration input file for the execution of the script '07_Compare_Models.ipynb'. (B) 'FIG07_Variance_Model_Comparison' showing the total explained variance for each view (y-axis) for the different models across all factors specified in the model. (C) 'FIG07_Feature_Correlations' showing the correlation of the feature factor weights between the different configurations. Please click here to download this File.
Supplementary Figure 4: Multi-omic integration effect – using only proteomic data. The resulting patterns captured by the latent factors when only using proteomics data as input. (A) FIG04_Factor_Association_Numerical_Features: Pearson correlation of the factor values with clinical variables (CRP, CK). (B) FIG04_Factor_Association_Categorical_Features: Boxplot comparison of the factor values of each timepoint of ACS and Control patients. Please click here to download this File.
Supplementary File 1: Supplementary_File_
Running_Pipeline_with_Exemplary_Data. Descriptions on how to run the pipeline on the example data and the expected outputs are given in an additionally provided supplementary file. Please click here to download this File.
Supplementary Video File 1: Screen capture video of the protocol. Please click here to download this File.
With the outlined protocol, a modular and extendible Jupyter-notebook-based workflow that can be used to quickly explore a complex multi-omics dataset is presented. The major parts of the workflow consist of the pre-processing and data harmonization part (offering different standard steps for filtering and normalization of the data), estimation of the MOFA9 model and some exemplary downstream analysis. One of the main critical steps is to pre-process and integrate, and harmonize the different omics datasets. Here, we present a strategy for a dataset that comprises scRNA-seq data, prime-seq bulk RNA, cytokine assay, plasma proteomics, and clinical values, which resulted in an integrated and harmonized dataset that could be used to identify relevant biological processes in MI11. In case other omic datasets require additional data pre-processing strategies, these should be executed prior to the MOFA analysis. The preprocessed data can then be used as input to the current workflow. The model's output can be used to evaluate the quality of the integration of the different datasets and the effects of the pre-processing to identify potential technical effects. For example, one could start by applying only a minimum of pre-processing and normalization steps and subsequently evaluate the effect of further normalization. In this application, it was observed that adding 'feature wise quantile normalization' leads to a better integration of proteomics with the remaining assays.
Another large part of the workflow is the estimation of the MOFA9 model on data that is aggregated on the sample level. Other extensions of the MOFA model exist, such as
MOFA+21 (specifically for single-cell data), MEFISTO22 (specifically for data including a time component), and MuVI23 (integrating domain knowledge in a factor analysis approach). These are very useful extensions of the method for specific types of datasets or settings, but applying them comes with specific requirements. For example, using MOFA+21 specifically for single-cell data in our case would mean that one could not easily make use of the sample level other omics data that is available. Using the MEFISTO22 method makes it possible to specifically model the time course but it is not applicable in settings where multiple time points are not available or, in our case for combining the 'Control' samples that have only measurements at one-time point with the time-resolved data of the 'ACS' samples. Therefore, in this very general workflow, we opt to use the MOFA9 method, as it is the most flexible method for exploring all kinds of datasets without too many requirements in a fast manner. Of note, the pre-processed data generated in the workflow can alternatively also be analyzed with these methods or the current scripts can be extended to integrate these methods if the dataset is compatible with the methods' requirements. For additional use cases, tutorials, and documentation, refer the GitHub repository of the MOFA developers24.
In comparison to even more general dimensionality reduction methods like Principal Component Analysis (PCA), the MOFA9 model offers several advantages, especially in the multi-omics setting. For example, the variance decomposition can be easily analyzed per input view and weights can be assigned to different views to account for different numbers of features. The model encourages sparsity, which increases the interpretability of the results. Furthermore, samples with missing data in one of the omics do not need to be excluded, and the model implements some features to focus on learning sparse feature factor weights. Also, the MOFA model offers several settings to integrate data characterized by different distributions (modeling either 'Gaussian', 'Bernoulli' or 'Poisson' likelihoods). In this workflow, we only integrate continuous data, which we normalize to follow a 'Gaussian' distribution, but the existing code can also be extended to integrate other data types if needed. Other decomposition-based methods exist, such as scITD25, especially for scRNA-seq data, but these then come with the limitation that they are not designed to work with other omics.
The presented workflow shows a protocol for an unsupervised exploration of large and complex multi-omics datasets to identify potential biological processes and other features that drive variation within the data. It can be applied in nearly any setting where variation in the data, e.g., caused by disease or other biological or technical perturbations, should be investigated. The resulting downstream analyses and feature sets driving the variance across the different omics can reveal relevant biological processes to investigate further in the specific context. For example, in the context of disease, it might enable the identification of new diagnostic markers or treatment targets.
The authors have nothing to disclose.
C.L. is supported by the Helmholtz Association under the joint research school "Munich School for Data Science – MUDS".
Apptainer | NA | NA | https://apptainer.org/docs/admin/main/installation.html |
Compute server or workstation or cloud (Linux, Mac or Windows environment). Depending on the size of the different input datasets we recommend running the workflow on a suitable machine (in our setting we use: 16 CPU, 64GB Memory) |
Any manufacturer | 16 CPU, 64GB Memory | Large Memory is only required for the processing of the raw single cell data. After preprocessing the later analysis steps can also be performed on regular desktop or laptop computers |
git | NA | NA | https://git-scm.com/book/en/v2/Getting-Started-Installing-Git |
GitHub | GitHub | NA | https://github.com/heiniglab/mofa_workflow |
.