Application of Unsupervised Multi-Omic Factor Analysis to Uncover Patterns of Variation and Molecular Processes Linked to Cardiovascular Disease

Corinna Losert; Kami Pekayvaz; Viktoria Knottenberg; Leo Nicolai; Konstantin Stark; Matthias Heinig

doi:10.3791/66659

JoVE Journal > Genetics

Please note that all translations are automatically generated. Click here for the English version.

Genética

Application of Unsupervised Multi-Omic Factor Analysis to Uncover Patterns of Variation and Molecular Processes Linked to Cardiovascular Disease

Published: September 20, 2024

doi:

10.3791/66659

Corinna Losert², Kami Pekayvaz⁴, Viktoria Knottenberg⁴, Leo Nicolai⁴, Konstantin Stark⁴, Matthias Heinig^2,4

¹Institute of Computational Biology, German Research Center for Environmental Health,Helmholtz Zentrum München, ²Department of Computer Science, TUM School of Computation, Information and Technology,Technical University of Munich, ³Medizinische Klinik und Poliklinik I University Hospital Ludwig-Maximilian University, ⁴German Centre for Cardiovascular Research (DZHK), partner site Munich Heart Alliance

Summary

We present a flexible, extendible Jupyter-lab-based workflow for the unsupervised analysis of complex multi-omics datasets that combines different pre-processing steps, estimation of the multi-omics factor analysis model, and several downstream analyses.

Abstract

Disease mechanisms are usually complex and governed by the interaction of several distinct molecular processes. Complex, multidimensional datasets are a valuable resource to generate more insights into those processes, but the analysis of such datasets can be challenging due to the high dimensionality resulting, for example, from different disease conditions, timepoints, and omics capturing the process at different resolutions.

Here, we showcase an approach to analyze and explore such a complex multiomics dataset in an unsupervised way by applying multi-omics factor analysis (MOFA) to a dataset generated from blood samples that capture the immune response in acute and chronic coronary syndromes. The dataset consists of several assays at differing resolutions, including sample-level cytokine data, plasma-proteomics and neutrophil prime-seq, and single-cell RNA-seq (scRNA-seq) data. Further complexity is added by having several different time points measured per patient and several patient subgroups.

The analysis workflow outlines how to integrate and analyze the data in several steps: (1) Data pre-processing and harmonization, (2) Estimation of the MOFA model, (3) Downstream analysis. Step 1 outlines how to process the features of the different data types, filter out low-quality features, and normalize them to harmonize their distributions for further analysis. Step 2 shows how to apply the MOFA model and explore the major sources of variance within the dataset across all omics and features. Step 3 presents several strategies for the downstream analysis of the captured patterns, linking them to the disease conditions and potential molecular processes governing those conditions.

Overall, we present a workflow for unsupervised data exploration of complex multi-omics datasets to enable the identification of major axes of variation composed of differing molecular features that can also be applied to other contexts and multi-omics datasets (including other assays as presented in the exemplary use case).

Introduction

Disease mechanisms are usually complex and governed by the interaction of several distinct molecular processes. Deciphering the complex molecular mechanisms that lead to specific diseases or govern the evolution of a disease is a task with high medical relevance as it might reveal new insights for understanding and treating diseases.

Recent technological advances enable to measure those processes on a higher resolution (e.g., on the single-cell level) and at various biological layers (e.g., DNA, mRNA, chromatin accessibility, DNA methylation, proteomics) at the same time. This leads to the increasing generation of large multidimensional biological datasets, which can be jointly analyzed to generate more insights into the underlying processes. At the same time, combining and analyzing the different data sources in a biologically meaningful way remains a challenging task¹.

Different technological limits, noises, and ranges of variability between different omics pose one challenge. For example, single-cell RNA-sequencing (scRNA-seq) data is very sparse and often influenced by large technical or batch effects. Additionally, the feature space is often very large, ranging across several thousand measured genes or proteins, while sample sizes are limited. This is further complicated by complex designs, which might include several disease conditions, confounding factors, time points, and resolutions. For example, in the presented use case, different data types were available either on the single-cell or sample (bulk) level. Besides that, the data may be incomplete, and not all measurements might be available for all analyzed subjects.

Due to these challenges, different omics and included features are still often analyzed only separately² even though performing an integrated analysis cannot only provide a complete picture of the process but biological and technical noises from one omic might also be compensated by other omics³^,⁴. Several different methods have been proposed to perform an integrated analysis of multi-omics data, including Bayesian methods, network-based methods⁵^,⁶, multimodal deep learning⁷, and dimensionality reduction methods via matrix factorization⁸^,⁹. For the latter, the results of a large benchmarking study¹⁰ have shown the MOFA⁹ (multi-omic factor analysis) method to be one of the better-suited tools when data should be linked to clinical annotations.

Especially in complex settings, unsupervised matrix factorization methods are a useful approach to reduce complexity and extract shared and complementary signals from different data sources and features. By decomposing the complex space into lower-rank latent representations, the major sources of variance within the data can be quickly explored and linked to known covariates. In case the same pattern of variation is shared across multiple features (e.g., genes or proteins), this may be aggregated to a few factors while noise is reduced. Regularization can be used to increase the sparsity of model coefficients, which makes the approach well-suited in settings where the feature space is large while the number of samples is limited⁹.

This protocol presents a flexible analysis workflow that uses the MOFA model to showcase how to quickly explore a complex multi-omics dataset and distill the main patterns of variation that characterize this dataset. The workflow consists of three main steps. In the first step, Data pre-processing and harmonization, different strategies for data preprocessing based on different input data types (scRNA-seq, proteomics, cytokine, clinical data) are presented. The protocol elaborates on how to process the features of the different input datasets, filter out low-quality features, and normalize them to harmonize their distributions. We also show how those pre-processing decisions might affect downstream results. In the second step, the MOFA model is applied to the data, and the resulting variance decomposition can be used to evaluate the integration of the different datasets. The third step shows how to link the captured factors to covariates and uncover the molecular programs that define those factors. With the presented workflow, we were able to extract several latent factors linked to clinical covariates in a dataset of patients suffering from coronary syndromes and identify potential underlying multicellular immune programs from a previous project¹¹. We will use this dataset here, but the protocol can easily be applied to other contexts, including other omics.

The dataset consists of samples from patients with stable chronic coronary syndromes (CCS), acute coronary syndromes (ACS), and a control group with healthy coronaries (non-CCS) (Figure 1). ACS is caused by plaque rupture in preexisting CCS, leading to an acute disruption of blood flow to the myocardium and a subsequent ischaemic injury of the heart. This injury causes an inflammatory response by the immune system followed by a reparative phase, which lasts until several days after the acute event¹². To be able to characterize this immune response for ACS patients, blood samples were taken at four different timepoints: acute (TP1); after recanalization (14 [± 8] h) (TP2); 60 [± 12] h later (TP3); before discharge (6.5 [±1.5] days) (TP4) (Figure 1A). For CCS and patients with healthy coronaries, only one time point was available- (TP0). For all patients and time points different assays based on the blood samples were measured: clinical markers of inflammation (Creatine-Kinase (CK), CK-MB, Troponin, C-reactive protein (CRP)), scRNA-seq of peripheral blood mononuclear cells (PBMCs), cytokine analysis, plasma proteomics and prime-seq¹³ data of neutrophils.

Figure 1: Myocardial infarction multi-omic input dataset. Input dataset: The analyzed data includes blood samples from patients (n = 62) with acute coronary syndrome (ACS), chronic coronary syndromes (CCS), and patients with healthy coronaries (non-CCS). For ACS patients blood samples were included at four different time points (TP1-4), for CCS and non-CCS patients at a single time point (TP0). Each patient and time point combination is treated as a separate sample in the analysis. Different omic assays were measured on the samples: clinical blood tests (n = 125), scRNA-seq (n = 121), plasma-proteomics (n = 119), cytokine assay (n = 127) and neutrophil prime-seq (n = 121). Subsequently, the described protocol was applied to integrate the data across all omics and explore it using the MOFA model and further downstream analysis (factor analysis, pathway enrichment). Please click here to view a larger version of this figure.

As input for the workflow as presented here, we take raw counts from scRNA-seq data after processing with cellranger and quality control (QC) as, for example, outlined in the scanpy¹⁴ preprocessing tutorial. For cell-type annotation, we used the automated Azimuth¹⁵ pipeline. The counts are then aggregated on the sample level for each cell-type by taking the mean across all cells for each sample and cell-type (pseudobulk aggregation). Plasma-proteomics is included as normalized and median-centered intensities, and for neutrophils, we take the umi unique molecular identifier (UMI) exon counts from the prime-seq. On cytokine and clinical values, no previous preprocessing has been applied. Further details on the (experimental) data generation are outlined in the corresponding manuscript¹¹. As the results presented here are based on using the automated Azimuth annotation for cell-types in the scRNA-seq data compared to the marker based strategy that was used in the referenced publication the results presented here are similar but not exactly the same as presented in the publication. In the manuscript it could be shown that the cell-type annotation strategy does not change the main patterns and biological interpretations of the analysis but small changes in the exact values resulting from the model might vary. Overall the input data was a complex multi-dimensional dataset including different time points and measurement levels (single-cells vs. bulk) of more than 10,000 different features (genes, proteins, clinical values). A strict pre-processing and data harmonization strategy followed by MOFA analysis has been shown to be a useful and quick tool for exploring the data and extracting relevant immune program. Each time point and patient combination is treated as an independent sample in the MOFA analysis. Each data type and cell-type is considered a separate view in the MOFA analysis.

This protocol provides instructions for preparing the input data for the workflow, executing the different workflow steps, customizing configurations, interpreting the resulting figures, and iteratively adjusting the configurations based on the interpretations. An overview of the different steps of the protocol, the required input datasets at each step, and resulting figures and datasets is given by the technical workflow overview (Figure 2).

Figure 2: Technical workflow overview. Outline of the workflow for the analysis of the multi-omics dataset. The different elements are highlighted by different colors and symbols. Jupyter Notebooks belonging to the Data Preprocessing and Harmonization (1) step are colored in blue. Jupyter Notebooks belonging to the 'MOFA Model' (2) step are colored in orange. Jupyter Notebooks belonging to the 'Downstream Analysis' (3) step are colored in green. One Jupyter Notebook to be used for comparison of the results is colored in yellow. Configuration files where parameters for the execution of the workflow can be modified are highlighted in purple. Input datasets that are required to run the workflow are indicated by the dataset symbol and highlighted in grey. All figure outputs that are generated during the workflow execution are indicated by the magnifying glass symbol. Datasets generated during workflow execution are indicated as tables. In general, the workflow is executed sequentially: (1) Data Preprocessing and Harmonization consists of two steps: first generation of a pseudobulk table based on the scRNA-seq input data (01_Prepare_Pseudobulk) and subsequent integration and normalization of this data together with all the other sample-level (bulk) inputs (02_Integrate_and_Normalize_Data). Within this step via the configuration files, it is possible to configure for each dataset separately which of the indicated pre-processing and normalization steps (e.g., Sample Filter) should be applied. (2) 'MOFA Model': runs the MOFA model on the generated input of the first step with the configurations specified in the configuration file (03_MOFA_configs.csv) (3) 'Downstream Analysis': consists of three different notebooks which may be run independently of each other to generate insights into the generated MOFA results and associate them with sample meta-data (covariates) provided as input via the 'Sample Meta Data.csv' file. (4) 'Model Comparison': is a small separate step that may be used to compare different models generated in step 2. Please click here to view a larger version of this figure.

The workflow consists of several Jupyter Notebooks written in R and Python (knowledge of the R and Python language is not required to run the workflow but might be helpful in case errors appear). In various steps of the protocol, parameters are changed via configuration files ('.csv' files containing the postfix '_Configs' in the name). Within the protocol, we only outline the parameters that need to be changed starting from the default configuration.

Several other parameters may also be changed, for example to customize the pre-processing. A documentation of these parameters and explanations is given in the file 'Documentation_Config_Parameter', which is included in the downloaded repository.

Protocol

1. Preparations: Technical setup and installation NOTE: To run this program, have wget, git, and Apptainer preinstalled on the device. A guide for installing Apptainer on different systems (Linux, Windows, Mac) is given here: https://apptainer.org/docs/admin/main/installation.html. Installation information on git can be found here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. Depending on the size of the different input datasets, running the workflow on a suita…

Representative Results

Following the successful execution of the workflow, several tables and figures are generated as indicated by Figure 2. Figures are placed in the /figures folder (Figure 6, Figure 7, Figure 8, Supplementary Figure 1, Supplementary Figure 2, Supplementary Figure 3, Supplementary Figure 4), and tables will be placed in the…

Discussion

With the outlined protocol, a modular and extendible Jupyter-notebook-based workflow that can be used to quickly explore a complex multi-omics dataset is presented. The major parts of the workflow consist of the pre-processing and data harmonization part (offering different standard steps for filtering and normalization of the data), estimation of the MOFA⁹ model and some exemplary downstream analysis. One of the main critical steps is to pre-process and integrate, and harmonize the different omic…

Divulgaciones

The authors have nothing to disclose.

Acknowledgements

C.L. is supported by the Helmholtz Association under the joint research school "Munich School for Data Science – MUDS".

Materials

Apptainer	NA	NA	https://apptainer.org/docs/admin/main/installation.html
Compute server or workstation or cloud (Linux, Mac or Windows environment). Depending on the size of the different input datasets we recommend running the workflow on a suitable machine (in our setting we use: 16 CPU, 64GB Memory)	Any manufacturer	16 CPU, 64GB Memory	Large Memory is only required for the processing of the raw single cell data. After preprocessing the later analysis steps can also be performed on regular desktop or laptop computers
git	NA	NA	https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
GitHub	GitHub	NA	https://github.com/heiniglab/mofa_workflow

Referencias

Lähnemann, D., et al. Eleven grand challenges in single-cell data science. Genome Biol. 21 (1), 31 (2020).
Colomé-Tatché, M., Theis, F. J. Statistical single cell multi-omics integration. Curr Opin Syst Biol. 7, 54-59 (2018).
Hawe, J., Theis, F., Heinig, M. Inferring interaction networks from multi-omics data. Front Genet. 10, 535 (2019).
Hawe, J. S., et al. Network reconstruction for trans acting genetic loci using multi-omics data and prior information. Genome Med. 14 (1), 125 (2022).
Koh, H. W. L., Fermin, D., Vogel, C., Choi, K. P., Ewing, R. M., Choi, H. iOmicsPASS: network-based integration of multiomics data for predictive subnetwork discovery. NPJ Syst Biol Appl. 5, 22 (2019).
Ogris, C., Hu, Y., Arloth, J., Müller, N. S. Versatile knowledge guided network inference method for prioritizing key regulatory factors in multi-omics data. Sci Rep. 11, 6806 (2021).
Lee, C., van der Schaar, M. A variational information bottleneck approach to multi-omics data integration. 130, 1513-1521 (2021).
Singh, A., et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 35 (17), 3055-3062 (2019).
Argelaguet, R., et al. Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 14 (6), e8124 (2018).
Cantini, L., et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nature Commun. 12 (1), 124 (2021).
Pekayvaz, K., et al. Multiomic analyses uncover immunological signatures in acute and chronic coronary syndromes. Nature Medicine. 30 (6), 1696-1710 (2024).
Swirski, F. K., Nahrendorf, M. Cardioimmunology: the immune system in cardiac homeostasis and disease. Nat Rev Immunol. 18 (12), 733-744 (2018).
Janjic, A., et al. Prime-seq, efficient and powerful bulk RNA sequencing. Genome Biol. 23 (1), 88 (2022).
Wolf, F. A., Angerer, P., Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19 (1), 15 (2018).
Cao, Y., et al. Integrated analysis of multimodal single-cell data with structural similarity. Nucleic Acids Res. 50 (21), e121 (2022).
. Get Started – JupyterLab 4.1.0a4 documentation Available from: https://jupyterlab.readthedocs.io/en/latest/getting_started/overview.html (2024)
. MOFA2: training a model in R Available from: https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/getting_started_R.html (2020)
De Silva, D., et al. Robust T cell activation requires an eIF3-driven burst in T cell receptor translation. eLife. 10, e74272 (2021).
Li, G., Liang, X., Lotze, M. HMGB1: The central cytokine for all lymphoid cells. Front Immunol. 4, 68 (2013).
Jassal, B., et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48 (D1), D498-D503 (2020).
Argelaguet, R., et al. MOFA+: a statistical framework for comprehensive integration of multimodal single-cell data. Genome Biol. 21 (1), 111 (2020).
Velten, B., et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat Methods. 19 (2), 179-186 (2022).
Qoku, A., Buettner, F. Encoding domain knowledge in multi-view latent variable models: A Bayesian approach with structured sparsity. 206, 11545-11562 (2022).
. Multi-Omics Factor Analysis Available from: https://biofam.github.io/MOFA2/ (2024)
Mitchel, J., et al. Tensor decomposition reveals coordinated multicellular patterns of transcriptional variation that distinguish and stratify disease individuals. bioRxiv. , (2023).

This article has been published

Video Coming Soon

Keep me updated:

.

PDF

DOI

DOWNLOAD MATERIALS LIST

Citar este artículo

Losert, C., Pekayvaz, K., Knottenberg, V., Nicolai, L., Stark, K., Heinig, M. Application of Unsupervised Multi-Omic Factor Analysis to Uncover Patterns of Variation and Molecular Processes Linked to Cardiovascular Disease. J. Vis. Exp. (211), e66659, doi:10.3791/66659 (2024).

Application of Unsupervised Multi-Omic Factor Analysis to Uncover Patterns of Variation and Molecular Processes Linked to Cardiovascular Disease

Summary

Abstract

Introduction

Protocol

Representative Results

Discussion

Divulgaciones

Acknowledgements

Materials

Referencias

Citar este artículo

View Video

✖

To prove you're not a robot, please enter the text in the image below