I4: Omics-LongInt

Integrative analysis of longitudinal omics data

In project I4, statistical methods will be developed for the integrative analysis of omics data (e.g., genomic, transcriptomic, and proteomic data) collected at several time points after the administration of a toxicological compound to tissues.

At IfADo, e.g., genome-wide RNA and protein expressions were measured on human hepatocytes from six donors at eight time points (from 24 hours to 14 days), after administration of a slightly toxic, but in vivo relevant concentration of paracetamol (1 mM). These expression data will be analysed integratively in this project.

Our starting point for the development of statistical methods for the integrative analysis of such longitudinal omics data is the externally centred correlation coefficient (Schäfer et al., 2009), which makes it possible to identify multivariate DNA regions with consistent changes in different data types (e.g., underexpressed genes with a loss of DNA material). In contrast to other methods for integrative analyses, in which typically the different data types are first considered individually and afterwards the results of the individual analyses are combined, the externally centred correlation coefficient allows a direct analysis of the influence of the different data types together in a multivariate way.

This coefficient, which has originally been developed for the analysis of two data types, was embedded into a Bayesian mixture model so that the relationship between the different data types can be investigated and genetic variables can be classified into different groups (Klein et al., 2014). Subsequently, this coefficient was extended to the integrative analysis of several types of omics data and embedded into a hierarchical Bayesian model (Schäfer et al., 2017). By employing a conditional autoregressive prior distribution, this model also allows to include a functional gene network in the integrative modelling. In this way, the analysis of genetic variables can be strengthened by the exchange of information between genes, as it is assumed that functionally related genes are correlated with respect to the degree of agreement of the differences that different data types show between two groups (e.g., cases and controls).

These Bayesian methods were developed for experiments with only a few or no replicates, and therefore, do not take into account the heterogeneity between biological replicates. We, however, recently modified the externally centred correlation coefficient and the Bayesian hierarchical model in which the coefficient is embedded, so that they can also be applied to data from large case-control studies (Klein et al., 2020). In a detailed comparison with other methods for integrative analysis of different omics data types, this newly developed method showed the best results.

Important findings from the development of this method will be also considered in this project and verified for the integrative analysis of longitudinal omics data. These findings include the different data types should be modelled additively and not multiplicatively, and instead of a general gene network, a more specific subnetwork adapted to the biological/toxicological problem at hand should be generated and used in the conditional autoregressive prior distribution.

In the development of the methods in this project, several approaches for the integrative analysis of longitudinal omics data will be considered, investigated, and compared. First, the externally centred correlation coefficient will be extended for the integrative analysis of several measurement points, and afterwards, for the additional consideration of several data types. Second, the correlation structure generated by the longitudinal data is taken into account in the Bayesian modelling. In particular, the use of a conditional autoregressive prior distribution will be compared with modelling the longitudinal data based on a linear mixed model.

References

Klein HU, Schäfer M, Bennett DA, Schwender H, De Jager PL (2020). Bayesian integrative analysis of epigenomic and transcriptomic data identifies Alzheimer's disease candidate genes and networks. To appear in: PLOS Computational Biology.
Klein HU, Schäfer M, Porse BT, Hasemann MS, Ickstadt K, Dugas M (2014). Integrative analysis of histone ChIP-seq and transcription data using Bayesian mixture models. Bioinformatics 30(8), 1154-62, doi: 10.1093/bioinformatics/btu003.
Schäfer M, Klein HU, Schwender H (2017). Integrative analysis of multiple genomic variables using a hierarchical Bayesian model. Bioinformatics 33, 3220-7, doi: 10.1093/bioinformatics/btx356.
Schäfer M, Schwender H, Merk S, Haferlach C, Ickstadt K, Dugas M (2009). Integrated analysis of copy number alterations and gene expression: A bivariate assessment of equally directed abnormalities. Bioinformatics 25, 3228-35, doi: 10.1093/bioinformatics/btp592.