Zum Inhalt
Fakultät Statistik

I5: Omics-RegInt

Integration of different omics data with regression methods

In project I5 (Integration of different omics data with regression methods), the effect of toxicological compounds on tissues or organisms will be investigated using genomic, transcriptomic and proteomic data. As omics data are typically high-dimensional, the following concept is applied.

In a first step, in the first phase of the RTG, multi-omics interactions of a transcriptomics and a proteomics data set were analysed. The data originated from mice that were exposed to carbon tetrachloride (CCl4). It is known that the amount of RNA and protein of a gene is positively correlated, but in real data, the RNA expression often explains only a relatively small fraction of the variance of protein levels. This motivates to apply conventional regression models for investigating the single-pair relationship between RNA expression and protein levels (Heiner et al., 2023). This approach can be extended based on co-expression and differentiation-pattern plot (DiPa plot; Nell et al., 2022) analysis for gene grouping. The resulting gene groups are used to generate high-dimensional, regularised regression models. In these models, genes from the same group are used as additional covariates for modelling protein levels, with the goal to improve protein level prediction based on transcriptomics.

As a next step, the impact of omics baseline levels will be investigated in order to explain resulting bivariate patterns observed in the aforementioned DiPa plot. Genes and proteins with rather low baseline expression at time zero have potentially higher ratios, and high baseline expression is associated with rather low ratios. This motivated a deeper analysis of the impact of baseline expressions on the bivariate pattern. The initial approach was a generalised additive model with multinomial sections of the DiPa plot as target variable. We plan to refine this regression approach, e.g. by adapting copula regression models.

Further, in this project different data sources will be considered and modelled. Targets in these different data types can be binary, categorical, continuous, or counts, hence the classes of Generalised Linear or Additive Regression Models (GLMs; GAMs) are useful, covering, e.g., logistic, multinomial, Poisson, and ordinary linear regression. Depending on the dimensionality of the predictor space, different modelling strategies can be suitable. For p<<n, conventional, unregularised regression models can be used. For determining which genes are relevant, standard significance-based approaches or subset selection approaches based on test statistics can be employed. However, if many predictors are present, multicollinearity issues become relevant and the usual stability problems of forward-backward algorithms occur, which are due to the inherent discreteness of the method.

Moreover, variable selection techniques can be desirable. Consequently, in this situation, it is preferable to utilise suitable regularization methods such as ridge, lasso, or component-wise boosting methods. Depending on the different outcome distributions and on the different predictor structures (metric vs. categorical covariates, linear vs. nonlinear effects) the researcher has to decide which statistical modelling approach to choose. For example, if categorical predictors are to be included, instead of ordinary lasso, group lasso (Meier et al., 2008) or fused lasso approaches (Gertheiss and Tutz, 2010) should be used. If effect selection between linear and nonlinear effects is desired, specific boosting algorithms are suitable (see, e.g., Groll and Tutz, 2012). For some combinations of outcome distributions and predictor structure, also methodological extensions might become necessary.

Referenzen