Zum Inhalt
Fakultät Statistik

R4: RegImage-Tox

Statistical assessment of gene-exposure and gene-gene interactions

In project R4 (High-dimensional regression for image data in toxicology), we will develop regression methods for image data to answer toxicological questions, using high-dimensional molecular data and other categorical and continuous variables as regressors.

The Leibniz institute IfADo (Jan Hengstler) has recently coined the term tissue cartography for the characterization of tissue combining various image and molecular data; see project I6 for more details. In project I6, the focus is on the combination of data from different imaging techniques for classification problems. In project R4, we will tackle regression problems, where the response variable is continuous instead of categorical. A major application will be modelling of properties of tumours that mice developed in toxicological experiments, see project I6 for an exemplary scenario. Typical response variables in such a case are tumour size or tumour weight of the corresponding subject (mouse). An important toxicological challenge is to identify molecular variables that influence the response variable of interest, also conditional on additional other variables further characterising the underlying subjects, such as mouse body weight, age, or sex.

For the sake of simplicity, in basic classical regression models, the effects of predictors are often assumed as being linear. However, this is a very strong and restrictive assumption. In reality, often certain metric covariates have more complex, not necessarily linear effects on the response. For example, body weight might have an important influence only for extreme values, i.e. for very small or very large values. Since in toxicology often, prior knowledge of the exact form of such a non-linear effect is not available, it is desirable that the form of the effect is determined automatically and data-controlled within the framework of statistical estimation.

The additive model based on smoothing splines is particularly suitable for this purpose (see, e.g., Wood, 2017). If target distributions other than the normal distribution are used, e.g., for count data, such a model is more generally called a Generalised Additive Model (GAM). In this situation, a typical approach is to develop a corresponding metric covariate that potentially exhibits a non-linear effect in a number of m basic functions. A frequently used class of basic functions is the so-called B-spline basis. In order to ensure sufficient flexibility, usually a relatively large number of basis functions are selected (e.g. m=20). In order to avoid overfitting, the roughness of the spline is penalised. A classic example are penalised B-splines (called P-splines; see Eilers and Marx, 2021). In this framework, complex non-linear interactions can also be mapped using bivariate tensor splines.

Since it is usually difficult for a toxicologist to decide which metric covariates should be equipped by linear or non-linear effects, questions of effect selection arise. A useful extension in this context are component-wise boosting methods for GAMs, which allow for automatic effect selection, so that individual covariate effects are included either linearly or non-linearly in the model, or are completely excluded (see, e.g., Groll and Tutz, 2012).

If the response distribution depends on several parameters (e.g., mean µ and variance σ² for the normal distribution), the model class of GAMs can be extended so that not only the mean, as usual, but also other distribution parameters are associated with covariates. This yields the model class of the Generalised Additive Model for Location, Scale and Shape (GAMLSS).

If a large number of covariates is available, we will also apply classical regularisation approaches such as the least angle shrinkage and selection operator (Friedman et al., 2010) and its extensions (e.g. relaxed LASSO; adaptive LASSO) to stabilise the estimation and to identify the most relevant predictors. To account for complex combinations of non-linearities and interaction effects, we will also consider trees and random forests. Such machine learning algorithms are often very well capable of modelling complex relationships, but this comes naturally with a loss of interpretability. However, modern methods of the field called ‘interpretable machine learning’ help to interpret complex nonlinear algorithms (Lu et al., 2023), and we will apply such methods to better understand the relevance of the predictors in our toxicological regression models.

Referenzen

  • Eilers PHC, Marx BD (2021). Practical smoothing: The joys of P-splines. Cambridge University Press. https://psplines.bitbucket.io/
  • Friedman J, Tibshirani R, Hastie T (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. doi: 10.18637/jss.v033.i01
  • Groll A, Tutz G (2012). Regularization for generalized additive mixed models by likelihood-based boosting. Methods of Information in Medicine 51(2), 168-177. doi: 10.3414/ME11-02-0021
  • Lu S, Swisher CL, Chung C, Jaffray D, Sidey-Gibbons C (2023): On the importance of interpretable machine learning predictions to inform clinical decision making in oncology. Front Oncol 13:1129380. doi: 10.3389/fonc.2023.1129380
  • Wood SN (2017). Generalized additive models: an introduction with R. Chapman and Hall/CRC. doi: 10.1201/9781315370279