R1: GenEnv-HDReg
High-dimensional regression for screening of important genetic and environmental factors
In project R1 (High dimensional regression for screening of important genetic and environmental factors), we are developing variable selection methods for regression models. The methods can be used to screen for important genetic and environmental risk factors and their interactions. Although the NAKO, SALIA, and GINIplus cohort studies already have a relatively high number of observations n, this number is still far below the number of variables p (n<<p). The screening and variable selection procedures in this project will be applied to data from these studies, particularly to data from the NAKO study, which is currently the largest epidemiological study in Germany, with 200,000 participants across different study centres. Since some outcomes are binary and others are continuous, e.g., measured blood pressure or lung function data, we will consider logistic and linear regression models.
The NAKO health study is a multi-centre study investigating the causes of widespread diseases like cancer, diabetes, and infectious or cardiovascular diseases. It also investigates the effects of environmental factors such as air pollution, climate and land use on health. The difficulty in assessing the impact of multiple environmental factors on health lies in high collinearity between air pollutants, climate factors, and land use indices. These influencing factors are in parts highly correlated and their effects cannot be distinguished in simple regression models. To date, epidemiological studies still mostly investigate one air pollutant at a time, although its effect does not occur detached from the other air pollutants or other environmental stressors. For the NAKO study, in the second phase of the RTG, in addition high-dimensional omics data will be available, for which variable selection is also essential.
In the first phase of the RTG, we developed variable selection methods for screening variables in regression models involving large genetic data sets. The methods are intended for investigating the influence of single-nucleotide polymorphisms (SNPs) and their interactions on health outcomes. This is a p ≫ n problem. We introduced cross leverage scores (CLS) to detect interactions of variables while maintaining interpretability. We calculate the CLS as a measure of importance for each variable. The key idea for scaling to large data sets is to divide the data into smaller random batches or consecutive windows of variables. This avoids complex and time-consuming computations on high-dimensional matrices by performing the computations only for small subsets of the data. We compare these methods with provable approximations of the CLS based on sketching, which aims to summarise the data succinctly (see, e.g., Geppert et al., 2017). In a simulation study, we show that the CLS are directly linked to the importance of a variable in the sense of an interaction effect. We further show that the approximation approaches are appropriate for performing the calculations efficiently on arbitrarily large data while preserving the interaction detection effect of the CLS. This underlines their scalability to genome wide data. Details are provided in a manuscript by Teschke et al. (2024) that is under revision. The methods are currently being generalised to include environmental factors and they are applied to the cohort of the SALIA study.
In the second phase of the RTG, we will pursue two goals. First, we will further develop our CLS based procedures for screening important variables to other types of variables. These include continuous variables such as gene expression or metabolome data. Second, we will develop an efficient variable selection procedure as part of a regression analysis. In principle, this variable selection problem cannot be solved efficiently (see Foster et al., 2015). However, efficient heuristics such as LASSO regression can be formulated. Previous sketching approaches for obtaining optimal dimensionality reduction counteract the computational benefits of applying convex relaxations and efficient heuristics, since they become non-convex again when applied in the sketch space. Therefore, we relax the tight results given in Mai et al. (2023) to sketch sizes of sub-optimal target dimension, but with convex estimators, to make them valuable in actual practical applications. This enables us to combine the sketching idea with different variable selection methods that proved to be effective in high dimensions, as suggested in Bommert et al. (2020). This makes the approach even more effective for variable selection in large scale-regression problems like the NAKO health study with a large number of observations n and an even higher number of omics and environmental variables p ≫ n.
References
- Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis 143, 106839. doi: 10.1016/j.csda.2019.106839
- Foster DP, Karloff HJ, Thaler J (2015). Variable Selection is Hard. In: Proceedings of the 28th Conference on Learning Theory (COLT), p. 696-709. https://proceedings.mlr.press/v40/ Foster15.html
- Geppert L, Ickstadt K, Munteanu, A, Quedenfeld J, Sohler C (2017). Random projections for Bayesian regression. Statistics and Computing 27(1). doi: 10.1007/s11222-015-9608-z
- Mai T, Munteanu A, Musco C, Rao AB, Schwiegelshohn C, Woodruff DP (2023). Optimal Sketching Bounds for Sparse Linear Regression. In: Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), p. 11288-11316. https://proceedings.mlr.press/v206/mai23a.html