R3: Gen-InterAna
Statistical assessment of gene-exposure and gene-gene interactions
In project R3 (Statistical assessment of gene-gene interactions in the construction of genetic risk scores), procedures based on Random Forests are developed that allow direct identification of interactions that are important for, e.g., risk prediction and specification of the importance of theses interactions for prediction and thus for their contribution to genetic risk scores. This and other procedures will then be employed to investigate whether one whole-genome genetic risk score or an ensemble of several gene- or pathway-wise risk scores provides a better risk prediction.
In recent years, genetic risk scores are frequently generated using, e.g., regularised regression methods to summarise the effect of all measured (and imputed) SNPs from a certain gene, a pathway, or the whole genome (Lewis and Vassos, 2017). Considering such genetic risk scores instead of individual SNPs, which are often highly correlated and typically have only a slight influence on the outcome of interest, can lead to improved detection of associations of genes or pathways, respectively, with this outcome (Hüls et al., 2017).
It is well known that not individual SNPs, but interactions of several SNPs are important for the risk of developing a (complex) disease. In the first phase of the RTG 2624, in project R3, we have investigated the performance of tree-based regression and classification methods such as logic regression (Ruczinski et al., 2001) and Random Forests (Breiman et al., 2001) in the construction of genetic risk scores, which naturally consider interactions when generating logic trees or CART trees, respectively (Lau et al., 2022). Comparisons of these tree-based methods with, e.g., the elastic net, which is often employed to construct genetic risk scores (Hüls et al., 2017), were performed on both simulated and real data, e.g., from the SALIA study. The results of these applications indicate that it can be advantageous to use such tree-based method in the construction of genetic risk scores, in particular, when interactions are relevant risk factors, but also when there are only main effects (Lau et al., 2022).
To overcome the drawbacks of these tree-based procedures such as the limited interpretability of predictions and difficulties with negligible marginal effects, we have developed a new tree-based regression and classification method called logicDT (Lau et al., 2024). Similar to the logic regression-based logic bagging (Schwender and Ickstadt, 2008), but in an improved way, logicDT enables the direct detection of interactions associated with the outcome of interest and the specification of the importance of these identified interactions for prediction of this outcome, and hence, in the construction of the genetic risk score.
While Random Forests, which arguably is the most popular tree-based ensemble method, also provide measures of importance for individual variables, interactions composing the hundreds or thousands of CART trees generated in an application of Random Forests can, due to the hierarchical nature of these trees, only be identified and assessed in an indirect fashion. Moreover, since a greedy search is used in the generation of CART trees, it is likely that the interactions comprised by these trees are too large, i.e. are composed of the truly influential interaction and additional variables that improve the prediction slightly at random.
In the second phase of the RTG, one goal of the project R3 is to develop a procedure based on Random Forests that enables the direct identification of interactions composing the CART trees built by Random Forests and the direct measurement of the importance of these interactions for risk prediction. Therefore, their contribution to the genetic risk score generated by Random Forests can be specified directly. We will exploit that each CART tree can be transformed into a logic tree generated by logic regression, and vice versa (see Ruczinski et al., 2003). Thus, all CART trees will be transformed into logic trees, making the contained interactions easily identifiable. The importance measures used in logic bagging as well as in logicDT will be adapted to measure the contribution of these SNPs to the genetic risk score. Moreover, SNPs with a low contribution to the predictive power of an interaction will be removed from this interaction using approaches similar to the ones considered by Lau et al. (2024) and Tietz et al. (2019). Simulated data and real data, e.g., from the SALIA study, will be employed to identify the most powerful procedure for measuring the importance. Moreover, this procedure will be compared to other importance measures of Random Forests for interactions as well as the measures of logic bagging and logicDT in their applications to simulated and real data.
A second goal of this project is to investigate whether combining all SNPs measured in a genome-wide association study to one genetic risk score or considering an ensemble of gene-wise genetic risk scores constructed using all available SNPs from the respective gene will lead to a better risk prediction. For this, such whole-genome and gene-wise genetic risk scores will be constructed using both tree-based methods and regularised regression procedures and the performances of the whole-genome genetic risk score and the ensemble of genetic risk scores will be compared in applications to simulated and real data. To investigate what the best strategy is to combine the ensemble of genetic risk scores to a single genetic risk score, approaches such as averaging over the gene-wise genetic risk scores, considering the genetic risk scores in a regularised regression method, or applying Random Forests to the genetic risk scores will be considered. The latter approach would allow us to consider interactions of gene-wise genetic risk scores without the necessity to specify the interactions beforehand. Moreover, the importance measure devised in the first part of this project can then be used to identify the interactions of genetic risk scores composing the CART trees and to measure the contribution of the individual genetic risk scores as well as the contribution of the identified interactions. Therefore, the importance of the genes and their identified interactions for the risk prediction can be quantified.
Referenzen
- Breiman L (2001): Random Forests. Machine Learning 45, 5-32. doi: 10.1023/A:1010933404324
- Hüls A, Ickstadt K, Schikowski T, Krämer U (2017): Detection of gene-environment interactions in the presence of linkage disequilibrium and noise by using genetic risk scores with internal weights from elastic net regression. BMC Genetics 18, 55. doi: 10.1186/s12863-017-0519-1
- Lau M, Wigmann C, Kress S, Schikowski T, Schwender H (2022): Evaluation of tree-based statistical learning methods for constructing genetic risk scores. BMC Bioinformatics 23, 97.
- doi: 10.1186/s12859-022-04634-w
- Lau M, Schikowski T, Schwender H (2024): logicDT: a procedure for identifying response-associated interactions between binary predictors. Machine Learning 113, 933-992.
- doi: 10.1007/s10994-023-06488-6
- Lewis CM, Vassos E (2017): Prospects for using risk scores in polygenic medicine. Genome Medicine 9, 96. doi: 10.1186/s13073-017-0489-y
- Ruczinski I, Kooperberg C, LeBlanc M (2003): Logic Regression. Journal of Computational and Graphical Statistics 12, 475-511. doi: 10.1198/1061860032238
- Schwender H, Ickstadt K (2008): Identification of SNP interactions using logic regression. Biostatistics 9, 187-198. doi: 10.1093/biostatistics/kxm024
- Tietz T, Selinski S, Golka K, Hengstler JG, Gripp S, Ickstadt K, Ruczinski I, Schwender H (2019): Identification of interactions of binary variables associated with survival time using survivalFS. Archives of Toxicology 93, 585-602. doi: 10.1007/s00204-019-02398-6