I7: SparcePolygenMods
Integration of different omics data with regression methods
In project I7 (Sparse polygenic models for cross-population generalizability and causal environmental analysis), leveraging large-scale genotype data from recent cohort studies, we will develop and evaluate polygenic risk score (PRS) models to predict phenotypes related to environmental health.
In particular, we aim to improve the generalizability across different populations by including fewer genetic variants than commonly used models. Additionally, selected variants from sparse PRS models will be incorporated as instruments in Mendelian randomization analyses to estimate causal effects of environmental exposures.
PRS models for predicting phenotypes and health outcomes are traditionally built based on additive univariate effects of many genetic variants (single nucleotide polymorphisms, SNPs). Training of PRS models faces computational and statistical issues due to the large and high-dimensional genotype data, both regarding the available sample sizes in large cohort studies (e.g., n > 200,000 subjects) and the number of common SNPs potentially included in the PRS models (e.g., p > 1,000,000 SNPs). While univariate effect estimates, publicly available as summary statistics from genome-wide association studies (GWAS), facilitate computations, high correlations between SNPs within the same genomic region (linkage disequilibrium, LD) complicate the accurate identification of causal variants for phenotypes. Furthermore, current PRS models, largely trained on UK Biobank data predominantly with individuals of British ancestry, show limited generalizability to different ancestries, with substantial reductions in prediction accuracy observed for out-of-training populations (e.g., Cavazos and Witte, 2021; Maj et al., 2022).
Using recent data from both the UK Biobank and from the German National Cohort (NAKO) studies, we will develop and evaluate sparse PRS models that base the SNP selection and effect estimation on multivariable regression models fitted on individual-level genotype data, to directly account for the correlation structure between variants, instead of relying on GWAS summary statistics. The NAKO health study is a large, multi-centre, prospective cohort study in Germany, investigating the causes of widespread diseases such as cancer, diabetes, infectious and cardiovascular diseases. Including about 207,000 participants at baseline, the study examines chronic diseases as well as the impact of genetic and environmental factors on health, such as air pollution, climate and land use.
We will employ and further develop recent scalable frameworks to fit sparse PRS models via regularised regression such as the Lasso (snpnet, Qian et al., 2020) and statistical boosting approaches (snpboost, Klinkhammer et al., 2023), providing SNP selection via L1-regularization and early stopping of the boosting algorithm, respectively. In this context, we will also investigate new methods from causal inference to further pinpoint relevant SNPs for environmental health-related phenotypes. Particularly, we aim to infer genetic variants with stable effect estimates across different populations, which can help in understanding the interplay between invariance, generalizability and causality (e.g., Bühlmann, 2020). Data from the ongoing NAKO study will not only allow us to externally validate PRS models developed on the UK Biobank, but it also provides the unique opportunity to train new PRS models using combined data from British and German populations. Recent works have indicated that cross-population generalizability of PRS models can be improved by the inclusion of genetic variants discovered from multiple populations (e.g., Cavazos and Witte, 2021). Thus, deriving sparse PRS models based on multiple European cohort studies can enhance their robustness and generalizability, potentially also extending the applicability to non-European populations, which is crucial for equitable and effective PRS implementation in practice.
Mendelian randomization (MR) is a powerful tool for estimating causal effects of environmental exposures on health outcomes. However, selecting SNPs as valid instruments in MR analyses is complicated by LD, which can lead to the selection of SNPs with pleiotropic effects, i.e., direct effects on the outcome of interest (cf., Dudbridge, 2021). We will integrate the newly developed sparse PRS models in MR analyses, where the identified SNPs may serve as more robust instruments, effectively reducing the risk of pleiotropy by focusing on fewer, more relevant variants. Specifically, we will investigate the impacts of environmental exposures such as air pollution on chronic health conditions related to ageing. This approach not only has the potential to reveal causal environmental risk factors for various health conditions, but also to identify new prevention strategies for public health and personalised medicine.
Referenzen
- Bühlmann P (2020). Invariance, causality and robustness. Stat Sci, 35(3), 404-426. https://www.jstor.org/stable/26997912
- Cavazos TB, Witte JS (2021). Inclusion of variants discovered from diverse populations improves polygenic risk score transferability. HGG Adv, 2(1), 100017. doi: 10.1016/j.xhgg.2020.100017
- Dudbridge F (2021). Polygenic Mendelian randomization. Cold Spring Harb Perspect Med, 11(2), a039586. doi: 10.1101/cshperspect.a039586
- Klinkhammer H, Staerk C, Maj C, Krawitz PM, Mayr A (2023). A statistical boosting framework for polygenic risk scores based on large-scale genotype data. Front Gen, 13, 1076440. doi:10.3389/fgene.2022.1076440
- Maj C, Staerk C, Borisov O, Klinkhammer H, Wai Yeung M, Krawitz P, Mayr A (2022). Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol. Genet Epidemiol, 46(8), 589-603. doi: 10.1002/gepi.22495
- Qian J, Tanigawa Y, Du W, Aguirre M, Chang C, Tibshirani R, ..., Hastie T (2020). A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet, 16(10), e1009141. doi: 10.1371/journal.pgen.1009141