I8: Clust-DiffExp-RNA

Integrative analysis of clustering and differential expression for single-cell RNA data

In project I8 (Integrative analysis of clustering and differential expression for single-cell RNA data), we will develop integrative statistical methods for the clustering of cells and the identification of differentially expressed genes based on single-cell RNA sequencing (scRNA-seq) data. A special focus will be on uncertainty quantification within a Bayesian framework.

Single-cell RNA sequencing comes with the promise to generate insight into gene expression patterns across individual cells and how complex tissues coordinate the response to toxic compounds or environmental threats. In particular, it enables data-driven clustering of cells based on gene expression measurements and the identification of differentially expressed genes across cell types, tissues, and various environmental conditions. Specifically, we are interested in the scarcely-described interplay between the two prominent UVB-responsive environmental sensor proteins HIF-1a (Faßbender et al., 2022) and AhR (Pollet et al, 2018) in epidermal keratinocytes, the major target cells of UVB irradiation. Our own preliminary experiments revealed that deletion of these key proteins from keratinocytes leads to an unexpected multifactorial “UVB-exposed-like” phenotype in naïve mice, which involves several other cell types. Therefore, we aim to identify aberrant pathways and cellular interactions to approach the mechanism orchestrating this phenotype.

Traditional scRNA-seq data analysis pipelines have considered cell clustering and differential gene expression analysis as two different steps, e.g., by first clustering the cells and subsequently testing for gene expression differences between clusters to estimate up- or down-regulation of genes and annotate cell types. However, recent works have identified the important underlying selective inference (“double-dipping”) problem, if one proceeds with differential expression analysis without accounting for the previous data-driven clustering on the same data (e.g., Chen and Witten, 2023). In this project, instead, we will focus on Bayesian approaches for the integrative analysis of cell clustering and differential gene expression for scRNA-seq data from mouse models. The approaches will allow for the incorporation of prior biological knowledge (e.g., on specific cell types and genes) and they will provide uncertainty quantification regarding clustering (e.g., Ickstadt et al., 2018). In particular, we will develop scalable extensions of flexible Bayesian modelling approaches, such as mixture models with an unknown number of components (clusters) to simultaneously cluster observations (cells) and select important discriminating variables (genes) via the introduction of latent variable selection indicators (Tadesse et al., 2005; Klein et al., 2014). We will compare our approaches to recent post-selection inference methods (e.g., Chen and Witten, 2023).

The high-dimensional scRNA-seq data, comprising thousands (or more) cells and genes, pose computational challenges, especially for the efficient implementation of fully Bayesian approaches. We will address this challenge by developing efficient Markov Chain Monte Carlo (MCMC) algorithms to effectively sample from the targeted posterior distributions. In particular, we will combine ideas from recent adaptive MCMC approaches for Bayesian variable selection (Staerk et al., 2024) with sampling-based strategies employing leverage scores for dimension reduction.

As raw scRNA-seq data consist of integer-valued counts with typically a large number of zeros (including “drop-outs”), we will also employ generalised linear models (GLMs) with Poisson, negative-binomial, or multinomial distributions, potentially incorporating zero-inflation. In addition, we will develop extensions to multi-donor settings under various conditions, by using hierarchical mixed models to account for individual variability. Specifically, we will apply the methods to analyse gene expression changes in double gene knock-out and single knock-out vs. wild-type models, assessing their impact on cellular responses to environmental stressors.

References

Chen YT, Witten DM (2023). Selective inference for k-means clustering. J Mach Learn Res, 24(152), 1-41. https://www.jmlr.org/papers/volume24/22-0371/22-0371.pdf
Faßbender S, Sondenheimer K, Majora M, …, Krutmann J, Weighardt H (2022). Keratinocytes counteract UVB-induced immunosuppression in mice through HIF-1a signaling. J Invest Dermatol, 142(4), 1183-1193. doi: 10.1016/j.jid.2021.07.185
Ickstadt K, Schäfer M, Zucknick M (2018). Toward integrative Bayesian analysis in molecular biology. Annual Review of Statistics and its Application, 5, 141-167. doi: 10.1146/annurev-statistics-031017-100438
Klein HU, Schäfer M, Porse BT, Hasemann MS, Ickstadt K, Duga M (2014). Integrative analysis of histone ChIP-seq and transcription data using Bayesian mixture models. Bioinformatics, 30(8), 1154-1162. doi: 10.1093/bioinformatics/btu003
Pollet M, Shaik S, Mescher M,, Krutmann J, Haarmann-Stemmann T (2018). The AHR represses nucleotide excision repair and apoptosis and contributes to UV-induced skin carcinogenesis. Cell Death Differ, 25(10): 1823-1836. doi: 10.1038/s41418-018-0160-1
Staerk C, Kateri M, Ntzoufras I (2024). A Metropolized adaptive subspace algorithm for high-dimensional Bayesian variable selection. Bayesian Anal, 19(1), 261-291. doi: 10.1214/22-BA1351
Tadesse MG, Sha N, Vannucci M (2005). Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc, 100(470), 602-617. doi: 10.1198/016214504000001565