I3: Enrich

Enrichment analyses for multiple data sources and complex statistical designs

In many statistical analyses with high-dimensional omics data, statistically significant gene groups are identified in addition to the most significant individual genes. The gene groups are often defined by a given, usually functional, context. Frequently examined gene groups are gene ontology (GO) groups, in which genes are grouped according to their involvement in biological processes or molecular functions. If many genes of a biologically defined gene group are expressed significantly differently between, e.g., measurements at two different concentrations, this suggests that the corresponding biological function might be intensified or weakened by the higher concentration (Goeman and Mansmann, 2008). Another application related to the research in this RTG is the comparison of two cell types, e.g. fibroblasts from intrinsically and from extrinsically aged skin, or hepatocyte-like cells derived from primary hepatocytes and from iPS cells.

In addition, the calculation of statistical significance of the gene groups provides a global molecular profile with high biological interpretability, also with regard to structural, regulatory or enzymatic properties of the associated proteins. Since many gene groups overlap with respect to their members, the calculation of the significance of gene groups leads to highly correlated test statistics and thus to many false negative results after ordinary adjustment for multiple testing. GO groups are hierarchically linked in a complex structure. For this situation, methods were developed that heuristically decorrelate the test statistics (Alexa et al., 2006; Mansmann and Meister, 2005). Our own method topGO for this task (Alexa et al., 2006) is very popular and is often cited and applied.

This project deals with two extensions of this basic approach. First, methods of enrichment analysis will also be applied to protein measurements and to SNP data as well as to combinations of different data sources (SNPs, gene expression, protein expression), by defining corresponding gene sets. Proteins can be mapped to genes before their corresponding gene groups are evaluated (Laukens et al., 2015). Groups of SNPs in genomic regions can also be mapped to genes. Associations and averages of gene groups identified from different sources then allow an integrative analysis of the different genomic data types.

Second, statistical designs that are more complex than two-group comparisons will be used in this project. On the one hand, two-group comparisons will be generalised to multivariate regression models using additional covariates (toxicological, epidemiological, or experimental). An example is an interaction effect between exposure time and concentration (cf. project P2), for which the underlying biological processes should be identified. On the other hand, covariates will be directly used as target variables to evaluate their molecular relevance. Finally, molecular data from different sources will be combined for these designs in order to identify relevant molecular signalling pathways.

The methods will be applied to various data sets available for the RTG and generated in the RTG. One major application, in collaboration with project I2, will be gene expression data (RNAseq, possibly also proteomics) generated from liver samples collected at specific time points during the development of liver steatosis in mice subjected to a 30 weeks-feeding period with a Western diet. This time course covers initiation of fat accumulation in the liver (steatosis), development of an inflammatory condition (steatohepatitis) and progression to fibrosis. These are three crucial stages that are confirmed by recorded pathophysiological parameters such as quantification of fat, immune cell infiltration and collagen deposition.

References

Alexa A, Rahnenführer J, Lengauer T (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22(13), 1600-7, doi: 10.1093/bioinformatics/btl140.
Goeman JJ, Mansmann U (2008). Multiple testing on the directed acyclic graph of Gene Ontology. Bioinformatics 24(4), 537-44, doi: 10.1093/bioinformatics/btm628.
Laukens K, Naulaerts S, Berghe WV (2015). Bioinformatics approaches for the functional interpretation of protein lists: from ontology term enrichment to network analysis. Proteomics 15(5-6), 981-96, doi: 10.1002/pmic.201400296.
Mansmann U, Meister R (2005). Testing differential gene expression in functional groups. Goeman's Global Test versus an ANCOVA Approach. Methods of information in medicine 44(03), 449-453, doi: 10.1055/s-0038-1633992.