P12: ValuesBelowLimit
Statistical methods for addressing values below a lower limit of quantification
In project P12 (Statistical methods for addressing values below a lower limit of quantification), we will develop and extend statistical methods that address values below a lower limit of quantification. This challenge frequently arises with proteomic, secretomic and transcriptomic data, as generated and collected in toxicological studies.
At IUF, for several projects omics data are collected, for example secretomic and transcriptomic data (RNA-seq data) for the comparison of intrinsically and extrinsically aged cells in project P4. In these data sets, missing values are frequent. Before applying statistical methods to address this challenge, e.g. multiple imputation to fill in missing values, the reason for the absence of data is evaluated and grouped as “missing completely at random”, “missing at random”, or “missing not at random”. Missing values due to measurement inaccuracies, i.e. being below a lower or above an upper limit of quantification, are usually not given enough attention.
A very frequent problem is that the value zero is stored by default for values below the lower limit of quantification. Then, for those zero entries one cannot distinguish whether a certain feature (e.g. an analyte) was actually absent (biological reason) or whether its value is below the lower limit of quantification or detection (technical reason). This problem is particularly prevalent in proteome and secretome data sets. While “ordinary” types of missing values can be addressed partially by implementing preventative strategies from the start of a study, the same cannot be done when it comes to values below a lower limit of quantification or detection.
Different methods have been established for addressing such zero-inflated intensity values for two-group comparisons, see, e.g., Gleiss et al. (2015) or van Reenen et al. (2017) for overviews in the context of metabolomic or proteomic data. A distinction is primarily made between one-part tests, where the values below the detection limit are considered part of the full distribution, and two-part tests. The latter describe with one part the occurrence of values below the detection limit and with another part the continuous sub-distribution. All groups of models are known to have their limitations, such as a lack of power, or that they focus only on the biological or technical reason for potential zeros (van Reenen et al., 2017).
In this project, we initially focus on the extension and analysis of new methods addressing the zero inflation described above, for proteomic, secretomic or transcriptomic data. Therefore, we will evaluate Bayesian mixture models with a special focus on the underlying prior distributions as well as censored regression models, such as the Tobit model with the evaluation of different underlying distributions. Moreover, we will develop a new two-step approach. In this procedure, first, for a specific feature with missing value the k-nn (k-nearest neighbours) algorithm is applied. If a predefined proportion of the identified k nearest neighbours also has a zero value, then the feature of interest is classified as biologically not available and represented by an actual zero. If the pre-defined proportion of zeros is not attained, then one of the following three methods is randomly applied: imputation by zero, imputation by detection limit divided by two, or imputation by the average of the values of the k nearest neighbours. This method will be applied and evaluated on data of the IUF, in particular for the comparison of the proteome and the secretome between intrinsically aged skin cells and extrinsically aged skin cells, and the analysis of their mutual dependence regarding their development.
In the second part of this project, we will evaluate the developed method in extensive simulation studies in which the dependence of sample size, number of influential variables, and percentage of zero values is varied. The distributions of the variables from the proteome, the secretome and the transcriptome data set of the IUF are simulated in a manner similar to the approach used by Hafermann et al. (2021). In the data sets of the IUF, the intrinsic and extrinsic aging of the skin is compared between three age groups. Hence, we will extend the methods for addressing zero-inflated intensities in a two-group comparison to multiple group comparisons.
Referenzen
- Gleiss A, Dakna M, Mischak H, Heinze G (2015). Two-group comparisons of zero-inflated intensity values: the choice of test statistic matters. Bioinformatics, 31(14), 2310-2317. doi: 10.1093/bioinformatics/btv154
- Van Reenen M, Westerhuis JA, Reinecke CJ, Venter JH (2017). Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp. BMC Bioinformatics, 18, 1-13. doi: 10.1186/s12859-017-1480-8
- Hafermann L, Becher H, Herrmann C, Klein N, Heinze G, Rauch G (2021). Statistical model building: Background “knowledge” based on inappropriate preselection causes misspecification. BMC Medical Research Methodology, 21, 1-12. doi: 10.1186/s12874-021-01373-z


