P6: ConcExp-Param

Design and analysis of concentration-exposure curves with common parameters

In project P6 (Design and analysis of concentration-exposure curves with common parameters), we use the concept of regression models with shared parameters for estimating dose-exposure curves for different genes that belong to the same gene group.

Modelling concentration-response curves for different genes in the same group is typically done separately for each gene. If different genes in the same group have similar properties and concentration-response modelling is performed with parametric models, one might assume that some parameters of the curves for the different genes are shared, resulting in a more precise estimation of these parameters.

In the first phase of the RTG, we developed such an approach. Motivated by the applications in toxicology, curves with some common parameters should be preferred over individual parametric curves if they are more suitable to estimate the scope of the experiment precisely. The scope can be, e.g., a specific alert concentration such as the EC₅₀, i.e. the concentration where 50% of the response is expected. For the decision for a specific model, model selection criteria can be used. However, established criteria, such as the Akaike (AIC) or Bayesian information criterion (BIC), only select the model based on the best overall model fit. Instead, we developed a new model selection criterion that also considers the scope of the experiment. We followed an approach of Claeskens and Hjort (2003) to derive an approximation of the mean squared error of the estimator of the scope for each of the potential curves (differing in the number of common parameters). In an extensive simulation study, we compared the performance of all model selection criteria. We showed that the curves selected by the new model selection criterion resulted in smaller MSEs of the scope estimator than those based on AIC and BIC.

However, the calculation of the newly developed criterion quickly becomes computationally intensive when the size of the gene group increases. Moreover, an application of the criterion to gene groups with very different concentration-response curves is not reasonable. Therefore, in the second phase of the RTG, we will extend our approach with suitable prior clustering methods of the curves of the different genes in the gene group of interest.

In a first clustering step, following an approach suggested by Feldman and Langberg (2011), we will generate a rough initial clustering of the curves with rather many clusters. As clustering methods, we will employ different algorithms, including classic ones like the k-means algorithm and the computationally efficient information criterion-based clustering algorithm ORICC, which was specifically developed for time-course gene expression data. For each of the obtained clusters, we then determine the mean or median curve and calculate the distance of each concentration-response curve within each cluster to the corresponding mean or median curve based on suitable distance measures. These distances serve as importance measures for the concentration-response curves within each cluster. Next, following the suggestion of Schwiegelshohn and Sheikh-Omar (2022), we reduce the number of curves in each cluster by subsampling them according to their importance values. In a second clustering step, we then obtain a good approximation with a theoretic guarantee of the true clustering with the actual number of clusters. Due to the reduction of the curves, this step is computationally efficient

Then, we evaluate how well the single resulting clusters can be used to apply the newly developed model selection criterion developed in the first phase. Further, the resulting clustering can be fed into a hierarchical Bayesian model that jointly analyses all clusters with respect to a subpopulation structure that reflects the cluster, similar to Thomas et al. (2022). One possibility to evaluate our clustering and dimension reduction approach is to compare this hierarchical Bayesian model with subgroup structure to an overall hierarchical Bayesian model that jointly considers all concentration-response curves. In a further step, the resulting clustering can be used to formulate graphical models considering possible dependence structures.

References

Feldman D, Langberg M (2011): A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), 569-578. https://people.csail.mit.edu/dannyf/stoc11.pdf
Schwiegelshohn C, Sheikh-Omar OA (2022). An empirical evaluation of k-means coresets. In: 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 84:1-84:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi: 10.4230/LIPIcs.ESA.2022.84
Thomas M, Bornkamp B, Ickstadt K (2022). Identifying treatment effect heterogeneity in dose-finding trials using Bayesian hierarchical models. Pharm Stat. 21(1):17-37. doi: 10.1002/pst.2150