A challenging statistical problem in drug discovery and development is identification of patient subgroups that exhibit differential treatment effects. To further our understanding of how a treatment works on a disease, it is helpful that the subgroups are interpretable, such as being defined by a small set of biomarkers and their associated threshold values. This can be a formidable task even if the objective is only to identify the biomarker subset (let alone finding the threshold values), when the set of potential biomarkers is large. Regression tree methods are natural solutions because the subgroups are defined by the terminal nodes of a decision tree.
Several regression tree algorithms exist for subgroup identification, but many have significant deficiencies. Algorithms based entirely on greedy search tend to increase the probability of identifying the wrong biomarkers, due to the latter being measured at higher levels of granularity. Further, algorithms that search for biomarker thresholds by maximizing a measure of the betweensubgroup difference in treatment effects inevitably yield overly optimistic estimates that cannot be replicated. Finally, the vast majority of algorithms are inapplicable to data with missing values, to multivariate response variables, and to treatment variables with more than two levels.
GUIDE is an algorithm that is unique in these respects: (1) it has negligible bias in variable selection, (2) it has negligible bias in treatment effect estimation, (3) it is applicable to treatment variables with more than two levels, (4) it can control for effects of prognostic variables within subgroups, (5) it does not require imputation of missing values, and (6) it is applicable to multiple, longitudinal, and censored response variables.
Using a series of real examples, the course will discuss (i) why many algorithms have the abovementioned difficulties and how GUIDE overcomes them, (ii) the effect of multicollinearity and masking on subgroup identification, (iii) the meaning of a "subgroup" when there is no true subgroup, and (iv) the conceptual and statistical difficulties of postselection inference, particularly for subgroup identification, and (v) a solution by means of bootstrap calibration. The last part of the course is a handson demonstration of the free GUIDE software, which can be obtained from http://www.stat.wisc.edu/~loh/guide.html. Attendees are encouraged to install the software and data sets on their computers beforehand.
Day 1 Morning session: Basic ideas of classification and regression trees
1. Classification with continuous predictors: glaucoma prediction
2. Classification with categorical predictors: peptide binding
3. Key differences between GUIDE and CART: heart disease
4. Longitudinal response variables: Alzheimer's disease
5. Importance scoring of predictor variables
Day 1 Afternoon session: Subgroup identification in randomized trials
1. No missing values: breast cancer
2. Large numbers of missing values: retrospective gene study
3. Longitudinal response: Type II diabetes
4. Importance scoring when treatment variable is present
5. Comparison of GUIDE with other tree and forest methods
Day 2 Morning session: Postselection inference and local prognostic control
1. Conceptual difficulties of postselection inference
2. Bootstrap calibrated confidence intervals
3. Bootstrap intervals for treatment effects
4. Local linear control of prognostic variables
Day 2 Afternoon session: Demonstration of GUIDE software
Free software, manual and sample data sets from http://www.stat.wisc.edu/~loh/guide.html
References:
Loh, W.Y. Calibrating confidence coefficients. Journal of the American Statistical Association 82 (1987), 155162.
Loh, W.Y., and Vanichsetakul, N. (1988), Treestructured classification via generalized discriminant analysis (with discussion), Journal of the American Statistical Association, vol. 83, 715728.
Chaudhuri, P., Huang, M.C., Loh, W.Y., and Yao, R. (1994), Piecewisepolynomial regression trees, Statistica Sinica, vol. 4, 143167.
Loh, W.Y., and Shih, Y.S. Split selection methods for classification trees. Statistica Sinica 7 (1997), 815840.
Loh, W.Y. (2002), Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, vol. 12, 361386.
Loh, W.Y. (2009), Improving the precision of classification trees, Annals of Applied Statistics, vol. 3, 17101737.
Loh, W.Y. (2011), Classification and regression trees, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol.1, 1423.
Loh, W.Y. and Zheng, W. (2013), Regression trees for longitudinal and multiresponse data, Annals of Applied Statistics, vol. 7, 496522.
Loh, W.Y. (2014), Fifty years of classification and regression trees (with discussion), International Statistical Review, vol. 34, 329370.
Loh, W.Y., He, X., and Man, M. (2015), A regression tree approach to identifying subgroups with differential treatment effects, Statistics in Medicine, vol. 34, 18181833.
Loh, W.Y., Fu, H., Man, M., Champion, V. and Yu, M. (2016), Identification of subgroups with differential treatment effects for longitudinal and multiresponse variables, Statistics in Medicine, vol. 35, 48374855.
WeiYin Loh is Professor of Statistics at the University of Wisconsin, Madison. He is a Fellow of the American Statistical Association and the Institute of Mathematical Statistics and a consultant to government and industry. He has been developing classification and regression tree algorithms for more than thirty years. See http://www.stat.wisc.edu/~loh/.
