Variable selection and validation in multivariate modelling
View/ Open
Date
2019Author
Shi, Lin
Westerhuis, Johan A.
Rosén, Johan
Landberg, Rikard
Brunius, Carl
Metadata
Show full item recordAbstract
Motivation: Validation of variable selection and predictive performance is crucial in construction of
robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of
results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk
of model overfitting and false positive discoveries. Although several algorithms exist to identify a
minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining
identification of both minimal-optimal and all-relevant variables with proper cross-validation are
urgently needed.
Results: We developed the MUVR algorithm to improve predictive performance and minimize
overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable
selection is achieved by performing recursive variable elimination in a repeated double crossvalidation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression,
classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded
parsimonious models with minimal overfitting and improved model performance compared with
state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme
and wider applicability
URI
http://hdl.handle.net/10394/32087https://academic.oup.com/bioinformatics/article-pdf/35/6/972/28079182/bty710.pdf
https://doi.org/10.1093/bioinformatics/bty710