Evaluating Variable Selection Methods in a Classification Framework: A Simulation Study
DOI:
https://doi.org/10.54103/2282-0930/29431Abstract
INTRODUCTION
Variable selection is a common step in clinical research, where large datasets often include many, potentially highly correlated, variables. The main objective is to identify the most relevant predictors for an outcome, thereby enhancing model interpretability, simplicity, and predictive performance [1]. However, data-driven variable selection also carries several underappreciated risks. These include the potential exclusion of important predictors, inclusion of irrelevant ones, biased coefficient estimates, underestimated standard errors, invalid confidence intervals, and overall model instability [2].
Simulation studies are a valuable approach for evaluating statistical methods, provided they are carefully designed. Yet, many such studies exhibit bias in favor of the newly proposed methods [3]. To address this, we developed a neutral comparison simulation study to fairly evaluate the performance of several variable selection techniques.
OBJECTIVE
To systematically evaluate and compare different variable selection methods across multiple simulated scenarios.
METHODS
To improve the design and reporting of our simulation study, we followed the ADEMP structure [4], this involves specifying the aim (A), the data-generating process (D), the estimand or target of inference (E), the analytical methods (M), and the criteria used to evaluate performance (P).
We designed different simulation scenarios by varying the number of observations, total variables, and number of true predictors. Predictor correlations were modeled to decay exponentially with increasing distance between variables, and effect sizes for true predictors were varied [5, 6]. Noise was introduced into the correlation structures to better mimic real-world data.
We focused on a binary classification setting, evaluating each method on two key outcomes: model selection accuracy (i.e. whether the true model is selected) and predictive performance. Five methods for selecting variables were compared: stepwise logistic regression, LASSO logistic regression, Elastic net logistic regression, Random Forest Classifier with OOB error based backward elimination [7] and Genetic Algorithm (GA) [8, 9]. Performance metrics included the Area Under the Curve (AUC), number of variables selected, and True Positive Rate (TPR). All the analyses were performed using Python 3.12.
RESULTS
We ran 1,000 Monte Carlo simulations per scenario, varying key factors such as sample size, number of predictors, true signal strength, and correlation strength. Elastic Net consistently achieved the highest mean AUC and TPR, particularly in high-dimensional or strong-signal settings (e.g., Scenarios 5–8), showing robust performance across conditions. Random Forest and Genetic Algorithm performed comparably in some scenarios but incurred substantially higher computational costs. LASSO achieved competitive AUC with significantly lower runtime, though it tended to underselect in weaker signal scenarios. Stepwise selection, while the fastest method, had the lowest overall predictive performance and true positive rates (Table 1).
CONCLUSION
Among the five evaluated methods, Elastic Net provided the best trade-off between predictive performance and model stability, particularly in realistic, high-dimensional settings. Our results reinforce the importance of carefully considering the variable selection method in the context of the data structure and research goals. This neutral comparison contributes to evidence-based guidance for method selection in clinical research and similar applied settings.
Downloads
References
[1] Chan, J. Y. L., Leow, S. M. H., Bea, K. T., Cheng, W. K., Phoong, S. W., Hong, Z. W., & Chen, Y. L. (2022). Mitigating the multicollinearity problem and its machine learning approach: a review. Mathematics, 10(8), 1283. DOI: https://doi.org/10.3390/math10081283
[2] Ullmann T, Heinze G, Hafermann L, Schilhart-Wallisch C, Dunkler D, for TG2 of the STRATOS initiative (2024) Evaluating variable selection methods for multivariable regression models: A simulation study protocol. PLoS ONE 19(8): e0308543. DOI: https://doi.org/10.1371/journal.pone.0308543
[3] Kipruto E, Sauerbrei W (2022) Comparison of variable selection procedures and investigation of the role of shrinkage in linear regression-protocol of a simulation study in low dimensional data. PLoS ONE 17(10): e0271240 DOI: https://doi.org/10.1371/journal.pone.0271240
[4] Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019; 38: 2074–2102 DOI: https://doi.org/10.1002/sim.8086
[5] Bag, S., Gupta, K., & Deb, S. (2022). A review and recommendations on variable selection methods in regression models for binary data. arXiv preprint arXiv:2201.06063.
[6] Hardin, J., Garcia, S. R., & Golan, D. (2013). A method for generating realistic correlation matrices. The Annals of Applied Statistics, 1733-1762 DOI: https://doi.org/10.1214/13-AOAS638
[7] Díaz-Uriarte, R., Alvarez de Andrés, S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006). DOI: https://doi.org/10.1186/1471-2105-7-3
[8] M. Mitchell, An Introduction to Genetic Algorithms, Cambridge, MA: MIT Press, 1998.
[9] Zhang Z, Trevino V, Hoseini SS, et al. Variable selection in Logistic regression model with genetic algorithm. Ann Transl Med. 2018;6(3):45. DOI: https://doi.org/10.21037/atm.2018.01.15
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Samuele Minari, Dario Pescini, Antonella Zambon, Davide Soranna

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


