A Comparison of Random Forest Variable Selection Methods for Regression Modeling of Continuous Outcomes
Background: The Importance of Variable Selection in Machine Learning Regression Models
In recent years, the widespread application of machine learning in the fields of bioinformatics and data science has greatly driven the development of predictive modeling. Random forest (RF) regression, as a commonly used ensemble learning algorithm, has become an important tool for constructing predictive models for continuous outcomes due to its ability to effectively improve prediction accuracy and robustness. However, in the face of high-dimensional data, using more predictor variables in a model does not necessarily lead to better predictive performance; it may instead cause information redundancy, model overfitting, or hinder practical application convenience. Therefore, “variable selection” (also known as feature selection or feature reduction) has become a crucial modeling step.
Variable selection can reduce variable redundancy, enhance prediction performance and model generalization, lower the cost of subsequent data collection and model deployment, and improve model interpretability and efficiency in application. Previously, scholars have proposed various methods that use random forest for variable selection, but in real-world continuous outcome datasets, the differences in performance between these methods, their adaptability, and best practices have yet to be consistently established. Moreover, variable selection involves multiple objectives: not only should prediction accuracy be considered, but model simplicity (minimizing the number of variables) and computational efficiency are also important.
Source and Authors
This research is titled “A comparison of random forest variable selection methods for regression modeling of continuous outcomes”, published in 2025 in Briefings in Bioinformatics (Volume 26, Issue 2, DOI: https://doi.org/10.1093/bib/bbaf096), and jointly authored by Nathaniel S. O’Connell, Byron C. Jaeger, Garrett S. Bullock, and Jaime Lynn Speiser. All authors are from the Department of Biostatistics and Data Science, Department of Orthopaedic Surgery, and Division of Public Health Sciences at the Wake Forest University School of Medicine, USA.
Detailed Study Design: Comprehensive Benchmark Evaluation
1. Research Aims and Overall Design
The aim of this study is to systematically evaluate and compare 13 R-implemented random forest variable selection methods for regression modeling with continuous outcomes. The study clarifies the performance differences of these methods across various real-world, publicly available datasets, providing methodological guidance for practical operations. The evaluation covers three main aspects: predictive accuracy (primarily R^2), model parsimony (the proportion of variables reduced), and computational efficiency (time consumed). Following the principle of open science, all code and data are publicly available, emphasizing reproducibility and transparency.
2. Data Sources and Processing Workflow
All datasets in this study were sourced from openml (https://www.openml.org/) and the R package modeldata. The inclusion criteria were strict: only supervised regression tasks were considered, missingness was required to be less than 50%, the number of variables was between 10 and 1000, the sample size between 100 and 10,000, and the outcome must be continuous with at least 10 unique values. Ultimately, the authors included 59 datasets (53 from openml, 6 from modeldata), covering a wide range of domains such as medicine, manufacturing, weather, economics, and education, thus ensuring good representativeness.
3. Variable Selection Methods: Implementation and Categorization
The 13 variable selection methods evaluated are all from the R ecosystem, including common packages such as caret, boruta, vsurf, rrf, as well as newly developed oblique random forest implementations (e.g., the aorsf series). Each method is implemented in accordance with its original publication, with default hyperparameters unless otherwise specified. The authors classify the methods into two broad categories:
- Test-based methods: variables are selected using statistical or permutation tests of significance (e.g., boruta, altman, aorsf-permutation).
- Performance-based methods: variables are recursively selected based on performance changes after their addition/removal from the model (e.g., caret, jiang, rrf, aorsf-menze, etc).
4. Experimental Procedures and Evaluation
The authors use 20-fold Monte Carlo cross-validation (split-sample validation), with each dataset randomly split into a training set and a test set (50%:50%; for ultra-large datasets, the training set was capped at 1,000 samples). All variable selection was performed on the training set. For datasets with more than 150 variables, 150 variables are randomly selected for each repetition to ensure manageable computational load. The variable selection yields a feature subset which is used to train both an axis-based RF (using the ranger package) and an oblique RF (using the aorsf package), with R^2 evaluated on the test set.
At each step, time taken for selection, and the proportion of variables reduced are recorded. Standardized z-scores are used to compare different metrics and the performance of various methods across different datasets.
Main Results
1. Overall Performance of Variable Selection Methods
Computational Efficiency
The fastest variable selection methods were axis-sfe, rrf, aorsf-menze, aorsf-negation, and aorsf-permutation, with median time consumption under 5 seconds for most datasets. The slowest methods were rfvimptest, caret, and svetnik, with computation time soaring to thousands of seconds for certain datasets.
Variable Reduction Ability
rfvimptest achieved the greatest variable compression (>90%), vsurf, altman, and svetnik achieved around 80%, while rrf almost did not reduce variables at all. Notably, methods such as caret and boruta showed large inter-dataset variation in the proportion of variables reduced, indicating their flexibility in dealing with different data complexities.
Predictive Performance (R^2)
Most methods (except rfvimptest) showed test set median R^2 ranging from 0.61 to 0.67 (axis-based RF) and 0.62 to 0.73 (oblique RF) on continuous outcome random forest regression, indicating that while main variable selection strategies differ, their ultimate predictive abilities are quite similar. The best R^2 was observed for aorsf-menze and aorsf-permutation (oblique RF); for axis-based RF, caret, jiang, boruta, and aorsf-permutation performed best.
2. Sensitivity and Stratified Analyses
Since some methods occasionally failed to select any variables (such as rfvimptest and boruta on a few datasets, altman and vsurf in individual scenarios), the authors performed a sensitivity analysis considering only samples for which all methods selected variables. They found that the overall ranking of methods remained consistent with the main analysis, indicating robust core findings.
Further, subgroup analysis by high (n:p≥10) and low (n:p<10) sample-to-variable ratios revealed: - In low n:p (high dimensionality, relatively few samples) settings, the oblique RF had a more obvious advantage, with its predictive accuracy far surpassing that of traditional axis-based RF. - In high n:p scenarios, main methods’ axis-based and oblique RF performance converged.
3. Comparison by Method Characteristics and Categories
The paper also analyzes the algorithmic implementation of methods (axis-based RF, conditional inference RF, oblique RF), and category (test-based or performance-based). Conditional inference RF methods performed mediocrely, mainly due to high computational resource consumption. Oblique RF-related methods (the aorsf family) were both efficient and accurate, significantly outperforming others. There was no clear superiority between test-based and performance-based classes—the main performance differences were determined by the specific algorithm structure and implementation details.
4. Data and Code Reproducibility
All code and data for this study are hosted on GitHub (https://github.com/nateoconnellphd/rfvs_regression), ensuring high transparency and reproducibility of the research, and encouraging peer reuse and further development.
Main Conclusions and Significance
The authors draw important conclusions: under default R implementations for random forest regression with continuous outcomes,
- For axis-based RF, boruta and aorsf-permutation are recommended;
- For oblique RF, aorsf-permutation and aorsf-menze are optimal.
These implementations combine high predictive accuracy, strong variable reduction, and superior computational efficiency, making them well-suited for high-dimensional data and practical deployment scenarios. The authors further recommend that applied researchers try multiple high-performing methods and select the optimal one based on the characteristics of their own data.
Study Highlights and Scientific Value
- Large-scale real dataset benchmarking: Using 59 highly heterogeneous public datasets, significantly enhances the credibility and generalizability of the results, laying a solid foundation for the further development and application of variable selection methods.
- Introduction and systematic evaluation of oblique random forests: For the first time, a comprehensive comparison of oblique RF methods for continuous outcome prediction is provided, filling the gap in prior studies that focused only on conventional RF approaches.
- Multidimensional and standardized evaluation system: Integrating predictive performance, model simplicity, and computational efficiency, the study offers a more scientific and practical reference standard.
- Emphasis on open science and reproducibility: By providing full source code and data, the study fosters ongoing verification and optimization in academia, enhancing community transparency and knowledge sharing.
- Practical guidance for real-world scenarios: The study clearly highlights the role of variable selection in reducing data collection costs, improving model interpretability, and helping actual research and industry needs.
Additional Information
- The research is supported by NIH and other grants in the United States, underscoring the importance attached to this line of investigation.
- All data and code used in the paper are open, facilitating localization and further development by researchers both domestic and overseas.
Conclusion and Outlook
This study systematically reviews mainstream and emerging random forest regression variable selection methods in the current R ecosystem. Through rigorous large-sample empirical analysis and comprehensive quantitative evaluation, the study clearly assesses the strengths, weaknesses, and applicable scenarios for each method, providing an essential theoretical and practical foundation for selecting appropriate variable selection schemes for continuous outcome prediction tasks in bioinformatics, medicine, engineering, and other domains. Its emphasis on open science and focus on high-dimensional, complex real-world data scenarios also sets a strong example for subsequent research on machine learning variable selection and interpretability methods.