A Random Forest Algorithm For Identifying Risk Factors For Multimorbidity In The UK Biobank Cohort
DOI:
https://doi.org/10.54103/2282-0930/29299Abstract
Introduction: High-income countries are undergoing significant demographic shifts, characterized by population decline and progressive aging. These transformations are associated with an increase in the prevalence of chronic diseases, which often coexist, worsening individuals’ quality of life and increasing healthcare costs. Identifying the factors that contribute to the onset of multimorbidity is particularly complex, as these factors often interact with each other and cause multiple effects across different diseases.
Objectives: This study aimed to identify the main risk factors for multimorbidity within a large UK cohort using a fully nonparametric ensemble method. This approach makes no assumptions about the underlying relationships between variables and allow managing high-dimensional data while preventing overfitting.
Methods: We analyzed data from the UK Biobank cohort, which includes detailed information on socioeconomic status, lifestyle, anthropometric measures, and environmental exposures collected at recruitment, along with disease occurrence obtained through linkage with hospital admissions (primary and secondary diagnoses), death records, and cancer registries. Multimorbidity was defined as the presence of at least two chronic conditions from a list developed through an international consensus using a modified Delphi method [1]. To assess the role of 18 candidate variables in predicting the onset of multimorbidity over a five-year follow-up, we applied a random forest algorithm adapted for survival analysis within a competing risk framework [2], considering two competing events: the development of multimorbidity and death prior to its onset. The candidate variables included: white British/Irish ethnicity (Yes/No), qualification level, average total household income before tax (adjusted for household size and categorized into quintiles), area-level index of multiple deprivation (deciles), body mass index (kg/m2), waist circumference (cm), pack-years of smoking, alcohol drinking (g/day), healthy diet score (ranging from 0 to 5, based on the intake of fruit, vegetables, fish, whole grains, processed and red meat), walking (at least 10 min, number of times a week), moderate physical activity (at least 10 min, number of times a week), vigorous physical activity (at least 10 min, number of times a week), particulate matter air pollution 2.5 (PM2.5) (µg/m3), PM2.5-10 (µg/m3), PM10 (µg/m3), NO2 (µg/m3), average exposure to evening (7:00 pm – 11:00 pm) or night noise (11:00 pm – 7:00 am) (dB). Results were summarised using out-of-bag partial dependence plots and variable importance (VIMP) metrics.
Results: Of the 422,344 individuals included in the cohort, aged between 39 and 73 years, we selected 137,565 participants who were free from the conditions included in the definition of multimorbidity at the time of recruitment and for whom risk factor information was available. During the five-year follow-up, 4384 individuals developed multimorbidity (2740 males, 1644 females). The five-year cumulative incidence was 3.9% in males and 2.6% in females. Among individuals who developed multimorbidity during follow-up, the main conditions observed were cancer (52.4% of males and 52.1% of females), arrhythmias (44.7% of males and 28.5% of females) and coronary artery disease (42.1% of males and 24.8% of females). Based on VIMP metrics, the strongest predictors in men were smoking, waist circumference, and sleep duration; in women alcohol, smoking, and waist circumference. Five-year cumulative incidence was higher for heavy smokers (sex-specific 95th percentile of pack-years) (males: 6.3%, females: 4.0%) compared to non-smokers (males: 3.5%, females: 2.4%); for individuals with elevated waist circumference (sex-specific 95th percentile) (males: 6.1%, females: 5.2%) versus those with median values (males: 3.9%, females 2.6%); for heavy alcohol drinkers (sex-specific 95th percentile) (males: 4.6%, females: 4.0%) versus median intake (males: 3.8%, females: 2.4% ); for those sleeping 4 hours/day (males: 6.3%, females: 4.2%) or 10 hours/day (males: 6.5%, females: 4.5%) versus 7 hours/day (males: 3.7%, females: 2.5%). Diet, physical activity, and air pollution had smaller impacts.
Conclusions: Preventive interventions targeting smoking, abdominal obesity, and heavy alcohol consumption among middle-aged adults in the UK and likely in other high-income countries, may substantially reduce the incidence of multimorbidity. Such interventions could improve the health trajectory and burden of disease of future older populations. In addition, promoting adequate sleep duration appears to be beneficial and should be integrated into public health recommendations.
Downloads
References
Ho ISS, Azcoaga-LorenzoA, Akbari A, et al. Measuring multimorbidity in research: Delphi consensus study. BMJ Med 2022 Jul 27;1(1):e000247. doi: 10.1136/bmjmed-2022-000247. DOI: https://doi.org/10.1136/bmjmed-2022-000247
Ishwaran H, Gerds TA, Kogalur UB et al. (2014) Random survival forests for competing risks. Biostatistics 15, 757-773. DOI: https://doi.org/10.1093/biostatistics/kxu010
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Linia Patel, Silvia Mignozzi, Margherita Pizzato, Carlo La Vecchia, Gianfranco Alicandro

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


