Explainability in Microbiome-Based Models for CRC Prediction via Partial Dependence Plots
DOI:
https://doi.org/10.54103/2282-0930/29480Abstract
INTRODUCTION
Gut microbiome profiling through 16S rRNA sequencing has emerged as a promising non-invasive tool for colorectal cancer (CRC) detection. Despite their predictive accuracy, machine learning (ML) models often struggle with interpretability, especially when dealing with high-dimensional and correlated microbial data. Ensemble methods such as random forests provide strong classification performance, but their internal mechanisms are opaque. The fuzzy forest (FF) algorithm extends the random forest approach by improving feature selection under multicollinearity, but still lacks direct interpretability of predictions. To address this limitation, explainability techniques such as Partial Dependence Plots (PDPs) can be used to visualize the marginal contribution of key features, enabling better understanding of the relationships between microbial taxa and disease risk.
OBJECTIVES
This study aims to enhance the interpretability of a microbiome-based classifier applied to Baxter et al.’s 16S rRNA sequencing dataset by using Partial Dependence Plots (PDPs), while also reducing feature importance bias by employing the Functional Forest (FF) method, which effectively addresses the limitations of Random Forests in handling highly correlated features. PDPs allow for the visualization of the marginal effect of each microbial or clinical feature on the predicted probability of CRC. The goal is to offer interpretable insights into the nonlinear and complex relationships captured by the FF model.
METHODS
We analysed faecal samples from CRC patients and healthy controls included in the Baxter et al.’s dataset. After centered log-ratio (clr) transformation of the data, we implemented the fuzzy forest (FF) algorithm for feature selection and classification. FF enhances the standard random forest by incorporating recursive feature elimination and correlation clustering, resulting in an unbiased ranking of features even in the presence of high multicollinearity. We then applied PDPs to the top-ranked microbial and clinical features. These plots allow the visualization of the marginal effect of each feature on the model's predicted probability of CRC, offering a means to interpret the impact of each variable in isolation.
RESULTS
The PDPs highlighted non-linear and threshold effects for both microbial and clinical predictors in the Baxter dataset (Figure 1). Age showed a biphasic relationship with CRC probability: a decreasing effect up to around 65 years, followed by a marked increase in risk thereafter. Among microbial features, Porphyromonas (ASV 417) was positively associated with CRC in a monotonic pattern, whereas Faecalibacterium (ASV 471) and Paraprevotella (ASV 446) showed threshold behaviour, with CRC probability increasing only beyond certain abundance levels. These results support the use of non-linear models for microbiome-based prediction tasks and highlight biologically plausible patterns in feature-response relationships.
CONCLUSIONS
By combining fuzzy forest feature selection with Partial Dependence Plots, we constructed an interpretable and robust modelling pipeline for microbiome-based CRC prediction. This framework enables a better understanding of the individual contribution of microbial and clinical features to model predictions, enhancing both scientific interpretability and clinical relevance.
Downloads
References
[1] Conn D., Ngun T., Li G. et al., Fuzzy forests: extending random forest feature selection for correlated, high-dimensional data. J Stat Soft, 2019; 91:1–25
[2] Porreca A., Ibrahimi E., Maturo F. et al., Robust prediction of colorectal cancer via gut microbiome 16S rRNA sequencing data. J Med Microbiol, 2024; 73:001903
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Annamaria Porreca , Eliana Ibrahimi , Fabrizio Maturo , Laura Judith Marcos Zambrano , Melisa Meto , Marta B. Lopes

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


