Explainability in Microbiome-Based Models for CRC Prediction via Partial Dependence Plots

Authors

  • Annamaria Porreca Department for the Promotion of Human Science and Quality of Life, San Raffaele University of Rome ; 2 Unit of Clinical and Molecular Epidemiology, IRCCS San Raffaele Roma image/svg+xml
  • Eliana Ibrahimi Department of Biology, University of Tirana image/svg+xml
  • Fabrizio Maturo Mercatorum University image/svg+xml
  • Laura Judith Marcos Zambrano IMDEA Food , Madrid, Spain image/svg+xml
  • Melisa Meto Department of Biology, University of Tirana image/svg+xml
  • Marta B. Lopes NOVA Math, NOVA School of Science and Technology, Universidade Nova de Lisboa image/svg+xml

DOI:

https://doi.org/10.54103/2282-0930/29480

Abstract

INTRODUCTION

Gut microbiome profiling through 16S rRNA sequencing has emerged as a promising non-invasive tool for colorectal cancer (CRC) detection. Despite their predictive accuracy, machine learning (ML) models often struggle with interpretability, especially when dealing with high-dimensional and correlated microbial data. Ensemble methods such as random forests provide strong classification performance, but their internal mechanisms are opaque. The fuzzy forest (FF) algorithm extends the random forest approach by improving feature selection under multicollinearity, but still lacks direct interpretability of predictions. To address this limitation, explainability techniques such as Partial Dependence Plots (PDPs) can be used to visualize the marginal contribution of key features, enabling better understanding of the relationships between microbial taxa and disease risk.

OBJECTIVES

This study aims to enhance the interpretability of a microbiome-based classifier applied to Baxter et al.’s 16S rRNA sequencing dataset by using Partial Dependence Plots (PDPs), while also reducing feature importance bias by employing the Functional Forest (FF) method, which effectively addresses the limitations of Random Forests in handling highly correlated features. PDPs allow for the visualization of the marginal effect of each microbial or clinical feature on the predicted probability of CRC. The goal is to offer interpretable insights into the nonlinear and complex relationships captured by the FF model.

METHODS

We analysed faecal samples from CRC patients and healthy controls included in the Baxter et al.’s dataset. After centered log-ratio (clr) transformation of the data, we implemented the fuzzy forest (FF) algorithm for feature selection and classification. FF enhances the standard random forest by incorporating recursive feature elimination and correlation clustering, resulting in an unbiased ranking of features even in the presence of high multicollinearity. We then applied PDPs to the top-ranked microbial and clinical features. These plots allow the visualization of the marginal effect of each feature on the model's predicted probability of CRC, offering a means to interpret the impact of each variable in isolation.

RESULTS

The PDPs highlighted non-linear and threshold effects for both microbial and clinical predictors in the Baxter dataset (Figure 1). Age showed a biphasic relationship with CRC probability: a decreasing effect up to around 65 years, followed by a marked increase in risk thereafter. Among microbial features, Porphyromonas (ASV 417) was positively associated with CRC in a monotonic pattern, whereas Faecalibacterium (ASV 471) and Paraprevotella (ASV 446) showed threshold behaviour, with CRC probability increasing only beyond certain abundance levels. These results support the use of non-linear models for microbiome-based prediction tasks and highlight biologically plausible patterns in feature-response relationships.

CONCLUSIONS

By combining fuzzy forest feature selection with Partial Dependence Plots, we constructed an interpretable and robust modelling pipeline for microbiome-based CRC prediction. This framework enables a better understanding of the individual contribution of microbial and clinical features to model predictions, enhancing both scientific interpretability and clinical relevance.

Downloads

Download data is not yet available.

References

[1] Conn D., Ngun T., Li G. et al., Fuzzy forests: extending random forest feature selection for correlated, high-dimensional data. J Stat Soft, 2019; 91:1–25

[2] Porreca A., Ibrahimi E., Maturo F. et al., Robust prediction of colorectal cancer via gut microbiome 16S rRNA sequencing data. J Med Microbiol, 2024; 73:001903

Published

2025-09-08

How to Cite

1.
Porreca A, Ibrahimi E, Maturo F, Marcos Zambrano LJ, Meto M, Lopes MB. Explainability in Microbiome-Based Models for CRC Prediction via Partial Dependence Plots. ebph [Internet]. 2025 [cited 2026 Feb. 6];. Available from: https://riviste.unimi.it/index.php/ebph/article/view/29480

Issue

Section

Congress Abstract - Section 3: Metodi Biostatistici