A Novel One-Class Classification Framework for Highly Imbalanced Binary Outcomes: the OC-Cat Approach

Authors

  • Federico Fassio Department of Public Health, Experimental and Forensic Medicine, University of Pavia image/svg+xml
  • Jessica Leoni Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano image/svg+xml
  • Rebecca Fattore Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco image/svg+xml
  • Giovanni Scaglione Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco image/svg+xml
  • Giovanni De Capitani Department of Infectious Diseases
  • Fabio Borgonovo Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco image/svg+xml
  • Claudia Conflitti National PhD Programme in One Health Approaches to Infectious Diseases and Life Science Research, Department of Public Health, Experimental and Forensic Medicine, University of Pavia image/svg+xml
  • Daniele Zizzo Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco image/svg+xml
  • Antonio Gidaro Division of Internal Medicine, Luigi Sacco Hospital, University of Milan image/svg+xml
  • Maria Calloni Division of Internal Medicine, Luigi Sacco Hospital, University of Milan image/svg+xml
  • Francesco Casella Division of Internal Medicine, Luigi Sacco Hospital, University of Milan image/svg+xml
  • Chiara Cogliati Division of Internal Medicine, Luigi Sacco Hospital, University of Milan image/svg+xml
  • Andrea Gori Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco ; Department of Biomedical and Clinical Sciences, University of Milan, Milan, Italy image/svg+xml
  • Antonella Foschi Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco image/svg+xml
  • Marta Colaneri Department of Infectious Diseases, Unit II, L. Sacco Hospital, ASST Fatebenefratelli Sacco ; Department of Biomedical and Clinical Sciences, University of Milan, Milan, Italy image/svg+xml
  • Valentina Breschi Department of Electrical Engineering, Eindhoven University of Technology image/svg+xml

DOI:

https://doi.org/10.54103/2282-0930/29381

Abstract

Introduction

Extremely rare events can challenge traditional classification models, which may exhibit reduced power in highly unbalanced datasets (i.e., when two or more target groups are unevenly represented). Moreover, this effect seems to be accentuated by the reduction of the sample size. Some of the easiest and intuitive methods proposed to handle unbalanced datasets, while still using a classical statistical models, are random under- or oversampling or hybrid methods[1]. Alternatively, other approaches have been proposed with different strategies, such as ensemble models (e.g. AdaBoost, XGBoost), or novelty detection models[2].

In medicine, this kind of scenario can occur when analysing catheter related/associated blood stream infections (CRBSI/CABSI), whose incidence usually remains <1/1000 catheter days[3], but could be higher in very frail patients[4]. Catheter insertion has a potential risk of complications and longer hospitalization: the use of decision-making algorithms is of great importance in order to avoid complications for these patients[5].

Objectives

The main purpose of our study is to adopt a novel anomaly detection model focused on binary/categorical covariates to predict risk of CRBSI/CABSI occurrence at baseline. To reach this result, we use a combined approach: features reduction, novelty detection algorithm and importance grid for model explainability.

Methods

Data from hospital patients who received a vascular access device (VAD) placements at the University Hospital Luigi Sacco in Milan between January 2021 and January 2025 were analysed. All patients underwent central or peripheral catheterization in a non-ICU department. Parameters were collected at catheter insertion: age, sex, any major comorbidities, active intravenous drug usage, parenteral nutrition, regimen of hospitalization, transfer from the ICU, type of catheter, number of lumens, tunnel, exit site and number of placement attempts. All continuous variables were discretized into categorical format, yielding  29 Boolean and 2 categorical features.

The designed framework (OC-Cat) combines:

1) a graph-search-based feature selection method;

2) a one-class soft classifier designed (based on characterization of patients who didn’t incurred in catheter infection);

3) a feature ranking that clarifies the classifier's decisions by ordering features based on their unique role in identifying uninfected patients.

In details:

  • we assess the redundancy of each pair of features using the excess over independence metric[6]. Then, we design a undirected connected graph where each node represents a feature, and the edge weights reflect the excess over independence between feature pairs. From each node, we apply the Bellman-Ford algorithm[7] to find the shortest closed path. Among all paths, we select the one that best represents the original data based on the Bayesian Information Criterion (BIC). The features included in this optimal path constitute the final selected feature set;
  • to design the soft-classifier, we rely on the assumption that a higher occurrence of a specific feature combination in majority class records (uninfected) implies that each new instance with those values is less likely to be infected. The learning phase consists of estimating the probability for a majority-class record occurring, given the distribution of uninfected patients. The prediction phase, instead, consists of estimated the majority-class probability for a new record (based on its i‑th attribute combination) using a weighted inverse Hamming distance [8]. The weight increases with the record's frequency among uninfected patients;
  • accordingly, the method ranks features based on a tailored definition of importance, stating that a feature - or a features set - is more important if it consistently exhibits the same value in majority-class data. To achieve this, we build a tree where nodes represent subsets of features, and each step measures the contribution of each new feature in reducing the majority-class data entropy. Last, once exploring all feature combinations and identifying the path with minimal entropy, the algorithm reports the features ranking as the order in which features appear along the path: from the root (most important) to the leaf (least important).

To evaluate the framework performance in terms of one-class classification, we compared OC-Cat probability distribution with that obtained from Isolation Forest (iForest) and One-Class Support Vector Machine (OCSVM). For the analysis, dataset was split into training and test set (August 2023 as threshold: ~75% vs 25%).

Results

Data from 2836 hospitalized patients with VADs were retrieved. After keeping only the first VAD placement for each patients, we considered 2275 subjects (1222 women and 1053 men between 18 to 101 years) Among them, 148 become infected: 62 patients developed a CRBSI, 80 a CABSI and 3 both. In the first step, our approach retained 16 out of 29 variables, which were then inserted in the novel model in the second step. Figure 1 displays the risk factor index distributions for the training and test sets of our model, iForest, and OCSVM, along with their respective ROC curves. Lastly, catheter insertion site (upper vs lower limb vs neck), biological sex, hypertension, Charlson Comorbidity index, neurological disease and diabetes resulted the first most characterizing feature.

Conclusion

Our model introduces a novel, integrated approach for both characterizing and forecasting outcomes under severe imbalance in the target variable. It outperformed the iForest and OCSVM models applied to categorical and Boolean variables in a specific clinical contest. We are currently conducting further analysis and refinements to optimize performance on both our internal and external datasets, enhancing the model's generalization.

 

 

Downloads

Download data is not yet available.

References

[1] Hoens TR, Chawla NV. Imbalanced Datasets: From Sampling to Classifiers. Imbalanced Learning, John Wiley & Sons, Ltd; 2013, p. 43–59. DOI: https://doi.org/10.1002/9781118646106.ch3

[2] Pimentel MAF, Clifton DA, Clifton L et al. A review of novelty detection. Signal Processing 2014;99:215–49. DOI: https://doi.org/10.1016/j.sigpro.2013.12.026

[3] Dreesen M, Foulon V, Spriet I et al. Epidemiology of catheter-related infections in adult patients receiving home parenteral nutrition: a systematic review. Clin Nutr 2013;32:16–26. DOI: https://doi.org/10.1016/j.clnu.2012.08.004

[4] Zhao VM, Griffith DP, Blumberg HM, et al. Characterization of Post-Hospital Infections in Adults Requiring Home Parenteral Nutrition. Nutrition 2013;29:52–9. DOI: https://doi.org/10.1016/j.nut.2012.03.010

[5] Catho G, Fortchantre L, Teixeira D, et al. Surveillance of catheter-associated bloodstream infections: development and validation of a fully automated algorithm. Antimicrob Resist Infect Control 2024;13:38. DOI: https://doi.org/10.1186/s13756-024-01395-4

[6] Maron ME, Kuhns JL. On Relevance, Probabilistic Indexing and Information Retrieval. J ACM 1960;7:216–44. DOI: https://doi.org/10.1145/321033.321035

[7] Bellman R. On a routing problem. Quart Appl Math 1958;16:87–90. DOI: https://doi.org/10.1090/qam/102435

[8] Hamming RW. Error Detecting and Error Correcting Codes. Bell System Technical Journal 1950;29:147–60. DOI: https://doi.org/10.1002/j.1538-7305.1950.tb00463.x

[9] Robertson SE. The probability ranking principle in IR. Journal of Documentation 1977;33:294–304. DOI: https://doi.org/10.1108/eb026647

Published

2025-09-08

How to Cite

1.
Fassio F, Leoni J, Fattore R, Scaglione G, De Capitani G, Borgonovo F, et al. A Novel One-Class Classification Framework for Highly Imbalanced Binary Outcomes: the OC-Cat Approach. ebph [Internet]. 2025 [cited 2026 Feb. 6];. Available from: https://riviste.unimi.it/index.php/ebph/article/view/29381

Issue

Section

Congress Abstract - Section 3: Metodi Biostatistici