Enhanced Software Defect Prediction Using Ensemble Learning with Correlation-Based Feature Selection and SMOTE

Bahiru Shifaw Yimer; Dita Abdujebar Abrahim

doi:10.20372/ejssdastu:v13.i1.2026.1111

Bahiru Shifaw Yimer Adama Science and Technology University
Dita Abdujebar Abrahim

DOI: https://doi.org/10.20372/ejssdastu:v13.i1.2026.1111

Keywords: algorithms, cross-validation, extra tree, grid search, machine learning, optimization

Abstract

Numerous studies have explored software defect prediction using machine learning algorithms; however, their performance on publicly available defect datasets often remains limited due to high feature dimensionality and class imbalance. This study addresses these issues using five AEEEM project datasets; namely, Eclipse Equinox (EQ), JDT, Apache Lucene (LC), Mylyn (ML), and PDE UI. Seven ensemble learning algorithms (AdaBoost, Gradient Boosting, XGBoost, Random Forest, Extra Trees (ET), Bagging, and Stacking) were implemented. To reduce dimensionality, three feature selection techniques, namely, Correlation-Based Feature Selection (CFS), Sequential Forward Selection (SFS), and Correlation-based Filter (CO), were applied, while the Synthetic Minority Oversampling Technique (SMOTE) method was employed to handle class imbalance. Experiments were conducted using 10-fold and nested cross-validation, and model performance was evaluated using accuracy, recall, precision, F-measure, and Area under ROC curve (AUC) metrics. The combination of CO feature selection with the ET ensemble algorithm outperformed all other models across the five datasets. Using nested cross-validation with grid search optimization, accuracies of 92.1, 97.3, 99.1, 98.2, and 98.5 % were achieved for the EQ, JDT, LC, ML, and PDE datasets, respectively. These findings demonstrate that integrating effective feature selection and data balancing significantly enhances defect prediction performance compared to models using default hyper-parameters.