Enhanced Software Defect Prediction Using Ensemble Learning with Correlation-Based Feature Selection and SMOTE
Abstract
Numerous studies have explored software defect prediction using machine learning algorithms; however, their performance on publicly available defect datasets often remains limited due to high feature dimensionality and class imbalance. This study addresses these issues using five AEEEM project datasets; namely, Eclipse Equinox (EQ), JDT, Apache Lucene (LC), Mylyn (ML), and PDE UI. Seven ensemble learning algorithms (AdaBoost, Gradient Boosting, XGBoost, Random Forest, Extra Trees (ET), Bagging, and Stacking) were implemented. To reduce dimensionality, three feature selection techniques, namely, Correlation-Based Feature Selection (CFS), Sequential Forward Selection (SFS), and Correlation-based Filter (CO), were applied, while the Synthetic Minority Oversampling Technique (SMOTE) method was employed to handle class imbalance. Experiments were conducted using 10-fold and nested cross-validation, and model performance was evaluated using accuracy, recall, precision, F-measure, and Area under ROC curve (AUC) metrics. The combination of CO feature selection with the ET ensemble algorithm outperformed all other models across the five datasets. Using nested cross-validation with grid search optimization, accuracies of 92.1, 97.3, 99.1, 98.2, and 98.5 % were achieved for the EQ, JDT, LC, ML, and PDE datasets, respectively. These findings demonstrate that integrating effective feature selection and data balancing significantly enhances defect prediction performance compared to models using default hyper-parameters.
Copyright (c) 2026 Ethiopian Journal of Science and Sustainable Development

This work is licensed under a Creative Commons Attribution 4.0 International License.
Open Access
Print and Online ISSN
Submit Your High-Quality Articles
Higher Impact With Wider Visibility
Double Blinded Review Process
Join as Reviewer