Improving Software Defect Prediction Using Cluster Undersampling

Overview
Cite Work

ABSTRACT

Adequately learning and classifying datasets that are highly unbalanced has become one of the most challenging tasks in Data Mining and Machine Learning disciplines. Most datasets are adversely affected by the class imbalance problem due to the limited occurrence of positive examples. This phenomenon adversely affects the ability of classification algorithms to adequately learn from these data to correctly classify positive examples in new datasets. Data sampling techniques presented in Data Mining and Machine Learning literature are often used to manipulate the training data in order to minimize the level of imbalance prior to training classification models. This study presents an undersampling technique that has the capability of further improving the performance of classification algorithms when learning from imbalance datasets. The technique targets the removal of potential problematic instances from the majority class in the course of undersampling. The proposed technique uses Tomek links to detect and remove noisy/inconsistent instances, and data clustering to detect and remove outliers and redundant instances from the majority class. The proposed technique is implemented in Java within the framework of the WEKA machine learning tool. The performance of the proposed sampling technique has been evaluated with WEKA machine learning tool using C4.5 and OneR classification algorithms. Sixteen datasets with varying degrees of imbalance are used. The performance of the models when CUST is used are compared to RUS, ROS, CBU, SMOTE, OSS, and when no sampling is performed prior to training (NONE). The results of CUST are encouraging as compared to the other techniques particularly in datasets that have less than 2% minority instances and larger quantities of repeated instances. The experimental results using AUC and G-Mean showed that CUST resulted in higher performance in most of the datasets than the other methods. The average performance of the classification algorithms across the datasets for each technique also showed that CUST resulted in the highest average performance in all test cases. Statistical comparison of the mean performance also revealed that CUST performed statistically better than ROS, SMOTE, OSS and NONE in all test cases. CUST however, performed statistically the same as RUS and CBU, but with a higher mean performance. The results also confirmed that CUST is a viable alternative to the already existing sampling techniques particularly when the datasets are highly unbalanced with larger quantities of repeated, noisy instances and outliers.

Read Download

Improving Software Defect Prediction Using Cluster Undersampling

Related Works