The effect of varying levels of class distribution on bagging for different algorithms: An empirical study

Liang, G; Zhu, X; Zhang, C

The effect of varying levels of class distribution on bagging for different algorithms: An empirical study

Liang, G

Zhu, X Zhang, C

Permalink

Publication Type:: Journal Article
Citation:: International Journal of Machine Learning and Cybernetics, 2014, 5 (1), pp. 63 - 71
Issue Date:: 2014-01-01

Closed Access

	Filename	Description	Size
	Guohua.Journal.2014.pdf	Published Version	670.66 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liang, G https://orcid.org/0000-0002-6843-7431	en_US
dc.contributor.author	Zhu, X	en_US
dc.contributor.author	Zhang, C https://orcid.org/0000-0001-5715-7154	en_US
dc.date.issued	2014-01-01	en_US
dc.identifier.citation	International Journal of Machine Learning and Cybernetics, 2014, 5 (1), pp. 63 - 71	en_US
dc.identifier.issn	1868-8071	en_US
dc.identifier.uri	http://hdl.handle.net/10453/33942
dc.description.abstract	Many real world applications involve highly imbalanced class distribution. Research into learning from imbalanced class distribution is considered to be one of ten challenging problems in data mining research, and it has increasingly captured the attention of both academia and industry. In this work, we study the effects of different levels of imbalanced class distribution on bagging predictors by using under-sampling techniques. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, such as the effect of varying the levels of class distribution on different bagging predictors, e.g., whether bagging is superior to single learners when the levels of class distribution change. Most classification learning algorithms are designed to maximize the overall accuracy rate and assume that training instances are uniformly distributed; however, the overall accuracy does not represent correct prediction on the minority class, which is the class of interest to users. The overall accuracy metric is therefore ineffective for evaluating the performance of classifiers in extremely imbalanced data. This study investigates the effect of varying levels of class distribution on different bagging predictors based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as a performance metric, using an under-sampling technique on 14 data-sets with imbalanced class distributions. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. The AUC performances of bagging predictors are statistically superior to single learners, with the exception of Support Vector Machines (SVM) and Decision Stump (DStump). © 2012 Springer-Verlag.	en_US
dc.relation.ispartof	International Journal of Machine Learning and Cybernetics	en_US
dc.relation.isbasedon	10.1007/s13042-012-0125-5	en_US
dc.title	The effect of varying levels of class distribution on bagging for different algorithms: An empirical study	en_US
dc.type	Journal Article
utslib.citation.volume	1	en_US
utslib.citation.volume	5	en_US
utslib.for	080109 Pattern Recognition and Data Mining	en_US
utslib.for	080105 Expert Systems	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/DVC (International)
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - ACRI - Australia China Relations Institute
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	1	en_US
pubs.publication-status	Published	en_US
pubs.volume	5	en_US

Abstract:

Many real world applications involve highly imbalanced class distribution. Research into learning from imbalanced class distribution is considered to be one of ten challenging problems in data mining research, and it has increasingly captured the attention of both academia and industry. In this work, we study the effects of different levels of imbalanced class distribution on bagging predictors by using under-sampling techniques. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, such as the effect of varying the levels of class distribution on different bagging predictors, e.g., whether bagging is superior to single learners when the levels of class distribution change. Most classification learning algorithms are designed to maximize the overall accuracy rate and assume that training instances are uniformly distributed; however, the overall accuracy does not represent correct prediction on the minority class, which is the class of interest to users. The overall accuracy metric is therefore ineffective for evaluating the performance of classifiers in extremely imbalanced data. This study investigates the effect of varying levels of class distribution on different bagging predictors based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as a performance metric, using an under-sampling technique on 14 data-sets with imbalanced class distributions. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. The AUC performances of bagging predictors are statistically superior to single learners, with the exception of Support Vector Machines (SVM) and Decision Stump (DStump). © 2012 Springer-Verlag.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/33942