A hybrid evolutionary preprocessing method for imbalanced datasets

Wong, GY; Leung, FHF; Ling, SH

A hybrid evolutionary preprocessing method for imbalanced datasets

Wong, GY Leung, FHF Ling, SH

Permalink

Publication Type:: Journal Article
Citation:: Information Sciences, 2018, 454-455 pp. 161 - 177
Issue Date:: 2018-07-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Acepted manuscript VersionAdobe PDF (828.27 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wong, GY	en_US
dc.contributor.author	Leung, FHF	en_US
dc.contributor.author	Ling, SH https://orcid.org/0000-0003-0849-5098	en_US
dc.date.available	2020-07-01T19:12:23Z
dc.date.issued	2018-07-01	en_US
dc.identifier.citation	Information Sciences, 2018, 454-455 pp. 161 - 177	en_US
dc.identifier.issn	0020-0255	en_US
dc.identifier.uri	http://hdl.handle.net/10453/128105
dc.description.abstract	© 2018 Imbalanced datasets are commonly encountered in real-world classification problems. Many machine learning algorithms are originally designed for well-balanced datasets, therefore re-sampling has become an important step to pre-process imbalanced data. This aims to balance the datasets by increasing the samples of the smaller class or decreasing the samples of the larger class, which are known as over-sampling and under-sampling, respectively. In this paper, a sampling strategy that is based on both over-sampling and under-sampling is proposed, in which the new samples of the smaller class are created based on fuzzy logic. Improvement of the datasets is done by the evolutionary computational method of Cross-generational elitist selection, Heterogeneous recombination and Cataclysmic mutation (CHC) that under-samples both the minority and majority samples. Consequently, a hybrid preprocessing method is proposed to re-sample imbalanced datasets. The evaluation is done by applying the Support Vector Machine (SVM), C4.5 decision tree and nearest neighbor rule to train a classification model from the re-sampled training sets. From the experimental results, it can be seen that our proposed method improves both the F−measure and AUC. The over-sampling rate and complexity of the classification model are also compared. Our proposed method is found to be superior to all other methods under comparison and it is more robust in different classifiers.	en_US
dc.relation.ispartof	Information Sciences	en_US
dc.relation.isbasedon	10.1016/j.ins.2018.04.068	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	A hybrid evolutionary preprocessing method for imbalanced datasets	en_US
dc.type	Journal Article
utslib.citation.volume	454-455	en_US
utslib.for	0102 Applied Mathematics	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0915 Interdisciplinary Engineering	en_US
utslib.for	01 Mathematical Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	09 Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Biomedical Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
utslib.copyright.status	open_access	*
pubs.publication-status	Published	en_US
pubs.volume	454-455	en_US

Abstract:

© 2018 Imbalanced datasets are commonly encountered in real-world classification problems. Many machine learning algorithms are originally designed for well-balanced datasets, therefore re-sampling has become an important step to pre-process imbalanced data. This aims to balance the datasets by increasing the samples of the smaller class or decreasing the samples of the larger class, which are known as over-sampling and under-sampling, respectively. In this paper, a sampling strategy that is based on both over-sampling and under-sampling is proposed, in which the new samples of the smaller class are created based on fuzzy logic. Improvement of the datasets is done by the evolutionary computational method of Cross-generational elitist selection, Heterogeneous recombination and Cataclysmic mutation (CHC) that under-samples both the minority and majority samples. Consequently, a hybrid preprocessing method is proposed to re-sample imbalanced datasets. The evaluation is done by applying the Support Vector Machine (SVM), C4.5 decision tree and nearest neighbor rule to train a classification model from the re-sampled training sets. From the experimental results, it can be seen that our proposed method improves both the F−measure and AUC. The over-sampling rate and complexity of the classification model are also compared. Our proposed method is found to be superior to all other methods under comparison and it is more robust in different classifiers.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/128105