Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning

Wang, T; Qin, Z; Jin, Z; Zhang, S

Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning

Wang, T Qin, Z Jin, Z Zhang, S

Permalink

Publication Type:: Journal Article
Citation:: Journal of Systems and Software, 2010, 83 (7), pp. 1137 - 1147
Issue Date:: 2010-07-01

Closed Access

	Filename	Description	Size
	2009007504OK.pdf		256.92 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, T	en_US
dc.contributor.author	Qin, Z	en_US
dc.contributor.author	Jin, Z	en_US
dc.contributor.author	Zhang, S	en_US
dc.date.issued	2010-07-01	en_US
dc.identifier.citation	Journal of Systems and Software, 2010, 83 (7), pp. 1137 - 1147	en_US
dc.identifier.issn	0164-1212	en_US
dc.identifier.uri	http://hdl.handle.net/10453/13481
dc.description.abstract	Cost-sensitive learning algorithms are typically designed for minimizing the total cost when multiple costs are taken into account. Like other learning algorithms, cost-sensitive learning algorithms must face a significant challenge, over-fitting, in an applied context of cost-sensitive learning. Specifically speaking, they can generate good results on training data but normally do not produce an optimal model when applied to unseen data in real world applications. It is called data over-fitting. This paper deals with the issue of data over-fitting by designing three simple and efficient strategies, feature selection, smoothing and threshold pruning, against the TCSDT (test cost-sensitive decision tree) method. The feature selection approach is used to pre-process the data set before applying the TCSDT algorithm. The smoothing and threshold pruning are used in a TCSDT algorithm before calculating the class probability estimate for each decision tree leaf. To evaluate our approaches, we conduct extensive experiments on the selected UCI data sets across different cost ratios, and on a real world data set, KDD-98 with real misclassification cost. The experimental results show that our algorithms outperform both the original TCSDT and other competing algorithms on reducing data over-fitting. © 2010 Elsevier Inc. All rights reserved.	en_US
dc.relation.ispartof	Journal of Systems and Software	en_US
dc.relation.isbasedon	10.1016/j.jss.2010.01.002	en_US
dc.subject.classification	Software Engineering	en_US
dc.title	Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning	en_US
dc.type	Journal Article
utslib.citation.volume	7	en_US
utslib.citation.volume	83	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0803 Computer Software	en_US
utslib.for	0804 Data Format	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Systems, Management and Leadership
utslib.copyright.status	closed_access
pubs.issue	7	en_US
pubs.publication-status	Published	en_US
pubs.volume	83	en_US

Abstract:

Cost-sensitive learning algorithms are typically designed for minimizing the total cost when multiple costs are taken into account. Like other learning algorithms, cost-sensitive learning algorithms must face a significant challenge, over-fitting, in an applied context of cost-sensitive learning. Specifically speaking, they can generate good results on training data but normally do not produce an optimal model when applied to unseen data in real world applications. It is called data over-fitting. This paper deals with the issue of data over-fitting by designing three simple and efficient strategies, feature selection, smoothing and threshold pruning, against the TCSDT (test cost-sensitive decision tree) method. The feature selection approach is used to pre-process the data set before applying the TCSDT algorithm. The smoothing and threshold pruning are used in a TCSDT algorithm before calculating the class probability estimate for each decision tree leaf. To evaluate our approaches, we conduct extensive experiments on the selected UCI data sets across different cost ratios, and on a real world data set, KDD-98 with real misclassification cost. The experimental results show that our algorithms outperform both the original TCSDT and other competing algorithms on reducing data over-fitting. © 2010 Elsevier Inc. All rights reserved.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/13481