Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets

Zhu, X; Wu, X; Chen, Q

Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets

Zhu, X Wu, X Chen, Q

Permalink

Publication Type:: Journal Article
Citation:: Data Mining and Knowledge Discovery, 2006, 12 (2-3), pp. 275 - 308
Issue Date:: 2006-05-01

Closed Access

	Filename	Description	Size
	2011000607OK.pdf		570.46 kB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, X	en_US
dc.contributor.author	Wu, X	en_US
dc.contributor.author	Chen, Q	en_US
dc.date.issued	2006-05-01	en_US
dc.identifier.citation	Data Mining and Knowledge Discovery, 2006, 12 (2-3), pp. 275 - 308	en_US
dc.identifier.issn	1384-5810	en_US
dc.identifier.uri	http://hdl.handle.net/10453/15210
dc.description.abstract	To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets. In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance It, two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach. © 2005 Springer Science + Business Media, Inc.	en_US
dc.relation.ispartof	Data Mining and Knowledge Discovery	en_US
dc.relation.isbasedon	10.1007/s10618-005-0012-8	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets	en_US
dc.type	Journal Article
utslib.citation.volume	2-3	en_US
utslib.citation.volume	12	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0804 Data Format	en_US
utslib.for	0806 Information Systems	en_US
dc.location.activity	Athens, Greece
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	closed_access
pubs.issue	2-3	en_US
pubs.publication-status	Published	en_US
pubs.volume	12	en_US

Abstract:

To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets. In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance It, two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach. © 2005 Springer Science + Business Media, Inc.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/15210