Healing sample selection bias by source classifier selection

Seah, CW; Tsang, IWH; Ong, YS

Healing sample selection bias by source classifier selection

Seah, CW Tsang, IWH

Ong, YS

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings - IEEE International Conference on Data Mining, ICDM, 2011, pp. 577 - 586
Issue Date:: 2011-12-01

Closed Access

	Filename	Description	Size
	2013004337OK.pdf		432.2 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Seah, CW	en_US
dc.contributor.author	Tsang, IWH https://orcid.org/0000-0001-8095-4637	en_US
dc.contributor.author	Ong, YS	en_US
dc.date.issued	2011-12-01	en_US
dc.identifier.citation	Proceedings - IEEE International Conference on Data Mining, ICDM, 2011, pp. 577 - 586	en_US
dc.identifier.isbn	9780769544083	en_US
dc.identifier.issn	1550-4786	en_US
dc.identifier.uri	http://hdl.handle.net/10453/29434
dc.description.abstract	Domain Adaptation (DA) methods are usually carried out by means of simply reducing the marginal distribution differences between the source and target domains, and subsequently using the resultant trained classifier, namely source classifier, for use in the target domain. However, in many cases, the true predictive distributions of the source and target domains can be vastly different especially when their class distributions are skewed, causing the issues of sample selection bias in DA. Hence, DA methods which leverage the source labeled data may suffer from poor generalization in the target domain, resulting in negative transfer. In addition, we observed that many DA methods use either a source classifier or a linear combination of source classifiers with a fixed weighting for predicting the target unlabeled data. Essentially, the labels of the target unlabeled data are spanned by the prediction of these source classifiers. Motivated by these observations, in this paper, we propose to construct many source classifiers of diverse biases and learn the weight for each source classifier by directly minimizing the structural risk defined on the target unlabeled data so as to heal the possible sample selection bias. Since the weights are learned by maximizing the margin of separation between opposite classes on the target unlabeled data, the proposed method is established here as Maximal Margin Target Label Learning (MMTLL), which is in a form of Multiple Kernel Learning problem with many label kernels. Extensive experimental studies of MMTLL against several state-of-the-art methods on the Sentiment and Newsgroups datasets with various imbalanced class settings showed that MMTLL exhibited robust accuracies on all the settings considered and was resilient to negative transfer, in contrast to other counterpart methods which suffered significantly in prediction accuracy. © 2011 IEEE.	en_US
dc.relation.ispartof	Proceedings - IEEE International Conference on Data Mining, ICDM	en_US
dc.relation.isbasedon	10.1109/ICDM.2011.73	en_US
dc.title	Healing sample selection bias by source classifier selection	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
dc.location.activity	Vancouver, Canada	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

Domain Adaptation (DA) methods are usually carried out by means of simply reducing the marginal distribution differences between the source and target domains, and subsequently using the resultant trained classifier, namely source classifier, for use in the target domain. However, in many cases, the true predictive distributions of the source and target domains can be vastly different especially when their class distributions are skewed, causing the issues of sample selection bias in DA. Hence, DA methods which leverage the source labeled data may suffer from poor generalization in the target domain, resulting in negative transfer. In addition, we observed that many DA methods use either a source classifier or a linear combination of source classifiers with a fixed weighting for predicting the target unlabeled data. Essentially, the labels of the target unlabeled data are spanned by the prediction of these source classifiers. Motivated by these observations, in this paper, we propose to construct many source classifiers of diverse biases and learn the weight for each source classifier by directly minimizing the structural risk defined on the target unlabeled data so as to heal the possible sample selection bias. Since the weights are learned by maximizing the margin of separation between opposite classes on the target unlabeled data, the proposed method is established here as Maximal Margin Target Label Learning (MMTLL), which is in a form of Multiple Kernel Learning problem with many label kernels. Extensive experimental studies of MMTLL against several state-of-the-art methods on the Sentiment and Newsgroups datasets with various imbalanced class settings showed that MMTLL exhibited robust accuracies on all the settings considered and was resilient to negative transfer, in contrast to other counterpart methods which suffered significantly in prediction accuracy. © 2011 IEEE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/29434