AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications

Yang, P; Ormerod, JT; Liu, W; Ma, C; Zomaya, AY; Yang, JYH

AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications

Yang, P Ormerod, JT Liu, W

Ma, C Zomaya, AY Yang, JYH

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Cybernetics, 2019, 49 (5), pp. 1932 - 1943
Issue Date:: 2019-05-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (4.28 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Yang, P	en_US
dc.contributor.author	Ormerod, JT	en_US
dc.contributor.author	Liu, W https://orcid.org/0000-0002-3003-1313	en_US
dc.contributor.author	Ma, C	en_US
dc.contributor.author	Zomaya, AY	en_US
dc.contributor.author	Yang, JYH	en_US
dc.date.available	2021-08-12T19:01:01Z
dc.date.issued	2019-05-01	en_US
dc.identifier.citation	IEEE Transactions on Cybernetics, 2019, 49 (5), pp. 1932 - 1943	en_US
dc.identifier.issn	2168-2267	en_US
dc.identifier.uri	http://hdl.handle.net/10453/132177
dc.description.abstract	© 2018 IEEE. Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.	en_US
dc.relation.ispartof	IEEE Transactions on Cybernetics	en_US
dc.relation.isbasedon	10.1109/TCYB.2018.2816984	en_US
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.subject.mesh	Phosphotransferases	en_US
dc.subject.mesh	Proteins	en_US
dc.subject.mesh	Phosphoproteins	en_US
dc.subject.mesh	Transcription Factors	en_US
dc.subject.mesh	Models, Statistical	en_US
dc.subject.mesh	Computational Biology	en_US
dc.subject.mesh	Algorithms	en_US
dc.subject.mesh	Machine Learning	en_US
dc.title	AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications	en_US
dc.type	Journal Article
utslib.citation.volume	5	en_US
utslib.citation.volume	49	en_US
utslib.for	0803 Computer Software	en_US
utslib.for	0102 Applied Mathematics	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	open_access	*
pubs.issue	5	en_US
pubs.publication-status	Published	en_US
pubs.volume	49	en_US

Abstract:

© 2018 IEEE. Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/132177