Detection of outlier residues for improving interface prediction in protein heterocomplexes

Chen, P; Wong, L; Li, J

Detection of outlier residues for improving interface prediction in protein heterocomplexes

Chen, P Wong, L

Li, J

Permalink

Publication Type:: Journal Article
Citation:: IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9 (4), pp. 1155 - 1165
Issue Date:: 2012-05-31

Closed Access

	Filename	Description	Size
	2011005479OK.pdf		1.5 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Chen, P	en_US
dc.contributor.author	Wong, L https://orcid.org/0000-0003-1241-5441	en_US
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413	en_US
dc.date.issued	2012-05-31	en_US
dc.identifier.citation	IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9 (4), pp. 1155 - 1165	en_US
dc.identifier.issn	1545-5963	en_US
dc.identifier.uri	http://hdl.handle.net/10453/22338
dc.description.abstract	Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions. © 2012 IEEE.	en_US
dc.relation.ispartof	IEEE/ACM Transactions on Computational Biology and Bioinformatics	en_US
dc.relation.isbasedon	10.1109/TCBB.2012.58	en_US
dc.subject.classification	Bioinformatics	en_US
dc.subject.mesh	Proteins	en_US
dc.subject.mesh	Area Under Curve	en_US
dc.subject.mesh	Sequence Analysis, Protein	en_US
dc.subject.mesh	Computational Biology	en_US
dc.subject.mesh	Protein Binding	en_US
dc.subject.mesh	Databases, Protein	en_US
dc.subject.mesh	Protein Interaction Domains and Motifs	en_US
dc.subject.mesh	Support Vector Machine	en_US
dc.title	Detection of outlier residues for improving interface prediction in protein heterocomplexes	en_US
dc.type	Journal Article
utslib.citation.volume	4	en_US
utslib.citation.volume	9	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	01 Mathematical Sciences	en_US
utslib.for	06 Biological Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
dc.location.activity	Sydney, Australia
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
utslib.copyright.status	closed_access
pubs.issue	4	en_US
pubs.publication-status	Published	en_US
pubs.volume	9	en_US

Abstract:

Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions. © 2012 IEEE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/22338