Mining approximate repeating patterns from sequence data with gap constraints

He, D; Zhu, X; Wu, X

Mining approximate repeating patterns from sequence data with gap constraints

He, D Zhu, X Wu, X

Permalink

Publication Type:: Journal Article
Citation:: Computational Intelligence, 2011, 27 (3), pp. 336 - 362
Issue Date:: 2011-08-01

Closed Access

	Filename	Description	Size
	2011000589OK.pdf		961.07 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	He, D	en_US
dc.contributor.author	Zhu, X	en_US
dc.contributor.author	Wu, X	en_US
dc.date.issued	2011-08-01	en_US
dc.identifier.citation	Computational Intelligence, 2011, 27 (3), pp. 336 - 362	en_US
dc.identifier.issn	0824-7935	en_US
dc.identifier.uri	http://hdl.handle.net/10453/18234
dc.description.abstract	The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for Bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data-driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back-tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori-like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real-world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short. © 2011 Wiley Periodicals, Inc.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DP1093762
dc.relation.ispartof	Computational Intelligence	en_US
dc.relation.isbasedon	10.1111/j.1467-8640.2011.00383.x	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Mining approximate repeating patterns from sequence data with gap constraints	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	27	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	closed_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	27	en_US

Abstract:

The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for Bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data-driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back-tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori-like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real-world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short. © 2011 Wiley Periodicals, Inc.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/18234