NOSEP: Nonoverlapping sequence pattern mining with gap constraints

Publication Type:
Journal Article
Citation:
IEEE Transactions on Cybernetics, 2018, 48 (10), pp. 2809 - 2822
Issue Date:
2018-10-01
Metrics:
Full metadata record
Files in This Item:
Filename Description Size
08053457.pdfPublished Version2.48 MB
Adobe PDF
© 2017 IEEE. Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification. Due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property because the support ratio of a pattern may be greater than that of its subpatterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a nonoverlapping sequence pattern mining task with gap constraints, where a nonoverlapping sequence pattern allows sequence letters to be flexibly and maximally utilized for pattern discovery. A new Apriori-based nonoverlapping sequence pattern mining algorithm, NOSEP, is proposed. NOSEP is a complete pattern mining algorithm, which uses a specially designed data structure, Nettree, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons on biology DNA sequences, time series data, and Gazelle datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of nonoverlapping sequence patterns compared to other methods.
Please use this identifier to cite or link to this item: