Sequential pattern mining refers to the identification of frequent subsequences in sequence databases as patterns. It provides an effective way to analyze the sequential data. The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. In the last two decades, researchers have proposed many techniques and algorithms for extracting the frequent sequential patterns, in which the downward closure property (also known as Apriori property) plays a fundamental role. At the same time, the relative importance of each item has been introduced in frequent pattern mining, and “high utility itemset mining” has been proposed. Instead of selecting high frequency patterns, the utility-based methods extract itemsets with high utilities, and many algorithms and strategies have been proposed. These methods can only process the itemsets in the utility framework.
However, all the above methods suffer from the following common issues and problems to varying extents: 1) Sometimes, most of frequent patterns may not be informative to business decision-making, since they do not show the business value and impact. 2) Even if there is an algorithm that considers the business impact (namely utility), it can only obtain high utility sequences based on a given minimum utility threshold, thus it is very difficult for users to specify an appropriate minimum utility and to directly obtain the most valuable patterns. 3) The algorithm in the utility framework may generate a large number of patterns, many of which maybe redundant.
Although high utility sequential pattern mining is essential, discovering the patterns is challenging for the following reasons: 1) The downward closure property does not hold in utility-based sequence mining. This means that most of the existing algorithms cannot be directly transferred, e.g. from frequent sequential pattern mining to high utility sequential pattern mining. Furthermore, compared to high utility itemset mining, utility-based sequence analysis faces the critical combinational explosion and computational complexity caused by sequencing between sequential elements (itemsets). 2) Since the minimum utility is not given in advance, the algorithm essentially starts searching from 0 minimum support. This not only incurs very high computational costs, but also the challenge of how to raise the minimum threshold without missing any top-k high utility sequences. 3) Due to the fundamental difference, incorporating the traditional closure concept into high utility sequential pattern mining makes the outcome patterns irreversibly lossy and no longer recoverable, which will be reasoned in the following chapters. Therefore, it is exceedingly challenging to address the above issues by designing a novel representation for high utility sequential patterns.
To address these research limitations and challenges, this thesis proposes a high utility sequential pattern mining framework, and proposes both a threshold-based and top-k-based mining algorithm. Furthermore, a compact and lossless representation of utility-based sequence is presented, and an efficient algorithm is provided to mine such kind of patterns.
Chapter 2 thoroughly reviews the related works in the frequent sequential pattern mining and high utility itemset/sequence mining.
Chapter 3 incorporates utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. Two efficient algorithms, namely USpan and USpan+, are presented to mine for high utility sequential patterns. In USpan and USpan+, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with three effective pruning strategies.
Chapter 4 proposes a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items.
Chapter 5 proposes a novel concise framework to discover US-closed (Utility Sequence closed) high utility sequential patterns, with theoretical proof that it expresses the lossless representation of high-utility patterns. An efficient algorithm named CloUSpan is introduced to extract the US-closed patterns. Two effective strategies are used to enhance the performance of CloUSpan.
All of the algorithms are examined in both synthetic and real datasets. The performances, including the running time and memory consumption, are compared. Furthermore, the utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential patterns provide insightful knowledge for users.