# «Abstract Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many ...»

Big Data Frequent Pattern Mining

David C. Anastasiu and Jeremy Iverson and Shaden Smith and George Karypis

Department of Computer Science and Engineering

University of Minnesota, Twin Cities, MN 55455, U.S.A.

{dragos, jiverson, shaden, karypis}@cs.umn.edu

Abstract

Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of

repeated patterns. Many eﬃcient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.

Introduction As an essential data mining task, frequent pattern mining has applications ranging from intrusion detection and market basket analysis, to credit card fraud prevention and drug discovery. Many eﬃcient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Web log data from social media sites such as Twitter produce over one hundred terabytes of raw data daily [32]. Giants such as Walmart register billions of yearly transactions [1]. Today’s high-throughput gene sequencing platforms are capable of generating terabytes of data in a single experiment [16]. Tools are needed that can eﬀectively mine frequent patterns from these massive data in a timely manner.

Some of today’s frequent pattern mining source data may not ﬁt on a single machine’s hard drive, let alone in its volatile memory. The exponential nature of the solution search space compounds this problem. Scalable parallel algorithms hold the key to addressing pattern mining in the context of Big Data. In this chapter, we review recent advances in solving the frequent pattern mining problem in parallel. We start by presenting an overview of the frequent pattern mining problem and its specializations in Section 1. In Section 2, we examine advantages of and challenges encountered when parallelizing algorithms, given today’s distributed and shared memory systems, centering our discussion in the frequent pattern mining context. We survey existing serial and parallel pattern mining methods in Sections 3 – 5. Finally, Section 6 draws some conclusions about the state-of-the-art and further opportunities in the ﬁeld.

1 Frequent Pattern Mining: Overview Since the well-known itemset model was introduced by Agrawal and Srikant [4] in 1994, numerous papers have been published proposing eﬃcient solutions to the problem of discovering frequent patterns in databases. Most follow two well known paradigms, which we brieﬂy describe in this section, after ﬁrst introducing notation and concepts used throughout the paper.

1.1 Preliminaries Let I = {i1, i2,..., in } be a set of items. An itemset C is a subset of I. We denote by |C| its length or size, i.e.

the number of items in C. Given a list of transactions T, where each transaction T ∈ T is an itemset, |T | denotes the total number of transactions. Transactions are generally identiﬁed by a transaction id (tid ). The support of C is the proportion of transactions in T that contain C, i.e., φ(C) = |{T |T ∈ T, C ⊆ T }|/|T |. The support count, or frequency of C is the number of transactions in T that contain C. An itemset is said to be a frequent itemset if it has a support greater than some user deﬁned minimum support threshold, σ.

The itemset model was extended to handle sequences by Srikant and Agrawal [54]. A sequence is deﬁned as an ordered list of itemsets, s = C1, C2,..., Cl, where Cj ⊆ I, 1 ≤ j ≤ l. A sequence database D is a list of |D| sequences, in which each sequence may be associated with a customer id and elements in the sequence may have an assigned timestamp. Without loss of generality, we assume a lexicographic order of items i ∈ C, C ⊆ I. We assume sequence elements are ordered in non-decreasing order based on their timestamps. A sequence s = C1, C2,..., Cm, m ≤ l is a sub-sequence of s if there exist integers i1, i2,..., im s.t. 1 ≤ i1 ≤ i2 ≤... ≤ im ≤ l and Cj ⊆ Cij, j = 1, 2,..., m.

In words, itemsets in s are subsets of those in s and follow the same list order as in s. If s is a sub-sequence of s, we write that s ⊆ s and say that s contains s and s is a super-sequence of s. Similar to the itemset support, the support of s is deﬁned as the proportion of sequences in D that contain s, i.e., φ(s) = |{s |s ∈ D, s ⊆ s }|/|D|. A sequence is said to be a frequent sequence if it has a support greater than σ.

A similar model extension has been proposed for mining structures, or graphs/networks. We are given a set of graphs G of size |G|. Graphs in G typically have labelled edges and vertices, though this is not required. V (G) and E(G) represent the vertex and edge sets of a graph G, respectively. The graph G = (V (G), E(G)) is said to be a subgraph of another graph H = (V (H), E(H)) if there is a bijection from E(G) to a subset of E(H). The relation is noted as G ⊆ H. The support of G is the proportion of graphs in G that have G as a subgraph, i.e., φ(G) = |{H|H ∈ G, G ⊆ H}|/|G|. A graph is said to be a frequent graph if it has a support greater than σ.

The problem of frequent pattern mining (FPM) is formally deﬁned as follows. Its specialization for the frequent itemset mining (FIM), frequent sequence mining (FSM), and frequent graph mining (FGM) is straight-forward.

Deﬁnition 1 Given a pattern container P and a user-speciﬁed parameter σ (0 ≤ σ ≤ 1), ﬁnd all sub-patterns each of which is supported by at least σ|P| patterns in P.

At times, we may wish to restrict the search to only maximal or closed patterns. A maximal pattern m is not a sub-pattern of any other frequent pattern in the database, whereas a closed pattern c has no proper super-pattern in the database with the same support.

A number of variations of the frequent sequence and frequent graph problems have been proposed. In some domains, the elements in a sequence are symbols from an alphabet A, e.g., A = {A, C, G, T } and s = T GGT GAGT.

We call these sequences symbol sequences. The symbol sequence model is equivalent to the general itemset sequence model where |C| = 1 for all C ∈ s, s ∈ D. Another interesting problem, sequence motif mining, looks to ﬁnd frequent sub-sequences within one (or a few) very long sequences. In this case, the support threshold is given as a support count, the minimum number of occurrences of the sub-sequence, rather than a value 0 ≤ σ ≤ 1, and additional constraints may be speciﬁed, such as minimum/maximum sub-sequence length. A similar problem is deﬁned for graphs, unfortunately also called frequent graph mining in the literature, where the support of G is the number of edge-disjoint subgraphs in a large graph G that are isomorphic to G. Two subgraphs are edge-disjoint if they do not share any edges. We call each appearance of G in G an embedding. Two graphs G and H are isomorphic if there exists a bijection between their vertex sets, f : V (G) → V (H), s.t. any two vertices u, v ∈ V (G) are adjacent in G if and only if f (u) and f (v) are adjacent in H.

1.2 Basic Mining Methodologies Many sophisticated frequent itemset mining methods have been developed over the years. Two core methodologies emerge from these methods for reducing computational cost. The ﬁrst aims to prune the candidate frequent itemset search space, while the second focuses on reducing the number of comparisons required to determine itemset support.

While we center our discussion on frequent itemsets, the methodologies noted in this section have also been used in designing FSM and FGM algorithms, which we describe in Sections 4 and 5, respectively.

1.2.1 Candidate Generation A brute-force approach to determine frequent itemsets in a set of transactions is to compute the support for every possible candidate itemset. Given the set of items I and a partial order with respect to the subset operator, one can denote all possible candidate itemsets by an itemset lattice, in which nodes represent itemsets and edges correspond to the subset relation. Figure 1 shows the itemset lattice containing candidate itemsets for example transactions denoted in Table 1. The brute-force approach would compare each candidate itemset with every transaction C ∈ T to check for containment. An approach like this would require O(|T | · L · |I|) item comparisons, where the number of non-empty itemsets in the lattice is L = 2|I| − 1. This type of computation becomes prohibitively expensive for all but the smallest sets of items and transaction sets.

One way to reduce computational complexity is to reduce the number of candidate itemsets tested for support.

To do this, algorithms rely on the observation that every candidate itemset of size k is the union of two candidate itemsets of size (k − 1), and on the converse of the following lemma.

** Lemma 1.1 (Downward Closure) The subsets of a frequent itemset must be frequent.**

Conversely, the supersets of an infrequent itemset must be infrequent. Thus, given a suﬃciently high minimum support, there are large portions of the itemset lattice that do not need to be explored. None of the white nodes in Figure 1 must be tested, as they do not have at least two frequent parent nodes. This technique is often referred to as support-based pruning and was ﬁrst introduced in the Apriori algorithm by Agrawal and Srikant [4].

Algorithm 1 shows the pseudo-code for Apriori-based frequent itemset discovery. Starting with each item as an itemset, the support for each itemset is computed, and itemsets that do not meet the minimum support threshold σ are removed. This results in the set F1 = {i|i ∈ I, φ({i}) ≥ σ} (line 2). From F1, all candidate itemsets of size two can be generated by joining frequent itemsets of size one, F2 = {C|C ∈ F1 × F1, |C| = 2, φ(C) ≥ σ}. In order to avoid re-evaluating itemsets of size one, only those sets in the Cartesian product which have size two are checked. This process can be generalized for all Fk, 2 ≤ k ≤ |I| (line 5). When Fk = ∅, all frequent itemsets have been discovered and can be expressed as the union of all frequent itemsets of size no more than k, F1 ∪ F2 ∪ · · · ∪ Fk (line 7).

In practice, the candidate generation and support computation step (line 5) can be made eﬃcient with the use of a subset function and a hash tree. Instead of computing the Cartesian product, Fk−1 × Fk−1, we consider all subsets of size k within all transactions in T. A subset function takes as input a transaction and returns all its subsets of size k, which become candidate itemsets. A hash tree data structure can be used to eﬃciently keep track of the number of times each candidate itemset is encountered in the database, i.e. its support count. Details for the construction of the hash tree can be found in the work of Agrawal and Srikant [4].

1.2.2 Pattern Growth Apriori-based algorithms process candidates in a breath-ﬁrst search manner, decomposing the itemset lattice into level-wise itemset-size based equivalence classes: k-itemsets must be processed before (k + 1)-itemsets. Assuming a lexicographic ordering of itemset items, the search space can also be decomposed into preﬁx-based and suﬃx-based equivalence classes. Figures 2 and 3 show equivalence classes for 1-length itemset preﬁxes and 1-length itemset suﬃxes, respectively, for our test database. Once frequent 1-itemsets are discovered, their equivalence classes can be mined independently. Patterns are grown by appending (prepending) appropriate items that follow (precede) the parent’s last (ﬁrst) item in lexicographic order.

Zaki [63] was the ﬁrst to suggest preﬁx-based equivalence classes as a means of independent sub-lattice mining in his algorithm, Equivalence CLAss Transformation (ECLAT). In order to improve candidate support counting, Zaki transforms the transactions into a vertical database format. In essence, he creates an inverted index, storing,

Figure 2: Preﬁx tree showing preﬁx-based 1-length Figure 3: Suﬃx tree showing suﬃx-based 1-length equivequivalence classes in the itemset lattice for I = alence classes in the itemset lattice for I = {a, b, c, d}.

{a, b, c, d}.

for each itemset, a list of tids it can be found in. Frequent 1-itemsets are then those with at least σ|T | listed tids. He uses lattice theory to prove that if two itemsets C1 and C2 are frequent, so will their intersection set C1 ∩ C2 be. After creating the vertical database, each equivalence class can be processed independently, in either breath-ﬁrst or depth-ﬁrst order, by recursive intersections of candidate itemset tid-lists, while still taking advantage of the downward closure property. For example, assuming {b} is infrequent, we can ﬁnd all frequent itemsets having preﬁx a by intersecting tid-lists of {a} and {c} to ﬁnd support for {ac}, then tid-lists of {ac} and {d} to ﬁnd support for {acd}, and ﬁnally tid-lists of {a} and {d} to ﬁnd support for {ad}. Note that the {ab}-rooted subtree is not considered, as {b} is infrequent and will thus not be joined with {a}.

A similar divide-and-conquer approach is employed by Han et al. [22] in FP-growth, which decomposes the search space based on length-1 suﬃxes. Additionally, they reduce database scans during the search by leveraging a compressed representation of the transaction database, via a data structure called an FP-tree. The FP-tree is a specialization of a preﬁx-tree, storing an item at each node, along with the support count of the itemset denoted by the path from the root to that node. Each database transaction is mapped onto a path in the tree. The FP-tree also keeps pointers between nodes containing the same item, which helps identify all itemsets ending in a given item.