FREE ELECTRONIC LIBRARY - Thesis, dissertations, books

Pages:   || 2 | 3 | 4 | 5 |   ...   | 7 |

«Abstract Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many ...»

-- [ Page 1 ] --

Big Data Frequent Pattern Mining

David C. Anastasiu and Jeremy Iverson and Shaden Smith and George Karypis

Department of Computer Science and Engineering

University of Minnesota, Twin Cities, MN 55455, U.S.A.

{dragos, jiverson, shaden, karypis}@cs.umn.edu


Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of

repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.

Introduction As an essential data mining task, frequent pattern mining has applications ranging from intrusion detection and market basket analysis, to credit card fraud prevention and drug discovery. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Web log data from social media sites such as Twitter produce over one hundred terabytes of raw data daily [32]. Giants such as Walmart register billions of yearly transactions [1]. Today’s high-throughput gene sequencing platforms are capable of generating terabytes of data in a single experiment [16]. Tools are needed that can effectively mine frequent patterns from these massive data in a timely manner.

Some of today’s frequent pattern mining source data may not fit on a single machine’s hard drive, let alone in its volatile memory. The exponential nature of the solution search space compounds this problem. Scalable parallel algorithms hold the key to addressing pattern mining in the context of Big Data. In this chapter, we review recent advances in solving the frequent pattern mining problem in parallel. We start by presenting an overview of the frequent pattern mining problem and its specializations in Section 1. In Section 2, we examine advantages of and challenges encountered when parallelizing algorithms, given today’s distributed and shared memory systems, centering our discussion in the frequent pattern mining context. We survey existing serial and parallel pattern mining methods in Sections 3 – 5. Finally, Section 6 draws some conclusions about the state-of-the-art and further opportunities in the field.

1 Frequent Pattern Mining: Overview Since the well-known itemset model was introduced by Agrawal and Srikant [4] in 1994, numerous papers have been published proposing efficient solutions to the problem of discovering frequent patterns in databases. Most follow two well known paradigms, which we briefly describe in this section, after first introducing notation and concepts used throughout the paper.

1.1 Preliminaries Let I = {i1, i2,..., in } be a set of items. An itemset C is a subset of I. We denote by |C| its length or size, i.e.

the number of items in C. Given a list of transactions T, where each transaction T ∈ T is an itemset, |T | denotes the total number of transactions. Transactions are generally identified by a transaction id (tid ). The support of C is the proportion of transactions in T that contain C, i.e., φ(C) = |{T |T ∈ T, C ⊆ T }|/|T |. The support count, or frequency of C is the number of transactions in T that contain C. An itemset is said to be a frequent itemset if it has a support greater than some user defined minimum support threshold, σ.

The itemset model was extended to handle sequences by Srikant and Agrawal [54]. A sequence is defined as an ordered list of itemsets, s = C1, C2,..., Cl, where Cj ⊆ I, 1 ≤ j ≤ l. A sequence database D is a list of |D| sequences, in which each sequence may be associated with a customer id and elements in the sequence may have an assigned timestamp. Without loss of generality, we assume a lexicographic order of items i ∈ C, C ⊆ I. We assume sequence elements are ordered in non-decreasing order based on their timestamps. A sequence s = C1, C2,..., Cm, m ≤ l is a sub-sequence of s if there exist integers i1, i2,..., im s.t. 1 ≤ i1 ≤ i2 ≤... ≤ im ≤ l and Cj ⊆ Cij, j = 1, 2,..., m.

In words, itemsets in s are subsets of those in s and follow the same list order as in s. If s is a sub-sequence of s, we write that s ⊆ s and say that s contains s and s is a super-sequence of s. Similar to the itemset support, the support of s is defined as the proportion of sequences in D that contain s, i.e., φ(s) = |{s |s ∈ D, s ⊆ s }|/|D|. A sequence is said to be a frequent sequence if it has a support greater than σ.

A similar model extension has been proposed for mining structures, or graphs/networks. We are given a set of graphs G of size |G|. Graphs in G typically have labelled edges and vertices, though this is not required. V (G) and E(G) represent the vertex and edge sets of a graph G, respectively. The graph G = (V (G), E(G)) is said to be a subgraph of another graph H = (V (H), E(H)) if there is a bijection from E(G) to a subset of E(H). The relation is noted as G ⊆ H. The support of G is the proportion of graphs in G that have G as a subgraph, i.e., φ(G) = |{H|H ∈ G, G ⊆ H}|/|G|. A graph is said to be a frequent graph if it has a support greater than σ.

The problem of frequent pattern mining (FPM) is formally defined as follows. Its specialization for the frequent itemset mining (FIM), frequent sequence mining (FSM), and frequent graph mining (FGM) is straight-forward.

Definition 1 Given a pattern container P and a user-specified parameter σ (0 ≤ σ ≤ 1), find all sub-patterns each of which is supported by at least σ|P| patterns in P.

At times, we may wish to restrict the search to only maximal or closed patterns. A maximal pattern m is not a sub-pattern of any other frequent pattern in the database, whereas a closed pattern c has no proper super-pattern in the database with the same support.

A number of variations of the frequent sequence and frequent graph problems have been proposed. In some domains, the elements in a sequence are symbols from an alphabet A, e.g., A = {A, C, G, T } and s = T GGT GAGT.

We call these sequences symbol sequences. The symbol sequence model is equivalent to the general itemset sequence model where |C| = 1 for all C ∈ s, s ∈ D. Another interesting problem, sequence motif mining, looks to find frequent sub-sequences within one (or a few) very long sequences. In this case, the support threshold is given as a support count, the minimum number of occurrences of the sub-sequence, rather than a value 0 ≤ σ ≤ 1, and additional constraints may be specified, such as minimum/maximum sub-sequence length. A similar problem is defined for graphs, unfortunately also called frequent graph mining in the literature, where the support of G is the number of edge-disjoint subgraphs in a large graph G that are isomorphic to G. Two subgraphs are edge-disjoint if they do not share any edges. We call each appearance of G in G an embedding. Two graphs G and H are isomorphic if there exists a bijection between their vertex sets, f : V (G) → V (H), s.t. any two vertices u, v ∈ V (G) are adjacent in G if and only if f (u) and f (v) are adjacent in H.

1.2 Basic Mining Methodologies Many sophisticated frequent itemset mining methods have been developed over the years. Two core methodologies emerge from these methods for reducing computational cost. The first aims to prune the candidate frequent itemset search space, while the second focuses on reducing the number of comparisons required to determine itemset support.

While we center our discussion on frequent itemsets, the methodologies noted in this section have also been used in designing FSM and FGM algorithms, which we describe in Sections 4 and 5, respectively.

1.2.1 Candidate Generation A brute-force approach to determine frequent itemsets in a set of transactions is to compute the support for every possible candidate itemset. Given the set of items I and a partial order with respect to the subset operator, one can denote all possible candidate itemsets by an itemset lattice, in which nodes represent itemsets and edges correspond to the subset relation. Figure 1 shows the itemset lattice containing candidate itemsets for example transactions denoted in Table 1. The brute-force approach would compare each candidate itemset with every transaction C ∈ T to check for containment. An approach like this would require O(|T | · L · |I|) item comparisons, where the number of non-empty itemsets in the lattice is L = 2|I| − 1. This type of computation becomes prohibitively expensive for all but the smallest sets of items and transaction sets.

One way to reduce computational complexity is to reduce the number of candidate itemsets tested for support.

To do this, algorithms rely on the observation that every candidate itemset of size k is the union of two candidate itemsets of size (k − 1), and on the converse of the following lemma.

–  –  –

Lemma 1.1 (Downward Closure) The subsets of a frequent itemset must be frequent.

Conversely, the supersets of an infrequent itemset must be infrequent. Thus, given a sufficiently high minimum support, there are large portions of the itemset lattice that do not need to be explored. None of the white nodes in Figure 1 must be tested, as they do not have at least two frequent parent nodes. This technique is often referred to as support-based pruning and was first introduced in the Apriori algorithm by Agrawal and Srikant [4].

–  –  –

Algorithm 1 shows the pseudo-code for Apriori-based frequent itemset discovery. Starting with each item as an itemset, the support for each itemset is computed, and itemsets that do not meet the minimum support threshold σ are removed. This results in the set F1 = {i|i ∈ I, φ({i}) ≥ σ} (line 2). From F1, all candidate itemsets of size two can be generated by joining frequent itemsets of size one, F2 = {C|C ∈ F1 × F1, |C| = 2, φ(C) ≥ σ}. In order to avoid re-evaluating itemsets of size one, only those sets in the Cartesian product which have size two are checked. This process can be generalized for all Fk, 2 ≤ k ≤ |I| (line 5). When Fk = ∅, all frequent itemsets have been discovered and can be expressed as the union of all frequent itemsets of size no more than k, F1 ∪ F2 ∪ · · · ∪ Fk (line 7).

In practice, the candidate generation and support computation step (line 5) can be made efficient with the use of a subset function and a hash tree. Instead of computing the Cartesian product, Fk−1 × Fk−1, we consider all subsets of size k within all transactions in T. A subset function takes as input a transaction and returns all its subsets of size k, which become candidate itemsets. A hash tree data structure can be used to efficiently keep track of the number of times each candidate itemset is encountered in the database, i.e. its support count. Details for the construction of the hash tree can be found in the work of Agrawal and Srikant [4].

1.2.2 Pattern Growth Apriori-based algorithms process candidates in a breath-first search manner, decomposing the itemset lattice into level-wise itemset-size based equivalence classes: k-itemsets must be processed before (k + 1)-itemsets. Assuming a lexicographic ordering of itemset items, the search space can also be decomposed into prefix-based and suffix-based equivalence classes. Figures 2 and 3 show equivalence classes for 1-length itemset prefixes and 1-length itemset suffixes, respectively, for our test database. Once frequent 1-itemsets are discovered, their equivalence classes can be mined independently. Patterns are grown by appending (prepending) appropriate items that follow (precede) the parent’s last (first) item in lexicographic order.

Zaki [63] was the first to suggest prefix-based equivalence classes as a means of independent sub-lattice mining in his algorithm, Equivalence CLAss Transformation (ECLAT). In order to improve candidate support counting, Zaki transforms the transactions into a vertical database format. In essence, he creates an inverted index, storing,

–  –  –

Figure 2: Prefix tree showing prefix-based 1-length Figure 3: Suffix tree showing suffix-based 1-length equivequivalence classes in the itemset lattice for I = alence classes in the itemset lattice for I = {a, b, c, d}.

{a, b, c, d}.

for each itemset, a list of tids it can be found in. Frequent 1-itemsets are then those with at least σ|T | listed tids. He uses lattice theory to prove that if two itemsets C1 and C2 are frequent, so will their intersection set C1 ∩ C2 be. After creating the vertical database, each equivalence class can be processed independently, in either breath-first or depth-first order, by recursive intersections of candidate itemset tid-lists, while still taking advantage of the downward closure property. For example, assuming {b} is infrequent, we can find all frequent itemsets having prefix a by intersecting tid-lists of {a} and {c} to find support for {ac}, then tid-lists of {ac} and {d} to find support for {acd}, and finally tid-lists of {a} and {d} to find support for {ad}. Note that the {ab}-rooted subtree is not considered, as {b} is infrequent and will thus not be joined with {a}.

A similar divide-and-conquer approach is employed by Han et al. [22] in FP-growth, which decomposes the search space based on length-1 suffixes. Additionally, they reduce database scans during the search by leveraging a compressed representation of the transaction database, via a data structure called an FP-tree. The FP-tree is a specialization of a prefix-tree, storing an item at each node, along with the support count of the itemset denoted by the path from the root to that node. Each database transaction is mapped onto a path in the tree. The FP-tree also keeps pointers between nodes containing the same item, which helps identify all itemsets ending in a given item.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 7 |

Similar works:

«COVER L ETTERS FOR ACADEMIC JOB APPLICATIONS www.uh.edu/ucs 713-743-5100 ucs@uh.edu Location: Student Service Center 1 Room 106 (First Floor) #524 on the UH campus map P: (713) 743-5100 W: www.uh.edu/ucs E: ucs@uh.edu COVER LETTERS FOR ACADEMIC JOB APPLICATIONS OVERVIEW A cover letter must accompany any application you submit for an academic job. The purpose of a cover letter, also sometimes called a letter of application or job letter, is to introduce yourself and to demonstrate the fit...»

«Guidelines for Life Cycle Assessment: Electronic Components October, 2008 Electronic Components Engineering Committee Electronic Components Board Japan Electronics and Information Technology Industries Association Warning: These guidelines have been voluntarily created strictly as reference material by the Subcommittee on Electronic Components Technology Environment, Electronic Components Engineering Committee, Electronic Components Board of the Japan Electronic and Information Technology...»

«Social Development Vol •• No. •• ••–•• •• 2013 doi: 10.1111/sode.12026 Inter-parent Aggression as a Precursor to Disengagement Coping in Emerging Adulthood: The Buffering Role of Friendship Competence Casey L. Brown, Barbara A. Oudekerk, David E. Szwedo and Joseph P. Allen, University of Virginia Abstract Using multi-informant data drawn from a prospective study involving 184 youth, mother-perpetrated and father-perpetrated partner aggression during early adolescence (the...»

«Pakendi infoleht: teave kasutajale *käsimüügiravim Dolmen, 25 mg suukaudse lahuse graanulid Deksketoprofeentrometamool Enne ravimi võtmist lugege hoolikalt infolehte, sest siin on teile vajalikku teavet. Võtke seda ravimit alati täpselt nii, nagu on kirjeldatud selles infolehes või nagu arst või apteeker on teile selgitanud.Hoidke infoleht alles, et seda vajadusel uuesti lugeda.Lisateabe saamiseks pidage nõu oma apteekriga. Kui teil tekib ükskõik milline kõrvaltoime, pidage nõu oma...»

«DATASTREAM CHARTING GETTING STARTED GUIDE This short guide outlines the main features of Datastream Charting 2.1. For more detailed information on using the product, please click the online help button at the top right of each page.An indexed PDF of the full set of help topics can be viewed on the Datastream Extranet: http://extranet.datastream.com/User%20Support/PubDoc/DatastreamCharting.htm The above link also contains demos and sample charts. This document covers: Adding series to charts...»

«Exploring Calculus And Differential Equations With The TI 89 And TI 92 Plus These opportunity and such forwarding has already taking bank administrative. It are only be to make upon segment and want your free account computer and prices. Exploring calculus and differential equations with the TI-89 and TI-92 Plus Twice why that protection makes surely learned you answers more or more run-down to add final if your means are successfully diminished, and to see you also the factory you do as our...»

«SOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2001; 31:103–128 Developing multi-agent systems with a FIPA-compliant agent framework Fabio Bellifemine1, Agostino Poggi2,∗,† and Giovanni Rimassa2 1 CSELT S.p.A., Via G. Reiss Romoli 274, Torino I-10148, Italy 2 Dipartimento di Ingegneria dell’Informazione, University of Parma, Parco Area delle Scienze 181A, Parma I-43100, Italy SUMMARY To ease large-scale realization of agent applications there is an urgent need for frameworks,...»

«Background Paper on Freedom of Expression and Internet Regulation for the International Seminar on Promoting Freedom of Expression With the Three Specialised International Mandates London, United Kingdom 19-20 November 2001 Introduction The Internet has fast become a key instrument for the exercise of the right to freedom of expression. It combines within one medium both the right to receive as well as the right to express and disseminate information, ideas and opinions, be it in the form of...»

«amsterdam marathon 2016 amsterdam marathon 2016 Amsterdam Preisvergleich | Amsterdam.TripAdvisor.de Amsterdam Hotels super günstig Alle Anbieter direkt vergleichen! AMSTERDAM MARATHON 2016 World AMSTERDAM MARATHON Boston Marathon: Which Countries Have the Fastest Recreational Runners The third D.R. marathon studied was Boston, whose subject Marathons in Amsterdam (Netherlands) Marathons in Amsterdam (Netherlands) 2016 list of marathons in city Amsterdam within year 2016. Amsterdam Marathon...»

«ISSN 1327-2101 AUSTRALASIAN NEMATOLOGY NEWSLETTER Published by: Australasian Association of Nematologists VOLUME 24 NO. 1 JANUARY 2013 From the Editor Thank you to those of you who made contributions to this newsletter. July Issue The deadline for the July issue will be mid June 2013. Kerrie Davies will notify you a month in advance so please have your material ready then. Katherine Linsell on behalf of Kerrie Davies Contacts Mike Hodda President, Australasian Association of Nematologists CSIRO...»

«A NEW APPROACH TO CRUSHING 3-MANIFOLD TRIANGULATIONS BENJAMIN A. BURTON Author’s self-archived version Available from http://www.maths.uq.edu.au/~bab/papers/ Abstract. The crushing operation of Jaco and Rubinstein is a powerful technique in algorithmic 3-manifold topology: it enabled the first practical implementations of 3-sphere recognition and prime decomposition of orientable manifolds, and it plays a prominent role in state-of-the-art algorithms for unknot recognition and testing for...»

«What If They Don’t Speak English? For Primary & Secondary Teachers This book is to serve as a Resource Guide for the educator who has been assigned students who speak a language other than English in their homes and have a limited proficiency in English Compiled from various English as a Second Language Resources by the MISD Bilingual/ESL Department Suchiraphon McKeithen-Polish, Bilingual Education Consultant Help! What do I do now? Que Pasa? Information in this booklet is for classroom...»

<<  HOME   |    CONTACTS
2016 www.dis.xlibx.info - Thesis, dissertations, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.