FREE ELECTRONIC LIBRARY - Thesis, dissertations, books

Pages:   || 2 |


-- [ Page 1 ] --

International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010





Dr (Mrs).Sujni Paul

Associate Professor

Department of Computer Applications,

Karunya University, Coimbatore 641114, Tamil Nadu, India sujni_paul@yahoo.com


Many current data mining tasks can be accomplished successfully only in a distributed setting. The field of distributed data mining has therefore gained increasing importance in the last decade. The Apriori algorithm by Rakesh Agarwal has emerged as one of the best Association Rule mining algorithms. Ii also serves as the base algorithm for most parallel algorithms. The enormity and high dimensionality of datasets typically available as input to problem of association rule discovery, makes it an ideal problem for solving on multiple processors in parallel. The primary reasons are the memory and CPU speed limitations faced by single processors. In this paper an Optimized Distributed Association Rule mining algorithm for geographically distributed data is used in parallel and distributed environment so that it reduces communication costs. The response time is calculated in this environment using XML data.


Association rules, Apriori algorithm, parallel and distributed data mining, XML data, response time.

1. INTRODUCTION Association rule mining (ARM) has become one of the core data mining tasks and has attracted tremendous interest among data mining researchers. ARM is an undirected or unsupervised data mining technique which works on variable length data, and produces clear and understandable results. There are two dominant approaches for utilizing multiple processors that have emerged distributed memory in which each processor has a private memory; and shared memory in which all processors access common memory [4]. Shared memory architecture has many desirable properties. Each processor has direct and equal access to all memory in the system. Parallel programs are easy to implement on such a system. In distributed memory architecture each processor has its own local memory that can only be accessed directly by that processor [10]. For a processor to have access to data in the local memory of another processor a copy of the desired data element must be sent from one processor to the other through message passing. XML data are used with the Optimized Distributed Association Rule Mining Algorithm.

A parallel application could be divided into number of tasks and executed concurrently on different processors in the system [9]. However the performance of a parallel application on a distributed system is mainly dependent on the allocation of the tasks comprising the application onto the available processors in the system.

–  –  –

Modern organizations are geographically distributed. Typically, each site locally stores its ever increasing amount of day-to-day data. Using centralized data mining to discover useful patterns in such organizations' data isn't always feasible because merging data sets from different sites into a centralized site incurs huge network communication costs. Data from these organizations are not only distributed over various locations but also vertically fragmented, making it difficult if not impossible to combine them in a central location. Distributed data mining has thus emerged as an active subarea of data mining research. In this paper an Optimized Association Rule Mining Algorithm is used for performing the mining process.

2. RELATED WORK Three parallel algorithms for mining association rules [2], an important data mining problem is formulated in this paper. These algorithms have been designed to investigate and understand the performance implications of a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem-specific information in parallel data mining [11]. Fast Distributed Mining of association rules, which generates a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules [3].

Algorithms for mining association rules from relational data have been well developed. Several query languages have been proposed, to assist association rule mining such as [18], 19]. The topic of mining XML data has received little attention, as the data mining community has focused on the development of techniques for extracting common structure from heterogeneous XML data. For instance, [20] has proposed an algorithm to construct a frequent tree by finding common subtrees embedded in the heterogeneous XML data. On the other hand, some researchers focus on developing a standard model to represent the knowledge extracted from the data using XML. JAM [21] has been developed to gather information from sparse data sources and induce a global classification model. The PADMA system [22] is a document analysis tool working on a distributed environment, based on cooperative agents. It works without any relational database underneath. Instead, there are PADMA agents that perform several relational operations with the information extracted from the documents.


An association rule is a rule which implies certain association relationships among a set of objects (such as ``occur together'' or ``one implies the other'') in a database. Given a set of transactions, where each transaction is a set of literals (called items), an association rule is an expression of the form X Y, where X and Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y [1].

3.1 Apriori Algorithm An association rule mining algorithm, Apriori has been developed for rule mining in large transaction databases by IBM's Quest project team [3]. An itemset is a non-empty set of items.

They have decomposed the problem of mining association rules into two parts Find all combinations of items that have transaction support above minimum support. Call those • combinations frequent itemsets.

Use the frequent itemsets to generate the desired rules. The general idea is that if, say, ABCD and • AB are frequent itemsets, then we can determine if the rule AB CD holds by computing the ratio r = support(ABCD)/support(AB). The rule holds only if r = minimum confidence. Note that the International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010 rule will have minimum support because ABCD is frequent. The algorithm is highly scalable [7].

The Apriori algorithm used in Quest for finding all frequent itemsets is given below.

procedure AprioriAlg() begin

–  –  –

It makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent pass, say pass k, consists of two phases. First, the frequent itemsets Lk-1 (the set of all frequent (k-1)-itemsets) found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the apriori-gen() function. This function first joins Lk-1 with Lk-1, the joining condition being that the lexicographically ordered first k-2 items are the same. Next, it deletes all those itemsets from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck.

The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hash-tree data structure and increments the count of those candidates [8], [14]. At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk. The algorithm terminates when Lk becomes empty.

3.2 Distributed/parallel algorithms Databases or data warehouses may store a huge amount of data to be mined. Mining association rules in such databases may require substantial processing power [6]. A possible solution to this problem can be a distributed system.[5]. Moreover, many large databases are distributed in nature which may make it more feasible to use distributed algorithms.

Major cost of mining association rules is the computation of the set of large itemsets in the database.

Distributed computing of large itemsets encounters some new problems. One may compute locally large itemsets easily, but a locally large itemset may not be globally large. Since it is very expensive to broadcast the whole data set to other sites, one option is to broadcast all the counts of all the itemsets, no matter locally large or small, to other sites. However, a database may contain enormous combinations of itemsets, and it will involve passing a huge number of messages.

A distributed data mining algorithm FDM (Fast Distributed Mining of association rules) has been proposed by [5], which has the following distinct features.

1. The generation of candidate sets is in the same spirit of Apriori. However, some relationships between locally large sets and globally large ones are explored to generate a smaller set of candidate sets at each iteration and thus reduce the number of messages to be passed.

International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010

2. After the candidate sets have been generated, two pruning techniques, local pruning and global pruning, are developed to prune away some candidate sets at each individual sites.

3. In order to determine whether a candidate set is large, this algorithm requires only O(n) messages for support count exchange, where n is the number of sites in the network. This is much less than a straight adaptation of Apriori, which requires O(n2 ) messages. [3] Distributed data mining refers to the mining of distributed data sets. The data sets are stored in local databases hosted by local computers which are connected through a computer network [15], [16].

Data mining takes place at a local level and at a global level where local data mining results are combined to gain global findings. Distributed data mining is often mentioned with parallel data mining in literature.

While both attempt to improve the performance of traditional data mining systems they assume different system architectures and take different approaches. In distributed data mining computers are distributed and communicate through message passing. In parallel data mining a parallel computer is assumed with processors sharing memory and or disk. Computers in a distributed data mining system may be viewed as processors sharing nothing. This difference in architecture greatly influences algorithm design, cost model, and performance measure in distributed and parallel data mining.

3.3 Distributed Algorithms [23] Distributed association rule learning • Collective decision tree learning • Collective PCA and PCA-based clustering • Distributed hierarchical clustering • Other distributed clustering algorithms • Collective Bayesian network learning • Collective multi-variate regression •

3.4 Parallel Algorithms The main challenges associated with parallel data mining include

- minimizing I/O

- minimizing synchronization and communication

- effective load balancing [12]

- effective data layout

- deciding on the best search procedure to use

- good data decomposition

- minimizing/avoiding duplication of work[2]

The Four Parallel Algorithms are

1. Count Distribution – parallelizing the task of measuring the frequency of a pattern inside a database

2. Candidate Distribution – parallelizing the task of generating longer patterns

3. Hybrid Count and Candidate Distribution – a hybrid algorithm that tries to combine the strengths of the above algorithms

4. Sampling with Hybrid Count and Candidate Distribution – an algorithm that tries to only use a sample of the database.

International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010 The speed and the efficiency of parallel formulations are discussed below.In a parallel data mining the main issues taken into account are

–  –  –

For a maximum speed up there are limits to the scalability of parallelism. For example, at fp = 0.50, 0.90 and 0.99, 50%, 90% and 99% of the code is parallelizable. However, certain problems demonstrate increased performance by increasing the problem size. Problems which increase the percentage of parallel time with their size are more "scalable" than problems with a fixed percentage of parallel time.


–  –  –



The performance of Apriori ARM algorithms degrades for various reasons. It requires n number of database scans to generate a frequent n-itemset. Furthermore, it doesn't recognize transactions in the data set with identical itemsets if that data set is not loaded into the main memory. Therefore, it unnecessarily occupies resources for repeatedly generating itemsets from such identical transactions. For example, if a data set has 10 identical transactions, the Apriori algorithm not only enumerates the same candidate itemsets 10 times but also updates the support counts for those candidate itemsets 10 times for each iteration. Moreover, directly loading a raw data set into the main memory won't find a significant number of identical transactions because each transaction of a raw data set contains both frequent and infrequent items. To overcome these problems, we don't generate candidate support counts from the raw data set after the first pass. This technique not only reduces the average transaction length but also reduces the International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010 data set size significantly, so we can accumulate more transactions in the main memory. The number of items in the data set might be large, but only a few will satisfy the support threshold..

Pages:   || 2 |

Similar works:

«The NISTian The NISTian Editorial From the Director i From the Placement Director iii From the editor v AvAnt-GArde: the StrikinG edGe 1. A Bird's Dream Sk Janesar Ahemad 1 2. Liberation Shreya Pratik 1 3. An Encounter with a Professional Killer Dr Suresh Patnaik 2 4. Fringe Rohit 3 5. I Can't Say Biswamitra Kar 4 6. Is Life an Immaculate Child's Play. Snigda Patro 5 7. It's Silence Silences Shant Shika 7 8. Living on the Edge Sabhyasachi Parida 7 9. One More Time Ritwik Bitihotro 8 10. Rain...»

«Report on the Scottish Parliament election on 5 May 2011 October 2011 Translations and other formats For information on obtaining this publication in another language or in a large-print or Braille version, please contact the Electoral Commission: Tel: 020 7271 0500 Email: publications@electoralcommission.org.uk © The Electoral Commission 2011 Contents Foreword 1 Summary 2 1 Introduction 9 2 information 12 3 registration and voting 18 The experience of those standing in the election Delivering...»

«  The Fictitious World Traveller:   The Swede on Timor and the Noble Savage Imagery  By Hans Hägerdal Abstract Travel writing soared in the Western world in the early-modern era with the widening geographical knowledge. This was accompanied by a genre of travel fiction. The present study analyses a short Swedish novella from 1815, Swensken på Timor (The Swede on Timor), “translated” by Christina Cronhjelm from a purported English account. It is a romantic tale of a Swedish...»

«1a direkt 1a direkt Original Mexico Möbel Größte Auswahl. Kleinste Preise. Größte Auswahl. Kleinste Preise. Abholung oder Lieferung europaweit. Moebelkarton Ihr Möbel Online Shop. Online Shop. Keine Versandkosten und mit Trusted Shops Käuferschutz 1A Direkt Limited Sprachschule Eisenbahnstr. 16a 1A Direkt Limited in Renchen mit Beiträgen von Menschen, wie du und ich. Mit Yelp kannst du suchen, Empfehlungen teilen und dich mit anderen darüber 1A Direkt GmbH 61381 Friedrichsdorf...»

«Die Auswirkungen Des Internationalen Handels Auf Die Arbeitslosigkeit In Deutschland An Vietnam Will will pass previous reward even for an improvement and the is the way relevant. Traffic like checking along your best mortgage if who you are it charge made to learn. Stocking plan, curb and software to convey other jobs, or cd to spend epub, can slightly be to keep your epub and a power at a me have closing. Why it includes to becoming any study who is the sailing yourself should create Die...»

«Satisfactory Academic Progress (SAP) Berkeley College Overview INTRODUCTION Academic Programs Qualitative Assessments: Every matriculated student is required to maintain a Admissions and minimum grade point average (GPA), which varies based on the number of credits the Finances student has already attempted. Compliance with this qualitative requirement is measured quarterly. Continued failure to meet this standard after a warning (automatic) or Administration, probationary (when permitted)...»

«CURRICULUM VITAE OF ROBERT SPAHR I. PROFESSIONAL AFFILIATION AND CONTACT INFORMATION Assistant Professor of Digital Media Arts Department of Cinema and Photography College of Mass Communication and Media Arts Southern Illinois University Carbondale Mailcode 6610 Carbondale, IL 62901 rob@robertspahr.com http://www.robertspahr.com II. EDUCATION 1991 Parson's School of Design, New York, NY Master of Fine Arts, Sculpture, magna cum laude 1988 Art Academy of Cincinnati, Cincinnati, OH Bachelors of...»

«Reaching for Rigor. Identifying Practices of Effective High Schools. Marisa Cannata | Katherine Taylor Haynes | Thomas M. Smith Research Report October 2013 Reaching for Rigor Research Report | October 2013 1 The National Center on Scaling Up Effective Schools (NCSU) is a national research and development center that focuses on identifying the combination of essential components and the programs, practices, processes, and policies that make some high schools in large urban districts...»

«FULTON COUNTY, INDIANA BIRTHS 1976-1980 From Rochester Sentinel clippings compiled by Jean C. Tombaugh Copied to computer and indexed by Wendell C. Tombaugh. ========== RS 1-3-76 BOWERS, Ross Walter The wait for the winner of the First Baby of the Year Derby in Fulton County was one of the longest in years but it finally ended at 1:57 p.m. Friday. That's when a boy was born at Woodlawn hospital to Mr. & Mrs. John Bowers of 1629 Bancroft avenue. He checked in at a weight of 7 lb 15 oz, and a...»

«Age of Bronze, State of Grace: Music and Dogs in Coetzee's Disgrace Author(s): Derek Attridge Reviewed work(s): Source: NOVEL: A Forum on Fiction, Vol. 34, No. 1 (Autumn, 2000), pp. 98-121 Published by: Duke University Press Stable URL: http://www.jstor.org/stable/1346141. Accessed: 07/11/2012 06:48 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at. http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR is a not-for-profit service...»

«Access Statement M&D's, Scotland's Theme Park Strathclyde Country Park, Motherwell, Scotland ML1 3RT Tel: 01698 333 777 Fax: 01698 303 034 Email: info@scotlandsthemepark.com June 2015 Table of Contents Welcome Travelling by Rail.4 Travelling by Bus / Taxi.5 Car Parking and Drop Offs’ Main Entrance and Tickets Outdoor Attractions Indoor Attractions Krazy Congo 11 Game Zone 12 Carousel 13 Cosmic Bowl 14 Amazonia 16 Public Toilets Catering Giuseppe’s 22 Food Court 24 The Family Restaurant 25...»

«Written Testimony of Roger Featherstone Southwest Circuit Rider EARTHWORKS Public Lands and Forests Subcommitee of the Senate Energy and Natural Resources Committee Hearing on S. 3157 “Southeast Arizona Land Exchange and Conservation Act of 2008” Wednesday, July 9, 2008 EARTHWORKS is a non-profit, non-partisan environmental organization dedicated to protecting communities and the environment from the adverse impacts of mineral development. Our national office, based in Washington D.C.,...»

<<  HOME   |    CONTACTS
2016 www.dis.xlibx.info - Thesis, dissertations, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.