«AN OPTIMIZED DISTRIBUTED ASSOCIATION RULE MINING ALGORITHM IN PARALLEL AND DISTRIBUTED DATA MINING WITH XML DATA FOR IMPROVED RESPONSE TIME. Dr ...»
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
AN OPTIMIZED DISTRIBUTED ASSOCIATION
RULE MINING ALGORITHM IN PARALLEL AND
DISTRIBUTED DATA MINING WITH XML DATA
FOR IMPROVED RESPONSE TIME.
Dr (Mrs).Sujni Paul
Department of Computer Applications,
Karunya University, Coimbatore 641114, Tamil Nadu, India email@example.com
KEYWORDSAssociation rules, Apriori algorithm, parallel and distributed data mining, XML data, response time.
1. INTRODUCTION Association rule mining (ARM) has become one of the core data mining tasks and has attracted tremendous interest among data mining researchers. ARM is an undirected or unsupervised data mining technique which works on variable length data, and produces clear and understandable results. There are two dominant approaches for utilizing multiple processors that have emerged distributed memory in which each processor has a private memory; and shared memory in which all processors access common memory . Shared memory architecture has many desirable properties. Each processor has direct and equal access to all memory in the system. Parallel programs are easy to implement on such a system. In distributed memory architecture each processor has its own local memory that can only be accessed directly by that processor . For a processor to have access to data in the local memory of another processor a copy of the desired data element must be sent from one processor to the other through message passing. XML data are used with the Optimized Distributed Association Rule Mining Algorithm.
A parallel application could be divided into number of tasks and executed concurrently on different processors in the system . However the performance of a parallel application on a distributed system is mainly dependent on the allocation of the tasks comprising the application onto the available processors in the system.
Modern organizations are geographically distributed. Typically, each site locally stores its ever increasing amount of day-to-day data. Using centralized data mining to discover useful patterns in such organizations' data isn't always feasible because merging data sets from different sites into a centralized site incurs huge network communication costs. Data from these organizations are not only distributed over various locations but also vertically fragmented, making it difficult if not impossible to combine them in a central location. Distributed data mining has thus emerged as an active subarea of data mining research. In this paper an Optimized Association Rule Mining Algorithm is used for performing the mining process.
2. RELATED WORK Three parallel algorithms for mining association rules , an important data mining problem is formulated in this paper. These algorithms have been designed to investigate and understand the performance implications of a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem-specific information in parallel data mining . Fast Distributed Mining of association rules, which generates a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules .
Algorithms for mining association rules from relational data have been well developed. Several query languages have been proposed, to assist association rule mining such as , 19]. The topic of mining XML data has received little attention, as the data mining community has focused on the development of techniques for extracting common structure from heterogeneous XML data. For instance,  has proposed an algorithm to construct a frequent tree by finding common subtrees embedded in the heterogeneous XML data. On the other hand, some researchers focus on developing a standard model to represent the knowledge extracted from the data using XML. JAM  has been developed to gather information from sparse data sources and induce a global classification model. The PADMA system  is a document analysis tool working on a distributed environment, based on cooperative agents. It works without any relational database underneath. Instead, there are PADMA agents that perform several relational operations with the information extracted from the documents.
3. ASSOCIATION RULE MINING ALGORITHMSAn association rule is a rule which implies certain association relationships among a set of objects (such as ``occur together'' or ``one implies the other'') in a database. Given a set of transactions, where each transaction is a set of literals (called items), an association rule is an expression of the form X Y, where X and Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y .
3.1 Apriori Algorithm An association rule mining algorithm, Apriori has been developed for rule mining in large transaction databases by IBM's Quest project team . An itemset is a non-empty set of items.
They have decomposed the problem of mining association rules into two parts Find all combinations of items that have transaction support above minimum support. Call those • combinations frequent itemsets.
Use the frequent itemsets to generate the desired rules. The general idea is that if, say, ABCD and • AB are frequent itemsets, then we can determine if the rule AB CD holds by computing the ratio r = support(ABCD)/support(AB). The rule holds only if r = minimum confidence. Note that the International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010 rule will have minimum support because ABCD is frequent. The algorithm is highly scalable .
The Apriori algorithm used in Quest for finding all frequent itemsets is given below.
procedure AprioriAlg() begin
It makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent pass, say pass k, consists of two phases. First, the frequent itemsets Lk-1 (the set of all frequent (k-1)-itemsets) found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the apriori-gen() function. This function first joins Lk-1 with Lk-1, the joining condition being that the lexicographically ordered first k-2 items are the same. Next, it deletes all those itemsets from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck.
The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hash-tree data structure and increments the count of those candidates , . At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk. The algorithm terminates when Lk becomes empty.
3.2 Distributed/parallel algorithms Databases or data warehouses may store a huge amount of data to be mined. Mining association rules in such databases may require substantial processing power . A possible solution to this problem can be a distributed system.. Moreover, many large databases are distributed in nature which may make it more feasible to use distributed algorithms.
Major cost of mining association rules is the computation of the set of large itemsets in the database.
Distributed computing of large itemsets encounters some new problems. One may compute locally large itemsets easily, but a locally large itemset may not be globally large. Since it is very expensive to broadcast the whole data set to other sites, one option is to broadcast all the counts of all the itemsets, no matter locally large or small, to other sites. However, a database may contain enormous combinations of itemsets, and it will involve passing a huge number of messages.
A distributed data mining algorithm FDM (Fast Distributed Mining of association rules) has been proposed by , which has the following distinct features.
1. The generation of candidate sets is in the same spirit of Apriori. However, some relationships between locally large sets and globally large ones are explored to generate a smaller set of candidate sets at each iteration and thus reduce the number of messages to be passed.
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
2. After the candidate sets have been generated, two pruning techniques, local pruning and global pruning, are developed to prune away some candidate sets at each individual sites.
3. In order to determine whether a candidate set is large, this algorithm requires only O(n) messages for support count exchange, where n is the number of sites in the network. This is much less than a straight adaptation of Apriori, which requires O(n2 ) messages.  Distributed data mining refers to the mining of distributed data sets. The data sets are stored in local databases hosted by local computers which are connected through a computer network , .
Data mining takes place at a local level and at a global level where local data mining results are combined to gain global findings. Distributed data mining is often mentioned with parallel data mining in literature.
While both attempt to improve the performance of traditional data mining systems they assume different system architectures and take different approaches. In distributed data mining computers are distributed and communicate through message passing. In parallel data mining a parallel computer is assumed with processors sharing memory and or disk. Computers in a distributed data mining system may be viewed as processors sharing nothing. This difference in architecture greatly influences algorithm design, cost model, and performance measure in distributed and parallel data mining.
3.3 Distributed Algorithms  Distributed association rule learning • Collective decision tree learning • Collective PCA and PCA-based clustering • Distributed hierarchical clustering • Other distributed clustering algorithms • Collective Bayesian network learning • Collective multi-variate regression •
3.4 Parallel Algorithms The main challenges associated with parallel data mining include
- minimizing I/O
- minimizing synchronization and communication
- effective load balancing 
- effective data layout
- deciding on the best search procedure to use
- good data decomposition
- minimizing/avoiding duplication of work
The Four Parallel Algorithms are
1. Count Distribution – parallelizing the task of measuring the frequency of a pattern inside a database
2. Candidate Distribution – parallelizing the task of generating longer patterns
3. Hybrid Count and Candidate Distribution – a hybrid algorithm that tries to combine the strengths of the above algorithms
4. Sampling with Hybrid Count and Candidate Distribution – an algorithm that tries to only use a sample of the database.
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010 The speed and the efficiency of parallel formulations are discussed below.In a parallel data mining the main issues taken into account are
For a maximum speed up there are limits to the scalability of parallelism. For example, at fp = 0.50, 0.90 and 0.99, 50%, 90% and 99% of the code is parallelizable. However, certain problems demonstrate increased performance by increasing the problem size. Problems which increase the percentage of parallel time with their size are more "scalable" than problems with a fixed percentage of parallel time.