«Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Hawthorne, NY 10532 PHILIP S. YU University of Illinois at Chicago, Chicago, IL 60607 ...»
PRIVACY-PRESERVING DATA MINING:
MODELS AND ALGORITHMS
PRIVACY-PRESERVING DATA MINING:
MODELS AND ALGORITHMS
CHARU C. AGGARWAL
IBM T. J. Watson Research Center, Hawthorne, NY 10532
PHILIP S. YU
University of Illinois at Chicago, Chicago, IL 60607
Kluwer Academic Publishers
Boston/Dordrecht/London Contents List of Figures xv List of Tables xx Preface xxi An Introduction to Privacy-Preserving Data Mining Charu C. Aggarwal, Philip S. Yu
1. Introduction 1
2. Privacy-Preserving Data Mining Algorithms 3
3. Conclusions and Summary 7 References 8 A General Survey of Privacy-Preserving Data Mining Models and Algorithms Charu C. Aggarwal, Philip S. Yu
1. Introduction 11
2. The Randomization Method 13
2.1 Privacy Quantiﬁcation 15
2.2 Adversarial Attacks on Randomization 17
2.3 Randomization Methods for Data Streams 18
2.4 Multiplicative Perturbations 18
2.5 Data Swapping 19
3. Group Based Anonymization 20 The k-Anonymity Framework 3.1 20
3.2 Personalized Privacy-Preservation 23
3.3 Utility Based Privacy Preservation 24
3.4 Sequential Releases 25 The l-diversity Method 3.5 26 The t-closeness Model 3.6 27
3.7 Models for Text, Binary and String Data 27
4. Distributed Privacy-Preserving Data Mining 28
4.1 Distributed Algorithms over Horizontally Partitioned Data Sets 30
4.2 Distributed Algorithms over Vertically Partitioned Data 31 Distributed Algorithms for k-Anonymity 4.3 31
In recent years, advances in hardware technology have lead to an increase in the capability to store and record personal data about consumers and individuals. This has lead to concerns that the personal data may be misused for a variety of purposes. In order to alleviate these concerns, a number of techniques have recently been proposed in order to perform the data mining tasks in a privacy-preserving way. These techniques for performing privacy-preserving data mining are drawn from a wide array of related topics such as data mining, cryptography and information hiding. The material in this book is designed to be drawn from the different topics so as to provide a good overview of the important topics in the ﬁeld.
While a large number of research papers are now available in this ﬁeld, many of the topics have been studied by different communities with different styles.
At this stage, it becomes important to organize the topics in such a way that the relative importance of different research areas is recognized. Furthermore, the ﬁeld of privacy-preserving data mining has been explored independently by the cryptography, database and statistical disclosure control communities. In some cases, the parallel lines of work are quite similar, but the communities are not sufﬁciently integrated for the provision of a broader perspective. This book will contain chapters from researchers of all three communities and will therefore try to provide a balanced perspective of the work done in this ﬁeld.
This book will be structured as an edited book from prominent researchers in the ﬁeld. Each chapter will contain a survey which contains the key research content on the topic, and the future directions of research in the ﬁeld. Emphasis will be placed on making each chapter self-sufﬁcient. While the chapters will be written by different researchers, the topics and content is organized in such a way so as to present the most important models, algorithms, and applications in the privacy ﬁeld in a structured and concise way. In addition, attention is paid in drawing chapters from researchers working in different areas in order to provide different points of view. Given the lack of structurally organized information on the topic of privacy, the book will provide insights which are not easily accessible otherwise. A few chapters in the book are not surveys, since the corresponding topics fall in the emerging category, and enough material is not
xxii PRIVACY-PRESERVING DATA MINING: MODELS AND ALGORITHMSavailable to create a survey. In such cases, the individual results have been included to give a ﬂavor of the emerging research in the ﬁeld. It is expected that the book will be a great help to researchers and graduate students interested in the topic. While the privacy ﬁeld clearly falls in the emerging category because of its recency, it is now beginning to reach a maturation and popularity point, where the development of an overview book on the topic becomes both possible and necessary. It is hoped that this book will provide a reference to students, researchers and practitioners in both introducing the topic of privacy-preserving data mining and understanding the practical and algorithmic aspects of the area.
AN INTRODUCTION TO PRIVACY-PRESERVING
DATA MININGCharu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 firstname.lastname@example.org Philip S. Yu IBM T. J. Watson Research Center Hawthorne, NY 10532 email@example.com Abstract The ﬁeld of privacy has seen rapid advances in recent years because of the increases in the ability to store data. In particular, recent advances in the data mining ﬁeld have lead to increased concerns about privacy. While the topic of privacy has been traditionally studied in the context of cryptography and informationhiding, recent emphasis on data mining has lead to renewed interest in the ﬁeld.
In this chapter, we will introduce the topic of privacy-preserving data mining and provide an overview of the different topics covered in this book.
1. IntroductionThe problem of privacy-preserving data mining has become more important in recent years because of the increasing ability to store personal data about users, and the increasing sophistication of data mining algorithms to leverage this information. A number of techniques such as randomization and k-anonymity [1, 4, 16] have been suggested in recent years in order to perform privacy-preserving data mining. Furthermore, the problem has been discussed in multiple communities such as the database community, the statistical disclosure control community and the cryptography community. In some cases,
2 PRIVACY-PRESERVING DATA MINING: MODELS AND ALGORITHMSthe different communities have explored parallel lines of work which are quite similar. This book will try to explore different topics from the perspective of different communities, and will try to give a fused idea of the work in different communities.
The key directions in the ﬁeld of privacy-preserving data mining are as follows:
Privacy-Preserving Data Publishing: These techniques tend to study different transformation methods associated with privacy. These techniques include methods such as randomization , k-anonymity [16, 7], and l-diversity . Another related issue is how the perturbed data can be used in conjunction with classical data mining methods such as association rule mining . Other related problems include that of determining privacy-preserving methods to keep the underlying data useful (utility-based methods), or the problem of studying the different deﬁnitions of privacy, and how they compare in terms of effectiveness in different scenarios.
Changing the results of Data Mining Applications to preserve privacy: In many cases, the results of data mining applications such as association rule or classiﬁcation rule mining can compromise the privacy of the data. This has spawned a ﬁeld of privacy in which the results of data mining algorithms such as association rule mining are modiﬁed in order to preserve the privacy of the data. A classic example of such techniques are association rule hiding methods, in which some of the association rules are suppressed in order to preserve privacy.
Query Auditing: Such methods are akin to the previous case of modifying the results of data mining algorithms. Here, we are either modifying or restricting the results of queries. Methods for perturbing the output of queries are discussed in , whereas techniques for restricting queries are discussed in [9, 13].
Cryptographic Methods for Distributed Privacy: In many cases, the data may be distributed across multiple sites, and the owners of the data across these different sites may wish to compute a common function. In such cases, a variety of cryptographic protocols may be used in order to communicate among the different sites, so that secure function computation is possible without revealing sensitive information. A survey of such methods may be found in .
Theoretical Challenges in High Dimensionality: Real data sets are usually extremely high dimensional, and this makes the process of privacypreservation extremely difﬁcult both from a computational and effectiveness point of view. In , it has been shown that optimal k-anonymization An Introduction to Privacy-Preserving Data Mining is NP-hard. Furthermore, the technique is not even effective with increasing dimensionality, since the data can typically be combined with either public or background information to reveal the identity of the underlying record owners. A variety of methods for adversarial attacks in the high dimensional case are discussed in [5, 6].
This book will attempt to cover the different topics from the point of view of different communities in the ﬁeld. This chapter will provide an overview of the different privacy-preserving algorithms covered in this book. We will discuss the challenges associated with each kind of problem, and discuss an overview of the material in the corresponding chapter.
2. Privacy-Preserving Data Mining Algorithms In this section, we will discuss the key stream mining problems and will discuss the challenges associated with each problem. We will also discuss an overview of the material covered in each chapter of this book. The broad topics
covered in this book are as follows:
General Survey. In chapter 2, we provide a broad survey of privacypreserving data-mining methods. We provide an overview of the different techniques and how they relate to one another. The individual topics will be covered in sufﬁcient detail to provide the reader with a good reference point. The idea is to provide an overview of the ﬁeld for a new reader from the perspective of the data mining community. However, more detailed discussions are deferred to future chapters which contain descriptions of different data mining algorithms.
Statistical Methods for Disclosure Control. The topic of privacy-preserving data mining has often been studied extensively by the data mining community without sufﬁcient attention to the work done by the conventional work done by the statistical disclosure control community. In chapter 3, detailed methods for statistical disclosure control have been presented along with some of the relationships to the parallel work done in the database and data mining community.
This includes methods such as k-anonymity, swapping, randomization, microaggregation and synthetic data generation. The idea is to give the readers an overview of the common themes in privacy-preserving data mining by different communities.
Measures of Anonymity. There are a very large number of deﬁnitions of anonymity in the privacy-preserving data mining ﬁeld. This is partially because of the varying goals of different privacy-preserving data mining algorithms. For example, methods such as k-anonymity, l-diversity and t-closeness are all designed to prevent identiﬁcation, though the ﬁnal goal is to preserve the
4 PRIVACY-PRESERVING DATA MINING: MODELS AND ALGORITHMSunderlying sensitive information. Each of these methods is designed to prevent disclosure of sensitive information in a different way. Chapter 4 is s a survey of different measures of anonymity. The chapter tries to deﬁne privacy from the perspective of anonymity measures and classiﬁes such measures. The chapter also compares and contrasts different measures, and discusses the relative advantages of different measures. This chapter thus provides an overview and perspective of the different ways in which privacy could be deﬁned, and what the relative advantages of each method might be.
The k-anonymity Method. An important method for privacy de-identiﬁcation is the method of k-anonymity . The motivating factor behind the kanonymity technique is that many attributes in the data can often be considered pseudo-identiﬁers which can be used in conjunction with public records in order to uniquely identify the records. For example, if the identiﬁcations from the records are removed, attributes such as the birth date and zip-code an be used in order to uniquely identify the identities of the underlying records. The idea in k-anonymity is to reduce the granularity of representation of the data in such a way that a given record cannot be distinguished from at least (k − 1) other records. In chapter 5, the k-anonymity method is discussed in detail. A number of important algorithms for k-anonymity are discussed in the same chapter.
The Randomization Method. The randomization technique uses data distortion methods in order to create private representations of the records [1, 4].
In most cases, the individual records cannot be recovered, but only aggregate distributions can be recovered. These aggregate distributions can be used for
data mining purposes. Two kinds of perturbation are possible with the randomization method:
Additive Perturbation: In this case, randomized noise is added to the data records. The overall data distributions can be recovered from the randomized records. Data mining and management algorithms re designed to work with these data distributions. A detailed discussion of these methods is provided in chapter 6.
Multiplicative Perturbation: In this case, the random projection or random rotation techniques are used in order to perturb the records. A detailed discussion of these methods is provided in chapter 7.
In addition, these chapters deal with the issue of adversarial attacks and vulnerabilities of these methods.