ECML/PKDD-2002 Tutorial

Privacy, Security, and Data Mining

20 August 2002 9:00-13:00

Index

Motivation ·
Objectives ·
Target Audience ·
Outline ·
Presenter ·
References ·

Motivation

Data mining technology has emerged as a means of identifying patterns and trends from large quantities of data. One of the key requirements of a data mining project is access to the relevant data. Privacy and Security concerns can constrain such access, threatening to derail data mining projects. This tutorial discusses constraints imposed by privacy and security, and presents technical solutions that enable data mining projects to proceed without violating these constraints.

Privacy and security have been a concern since early in the age of automated data processing, and these issues were raised early in the data mining community [CM96]. There has recently been a surge in solutions to these issues [AS00, LP00, AA01, KC02, VC02]. The combination of growing use of data mining (leading to more conflict with privacy and security concerns), along with the growing number of technical solutions, makes the time ripe for a tutorial on the subject.

There are many data mining situations where these privacy and security issues arise. A few examples are:

Identifying public health problem outbreaks (e.g., epidemics, biological warfare instances). There are many data collectors (insurance companies, HMOs, public health agencies). Individual privacy concerns limit the willingness of the data custodians to share data, even with government agencies such as the U.S. Centers for Disease Control. Can we accomplish the desired results while still preserving privacy of individual entities?

Collaborative corporations or entities. Ford and Firestone shared a problem with a jointly produced product: Ford Explorers with Firestone tires. Ford and Firestone may have been able to use association rule techniques to detect problems earlier. This would have required extensive data sharing. Factors such as trade secrets and agreements with other manufacturers stand in the way of the necessary sharing. Could we obtain the same results, while still preserving the secrecy of each side's data? Government entities face similar problems, such as limitations on sharing between law enforcement, intelligence agencies, and tax collection.

Multi-national corporations. An individual country's legal system may prevent sharing of customer data between a subsidiary and its parent.

These examples each define a different problem, or set of problems. The problems can be characterized by the following three parameters:

Outcome
What is the desired data mining result? Do we want to cluster the data, as in the disease outbreak example? Are we looking for association rules identifying relationships among the attributes? There are several such data mining tasks, and each poses a new set of challenges.

Control of Data
Who controls the data? Is each entity found only at a single site (as with medical insurance records)? Or do different sites contain different types of data (Ford on vehicles, Firestone on tires)?

Privacy
What are the privacy requirements? If the concern is solely that values associated with an individual entity not be released (e.g., "personally identifiable information") we can develop techniques that provably protect such information. In other cases, the notion of "sensitive" may not be known in advance. This would lead to human vetting of the intermediate results.

Sometimes it may be difficult (or impossible) to develop an exact solution that meets the privacy constraints. In data mining an approximate solution is often sufficient. The goal, then, is to obtain a solution with bounded error.

Objectives

The first half of this tutorial will discuss privacy and security constraints that are likely to affect data mining projects. The second half will introduce technical solutions to these problems: How to obtain data mining results without violating privacy, where "normal" data mining approaches would require a level of data access that violates privacy and security constraints.

Target Audience

This tutorial meets the needs of two audiences. Practitioners will learn to recognize when privacy and security concerns threaten to derail a data mining project. They will become able to develop alternative approaches that enable project completion while meeting privacy constraints. They will learn of technical solutions that enable those approaches, and of contacts within the research community who can develop new solutions where existing ones do not meet their needs.

Researchers will become familiar with the current state of research in this relatively new area. They will learn the constraints that lead to privacy and security problems with data mining, enabling them to identify new challenges and develop new solutions in this rapidly developing field.

Participants will need a general knowledge of data mining methods and techniques.

Outline

Introduction:
Examples of conflict between data mining and privacy, example of simple solution to one example (distributed association rules in horizontally partitioned distributed data where global access to individual values restricted).

Privacy and security constraints:
Individual privacy, Collection privacy, Limitations on use of results.

Sources of privacy and security constraints:
Regulatory, Contractual, Secrecy.

Classes of solutions:

Data obfuscation. Example: United States Census Bureau Public Use Microdata [RAM96].
Summarization. Example: Statistical Queries [Den80].
Data separation. Horizontal vs. vertical partitioning of data. Example: Hospital records.

Part 1 summary and break:
When privacy concerns should be addressed, in terms of the CRoss Industry Standard Process for Data Mining (CRISP-DM). Articulating that data mining results rarely violate these concerns. The problem is for the data miners to obtain access to the data.

Setup for Part 2:
technical solutions to obtain data mining results in spite of limited data access.

Data obfuscation based technique:
Developing reliable classifiers from unreliable data through reconstructing distributions. Method of [AS00].

Data separation based techniques:
Computing results without sharing data.

Overview of secure multiparty computation. Definition of secure, semi-honest and malicious model. Brief overview of the general two-party protocol [Yao86].
Secure decision tree construction. [LP00].
Secure association rules. Horizontally partitioned data [KC02], Vertically partitioned data [VC02].

Presenter

Dr. Chris Clifton, Associate Professor
Department of Computer Sciences
Purdue University
West Lafayette, Indiana 47907-1398 USA
+1 765-494-6005, Fax: +1 765 494-0739
clifton@cs.purdue.edu

Chris Clifton is an Associate Professor of Computer Science at Purdue University. He has a Ph.D. from Princeton University, and Bachelor's and Master's degrees from the Massachusetts Institute of Technology. Prior to joining Purdue in 2001, Chris had served as a Principal Scientist at The MITRE Corporation and as an Assistant Professor of Computer Science at Northwestern University. His research interests include data mining, data security, database support for text, and heterogeneous databases.

References

[AA01]
Dakshi Agrawal and Charu C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Santa Barbara, California, USA, May 21-23 2001. ACM.

[AS00]
Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, Dallas, TX, May 14-19 2000. ACM.

[CM96]
Chris Clifton and Don Marks. Security and privacy implications of data mining. In Workshop on Data Mining and Knowledge Discovery, pages 15-19, Montreal, Canada, June 2 1996. ACM SIGMOD.

[Den80]
Dorothy E. Denning. Secure statistical databases with random sample queries. ACM Transactions on Database Systems, 5(3):291-315, September 1980.

[KC02]
Murat Kantarcioglu and Chris Clifton. Distributed association rule mining without sharing. Submitted to Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'02). ACM SIGMOD, June 2 2002.

[LP00]
Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In Advances in Cryptology - CRYPTO 2000, pages 36-54. Springer-Verlag, August 20-24 2000.

[RAM96]
Jr. Richard A. Moore. Controlled data-swapping techniques for masking public use microdata sets. Statistical Research Division Report Series RR 96-04, U.S. Bureau of the Census, Washington, DC., 1996.

[VC02]
Jaideep Shrikant Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically partitioned data. Submitted to The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26 2002.

[Yao86]
Andrew C. Yao. How to generate and exchange secrets. In Proceedings of the 27th IEEE Symposium on Foundations of Computer Science, pages 162-167. IEEE, 1986.