Privacy-preserving data mining : models and algorithms

Advances in hardware technology have increased the capability to store and record personal data about consumers and individuals, causing concerns that personal data may be used for a variety of intrusive or malicious purposes.

Privacy-Preserving Data Mining: Models and Algorithms proposes a number of techniques to perform the data mining tasks in a privacy-preserving way. These techniques generally fall into the following categories: data modification techniques, cryptographic methods and protocols for data sharing, statistical techniques for disclosure and inference control, query auditing methods, randomization and perturbation-based techniques.

This edited volume contains surveys by distinguished researchers in the privacy field. Each survey includes the key research content as well as future research directions.

Privacy-Preserving Data Mining: Models and Algorithms is designed for researchers, professors, and advanced-level students in computer science, and is also suitable for industry practitioners.

Charu C. Aggarwal and Philip S. YuCharu C. Aggarwal and Philip S. YuJosep Domingo-FerrerSuresh VenkatasubramanianV. Ciriani and S. De Capitani di Vimercati and S. Foresti and P. SamaratiCharu C. Aggarwal and Philip S. YuKeke Chen and Ling LiuElisa Bertino and Dan Lin and Wei JiangMing Hua and Jian PeiJayant R. HaritsaVassilios S. Verykios and Aris Gkoulalas-DivanisStephen E. Fienberg and Aleksandra B. SlavkovicMurat KantarciogluJaideep VaidyaKun Liu and Chris Giannella and Hillol KarguptaKobbi NissimShubha U. Nabar and Krishnaram Kenthapadi and Nina Mishra and Rajeev MotwaniCharu C. AggarwalYufei Tao and Xiaokui XiaoYabo Xu and Ke Wang and Ada Wai-Chee Fu and Rong She and Jian Pei

Preface	p. v
List of Figures	p. xvii
List of Tables	p. xxi
An Introduction to Privacy-Preserving Data Mining	p. 1
1.1 Introduction	p. 1
1.2 Privacy-Preserving Data Mining Algorithms	p. 3
1.3 Conclusions and Summary	p. 7
References	p. 8
2 A General Survey of Privacy-Preserving Data Mining Models and Algorithms	p. 11
2.1 Introduction	p. 11
2.2 The Randomization Method	p. 13
2.2.1 Privacy Quantification	p. 15
2.2.2 Adversarial Attacks on Randomization	p. 18
2.2.3 Randomization Methods for Data Streams	p. 18
2.2.4 Multiplicative Perturbations	p. 19
2.2.5 Data Swapping	p. 19
2.3 Group Based Anonymization	p. 20
2.3.1 The k-Anonymity Framework	p. 20
2.3.2 Personalized Privacy-Preservation	p. 24
2.3.3 Utility Based Privacy Preservation	p. 24
2.3.4 Sequential Releases	p. 25
2.3.5 The l-diversity Method	p. 26
2.3.6 The t-closeness Model	p. 27
2.3.7 Models for Text, Binary and String Data	p. 27
2.4 Distributed Privacy-Preserving Data Mining	p. 28
2.4.1 Distributed Algorithms over Horizontally Partitioned Data Sets	p. 30
2.4.2 Distributed Algorithms over Vertically Partitioned Data	p. 31
2.4.3 Distributed Algorithms for k-Anonymity	p. 32
2.5 Privacy-Preservation of Application Results	p. 32
2.5.1 Association Rule Hiding	p. 33
2.5.2 Downgrading Classifier Effectiveness	p. 34
2.5.3 Query Auditing and Inference Control	p. 34
2.6 Limitations of Privacy: The Curse of Dimensionality	p. 37
2.7 Applications of Privacy-Preserving Data Mining	p. 38
2.7.1 Medical Databases: The Scrub and Datafly Systems	p. 39
2.7.2 Bioterrorism Applications	p. 40
2.7.3 Homeland Security Applications	p. 40
2.7.4 Genomic Privacy	p. 42
2.8 Summary	p. 43
References	p. 43
3 A Survey of Inference Control Methods for Privacy-Preserving Data Mining	p. 53
3.1 Introduction	p. 54
3.2 A classification of Microdata Protection Methods	p. 55
3.3 Perturbative Masking Methods	p. 58
3.3.1 Additive Noise	p. 58
3.3.2 Microaggregation	p. 59
3.3.3 Data Wapping and Rank Swapping	p. 61
3.3.4 Rounding	p. 62
3.3.5 Resampling	p. 62
3.3.6 PRAM	p. 62
3.3.7 MASSC	p. 63
3.4 Non-perturbative Masking Methods	p. 63
3.4.1 Sampling	p. 64
3.4.2 Global Recoding	p. 64
3.4.3 Top and Bottom Coding	p. 65
3.4.4 Local Suppression	p. 65
3.5 Synthetic Microdata Generation	p. 65
3.5.1 Synthetic Data by Multiple Imputation	p. 65
3.5.2 Synthetic Data by Bootstrap	p. 66
3.5.3 Synthetic Data by Latin Hypercube Sampling	p. 66
3.5.4 Partially Synthetic Data by Cholesky Decomposition	p. 67
3.5.5 Other Partially Synthetic and Hybrid Microdata Approaches	p. 67
3.5.6 Pros and Cons of Synthetic Microdata	p. 68
3.6 Trading off Information Loss and Disclosure Risk	p. 69
3.6.1 Score Construction	p. 69
3.6.2 R-U Maps	p. 71
3.6.3 k-anonymity	p. 71
3.7 Conclusions and Research Directions	p. 72
References	p. 73
4 Measures of Anonymity	p. 81
4.1 Introduction	p. 81
4.1.1 What is Privacy?	p. 82
4.1.2 Data Anonymization Methods	p. 83
4.1.3 A Classification of Methods	p. 84
4.2 Statistical Measures of Anonymity	p. 85
4.2.1 Query Restriction	p. 85
4.2.2 Anonymity via Variance	p. 85
4.2.3 Anonymity via Multiplicity	p. 86
4.3 Probabilistic Measures of Anonymity	p. 87
4.3.1 Measures Based on Random Perturbation	p. 87
4.3.2 Measures Based on Generalization	p. 90
4.3.3 Utility vs Privacy	p. 94
4.4 Computational Measures of Anonymity	p. 94
4.4.1 Anonymity via Isolation	p. 97
4.5 Conclusions and New Directions	p. 97
4.5.1 New Directions	p. 98
References	p. 99
5 k-Anonymous Data Mining: A Survey	p. 105
5.1 Introduction	p. 105
5.2 k-Anonymity	p. 107
5.3 Algorithms for Enforcing k-Anonymity	p. 110
5.4 k-Anonymity Threats from Data Mining	p. 117
5.4.1 Association Rules	p. 118
5.4.2 Classification Mining	p. 118
5.5 k-Anonymity in Data Mining	p. 120
5.6 Anonymize-and-Mine	p. 123
5.7 Mine-and-Anonymize	p. 126
5.7.1 Enforcing k-Anonymity on Association Rules	p. 126
5.7.2 Enforcing k-Anonymity on Decision Trees	p. 130
5.8 Conclusions	p. 133
Acknowledgments	p. 133
References	p. 134
6 A Survey of Randomization Methods for Privacy-Preserving Data Mining	p. 137
6.1 Introduction	p. 137
6.2 Reconstruction Methods for Randomization	p. 139
6.2.1 The Bayes Reconstruction Method	p. 139
6.2.2 The EM Reconstruction Method	p. 141
6.2.3 Utility and Optimality of Randomization Models	p. 143
6.3 Applications of Randomization	p. 144
6.3.1 Privacy-Preserving Classification with Randomization	p. 144
6.3.2 Privacy-Preserving OLAP	p. 145
6.3.3 Collaborative Filtering	p. 145
6.4 The Privacy-Information Loss Tradeoff	p. 146
6.5 Vulnerabilities of the Randomization Method	p. 149
6.6 Randomization of Time Series Data Streams	p. 151
6.7 Multiplicative Noise for Randomization	p. 152
6.7.1 Vulnerabilities of Multiplicative Randomization	p. 153
6.7.2 Sketch Based Randomization	p. 153
6.8 Conclusions and Summary	p. 154
References	p. 154
7 A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining	p. 157
7.1 Introduction	p. 158
7.1.1 Data Privacy vs. Data Utility	p. 159
7.1.2 Outline	p. 160
7.2 Definition of Multiplicative Perturbation	p. 161
7.2.1 Notations	p. 161
7.2.2 Rotation Perturbation	p. 161
7.2.3 Projection Perturbation	p. 162
7.2.4 Sketch-based Approach	p. 164
7.2.5 Geometric Perturbation	p. 164
7.3 Transformation Invariant Data Mining Models	p. 165
7.3.1 Definition of Transformation Invariant Models	p. 166
7.3.2 Transformation-Invariant Classification Models	p. 166
7.3.3 Transformation-Invariant Clustering Models	p. 167
7.4 Privacy Evaluation for Multiplicative Perturbation	p. 168
7.4.1 A Conceptual Multidimensional Privacy Evaluation Model	p. 168
7.4.2 Variance of Difference as Column Privacy Metric	p. 169
7.4.3 Incorporating Attack Evaluation	p. 170
7.4.4 Other Metrics	p. 171
7.5 Attack Resilient Multiplicative Perturbations	p. 171
7.5.1 Naive Estimation to Rotation Perturbation	p. 171
7.5.2 ICA-Based Attacks	p. 173
7.5.3 Distance-Inference Attacks	p. 174
7.5.4 Attacks with More Prior Knowledge	p. 176
7.5.5 Finding Attack-Resilient Perturbations	p. 177
7.6 Conclusion	p. 177
Acknowledgment	p. 178
References	p. 179
8 A Survey of Quantification of Privacy Preserving Data Mining Algorithms	p. 183
8.1 Introduction	p. 184
8.2 Metrics for Quantifying Privacy Level	p. 186
8.2.1 Data Privacy	p. 186
8.2.2 Result Privacy	p. 191
8.3 Metrics for Quantifying Hiding Failure	p. 192
8.4 Metrics for Quantifying Data Quality	p. 193
8.4.1 Quality of the Data Resulting from the PPDM Process	p. 193
8.4.2 Quality of the Data Mining Results	p. 198
8.5 Complexity Metrics	p. 200
8.6 How to Select a Proper Metric	p. 201
8.7 Conclusion and Research Directions	p. 202
References	p. 202
9 A Survey of Utility-based Privacy-Preserving Data Transformation Methods	p. 207
9.1 Introduction	p. 208
9.1.1 What is Utility-based Privacy Preservation?	p. 209
9.2 Types of Utility-based Privacy Preservation Methods	p. 210
9.2.1 Privacy Models	p. 210
9.2.2 Utility Measures	p. 212
9.2.3 Summary of the Utility-Based Privacy Preserving Methods	p. 214
9.3 Utility-Based Anonymization Using Local Recoding	p. 214
9.3.1 Global Recoding and Local Recoding	p. 215
9.3.2 Utility Measure	p. 216
9.3.3 Anonymization Methods	p. 217
9.3.4 Summary and Discussion	p. 219
9.4 The Utility-based Privacy Preserving Methods in Classification Prob-lems	p. 219
9.4.1 The Top-Down Specialization Method	p. 220
9.4.2 The Progressive Disclosure Algorithm	p. 224
9.4.3 Summary and Discussion	p. 228
9.5 Anonymized Marginal: Injecting Utility into Anonymized Data Sets	p. 228
9.5.1 Anonymized Marginal	p. 229
9.5.2 Utility Measure	p. 230
9.5.3 Injecting Utility Using Anonymized Marginals	p. 231
9.5.4 Summary and Discussion	p. 233
9.6 Summary	p. 234
Acknowledgments	p. 234
References	p. 234
10 Mining Association Rules under Privacy Constraints	p. 239
10.1 Introduction	p. 239
10.2 Problem Framework	p. 240
10.2.1 Database Model	p. 240
10.2.2 Mining Objective	p. 241
10.2.3 Privacy Mechanisms	p. 241
10.2.4 Privacy Metric	p. 243
10.2.5 Accuracy Metric	p. 245
10.3 Evolution of the Literature	p. 246
10.4 The FRAPP Framework	p. 251
10.4.1 Reconstruction Model	p. 252
10.4.2 Estimation Error	p. 253
10.4.3 Randomizing the Perturbation Matrix	p. 256
10.4.4 Efficient Perturbation	p. 256
10.4.5 Integration with Association Rule Mining	p. 258
10.5 Sample Results	p. 259
10.6 Closing Remarks 263 Acknowledgments	p. 263
References	p. 263
11 A Survey of Association Rule Hiding Methods for Privacy	p. 267
11.1 Introduction	p. 267
11.2 Terminology and Preliminaries	p. 269
11.3 Taxonomy of Association Rule Hiding Algorithms	p. 270
11.4 Classes of Association Rule Algorithms	p. 271
11.4.1 Heuristic Approaches	p. 272
11.4.2 Border-based Approaches	p. 277
11.4.3 Exact Approaches	p. 278
11.5 Other Hiding Approaches	p. 279
11.6 Metrics and Performance Analysis	p. 281
11.7 Discussion and Future Trends	p. 284
11.8 Conclusions 285 References	p. 286
12 A Survey of Statistical Approaches to Preserving Confidentiality of Contingency Table Entries	p. 291
12.1 Introduction	p. 291
12.2 The Statistical Approach Privacy Protection	p. 292
12.3 Datamining Algorithms, Association Rules, and Disclosure Limitation	p. 294
12.4 Estimation and Disclosure Limitation for Multi-way Contingency Tables	p. 295
12.5 Two Illustrative Examples	p. 301
12.5.1 Example 1: Data from a Randomized Clinical Trial	p. 301
12.5.2 Example 2: Data from the 1993 U.S. Current Population Survey	p. 305
12.6 Conclusions	p. 308
Acknowledgments	p. 309
References	p. 309
13 A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data	p. 313
13.1 Introduction	p. 313
13.2 Basic Cryptographic Techniques for Privacy-Preserving Distributed Data Mining	p. 315
13.3 Common Secure Sub-protocols Used in Privacy-Preserving Distributed Data Mining	p. 318
13.4 Privacy-preserving Distributed Data Mining on Horizontally Partitioned Data	p. 323
13.5 Comparison to Vertically Partitioned Data Model	p. 326
13.6 Extension to Malicious Parties	p. 327
13.7 Limitations of the Cryptographic Techniques Used in Privacy-Preserving Distributed Data Mining	p. 329
13.8 Privacy Issues Related to Data Mining Results	p. 330
13.9 Conclusion	p. 332
References	p. 332
14 A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data	p. 337
14.1 Introduction	p. 337
14.2 Classification	p. 341
14.2.1 Naïve Bayes Classification	p. 342
14.2.2 Bayesian Network Structure Learning	p. 343
14.2.3 Decision Tree Classification	p. 344
14.3 Clustering	p. 346
14.4 Association Rule Mining	p. 347
14.5 Outlier detection	p. 349
14.5.1 Algorithm	p. 351
14.5.2 Security Analysis	p. 352
14.5.3 Computation and Communication Analysis	p. 354
14.6 Challenges and Research Directions	p. 355
References	p. 356
15 A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods	p. 359
15.1 Introduction	p. 360
15.2 Definitions and Notation	p. 360
15.3 Attacking Additive Data Perturbation	p. 361
15.3.1 Eigen-Analysis and PCA Preliminaries	p. 362
15.3.2 Spectral Filtering	p. 363
15.3.3 SVD Filtering	p. 364
15.3.4 PCA Filtering	p. 365
15.3.5 MAP Estimation Attack	p. 366
15.3.6 Distribution Analysis Attack	p. 367
15.3.7 Summary	p. 367
15.4 Attacking Matrix Multiplicative Data Perturbation	p. 369
15.4.1 Known I/O Attacks	p. 370
15.4.2 Known Sample Attack	p. 373
15.4.3 Other Attacks Based on ICA	p. 374
15.4.4 Summary	p. 375
15.5 Attacking k-Anonymization	p. 376
15.6 Conclusion 376 Acknowledgments 377 References	p. 377
16 Private Data Analysis via Output Perturbation	p. 383
16.1 Introduction	p. 383
16.2 The Abstract Model - Statistical Databases, Queries, and Sanitizers	p. 385
16.3 Privacy	p. 388
16.3.1 Interpreting the Privacy Definition	p. 390
16.4 The Basic Technique: Calibrating Noise to Sensitivity	p. 394
16.4.1 Applications: Functions with Low Global Sensitivity	p. 396
16.5 Constructing Sanitizers for Complex Functionalities	p. 400
16.5.1 k-Means Clustering	p. 401
16.5.2 SVD and PCA	p. 403
16.5.3 Learning in the Statistical Queries Model	p. 404
16.6 Beyond the Basics	p. 405
16.6.1 Instance Based Noise and Smooth Sensitivity	p. 406
16.6.2 The Sample-Aggregate Framework	p. 408
16.6.3 A General Sanitization Mechanism	p. 409
16.7 Related Work and Bibliographic Notes	p. 409
Acknowledgments	p. 411
References	p. 411
17 A Survey of Query Auditing Techniques for Data Privacy	p. 415
17.1 Introduction	p. 415
17.2 Auditing Aggregate Queries	p. 416
17.2.1 Offline Auditing	p. 417
17.2.2 Online Auditing	p. 418
17.3 Auditing Select-Project-Join Queries	p. 426
17.4 Challenges in Auditing	p. 427
17.5 Reading	p. 429
References	p. 430
18 Privacy and the Dimensionality Curse	p. 433
18.1 Introduction	p. 433
18.2 The Dimensionality Curse and the k-anonymity Method	p. 435
18.3 The Dimensionality Curse and Condensation	p. 441
18.4 The Dimensionality Curse and the Randomization Method	p. 446
18.4.1 Effects of Public Information	p. 446
18.4.2 Effects of High Dimensionality	p. 450
18.4.3 Gaussian Perturbing Distribution	p. 450
18.4.4 Uniform Perturbing Distribution	p. 455
18.5 The Dimensionality Curse and l-diversity	p. 458
18.6 Conclusions and Research Directions	p. 459
References	p. 460
19 Personalized Privacy Preservation	p. 461
19.1 Introduction	p. 461
19.2 Formalization of Personalized Anonymity	p. 463
19.2.1 Personal Privacy Requirements	p. 464
19.2.2 Generalization	p. 465
19.3 Combinatorial Process of Privacy Attack	p. 467
19.3.1 Primary Case	p. 468
19.3.2 Non-primary Case	p. 469
19.4 Theoretical Foundation	p. 470
19.4.1 Notations and Basic Properties	p. 471
19.4.2 Derivation of the Breach Probability	p. 472
19.5 Generalization Algorithm	p. 473
19.5.1 The Greedy Framework	p. 474
19.5.2 Optimal SA-generalization	p. 476
19.6 Alternative Forms of Personalized Privacy Preservation	p. 478
19.6.1 Extension of k-anonymity	p. 479
19.6.2 Personalization in Location Privacy Protection	p. 480
19.7 Summary and Future Work	p. 482
References	p. 485
20 Privacy-Preserving Data Stream Classification	p. 487
20.1 Introduction	p. 487
20.1.1 Motivating Example	p. 488
20.1.2 Contributions and Paper Outline	p. 490
20.2 Related Works	p. 491
20.3 Problem Statement	p. 493
20.3.1 Secure Join Stream Classification	p. 493
20.3.2 Naive Bayesian Classifiers	p. 494
20.4 Our Approach	p. 495
20.4.1 Initialization	p. 495
20.4.2 Bottom-Up Propagation	p. 496
20.4.3 Top-Down Propagation	p. 497
20.4.4 Using NBC	p. 499
20.4.5 Algorithm Analysis	p. 500
20.5 Empirical Studies	p. 501
20.5.1 Real-life Datasets	p. 502
20.5.2 Synthetic Datasets	p. 504
20.5.3 Discussion	p. 506
20.6 Conclusions	p. 507
References	p. 508
Index	p. 511

Available:*

On Order

Summary

Summary

Table of Contents