Cover image for Data quality : concepts, methodologies and techniques
Title:
Data quality : concepts, methodologies and techniques
Personal Author:
Series:
Data-centric systems and applications
Publication Information:
Berlin : Springer, 2006
ISBN:
9783540331728
Added Author:

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010119031 QA76.9.D3 B374 2006 Open Access Book Book
Searching...
Searching...
30000010155763 QA76.9.D3 B374 2006 Open Access Book Book
Searching...

On Order

Summary

Summary

Poor data quality can seriously hinder or damage the efficiency and effectiveness of organizations and businesses. The growing awareness of such repercussions has led to major public initiatives like the "Data Quality Act" in the USA and the "European 2003/98" directive of the European Parliament.

Batini and Scannapieco present a comprehensive and systematic introduction to the wide set of issues related to data quality. They start with a detailed description of different data quality dimensions, like accuracy, completeness, and consistency, and their importance in different types of data, like federated data, web data, or time-dependent data, and in different data categories classified according to frequency of change, like stable, long-term, and frequently changing data. The book's extensive description of techniques and methodologies from core data quality research as well as from related fields like data mining, probability theory, statistical data analysis, and machine learning gives an excellent overview of the current state of the art. The presentation is completed by a short description and critical comparison of tools and practical methodologies, which will help readers to resolve their own quality problems.

This book is an ideal combination of the soundness of theoretical foundations and the applicability of practical approaches. It is ideally suited for everyone - researchers, students, or professionals - interested in a comprehensive overview of data quality issues. In addition, it will serve as the basis for an introductory course or for self-study on this topic.


Author Notes

Carlo Batini is full professor of Computer Engineering at University of Milano Bicocca. He has been associate professor since 1983 and full professor since 1986. His research interests include cooperative information systems, information systems and data base modeling and design, usability of information systems, data and information quality. From 1995 to 2003 he was a member of the board of directors of the Authority for Information Technology in public administration, where he headed several large scale projects for the modernization of public administration.

Monica Scannapieco is a research associate at the Computer Engineering Department of the University of Roma La Sapienza. Her research interests are data quality issues, including data quality dimensions, measurement and improvement techniques, dynamics of data quality, record matching.


Table of Contents

1 Introduction to Data Qualityp. 1
1.1 Why Data Quality is Relevantp. 1
1.2 Introduction to the Concept of Data Qualityp. 4
1.3 Data Quality and Types of Datap. 6
1.4 Data Quality and Types of Information Systemsp. 9
1.5 Main Research Issues and Application Domains in Data Qualityp. 11
1.5.1 Research Issues in Data Qualityp. 12
1.5.2 Application Domains in Data Qualityp. 12
1.5.3 Research Areas Related to Data Qualityp. 16
1.6 Summaryp. 17
2 Data Quality Dimensionsp. 19
2.1 Accuracyp. 20
2.2 Completenessp. 23
2.2.1 Completeness of Relational Datap. 24
2.2.2 Completeness of Web Datap. 27
2.3 Time-Related Dimensions: Currency, Timeliness, and Volatilityp. 28
2.4 Consistencyp. 30
2.4.1 Integrity Constraintsp. 30
2.4.2 Data Editsp. 31
2.5 Other Data Quality Dimensionsp. 32
2.5.1 Accessibilityp. 34
2.5.2 Quality of Information Sourcesp. 35
2.6 Approaches to the Definition of Data Quality Dimensionsp. 36
2.6.1 Theoretical Approachp. 36
2.6.2 Empirical Approachp. 38
2.6.3 Intuitive Approachp. 39
2.6.4 A Comparative Analysis of the Dimension Definitionsp. 39
2.6.5 Trade-offs Between Dimensionsp. 40
2.7 Schema Quality Dimensionsp. 42
2.7.1 Readabilityp. 45
2.7.2 Normalizationp. 45
2.8 Summaryp. 48
3 Models for Data Qualityp. 51
3.1 Introductionp. 51
3.2 Extensions of Structured Data Modelsp. 52
3.2.1 Conceptual Modelsp. 52
3.2.2 Logical Models for Data Descriptionp. 54
3.2.3 The Polygen Model for Data Manipulationp. 55
3.2.4 Data Provenancep. 56
3.3 Extensions of Semistructured Data Modelsp. 59
3.4 Management Information System Modelsp. 61
3.4.1 Models for Process Description: the IP-MAP modelp. 61
3.4.2 Extensions of IP-MAPp. 62
3.4.3 Data Modelsp. 64
3.5 Summaryp. 68
4 Activities and Techniques for Data Quality: Generalitiesp. 69
4.1 Data Quality Activitiesp. 70
4.2 Quality Compositionp. 71
4.2.1 Models and Assumptionsp. 74
4.2.2 Dimensionsp. 76
4.2.3 Accuracyp. 78
4.2.4 Completenessp. 79
4.3 Error Localization and Correctionp. 82
4.3.1 Localize and Correct Inconsistenciesp. 82
4.3.2 Incomplete Datap. 85
4.3.3 Discovering Outliersp. 86
4.4 Cost and Benefit Classificationsp. 88
4.4.1 Cost Classificationsp. 89
4.4.2 Benefits Classificationp. 94
4.5 Summaryp. 95
5 Object Identificationp. 97
5.1 Historical Perspectivep. 98
5.2 Object Identification for Different Data Typesp. 99
5.3 The High-Level Process for Object Identificationp. 101
5.4 Details on the Steps for Object Identificationp. 103
5.4.1 Preprocessingp. 103
5.4.2 Search Space Reductionp. 104
5.4.3 Comparison Functionsp. 104
5.5 Object Identification Techniquesp. 106
5.6 Probabilistic Techniquesp. 106
5.6.1 The Fellegi and Sunter Theory and Extensionsp. 107
5.6.2 A Cost-Based Probabilistic Techniquep. 112
5.7 Empirical Techniquesp. 113
5.7.1 Sorted Neighborhood Method and Extensionsp. 113
5.7.2 The Priority Queue Algorithmp. 116
5.7.3 A Technique for Complex Structured Data: Delphip. 117
5.7.4 XML Duplicate Detection: DogmatiXp. 119
5.7.5 Other Empirical Methodsp. 120
5.8 Knowledge-Based Techniquesp. 121
5.8.1 A Rule-Based Approach: Intellicleanp. 122
5.8.2 Learning Methods for Decision Rules: Atlasp. 123
5.9 Comparison of Techniquesp. 125
5.9.1 Metricsp. 125
5.9.2 Search Space Reduction Methodsp. 127
5.9.3 Comparison Functionsp. 127
5.9.4 Decision Methodsp. 128
5.9.5 Resultsp. 130
5.10 Summaryp. 131
6 Data Quality Issues in Data Integration Systemsp. 133
6.1 Introductionp. 133
6.2 Generalities on Data Integration Systemsp. 134
6.2.1 Query Processingp. 135
6.3 Techniques for Quality-Driven Query Processingp. 137
6.3.1 The QP-alg: Quality-Driven Query Planningp. 138
6.3.2 DaQuinCIS Query Processingp. 140
6.3.3 Fusionplex Query Processingp. 141
6.3.4 Comparison of Quality-Driven Query Processing Techniquesp. 143
6.4 Instance-level Conflict Resolutionp. 143
6.4.1 Classification of Instance-Level Conflictsp. 144
6.4.2 Overview of Techniquesp. 146
6.4.3 Comparison of Instance-level Conflict Resolution Techniquesp. 156
6.5 Inconsistencies in Data Integration: a Theoretical Perspectivep. 157
6.5.1 A Formal Framework for Data Integrationp. 157
6.5.2 The Problem of Inconsistencyp. 158
6.6 Summaryp. 160
7 Methodologies for Data Quality Measurement and Improvementp. 161
7.1 Basics on Data Quality Methodologiesp. 161
7.1.1 Inputs and Outputsp. 161
7.1.2 Classification of Methodologiesp. 164
7.1.3 Comparison among Data-driven and Process-driven Strategiesp. 164
7.2 Assessment Methodologiesp. 167
7.3 Comparative Analysis of General-purpose Methodologiesp. 170
7.3.1 Basic Common Phases Among Methodologiesp. 171
7.3.2 The TDQM Methodologyp. 172
7.3.3 The TQdM Methodologyp. 174
7.3.4 The Istat Methodologyp. 177
7.3.5 Comparisons of Methodologiesp. 180
7.4 The CDQM methodologyp. 181
7.4.1 Reconstruct the State of Datap. 182
7.4.2 Reconstruct Business Processesp. 183
7.4.3 Reconstruct Macroprocesses and Rulesp. 183
7.4.4 Check Problems with Usersp. 184
7.4.5 Measure Data Qualityp. 184
7.4.6 Set New Target DQ Levelsp. 185
7.4.7 Choose Improvement Activitiesp. 186
7.4.8 Choose Techniques for Data Activitiesp. 187
7.4.9 Find Improvement Processesp. 187
7.4.10 Choose the Optimal Improvement Processp. 188
7.5 A Case Study in the e-Government Areap. 188
7.6 Summaryp. 199
8 Tools for Data Qualityp. 201
8.1 Introductionp. 201
8.2 Toolsp. 202
8.2.1 Potter's Wheelp. 203
8.2.2 Telcordia's Toolp. 205
8.2.3 Ajaxp. 206
8.2.4 Artkosp. 208
8.2.5 Choice Makerp. 210
8.3 Frameworks for Cooperative Information Systemsp. 212
8.3.1 DaQuinCIS Frameworkp. 212
8.3.2 FusionPlex Frameworkp. 215
8.4 Toolboxes to Compare Toolsp. 216
8.4.1 Theoretical Approachp. 216
8.4.2 Tailorp. 217
8.5 Summaryp. 218
9 Open Problemsp. 221
9.1 Dimensions and Metricsp. 221
9.2 Object Identificationp. 222
9.2.1 XML Object Identificationp. 223
9.2.2 Object Identification of Personal Informationp. 224
9.2.3 Record Linkage and Privacyp. 225
9.3 Data Integrationp. 227
9.3.1 Trust-Aware Query Processing in P2P Contextsp. 227
9.3.2 Cost-Driven Query Processingp. 228
9.4 Methodologiesp. 230
9.5 Conclusionsp. 235
Referencesp. 237
Indexp. 249