Skip to:Content
|
Bottom
Cover image for Data quality and record linkage techniques
Title:
Data quality and record linkage techniques
Personal Author:
Publication Information:
New York, NY : Springer, 2007
ISBN:
9780387695020
General Note:
Available online version
Electronic Access:
Fulltext

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010163048 QA76.9.E94 H47 2007 Open Access Book Book
Searching...

On Order

Summary

Summary

This book offers a practical understanding of issues involved in improving data quality through editing, imputation, and record linkage. The first part of the book deals with methods and models, focusing on the Fellegi-Holt edit-imputation model, the Little-Rubin multiple-imputation scheme, and the Fellegi-Sunter record linkage model. The second part presents case studies in which these techniques are applied in a variety of areas, including mortgage guarantee insurance, medical, biomedical, highway safety, and social insurance as well as the construction of list frames and administrative lists. This book offers a mixture of practical advice, mathematical rigor, management insight and philosophy.


Author Notes

Thomas N. Herzog, Ph.D., ASA is the Chief Actuary at the U.S. Department of Housing and Urban Development
Fritz J. Scheuren, Ph.D., is a Vice President for Statistics with the National Opinion Research Center at the University of Chicago
William E. Winkler, Ph.D., is Principal Researcher at the U.S. Census Bureau


Table of Contents

Prefacep. v
About the Authorsp. xiii
1 Introductionp. 1
1.1 Audience and Objectivep. 1
1.2 Scopep. 1
1.3 Structurep. 2
Part 1 Data Quality: What It is, Why It is Important, and How to Achieve It
2 What Is Data Quality and Why Should We Care?p. 7
2.1 When Are Data of High Quality?p. 7
2.2 Why Care About Data Quality?p. 10
2.3 How Do You Obtain High-Quality Data?p. 11
2.4 Practical Tipsp. 13
2.5 Where Are We Now?p. 13
3 Examples of Entities Using Data to their Advantage/Disadvantagep. 17
3.1 Data Quality as a Competitive Advantagep. 17
3.2 Data Quality Problems and their Consequencesp. 20
3.3 How Many People Really Live to 100 and Beyond? Views from the United States, Canada, and the United Kingdomp. 25
3.4 Disabled Airplane Pilots - A Successful Application of Record Linkagep. 26
3.5 Completeness and Accuracy of a Billing Database: Why It Is Important to the Bottom Linep. 26
3.6 Where Are We Now?p. 27
4 Properties of Data Quality and Metrics for Measuring Itp. 29
4.1 Desirable Properties of Databases/Listsp. 29
4.2 Examples of Merging Two or More Lists and the Issues that May Arisep. 31
4.3 Metrics Used when Merging Listsp. 33
4.4 Where Are We Now?p. 35
5 Basic Data Quality Toolsp. 37
5.1 Data Elementsp. 37
5.2 Requirements Documentp. 38
5.3 A Dictionary of Testsp. 39
5.4 Deterministic Testsp. 40
5.5 Probabilistic Testsp. 44
5.6 Exploratory Data Analysis Techniquesp. 44
5.7 Minimizing Processing Errorsp. 46
5.8 Practical Tipsp. 46
5.9 Where Are We Now?p. 48
Part 2 Specialized Tools for Database Improvement
6 Mathematical Preliminaries for Specialized Data Quality Techniquesp. 51
6.1 Conditional Independencep. 51
6.2 Statistical Paradigmsp. 53
6.3 Capture-Recapture Procedures and Applicationsp. 54
7 Automatic Editing and Imputation of Sample Survey Datap. 61
7.1 Introductionp. 61
7.2 Early Editing Effortsp. 63
7.3 Fellegi-Holt Model for Editingp. 64
7.4 Practical Tipsp. 65
7.5 Imputationp. 66
7.6 Constructing a Unified Edit/Imputation Modelp. 71
7.7 Implicit Edits - A Key Construct of Editing Softwarep. 73
7.8 Editing Softwarep. 75
7.9 Is Automatic Editing Taking Up Too Much Time and Money?p. 78
7.10 Selective Editingp. 79
7.11 Tips on Automatic Editing and Imputationp. 79
7.12 Where Are We Now?p. 80
8 Record Linkage - Methodologyp. 81
8.1 Introductionp. 81
8.2 Why Did Analysts Begin Linking Records?p. 82
8.3 Deterministic Record Linkagep. 82
8.4 Probabilistic Record Linkage - A Frequentist Perspectivep. 83
8.5 Probabilistic Record Linkage - A Bayesian Perspectivep. 91
8.6 Where Are We Now?p. 92
9 Estimating the Parameters of the Fellegi-Sunter Record Linkage Modelp. 93
9.1 Basic Estimation of Parameters Under Simple Agreement/Disagreement Patternsp. 93
9.2 Parameter Estimates Obtained via Frequency-Based Matchingp. 94
9.3 Parameter Estimates Obtained Using Data from Current Filesp. 96
9.4 Parameter Estimates Obtained via the EM Algorithmp. 97
9.5 Advantages and Disadvantages of Using the EM Algorithm to Estimate m- and u-probabilitiesp. 101
9.6 General Parameter Estimation Using the EM Algorithmp. 103
9.7 Where Are We Now?p. 106
10 Standardization and Parsingp. 107
10.1 Obtaining and Understanding Computer Filesp. 109
10.2 Standardization of Termsp. 110
10.3 Parsing of Fieldsp. 111
10.4 Where Are We Now?p. 114
11 Phonetic Coding Systems for Namesp. 115
11.1 Soundex System of Namesp. 115
11.2 NYSIIS Phonetic Decoderp. 119
11.3 Where Are We Now?p. 121
12 Blockingp. 123
12.1 Independence of Blocking Strategiesp. 124
12.2 Blocking Variablesp. 125
12.3 Using Blocking Strategies to Identify Duplicate List Entriesp. 126
12.4 Using Blocking Strategies to Match Records Between Two Sample Surveysp. 128
12.5 Estimating the Number of Matches Missedp. 130
12.6 Where Are We Now?p. 130
13 String Comparator Metrics for Typographical Errorp. 131
13.1 Jaro String Comparator Metric for Typographical Errorp. 131
13.2 Adjusting the Matching Weight for the Jaro String Comparatorp. 133
13.3 Winkler String Comparator Metric for Typographical Errorp. 133
13.4 Adjusting the Weights for the Winkler Comparator Metricp. 134
13.5 Where are We Now?p. 135
Part 3 Record Linkage Case Studies
14 Duplicate FHA Single-Family Mortgage Records: A Case Study of Data Problems, Consequences, and Corrective Stepsp. 139
14.1 Introductionp. 139
14.2 FHA Case Numbers on Single-Family Mortgagesp. 141
14.3 Duplicate Mortgage Recordsp. 141
14.4 Mortgage Records with an Incorrect Termination Statusp. 145
14.5 Estimating the Number of Duplicate Mortgage Recordsp. 148
15 Record Linkage Case Studies in the Medical, Biomedical, and Highway Safety Areasp. 151
15.1 Biomedical and Genetic Research Studiesp. 151
15.2 Who goes to a Chiropractor?p. 153
15.3 National Master Patient Indexp. 154
15.4 Provider Access to Immunization Register Securely (PAiRS) Systemp. 155
15.5 Studies Required by the Intermodal Surface Transportation Efficiency Act of 1991p. 156
15.6 Crash Outcome Data Evaluation Systemp. 157
16 Constructing List Frames and Administrative Listsp. 159
16.1 National Address Register of Residences in Canadap. 160
16.2 USDA List Frame of Farms in the United Statesp. 162
16.3 List Frame Development for the US Census of Agriculturep. 165
16.4 Post-enumeration Studies of US Decennial Censusp. 166
17 Social Security and Related Topicsp. 169
17.1 Hidden Multiple Issuance of Social Security Numbersp. 169
17.2 How Social Security Stops Benefit Payments after Deathp. 173
17.3 CPS-IRS-SSA Exact Match Filep. 175
17.4 Record Linkage and Terrorismp. 177
Part 4 Other Topics
18 Confidentiality: Maximizing Access to Micro-data while Protecting Privacyp. 181
18.1 Importance of High Quality of Data in the Original Filep. 182
18.2 Documenting Public-use Filesp. 183
18.3 Checking Re-identifiabilityp. 183
18.4 Elementary Masking Methods and Statistical Agenciesp. 186
18.5 Protecting Confidentiality of Medical Datap. 193
18.6 More-advanced Masking Methods - Synthetic Datasetsp. 195
18.7 Where Are We Now?p. 198
19 Review of Record Linkage Softwarep. 201
19.1 Governmentp. 201
19.2 Commercialp. 202
19.3 Checklist for Evaluating Record Linkage Softwarep. 203
20 Summary Chapterp. 209
Bibliographyp. 211
Indexp. 221
Go to:Top of Page