Principles of big data : preparing, sharing, and analyzing complex information

Title:

Personal Author:

Berman, Jules J

Publication Information:

Amsterdam : Elsevier, 2013

Physical Description:

xxvi, 261 . ; 25 cm

ISBN:

9780124045767

Subject Term:

Big data

Database management

Available:*

Library	Item Barcode	Call Number	Material Type	Item Category 1	Status
Searching... PSZ JB	30000010328875	QA76.9.D32 B47 2013	Open Access Book	Book	Searching... Unknown
Searching... PSZ KL	33000000010179	QA76.9.D32 B47 2013	Open Access Book	Book	Searching... Unknown

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.

Author Notes

Jules Berman holds two bachelor of science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a PhD from Temple University, and an MD, from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute, at Temple University, and at the American Health Foundation in Valhalla, New York. His post-doctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, D.C. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the U.S. National Institutes of Health, as a Medical Officer, and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics, and the 2011 recipient of the association's Lifetime Achievement Award. He is a listed author on over 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and cancer biology. Dr. Berman is currently a free-lance writer.

Acknowledgments	p. xi
Author Biography	p. xiii
Preface	p. xv
Introduction	p. xix
1 Providing Structure to Unstructured Data
Background	p. 1
Machine Translation	p. 2
Autocoding	p. 4
Indexing	p. 9
Term Extraction	p. 11
2 Identification, Deidentification, and Reidentification
Background	p. 15
Features of an Identifier System	p. 17
Registered Unique Object Identifiers	p. 18
Really Bad Identifier Methods	p. 22
Embedding Information in an Identifier: Not Recommended	p. 24
One-Way Hashes	p. 25
Use Case: Hospital Registration	p. 26
Deidentification	p. 28
Data Scrubbing	p. 30
Reidentification	p. 31
Lessons Learned	p. 32
3 Ontologies and Semantics
Background	p. 35
Classifications, the Simplest of Ontologies	p. 36
Ontologies, Classes with Multiple Parents	p. 39
Choosing a Class Model	p. 40
Introduction to Resource Description Framework Schema	p. 44
Common Pitfalls in Ontology Development	p. 46
4 Introspection
Background	p. 49
Knowledge of Self	p. 50
eXtensible Markup Language	p. 52
Introduction to Meaning	p. 54
Namespaces and the Aggregation of Meaningful Assertions	p. 55
Resource Description Framework Triples	p. 56
Reflection	p. 59
Use Case: Trusted Time Stamp	p. 59
Summary	p. 60
5 Data Integration and Software Interoperability
Background	p. 63
The Committee to Survey Standards	p. 64
Standard Trajectory	p. 65
Specifications and Standards	p. 69
Versioning	p. 71
Compliance Issues	p. 73
Interfaces to Big Data Resources	p. 74
6 Immutability and Immortality
Background	p. 77
Immutability and Identifiers	p. 78
Data Objects	p. 80
Legacy Data	p. 82
Data Born from Data	p. 83
Reconciling Identifiers across Institutions	p. 84
Zero-Knowledge Reconciliation	p. 86
The Curator's Burden	p. 87
7 Measurement
Background	p. 89
Counting	p. 90
Gene Counting	p. 93
Dealing with Negations	p. 93
Understanding Your Control	p. 95
Practical Significance of Measurements	p. 96
Obsessive-Compulsive Disorder: The Mark of a Great Data Manager	p. 97
8 Simple but Powerful Big Data Techniques
Background	p. 99
Look at the Data	p. 100
Data Range	p. 110
Denominator	p. 112
Frequency Distributions	p. 115
Mean and Standard Deviation	p. 119
Estimation-Only Analyses	p. 122
Use Case: Watching Data Trends with Google Ngrams	p. 123
Use Case: Estimating Movie Preferences	p. 126
9 Analysis
Background	p. 129
Analytic Tasks	p. 130
Clustering, Classifying, Recommending, and Modeling	p. 130
Data Reduction	p. 134
Normalizing and Adjusting Data	p. 137
Big Data Software: Speed and Scalability	p. 139
Find Relationships, Not Similarities	p. 141
10 Special Considerations in Big Data Analysis Background	p. 145
Theory in Search of Data	p. 146
Data in Search of a Theory	p. 146
Overfitting	p. 148
Bigness Bias	p. 148
Too Much Data	p. 151
Fixing Data	p. 152
Data Subsets in Big Data: Neither Additive nor Transitive	p. 153
Additional Big Data Pitfalls	p. 154
11 Stepwise Approach to Big Data Analysis
Background	p. 157
Step 1. A Question Is Formulated	p. 158
Step 2. Resource Evaluation	p. 158
Step 3. A Question Is Reformulated	p. 159
Step 4. Query Output Adequacy	p. 160
Step 5. Data Description	p. 161
Step 6. Data Reduction	p. 161
Step 7. Algorithms Are Selected, If Absolutely Necessary	p. 162
Step 8. Results Are Reviewed and Conclusions Are Asserted	p. 164
Step 9. Conclusions Are Examined and Subjected to Validation	p. 164
12 Failure
Background	p. 167
Failure Is Common	p. 168
Failed Standards	p. 169
Complexity	p. 172
When Does Complexity Help?	p. 173
When Redundancy Fails	p. 174
Save Money; Don't Protect Harmless Information	p. 176
After Failure	p. 177
Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Far	p. 178
13 Legalities
Background	p. 183
Responsibility for the Accuracy and Legitimacy of Contained Data	p. 184
Rights to Create, Use, and Share the Resource	p. 185
Copyright and Patent Infringements Incurred by Using Standards	p. 187
Protections for Individuals	p. 188
Consent	p. 190
Unconsented Data	p. 194
Good Policies Are a Good Policy	p. 197
Use Case: The Havasupai Story	p. 198
14 Societal Issues
Background	p. 201
How Big Data Is Perceived	p. 201
The Necessity of Data Sharing, Even When It Seems Irrelevant	p. 204
Reducing Costs and Increasing Productivity with Big Data	p. 208
Public Mistrust	p. 210
Saving Us from Ourselves	p. 211
Hubris and Hyperbole	p. 213
15 The Future
Background	p. 217
Last Words	p. 226
Glossary	p. 229
References	p. 247
Index	p. 257

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents