Skip to:Content
|
Bottom
Cover image for Principles of big data : preparing, sharing, and analyzing complex information
Title:
Principles of big data : preparing, sharing, and analyzing complex information
Personal Author:
Publication Information:
Amsterdam : Elsevier, 2013
Physical Description:
xxvi, 261 . ; 25 cm
ISBN:
9780124045767

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010328875 QA76.9.D32 B47 2013 Open Access Book Book
Searching...
Searching...
33000000010179 QA76.9.D32 B47 2013 Open Access Book Book
Searching...

On Order

Summary

Summary

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.


Author Notes

Jules Berman holds two bachelor of science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a PhD from Temple University, and an MD, from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute, at Temple University, and at the American Health Foundation in Valhalla, New York. His post-doctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, D.C. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the U.S. National Institutes of Health, as a Medical Officer, and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics, and the 2011 recipient of the association's Lifetime Achievement Award. He is a listed author on over 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and cancer biology. Dr. Berman is currently a free-lance writer.


Table of Contents

Acknowledgmentsp. xi
Author Biographyp. xiii
Prefacep. xv
Introductionp. xix
1 Providing Structure to Unstructured Data
Backgroundp. 1
Machine Translationp. 2
Autocodingp. 4
Indexingp. 9
Term Extractionp. 11
2 Identification, Deidentification, and Reidentification
Backgroundp. 15
Features of an Identifier Systemp. 17
Registered Unique Object Identifiersp. 18
Really Bad Identifier Methodsp. 22
Embedding Information in an Identifier: Not Recommendedp. 24
One-Way Hashesp. 25
Use Case: Hospital Registrationp. 26
Deidentificationp. 28
Data Scrubbingp. 30
Reidentificationp. 31
Lessons Learnedp. 32
3 Ontologies and Semantics
Backgroundp. 35
Classifications, the Simplest of Ontologiesp. 36
Ontologies, Classes with Multiple Parentsp. 39
Choosing a Class Modelp. 40
Introduction to Resource Description Framework Schemap. 44
Common Pitfalls in Ontology Developmentp. 46
4 Introspection
Backgroundp. 49
Knowledge of Selfp. 50
eXtensible Markup Languagep. 52
Introduction to Meaningp. 54
Namespaces and the Aggregation of Meaningful Assertionsp. 55
Resource Description Framework Triplesp. 56
Reflectionp. 59
Use Case: Trusted Time Stampp. 59
Summaryp. 60
5 Data Integration and Software Interoperability
Backgroundp. 63
The Committee to Survey Standardsp. 64
Standard Trajectoryp. 65
Specifications and Standardsp. 69
Versioningp. 71
Compliance Issuesp. 73
Interfaces to Big Data Resourcesp. 74
6 Immutability and Immortality
Backgroundp. 77
Immutability and Identifiersp. 78
Data Objectsp. 80
Legacy Datap. 82
Data Born from Datap. 83
Reconciling Identifiers across Institutionsp. 84
Zero-Knowledge Reconciliationp. 86
The Curator's Burdenp. 87
7 Measurement
Backgroundp. 89
Countingp. 90
Gene Countingp. 93
Dealing with Negationsp. 93
Understanding Your Controlp. 95
Practical Significance of Measurementsp. 96
Obsessive-Compulsive Disorder: The Mark of a Great Data Managerp. 97
8 Simple but Powerful Big Data Techniques
Backgroundp. 99
Look at the Datap. 100
Data Rangep. 110
Denominatorp. 112
Frequency Distributionsp. 115
Mean and Standard Deviationp. 119
Estimation-Only Analysesp. 122
Use Case: Watching Data Trends with Google Ngramsp. 123
Use Case: Estimating Movie Preferencesp. 126
9 Analysis
Backgroundp. 129
Analytic Tasksp. 130
Clustering, Classifying, Recommending, and Modelingp. 130
Data Reductionp. 134
Normalizing and Adjusting Datap. 137
Big Data Software: Speed and Scalabilityp. 139
Find Relationships, Not Similaritiesp. 141
10 Special Considerations in Big Data Analysis Backgroundp. 145
Theory in Search of Datap. 146
Data in Search of a Theoryp. 146
Overfittingp. 148
Bigness Biasp. 148
Too Much Datap. 151
Fixing Datap. 152
Data Subsets in Big Data: Neither Additive nor Transitivep. 153
Additional Big Data Pitfallsp. 154
11 Stepwise Approach to Big Data Analysis
Backgroundp. 157
Step 1. A Question Is Formulatedp. 158
Step 2. Resource Evaluationp. 158
Step 3. A Question Is Reformulatedp. 159
Step 4. Query Output Adequacyp. 160
Step 5. Data Descriptionp. 161
Step 6. Data Reductionp. 161
Step 7. Algorithms Are Selected, If Absolutely Necessaryp. 162
Step 8. Results Are Reviewed and Conclusions Are Assertedp. 164
Step 9. Conclusions Are Examined and Subjected to Validationp. 164
12 Failure
Backgroundp. 167
Failure Is Commonp. 168
Failed Standardsp. 169
Complexityp. 172
When Does Complexity Help?p. 173
When Redundancy Failsp. 174
Save Money; Don't Protect Harmless Informationp. 176
After Failurep. 177
Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Farp. 178
13 Legalities
Backgroundp. 183
Responsibility for the Accuracy and Legitimacy of Contained Datap. 184
Rights to Create, Use, and Share the Resourcep. 185
Copyright and Patent Infringements Incurred by Using Standardsp. 187
Protections for Individualsp. 188
Consentp. 190
Unconsented Datap. 194
Good Policies Are a Good Policyp. 197
Use Case: The Havasupai Storyp. 198
14 Societal Issues
Backgroundp. 201
How Big Data Is Perceivedp. 201
The Necessity of Data Sharing, Even When It Seems Irrelevantp. 204
Reducing Costs and Increasing Productivity with Big Datap. 208
Public Mistrustp. 210
Saving Us from Ourselvesp. 211
Hubris and Hyperbolep. 213
15 The Future
Backgroundp. 217
Last Wordsp. 226
Glossaryp. 229
Referencesp. 247
Indexp. 257
Go to:Top of Page