Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010328875 | QA76.9.D32 B47 2013 | Open Access Book | Book | Searching... |
Searching... | 33000000010179 | QA76.9.D32 B47 2013 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.
Author Notes
Jules Berman holds two bachelor of science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a PhD from Temple University, and an MD, from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute, at Temple University, and at the American Health Foundation in Valhalla, New York. His post-doctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, D.C. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the U.S. National Institutes of Health, as a Medical Officer, and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics, and the 2011 recipient of the association's Lifetime Achievement Award. He is a listed author on over 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and cancer biology. Dr. Berman is currently a free-lance writer.
Table of Contents
Acknowledgments | p. xi |
Author Biography | p. xiii |
Preface | p. xv |
Introduction | p. xix |
1 Providing Structure to Unstructured Data | |
Background | p. 1 |
Machine Translation | p. 2 |
Autocoding | p. 4 |
Indexing | p. 9 |
Term Extraction | p. 11 |
2 Identification, Deidentification, and Reidentification | |
Background | p. 15 |
Features of an Identifier System | p. 17 |
Registered Unique Object Identifiers | p. 18 |
Really Bad Identifier Methods | p. 22 |
Embedding Information in an Identifier: Not Recommended | p. 24 |
One-Way Hashes | p. 25 |
Use Case: Hospital Registration | p. 26 |
Deidentification | p. 28 |
Data Scrubbing | p. 30 |
Reidentification | p. 31 |
Lessons Learned | p. 32 |
3 Ontologies and Semantics | |
Background | p. 35 |
Classifications, the Simplest of Ontologies | p. 36 |
Ontologies, Classes with Multiple Parents | p. 39 |
Choosing a Class Model | p. 40 |
Introduction to Resource Description Framework Schema | p. 44 |
Common Pitfalls in Ontology Development | p. 46 |
4 Introspection | |
Background | p. 49 |
Knowledge of Self | p. 50 |
eXtensible Markup Language | p. 52 |
Introduction to Meaning | p. 54 |
Namespaces and the Aggregation of Meaningful Assertions | p. 55 |
Resource Description Framework Triples | p. 56 |
Reflection | p. 59 |
Use Case: Trusted Time Stamp | p. 59 |
Summary | p. 60 |
5 Data Integration and Software Interoperability | |
Background | p. 63 |
The Committee to Survey Standards | p. 64 |
Standard Trajectory | p. 65 |
Specifications and Standards | p. 69 |
Versioning | p. 71 |
Compliance Issues | p. 73 |
Interfaces to Big Data Resources | p. 74 |
6 Immutability and Immortality | |
Background | p. 77 |
Immutability and Identifiers | p. 78 |
Data Objects | p. 80 |
Legacy Data | p. 82 |
Data Born from Data | p. 83 |
Reconciling Identifiers across Institutions | p. 84 |
Zero-Knowledge Reconciliation | p. 86 |
The Curator's Burden | p. 87 |
7 Measurement | |
Background | p. 89 |
Counting | p. 90 |
Gene Counting | p. 93 |
Dealing with Negations | p. 93 |
Understanding Your Control | p. 95 |
Practical Significance of Measurements | p. 96 |
Obsessive-Compulsive Disorder: The Mark of a Great Data Manager | p. 97 |
8 Simple but Powerful Big Data Techniques | |
Background | p. 99 |
Look at the Data | p. 100 |
Data Range | p. 110 |
Denominator | p. 112 |
Frequency Distributions | p. 115 |
Mean and Standard Deviation | p. 119 |
Estimation-Only Analyses | p. 122 |
Use Case: Watching Data Trends with Google Ngrams | p. 123 |
Use Case: Estimating Movie Preferences | p. 126 |
9 Analysis | |
Background | p. 129 |
Analytic Tasks | p. 130 |
Clustering, Classifying, Recommending, and Modeling | p. 130 |
Data Reduction | p. 134 |
Normalizing and Adjusting Data | p. 137 |
Big Data Software: Speed and Scalability | p. 139 |
Find Relationships, Not Similarities | p. 141 |
10 Special Considerations in Big Data Analysis Background | p. 145 |
Theory in Search of Data | p. 146 |
Data in Search of a Theory | p. 146 |
Overfitting | p. 148 |
Bigness Bias | p. 148 |
Too Much Data | p. 151 |
Fixing Data | p. 152 |
Data Subsets in Big Data: Neither Additive nor Transitive | p. 153 |
Additional Big Data Pitfalls | p. 154 |
11 Stepwise Approach to Big Data Analysis | |
Background | p. 157 |
Step 1. A Question Is Formulated | p. 158 |
Step 2. Resource Evaluation | p. 158 |
Step 3. A Question Is Reformulated | p. 159 |
Step 4. Query Output Adequacy | p. 160 |
Step 5. Data Description | p. 161 |
Step 6. Data Reduction | p. 161 |
Step 7. Algorithms Are Selected, If Absolutely Necessary | p. 162 |
Step 8. Results Are Reviewed and Conclusions Are Asserted | p. 164 |
Step 9. Conclusions Are Examined and Subjected to Validation | p. 164 |
12 Failure | |
Background | p. 167 |
Failure Is Common | p. 168 |
Failed Standards | p. 169 |
Complexity | p. 172 |
When Does Complexity Help? | p. 173 |
When Redundancy Fails | p. 174 |
Save Money; Don't Protect Harmless Information | p. 176 |
After Failure | p. 177 |
Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Far | p. 178 |
13 Legalities | |
Background | p. 183 |
Responsibility for the Accuracy and Legitimacy of Contained Data | p. 184 |
Rights to Create, Use, and Share the Resource | p. 185 |
Copyright and Patent Infringements Incurred by Using Standards | p. 187 |
Protections for Individuals | p. 188 |
Consent | p. 190 |
Unconsented Data | p. 194 |
Good Policies Are a Good Policy | p. 197 |
Use Case: The Havasupai Story | p. 198 |
14 Societal Issues | |
Background | p. 201 |
How Big Data Is Perceived | p. 201 |
The Necessity of Data Sharing, Even When It Seems Irrelevant | p. 204 |
Reducing Costs and Increasing Productivity with Big Data | p. 208 |
Public Mistrust | p. 210 |
Saving Us from Ourselves | p. 211 |
Hubris and Hyperbole | p. 213 |
15 The Future | |
Background | p. 217 |
Last Words | p. 226 |
Glossary | p. 229 |
References | p. 247 |
Index | p. 257 |