Correspondence analysis and data coding with java and R

Developed by Jean-Paul Benzérci more than 30 years ago, correspondence analysis as a framework for analyzing data quickly found widespread popularity in Europe. The topicality and importance of correspondence analysis continue, and with the tremendous computing power now available and new fields of application emerging, its significance is greater than ever.

Correspondence Analysis and Data Coding with Java and R clearly demonstrates why this technique remains important and in the eyes of many, unsurpassed as an analysis framework. After presenting some historical background, the author presents a theoretical overview of the mathematics and underlying algorithms of correspondence analysis and hierarchical clustering. The focus then shifts to data coding, with a survey of the widely varied possibilities correspondence analysis offers and introduction of the Java software for correspondence analysis, clustering, and interpretation tools. A chapter of case studies follows, wherein the author explores applications to areas such as shape analysis and time-evolving data. The final chapter reviews the wealth of studies on textual content as well as textual form, carried out by Benzécri and his research lab. These discussions show the importance of correspondence analysis to artificial intelligence as well as to stylometry and other fields.

This book not only shows why correspondence analysis is important, but with a clear presentation replete with advice and guidance, also shows how to put this technique into practice. Downloadable software and data sets allow quick, hands-on exploration of innovative correspondence analysis applications.

Author Notes

Murtagh, Fionn

1 Introduction	p. 1
1.1 Data Analysis	p. 1
1.2 Notes on the History of Data Analysis	p. 3
1.2.1 Biometry	p. 4
1.2.2 Era Piscatoria	p. 4
1.2.3 Psychometrics	p. 5
1.2.4 Analysis of Proximities	p. 7
1.2.5 Genesis of Correspondence Analysis	p. 8
1.3 Correspondence Analysis or Principal Components Analysis	p. 9
1.3.1 Similarities of These Two Algorithms	p. 9
1.3.2 Introduction to Principal Components Analysis	p. 10
1.3.3 An Illustrative Example	p. 11
1.3.4 Principal Components Analysis of Globular Clusters	p. 13
1.3.5 Correspondence Analysis of Globular Clusters	p. 14
1.4 R Software for Correspondence Analysis and Clustering	p. 17
1.4.1 Fuzzy or Piecewise Linear Coding	p. 17
1.4.2 Utility for Plotting Axes	p. 18
1.4.3 Correspondence Analysis Program	p. 18
1.4.4 Running the Analysis and Displaying Results	p. 20
1.4.5 Hierarchical Clustering	p. 21
1.4.6 Handling Large Data Sets	p. 27
2 Theory of Correspondence Analysis	p. 29
2.1 Vectors and Projections	p. 29
2.2 Factors	p. 32
2.2.1 Review of Metric Spaces	p. 32
2.2.2 Clouds of Points, Masses, and Inertia	p. 34
2.2.3 Notation for Factors	p. 35
2.2.4 Properties of Factors	p. 36
2.2.5 Properties of Factors: Tensor Notation	p. 36
2.3 Transform	p. 38
2.3.1 Forward Transform	p. 38
2.3.2 Inverse Transform	p. 38
2.3.3 Decomposition of Inertia	p. 38
2.3.4 Relative and Absolute Contributions	p. 39
2.3.5 Reduction of Dimensionality	p. 39
2.3.6 Interpretation of Results	p. 39
2.3.7 Analysis of the Dual Spaces	p. 40
2.3.8 Supplementary Elements	p. 41
2.4 Algebraic Perspective	p. 41
2.4.1 Processing	p. 41
2.4.2 Motivation	p. 41
2.4.3 Operations	p. 42
2.4.4 Axes and Factors	p. 43
2.4.5 Multiple Correspondence Analysis	p. 44
2.4.6 Summary of Correspondence Analysis Properties	p. 46
2.5 Clustering	p. 46
2.5.1 Hierarchical Agglomerative Clustering	p. 46
2.5.2 Minimum Variance Agglomerative Criterion	p. 49
2.5.3 Lance-Williams Dissimilarity Update Formula	p. 49
2.5.4 Reciprocal Nearest Neighbors and Reducibility	p. 52
2.5.5 Nearest-Neighbor Chain Algorithm	p. 53
2.5.6 Minimum Variance Method in Perspective	p. 54
2.5.7 Minimum Variance Method: Mathematical Properties	p. 55
2.5.8 Simultaneous Analysis of Factors and Clusters	p. 57
2.6 Questions	p. 57
2.7 Further R Software for Correspondence Analysis	p. 58
2.7.1 Supplementary Elements	p. 58
2.7.2 FACOR: Interpretation of Factors and Clusters	p. 61
2.7.3 VACOR: Interpretation of Variables and Clusters	p. 64
2.7.4 Hierarchical Clustering in C, Callable from R	p. 67
2.8 Summary	p. 69
3 Input Data Coding	p. 71
3.1 Introduction	p. 71
3.1.1 The Fundamental Role of Coding	p. 72
3.1.2 "Semantic Embedding"	p. 73
3.1.3 Input Data Encodings	p. 75
3.1.4 Input Data Analyzed Without Transformation	p. 76
3.2 From Doubling to Fuzzy Coding and Beyond	p. 77
3.2.1 Doubling	p. 77
3.2.2 Complete Disjunctive Form	p. 79
3.2.3 Fuzzy, Piecewise Linear or Barycentric Coding	p. 80
3.2.4 General Discussion of Data Coding	p. 85
3.2.5 From Fuzzy Coding to Possibility Theory	p. 86
3.3 Assessment of Coding Methods	p. 92
3.4 The Personal Equation and Double Rescaling	p. 98
3.5 Case Study: DNA Exon and Intron Junction Discrimination	p. 99
3.6 Conclusions on Coding	p. 103
3.7 Java Software	p. 104
3.7.1 Running the Java Software	p. 105
4 Examples and Case Studies	p. 111
4.1 Introduction to Analysis of Size and Shape	p. 111
4.1.1 Morphometry of Prehistoric Thai Goblets	p. 111
4.1.2 Software Used	p. 116
4.2 Comparison of Prehistoric and Modern Groups of Canids	p. 118
4.2.1 Software Used	p. 130
4.3 Craniometric Data from Ancient Egyptian Tombs	p. 135
4.3.1 Software Used	p. 139
4.4 Time-Varying Data Analysis: Examples from Economics	p. 140
4.4.1 Imports and Exports of Phosphates	p. 140
4.4.2 Services and Other Sectors in Economic Growth	p. 145
4.5 Financial Modeling and Forecasting	p. 148
4.5.1 Introduction	p. 148
4.5.2 Brownian Motion	p. 149
4.5.3 Granularity of Coding	p. 150
4.5.4 Fingerprinting the Price Movements	p. 158
4.5.5 Conclusions	p. 160
5 Content Analysis of Text	p. 161
5.1 Introduction	p. 161
5.1.1 Accessing Content	p. 161
5.1.2 The Work of J.-P. Benzecri	p. 161
5.1.3 Objectives and Some Findings	p. 163
5.1.4 Outline of the Chapter	p. 164
5.2 Correspondence Analysis	p. 164
5.2.1 Analyzing Data	p. 164
5.2.2 Textual Data Preprocessing	p. 165
5.3 Tool Words: Between Analysis of Form and Analysis of Content	p. 166
5.3.1 Tool Words versus Full Words	p. 166
5.3.2 Tool Words in Various Languages	p. 167
5.3.3 Tool Words versus Metalanguages or Ontologies	p. 168
5.3.4 Refinement of Tool Words	p. 170
5.3.5 Tool Words in Survey Analysis	p. 171
5.3.6 The Text Aggregates Studied	p. 172
5.4 Towards Content Analysis	p. 172
5.4.1 Intra-Document Analysis of Content	p. 172
5.4.2 Comparative Semantics: Diagnosis versus Prognosis	p. 174
5.4.3 Semantics of Connotation and Denotation	p. 175
5.4.4 Discipline-Based Theme Analysis	p. 175
5.4.5 Mapping Cognitive Processes	p. 176
5.4.6 History and Evolution of Ideas	p. 176
5.4.7 Doctrinal Content and Stylistic Expression	p. 177
5.4.8 Interpreting Antinomies Through Cluster Branchings	p. 179
5.4.9 The Hypotheses of Plato on The One	p. 179
5.5 Textual and Documentary Typology	p. 180
5.5.1 Assessing Authorship	p. 180
5.5.2 Further Studies with Tool Words and Miscellaneous Approaches	p. 184
5.6 Conclusion: Methodology in Free Text Analysis	p. 186
5.7 Software for Text Processing	p. 188
5.8 Introduction to the Text Analysis Case Studies	p. 189
5.9 Eight Hypotheses of Parmenides Regarding the One	p. 190
5.10 Comparative Study of Reality, Fable and Dream	p. 197
5.10.1 Aviation Accidents	p. 198
5.10.2 Dream Reports	p. 198
5.10.3 Grimm Fairy Tales	p. 199
5.10.4 Three Jane Austen Novels	p. 199
5.10.5 Set of Texts	p. 200
5.10.6 Tool Words	p. 200
5.10.7 Domain Content Words	p. 201
5.10.8 Analysis of Domains through Content-Oriented Words	p. 205
5.11 Single Document Analysis	p. 207
5.11.1 The Data: Aristotle's Categories	p. 207
5.11.2 Structure of Presentation	p. 210
5.11.3 Evolution of Presentation	p. 214
5.12 Conclusion on Text Analysis Case Studies	p. 220
6 Concluding Remarks	p. 221
References	p. 223
Index	p. 229

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents