Analyzing linguistic data : a practical introduction to statistics using R

Statistical analysis is a useful skill for linguists and psycholinguists, allowing them to understand the quantitative structure of their data. This textbook provides a straightforward introduction to the statistical analysis of language. Designed for linguists with a non-mathematical background, it clearly introduces the basic principles and methods of statistical analysis, using 'R', the leading computational statistics programme. The reader is guided step-by-step through a range of real data sets, allowing them to analyse acoustic data, construct grammatical trees for a variety of languages, quantify register variation in corpus linguistics, and measure experimental data using state-of-the-art models. The visualization of data plays a key role, both in the initial stages of data exploration and later on when the reader is encouraged to criticize various models. Containing over 40 exercises with model answers, this book will be welcomed by all linguists wishing to learn more about working with and presenting quantitative data.

Author Notes

R. H. Baayen is Professor of Quantitative Linguistics at the University of Alberta, Edmonton

Preface	p. x
1 An introduction to R	p. 1
1.1 R as a calculator	p. 2
1.2 Getting data into and out of R	p. 4
1.3 Accessing information in data frames	p. 6
1.4 Operations on data frames	p. 10
1.4.1 Sorting a data frame by one or more columns	p. 10
1.4.2 Changing information in a data frame	p. 12
1.4.3 Extracting contingency tables from data frames	p. 13
1.4.4 Calculations on data frames	p. 15
1.5 Session management	p. 18
2 Graphical data exploration	p. 20
2.1 Random variables	p. 20
2.2 Visualizing single random variables	p. 21
2.3 Visualizing two or more variables	p. 32
2.4 Trellis graphics	p. 37
3 Probability distributions	p. 44
3.1 Distributions	p. 44
3.2 Discrete distributions	p. 44
3.3 Continuous distributions	p. 57
3.3.1 The normal distribution	p. 58
3.3.2 The t, F, and X[superscript 2] distributions	p. 63
4 Basic statistical methods	p. 68
4.1 Tests for single vectors	p. 71
4.1.1 Distribution tests	p. 71
4.1.2 Tests for the mean	p. 75
4.2 Tests for two independent vectors	p. 77
4.2.1 Are the distributions the same?	p. 78
4.2.2 Are the means the same?	p. 79
4.2.3 Are the variances the same?	p. 81
4.3 Paired vectors	p. 82
4.3.1 Are the means or medians the same?	p. 82
4.3.2 Functional relations: linear regression	p. 84
4.3.3 What does the joint density look like?	p. 97
4.4 A numerical vector and a factor: analysis of variance	p. 101
4.4.1 Two numerical vectors and a factor: analysis of covariance	p. 108
4.5 Two vectors with counts	p. 111
4.6 A note on statistical significance	p. 114
5 Clustering and classification	p. 118
5.1 Clustering	p. 118
5.1.1 Tables with measurements: principal components analysis	p. 118
5.1.2 Tables with measurements: factor analysis	p. 126
5.1.3 Tables with counts: correspondence analysis	p. 128
5.1.4 Tables with distances: multidimensional scaling	p. 136
5.1.5 Tables with distances: hierarchical cluster analysis	p. 138
5.2 Classification	p. 148
5.2.1 Classification trees	p. 148
5.2.2 Discriminant analysis	p. 154
5.2.3 Support vector machines	p. 160
6 Regression modeling	p. 165
6.1 Introduction	p. 165
6.2 Ordinary least squares regression	p. 169
6.2.1 Nonlinearities	p. 174
6.2.2 Collinearity	p. 181
6.2.3 Model criticism	p. 188
6.2.4 Validation	p. 193
6.3 Generalized linear models	p. 195
6.3.1 Logistic regression	p. 195
6.3.2 Ordinal logistic regression	p. 208
6.4 Regression with breakpoints	p. 214
6.5 Models for lexical richness	p. 222
6.6 General considerations	p. 236
7 Mixed models	p. 241
7.1 Modeling data with fixed and random effects	p. 242
7.2 A comparison with traditional analyses	p. 259
7.2.1 Mixed-effects models and quasi-F	p. 260
7.2.2 Mixed-effects models and Latin Square designs	p. 266
7.2.3 Regression with subjects and items	p. 269
7.3 Shrinkage in mixed-effects models	p. 275
7.4 Generalized linear mixed models	p. 278
7.5 Case studies	p. 284
7.5.1 Primed lexical decision latencies for Dutch neologisms	p. 284
7.5.2 Self-paced reading latencies for Dutch neologisms	p. 287
7.5.3 Visual lexical decision latencies of Dutch eight-year-olds	p. 289
7.5.4 Mixed-effects models in corpus linguistics	p. 295
Appendix A Solutions to the exercises	p. 303
Appendix B Overview of R functions	p. 335
References	p. 342
Index	p. 347
Index of data sets	p. 347
Index of R	p. 347
Index of topics	p. 349
Index of authors	p. 352

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents