The statistical evaluation of medical tests for classification and prediction

This book describes statistical techniques for the design and evaluation of research studies on medical diagnostic tests, screening tests, biomarkers and new technologies for classification and prediction in medicine. Based on solid mathematical theory, the book includes worked examples along with data and code, which provides the reader with easy implementation of methods.

Author Notes

Margaret Sullivan Pepe is a Professor of Biostatistics, University of Washington and Fred Hutchinson Cancer Research Center, Washington, USA.

Notation	p. xv
1 Introduction	p. 1
1.1 The medical test	p. 1
1.1.1 Tests, classification and the broader context	p. 1
1.1.2 Disease screening versus diagnosis	p. 2
1.1.3 Criteria for a useful medical test	p. 2
1.2 Elements of study design	p. 3
1.2.1 Scale for the test result	p. 4
1.2.2 Selection of study subjects	p. 4
1.2.3 Comparing tests	p. 5
1.2.4 Test integrity	p. 5
1.2.5 Sources of bias	p. 6
1.3 Examples and datasets	p. 8
1.3.1 Overview	p. 8
1.3.2 The CASS dataset	p. 8
1.3.3 Pancreatic cancer serum biomarkers study	p. 10
1.3.4 Hepatitis metastasis ultrasound study	p. 10
1.3.5 CARET PSA biomarker study	p. 10
1.3.6 Ovarian cancer gene expression study	p. 11
1.3.7 Neonatal audiology data	p. 11
1.3.8 St Louis prostate cancer screening study	p. 11
1.4 Topics and organization	p. 11
1.5 Exercises	p. 12
2 Measures of accuracy for binary tests	p. 14
2.1 Measures of accuracy	p. 14
2.1.1 Notation	p. 14
2.1.2 Disease-specific classification probabilities	p. 14
2.1.3 Predictive values	p. 16
2.1.4 Diagnostic likelihood ratios	p. 17
2.2 Estimating accuracy with data	p. 21
2.2.1 Data from a cohort study	p. 21
2.2.2 Proportions: (FPF, TPF) and (PPV, NPV)	p. 22
2.2.3 Ratios of proportions: DLRs	p. 24
2.2.4 Estimation from a case-control study	p. 25
2.2.5 Merits of case-control versus cohort studies	p. 26
2.3 Quantifying the relative accuracy of tests	p. 27
2.3.1 Comparing classification probabilities	p. 28
2.3.2 Comparing predictive values	p. 29
2.3.3 Comparing diagnostic likelihood ratios	p. 30
2.3.4 Which test is better?	p. 31
2.4 Concluding remarks	p. 33
2.5 Exercises	p. 34
3 Comparing binary tests and regression analysis	p. 35
3.1 Study designs for comparing tests	p. 35
3.1.1 Unpaired designs	p. 35
3.1.2 Paired designs	p. 36
3.2 Comparing accuracy with unpaired data	p. 37
3.2.1 Empirical estimators of comparative measures	p. 37
3.2.2 Large sample inference	p. 38
3.3 Comparing accuracy with paired data	p. 41
3.3.1 Sources of correlation	p. 41
3.3.2 Estimation of comparative measures	p. 41
3.3.3 Wide or long data representations	p. 42
3.3.4 Large sample inference	p. 43
3.3.5 Efficiency of paired versus unpaired designs	p. 44
3.3.6 Small sample properties	p. 45
3.3.7 The CASS study	p. 45
3.4 The regression modeling framework	p. 48
3.4.1 Factors potentially affecting test performance	p. 48
3.4.2 Questions addressed by regression modeling	p. 50
3.4.3 Notation and general set-up	p. 50
3.5 Regression for true and false positive fractions	p. 51
3.5.1 Binary marginal GLM models	p. 51
3.5.2 Fitting marginal models to data	p. 51
3.5.3 Illustration: factors affecting test accuracy	p. 53
3.5.4 Comparing tests with regression analysis	p. 55
3.6 Regression modeling of predictive values	p. 58
3.6.1 Model formulation and fitting	p. 58
3.6.2 Comparing tests	p. 59
3.6.3 The incremental value of a test for prediction	p. 59
3.7 Regression models for DLRs	p. 61
3.7.1 The model form	p. 61
3.7.2 Fitting the DLR model	p. 61
3.7.3 Comparing DLRs of two tests	p. 61
3.7.4 Relationships with other regression models	p. 62
3.8 Concluding remarks	p. 63
3.9 Exercises	p. 64
4 The receiver operating characteristic curve	p. 66
4.1 The context	p. 66
4.1.1 Examples of non-binary tests	p. 66
4.1.2 Dichotomizing the test result	p. 66
4.2 The ROC curve for continuous tests	p. 67
4.2.1 Definition of the ROC	p. 67
4.2.2 Mathematical properties of the ROC curve	p. 68
4.2.3 Attributes of and uses for the ROC curve	p. 71
4.2.4 Restrictions and alternatives to the ROC curve	p. 75
4.3 Summary indices	p. 76
4.3.1 The area under the ROC curve (AUC)	p. 77
4.3.2 The ROC(t[subscript 0]) and partial AUC	p. 79
4.3.3 Other summary indices	p. 80
4.3.4 Measures of distance between distributions	p. 81
4.4 The binormal ROC curve	p. 81
4.4.1 Functional form	p. 82
4.4.2 The binormal AUC	p. 83
4.4.3 The binormal assumption	p. 84
4.5 The ROC for ordinal tests	p. 85
4.5.1 Tests with ordered discrete results	p. 85
4.5.2 The latent decision variable model	p. 86
4.5.3 Identification of the latent variable ROC	p. 86
4.5.4 Changes in accuracy versus thresholds	p. 88
4.5.5 The discrete ROC curve	p. 89
4.5.6 Summary measures for the discrete ROC curve	p. 92
4.6 Concluding remarks	p. 92
4.7 Exercises	p. 94
5 Estimating the ROC curve	p. 96
5.1 Introduction	p. 96
5.1.1 Approaches	p. 96
5.1.2 Notation and assumptions	p. 96
5.2 Empirical estimation	p. 97
5.2.1 The empirical estimator	p. 97
5.2.2 Sampling variability at a threshold	p. 99
5.2.3 Sampling variability of ROC[subscript e](t)	p. 99
5.2.4 The empirical AUC and other indices	p. 103
5.2.5 Variability in the empirical AUC	p. 104
5.2.6 Comparing empirical ROC curves	p. 107
5.2.7 Illustration: pancreatic cancer biomarkers	p. 109
5.2.8 Discrete ordinal data ROC curves	p. 110
5.3 Modeling the test result distributions	p. 111
5.3.1 Fully parametric modeling	p. 111
5.3.2 Semiparametric location-scale models	p. 112
5.3.3 Arguments against modeling test results	p. 114
5.4 Parametric distribution-free methods: ordinal tests	p. 114
5.4.1 The binormal latent variable framework	p. 115
5.4.2 Fitting the discrete binormal ROC function	p. 117
5.4.3 Generalizations and comparisons	p. 118
5.5 Parametric distribution-free methods: continuous tests	p. 119
5.5.1 LABROC	p. 119
5.5.2 The ROC-GLM estimator	p. 120
5.5.3 Inference with parametric distribution-free methods	p. 124
5.6 Concluding remarks	p. 125
5.7 Exercises	p. 127
5.8 Proofs of theoretical results	p. 128
6 Covariate effects on continuous and ordinal tests	p. 130
6.1 How and why?	p. 130
6.1.1 Notation	p. 130
6.1.2 Aspects to model	p. 131
6.1.3 Omitting covariates/pooling data	p. 132
6.2 Reference distributions	p. 136
6.2.1 Non-diseased as the reference population	p. 136
6.2.2 The homogenous population	p. 137
6.2.3 Nonparametric regression quantiles	p. 139
6.2.4 Parametric estimation of S[subscript D,Z]	p. 140
6.2.5 Semiparametric models	p. 141
6.2.6 Application	p. 141
6.2.7 Ordinal test results	p. 143
6.3 Modeling covariate effects on test results	p. 144
6.3.1 The basic idea	p. 144
6.3.2 Induced ROC curves for continuous tests	p. 144
6.3.3 Semiparametric location-scale families	p. 148
6.3.4 Induced ROC curves for ordinal tests	p. 150
6.3.5 Random effect models for test results	p. 150
6.4 Modeling covariate effects on ROC curves	p. 151
6.4.1 The ROC-GLM regression model	p. 152
6.4.2 Fitting the model to data	p. 154
6.4.3 Comparing ROC curves	p. 157
6.4.4 Three examples	p. 159
6.5 Approaches to ROC regression	p. 164
6.5.1 Modeling ROC summary indices	p. 164
6.5.2 A qualitative comparison	p. 164
6.6 Concluding remarks	p. 166
6.7 Exercises	p. 167
7 Incomplete data and imperfect reference tests	p. 168
7.1 Verification biased sampling	p. 168
7.1.1 Context and definition	p. 168
7.1.2 The missing at random assumption	p. 170
7.1.3 Correcting for bias with Bayes' theorem	p. 170
7.1.4 Inverse probability weighting/imputation	p. 171
7.1.5 Sampling variability of corrected estimates	p. 172
7.1.6 Adjustments for other biasing factors	p. 175
7.1.7 A broader context	p. 177
7.1.8 Non-binary tests	p. 179
7.2 Verification restricted to screen positives	p. 180
7.2.1 Extreme verification bias	p. 180
7.2.2 Identificable parameters for a single test	p. 181
7.2.3 Comparing tests	p. 183
7.2.4 Evaluating covariate effects on (DP, FP)	p. 185
7.2.5 Evaluating covariate effects on (TPF, FPF) and on prevalence	p. 187
7.2.6 Evaluating covariate effects on (rTPF, rFPF)	p. 189
7.2.7 Alternative strategies	p. 193
7.3 Imperfect reference tests	p. 194
7.3.1 Examples	p. 194
7.3.2 Effects on accuracy parameters	p. 194
7.3.3 Classic latent class analysis	p. 197
7.3.4 Relaxing the conditional independence assumption	p. 200
7.3.5 A critique of latent class analysis	p. 203
7.3.6 Discrepant resolution	p. 205
7.3.7 Composite reference standards	p. 206
7.4 Concluding remarks	p. 207
7.5 Exercises	p. 209
7.6 Proofs of theoretical results	p. 210
8 Study design and hypothesis testing	p. 214
8.1 The phases of medical test development	p. 214
8.1.1 Research as a process	p. 214
8.1.2 Five phases for the development of a medical test	p. 215
8.2 Sample sizes for phase 2 studies	p. 218
8.2.1 Retrospective validation of a binary test	p. 218
8.2.2 Retrospective validation of a continuous test	p. 220
8.2.3 Sample size based on the AUC	p. 224
8.2.4 Ordinal tests	p. 228
8.3 Sample sizes for phase 3 studies	p. 229
8.3.1 Comparing two binary tests--paired data	p. 229
8.3.2 Comparing two binary tests--unpaired data	p. 233
8.3.3 Evaluating population effects on test performance	p. 233
8.3.4 Comparisons with continuous test results	p. 234
8.3.5 Estimating the threshold for screen positivity	p. 237
8.3.6 Remarks on phase 3 analyses	p. 238
8.4 Sample sizes for phase 4 studies	p. 239
8.4.1 Designs for inference about (FPF, TPF)	p. 239
8.4.2 Designs for predictive values	p. 241
8.4.3 Designs for (FP, DP)	p. 243
8.4.4 Selected verification of screen negatives	p. 244
8.5 Phase 5	p. 245
8.6 Matching and stratification	p. 246
8.6.1 Stratification	p. 246
8.6.2 Matching	p. 247
8.7 Concluding remarks	p. 248
8.8 Exercises	p. 251
9 More topics and conclusions	p. 253
9.1 Meta-analysis	p. 253
9.1.1 Goals of meta-analysis	p. 253
9.1.2 Design of a meta-analysis study	p. 253
9.1.3 The summary ROC curve	p. 255
9.1.4 Binomial regression models	p. 258
9.2 Incorporating the time dimension	p. 259
9.2.1 The context	p. 259
9.2.2 Incident cases and long-term controls	p. 260
9.2.3 Interval cases and controls	p. 263
9.2.4 Predictive values	p. 266
9.2.5 Longitudinal measurements	p. 266
9.3 Combining multiple test results	p. 267
9.3.1 Boolean combinations	p. 267
9.3.2 The likelihood ratio principle	p. 269
9.3.3 Optimality of the risk score	p. 271
9.3.4 Estimating the risk score	p. 274
9.3.5 Development and assessment of the combination score	p. 276
9.4 Concluding remarks	p. 277
9.4.1 Topics we only mention	p. 277
9.4.2 New applications and new technologies	p. 277
9.5 Exercises	p. 279
Bibliography	p. 280
Index	p. 297

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents