Cover image for Fault-tolerant systems
Title:
Fault-tolerant systems
Personal Author:
Publication Information:
Amsterdam : Elsevier/Morgan Kaufmann, 2007
ISBN:
9780120885251
Added Author:

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010141218 QA76.9.F38 K67 2007 Open Access Book Book
Searching...
Searching...
30000003483116 QA 76.9.F38 K67 2007 Open Access Book Book
Searching...

On Order

Summary

Summary

Fault-Tolerant Systems is the first book on fault tolerance design with a systems approach to both hardware and software. No other text on the market takes this approach, nor offers the comprehensive and up-to-date treatment that Koren and Krishna provide.

This book incorporates case studies that highlight six different computer systems with fault-tolerance techniques implemented in their design. A complete ancillary package is available to lecturers, including online solutions manual for instructors and PowerPoint slides.

Students, designers, and architects of high performance processors will value this comprehensive overview of the field.


Author Notes

C. Mani Krishna is a Professor of Electrical and Computer Engineering at the University of Massachusetts, Amherst.


Table of Contents

Forewordp. xi
Prefacep. xiii
Acknowledgementsp. xvii
About the Authorsp. xix
1 Preliminariesp. 1
1.1 Fault Classificationp. 2
1.2 Types of Redundancyp. 3
1.3 Basic Measures of Fault Tolerancep. 4
1.3.1 Traditional Measuresp. 5
1.3.2 Network Measuresp. 6
1.4 Outline of This Bookp. 7
1.5 Further Readingp. 9
Referencesp. 10
2 Hardware Fault Tolerancep. 11
2.1 The Rate of Hardware Failuresp. 11
2.2 Failure Rate, Reliability, and Mean Time to Failurep. 13
2.3 Canonical and Resilient Structuresp. 15
2.3.1 Series and Parallel Systemsp. 16
2.3.2 Non-Series/Parallel Systemsp. 17
2.3.3 M-of-N Systemsp. 20
2.3.4 Votersp. 23
2.3.5 Variations on N-Modular Redundancyp. 23
2.3.6 Duplex Systemsp. 27
2.4 Other Reliability Evaluation Techniquesp. 30
2.4.1 Poisson Processesp. 30
2.4.2 Markov Modelsp. 33
2.5 Fault-Tolerance Processor-Level Techniquesp. 36
2.5.1 Watchdog Processorp. 37
2.5.2 Simultaneous Multithreading for Fault Tolerancep. 39
2.6 Byzantine Failuresp. 41
2.6.1 Byzantine Agreement with Message Authenticationp. 46
2.7 Further Readingp. 48
2.8 Exercisesp. 48
Referencesp. 53
3 Information Redundancyp. 55
3.1 Codingp. 56
3.1.1 Parity Codesp. 57
3.1.2 Checksump. 64
3.1.3 M-of-N Codesp. 65
3.1.4 Berger Codep. 66
3.1.5 Cyclic Codesp. 67
3.1.6 Arithmetic Codesp. 74
3.2 Resilient Disk Systemsp. 79
3.2.1 Raid Level 1p. 79
3.2.2 Raid Level 2p. 81
3.2.3 Raid Level 3p. 82
3.2.4 Raid Level 4p. 83
3.2.5 Raid Level 5p. 84
3.2.6 Modeling Correlated Failuresp. 84
3.3 Data Replicationp. 88
3.3.1 Voting: Non-Hierarchical Organizationp. 89
3.3.2 Voting: Hierarchical Organizationp. 95
3.3.3 Primary-Backup Approachp. 96
3.4 Algorithm-Based Fault Tolerancep. 99
3.5 Further Readingp. 101
3.6 Exercisesp. 102
Referencesp. 106
4 Fault-Tolerant Networksp. 109
4.1 Measures of Resiliencep. 110
4.1.1 Graph-Theoretical Measuresp. 110
4.1.2 Computer Networks Measuresp. 111
4.2 Common Network Topologies and Their Resiliencep. 112
4.2.1 Multistage and Extra-Stage Networksp. 112
4.2.2 Crossbar Networksp. 119
4.2.3 Rectangular Mesh and Interstitial Meshp. 121
4.2.4 Hypercube Networkp. 124
4.2.5 Cube-Connected Cycles Networksp. 128
4.2.6 Loop Networksp. 130
4.2.7 Ad hoc Point-to-Point Networksp. 132
4.3 Fault-Tolerant Routingp. 135
4.3.1 Hypercube Fault-Tolerant Routingp. 136
4.3.2 Origin-Based Routing in the Meshp. 138
4.4 Further Readingp. 141
4.5 Exercisesp. 142
Referencesp. 145
5 Software Fault Tolerancep. 147
5.1 Acceptance Testsp. 148
5.2 Single-Version Fault Tolerancep. 149
5.2.1 Wrappersp. 149
5.2.2 Software Rejuvenationp. 152
5.2.3 Data Diversityp. 155
5.2.4 Software Implemented Hardware Fault Tolerance (SIHFT)p. 157
5.3 N-Version Programmingp. 160
5.3.1 Consistent Comparison Problemp. 161
5.3.2 Version Independencep. 162
5.4 Recovery Block Approachp. 169
5.4.1 Basic Principlesp. 169
5.4.2 Success Probability Calculationp. 169
5.4.3 Distributed Recovery Blocksp. 171
5.5 Preconditions, Postconditions, and Assertionsp. 173
5.6 Exception-Handlingp. 173
5.6.1 Requirements from Exception-Handlersp. 174
5.6.2 Basics of Exceptions and Exception-Handlingp. 175
5.6.3 Language Supportp. 177
5.7 Software Reliability Modelsp. 178
5.7.1 Jelinski-Moranda Modelp. 178
5.7.2 Littlewood-Verrall Modelp. 179
5.7.3 Musa-Okumoto Modelp. 180
5.7.4 Model Selection and Parameter Estimationp. 182
5.8 Fault-Tolerant Remote Procedure Callsp. 182
5.8.1 Primary-Backup Approachp. 182
5.8.2 The Circus Approachp. 183
5.9 Further Readingp. 184
5.10 Exercisesp. 186
Referencesp. 188
6 Checkpointingp. 193
6.1 What is Checkpointing?p. 195
6.1.1 Why is Checkpointing Nontrivial?p. 197
6.2 Checkpoint Levelp. 197
6.3 Optimal Checkpointing-An Analytical Modelp. 198
6.3.1 Time Between Checkpoints-A First-Order Approximationp. 200
6.3.2 Optimal Checkpoint Placementp. 201
6.3.3 Time Between Checkpoints-A More Accurate Modelp. 202
6.3.4 Reducing Overheadp. 204
6.3.5 Reducing Latencyp. 205
6.4 Cache-Aided Rollback Error Recovery (CARER)p. 206
6.5 Checkpointing in Distributed Systemsp. 207
6.5.1 The Domino Effect and Livelockp. 209
6.5.2 A Coordinated Checkpointing Algorithmp. 210
6.5.3 Time-Based Synchronizationp. 211
6.5.4 Diskless Checkpointingp. 212
6.5.5 Message Loggingp. 213
6.6 Checkpointing in Shared-Memory Systemsp. 217
6.6.1 Bus-Based Coherence Protocolp. 218
6.6.2 Directory-Based Protocolp. 219
6.7 Checkpointing in Real-Time Systemsp. 220
6.8 Other Uses of Checkpointingp. 223
6.9 Further Readingp. 223
6.10 Exercisesp. 224
Referencesp. 226
7 Case Studiesp. 229
7.1 NonStop Systemsp. 229
7.1.1 Architecturep. 229
7.1.2 Maintenance and Repair Aidsp. 233
7.1.3 Softwarep. 233
7.1.4 Modifications to the NonStop Architecturep. 235
7.2 Stratus Systemsp. 236
7.3 Cassini Command and Data Subsystemp. 238
7.4 IBM G5p. 241
7.5 IBM Sysplexp. 242
7.6 Itaniump. 244
7.7 Further Readingp. 246
Referencesp. 247
8 Defect Tolerance in VLSI Circuitsp. 249
8.1 Manufacturing Defects and Circuit Faultsp. 249
8.2 Probability of Failure and Critical Areap. 251
8.3 Basic Yield Modelsp. 253
8.3.1 The Poisson and Compound Poisson Yield Modelsp. 254
8.3.2 Variations on the Simple Yield Modelsp. 256
8.4 Yield Enhancement Through Redundancyp. 258
8.4.1 Yield Projection for Chips with Redundancyp. 259
8.4.2 Memory Arrays with Redundancyp. 263
8.4.3 Logic Integrated Circuits with Redundancyp. 270
8.4.4 Modifying the Floorplanp. 272
8.5 Further Readingp. 276
8.6 Exercisesp. 277
Referencesp. 281
9 Fault Detection in Cryptographic Systemsp. 285
9.1 Overview of Ciphersp. 286
9.1.1 Symmetric Key Ciphersp. 286
9.1.2 Public Key Ciphersp. 295
9.2 Security Attacks Through Fault Injectionp. 296
9.2.1 Fault Attacks on Symmetric Key Ciphersp. 297
9.2.2 Fault Attacks on Public (Asymmetric) Key Ciphersp. 298
9.3 Countermeasuresp. 299
9.3.1 Spatial and Temporal Duplicationp. 300
9.3.2 Error-Detecting Codesp. 300
9.3.3 Are These Countermeasures Sufficient?p. 304
9.3.4 Final Commentp. 307
9.4 Further Readingp. 307
9.5 Exercisesp. 307
Referencesp. 308
10 Simulation Techniquesp. 311
10.1 Writing a Simulation Programp. 311
10.2 Parameter Estimationp. 315
10.2.1 Point Versus Interval Estimationp. 315
10.2.2 Method of Momentsp. 316
10.2.3 Method of Maximum Likelihoodp. 318
10.2.4 The Bayesian Approach to Parameter Estimationp. 322
10.2.5 Confidence Intervalsp. 324
10.3 Variance Reduction Methodsp. 328
10.3.1 Antithetic Variablesp. 328
10.3.2 Using Control Variablesp. 330
10.3.3 Stratified Samplingp. 331
10.3.4 Importance Samplingp. 333
10.4 Random Number Generationp. 341
10.4.1 Uniformly Distributed Random Number Generatorsp. 342
10.4.2 Testing Uniform Random Number Generatorsp. 345
10.4.3 Generating Other Distributionsp. 349
10.5 Fault Injectionp. 355
10.5.1 Types of Fault Injection Techniquesp. 356
10.5.2 Fault Injection Application and Toolsp. 358
10.6 Further Readingp. 358
10.7 Exercisesp. 359
Referencesp. 363
Subject Indexp. 365