Fault-tolerant systems

Select an Action

Place Hold(s)
Add to My Lists
Email
Print

Title:

Personal Author:

Koren, Israel 1945-

Publication Information:

Amsterdam : Elsevier/Morgan Kaufmann, 2007

ISBN:

9780120885251

Subject Term:

Fault-tolerant computing

Computer systems -- Reliability

Added Author:

Krishna, C. M.

Available:*

Library	Item Barcode	Call Number	Material Type	Item Category 1	Status
Searching... PSZ JB	30000010141218	QA76.9.F38 K67 2007	Open Access Book	Book	Searching... Unknown
Searching... SPACE KL Main Library	30000003483116	QA 76.9.F38 K67 2007	Open Access Book	Book	Searching... Unknown

Fault-Tolerant Systems is the first book on fault tolerance design with a systems approach to both hardware and software. No other text on the market takes this approach, nor offers the comprehensive and up-to-date treatment that Koren and Krishna provide.

This book incorporates case studies that highlight six different computer systems with fault-tolerance techniques implemented in their design. A complete ancillary package is available to lecturers, including online solutions manual for instructors and PowerPoint slides.

Students, designers, and architects of high performance processors will value this comprehensive overview of the field.

Author Notes

C. Mani Krishna is a Professor of Electrical and Computer Engineering at the University of Massachusetts, Amherst.

Foreword	p. xi
Preface	p. xiii
Acknowledgements	p. xvii
About the Authors	p. xix
1 Preliminaries	p. 1
1.1 Fault Classification	p. 2
1.2 Types of Redundancy	p. 3
1.3 Basic Measures of Fault Tolerance	p. 4
1.3.1 Traditional Measures	p. 5
1.3.2 Network Measures	p. 6
1.4 Outline of This Book	p. 7
1.5 Further Reading	p. 9
References	p. 10
2 Hardware Fault Tolerance	p. 11
2.1 The Rate of Hardware Failures	p. 11
2.2 Failure Rate, Reliability, and Mean Time to Failure	p. 13
2.3 Canonical and Resilient Structures	p. 15
2.3.1 Series and Parallel Systems	p. 16
2.3.2 Non-Series/Parallel Systems	p. 17
2.3.3 M-of-N Systems	p. 20
2.3.4 Voters	p. 23
2.3.5 Variations on N-Modular Redundancy	p. 23
2.3.6 Duplex Systems	p. 27
2.4 Other Reliability Evaluation Techniques	p. 30
2.4.1 Poisson Processes	p. 30
2.4.2 Markov Models	p. 33
2.5 Fault-Tolerance Processor-Level Techniques	p. 36
2.5.1 Watchdog Processor	p. 37
2.5.2 Simultaneous Multithreading for Fault Tolerance	p. 39
2.6 Byzantine Failures	p. 41
2.6.1 Byzantine Agreement with Message Authentication	p. 46
2.7 Further Reading	p. 48
2.8 Exercises	p. 48
References	p. 53
3 Information Redundancy	p. 55
3.1 Coding	p. 56
3.1.1 Parity Codes	p. 57
3.1.2 Checksum	p. 64
3.1.3 M-of-N Codes	p. 65
3.1.4 Berger Code	p. 66
3.1.5 Cyclic Codes	p. 67
3.1.6 Arithmetic Codes	p. 74
3.2 Resilient Disk Systems	p. 79
3.2.1 Raid Level 1	p. 79
3.2.2 Raid Level 2	p. 81
3.2.3 Raid Level 3	p. 82
3.2.4 Raid Level 4	p. 83
3.2.5 Raid Level 5	p. 84
3.2.6 Modeling Correlated Failures	p. 84
3.3 Data Replication	p. 88
3.3.1 Voting: Non-Hierarchical Organization	p. 89
3.3.2 Voting: Hierarchical Organization	p. 95
3.3.3 Primary-Backup Approach	p. 96
3.4 Algorithm-Based Fault Tolerance	p. 99
3.5 Further Reading	p. 101
3.6 Exercises	p. 102
References	p. 106
4 Fault-Tolerant Networks	p. 109
4.1 Measures of Resilience	p. 110
4.1.1 Graph-Theoretical Measures	p. 110
4.1.2 Computer Networks Measures	p. 111
4.2 Common Network Topologies and Their Resilience	p. 112
4.2.1 Multistage and Extra-Stage Networks	p. 112
4.2.2 Crossbar Networks	p. 119
4.2.3 Rectangular Mesh and Interstitial Mesh	p. 121
4.2.4 Hypercube Network	p. 124
4.2.5 Cube-Connected Cycles Networks	p. 128
4.2.6 Loop Networks	p. 130
4.2.7 Ad hoc Point-to-Point Networks	p. 132
4.3 Fault-Tolerant Routing	p. 135
4.3.1 Hypercube Fault-Tolerant Routing	p. 136
4.3.2 Origin-Based Routing in the Mesh	p. 138
4.4 Further Reading	p. 141
4.5 Exercises	p. 142
References	p. 145
5 Software Fault Tolerance	p. 147
5.1 Acceptance Tests	p. 148
5.2 Single-Version Fault Tolerance	p. 149
5.2.1 Wrappers	p. 149
5.2.2 Software Rejuvenation	p. 152
5.2.3 Data Diversity	p. 155
5.2.4 Software Implemented Hardware Fault Tolerance (SIHFT)	p. 157
5.3 N-Version Programming	p. 160
5.3.1 Consistent Comparison Problem	p. 161
5.3.2 Version Independence	p. 162
5.4 Recovery Block Approach	p. 169
5.4.1 Basic Principles	p. 169
5.4.2 Success Probability Calculation	p. 169
5.4.3 Distributed Recovery Blocks	p. 171
5.5 Preconditions, Postconditions, and Assertions	p. 173
5.6 Exception-Handling	p. 173
5.6.1 Requirements from Exception-Handlers	p. 174
5.6.2 Basics of Exceptions and Exception-Handling	p. 175
5.6.3 Language Support	p. 177
5.7 Software Reliability Models	p. 178
5.7.1 Jelinski-Moranda Model	p. 178
5.7.2 Littlewood-Verrall Model	p. 179
5.7.3 Musa-Okumoto Model	p. 180
5.7.4 Model Selection and Parameter Estimation	p. 182
5.8 Fault-Tolerant Remote Procedure Calls	p. 182
5.8.1 Primary-Backup Approach	p. 182
5.8.2 The Circus Approach	p. 183
5.9 Further Reading	p. 184
5.10 Exercises	p. 186
References	p. 188
6 Checkpointing	p. 193
6.1 What is Checkpointing?	p. 195
6.1.1 Why is Checkpointing Nontrivial?	p. 197
6.2 Checkpoint Level	p. 197
6.3 Optimal Checkpointing-An Analytical Model	p. 198
6.3.1 Time Between Checkpoints-A First-Order Approximation	p. 200
6.3.2 Optimal Checkpoint Placement	p. 201
6.3.3 Time Between Checkpoints-A More Accurate Model	p. 202
6.3.4 Reducing Overhead	p. 204
6.3.5 Reducing Latency	p. 205
6.4 Cache-Aided Rollback Error Recovery (CARER)	p. 206
6.5 Checkpointing in Distributed Systems	p. 207
6.5.1 The Domino Effect and Livelock	p. 209
6.5.2 A Coordinated Checkpointing Algorithm	p. 210
6.5.3 Time-Based Synchronization	p. 211
6.5.4 Diskless Checkpointing	p. 212
6.5.5 Message Logging	p. 213
6.6 Checkpointing in Shared-Memory Systems	p. 217
6.6.1 Bus-Based Coherence Protocol	p. 218
6.6.2 Directory-Based Protocol	p. 219
6.7 Checkpointing in Real-Time Systems	p. 220
6.8 Other Uses of Checkpointing	p. 223
6.9 Further Reading	p. 223
6.10 Exercises	p. 224
References	p. 226
7 Case Studies	p. 229
7.1 NonStop Systems	p. 229
7.1.1 Architecture	p. 229
7.1.2 Maintenance and Repair Aids	p. 233
7.1.3 Software	p. 233
7.1.4 Modifications to the NonStop Architecture	p. 235
7.2 Stratus Systems	p. 236
7.3 Cassini Command and Data Subsystem	p. 238
7.4 IBM G5	p. 241
7.5 IBM Sysplex	p. 242
7.6 Itanium	p. 244
7.7 Further Reading	p. 246
References	p. 247
8 Defect Tolerance in VLSI Circuits	p. 249
8.1 Manufacturing Defects and Circuit Faults	p. 249
8.2 Probability of Failure and Critical Area	p. 251
8.3 Basic Yield Models	p. 253
8.3.1 The Poisson and Compound Poisson Yield Models	p. 254
8.3.2 Variations on the Simple Yield Models	p. 256
8.4 Yield Enhancement Through Redundancy	p. 258
8.4.1 Yield Projection for Chips with Redundancy	p. 259
8.4.2 Memory Arrays with Redundancy	p. 263
8.4.3 Logic Integrated Circuits with Redundancy	p. 270
8.4.4 Modifying the Floorplan	p. 272
8.5 Further Reading	p. 276
8.6 Exercises	p. 277
References	p. 281
9 Fault Detection in Cryptographic Systems	p. 285
9.1 Overview of Ciphers	p. 286
9.1.1 Symmetric Key Ciphers	p. 286
9.1.2 Public Key Ciphers	p. 295
9.2 Security Attacks Through Fault Injection	p. 296
9.2.1 Fault Attacks on Symmetric Key Ciphers	p. 297
9.2.2 Fault Attacks on Public (Asymmetric) Key Ciphers	p. 298
9.3 Countermeasures	p. 299
9.3.1 Spatial and Temporal Duplication	p. 300
9.3.2 Error-Detecting Codes	p. 300
9.3.3 Are These Countermeasures Sufficient?	p. 304
9.3.4 Final Comment	p. 307
9.4 Further Reading	p. 307
9.5 Exercises	p. 307
References	p. 308
10 Simulation Techniques	p. 311
10.1 Writing a Simulation Program	p. 311
10.2 Parameter Estimation	p. 315
10.2.1 Point Versus Interval Estimation	p. 315
10.2.2 Method of Moments	p. 316
10.2.3 Method of Maximum Likelihood	p. 318
10.2.4 The Bayesian Approach to Parameter Estimation	p. 322
10.2.5 Confidence Intervals	p. 324
10.3 Variance Reduction Methods	p. 328
10.3.1 Antithetic Variables	p. 328
10.3.2 Using Control Variables	p. 330
10.3.3 Stratified Sampling	p. 331
10.3.4 Importance Sampling	p. 333
10.4 Random Number Generation	p. 341
10.4.1 Uniformly Distributed Random Number Generators	p. 342
10.4.2 Testing Uniform Random Number Generators	p. 345
10.4.3 Generating Other Distributions	p. 349
10.5 Fault Injection	p. 355
10.5.1 Types of Fault Injection Techniques	p. 356
10.5.2 Fault Injection Application and Tools	p. 358
10.6 Further Reading	p. 358
10.7 Exercises	p. 359
References	p. 363
Subject Index	p. 365

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents