Skip to:Content
|
Bottom
Cover image for Design and analysis of reliable and fault-tolerant computer systems
Title:
Design and analysis of reliable and fault-tolerant computer systems
Personal Author:
Publication Information:
London : Imperial College Press, 2007
ISBN:
9781860946684

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010149586 QA76.9.F38 A22 2007 Open Access Book Book
Searching...

On Order

Summary

Summary

Covering both the theoretical and practical aspects of fault-tolerant mobile systems, and fault tolerance and analysis, this book tackles the current issues of reliability-based optimization of computer networks, fault-tolerant mobile systems, and fault tolerance and reliability of high speed and hierarchical networks.The book is divided into six parts to facilitate coverage of the material by course instructors and computer systems professionals. The sequence of chapters in each part ensures the gradual coverage of issues from the basics to the most recent developments. A useful set of references, including electronic sources, is listed at the end of each chapter.


Table of Contents

Prefacep. vii
Acknowledgementsp. xiii
Chapter 1 Fundamental Concepts in Fault Tolerance and Reliability Analysisp. 1
1.1 Introductionp. 1
1.2 Redundancy Techniquesp. 4
1.2.1 Hardware Redundancyp. 5
1.2.1.1 Passive (Static) Hardware Redundancyp. 5
1.2.1.2 Active (Dynamic) Hardware Redundancyp. 6
1.2.1.3 Hybrid Hardware Redundancyp. 7
1.2.2 Software Redundancyp. 9
1.2.2.1 Static Software Redundancy Techniquesp. 9
1.2.2.2 Dynamic Software Redundancy Techniquesp. 10
1.2.3 Information Redundancyp. 12
1.2.3.1 Error Detecting Codesp. 14
1.2.3.2 Error Correcting Codesp. 18
1.2.3.3 SEC-DED Codesp. 20
1.2.3.4 CRC Codesp. 26
1.2.3.5 Convolution Codesp. 27
1.2.4 Time Redundancyp. 29
1.2.4.1 Permanent Error Detection with Time Redundancyp. 30
1.3 Reliability Modeling and Evaluationp. 33
1.3.1 Empirical Modelsp. 34
1.3.2 The Analytical Techniquep. 34
1.4 Summaryp. 42
Referencesp. 42
Chapter 2 Fault Modeling, Simulation and Diagnosisp. 44
2.1 Fault Modelingp. 44
2.2 Fault Simulationp. 51
2.3 Fault Simulation Algorithmsp. 52
2.3.1 Serial Fault Simulation Algorithmp. 52
2.3.2 Parallel Fault Simulationp. 53
2.3.3 Deductive Fault Simulationp. 54
2.3.4 Concurrent Fault Simulationp. 57
2.3.5 Critical Path Tracingp. 57
2.4 Fault Diagnosisp. 59
2.4.1 Combinational Fault Diagnosisp. 59
2.4.2 Sequential Fault Diagnosis Methodsp. 61
2.5 Summaryp. 64
Referencesp. 64
Chapter 3 Error Control and Self-Checking Circuitsp. 66
3.1 Error-Detecting/Error-Correcting Codesp. 67
3.2 Self-Checking Circuitsp. 81
3.3 Summaryp. 92
Referencesp. 92
Chapter 4 Fault Tolerance in Multiprocessor Systemsp. 94
4.1 Fault Tolerance in Interconnection Networksp. 95
4.2 Reliability and Fault Tolerance in Single Loop Architecturesp. 104
4.3 Introduction to Fault Tolerance in Hypercube Networksp. 108
4.4 Introduction to Fault Tolerance in Mesh Networksp. 120
4.5 Summaryp. 125
Referencesp. 126
Chapter 5 Fault-Tolerant Routing in Multi-Computer Networksp. 127
5.1 Introductionp. 127
5.2 Fault-Tolerant Routing Algorithms in Hypercubep. 131
5.2.1 Depth-First Search Approachp. 131
5.2.2 Iterative-Based Heuristic Routing Algorithmp. 135
5.3 Routing in Faulty Mesh Networksp. 140
5.3.1 Node Labeling Techniquep. 140
5.3.2 A FT Routing Scheme for Meshes with Non-Convex Faultsp. 141
5.4 Algorithm Extensionsp. 147
5.4.1 Multidimensional Meshesp. 147
5.4.2 Faults with f-Chainsp. 148
5.5 Summaryp. 149
Referencesp. 149
Chapter 6 Fault Tolerance and Reliability in Hierarchical Interconnection Networksp. 152
6.1 Introductionp. 152
6.2 Block-Shift Network (BSN)p. 154
6.2.1 BSN Edges Groupsp. 155
6.2.2 BSN Constructionp. 156
6.2.3 BSN Degree and Diameterp. 158
6.2.4 BSN Connectivityp. 158
6.2.5 BSN Fault Diameterp. 159
6.2.6 BSN Reliabilityp. 160
6.3 Hierarchical Cubic Network (HCN)p. 161
6.3.1 HCN Degree and Diameterp. 162
6.4 HINs versus HCNsp. 163
6.4.1 Topological Costp. 163
6.5 The Hyper-Torus Network (HTN)p. 166
6.6 Summaryp. 170
Referencesp. 170
Chapter 7 Fault Tolerance and Reliability of Computer Networksp. 172
7.1 Background Materialp. 173
7.2 Fault Tolerance in Loop Networksp. 174
7.2.1 Reliability of Token-Ring Networksp. 175
7.2.2 Reliability of Bypass-Switch Networksp. 176
7.2.3 Double Loop Architecturesp. 176
7.2.4 Multi-Drop Architecturesp. 178
7.2.5 Daisy-Chain Architecturesp. 178
7.3 Reliability of General Graph Networksp. 180
7.3.1 The Exact Methodp. 180
7.3.2 Reliability Boundingp. 185
7.4 Topology Optimization of Networks Subject to Reliability & Fault Tolerance Constraintsp. 188
7.4.1 Enumeration Techniquesp. 189
7.4.1.1 Network Reliabilityp. 195
7.4.2 Iterative Techniquesp. 199
7.5 Maximizing Network Reliability by Adding a Single Edgep. 204
7.6 Design for Networks Reliabilityp. 204
7.7 Summaryp. 205
Referencesp. 206
Chapter 8 Fault Tolerance in High Speed Switching Networksp. 208
8.1 Introductionp. 208
8.2 Classification of Fault-Tolerant Switching Architecturesp. 212
8.3 One-Fault Tolerance Switch Architecturesp. 213
8.3.1 Extra-Stage Shuffle Exchangep. 213
8.3.2 Itoh Networkp. 214
8.3.3 The B-Tree Networkp. 215
8.3.4 Benes Networkp. 216
8.3.5 Parallel Banyan Networkp. 217
8.3.6 Tagle & Sharma Networkp. 218
8.4 Two-Fault Tolerance Switch Architecturesp. 219
8.4.1 Binary Tree Banyan Networkp. 219
8.5 Logarithmic-Fault Tolerancep. 220
8.5.1 RAZANp. 220
8.5.2 Logical Neighborhoodp. 222
8.5.3 Improved Logical Neighborhoodp. 223
8.6 Architecture-Dependent Fault Tolerancep. 224
8.7 Summaryp. 226
Referencesp. 226
Chapter 9 Fault Tolerance in Distributed and Mobile Computing Systemsp. 229
9.1 Introductionp. 229
9.2 Background Materialp. 231
9.3 Checkpointing Techniques in Mobile Networksp. 236
9.3.1 Minimal Snapshot Collection Algorithmp. 237
9.3.2 Mutable Checkpointsp. 239
9.3.3 Adaptive Recoveryp. 241
9.3.4 Message Logging Based Checkpointsp. 243
9.3.5 Hybrid Checkpointsp. 244
9.4 Comparisonp. 245
9.5 Summaryp. 247
Referencesp. 247
Chapter 10 Fault Tolerance in Mobile Networksp. 249
10.1 Background Materialp. 249
10.2 More on Mutable Checkpoint Techniques in Mobile Networksp. 251
10.2.1 Handling Mobility, Disconnection and Reconnection of MHsp. 252
10.2.2 A Checkpointing Algorithm Based on Mutable Checkpointsp. 253
10.2.3 Performance Evaluationp. 261
10.3 Hardware Approach for Fault Tolerance in Mobile Networksp. 265
10.4 Summaryp. 273
Referencesp. 273
Chapter 11 Reliability and Yield Enhancement of VLSI/WSI Circuitsp. 276
11.1 Defect and Failure in VLSI Circuitsp. 276
11.2 Yield and Defect Model in VLSI/WSI Circuitsp. 279
11.3 Techniques to Improve Yieldp. 284
11.4 Effect of Redundancy on Yieldp. 286
11.5 Summaryp. 288
Referencesp. 288
Chapter 12 Design of Fault-Tolerant Processor Arraysp. 291
12.1 Introductionp. 291
12.2 Hardware Redundancy Techniquesp. 294
12.3 Self-Reconfiguration Techniquesp. 317
12.4 Summaryp. 321
Referencesp. 322
Chapter 13 Algorithm-Based Fault Tolerancep. 326
13.1 Checksum-Based ABFT for Matrix Operationsp. 327
13.2 Checksum-Based ABFT Error Handlingp. 330
13.3 Weighted Checksum Based ABFTp. 331
13.4 ABFT on a Mesh Multiprocessorp. 332
13.5 Checksum-Based ABFT on a Hypercube Multiprocessorp. 334
13.6 Partition-Based ABFT for Floating-Point Matrix Operationsp. 336
13.7 Summaryp. 339
Referencesp. 339
Chapter 14 System Level Diagnosis-Ip. 341
14.1 Background Material and Basic Terminologyp. 342
14.2 System-Level Diagnosis Modelsp. 347
14.3 Diagnosable Systemsp. 352
14.4 Diagnose-Ability Algorithmsp. 358
14.4.1 Centralized Diagnosis Systemsp. 359
14.4.2 Distributed Diagnosis Systemsp. 365
14.5 Summaryp. 372
Referencesp. 373
Chapter 15 System Level Diagnosis-IIp. 378
15.1 Diagnosis Algorithms for Regular Structuresp. 378
15.2 Regular Structuresp. 379
15.3 Pessimistic One-Step Diagnosis Algorithms for Hypercubep. 380
15.4 Diagnosis for Symmetric Multiple Processor Architecturep. 383
15.5 Summaryp. 394
Referencesp. 394
Appendixp. 397
Chapter 16 Fault Tolerance and Reliability of the RAID Systemsp. 400
16.1 Introductionp. 401
16.2 Redundancy Mechanismsp. 403
16.3 Simple Reliability Analysisp. 411
16.4 Advanced RAID Systemsp. 413
16.5 More on RAIDSp. 418
16.6 Summaryp. 423
Referencesp. 423
Chapter 17 High Availability in Computer Systemsp. 426
17.1 Introductionp. 426
17.2 Tandem High Availability Computers at a Glancep. 430
17.3 Availability in Client/Server Computingp. 438
17.4 Chapter Summaryp. 440
Referencesp. 440
Go to:Top of Page