Design and analysis of reliable and fault-tolerant computer systems

Title:

Personal Author:

Abd-El-Barr, Mostafa, 1950-

Publication Information:

London : Imperial College Press, 2007

ISBN:

9781860946684

Subject Term:

Fault-tolerant computing

Computer systems -- Reliability

Computer architecture

Available:*

Library	Item Barcode	Call Number	Material Type	Item Category 1	Status
Searching... PSZ JB	30000010149586	QA76.9.F38 A22 2007	Open Access Book	Book	Searching... Unknown

Covering both the theoretical and practical aspects of fault-tolerant mobile systems, and fault tolerance and analysis, this book tackles the current issues of reliability-based optimization of computer networks, fault-tolerant mobile systems, and fault tolerance and reliability of high speed and hierarchical networks.The book is divided into six parts to facilitate coverage of the material by course instructors and computer systems professionals. The sequence of chapters in each part ensures the gradual coverage of issues from the basics to the most recent developments. A useful set of references, including electronic sources, is listed at the end of each chapter.

Preface	p. vii
Acknowledgements	p. xiii
Chapter 1 Fundamental Concepts in Fault Tolerance and Reliability Analysis	p. 1
1.1 Introduction	p. 1
1.2 Redundancy Techniques	p. 4
1.2.1 Hardware Redundancy	p. 5
1.2.1.1 Passive (Static) Hardware Redundancy	p. 5
1.2.1.2 Active (Dynamic) Hardware Redundancy	p. 6
1.2.1.3 Hybrid Hardware Redundancy	p. 7
1.2.2 Software Redundancy	p. 9
1.2.2.1 Static Software Redundancy Techniques	p. 9
1.2.2.2 Dynamic Software Redundancy Techniques	p. 10
1.2.3 Information Redundancy	p. 12
1.2.3.1 Error Detecting Codes	p. 14
1.2.3.2 Error Correcting Codes	p. 18
1.2.3.3 SEC-DED Codes	p. 20
1.2.3.4 CRC Codes	p. 26
1.2.3.5 Convolution Codes	p. 27
1.2.4 Time Redundancy	p. 29
1.2.4.1 Permanent Error Detection with Time Redundancy	p. 30
1.3 Reliability Modeling and Evaluation	p. 33
1.3.1 Empirical Models	p. 34
1.3.2 The Analytical Technique	p. 34
1.4 Summary	p. 42
References	p. 42
Chapter 2 Fault Modeling, Simulation and Diagnosis	p. 44
2.1 Fault Modeling	p. 44
2.2 Fault Simulation	p. 51
2.3 Fault Simulation Algorithms	p. 52
2.3.1 Serial Fault Simulation Algorithm	p. 52
2.3.2 Parallel Fault Simulation	p. 53
2.3.3 Deductive Fault Simulation	p. 54
2.3.4 Concurrent Fault Simulation	p. 57
2.3.5 Critical Path Tracing	p. 57
2.4 Fault Diagnosis	p. 59
2.4.1 Combinational Fault Diagnosis	p. 59
2.4.2 Sequential Fault Diagnosis Methods	p. 61
2.5 Summary	p. 64
References	p. 64
Chapter 3 Error Control and Self-Checking Circuits	p. 66
3.1 Error-Detecting/Error-Correcting Codes	p. 67
3.2 Self-Checking Circuits	p. 81
3.3 Summary	p. 92
References	p. 92
Chapter 4 Fault Tolerance in Multiprocessor Systems	p. 94
4.1 Fault Tolerance in Interconnection Networks	p. 95
4.2 Reliability and Fault Tolerance in Single Loop Architectures	p. 104
4.3 Introduction to Fault Tolerance in Hypercube Networks	p. 108
4.4 Introduction to Fault Tolerance in Mesh Networks	p. 120
4.5 Summary	p. 125
References	p. 126
Chapter 5 Fault-Tolerant Routing in Multi-Computer Networks	p. 127
5.1 Introduction	p. 127
5.2 Fault-Tolerant Routing Algorithms in Hypercube	p. 131
5.2.1 Depth-First Search Approach	p. 131
5.2.2 Iterative-Based Heuristic Routing Algorithm	p. 135
5.3 Routing in Faulty Mesh Networks	p. 140
5.3.1 Node Labeling Technique	p. 140
5.3.2 A FT Routing Scheme for Meshes with Non-Convex Faults	p. 141
5.4 Algorithm Extensions	p. 147
5.4.1 Multidimensional Meshes	p. 147
5.4.2 Faults with f-Chains	p. 148
5.5 Summary	p. 149
References	p. 149
Chapter 6 Fault Tolerance and Reliability in Hierarchical Interconnection Networks	p. 152
6.1 Introduction	p. 152
6.2 Block-Shift Network (BSN)	p. 154
6.2.1 BSN Edges Groups	p. 155
6.2.2 BSN Construction	p. 156
6.2.3 BSN Degree and Diameter	p. 158
6.2.4 BSN Connectivity	p. 158
6.2.5 BSN Fault Diameter	p. 159
6.2.6 BSN Reliability	p. 160
6.3 Hierarchical Cubic Network (HCN)	p. 161
6.3.1 HCN Degree and Diameter	p. 162
6.4 HINs versus HCNs	p. 163
6.4.1 Topological Cost	p. 163
6.5 The Hyper-Torus Network (HTN)	p. 166
6.6 Summary	p. 170
References	p. 170
Chapter 7 Fault Tolerance and Reliability of Computer Networks	p. 172
7.1 Background Material	p. 173
7.2 Fault Tolerance in Loop Networks	p. 174
7.2.1 Reliability of Token-Ring Networks	p. 175
7.2.2 Reliability of Bypass-Switch Networks	p. 176
7.2.3 Double Loop Architectures	p. 176
7.2.4 Multi-Drop Architectures	p. 178
7.2.5 Daisy-Chain Architectures	p. 178
7.3 Reliability of General Graph Networks	p. 180
7.3.1 The Exact Method	p. 180
7.3.2 Reliability Bounding	p. 185
7.4 Topology Optimization of Networks Subject to Reliability & Fault Tolerance Constraints	p. 188
7.4.1 Enumeration Techniques	p. 189
7.4.1.1 Network Reliability	p. 195
7.4.2 Iterative Techniques	p. 199
7.5 Maximizing Network Reliability by Adding a Single Edge	p. 204
7.6 Design for Networks Reliability	p. 204
7.7 Summary	p. 205
References	p. 206
Chapter 8 Fault Tolerance in High Speed Switching Networks	p. 208
8.1 Introduction	p. 208
8.2 Classification of Fault-Tolerant Switching Architectures	p. 212
8.3 One-Fault Tolerance Switch Architectures	p. 213
8.3.1 Extra-Stage Shuffle Exchange	p. 213
8.3.2 Itoh Network	p. 214
8.3.3 The B-Tree Network	p. 215
8.3.4 Benes Network	p. 216
8.3.5 Parallel Banyan Network	p. 217
8.3.6 Tagle & Sharma Network	p. 218
8.4 Two-Fault Tolerance Switch Architectures	p. 219
8.4.1 Binary Tree Banyan Network	p. 219
8.5 Logarithmic-Fault Tolerance	p. 220
8.5.1 RAZAN	p. 220
8.5.2 Logical Neighborhood	p. 222
8.5.3 Improved Logical Neighborhood	p. 223
8.6 Architecture-Dependent Fault Tolerance	p. 224
8.7 Summary	p. 226
References	p. 226
Chapter 9 Fault Tolerance in Distributed and Mobile Computing Systems	p. 229
9.1 Introduction	p. 229
9.2 Background Material	p. 231
9.3 Checkpointing Techniques in Mobile Networks	p. 236
9.3.1 Minimal Snapshot Collection Algorithm	p. 237
9.3.2 Mutable Checkpoints	p. 239
9.3.3 Adaptive Recovery	p. 241
9.3.4 Message Logging Based Checkpoints	p. 243
9.3.5 Hybrid Checkpoints	p. 244
9.4 Comparison	p. 245
9.5 Summary	p. 247
References	p. 247
Chapter 10 Fault Tolerance in Mobile Networks	p. 249
10.1 Background Material	p. 249
10.2 More on Mutable Checkpoint Techniques in Mobile Networks	p. 251
10.2.1 Handling Mobility, Disconnection and Reconnection of MHs	p. 252
10.2.2 A Checkpointing Algorithm Based on Mutable Checkpoints	p. 253
10.2.3 Performance Evaluation	p. 261
10.3 Hardware Approach for Fault Tolerance in Mobile Networks	p. 265
10.4 Summary	p. 273
References	p. 273
Chapter 11 Reliability and Yield Enhancement of VLSI/WSI Circuits	p. 276
11.1 Defect and Failure in VLSI Circuits	p. 276
11.2 Yield and Defect Model in VLSI/WSI Circuits	p. 279
11.3 Techniques to Improve Yield	p. 284
11.4 Effect of Redundancy on Yield	p. 286
11.5 Summary	p. 288
References	p. 288
Chapter 12 Design of Fault-Tolerant Processor Arrays	p. 291
12.1 Introduction	p. 291
12.2 Hardware Redundancy Techniques	p. 294
12.3 Self-Reconfiguration Techniques	p. 317
12.4 Summary	p. 321
References	p. 322
Chapter 13 Algorithm-Based Fault Tolerance	p. 326
13.1 Checksum-Based ABFT for Matrix Operations	p. 327
13.2 Checksum-Based ABFT Error Handling	p. 330
13.3 Weighted Checksum Based ABFT	p. 331
13.4 ABFT on a Mesh Multiprocessor	p. 332
13.5 Checksum-Based ABFT on a Hypercube Multiprocessor	p. 334
13.6 Partition-Based ABFT for Floating-Point Matrix Operations	p. 336
13.7 Summary	p. 339
References	p. 339
Chapter 14 System Level Diagnosis-I	p. 341
14.1 Background Material and Basic Terminology	p. 342
14.2 System-Level Diagnosis Models	p. 347
14.3 Diagnosable Systems	p. 352
14.4 Diagnose-Ability Algorithms	p. 358
14.4.1 Centralized Diagnosis Systems	p. 359
14.4.2 Distributed Diagnosis Systems	p. 365
14.5 Summary	p. 372
References	p. 373
Chapter 15 System Level Diagnosis-II	p. 378
15.1 Diagnosis Algorithms for Regular Structures	p. 378
15.2 Regular Structures	p. 379
15.3 Pessimistic One-Step Diagnosis Algorithms for Hypercube	p. 380
15.4 Diagnosis for Symmetric Multiple Processor Architecture	p. 383
15.5 Summary	p. 394
References	p. 394
Appendix	p. 397
Chapter 16 Fault Tolerance and Reliability of the RAID Systems	p. 400
16.1 Introduction	p. 401
16.2 Redundancy Mechanisms	p. 403
16.3 Simple Reliability Analysis	p. 411
16.4 Advanced RAID Systems	p. 413
16.5 More on RAIDS	p. 418
16.6 Summary	p. 423
References	p. 423
Chapter 17 High Availability in Computer Systems	p. 426
17.1 Introduction	p. 426
17.2 Tandem High Availability Computers at a Glance	p. 430
17.3 Availability in Client/Server Computing	p. 438
17.4 Chapter Summary	p. 440
References	p. 440

Available:*

On Order

Summary

Summary

Table of Contents