Title:
Design and analysis of reliable and fault-tolerant computer systems
Personal Author:
Publication Information:
London : Imperial College Press, 2007
ISBN:
9781860946684
Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010149586 | QA76.9.F38 A22 2007 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Covering both the theoretical and practical aspects of fault-tolerant mobile systems, and fault tolerance and analysis, this book tackles the current issues of reliability-based optimization of computer networks, fault-tolerant mobile systems, and fault tolerance and reliability of high speed and hierarchical networks.The book is divided into six parts to facilitate coverage of the material by course instructors and computer systems professionals. The sequence of chapters in each part ensures the gradual coverage of issues from the basics to the most recent developments. A useful set of references, including electronic sources, is listed at the end of each chapter.
Table of Contents
Preface | p. vii |
Acknowledgements | p. xiii |
Chapter 1 Fundamental Concepts in Fault Tolerance and Reliability Analysis | p. 1 |
1.1 Introduction | p. 1 |
1.2 Redundancy Techniques | p. 4 |
1.2.1 Hardware Redundancy | p. 5 |
1.2.1.1 Passive (Static) Hardware Redundancy | p. 5 |
1.2.1.2 Active (Dynamic) Hardware Redundancy | p. 6 |
1.2.1.3 Hybrid Hardware Redundancy | p. 7 |
1.2.2 Software Redundancy | p. 9 |
1.2.2.1 Static Software Redundancy Techniques | p. 9 |
1.2.2.2 Dynamic Software Redundancy Techniques | p. 10 |
1.2.3 Information Redundancy | p. 12 |
1.2.3.1 Error Detecting Codes | p. 14 |
1.2.3.2 Error Correcting Codes | p. 18 |
1.2.3.3 SEC-DED Codes | p. 20 |
1.2.3.4 CRC Codes | p. 26 |
1.2.3.5 Convolution Codes | p. 27 |
1.2.4 Time Redundancy | p. 29 |
1.2.4.1 Permanent Error Detection with Time Redundancy | p. 30 |
1.3 Reliability Modeling and Evaluation | p. 33 |
1.3.1 Empirical Models | p. 34 |
1.3.2 The Analytical Technique | p. 34 |
1.4 Summary | p. 42 |
References | p. 42 |
Chapter 2 Fault Modeling, Simulation and Diagnosis | p. 44 |
2.1 Fault Modeling | p. 44 |
2.2 Fault Simulation | p. 51 |
2.3 Fault Simulation Algorithms | p. 52 |
2.3.1 Serial Fault Simulation Algorithm | p. 52 |
2.3.2 Parallel Fault Simulation | p. 53 |
2.3.3 Deductive Fault Simulation | p. 54 |
2.3.4 Concurrent Fault Simulation | p. 57 |
2.3.5 Critical Path Tracing | p. 57 |
2.4 Fault Diagnosis | p. 59 |
2.4.1 Combinational Fault Diagnosis | p. 59 |
2.4.2 Sequential Fault Diagnosis Methods | p. 61 |
2.5 Summary | p. 64 |
References | p. 64 |
Chapter 3 Error Control and Self-Checking Circuits | p. 66 |
3.1 Error-Detecting/Error-Correcting Codes | p. 67 |
3.2 Self-Checking Circuits | p. 81 |
3.3 Summary | p. 92 |
References | p. 92 |
Chapter 4 Fault Tolerance in Multiprocessor Systems | p. 94 |
4.1 Fault Tolerance in Interconnection Networks | p. 95 |
4.2 Reliability and Fault Tolerance in Single Loop Architectures | p. 104 |
4.3 Introduction to Fault Tolerance in Hypercube Networks | p. 108 |
4.4 Introduction to Fault Tolerance in Mesh Networks | p. 120 |
4.5 Summary | p. 125 |
References | p. 126 |
Chapter 5 Fault-Tolerant Routing in Multi-Computer Networks | p. 127 |
5.1 Introduction | p. 127 |
5.2 Fault-Tolerant Routing Algorithms in Hypercube | p. 131 |
5.2.1 Depth-First Search Approach | p. 131 |
5.2.2 Iterative-Based Heuristic Routing Algorithm | p. 135 |
5.3 Routing in Faulty Mesh Networks | p. 140 |
5.3.1 Node Labeling Technique | p. 140 |
5.3.2 A FT Routing Scheme for Meshes with Non-Convex Faults | p. 141 |
5.4 Algorithm Extensions | p. 147 |
5.4.1 Multidimensional Meshes | p. 147 |
5.4.2 Faults with f-Chains | p. 148 |
5.5 Summary | p. 149 |
References | p. 149 |
Chapter 6 Fault Tolerance and Reliability in Hierarchical Interconnection Networks | p. 152 |
6.1 Introduction | p. 152 |
6.2 Block-Shift Network (BSN) | p. 154 |
6.2.1 BSN Edges Groups | p. 155 |
6.2.2 BSN Construction | p. 156 |
6.2.3 BSN Degree and Diameter | p. 158 |
6.2.4 BSN Connectivity | p. 158 |
6.2.5 BSN Fault Diameter | p. 159 |
6.2.6 BSN Reliability | p. 160 |
6.3 Hierarchical Cubic Network (HCN) | p. 161 |
6.3.1 HCN Degree and Diameter | p. 162 |
6.4 HINs versus HCNs | p. 163 |
6.4.1 Topological Cost | p. 163 |
6.5 The Hyper-Torus Network (HTN) | p. 166 |
6.6 Summary | p. 170 |
References | p. 170 |
Chapter 7 Fault Tolerance and Reliability of Computer Networks | p. 172 |
7.1 Background Material | p. 173 |
7.2 Fault Tolerance in Loop Networks | p. 174 |
7.2.1 Reliability of Token-Ring Networks | p. 175 |
7.2.2 Reliability of Bypass-Switch Networks | p. 176 |
7.2.3 Double Loop Architectures | p. 176 |
7.2.4 Multi-Drop Architectures | p. 178 |
7.2.5 Daisy-Chain Architectures | p. 178 |
7.3 Reliability of General Graph Networks | p. 180 |
7.3.1 The Exact Method | p. 180 |
7.3.2 Reliability Bounding | p. 185 |
7.4 Topology Optimization of Networks Subject to Reliability & Fault Tolerance Constraints | p. 188 |
7.4.1 Enumeration Techniques | p. 189 |
7.4.1.1 Network Reliability | p. 195 |
7.4.2 Iterative Techniques | p. 199 |
7.5 Maximizing Network Reliability by Adding a Single Edge | p. 204 |
7.6 Design for Networks Reliability | p. 204 |
7.7 Summary | p. 205 |
References | p. 206 |
Chapter 8 Fault Tolerance in High Speed Switching Networks | p. 208 |
8.1 Introduction | p. 208 |
8.2 Classification of Fault-Tolerant Switching Architectures | p. 212 |
8.3 One-Fault Tolerance Switch Architectures | p. 213 |
8.3.1 Extra-Stage Shuffle Exchange | p. 213 |
8.3.2 Itoh Network | p. 214 |
8.3.3 The B-Tree Network | p. 215 |
8.3.4 Benes Network | p. 216 |
8.3.5 Parallel Banyan Network | p. 217 |
8.3.6 Tagle & Sharma Network | p. 218 |
8.4 Two-Fault Tolerance Switch Architectures | p. 219 |
8.4.1 Binary Tree Banyan Network | p. 219 |
8.5 Logarithmic-Fault Tolerance | p. 220 |
8.5.1 RAZAN | p. 220 |
8.5.2 Logical Neighborhood | p. 222 |
8.5.3 Improved Logical Neighborhood | p. 223 |
8.6 Architecture-Dependent Fault Tolerance | p. 224 |
8.7 Summary | p. 226 |
References | p. 226 |
Chapter 9 Fault Tolerance in Distributed and Mobile Computing Systems | p. 229 |
9.1 Introduction | p. 229 |
9.2 Background Material | p. 231 |
9.3 Checkpointing Techniques in Mobile Networks | p. 236 |
9.3.1 Minimal Snapshot Collection Algorithm | p. 237 |
9.3.2 Mutable Checkpoints | p. 239 |
9.3.3 Adaptive Recovery | p. 241 |
9.3.4 Message Logging Based Checkpoints | p. 243 |
9.3.5 Hybrid Checkpoints | p. 244 |
9.4 Comparison | p. 245 |
9.5 Summary | p. 247 |
References | p. 247 |
Chapter 10 Fault Tolerance in Mobile Networks | p. 249 |
10.1 Background Material | p. 249 |
10.2 More on Mutable Checkpoint Techniques in Mobile Networks | p. 251 |
10.2.1 Handling Mobility, Disconnection and Reconnection of MHs | p. 252 |
10.2.2 A Checkpointing Algorithm Based on Mutable Checkpoints | p. 253 |
10.2.3 Performance Evaluation | p. 261 |
10.3 Hardware Approach for Fault Tolerance in Mobile Networks | p. 265 |
10.4 Summary | p. 273 |
References | p. 273 |
Chapter 11 Reliability and Yield Enhancement of VLSI/WSI Circuits | p. 276 |
11.1 Defect and Failure in VLSI Circuits | p. 276 |
11.2 Yield and Defect Model in VLSI/WSI Circuits | p. 279 |
11.3 Techniques to Improve Yield | p. 284 |
11.4 Effect of Redundancy on Yield | p. 286 |
11.5 Summary | p. 288 |
References | p. 288 |
Chapter 12 Design of Fault-Tolerant Processor Arrays | p. 291 |
12.1 Introduction | p. 291 |
12.2 Hardware Redundancy Techniques | p. 294 |
12.3 Self-Reconfiguration Techniques | p. 317 |
12.4 Summary | p. 321 |
References | p. 322 |
Chapter 13 Algorithm-Based Fault Tolerance | p. 326 |
13.1 Checksum-Based ABFT for Matrix Operations | p. 327 |
13.2 Checksum-Based ABFT Error Handling | p. 330 |
13.3 Weighted Checksum Based ABFT | p. 331 |
13.4 ABFT on a Mesh Multiprocessor | p. 332 |
13.5 Checksum-Based ABFT on a Hypercube Multiprocessor | p. 334 |
13.6 Partition-Based ABFT for Floating-Point Matrix Operations | p. 336 |
13.7 Summary | p. 339 |
References | p. 339 |
Chapter 14 System Level Diagnosis-I | p. 341 |
14.1 Background Material and Basic Terminology | p. 342 |
14.2 System-Level Diagnosis Models | p. 347 |
14.3 Diagnosable Systems | p. 352 |
14.4 Diagnose-Ability Algorithms | p. 358 |
14.4.1 Centralized Diagnosis Systems | p. 359 |
14.4.2 Distributed Diagnosis Systems | p. 365 |
14.5 Summary | p. 372 |
References | p. 373 |
Chapter 15 System Level Diagnosis-II | p. 378 |
15.1 Diagnosis Algorithms for Regular Structures | p. 378 |
15.2 Regular Structures | p. 379 |
15.3 Pessimistic One-Step Diagnosis Algorithms for Hypercube | p. 380 |
15.4 Diagnosis for Symmetric Multiple Processor Architecture | p. 383 |
15.5 Summary | p. 394 |
References | p. 394 |
Appendix | p. 397 |
Chapter 16 Fault Tolerance and Reliability of the RAID Systems | p. 400 |
16.1 Introduction | p. 401 |
16.2 Redundancy Mechanisms | p. 403 |
16.3 Simple Reliability Analysis | p. 411 |
16.4 Advanced RAID Systems | p. 413 |
16.5 More on RAIDS | p. 418 |
16.6 Summary | p. 423 |
References | p. 423 |
Chapter 17 High Availability in Computer Systems | p. 426 |
17.1 Introduction | p. 426 |
17.2 Tandem High Availability Computers at a Glance | p. 430 |
17.3 Availability in Client/Server Computing | p. 438 |
17.4 Chapter Summary | p. 440 |
References | p. 440 |