Title:
Fault tolerance in distributed systems
Personal Author:
Publication Information:
Englewood Cliffs, N.J. : Prentice Hall, 1994
ISBN:
9780133013672
Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000002955346 | QA76.9.F38 J25 1994 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
Fault tolerance is an approach by which reliability of a computer system can be increased beyond what can be achieved by traditional methods. While hardware supported fault tolerance has been well-documented, the newer, software supported fault tolerance techniques have remained scattered throughout the literature. Comprehensive and self-contained, this book organizes that body of knowledge with a focus on fault tolerance in distributed systems. (The uniprocess case is treated as a special case of distributed systems.) KEY TOPICS: Treats fault tolerant distributed systems as consisting of levels of abstraction, providing different tolerant services. For researchers/practitioners working in the area of fault tolerance.
Table of Contents
1 Introduction |
Basic Concepts and Definitions |
Phases in Fault Tolerance |
Overview of Hardware Fault Tolerance |
Reliability and Availability |
Summary |
2 Distributed Systems |
System Model |
Interprocess Communication |
Ordering of Events and Logical Clocks |
Execution Model and System State |
Summary |
3 Basic Building Blocks |
Byzantine Agreement |
Synchronized Clocks |
Stable Storage |
Fail Stop Processors |
Failure Detection and Fault Diagnosis |
Reliable Message Delivery |
Summary |
4 Reliable, Atomic, and Causal Broadcast |
Reliable Broadcast |
Atomic Broadcast |
Causal Broadcast |
5 Recovering A Consistent State |
Asynchronous Checkpointing and Rollback |
Distributed Checkpointing |
Summary |
6 Atomic Actions |
Atomic Actions and Serializability |
Atomic Actions in a Centralized System |
Commit Protocols |
Atomic Actions on Decentralized Data |
Summary |
7 Data Replication And Resiliency |
Optimistic Approaches |
Primary Site Approach |
Resiliency with Active Replicas |
Voting |
Degree of Replication |
Summary |
8 Process Resiliency |
Resilient Remote Procedure Call |
Resiliency with Asynchronous Communication |
Resiliency with Synchronous Message Passing |
Total Failure and Last Process to Fail |
Summary |
9 Software Design Faults |
Approaches for Uniprocess Software |
Backward Recovery in Concurrent Systems |
Forward Recovery in Concurrent Systems |
Summary |
Bibliography |