Fault tolerance in distributed systems

Title:

Personal Author:

Jalote, Pankaj

Publication Information:

Englewood Cliffs, N.J. : Prentice Hall, 1994

ISBN:

9780133013672

Subject Term:

Fault-tolerant computing

Electronic data processing

Available:*

Library	Item Barcode	Call Number	Material Type	Item Category 1	Status
Searching... PSZ JB	30000002955346	QA76.9.F38 J25 1994	Open Access Book	Book	Searching... Unknown

Fault tolerance is an approach by which reliability of a computer system can be increased beyond what can be achieved by traditional methods. While hardware supported fault tolerance has been well-documented, the newer, software supported fault tolerance techniques have remained scattered throughout the literature. Comprehensive and self-contained, this book organizes that body of knowledge with a focus on fault tolerance in distributed systems. (The uniprocess case is treated as a special case of distributed systems.) KEY TOPICS: Treats fault tolerant distributed systems as consisting of levels of abstraction, providing different tolerant services. For researchers/practitioners working in the area of fault tolerance.

1 Introduction

Basic Concepts and Definitions

Phases in Fault Tolerance

Overview of Hardware Fault Tolerance

Reliability and Availability

Summary

2 Distributed Systems

System Model

Interprocess Communication

Ordering of Events and Logical Clocks

Execution Model and System State

Summary

3 Basic Building Blocks

Byzantine Agreement

Synchronized Clocks

Stable Storage

Fail Stop Processors

Failure Detection and Fault Diagnosis

Reliable Message Delivery

Summary

4 Reliable, Atomic, and Causal Broadcast

Reliable Broadcast

Atomic Broadcast

Causal Broadcast

5 Recovering A Consistent State

Asynchronous Checkpointing and Rollback

Distributed Checkpointing

Summary

6 Atomic Actions

Atomic Actions and Serializability

Atomic Actions in a Centralized System

Commit Protocols

Atomic Actions on Decentralized Data

Summary

7 Data Replication And Resiliency

Optimistic Approaches

Primary Site Approach

Resiliency with Active Replicas

Voting

Degree of Replication

Summary

8 Process Resiliency

Resilient Remote Procedure Call

Resiliency with Asynchronous Communication

Resiliency with Synchronous Message Passing

Total Failure and Last Process to Fail

Summary

9 Software Design Faults

Approaches for Uniprocess Software

Backward Recovery in Concurrent Systems

Forward Recovery in Concurrent Systems

Summary

Bibliography

Available:*

On Order

Summary

Summary

Table of Contents