Skip to:Content
|
Bottom
Cover image for Practical system reliability
Title:
Practical system reliability
Personal Author:
Publication Information:
Piscataway, NJ : IEEE Press ; Hoboken, NJ : Wiley, 2009
Physical Description:
xiii, 287 p. : ill. ; 25 cm.
ISBN:
9780470408605

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010236752 TA169 B38 2009 Open Access Book Book
Searching...

On Order

Summary

Summary

Learn how to model, predict, and manage system reliability/availability throughout the development life cycle

Written by a panel of authors with a wealth of industry experience, the methods and concepts presented here give readers a solid understanding of modeling and managing system and software availability and reliability through the development of real applications and products. The modeling and prediction techniques and tools are customer-focused and data-driven, and are also aligned with industry standards (Telcordia, TL 9000, ISO, etc.). Readers will get a clear understanding about what real-world reliability and availability mean through step-by-step discussions of:

System availability Conceptual model of reliability and availability Why availability varies between customers Modeling availability Estimating parameters and availability from field data Estimating input parameters from laboratory data Estimating input parameters in the architecture/design stage Prediction accuracy Connecting the dots

This book can be used by system architects, engineers, and developers to better understand and manage the reliability/availability of their products; quality engineers to grasp how software and hardware quality relate to system availability; and engineering students as part of a short course on system availability and software reliability.


Author Notes

Eric Bauer is a manager of reliability engineering in Alcatel-Lucent's wireline business in Murray Hill, New Jersey. He has designed, modeled, and analyzed reliability for many different products and solutions, and architected and developed software for a variety of communications devices, platforms, and products.
Xuemei Zhang, PhD, is a principal member of the technical staff in the Network Design and Performance Analysis Department at ATT Labs. She has been working on reliability and performance analysis of wireline-and wireless communications systems and networks. Her major work and research areas are in system and architectural reliability and performance modeling, and software reliability.
Douglas A. Kimber retired from Alcatel-Lucent as a staff reliability engineer. Throughout his career at Bell Labs, Lucent Technologies, and Alcatel-Lucent, he developed high reliability hardware and software platforms, applications, and systems, and then transitioned to reliability engineering where he did reliability modeling and analysis.


Table of Contents

Prefacep. xi
Acknowledgmentsp. xiii
1 Introductionp. 1
2 System Availabilityp. 5
2.1 Availability, Services and Elementsp. 6
2.2 Classical Viewp. 8
2.3 Customers' Viewp. 9
2.4 Standards Viewp. 10
3 Conceptual Model of Reliability and Availabilityp. 15
3.1 Concept of Highly Available Systemsp. 15
3.2 Conceptual Model of System Availabilityp. 17
3.3 Failuresp. 19
3.4 Outage Resolutionp. 23
3.5 Downtime Budgetsp. 26
4 Why Availability Varies Between Customersp. 31
4.1 Causes of Variation in Outage Event Reportingp. 31
4.2 Causes of Variation in Outage Durationp. 33
5 Modeling Availabilityp. 37
5.1 Overview of Modeling Techniquesp. 38
5.2 Modeling Definitionsp. 58
5.3 Practical Modelingp. 69
5.4 Widget Examplep. 78
5.5 Alignment with Industry Standardsp. 89
6 Estimating Parameters and Availability from Field Datap. 95
6.1 Self-Maintaining Customersp. 96
6.2 Analyzing Field Outage Datap. 96
6.3 Analyzing Performance and Alarm Datap. 106
6.4 Coverage Factor and Failure Ratep. 107
6.5 Uncovered Failure Recovery Timep. 108
6.6 Convered Failure Detection and Recovery Timep. 109
7 Estimating Input Parameters from Lab Datap. 111
7.1 Hardware Failure Ratep. 111
7.2 Software Failure Ratep. 114
7.3 Coverage Factorsp. 129
7.4 Timing Parametersp. 130
7.5 System-Level Parametersp. 132
8 Estimating Input Parameters in the Architecture/Design Stagep. 137
8.1 Hardware Parametersp. 138
8.2 System-Level Parametersp. 146
8.3 Sensitivity Analysisp. 149
9 Prediction Accuracyp. 167
9.1 How Much Field Data Is Enough?p. 168
9.2 How Does One Measure Sampling and Prediction Errors?p. 172
9.3 What Causes Prediction Errors?p. 173
10 Connecting the Dotsp. 177
10.1 Set Availability Requirementsp. 179
10.2 Incorporate Architectural and Design Techniquesp. 179
10.3 Modeling to Verify Feasibilityp. 206
10.4 Testingp. 208
10.5 Update Availability Predictionp. 208
10.6 Periodic Field Validation and Model Updatep. 208
10.7 Building an Availability Roadmapp. 209
10.8 Reliability Reportp. 210
11 Summaryp. 213
Appendix A System Reliability Report outlinep. 216
1 Executive Summaryp. 215
2 Reliability Requirementsp. 217
3 Unplanned Downtime Model and Resultsp. 217
Annex A Reliability Definitionsp. 219
Annex B Referencesp. 219
Annex C Markov Model State-Transition Diagramsp. 220
Appendix B Reliability and Availability Theoryp. 221
1 Reliability and Availability Definitionsp. 221
2 Probability Distributions in Reliability Evaluationp. 228
3 Estimation of Confidence Intervalsp. 237
Appendix C Software Reliability Growth Modelsp. 245
1 Software Characteristic Modelsp. 245
2 Nonhomogeneous Poisson Process Modelsp. 246
Appendix D Acronyms and Abbreviationsp. 263
Appendix E Bibliographyp. 265
Indexp. 279
About the Authorsp. 285
Go to:Top of Page