Home      Log In      Contacts      FAQs      INSTICC Portal


The role of the tutorials is to provide a platform for a more intensive scientific exchange amongst researchers interested in a particular topic and as a meeting point for the community. Tutorials complement the depth-oriented technical sessions by providing participants with broad overviews of emerging fields. A tutorial can be scheduled for 1.5 or 3 hours.


Tutorial on
Rethinking Software Fault Tolerance


Kishor Trivedi
Elect and Comp. Eng, Duke University
United States
Brief Bio
Kishor S. Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He has a B.Tech (EE, 1968) from IIT Mumbai, M.S. (CS, 1972) and PhD (CS, 1974) from the University of Illinois, Urbana-Champaign. He has been on the Duke faculty since 1975. He is the author of a well-known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, first published by Prentice-Hall; a thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has authored several other books. He is a Life Fellow of the Institute of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published over 600 articles and has supervised 48 Ph.D. dissertations. His h-index is 108. He is a recipient of IEEE Computer Society Technical Achievement Award for his research on Software Aging and Rejuvenation. He is a recipient of IEEE Reliability Society’s Lifetime Achievement Award. He has worked closely with industry in carrying our reliability/availability analysis, providing short courses on reliability, availability, performability modeling and in the development and dissemination of software packages such as SHARPE and SPNP.

Complex systems in different domains contain significant amount of software. Several studies have established that a large fraction of system outages are due to software faults. Traditional methods of fault avoidance, fault removal based on extensive testing/debugging, and fault tolerance based on design/data diversity are found inadequate to ensure high software dependability. The key challenge then is how to provide highly dependable software. We discuss a viewpoint of fault tolerance of software-based systems to ensure high dependability. We classify software faults into Bohrbugs and Mandelbugs, and identify aging-related bugs as a subtype of the latter. Traditional methods have been designed to deal with Bohrbugs. The key challenge then is to develop mitigation methods for Mandelbugs in general and aging-related bugs in particular. We submit that mitigation methods for Mandelbugs utilize environmental diversity. Retry operation, restart application, failover to an identical replica (hot, warm or cold) and reboot the OS are reactive recovery applied after the occurrence of a failure. They are examples of techniques that rely on environmental diversity. For software aging related bugs, it is also possible to utilize a proactive environmental diversity technique known as software rejuvenation. We discuss environmental diversity both from experimental and analytic points of view and cite examples of real systems employing these techniques.


Design diversity, Environmental diversity, Bohrbugs, Mandelbugs, Software Aging, Software rejuvenation

Aims and Learning Objectives

novel ideas on environment-dependent bugs and environmental diversity

Target Audience

Practicing Software engineers, students of SE

Prerequisite Knowledge of Audience

Software engineering

Detailed Outline

1. Introduction
Introduction will set the stage for the importance of software reliability as contribution to outages and downtimes due to software are more predominant than those due to hardware.
2. Inadequacy of Fault Avoidance and Fault Removal
The use of sound software engineering practices and formal methods have their limitations. So the next step of testing and debugging to remove faults is natural. But in spite of all these, delivered software does contain many bugs. Then it is natural to consider software fault tolerance.
3. Traditional Fault Tolerance
Traditional software fault tolerance is based on design diversity since it was thought that using identical copies of software, unlike hardware, will not be useful. But these methods are very expensive and hence get used only in safety-critical systems. Yet failure free operation is desired in other applications.
4. Software Fault Tolerance in Some Real Systems
We examine the types of fault tolerance used in real-life software systems. We find that identical copies are used as a form of software redundancy. Recovery after a failure due to software bugs is to restart software or reboot the node without fixing the bug that caused the failure.
5. Classification of Software Bugs
A classification of bugs into Bohrbugs and Mandelbugs is a key to the successful of use of restart/reboot as recovery methods and use of identical copies of software as a form of redundancy. We explore work done to define and study these bug types.
6. Environmental Diversity as a Method of Software Fault Tolerance
We submit that restart/reboot based recovery after a software failure and use of identical software copies as a form of redundancy work because a significant proportion of bugs in software systems are Mandelbugs. These bugs are more elusive and interactions with OS resources, other concurrently running processes participate in exciting the bugs to lead to a failure. The environment in which the software operates is changed by restart/reboot or by failing over to an identical software copy executing on another virtual machine.
7. References
Relevant references will be provided

Secretariat Contacts