January 4, 2016
Fault tolerance quality attribute
Nowadays, we critically depend of the vast number of software products. They do their job every minute of our life and some of these minutes are really challenging for them. While we become more and more aware of software development complexity, an average end-user doesn’t get filled with compassion and understanding when an application he/she has bought suddenly loses data or simply fails. Needless to say some apps have been made to save lives or to prevent disasters, so their failure are literally fatal.
According to ISO 9126, fault tolerance is a part of Reliability quality attribute group and it represents the ability of a system to withstand component failure.
So, a fault may happen, no one is guaranteed from this, that’s why software products are designed within three main categories, each agrees with a possibility of fault, but each has a different approach to fault countermeasures planning and realization.
- Fault prevention: when software is designed as fault-free as possible. This approach requires very thoughtful and a little bit paranoid perspective from developers because any possible issue should be taken to account. It’s very time-consuming and mainly is restricted by time and cost limits of a project.
- Fault removal: here comes the testers’ team. The development stage is completed and now it’s testers’ turn to check everything in the most detailed manner.
- Fault tolerance: this approach doesn’t attempt to prevent a fault or discover it. Tolerance to failures is based on assumption that there’s no way to detect all possible faults as well as to create a failure-free design. That’s why the system should be designed in a way which will allow it to operate properly even if faults occur.
- Fault forecasting: planning, investigating possible presence and calculating future chances of fault occurrence.
Software faults are not identical as twins, however they all are design faults, and they can be classified depending on phase of their occurrence, system boundaries, cause, intent, and persistence (Xie, Sun and Saluja, 2001).
Certainly, it is important for a system to be fault-tolerant on hardware as well as software levels. And software fault tolerance can’t be assured without trustworthy hardware background.
Current methods for software fault tolerance include recovery blocks, N-version programming, and self-checking software. Their application depends on environment settings and characteristics.
Therefore, according to Laura L. Pullum,
Monitoring techniques, atomicity of actions, decision verification, and exception handling may be used to partially tolerate software design faults for Single Version Software Environment (SVSE).
Multiple Version Software Environment (MVSE) requires design diverse techniques which provide independently developed equivalent software to guarantee tolerance to software design faults. This type of design techniques includes recovery blocks (RcB), N-version programming (NVP), and N self-checking programming (NSCP).
Multiple Data Representation Environment (MDRE) involves data diverse techniques which multiple data representation environment and utilize different representations of input data to provide tolerance to software design faults. Examples of such techniques include retry blocks (RtB), N-copy programming and N-self-checking Programming.
But the main challenge that persists within developing a fault-tolerant products is a controversy between redundancy requirements and economy situation. Redundancy has its high price associated with operating cost (performance), development cost, as well as additional complexity. In point of fact, providing hardware redundancy is much cheaper because faults of hardware are often expected to be independent and due to wear, plus producing of identical hardware units is cheaper than developing diverse designs to tolerate software to faults.
And, as a conclusion, let us all remember, that fault tolerance is not an alternative to performing regular backups.