Are you any of the people working on software fault tolerance, can then anybody suggest how to determine fault tolerance of the system and what are the metrics available to measure it?
I have gone through many of the litreatures available on the net but most of them addresses fault tolerance pertaining to very highly reliable systems like avionics, can anybody suggest me to determine fault tolerance in Client server softwares like ERP application or web application.
Thanx in advance.
Software Test Engineer
If you cannot think about changing yourself, you don't have the right to change the world.
[Just a note: I will be moving this to the Software Process Improvement forum. So please look for it there.]
Realize that talking about fault tolerance is one way of talking about reliability and availability, depending upon context. You have to be careful with these because it is possible to go way overboard in terms of what you really need and spend more time on your metrics then you do truly measuring your "fault tolerance". Just remember that fault tolerance, strictly defined, is a prescribed permissible level of failure that is tolerated within the context of your system. So you can relate that to availability. You can also relate that to reliability, which then speaks to the number of defects that you still have (or estimate that you still have) in your system. (Here I am using "system" in the broad sense, so that it could apply to a client/server application or a simple desktop application.) This is actually quite a large topic, even though it might not appear so on the surface, because strict fault tolerance can be based, to a large extent, on whatever fault avoidance techniques you use. (These are intended to keep faults out of the system at the design stage as much as possible.) It also speaks to the fault detection techniques you use to try to detect faults within the system. It also depends on fault containment, which is what you do to limit the spread of the effects of a fault from one area of the system into another area. (Bear in mind that this last point can take place at the code stage, such as by introducing various object-oriented concepts or modularization).
So the point is that "fault tolerance" can come in at a lot of stages. What metrics are most valuable to you depends on the stages you utilize and what techniques you use.
There are two papers you might want to check out. The first of these is from NASA but it is about software reliability. There is nothing wrong with looking at how other industries do these things. You have to somehow extrapolate that usefulness to your software context, but that is no harder than doing the same thing for six sigma.