This post is part of my collaborative research with Shinsei Bank on highly-evolvable enterprise software. It is licensed under the Creative Commons Attribution-ShareAlike 3.0 license. I am indebted to Jay Dvivedi and his team at Shinsei Bank for supporting this research. All errors are my own.
In an earlier post, I observed in a footnote that Jay is more interested in detecting errors than in guaranteeing correct output, because once an error has been detected, the problem can be contained and the cause of the error can be identified and eliminated. I suggested that this approach is far easier than the problem of guaranteeing correctness tackled in some research on reliable computing. I recently had an opportunity to speak with Dr. Mary Loomis about this research, and she was intrigued by this idea and encouraged me to explore it further. At this point, I’m not aware of any academic work on the topic, but please let me know if you know of any academic papers that might be relevant.
Anyway, I thought it might be interesting to state the idea more precisely with the aid of a toy model. Let’s assume that we need to perform a computation, and that the cost of acting on an incorrect output is unacceptably high, while the cost of inaction or delaying action is relatively low. This might be the case for a decision to make a loan: making a loan to an unqualified borrower could be very costly, while turning away a potential borrower or delaying disbursement of the loan carries a far smaller opportunity cost. Let us further assume that any system component we deploy will be unreliable, with a probability e of producing incorrect output.
A simple (and naive, see here for a more sophisticated treatment) approach to guaranteeing correctness would be to deploy three systems. For a binary output, such as whether to make a loan, we can simply take the output given by a majority of the systems. For simplicity, I’ll assume that no errors occur when computing the majority, although that’s not necessarily the case. Under these assumptions, an incorrect output results only when two or three of the systems produce incorrect output, the likelihood of which is (assuming I got the computation right) 3e2-2e3. If e is on the order of a percent or so, we have a roughly 0.03% probability of incorrect output. Of course, the likelihood of an error may be much higher when components are new or recently modified. If e is closer to 20%, the probability of incorrect output exceeds 10%.
Now, consider the approach where we seek to detect incorrect output and prevent any action from being taken on the incorrect output until the problem has been resolved. This time, we similarly deploy three redundant components, but we assume the output to be correct only if we do not detect an error. We detect an error any time there exists a disagreement between the outputs. We fail to detect an error only when all three systems fail, which occurs with probability e3. For e of 1%, we fail to detect incorrect output with 0.0001% probability, while for e of 20%, we have a 0.8% probability of overlooking incorrect output. The detecting wrongness approach thus seems likely to be more robust, though it won’t work if the cost of delaying action is high, in which case we must come up with our best answer even if we detect a malfunctioning component.
The problem changes somewhat when outputs are not binary. If the range of possible outputs is large, the likelihood of malfunctioning systems agreeing on an incorrect answer (in the absence of strategic collusion among the components) is small, perhaps even effectively zero. In this case, the majority voting approach may not yield a conclusive result, so it ends up resembling the detecting wrongness approach with a less sensitive detector: when there is no majority, we conclude that no output can be selected with sufficient certainty and report an error.
The technique of mutually verifying dualism, which I discussed previously, probably improves the likelihood of detecting errors by exploiting the unlikeliness of multiple, independent errors that exactly balance each other. If we have two systems that each sum a collection of numbers, and we know that the outputs of the two systems should be equal in magnitude and opposite in sign, the likelihood of errors occurring independently in both systems that distort the outputs by equal amounts in opposite directions is vanishingly small. Intuitively, it seems that the more complex the relationship between the outputs of the mutually verifying components, the less likely that errors, and perhaps also tampering, will go undetected.
Finally, it also seems worth noting that the consequences of incorrect outputs may not be immediate or direct (for example, erroneous loan decisions may increase the losses from defaults several years later), so the return to investments in reliable computing may be difficult to measure.
Mr David, The process of one person entry and second either entry or checking is used in most enterprise systems. This is either called maker maker or maker checker functions to avoid errors due to data entry. The technique selection is used based on specific context i.e. volume etc.
I think taking example of a loan is an incorrect example, this is because the decisions are made based on well defined rules. These can be easily codified, tested and put to use to produce a result as pass or fail. this is then check by a person before final decision is made. the first part could be either done by system or by person based on volume of loan applications.
Secondly even if there is majority the decision could be wrong. in decision making in teams this is called group think.
Quality control is well documented, and there is a way to calculate cost of bad product. similar to your loan example even in case a car is produced and has to be recalled to fix the issue (similar to what you described) the cost can still be calculated.
It really seems a nice approach to analyze the kind of control systems that would need to be designed in a process. We all accept that there will be errors creeping in from various sources – these could be data errors, analysis errors or even errors of judgment. However, in what manner to minimize their ability to pass through or get a wrong decision step acted upon is more useful than anything. I think the model to use the cost factor and utilize it to look at the complete process till the final decision is acted is very worthwhile.
It will be a good idea to have a little discussion on it so that this kind modeling exercise can be extended with various situations and correspondingly the error control processes could be designed. Certainly, the thinking process of understanding the errors needs a radical view like this.
Thanks a lot for making such a thinking process look so simple. I am sure it is not so simple to start thinking in this manner.