This post is part of my collaborative research with Shinsei Bank on highly-evolvable enterprise software. It is licensed under the Creative Commons Attribution-ShareAlike 3.0 license. I am indebted to Jay Dvivedi and his team at Shinsei Bank for supporting this research. All errors are my own.
In an earlier post, I observed in a footnote that Jay is more interested in detecting errors than in guaranteeing correct output, because once an error has been detected, the problem can be contained and the cause of the error can be identified and eliminated. I suggested that this approach is far easier than the problem of guaranteeing correctness tackled in some research on reliable computing. I recently had an opportunity to speak with Dr. Mary Loomis about this research, and she was intrigued by this idea and encouraged me to explore it further. At this point, I’m not aware of any academic work on the topic, but please let me know if you know of any academic papers that might be relevant.
Anyway, I thought it might be interesting to state the idea more precisely with the aid of a toy model. Let’s assume that we need to perform a computation, and that the cost of acting on an incorrect output is unacceptably high, while the cost of inaction or delaying action is relatively low. This might be the case for a decision to make a loan: making a loan to an unqualified borrower could be very costly, while turning away a potential borrower or delaying disbursement of the loan carries a far smaller opportunity cost. Let us further assume that any system component we deploy will be unreliable, with a probability e of producing incorrect output.
A simple (and naive, see here for a more sophisticated treatment) approach to guaranteeing correctness would be to deploy three systems. For a binary output, such as whether to make a loan, we can simply take the output given by a majority of the systems. For simplicity, I’ll assume that no errors occur when computing the majority, although that’s not necessarily the case. Under these assumptions, an incorrect output results only when two or three of the systems produce incorrect output, the likelihood of which is (assuming I got the computation right) 3e2-2e3. If e is on the order of a percent or so, we have a roughly 0.03% probability of incorrect output. Of course, the likelihood of an error may be much higher when components are new or recently modified. If e is closer to 20%, the probability of incorrect output exceeds 10%.
Now, consider the approach where we seek to detect incorrect output and prevent any action from being taken on the incorrect output until the problem has been resolved. This time, we similarly deploy three redundant components, but we assume the output to be correct only if we do not detect an error. We detect an error any time there exists a disagreement between the outputs. We fail to detect an error only when all three systems fail, which occurs with probability e3. For e of 1%, we fail to detect incorrect output with 0.0001% probability, while for e of 20%, we have a 0.8% probability of overlooking incorrect output. The detecting wrongness approach thus seems likely to be more robust, though it won’t work if the cost of delaying action is high, in which case we must come up with our best answer even if we detect a malfunctioning component.
The problem changes somewhat when outputs are not binary. If the range of possible outputs is large, the likelihood of malfunctioning systems agreeing on an incorrect answer (in the absence of strategic collusion among the components) is small, perhaps even effectively zero. In this case, the majority voting approach may not yield a conclusive result, so it ends up resembling the detecting wrongness approach with a less sensitive detector: when there is no majority, we conclude that no output can be selected with sufficient certainty and report an error.
The technique of mutually verifying dualism, which I discussed previously, probably improves the likelihood of detecting errors by exploiting the unlikeliness of multiple, independent errors that exactly balance each other. If we have two systems that each sum a collection of numbers, and we know that the outputs of the two systems should be equal in magnitude and opposite in sign, the likelihood of errors occurring independently in both systems that distort the outputs by equal amounts in opposite directions is vanishingly small. Intuitively, it seems that the more complex the relationship between the outputs of the mutually verifying components, the less likely that errors, and perhaps also tampering, will go undetected.
Finally, it also seems worth noting that the consequences of incorrect outputs may not be immediate or direct (for example, erroneous loan decisions may increase the losses from defaults several years later), so the return to investments in reliable computing may be difficult to measure.