We are developing an approach called in-situ model checking to thoroughly check general systems software in a lightweight manner. We've made our approach so easy that we have applied it to more than 20 widely used, well tested systems and found nearly a hundred serious errors. Some of these errors can cause unrecoverable loss of an entire file system. Currently we're collaborating with researchers at Microsoft Research to apply the eXplode approach to distributed systems.
Our goal is to effectively detect these bugs. The main technique we use is model checking. This formal verification technique systematically enumerates the possible execution paths of a distributed system by starting from an initial state and repeatedly performing all possible actions to this state and its successors. This state-space exploration makes rare actions such as machine crashes and network failures appear as often as common ones, thereby quickly driving a system into corner cases where subtle bugs surface.However, naive application of model checking to large systems is prohibitive because of the huge cost required to write an abstract specification of the checked system. Recent work has developed implementation-level model checkers that check code directly, but these checkers still require an invasive, heavyweight port of the checked system to run inside these model checkers.
Instead of the heavyweight way of creating a simulated environment to run a system, we created an in-situ checking architecture that interlaces the mechanism for comprehensive checking within the checked system. This architecture enables users to check live systems, drastically reducing the overhead and the invasive infrastructure needed. For example, in eXplode (our storage system checker), the checking infrastructure is reduced down to a single device driver, dynamically loadable to a running OS kernel. These drivers are fairly small and easy to write; both the Linux and FreeBSD drivers of eXplode are less than two thousand lines of code. Once such a driver is loaded, eXplode can readily check any storage system that runs inside or above the OS kernel. Compared to the cost of weeks or months to check a system using the old approaches, eXplode only requires minutes, several orders of magnitude reduction.
Using this approach, we have found numerous serious errors in 20 widely used, well-tested systems. For example, we found data-loss errors that can vaporize all user data in 17 storage systems, including three version control software, a database, Linux NFS, ten local file systems, a software RAID, and a popular commercial virtual machine.