eXplode: Effective Model Checking of Real Systems

We are developing an approach called in-situ model checking to thoroughly check general systems software in a lightweight manner. We've made our approach so easy that we have applied it to more than 20 widely used, well tested systems and found nearly a hundred serious errors. Some of these errors can cause unrecoverable loss of an entire file system. Currently we're collaborating with researchers at Microsoft Research to apply the eXplode approach to distributed systems.

Our goal is to effectively detect these bugs. The main technique we use is model checking. This formal verification technique systematically enumerates the possible execution paths of a distributed system by starting from an initial state and repeatedly performing all possible actions to this state and its successors. This state-space exploration makes rare actions such as machine crashes and network failures appear as often as common ones, thereby quickly driving a system into corner cases where subtle bugs surface.

However, naive application of model checking to large systems is prohibitive because of the huge cost required to write an abstract specification of the checked system. Recent work has developed implementation-level model checkers that check code directly, but these checkers still require an invasive, heavyweight port of the checked system to run inside these model checkers.

Instead of the heavyweight way of creating a simulated environment to run a system, we created an in-situ checking architecture that interlaces the mechanism for comprehensive checking within the checked system. This architecture enables users to check live systems, drastically reducing the overhead and the invasive infrastructure needed. For example, in eXplode (our storage system checker), the checking infrastructure is reduced down to a single device driver, dynamically loadable to a running OS kernel. These drivers are fairly small and easy to write; both the Linux and FreeBSD drivers of eXplode are less than two thousand lines of code. Once such a driver is loaded, eXplode can readily check any storage system that runs inside or above the OS kernel. Compared to the cost of weeks or months to check a system using the old approaches, eXplode only requires minutes, several orders of magnitude reduction.

Using this approach, we have found numerous serious errors in 20 widely used, well-tested systems. For example, we found data-loss errors that can vaporize all user data in 17 storage systems, including three version control software, a database, Linux NFS, ten local file systems, a software RAID, and a popular commercial virtual machine.

Publications

Practical Software Model Checking via Dynamic Interface Reduction

Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, Lintao Zhang
Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), October 2011

MODIST: Transparent Model Checking of Unmodified Distributed Systems

Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, Lidong Zhou
Proceedings of the Sixth Symposium on Networked Systems Design and Implementation (NSDI '09), April 2009

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors

Proceedings of the Seventh Symposium on Operating Systems Design and Implementation (OSDI '06), November 2006

Using Model Checking to Find Serious File System Errors

Junfeng Yang, Paul Twohey, Dawson Engler, Madanlal Musuvathi
ACM Transactions on Computer Systems, Volume 24, Issue 4, November 2006

eXplode: A Lightweight, General Approach for Finding Serious Errors in Storage Systems

Junfeng Yang, Paul Twohey, Ben Pfaff, Can Sar, Dawson Engler
Proceedings of the first Workshop on the Evaluation of Software Defect Detection Tools (BUGS '05), June 2005

Using Model Checking to Find Serious File System Errors

Junfeng Yang, Paul Twohey, Dawson Engler, Madanlal Musuvathi
Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (OSDI '04), December 2004

Download eXplode

The above are three ways to get eXplode. You can download the eXplode source code locally, or download a virtual machine image with eXplode and eXplode-patched kernels compiled, or get the source code from sourceforge. You can browse the version history of eXplode at sourceforge.

For instructions to compile, build, and use eXplode, Please refer to the README file in the top-level directory after you uncompress the explode tar ball.

The eXplode distribution contains a generic model checker that can be applied to other systems. For details, please refer to the README file under directory mcl.
Real systems are difficult to get right because they must correctly handle a practically infinite number of rare events. For example, file systems must correctly recover from all possible crashes; distributed systems must ensure consistency and liveness despite of a large variety of rare events, such as machine crashes, network partition, message delays, and message loss. The complexity to handle these rare events often leads to corner-case errors that are difficult to test, and once detected in the field, impossible to reproduce.