Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems

Proceedings of the 2007 USENIX Annual Technical Conference, Santa Clara, CA, June 17-22, 2007, pp. 323-336


The ability to checkpoint a running application and restart it later can provide many useful benefits including fault recovery, advanced resources sharing, dynamic load bal- ancing and improved service availability. However, appli- cations often involve multiple processes which have de- pendencies through the operating system. We present a transparent mechanism for commodity operating systems that can checkpoint multiple processes in a consistent state so that they can be restarted correctly at a later time. We introduce an efficient algorithm for recording process re- lationships and correctly saving and restoring shared state in a manner that leverages existing operating system ker- nel functionality. We have implemented our system as a loadable kernel module and user-space utilities in Linux. We demonstrate its ability on real-world applications to provide transparent checkpoint-restart functionality with- out modifying, recompiling, or relinking applications, li- braries, or the operating system kernel. Our results show checkpoint and restart times 3 to 55 times faster than OpenVZ and 5 to 1100 times faster than Xen.



Columbia University Department of Computer Science