Record and Transplay: Partial Checkpointing for Replay Debugging Across Heterogeneous Systems

Dinesh Subhraveti, Jason Nieh

Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2011), San Jose, CA, June 7-11, 2011


Software bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this problem, we present Transplay, a system that captures production software bugs into small per-bug recordings which are used to reproduce the bugs on a completely different operating system without access to any of the original software used in the production environment. Transplay introduces partial checkpointing, a new mechanism that efficiently captures the partial state necessary to reexecute just the last few moments of the application before it encountered a failure. The recorded state, which typically consists of a few megabytes of data, is used to replay the application without requiring the specific application binaries, libraries, support data, or the original execution environment. Transplay integrates with existing debuggers to provide standard debugging facilities to allow the user to examine the contents of variables and other program state at each source line of the application’s replayed execution. We have implemented a Transplay prototype that can record unmodified Linux applications and replay them on different versions of Linux as well as Windows. Experiments with several applications including Apache and MySQL show that Transplay can reproduce real bugs and be used in production with modest recording overhead.



Columbia University Department of Computer Science