- AboutThis should describe the systems research collaboration, and present the overall research goals of the new group.
- PeopleHere are the different labs in the SRC…
- PublicationsA page where you will find categorized publications!
- ProjectsA page where you will find our projects
- ResourcesVarious resources for prospective students, current students, alumni. Maybe put something here about life in NYC and at Columbia…
Publications from 2009
Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2009), March 2009
Software failures in server applications are a significant problem for preserving system availability. We present AS- SURE, a system that introduces rescue points that recover software from unknown faults while maintaining both sys- tem integrity and availability, by mimicking system behav- ior under known error conditions. Rescue points are loca- tions in existing application code for handling a given set of programmer-anticipated failures, which are automatically repurposed and tested for safely enabling fault recovery from a larger class of (unanticipated) faults. When a fault occurs at an arbitrary location in the program, ASSURE restores execution to an appropriate rescue point and in- duces the program to recover execution by virtualizing the programâ€™s existing error-handling facilities. Rescue points are identified using fuzzing, implemented using a fast co- ordinated checkpoint-restart mechanism that handles multi- process and multi-threaded applications, and, after testing, are injected into production code using binary patching. We have implemented an ASSURE Linux prototype that oper- ates without application source code and without base op- erating system kernel changes. Our experimental results on a set of real-world server applications and bugs show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.
ACM Transactions on Storage Systems, Volume 4, Issue 4, January 2009
Kinesis is a novel data placement model for distributed storage systems. It exemplifies three design principles: structure (division of servers into a few failure-isolated segments), freedom of choice (freedom to allocate the best servers to store and retrieve data based on current resource availability), and scattered distribution (independent, pseudo-random spread of replicas in the system). These design principles enable storage systems to achieve balanced utilization of storage and network resources in the presence of incremental system expansions, failures of single and shared components, and skewed distributions of data size and popularity. In turn, this ability leads to significantly reduced resource provisioning costs, good user-perceived response times, and fast, parallelized recovery from independent and correlated failures. This article validates Kinesis through theoretical analysis, simulations, and experiments on a prototype implementation. Evaluations driven by real-world traces show that Kinesis can significantly outperform the widely used Chain replica-placement strategy in terms of resource requirements, end-to-end delay, and failure recovery.