DP Insfrastructure
Incorporating support for differential privacy into data processing infrastructure.
Vision
We are a group of systems/security researchers at Columbia University and the University of British Columbia working on infrastructure systems for differential privacy. We believe that differential privacy is an essential privacy technology for today’s data-driven world, in which users’ data is avidly collected and processed through a variety of machine learning and analytics workloads aimed at improving products, targeting ads, informing new business directions, and more. We posit that across all of these workloads, user privacy is a critical computing resource that is being implicitly consumed but whose consumption is not tracked, managed, or paid for in any way. This is in contrast to how the use of other computing resources – such as CPU, GPU, and RAM – is tightly controlled and supported by infrastructure systems such as data analytics frameworks, ML platforms, resource orchestrators, and others. We seek to incorporate privacy as a first-order resource into such infrastructure systems, so it can be similarly managed, monitored, conserved, and carefully accounted for. Differential privacy (DP) gives us the theoretical and algorithmic building blocks for defining such a privacy resource.
We’ve used DP to incorporate privacy as a resource into:
- the Kubernetes orchestrator (PrivateKube project below);
- the Tensorflow-Extended ML training platform (Sage project); and
- most recently, into the caching components of an analytics database (Turbo project).
It turns out that, by incorporating the DP-based privacy resource into these infrastructure systems, we also help address – or at least operationalize – some pretty sticky problems with DP that have stymied this privacy technology’s adoption for years. For example, incorporating DP as a resource into Kubernetes helps recast the “running out of privacy budget” problem of DP as a limited-resource scheduling problem, for which there are well-known algorithms and theory that can be tapped into to manage this rather fundamental challenge in practical ways. As another example, when computing resource is limited, caching is a go-to approach to conserve it in traditional systems. In Turbo, we show that caching, specifically designed for the privacy resource, can help conserve this resource enormously, enabling DP systems to avoid running out of privacy budget for much longer than without caching. Thus, our effort to incorporate support for DP into data processing infrastructure helps advance this important privacy technology further toward adoption.