Distributed Systems Primer Distribution is hard for many reasons (see facts of life). Distributed systems aim to provide the core abstractions and systems that address the challenges and hide them under convenient, easier to use primitives that others can use. Unfortunately, not all challenges can be hidden under clever abstractions, and they creep up whenever one pushes a distributed system to its limits. So you and everyone else wishing to use distributed systems must understand these core challenges to be able to address them when they do creep up. Below are a few facts of life that make distribution hard, and the corresponding goals of distributed system design. Facts of life/Distributed Systems Goals: Fact 1: Data is big. Users are many. Requests are even more. No single machine can store or process all data efficiently. Supercomputers can do a lot, but they haven't been the final answer to scaling for a long time. DS Goal 1: The core goal of distributed systems is to enable distribution. Make multiple independent machines interconnected through a network coordinate in a coherent way to achieve a common goal (e.g., efficiently process a lot of data, store it, or serve it to a lot of users). The above sentence is, btw, an accepted definition of a distributed system. Fact 2: Effective processing at scale is hard. An arbitrarily constructed application may simply not scale: - coordination is expensive (networks are expensive). - the application may not exhibit sufficient parallelism. - bottlenecks may inhibit parallelism. Sometimes bottlenecks hide in the very low levels if those are not used correctly (e.g., a network hub, a logging server, a database, a coordinator, etc.). DS Goal 2: Incremental Scalability. Effective coordination at scale. The more resource you add, the more data you should be able to store/process, and the more users you can serve. This implies programming models and abstractions that are known to scale. If you structure your application in a particular way. You've already learned about a bunch of them, e.g., the map/reduce model, RDDs, etc. These are all examples of programming models that make an application scalable. Fact 3: At scale, faiures are inevitable. Many types of failures exist at all levels of a system: - network failures - machine failures (software, hardware, flipped bits in memory, overheating, etc.) - datacenter failures ... They are of many types: some are small and isolated others are major failures, some are persistent others are temporary, some resolve themselves others require human intervention, some result in crashes others result in small, hardly detectable corruptions. What they all have in common: most failures are very unpredictable! They can occur at any time, and at scale they are guaranteed to occur ALL the time! And they greatly challenge coordination between the machines of a distributed systems (e.g., a machine tells another machine to do something but it doesn't know if it's done it, how can it proceed?)! Or, imagine that two machines need to coordinate but they cannot talk to each other. What are they supposed to do? Can they go on with their processes and make decisions unilaterally? When is it OK to do that? DS Goal 3: Fault tolerance and availability. The goal is to hide as much as of the failures as possible and provide a service that e.g., finishes the computation fast despite failures, stores some data reliably despite failures, or provides its users with continued and meaningful service despite failures. Coordination needs to take failures into account and recover from them. Often the approach is to rely upon majorities that continue functioning despite failures; as long as a majority agrees on something, the idea is that it can be safe to continue making that action. Unfortunately, not all failures can be completely hidden, and hence they creep out inevitably. Dealing with failures is hard for both the programmers who build distributed applications and the users who use these applications. So the key thing is to provide a set of semantics that make sense despite failures and express those semantics clear in the APIs of the distributed systems. DS Goal 3': Meaningful semantics and clear APIs/abstractions. The APIs must communicate extremely clearly the failure modes. E.g., if you decide that in the case of failure your distributed computation system will return the results of a partial computation, then you need to communicate that through your API so the programmer/user of the results is aware of the situation. As another example, if you decide that upon a failure you . Unfortunately, perfect availability with strong semantics under failures is impossible to achieve. Fact 4: The structure of distributed systems needs to evolve over time. Multiple versions of the code may co-exist in a distributed system. New machines may be added or removed from a deployment (due to failures but also for elasticity, particularly for a long-running service). Certain machines may become very loaded when they didn't use to know (e.g., the machine serving a video that suddenly becomes extremely popular might see a huge jump in load and might need to shed some of its load to others). Being able to evolve the structure and/or code of a distributed system on the fly, without disrupting the system's functioning is tremendously important. But it's also pretty hard. DS Goal 4: Location transparency and evolution support. Location transparency means that none of your resources, data, or processes should be tied to particular machines. Their *names* should be independent of where they are placed. Fact 5: Distributed systems are under constant attacks! Those are similar to failures, but not quite, they are guided failures. Their purpose is to either access the data, change it or the results of computations, or prevent access to data or computation. The integrity attacks (second type) are sometimes called Byzantine failures. DS Goal 5: Robustness against attack. Authentication, access control, auditing, byzantine fault tolerance, etc. We won't cover these in this class. Unfortunately, robustness against attack isn't always a primary goal of distributed systems designs (e.g., Bigtable, Spark, etc. include extremely minimal security measures in their designs). Often (distributed) systems built specifically to protect (e.g., intrusion detection systems, firewalls, record/replay systems, etc.) are separate systems deployed in a datacenter to protect a whole lot of other distributed systems that are by themselves not robust to attacks. We won't cover robustness against attack in this class, but it's a very important aspect that needs to penetrate more DS designs. Overall, the approach of building distributed systems that address all of these challenges is to build layers upon layers of systems or services that raise the level of abstraction increasingly through well-defined APIs. Google's storage stack, as well as Spark and other systems, are perfect examples of that layered design for distributed systems. --- Show Google's storage stack (or part thereof) in the PDF. Lots of layers, built upon other layers to raise the level of abstraction and hide more and more challenges. Rely on extra helper services, too (e.g., lock services can help with coordination and can help address some difficult DS challenges). Remark replication. Next time we'll start looking at some one building block for DSes: RPC, which enables communication. To cooperate, machines need to first communicate. How should they do that? RPC is the predominant way of communicating in a DS. Interestingly enough, RPC -- the simplest possible DS abstraction -- reflects most of the challenges in distributed systems, showing you how fundamental those things are.