Distributed Systems Primer

Distribution is hard for many reasons (see facts of life).
Distributed systems aim to provide the core abstractions and
systems that address the challenges and hide them under
convenient, easier to use primitives that others can use.
Unfortunately, not all challenges can be hidden under clever
abstractions, and they creep up whenever one pushes a distributed
system to its limits.  So you and everyone else wishing to
use distributed systems must understand these core challenges to
be able to address them when they do creep up.

Below are a few facts of life that make distribution hard, and
the corresponding goals of distributed system design.

Facts of life/Distributed Systems Goals:
Fact 1: Data is big.  Users are many.  Requests are even more.
  No single machine can store or process all data efficiently.
  Supercomputers can do a lot, but they haven't been the final
  answer to scaling for a long time.

  DS Goal 1: The core goal of distributed systems is to enable
    distribution.  Make multiple independent machines interconnected
    through a network coordinate in a coherent way to achieve a common
    goal (e.g., efficiently process a lot of data, store it, or serve
    it to a lot of users).  The above sentence is, btw, an accepted
    definition of a distributed system.

Fact 2: Effective processing at scale is hard.  An arbitrarily
  constructed application may simply not scale:
  - coordination is expensive (networks are expensive).
  - the application may not exhibit sufficient parallelism.
  - bottlenecks may inhibit parallelism.  Sometimes bottlenecks
    hide in the very low levels if those are not used correctly
    (e.g., a network hub, a logging server, a database, a
    coordinator, etc.).

  DS Goal 2: Incremental Scalability.  Effective coordination at scale.
    The more resource you add, the more data you should be able to
    store/process, and the more users you can serve.  This implies
    programming models and abstractions that are known to scale.  If you
    structure your application in a particular way. You've already
    learned about a bunch of them, e.g., the map/reduce model, RDDs, etc.
    These are all examples of programming models that make an application
    scalable.

Fact 3: At scale, faiures are inevitable.  Many types of failures exist
  at all levels of a system:
  - network failures
  - machine failures (software, hardware, flipped bits in memory,
    overheating, etc.)
  - datacenter failures
  ...
  They are of many types: some are small and isolated others are major failures,
  some are persistent others are temporary, some resolve themselves others
  require human intervention, some result in crashes others result in small,
  hardly detectable corruptions.
  What they all have in common: most failures are very unpredictable!
  They can occur at any time, and at scale they are guaranteed to occur
  ALL the time!  And they greatly challenge coordination between the
  machines of a distributed systems (e.g., a machine tells another machine
  to do something but it doesn't know if it's done it, how can it proceed?)!
  Or, imagine that two machines need to coordinate but they cannot talk to
  each other.  What are they supposed to do?  Can they go on with their
  processes and make decisions unilaterally?  When is it OK to do that?  

  DS Goal 3: Fault tolerance and availability.  The goal is to hide as much as
    of the failures as possible and provide a service that e.g., finishes the
    computation fast despite failures, stores some data reliably despite failures,
    or provides its users with continued and meaningful service despite failures.
    Coordination needs to take failures into account and recover from them.  Often
    the approach is to rely upon majorities that continue functioning despite
    failures; as long as a majority agrees on something, the idea is that it can
    be safe to continue making that action.
    Unfortunately, not all failures can be completely hidden, and hence they creep
    out inevitably.  Dealing with failures is hard for both the programmers who
    build distributed applications and the users who use these applications.  So
    the key thing is to provide a set of semantics that make sense despite failures
    and express those semantics clear in the APIs of the distributed systems.

  DS Goal 3': Meaningful semantics and clear APIs/abstractions.  The APIs must
    communicate extremely clearly the failure modes.  E.g., if you decide that in
    the case of failure your distributed computation system will return the results
    of a partial computation, then you need to communicate that through your API
    so the programmer/user of the results is aware of the situation.  As another
    example, if you decide that upon a failure you .

    Unfortunately, perfect availability with strong semantics under failures is
    impossible to achieve.


Fact 4: The structure of distributed systems needs to evolve over time.  Multiple
  versions of the code may co-exist in a distributed system. New machines may be
  added or removed from a deployment (due to failures but also for elasticity,
  particularly for a long-running service).  Certain machines may become very
  loaded when they didn't use to know (e.g., the machine serving a video
  that suddenly becomes extremely popular might see a huge jump in load and
  might need to shed some of its load to others). Being able to evolve the
  structure and/or code of a distributed system on the fly, without disrupting
  the system's functioning is tremendously important.  But it's also pretty hard.

  DS Goal 4: Location transparency and evolution support.  Location transparency
    means that none of your resources, data, or processes should be tied to
    particular machines.  Their *names* should be independent of where they are
    placed. 


Fact 5: Distributed systems are under constant attacks!  Those are similar
  to failures, but not quite, they are guided failures.  Their purpose is
  to either access the data, change it or the results of computations, or
  prevent access to data or computation. The integrity attacks (second type)
  are sometimes called Byzantine failures.

  DS Goal 5: Robustness against attack.  Authentication, access control, auditing,
    byzantine fault tolerance, etc.  We won't cover these in this class.  Unfortunately,
    robustness against attack isn't always a primary goal of distributed systems designs
    (e.g., Bigtable, Spark, etc. include extremely minimal security measures in their
    designs).  Often (distributed) systems built specifically to protect (e.g., intrusion
    detection systems, firewalls, record/replay systems, etc.) are separate systems
    deployed in a datacenter to protect a whole lot of other distributed systems that
    are by themselves not robust to attacks.  We won't cover robustness against attack
    in this class, but it's a very important aspect that needs to penetrate more DS designs.


Overall, the approach of building distributed systems that address all of these
challenges is to build layers upon layers of systems or services that raise the
level of abstraction increasingly through well-defined APIs.  Google's storage stack,
as well as Spark and other systems, are perfect examples of that layered design
for distributed systems.

---

Show Google's storage stack (or part thereof) in the PDF.  Lots of layers, built
upon other layers to raise the level of abstraction and hide more and more challenges.
Rely on extra helper services, too (e.g., lock services can help with coordination and
can help address some difficult DS challenges).  Remark replication.

Next time we'll start looking at some one building block for DSes: RPC, which enables
communication.  To cooperate, machines need to first communicate. How should they do
that?  RPC is the predominant way of communicating in a DS.  Interestingly enough,
RPC -- the simplest possible DS abstraction -- reflects most of the challenges in
distributed systems, showing you how fundamental those things are.