Lectures

The following topics will be presented over the course of the semester. Each topic will be covered in (roughly) one lecture. Lecture notes are linked as they become available.

  1. Course introduction
  2. Distributed systems primer
    • challenges and goals of distributed systems
    • example architectures
  3. Distributed computation
    • MapReduce
    • Spark
    • Tradeoffs
  4. Communication models
    • remote procedure calls (RPC)
    • RPC libraries
    • failure models
    • semantics
  5. Time and coordination
    • challenges
    • physical and logical clocks
    • distributed mutual exclusion
  6. Agreement in distributed systems
    • the atomic commitment problem
    • the consensus problem
    • use cases for each
    • FLP impossibility result of achieving consensus
  7. The transaction abstraction
    • ACID semantics
    • concurrency control mechanisms
    • recovery mechanisms
  8. Atomic commitment protocols
    • 2-phase-commit
    • blocking nature
  9. Consensus protocols
    • Paxos overview, key ideas, basic algorithm
    • examples of normal operation and operation under failures
    • liveness failure mode
    • multi-Paxos
    • applications
  10. Case studies from industry:
  11. Broader view of isolation and consistency semantics
    • isolation: serializability, repeatable reads, read committed, read uncommitted
    • consistency: external, sequential, causal, eventual
    • mechanisms for each
    • performance/usability tradeoffs
  12. Beyond storage and MapReduce: Broader infrastructure systems
    • Google’s software stack
    • Meta’s software stack
    • Hadoop and Spark software stacks
  13. Cluster scheduling
    • scheduler architectures and considerations
    • frameworks: YARN, Mesos, Borg
    • algorithms: dominant resource fairness, bin packing
  14. Testing and model checking
    • testing approaches and challenges
    • formal specification and model checking
    • TLA+ primer
  15. Security and Byzantine fault tolerance