Lectures
The following topics will be presented over the course of the semester.
Each topic will be covered in (roughly) one lecture.
Lecture notes are linked as they become available.
- Course introduction
- Distributed systems primer
- challenges and goals of distributed systems
- example architectures
- Distributed computation
- MapReduce
- Spark
- Tradeoffs
- Communication models
- remote procedure calls (RPC)
- RPC libraries
- failure models
- semantics
- Time and coordination
- challenges
- physical and logical clocks
- distributed mutual exclusion
- Agreement in distributed systems
- the atomic commitment problem
- the consensus problem
- use cases for each
- FLP impossibility result of achieving consensus
- The transaction abstraction
- ACID semantics
- concurrency control mechanisms
- recovery mechanisms
- Atomic commitment protocols
- 2-phase-commit
- blocking nature
- Consensus protocols
- Paxos overview, key ideas, basic algorithm
- examples of normal operation and operation under failures
- liveness failure mode
- multi-Paxos
- applications
- Case studies from industry:
- Broader view of isolation and consistency
semantics
- isolation: serializability, repeatable reads, read committed, read uncommitted
- consistency: external, sequential, causal, eventual
- mechanisms for each
- performance/usability tradeoffs
- Beyond storage and MapReduce: Broader infrastructure systems
- Google’s software stack
- Meta’s software stack
- Hadoop and Spark software stacks
- Cluster scheduling
- scheduler architectures and considerations
- frameworks: YARN, Mesos, Borg
- algorithms: dominant resource fairness, bin packing
- Testing and model checking
- testing approaches and challenges
- formal specification and model checking
- TLA+ primer
- Security and Byzantine fault tolerance