AWS Outage — October 2025
Major AWS outage event on October 19–20, 2025. Post-mortem report that AWS released: aws.amazon.com/message/101925
I’m sure AWS’s summary is over-simplifying things and we have to wait for a deeper post-mortem. But meanwhile, one can’t help but wonder how simple the issue was (or is it not?). My thoughts:
The Incident
DNS plans to be applied by age: plan_old → plan_new →
plan_newest
The “race condition”:
- Slow DNS Enactor reads the current plan’s timestamp. Sees it is
plan_old. - Starts applying
plan_new. - … Meanwhile
plan_newestgets generated … - … Fast enactor replaces everything with
plan_newest… - … Fast enactor deletes
plan_new, because it is older thanplan_newest. - Slow DNS overwrites
plan_newestwithplan_new(now empty due to deletion in step 5), because it checked plan freshness only in step 1.
The correctness of this setup was dependent on:
- The assumption that “This process typically completes rapidly” (unpardonable), and
- Having locks only at per-entry level, not at the plan level (understandably, plans can be quite large).
Problem Statement
How to make updating of a very large record (say millions of lines) atomic while also being available and fault tolerant?
Design Questions
- Why not just have a single pointer that points to the latest plan and update the pointer atomically instead of “overwriting” the entire plan?
- Why have multiple enactors running in parallel for availability? Why not have one leader and others as backup that check leader’s health using heartbeat etc.?
- Or simply: Why not check per-entry freshness before updating?
- …