AWS Outage — October 2025

Major AWS outage event on October 19–20, 2025. Post-mortem report that AWS released: aws.amazon.com/message/101925

I’m sure AWS’s summary is over-simplifying things and we have to wait for a deeper post-mortem. But meanwhile, one can’t help but wonder how simple the issue was (or is it not?). My thoughts:

The Incident

DNS plans to be applied by age: plan_oldplan_newplan_newest

The “race condition”:

  1. Slow DNS Enactor reads the current plan’s timestamp. Sees it is plan_old.
  2. Starts applying plan_new.
  3. … Meanwhile plan_newest gets generated …
  4. … Fast enactor replaces everything with plan_newest
  5. … Fast enactor deletes plan_new, because it is older than plan_newest.
  6. Slow DNS overwrites plan_newest with plan_new (now empty due to deletion in step 5), because it checked plan freshness only in step 1.

The correctness of this setup was dependent on:

  1. The assumption that “This process typically completes rapidly” (unpardonable), and
  2. Having locks only at per-entry level, not at the plan level (understandably, plans can be quite large).

Problem Statement

How to make updating of a very large record (say millions of lines) atomic while also being available and fault tolerant?

Design Questions

  1. Why not just have a single pointer that points to the latest plan and update the pointer atomically instead of “overwriting” the entire plan?
  2. Why have multiple enactors running in parallel for availability? Why not have one leader and others as backup that check leader’s health using heartbeat etc.?
  3. Or simply: Why not check per-entry freshness before updating?