Posts

Showing posts from June, 2025

Analyzing Metastable Failures in Distributed Systems

Image
So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless. In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures , which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there. This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it. Introduction Like most great tragedies, metastable failure doesn't begin with villain...

Chapter 6: Centralized Recovery (Concurrency Control Book)

Image
With Chapter 6, the Concurrency Control and Recovery in Database Systems book shifts focus from concurrency control to the recovery! This chapter addresses how to preserve the atomicity and durability of transactions in the presence of failures, and how to restore the system to a consistent state afterward. The book offers a remarkably complete foundation for transactional recovery, covering undo/redo logging, checkpointing, and crash recovery. While it doesn't use the phrase "write-ahead logging", the basic concepts are there, including log-before-data and dual-pass recovery. When the book was written, the full WAL abstraction in ARIES was still to come in another five years at 1992 ( see my review here ). I revisit this discussion/comparison at the end of the post. System Model and Architecture In earlier chapters, we had reviewed the architecture: transactions pass through a transaction manager (TM), which sends operations to a scheduler and then to a data manager (DM...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Making database systems usable

Advice to the young

Looming Liability Machines (LLMs)

Learning about distributed systems: where to start?

Foundational distributed systems papers

Scalable OLTP in the Cloud: What’s the BIG DEAL?

What I'd do as a College Freshman in 2025

Distributed Transactions at Scale in Amazon DynamoDB