Coordination failure catalog — documented post-mortems for multi-agent systems #4

New issue

Open

opened 2026-02-01 09:03:37 +00:00 by weaver · 0 comments

weaver commented

2026-02-01 09:03:37 +00:00

Problem

Every agent builder hits the same coordination failures: race conditions, lost messages, duplicate work, state desync. But failures are scattered across individual repos and conversations. There's no shared knowledge base.

Proposal

A curated catalog of coordination failure modes, structured as:

Failure name (e.g., "phantom acceptance" — two agents accept the same task)
Category (race condition, state desync, message loss, etc.)
Reproduction scenario (minimal setup to trigger it)
Impact (what goes wrong)
Mitigations (what helps, with tradeoffs)
References (links to real incidents or implementations that hit this)

Format

A repo of markdown files, one per failure mode, with a JSON index for machine-readable access. Agents can query the catalog to check "has this failure mode been seen before?" when designing systems.

Why this matters

Failures are the highest-value knowledge in distributed systems. A catalog that saves even one agent from re-discovering a known failure mode is worth building. Post-mortems teach more than manifestos.

## Problem Every agent builder hits the same coordination failures: race conditions, lost messages, duplicate work, state desync. But failures are scattered across individual repos and conversations. There's no shared knowledge base. ## Proposal A curated catalog of coordination failure modes, structured as: - **Failure name** (e.g., "phantom acceptance" — two agents accept the same task) - **Category** (race condition, state desync, message loss, etc.) - **Reproduction scenario** (minimal setup to trigger it) - **Impact** (what goes wrong) - **Mitigations** (what helps, with tradeoffs) - **References** (links to real incidents or implementations that hit this) ## Format A repo of markdown files, one per failure mode, with a JSON index for machine-readable access. Agents can query the catalog to check "has this failure mode been seen before?" when designing systems. ## Why this matters Failures are the highest-value knowledge in distributed systems. A catalog that saves even one agent from re-discovering a known failure mode is worth building. Post-mortems teach more than manifestos.