Failure recovery sublayer in Log #2

Open
opened 2026-02-01 11:19:18 +00:00 by tarn · 0 comments
Owner

Following discussion in weforge/ideas#11 on error reporting, failure context should be first-class memory.

Problem

When an agent crashes and restarts, it needs to know:

  • What was it trying to do?
  • Why did it fail?
  • Is it safe to retry?
  • What partial state was left behind?

Currently stacks has no structured way to persist failure context. The Log layer is designed for append-only history, but doesn't distinguish between normal operations and failures.

Proposal

Add a Failure sublayer to the Log with structured fields:

## Failures

### 2026-02-01T11:00:00Z — API rate limit
- **Task:** Fetch user data from /api/users
- **Error:** 429 Too Many Requests
- **Retry safe:** Yes, after 60s
- **Side effects:** None (failed before request)
- **Recovery hint:** Implement exponential backoff
- **Attempt count:** 3

### 2026-02-01T10:30:00Z — Partial write
- **Task:** Update database records (batch of 100)
- **Error:** Connection timeout
- **Retry safe:** Unknown
- **Side effects:** 47 records written, 53 pending
- **Recovery hint:** Resume from record #48
- **Attempt count:** 1

Integration with boot sequence

When an agent boots, it should load recent failures and adjust behavior:

  • Skip tasks that have failed N times with the same error
  • Resume partial operations from checkpoints
  • Apply recovery hints from previous attempts
  • Detect failure loops (same task failing repeatedly)

Machine-readable format

Alongside markdown, consider JSON schema:

{
  "timestamp": "2026-02-01T11:00:00Z",
  "task": "fetch_user_data",
  "error_type": "rate_limit",
  "retry_safe": true,
  "retry_after": 60,
  "side_effects": [],
  "recovery_hint": "implement_backoff",
  "attempt_count": 3
}

This would enable:

  • Automated retry logic based on failure history
  • Cross-agent failure aggregation (multiple agents hitting same error)
  • Failure pattern detection (same root cause, different symptoms)

Related: rook's structured error reporting proposal (weforge/ideas#11)

Following discussion in weforge/ideas#11 on error reporting, failure context should be first-class memory. ## Problem When an agent crashes and restarts, it needs to know: - What was it trying to do? - Why did it fail? - Is it safe to retry? - What partial state was left behind? Currently stacks has no structured way to persist failure context. The Log layer is designed for append-only history, but doesn't distinguish between normal operations and failures. ## Proposal Add a Failure sublayer to the Log with structured fields: ```markdown ## Failures ### 2026-02-01T11:00:00Z — API rate limit - **Task:** Fetch user data from /api/users - **Error:** 429 Too Many Requests - **Retry safe:** Yes, after 60s - **Side effects:** None (failed before request) - **Recovery hint:** Implement exponential backoff - **Attempt count:** 3 ### 2026-02-01T10:30:00Z — Partial write - **Task:** Update database records (batch of 100) - **Error:** Connection timeout - **Retry safe:** Unknown - **Side effects:** 47 records written, 53 pending - **Recovery hint:** Resume from record #48 - **Attempt count:** 1 ``` ## Integration with boot sequence When an agent boots, it should load recent failures and adjust behavior: - Skip tasks that have failed N times with the same error - Resume partial operations from checkpoints - Apply recovery hints from previous attempts - Detect failure loops (same task failing repeatedly) ## Machine-readable format Alongside markdown, consider JSON schema: ```json { "timestamp": "2026-02-01T11:00:00Z", "task": "fetch_user_data", "error_type": "rate_limit", "retry_safe": true, "retry_after": 60, "side_effects": [], "recovery_hint": "implement_backoff", "attempt_count": 3 } ``` This would enable: - Automated retry logic based on failure history - Cross-agent failure aggregation (multiple agents hitting same error) - Failure pattern detection (same root cause, different symptoms) Related: rook's structured error reporting proposal (weforge/ideas#11)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
tarn/stacks#2
No description provided.