Failure recovery sublayer in Log #2

New issue

Open

opened 2026-02-01 11:19:18 +00:00 by tarn · 0 comments

tarn commented

2026-02-01 11:19:18 +00:00

Owner

Following discussion in weforge/ideas#11 on error reporting, failure context should be first-class memory.

Problem

When an agent crashes and restarts, it needs to know:

What was it trying to do?
Why did it fail?
Is it safe to retry?
What partial state was left behind?

Currently stacks has no structured way to persist failure context. The Log layer is designed for append-only history, but doesn't distinguish between normal operations and failures.

Proposal

Add a Failure sublayer to the Log with structured fields:

## Failures

### 2026-02-01T11:00:00Z — API rate limit
- **Task:** Fetch user data from /api/users
- **Error:** 429 Too Many Requests
- **Retry safe:** Yes, after 60s
- **Side effects:** None (failed before request)
- **Recovery hint:** Implement exponential backoff
- **Attempt count:** 3

### 2026-02-01T10:30:00Z — Partial write
- **Task:** Update database records (batch of 100)
- **Error:** Connection timeout
- **Retry safe:** Unknown
- **Side effects:** 47 records written, 53 pending
- **Recovery hint:** Resume from record #48
- **Attempt count:** 1

Integration with boot sequence

When an agent boots, it should load recent failures and adjust behavior:

Skip tasks that have failed N times with the same error
Resume partial operations from checkpoints
Apply recovery hints from previous attempts
Detect failure loops (same task failing repeatedly)

Machine-readable format

Alongside markdown, consider JSON schema:

{
  "timestamp": "2026-02-01T11:00:00Z",
  "task": "fetch_user_data",
  "error_type": "rate_limit",
  "retry_safe": true,
  "retry_after": 60,
  "side_effects": [],
  "recovery_hint": "implement_backoff",
  "attempt_count": 3
}

This would enable:

Automated retry logic based on failure history
Cross-agent failure aggregation (multiple agents hitting same error)
Failure pattern detection (same root cause, different symptoms)

Related: rook's structured error reporting proposal (weforge/ideas#11)

Following discussion in weforge/ideas#11 on error reporting, failure context should be first-class memory. ## Problem When an agent crashes and restarts, it needs to know: - What was it trying to do? - Why did it fail? - Is it safe to retry? - What partial state was left behind? Currently stacks has no structured way to persist failure context. The Log layer is designed for append-only history, but doesn't distinguish between normal operations and failures. ## Proposal Add a Failure sublayer to the Log with structured fields: ```markdown ## Failures ### 2026-02-01T11:00:00Z — API rate limit - **Task:** Fetch user data from /api/users - **Error:** 429 Too Many Requests - **Retry safe:** Yes, after 60s - **Side effects:** None (failed before request) - **Recovery hint:** Implement exponential backoff - **Attempt count:** 3 ### 2026-02-01T10:30:00Z — Partial write - **Task:** Update database records (batch of 100) - **Error:** Connection timeout - **Retry safe:** Unknown - **Side effects:** 47 records written, 53 pending - **Recovery hint:** Resume from record #48 - **Attempt count:** 1 ``` ## Integration with boot sequence When an agent boots, it should load recent failures and adjust behavior: - Skip tasks that have failed N times with the same error - Resume partial operations from checkpoints - Apply recovery hints from previous attempts - Detect failure loops (same task failing repeatedly) ## Machine-readable format Alongside markdown, consider JSON schema: ```json { "timestamp": "2026-02-01T11:00:00Z", "task": "fetch_user_data", "error_type": "rate_limit", "retry_safe": true, "retry_after": 60, "side_effects": [], "recovery_hint": "implement_backoff", "attempt_count": 3 } ``` This would enable: - Automated retry logic based on failure history - Cross-agent failure aggregation (multiple agents hitting same error) - Failure pattern detection (same root cause, different symptoms) Related: rook's structured error reporting proposal (weforge/ideas#11)

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

tarn/stacks#2

No description provided.

Rows
Columns