From 6fa9dd013c2d00717ae3977611f8c91aa2d160cc Mon Sep 17 00:00:00 2001 From: argenis de la rosa Date: Sat, 28 Feb 2026 20:30:51 -0500 Subject: [PATCH] docs(rfi): add F1-3 and Q0-3 state machine design docs --- docs/SUMMARY.md | 2 + docs/docs-inventory.md | 4 +- docs/project/README.md | 2 + ...-lifecycle-state-machine-rfi-2026-03-01.md | 193 +++++++++++++++ ...top-reason-state-machine-rfi-2026-03-01.md | 222 ++++++++++++++++++ 5 files changed, 422 insertions(+), 1 deletion(-) create mode 100644 docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md create mode 100644 docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index fe91dc26a..65a324047 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -111,5 +111,7 @@ Last refreshed: **February 28, 2026**. - [project-triage-snapshot-2026-02-18.md](project-triage-snapshot-2026-02-18.md) - [docs-audit-2026-02-24.md](docs-audit-2026-02-24.md) - [project/m4-5-rfi-spike-2026-02-28.md](project/m4-5-rfi-spike-2026-02-28.md) +- [project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md](project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md) +- [project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md](project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md) - [i18n-gap-backlog.md](i18n-gap-backlog.md) - [docs-inventory.md](docs-inventory.md) diff --git a/docs/docs-inventory.md b/docs/docs-inventory.md index b3b1ae175..aae833215 100644 --- a/docs/docs-inventory.md +++ b/docs/docs-inventory.md @@ -2,7 +2,7 @@ This inventory classifies documentation by intent and canonical location. -Last reviewed: **February 28, 2026**. +Last reviewed: **March 1, 2026**. ## Classification Legend @@ -125,6 +125,8 @@ These are valuable context, but **not strict runtime contracts**. | `docs/project-triage-snapshot-2026-02-18.md` | Snapshot | | `docs/docs-audit-2026-02-24.md` | Snapshot (docs architecture audit) | | `docs/project/m4-5-rfi-spike-2026-02-28.md` | Snapshot (M4-5 workspace split RFI baseline and execution plan) | +| `docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md` | Snapshot (F1-3 lifecycle state machine RFI) | +| `docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md` | Snapshot (Q0-3 stop-reason/continuation RFI) | | `docs/i18n-gap-backlog.md` | Snapshot (i18n depth gap tracking) | ## Maintenance Contract diff --git a/docs/project/README.md b/docs/project/README.md index a2238ed5a..712ff3501 100644 --- a/docs/project/README.md +++ b/docs/project/README.md @@ -7,6 +7,8 @@ Time-bound project status snapshots for planning documentation and operations wo - [../project-triage-snapshot-2026-02-18.md](../project-triage-snapshot-2026-02-18.md) - [../docs-audit-2026-02-24.md](../docs-audit-2026-02-24.md) - [m4-5-rfi-spike-2026-02-28.md](m4-5-rfi-spike-2026-02-28.md) +- [f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md](f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md) +- [q0-3-stop-reason-state-machine-rfi-2026-03-01.md](q0-3-stop-reason-state-machine-rfi-2026-03-01.md) ## Scope diff --git a/docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md b/docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md new file mode 100644 index 000000000..69fd96bc2 --- /dev/null +++ b/docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md @@ -0,0 +1,193 @@ +# F1-3 Agent Lifecycle State Machine RFI (2026-03-01) + +Status: RFI complete, implementation planning ready. +GitHub issue: [#2308](https://github.com/zeroclaw-labs/zeroclaw/issues/2308) +Linear: [RMN-256](https://linear.app/zeroclawlabs/issue/RMN-256/rfi-f1-3-agent-lifecycle-state-machine) + +## Summary + +ZeroClaw currently has strong component supervision and health snapshots, but it does not expose a +formal agent lifecycle state model. This RFI defines a lifecycle FSM, transition contract, +synchronization model, persistence posture, and migration path that can be implemented without +changing existing daemon reliability behavior. + +## Current-State Findings + +### Existing behavior that already works + +- `src/daemon/mod.rs` supervises gateway/channels/heartbeat/scheduler with restart backoff. +- `src/health/mod.rs` tracks per-component `status`, `last_ok`, `last_error`, and `restart_count`. +- `src/agent/session.rs` persists conversational history with memory/SQLite backends and TTL cleanup. +- `src/agent/loop_.rs` and `src/agent/agent.rs` provide bounded per-turn execution loops. + +### Gaps blocking lifecycle consistency + +- No typed lifecycle enum for the agent runtime (or per-session runtime state). +- No validated transition guard rails (invalid transitions are not prevented centrally). +- Health state and lifecycle state are conflated (`ok`/`error` are not full lifecycle semantics). +- Persistence only covers health snapshots and conversation history, not lifecycle transitions. +- No single integration contract for daemon, channels, supervisor, and health endpoint consumers. + +## Proposed Lifecycle Model + +### State definitions + +- `Created`: runtime object exists but not started. +- `Starting`: dependencies are being initialized. +- `Running`: normal operation, accepting and processing work. +- `Degraded`: still running but with elevated failure/restart signals. +- `Suspended`: intentionally paused (manual pause, e-stop, or maintenance gate). +- `Backoff`: recovering after crash/error; restart cooldown active. +- `Terminating`: graceful shutdown in progress. +- `Terminated`: clean shutdown completed. +- `Crashed`: unrecoverable failure after retry budget is exhausted. + +### State diagram + +```mermaid +stateDiagram-v2 + [*] --> Created + Created --> Starting: daemon run/start + Starting --> Running: init_ok + Starting --> Backoff: init_fail + Running --> Degraded: component_error_threshold + Degraded --> Running: recovered + Running --> Suspended: pause_or_estop + Degraded --> Suspended: pause_or_estop + Suspended --> Running: resume + Backoff --> Starting: retry_after_backoff + Backoff --> Crashed: retry_budget_exhausted + Running --> Terminating: shutdown_signal + Degraded --> Terminating: shutdown_signal + Suspended --> Terminating: shutdown_signal + Terminating --> Terminated: shutdown_complete + Crashed --> Terminating: manual_stop +``` + +### Transition table + +| From | Trigger | Guard | To | Action | +|---|---|---|---|---| +| `Created` | daemon start | config valid | `Starting` | emit lifecycle event | +| `Starting` | init success | all required components healthy | `Running` | clear restart streak | +| `Starting` | init failure | retry budget available | `Backoff` | increment restart streak | +| `Running` | component errors | restart streak >= threshold | `Degraded` | set degraded cause | +| `Degraded` | recovery success | error window clears | `Running` | clear degraded cause | +| `Running`/`Degraded` | pause/e-stop | operator or policy signal | `Suspended` | stop intake/execution | +| `Suspended` | resume | policy allows | `Running` | re-enable intake | +| `Backoff` | retry timer | retry budget available | `Starting` | start component init | +| `Backoff` | retry exhausted | no retries left | `Crashed` | emit terminal failure event | +| non-terminal states | shutdown | signal received | `Terminating` | drain and stop workers | +| `Terminating` | done | all workers stopped | `Terminated` | persist final snapshot | + +## Implementation Approach + +### State representation + +Add a dedicated lifecycle type in runtime/daemon scope: + +```rust +enum AgentLifecycleState { + Created, + Starting, + Running, + Degraded { cause: String }, + Suspended { reason: String }, + Backoff { retry_in_ms: u64, attempt: u32 }, + Terminating, + Terminated, + Crashed { reason: String }, +} +``` + +### Synchronization model + +- Use a single `LifecycleRegistry` (`Arc>`) owned by daemon runtime. +- Route all lifecycle writes through `transition(from, to, trigger)` with guard checks. +- Emit transition events from one place, then fan out to health snapshot and observability. +- Reject invalid transitions at runtime and log them as policy violations. + +## Persistence Decision + +Decision: **hybrid persistence**. + +- Runtime source of truth: in-memory lifecycle registry for low-latency transitions. +- Durable checkpoint: persisted lifecycle snapshot alongside `daemon_state.json`. +- Optional append-only transition journal (`lifecycle_events.jsonl`) for audit and forensics. + +Rationale: + +- In-memory state keeps current daemon behavior fast and simple. +- Persistent checkpoint enables status restoration after restart and improves operator clarity. +- Event journal is valuable for post-incident analysis without changing runtime control flow. + +## Integration Points + +- `src/daemon/mod.rs` + - wrap supervisor start/failure/backoff/shutdown with explicit lifecycle transitions. +- `src/health/mod.rs` + - expose lifecycle state in health snapshot without replacing component-level health detail. +- `src/main.rs` (`status`, `restart`, e-stop surfaces) + - render lifecycle state and transition reason in CLI output. +- `src/channels/mod.rs` and channel workers + - gate message intake when lifecycle is `Suspended`, `Terminating`, `Crashed`, or `Terminated`. +- `src/agent/session.rs` + - keep session history semantics unchanged; add optional link from session to runtime lifecycle id. + +## Migration Plan + +### Phase 1: Non-breaking state plumbing + +- Add lifecycle enum/registry and default transitions in daemon startup/shutdown. +- Include lifecycle state in health JSON output. +- Keep existing component health fields unchanged. + +### Phase 2: Supervisor transition wiring + +- Convert supervisor restart/error signals into lifecycle transitions. +- Add backoff metadata (`attempt`, `retry_in_ms`) to lifecycle snapshots. + +### Phase 3: Intake gating + operator controls + +- Enforce channel/gateway intake gating by lifecycle state. +- Surface lifecycle controls and richer status output in CLI. + +### Phase 4: Persistence + event journal + +- Persist snapshot and optional JSONL transition events. +- Add recovery behavior for daemon restart from persisted snapshot. + +## Verification and Testing Plan + +### Unit tests + +- transition guard tests for all valid/invalid state pairs. +- lifecycle-to-health serialization tests. +- persistence round-trip tests for snapshot and event journal. + +### Integration tests + +- daemon startup failure -> backoff -> recovery path. +- repeated failure -> `Crashed` transition. +- suspend/resume behavior for channel intake and scheduler activity. + +### Chaos/failure tests + +- component panic/exit simulation under supervisor. +- rapid restart storm protection and state consistency checks. + +## Risks and Mitigations + +| Risk | Impact | Mitigation | +|---|---|---| +| Overlap between health and lifecycle semantics | Operator confusion | Keep both domains explicit and documented | +| Invalid transition bugs during rollout | Runtime inconsistency | Central transition API with guard checks | +| Excessive persistence I/O | Throughput impact | snapshot throttling + async event writes | +| Channel behavior regressions on suspend | Message loss | add intake gating tests and dry-run mode | + +## Implementation Readiness Checklist + +- [x] State diagram and transition table documented. +- [x] State representation and synchronization approach selected. +- [x] Persistence strategy documented. +- [x] Integration points and migration plan documented. diff --git a/docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md b/docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md new file mode 100644 index 000000000..b85301896 --- /dev/null +++ b/docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md @@ -0,0 +1,222 @@ +# Q0-3 Stop-Reason State Machine + Max-Tokens Continuation RFI (2026-03-01) + +Status: RFI complete, implementation planning ready. +GitHub issue: [#2309](https://github.com/zeroclaw-labs/zeroclaw/issues/2309) +Linear: [RMN-257](https://linear.app/zeroclawlabs/issue/RMN-257/rfi-q0-3-stop-reason-state-machine-max-tokens-continuation) + +## Summary + +ZeroClaw currently parses text/tool calls and token usage across providers, but it does not carry a +normalized stop reason into `ChatResponse`, and there is no deterministic continuation loop for +`max_tokens` truncation. This RFI defines a provider mapping model, a continuation FSM, partial +tool-call recovery policy, and observability/testing requirements. + +## Current-State Findings + +### Confirmed implementation behavior + +- `src/providers/traits.rs` `ChatResponse` has no stop-reason field. +- Provider adapters parse text/tool-calls/usage, but stop reason fields are mostly discarded. +- `src/agent/loop_.rs` finalizes response if no parsed tool calls are present. +- Existing parser in `src/agent/loop_/parsing.rs` already handles many malformed/truncated + tool-call formats safely (no panic), but this is parsing recovery, not continuation policy. + +### Known gap + +- When a provider truncates output due to max token cap, the loop lacks a dedicated continuation + path. Result: partial responses can be returned silently. + +## Proposed Stop-Reason Model + +### Normalized enum + +```rust +enum NormalizedStopReason { + EndTurn, + ToolCall, + MaxTokens, + ContextWindowExceeded, + SafetyBlocked, + Cancelled, + Unknown(String), +} +``` + +### `ChatResponse` extension + +Add stop-reason payload to provider response contract: + +```rust +pub struct ChatResponse { + pub text: Option, + pub tool_calls: Vec, + pub usage: Option, + pub reasoning_content: Option, + pub quota_metadata: Option, + pub stop_reason: Option, + pub raw_stop_reason: Option, +} +``` + +`raw_stop_reason` preserves provider-native values for diagnostics and future mapping updates. + +## Provider Mapping Matrix + +This table defines implementation targets for active provider families in ZeroClaw. + +| Provider family | Native field | Native values | Normalized | +|---|---|---|---| +| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `stop` | `EndTurn` | +| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `tool_calls`, `function_call` | `ToolCall` | +| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `length` | `MaxTokens` | +| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `content_filter` | `SafetyBlocked` | +| Anthropic messages | `stop_reason` | `end_turn`, `stop_sequence` | `EndTurn` | +| Anthropic messages | `stop_reason` | `tool_use` | `ToolCall` | +| Anthropic messages | `stop_reason` | `max_tokens` | `MaxTokens` | +| Anthropic messages | `stop_reason` | `model_context_window_exceeded` | `ContextWindowExceeded` | +| Gemini generateContent | `finishReason` | `STOP` | `EndTurn` | +| Gemini generateContent | `finishReason` | `MAX_TOKENS` | `MaxTokens` | +| Gemini generateContent | `finishReason` | `SAFETY`, `RECITATION` | `SafetyBlocked` | +| Bedrock Converse | `stopReason` | `end_turn` | `EndTurn` | +| Bedrock Converse | `stopReason` | `tool_use` | `ToolCall` | +| Bedrock Converse | `stopReason` | `max_tokens` | `MaxTokens` | +| Bedrock Converse | `stopReason` | `guardrail_intervened` | `SafetyBlocked` | + +Notes: + +- Unknown values map to `Unknown(raw)` and must be logged once per provider/model combination. +- Mapping must be unit-tested against fixture payloads for each provider adapter. + +## Continuation State Machine + +### Goals + +- Continue only when stop reason indicates output truncation. +- Bound retries and total output growth. +- Preserve tool-call correctness (never execute partial JSON). + +### State diagram + +```mermaid +stateDiagram-v2 + [*] --> Request + Request --> EvaluateStop: provider_response + EvaluateStop --> Complete: EndTurn + EvaluateStop --> ExecuteTools: ToolCall + EvaluateStop --> ContinuePending: MaxTokens + EvaluateStop --> Abort: SafetyBlocked/ContextWindowExceeded/UnknownFatal + ContinuePending --> RequestContinuation: under_limits + RequestContinuation --> EvaluateStop: provider_response + ContinuePending --> AbortPartial: retry_limit_or_budget_exceeded + AbortPartial --> Complete: return_partial_with_notice + ExecuteTools --> Request: tool_results_appended +``` + +### Hard limits (defaults) + +- `max_continuations_per_turn = 3` +- `max_total_completion_tokens_per_turn = 4 * initial_max_tokens` (configurable) +- `max_total_output_chars_per_turn = 120_000` (safety cap) + +## Partial Tool-Call JSON Policy + +### Rules + +- Never execute tool calls when parsed payload is incomplete/ambiguous. +- If `MaxTokens` and parser detects malformed/partial tool-call body: + - request deterministic re-emission of the tool call payload only. + - keep attempt budget separate (`max_tool_repair_attempts = 1`). +- If repair fails, degrade safely: + - return a partial response with explicit truncation notice. + - emit structured event for operator diagnosis. + +### Recovery prompt contract + +Use a strict system-side continuation hint: + +```text +Previous response was truncated by token limit. +Continue exactly from where you left off. +If you intended a tool call, emit one complete tool call payload only. +Do not repeat already-sent text. +``` + +## Observability Requirements + +Emit structured events per turn: + +- `stop_reason_observed` + - provider, model, normalized reason, raw reason, turn id, iteration. +- `continuation_attempt` + - attempt index, cumulative output tokens/chars, budget remaining. +- `continuation_terminated` + - terminal reason (`completed`, `retry_limit`, `budget_exhausted`, `safety_blocked`). +- `tool_payload_repair` + - parse issue type, repair attempted, repair success/failure. + +Metrics: + +- counter: continuations triggered by provider/model. +- counter: truncation exits without continuation (guardrail/budget cases). +- histogram: continuation attempts per turn. +- histogram: end-to-end turn latency for continued turns. + +## Implementation Outline + +### Provider layer + +- Parse and map native stop reason fields in each adapter. +- Populate `stop_reason` and `raw_stop_reason` in `ChatResponse`. +- Add fixture-based unit tests for mapping. + +### Agent loop layer + +- Introduce `ContinuationController` in `src/agent/loop_.rs`. +- Route `MaxTokens` through continuation FSM before finalization. +- Merge continuation text chunks into one coherent assistant response. +- Keep existing tool parsing and loop-detection guards intact. + +### Config layer + +Add config keys under `agent`: + +- `continuation_max_attempts` +- `continuation_max_output_chars` +- `continuation_max_total_completion_tokens` +- `continuation_tool_repair_attempts` + +## Verification and Testing Plan + +### Unit tests + +- stop-reason mapping tests per provider adapter. +- continuation FSM transition tests (all terminal paths). +- budget cap tests and retry-limit behavior. + +### Integration tests + +- mock provider returns `MaxTokens` then successful continuation. +- mock provider returns repeated `MaxTokens` until retry cap. +- mock provider emits partial tool-call JSON then repaired payload. + +### Regression tests + +- ensure non-truncated normal responses are unchanged. +- ensure existing parser recovery tests in `loop_/parsing.rs` remain green. +- verify no duplicate text when continuation merges. + +## Risks and Mitigations + +| Risk | Impact | Mitigation | +|---|---|---| +| Provider mapping drift | incorrect continuation triggers | keep `raw_stop_reason` + tests | +| Continuation repetition loops | poor UX, extra tokens | dedupe heuristics + strict caps | +| Partial tool-call execution | unsafe tool behavior | hard block on malformed payload | +| Latency growth | slower responses | cap attempts and emit metrics | + +## Implementation Readiness Checklist + +- [x] Provider stop-reason mapping documented. +- [x] Continuation policy and hard limits documented. +- [x] Partial tool-call handling strategy documented. +- [x] Proposed state machine documented for implementation.