From 6fa9dd013c2d00717ae3977611f8c91aa2d160cc Mon Sep 17 00:00:00 2001
From: argenis de la rosa <theonlyhennygod@gmail.com>
Date: Sat, 28 Feb 2026 20:30:51 -0500
Subject: [PATCH] docs(rfi): add F1-3 and Q0-3 state machine design docs

---
 docs/SUMMARY.md                               |   2 +
 docs/docs-inventory.md                        |   4 +-
 docs/project/README.md                        |   2 +
 ...-lifecycle-state-machine-rfi-2026-03-01.md | 193 +++++++++++++++
 ...top-reason-state-machine-rfi-2026-03-01.md | 222 ++++++++++++++++++
 5 files changed, 422 insertions(+), 1 deletion(-)
 create mode 100644 docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md
 create mode 100644 docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md

diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
index fe91dc26a..65a324047 100644
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -111,5 +111,7 @@ Last refreshed: **February 28, 2026**.
 - [project-triage-snapshot-2026-02-18.md](project-triage-snapshot-2026-02-18.md)
 - [docs-audit-2026-02-24.md](docs-audit-2026-02-24.md)
 - [project/m4-5-rfi-spike-2026-02-28.md](project/m4-5-rfi-spike-2026-02-28.md)
+- [project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md](project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md)
+- [project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md](project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md)
 - [i18n-gap-backlog.md](i18n-gap-backlog.md)
 - [docs-inventory.md](docs-inventory.md)
diff --git a/docs/docs-inventory.md b/docs/docs-inventory.md
index b3b1ae175..aae833215 100644
--- a/docs/docs-inventory.md
+++ b/docs/docs-inventory.md
@@ -2,7 +2,7 @@
 
 This inventory classifies documentation by intent and canonical location.
 
-Last reviewed: **February 28, 2026**.
+Last reviewed: **March 1, 2026**.
 
 ## Classification Legend
 
@@ -125,6 +125,8 @@ These are valuable context, but **not strict runtime contracts**.
 | `docs/project-triage-snapshot-2026-02-18.md` | Snapshot |
 | `docs/docs-audit-2026-02-24.md` | Snapshot (docs architecture audit) |
 | `docs/project/m4-5-rfi-spike-2026-02-28.md` | Snapshot (M4-5 workspace split RFI baseline and execution plan) |
+| `docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md` | Snapshot (F1-3 lifecycle state machine RFI) |
+| `docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md` | Snapshot (Q0-3 stop-reason/continuation RFI) |
 | `docs/i18n-gap-backlog.md` | Snapshot (i18n depth gap tracking) |
 
 ## Maintenance Contract
diff --git a/docs/project/README.md b/docs/project/README.md
index a2238ed5a..712ff3501 100644
--- a/docs/project/README.md
+++ b/docs/project/README.md
@@ -7,6 +7,8 @@ Time-bound project status snapshots for planning documentation and operations wo
 - [../project-triage-snapshot-2026-02-18.md](../project-triage-snapshot-2026-02-18.md)
 - [../docs-audit-2026-02-24.md](../docs-audit-2026-02-24.md)
 - [m4-5-rfi-spike-2026-02-28.md](m4-5-rfi-spike-2026-02-28.md)
+- [f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md](f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md)
+- [q0-3-stop-reason-state-machine-rfi-2026-03-01.md](q0-3-stop-reason-state-machine-rfi-2026-03-01.md)
 
 ## Scope
 
diff --git a/docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md b/docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md
new file mode 100644
index 000000000..69fd96bc2
--- /dev/null
+++ b/docs/project/f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md
@@ -0,0 +1,193 @@
+# F1-3 Agent Lifecycle State Machine RFI (2026-03-01)
+
+Status: RFI complete, implementation planning ready.
+GitHub issue: [#2308](https://github.com/zeroclaw-labs/zeroclaw/issues/2308)
+Linear: [RMN-256](https://linear.app/zeroclawlabs/issue/RMN-256/rfi-f1-3-agent-lifecycle-state-machine)
+
+## Summary
+
+ZeroClaw currently has strong component supervision and health snapshots, but it does not expose a
+formal agent lifecycle state model. This RFI defines a lifecycle FSM, transition contract,
+synchronization model, persistence posture, and migration path that can be implemented without
+changing existing daemon reliability behavior.
+
+## Current-State Findings
+
+### Existing behavior that already works
+
+- `src/daemon/mod.rs` supervises gateway/channels/heartbeat/scheduler with restart backoff.
+- `src/health/mod.rs` tracks per-component `status`, `last_ok`, `last_error`, and `restart_count`.
+- `src/agent/session.rs` persists conversational history with memory/SQLite backends and TTL cleanup.
+- `src/agent/loop_.rs` and `src/agent/agent.rs` provide bounded per-turn execution loops.
+
+### Gaps blocking lifecycle consistency
+
+- No typed lifecycle enum for the agent runtime (or per-session runtime state).
+- No validated transition guard rails (invalid transitions are not prevented centrally).
+- Health state and lifecycle state are conflated (`ok`/`error` are not full lifecycle semantics).
+- Persistence only covers health snapshots and conversation history, not lifecycle transitions.
+- No single integration contract for daemon, channels, supervisor, and health endpoint consumers.
+
+## Proposed Lifecycle Model
+
+### State definitions
+
+- `Created`: runtime object exists but not started.
+- `Starting`: dependencies are being initialized.
+- `Running`: normal operation, accepting and processing work.
+- `Degraded`: still running but with elevated failure/restart signals.
+- `Suspended`: intentionally paused (manual pause, e-stop, or maintenance gate).
+- `Backoff`: recovering after crash/error; restart cooldown active.
+- `Terminating`: graceful shutdown in progress.
+- `Terminated`: clean shutdown completed.
+- `Crashed`: unrecoverable failure after retry budget is exhausted.
+
+### State diagram
+
+```mermaid
+stateDiagram-v2
+    [*] --> Created
+    Created --> Starting: daemon run/start
+    Starting --> Running: init_ok
+    Starting --> Backoff: init_fail
+    Running --> Degraded: component_error_threshold
+    Degraded --> Running: recovered
+    Running --> Suspended: pause_or_estop
+    Degraded --> Suspended: pause_or_estop
+    Suspended --> Running: resume
+    Backoff --> Starting: retry_after_backoff
+    Backoff --> Crashed: retry_budget_exhausted
+    Running --> Terminating: shutdown_signal
+    Degraded --> Terminating: shutdown_signal
+    Suspended --> Terminating: shutdown_signal
+    Terminating --> Terminated: shutdown_complete
+    Crashed --> Terminating: manual_stop
+```
+
+### Transition table
+
+| From | Trigger | Guard | To | Action |
+|---|---|---|---|---|
+| `Created` | daemon start | config valid | `Starting` | emit lifecycle event |
+| `Starting` | init success | all required components healthy | `Running` | clear restart streak |
+| `Starting` | init failure | retry budget available | `Backoff` | increment restart streak |
+| `Running` | component errors | restart streak >= threshold | `Degraded` | set degraded cause |
+| `Degraded` | recovery success | error window clears | `Running` | clear degraded cause |
+| `Running`/`Degraded` | pause/e-stop | operator or policy signal | `Suspended` | stop intake/execution |
+| `Suspended` | resume | policy allows | `Running` | re-enable intake |
+| `Backoff` | retry timer | retry budget available | `Starting` | start component init |
+| `Backoff` | retry exhausted | no retries left | `Crashed` | emit terminal failure event |
+| non-terminal states | shutdown | signal received | `Terminating` | drain and stop workers |
+| `Terminating` | done | all workers stopped | `Terminated` | persist final snapshot |
+
+## Implementation Approach
+
+### State representation
+
+Add a dedicated lifecycle type in runtime/daemon scope:
+
+```rust
+enum AgentLifecycleState {
+    Created,
+    Starting,
+    Running,
+    Degraded { cause: String },
+    Suspended { reason: String },
+    Backoff { retry_in_ms: u64, attempt: u32 },
+    Terminating,
+    Terminated,
+    Crashed { reason: String },
+}
+```
+
+### Synchronization model
+
+- Use a single `LifecycleRegistry` (`Arc<RwLock<...>>`) owned by daemon runtime.
+- Route all lifecycle writes through `transition(from, to, trigger)` with guard checks.
+- Emit transition events from one place, then fan out to health snapshot and observability.
+- Reject invalid transitions at runtime and log them as policy violations.
+
+## Persistence Decision
+
+Decision: **hybrid persistence**.
+
+- Runtime source of truth: in-memory lifecycle registry for low-latency transitions.
+- Durable checkpoint: persisted lifecycle snapshot alongside `daemon_state.json`.
+- Optional append-only transition journal (`lifecycle_events.jsonl`) for audit and forensics.
+
+Rationale:
+
+- In-memory state keeps current daemon behavior fast and simple.
+- Persistent checkpoint enables status restoration after restart and improves operator clarity.
+- Event journal is valuable for post-incident analysis without changing runtime control flow.
+
+## Integration Points
+
+- `src/daemon/mod.rs`
+  - wrap supervisor start/failure/backoff/shutdown with explicit lifecycle transitions.
+- `src/health/mod.rs`
+  - expose lifecycle state in health snapshot without replacing component-level health detail.
+- `src/main.rs` (`status`, `restart`, e-stop surfaces)
+  - render lifecycle state and transition reason in CLI output.
+- `src/channels/mod.rs` and channel workers
+  - gate message intake when lifecycle is `Suspended`, `Terminating`, `Crashed`, or `Terminated`.
+- `src/agent/session.rs`
+  - keep session history semantics unchanged; add optional link from session to runtime lifecycle id.
+
+## Migration Plan
+
+### Phase 1: Non-breaking state plumbing
+
+- Add lifecycle enum/registry and default transitions in daemon startup/shutdown.
+- Include lifecycle state in health JSON output.
+- Keep existing component health fields unchanged.
+
+### Phase 2: Supervisor transition wiring
+
+- Convert supervisor restart/error signals into lifecycle transitions.
+- Add backoff metadata (`attempt`, `retry_in_ms`) to lifecycle snapshots.
+
+### Phase 3: Intake gating + operator controls
+
+- Enforce channel/gateway intake gating by lifecycle state.
+- Surface lifecycle controls and richer status output in CLI.
+
+### Phase 4: Persistence + event journal
+
+- Persist snapshot and optional JSONL transition events.
+- Add recovery behavior for daemon restart from persisted snapshot.
+
+## Verification and Testing Plan
+
+### Unit tests
+
+- transition guard tests for all valid/invalid state pairs.
+- lifecycle-to-health serialization tests.
+- persistence round-trip tests for snapshot and event journal.
+
+### Integration tests
+
+- daemon startup failure -> backoff -> recovery path.
+- repeated failure -> `Crashed` transition.
+- suspend/resume behavior for channel intake and scheduler activity.
+
+### Chaos/failure tests
+
+- component panic/exit simulation under supervisor.
+- rapid restart storm protection and state consistency checks.
+
+## Risks and Mitigations
+
+| Risk | Impact | Mitigation |
+|---|---|---|
+| Overlap between health and lifecycle semantics | Operator confusion | Keep both domains explicit and documented |
+| Invalid transition bugs during rollout | Runtime inconsistency | Central transition API with guard checks |
+| Excessive persistence I/O | Throughput impact | snapshot throttling + async event writes |
+| Channel behavior regressions on suspend | Message loss | add intake gating tests and dry-run mode |
+
+## Implementation Readiness Checklist
+
+- [x] State diagram and transition table documented.
+- [x] State representation and synchronization approach selected.
+- [x] Persistence strategy documented.
+- [x] Integration points and migration plan documented.
diff --git a/docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md b/docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md
new file mode 100644
index 000000000..b85301896
--- /dev/null
+++ b/docs/project/q0-3-stop-reason-state-machine-rfi-2026-03-01.md
@@ -0,0 +1,222 @@
+# Q0-3 Stop-Reason State Machine + Max-Tokens Continuation RFI (2026-03-01)
+
+Status: RFI complete, implementation planning ready.
+GitHub issue: [#2309](https://github.com/zeroclaw-labs/zeroclaw/issues/2309)
+Linear: [RMN-257](https://linear.app/zeroclawlabs/issue/RMN-257/rfi-q0-3-stop-reason-state-machine-max-tokens-continuation)
+
+## Summary
+
+ZeroClaw currently parses text/tool calls and token usage across providers, but it does not carry a
+normalized stop reason into `ChatResponse`, and there is no deterministic continuation loop for
+`max_tokens` truncation. This RFI defines a provider mapping model, a continuation FSM, partial
+tool-call recovery policy, and observability/testing requirements.
+
+## Current-State Findings
+
+### Confirmed implementation behavior
+
+- `src/providers/traits.rs` `ChatResponse` has no stop-reason field.
+- Provider adapters parse text/tool-calls/usage, but stop reason fields are mostly discarded.
+- `src/agent/loop_.rs` finalizes response if no parsed tool calls are present.
+- Existing parser in `src/agent/loop_/parsing.rs` already handles many malformed/truncated
+  tool-call formats safely (no panic), but this is parsing recovery, not continuation policy.
+
+### Known gap
+
+- When a provider truncates output due to max token cap, the loop lacks a dedicated continuation
+  path. Result: partial responses can be returned silently.
+
+## Proposed Stop-Reason Model
+
+### Normalized enum
+
+```rust
+enum NormalizedStopReason {
+    EndTurn,
+    ToolCall,
+    MaxTokens,
+    ContextWindowExceeded,
+    SafetyBlocked,
+    Cancelled,
+    Unknown(String),
+}
+```
+
+### `ChatResponse` extension
+
+Add stop-reason payload to provider response contract:
+
+```rust
+pub struct ChatResponse {
+    pub text: Option<String>,
+    pub tool_calls: Vec<ToolCall>,
+    pub usage: Option<TokenUsage>,
+    pub reasoning_content: Option<String>,
+    pub quota_metadata: Option<QuotaMetadata>,
+    pub stop_reason: Option<NormalizedStopReason>,
+    pub raw_stop_reason: Option<String>,
+}
+```
+
+`raw_stop_reason` preserves provider-native values for diagnostics and future mapping updates.
+
+## Provider Mapping Matrix
+
+This table defines implementation targets for active provider families in ZeroClaw.
+
+| Provider family | Native field | Native values | Normalized |
+|---|---|---|---|
+| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `stop` | `EndTurn` |
+| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `tool_calls`, `function_call` | `ToolCall` |
+| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `length` | `MaxTokens` |
+| OpenAI / OpenRouter / OpenAI-compatible chat | `finish_reason` | `content_filter` | `SafetyBlocked` |
+| Anthropic messages | `stop_reason` | `end_turn`, `stop_sequence` | `EndTurn` |
+| Anthropic messages | `stop_reason` | `tool_use` | `ToolCall` |
+| Anthropic messages | `stop_reason` | `max_tokens` | `MaxTokens` |
+| Anthropic messages | `stop_reason` | `model_context_window_exceeded` | `ContextWindowExceeded` |
+| Gemini generateContent | `finishReason` | `STOP` | `EndTurn` |
+| Gemini generateContent | `finishReason` | `MAX_TOKENS` | `MaxTokens` |
+| Gemini generateContent | `finishReason` | `SAFETY`, `RECITATION` | `SafetyBlocked` |
+| Bedrock Converse | `stopReason` | `end_turn` | `EndTurn` |
+| Bedrock Converse | `stopReason` | `tool_use` | `ToolCall` |
+| Bedrock Converse | `stopReason` | `max_tokens` | `MaxTokens` |
+| Bedrock Converse | `stopReason` | `guardrail_intervened` | `SafetyBlocked` |
+
+Notes:
+
+- Unknown values map to `Unknown(raw)` and must be logged once per provider/model combination.
+- Mapping must be unit-tested against fixture payloads for each provider adapter.
+
+## Continuation State Machine
+
+### Goals
+
+- Continue only when stop reason indicates output truncation.
+- Bound retries and total output growth.
+- Preserve tool-call correctness (never execute partial JSON).
+
+### State diagram
+
+```mermaid
+stateDiagram-v2
+    [*] --> Request
+    Request --> EvaluateStop: provider_response
+    EvaluateStop --> Complete: EndTurn
+    EvaluateStop --> ExecuteTools: ToolCall
+    EvaluateStop --> ContinuePending: MaxTokens
+    EvaluateStop --> Abort: SafetyBlocked/ContextWindowExceeded/UnknownFatal
+    ContinuePending --> RequestContinuation: under_limits
+    RequestContinuation --> EvaluateStop: provider_response
+    ContinuePending --> AbortPartial: retry_limit_or_budget_exceeded
+    AbortPartial --> Complete: return_partial_with_notice
+    ExecuteTools --> Request: tool_results_appended
+```
+
+### Hard limits (defaults)
+
+- `max_continuations_per_turn = 3`
+- `max_total_completion_tokens_per_turn = 4 * initial_max_tokens` (configurable)
+- `max_total_output_chars_per_turn = 120_000` (safety cap)
+
+## Partial Tool-Call JSON Policy
+
+### Rules
+
+- Never execute tool calls when parsed payload is incomplete/ambiguous.
+- If `MaxTokens` and parser detects malformed/partial tool-call body:
+  - request deterministic re-emission of the tool call payload only.
+  - keep attempt budget separate (`max_tool_repair_attempts = 1`).
+- If repair fails, degrade safely:
+  - return a partial response with explicit truncation notice.
+  - emit structured event for operator diagnosis.
+
+### Recovery prompt contract
+
+Use a strict system-side continuation hint:
+
+```text
+Previous response was truncated by token limit.
+Continue exactly from where you left off.
+If you intended a tool call, emit one complete tool call payload only.
+Do not repeat already-sent text.
+```
+
+## Observability Requirements
+
+Emit structured events per turn:
+
+- `stop_reason_observed`
+  - provider, model, normalized reason, raw reason, turn id, iteration.
+- `continuation_attempt`
+  - attempt index, cumulative output tokens/chars, budget remaining.
+- `continuation_terminated`
+  - terminal reason (`completed`, `retry_limit`, `budget_exhausted`, `safety_blocked`).
+- `tool_payload_repair`
+  - parse issue type, repair attempted, repair success/failure.
+
+Metrics:
+
+- counter: continuations triggered by provider/model.
+- counter: truncation exits without continuation (guardrail/budget cases).
+- histogram: continuation attempts per turn.
+- histogram: end-to-end turn latency for continued turns.
+
+## Implementation Outline
+
+### Provider layer
+
+- Parse and map native stop reason fields in each adapter.
+- Populate `stop_reason` and `raw_stop_reason` in `ChatResponse`.
+- Add fixture-based unit tests for mapping.
+
+### Agent loop layer
+
+- Introduce `ContinuationController` in `src/agent/loop_.rs`.
+- Route `MaxTokens` through continuation FSM before finalization.
+- Merge continuation text chunks into one coherent assistant response.
+- Keep existing tool parsing and loop-detection guards intact.
+
+### Config layer
+
+Add config keys under `agent`:
+
+- `continuation_max_attempts`
+- `continuation_max_output_chars`
+- `continuation_max_total_completion_tokens`
+- `continuation_tool_repair_attempts`
+
+## Verification and Testing Plan
+
+### Unit tests
+
+- stop-reason mapping tests per provider adapter.
+- continuation FSM transition tests (all terminal paths).
+- budget cap tests and retry-limit behavior.
+
+### Integration tests
+
+- mock provider returns `MaxTokens` then successful continuation.
+- mock provider returns repeated `MaxTokens` until retry cap.
+- mock provider emits partial tool-call JSON then repaired payload.
+
+### Regression tests
+
+- ensure non-truncated normal responses are unchanged.
+- ensure existing parser recovery tests in `loop_/parsing.rs` remain green.
+- verify no duplicate text when continuation merges.
+
+## Risks and Mitigations
+
+| Risk | Impact | Mitigation |
+|---|---|---|
+| Provider mapping drift | incorrect continuation triggers | keep `raw_stop_reason` + tests |
+| Continuation repetition loops | poor UX, extra tokens | dedupe heuristics + strict caps |
+| Partial tool-call execution | unsafe tool behavior | hard block on malformed payload |
+| Latency growth | slower responses | cap attempts and emit metrics |
+
+## Implementation Readiness Checklist
+
+- [x] Provider stop-reason mapping documented.
+- [x] Continuation policy and hard limits documented.
+- [x] Partial tool-call handling strategy documented.
+- [x] Proposed state machine documented for implementation.