From 9769822dc85e76bdd48b18be23a4c028819b9c1e Mon Sep 17 00:00:00 2001 From: Chummy Date: Wed, 25 Feb 2026 14:13:11 +0000 Subject: [PATCH] docs(ci): harden matrix/nightly gate mapping and escalation runbooks --- docs/ci-map.md | 2 + docs/operations/feature-matrix-runbook.md | 47 +++++++++++++++++-- .../nightly-all-features-runbook.md | 41 ++++++++++++++++ docs/operations/required-check-mapping.md | 17 +++++++ 4 files changed, 104 insertions(+), 3 deletions(-) diff --git a/docs/ci-map.md b/docs/ci-map.md index 9a6ef69ef..e3a300c80 100644 --- a/docs/ci-map.md +++ b/docs/ci-map.md @@ -36,8 +36,10 @@ Merge-blocking checks should stay small and deterministic. Optional checks are u - `.github/workflows/feature-matrix.yml` (`Feature Matrix`) - Purpose: compile-time matrix validation for `default`, `whatsapp-web`, `browser-native`, and `nightly-all-features` lanes - Additional behavior: each lane emits machine-readable result artifacts; summary lane aggregates owner routing from `.github/release/nightly-owner-routing.json` + - Additional behavior: required-check mapping is anchored to stable job name `Feature Matrix Summary`; lane jobs stay informational - `.github/workflows/nightly-all-features.yml` (`Nightly All-Features`) - Purpose: scheduled high-risk matrix execution with per-lane artifacts and summary rollup for overnight signal quality + - Additional behavior: owner routing + escalation policy is documented in `docs/operations/nightly-all-features-runbook.md` - `.github/workflows/sec-audit.yml` (`Security Audit`) - Purpose: dependency advisories (`rustsec/audit-check`, pinned SHA), policy/license checks (`cargo deny`), gitleaks-based secrets governance (allowlist policy metadata + expiry guard), and SBOM snapshot artifacts (`CycloneDX` + `SPDX`) - `.github/workflows/sec-codeql.yml` (`CodeQL Analysis`) diff --git a/docs/operations/feature-matrix-runbook.md b/docs/operations/feature-matrix-runbook.md index 5494b325d..514d06b3e 100644 --- a/docs/operations/feature-matrix-runbook.md +++ b/docs/operations/feature-matrix-runbook.md @@ -22,10 +22,51 @@ Workflow: `.github/workflows/feature-matrix.yml` - Per-lane report: `feature-matrix-` - Aggregated report: `feature-matrix-summary` (`feature-matrix-summary.json`, `feature-matrix-summary.md`) +- Retention: 21 days for lane + summary artifacts + +## Required Check Contract + +Branch protection should use stable, non-matrix-expanded check names for merge gates: + +- `Feature Matrix Summary` (from `feature-matrix.yml`) + +Matrix lane jobs stay observable but are not required check targets: + +- `Matrix Lane (default)` +- `Matrix Lane (whatsapp-web)` +- `Matrix Lane (browser-native)` +- `Matrix Lane (nightly-all-features)` + +Check-name stability rule: + +- Do not rename the job names above without updating `docs/operations/required-check-mapping.md`. +- Keep lane names in the matrix include-list stable to avoid check-name drift. + +Verification commands: + +- `gh run list --repo zeroclaw-labs/zeroclaw --workflow feature-matrix.yml --limit 3` +- `gh run view --repo zeroclaw-labs/zeroclaw --json jobs --jq '.jobs[].name'` ## Failure Triage -1. Open `feature-matrix-summary.md` and identify failed lane(s). +1. Open `feature-matrix-summary.md` and identify failed lane(s), owner, and failing command. 2. Download lane artifact (`nightly-result-.json`) for exact command + exit code. -3. Reproduce locally using the reported command. -4. Attach reproduction output to the corresponding Linear execution issue. +3. Reproduce locally with the exact command and toolchain lock (`--locked`). +4. Attach local reproduction logs + fix PR link to the active Linear execution issue. + +## High-Frequency Failure Classes + +| Failure class | Signal | First response | Escalation trigger | +| --- | --- | --- | --- | +| Rust dependency lock drift | `cargo check --locked` fails with lock mismatch | run `cargo update -p ` only when needed; regenerate lockfile in focused PR | same lane fails on 2 consecutive runs | +| Feature-flag compile drift (`whatsapp-web`) | missing symbols or cfg-gated modules | run the lane command locally and inspect feature-gated module imports | unresolved in 24h | +| Feature-flag compile drift (`browser-native`) | platform/feature binding compile errors | inspect browser-native cfg paths and recent dependency bumps | unresolved in 24h | +| System package dependency drift (`nightly-all-features`) | missing `libudev`/`pkg-config` or linker errors | verify apt install step succeeded; rerun in clean container with same deps | recurs 3 times in 7 days | +| CI environment/runtime regressions | lane timeout or infrastructure transient failure | re-run once, compare with prior successful run, then isolate infra vs code | 2+ lanes impacted in one run | +| Summary aggregation contract break | `Feature Matrix Summary` fails to parse artifacts | verify artifact names + JSON schema from lane outputs | any merge-gate failure on protected branches | + +## Debug Data Expectations + +- Lane JSON must include: lane, status, exit_code, duration_seconds, command. +- Summary JSON must include: total, passed, failed, per-lane rows, owner routing. +- Preserve artifacts for at least one full release cycle (21 days currently configured). diff --git a/docs/operations/nightly-all-features-runbook.md b/docs/operations/nightly-all-features-runbook.md index 306c68455..c539ee87b 100644 --- a/docs/operations/nightly-all-features-runbook.md +++ b/docs/operations/nightly-all-features-runbook.md @@ -22,6 +22,46 @@ Lane owners are configured in `.github/release/nightly-owner-routing.json`. - Per-lane: `nightly-lane-` with `nightly-result-.json` - Aggregate: `nightly-all-features-summary` with `nightly-summary.json` and `nightly-summary.md` +- Retention: 30 days for lane + summary artifacts + +## Scheduler and Activation Notes + +- Schedule contract: daily at `03:15 UTC` (`cron: 15 3 * * *`). +- Determinism contract: pinned Rust toolchain (`1.92.0`), locked Cargo commands, explicit apt package install for all-features lane. +- GitHub schedule/discovery caveat: scheduled and `workflow_dispatch` discovery is driven by the repository default branch workflow catalog. If this workflow is only on `dev`, promote `dev -> main` before expecting native schedule/dispatch visibility. + +## Ownership Routing and Escalation + +Owner routing source: `.github/release/nightly-owner-routing.json` + +- `default` -> `@chumyin` +- `whatsapp-web` -> `@chumyin` +- `browser-native` -> `@chumyin` +- `nightly-all-features` -> `@chumyin` + +Escalation thresholds: + +- Single-lane nightly failure: notify mapped owner within 30 minutes of triage start. +- Same lane fails for 2 consecutive nightly runs: escalate in release governance thread and link both run URLs. +- 3 or more lanes fail in one nightly run: open incident issue and page on-call maintainer. +- Failure unresolved for 24 hours: escalate to maintainers list and block related release promotion tasks. + +SLA targets: + +- Acknowledge: within 30 minutes during working window. +- Initial diagnosis update: within 4 hours. +- Mitigation PR or rollback decision: within 24 hours. + +## Traceability (Last 3 Runs) + +Use: + +- `gh run list --repo zeroclaw-labs/zeroclaw --workflow nightly-all-features.yml --limit 3` +- `gh run view --repo zeroclaw-labs/zeroclaw --json jobs,headSha,event,createdAt,url` + +Project update expectation: + +- Every weekly status update links the latest 3 nightly runs (URL + conclusion + failed lanes). ## Failure Handling @@ -29,3 +69,4 @@ Lane owners are configured in `.github/release/nightly-owner-routing.json`. 2. Download the failed lane artifact and rerun the exact command locally. 3. Capture fix PR + test evidence. 4. Link remediation back to release or CI governance issues. +5. If escalation threshold is hit, include escalation ticket/runbook action in the issue update. diff --git a/docs/operations/required-check-mapping.md b/docs/operations/required-check-mapping.md index 083877e82..8a8a45774 100644 --- a/docs/operations/required-check-mapping.md +++ b/docs/operations/required-check-mapping.md @@ -11,6 +11,13 @@ This document maps merge-critical workflows to expected check names. | `Feature Matrix Summary` | `.github/workflows/feature-matrix.yml` | feature-combination compile matrix | | `Workflow Sanity` | `.github/workflows/workflow-sanity.yml` | workflow syntax and lint | +Feature matrix lane check names (informational, non-required): + +- `Matrix Lane (default)` +- `Matrix Lane (whatsapp-web)` +- `Matrix Lane (browser-native)` +- `Matrix Lane (nightly-all-features)` + ## Promotion to `main` | Required check name | Source workflow | Scope | @@ -27,8 +34,18 @@ This document maps merge-critical workflows to expected check names. | `Pre-release Guard` | `.github/workflows/pub-prerelease.yml` | stage progression + tag integrity | | `Nightly Summary & Routing` | `.github/workflows/nightly-all-features.yml` | overnight integration signal | +## Verification Procedure + +1. Resolve latest workflow run IDs: + - `gh run list --repo zeroclaw-labs/zeroclaw --workflow feature-matrix.yml --limit 1` + - `gh run list --repo zeroclaw-labs/zeroclaw --workflow ci-run.yml --limit 1` +2. Enumerate check/job names and compare to this mapping: + - `gh run view --repo zeroclaw-labs/zeroclaw --json jobs --jq '.jobs[].name'` +3. If any merge-critical check name changed, update this file before changing branch protection policy. + ## Notes - Use pinned `uses:` references for all workflow actions. - Keep check names stable; renaming check jobs can break branch protection rules. +- GitHub scheduled/manual discovery for workflows is default-branch driven. If a release/nightly workflow only exists on `dev`, promotion to `main` is required before default-branch schedule visibility is expected. - Update this mapping whenever merge-critical workflows/jobs are added or renamed.