Implements scheduled/manual connectivity probes with contract-driven provider matrix, categorized failure policy, CI artifacts, and operator runbook.\n\nRefs RMN-5\nRefs RMN-6
4.0 KiB
4.0 KiB
Connectivity Probes Runbook
This runbook defines how maintainers operate the provider/model connectivity matrix probes.
Last verified: February 24, 2026.
Scope
Covers the scheduled/manual workflow:
.github/workflows/ci-connectivity-probes.ymlscripts/ci/provider_connectivity_matrix.py.github/connectivity/probe-contract.json
Probe purpose:
- verify provider catalog discovery (
doctor models --provider ...) - classify failures into actionable buckets
- keep CI noise low with transient-failure policy
- publish machine-readable artifacts for triage
Contract Model
Contract file: .github/connectivity/probe-contract.json
Each provider entry defines:
name: display label in reportprovider: provider ID passed tozeroclaw doctor models --providerrequired: whether this provider can gate the runsecret_env: required credential env var name for live probetimeout_sec: per-attempt timeoutretries: retry count for transient failuresnotes: operator-facing context
Global field:
consecutive_transient_failures_to_escalate: threshold for promotingnetwork/rate_limitfrom warning to gate failure on required providers
Failure Taxonomy
Categories in connectivity-report.json:
auth: missing/invalid credential, permission denied, quota/access issuesnetwork: timeout, DNS/connectivity/TLS transport failuresunavailable: unsupported endpoint, 404, empty model list, service unavailablerate_limit: HTTP 429 / explicit rate-limit errorsother: uncategorized provider failures
Gate Policy
Default policy implemented by provider_connectivity_matrix.py:
- Required provider +
auth/unavailable/other=> immediate gate failure - Required provider +
network/rate_limit=> gate only after reaching transient threshold - Optional provider failures => never gate
- Missing secret on required provider => immediate gate failure
For ad-hoc diagnostics, use workflow input enforcement_mode=report-only.
CI Artifacts
Each run uploads:
connectivity-report.json(full machine-readable matrix)connectivity-summary.md(human summary table).ci/connectivity-state.json(transient tracking state).ci/connectivity-raw.log(per-probe raw line log)
The markdown summary is also appended to GITHUB_STEP_SUMMARY.
Local Reproduction
Build binary first:
cargo build --profile release-fast --locked --bin zeroclaw
Run probes in enforce mode:
python3 scripts/ci/provider_connectivity_matrix.py \
--binary target/release-fast/zeroclaw \
--contract .github/connectivity/probe-contract.json \
--state-file .ci/connectivity-state.json \
--output-json connectivity-report.json \
--output-markdown connectivity-summary.md
Run report-only mode:
python3 scripts/ci/provider_connectivity_matrix.py \
--binary target/release-fast/zeroclaw \
--contract .github/connectivity/probe-contract.json \
--report-only
Triage Playbook
- Open
connectivity-summary.mdfor quick provider matrix. - For gate failures, inspect
categoryandmessageinconnectivity-report.json. - Follow category-specific actions:
auth:- verify secret exists and is non-empty
- rotate secret if revoked/expired
- confirm plan/permission supports
/models
network:- retry once manually (
workflow_dispatch) - check provider status page / GitHub Actions network incidents
- escalate only after threshold is exceeded
- retry once manually (
unavailable:- validate endpoint path contract
- confirm provider still supports live model discovery
rate_limit:- re-run later or reduce probe frequency for that provider
- ensure provider plan allows current request rate
other:- inspect raw log and provider response body
- adjust classifier hints if recurring and actionable
Change Control
When editing .github/connectivity/probe-contract.json:
- Explain why provider requirement or threshold changed.
- Keep required set small and stable to avoid alert fatigue.
- Run local probe once before merging.
- Update this runbook if policy behavior changed.