zeroclaw/docs/provider-model-fallback.md

5.1 KiB

Provider Model Fallback Configuration

This guide explains how to configure ZeroClaw to handle quota exhaustion and model failures gracefully using provider and model fallback chains.

Problem: Rate Limits and Model Quota

When using AI providers, you may encounter:

  1. Rate limits (429) - Your primary provider has no available quota
  2. Model-specific quota - A specific model (e.g., gemini-2.0-flash-exp) is exhausted but other models from the same provider work
  3. Model incompatibility - Fallback providers don't support the same model names

Solution 1: Provider Fallback with Model-Specific Defaults

Configure fallback providers with their own default models:

# config.toml
default_provider = "google"
default_model = "gemini-2.0-flash-exp"

[reliability]
# When Gemini fails, fall back to OpenAI Codex
fallback_providers = ["openai-codex:codex-1"]

# Provider-specific configuration
[[providers.openai_codex]]
profile = "codex-1"
model = "gpt-4o-mini"  # Use this model when falling back to Codex

How it works:

  1. Primary request goes to Gemini with gemini-2.0-flash-exp
  2. If Gemini returns 429 (rate limited), ZeroClaw tries the fallback
  3. Fallback provider (OpenAI Codex) uses its configured model (gpt-4o-mini) instead of trying the Gemini-specific model
  4. No "400 Bad Request: model not supported" errors!

Solution 2: Model Fallback Within Same Provider

When quota for one model is exhausted but other models from the same provider work:

# config.toml
default_provider = "google"
default_model = "gemini-2.0-flash-exp"

[reliability]
# Try alternative Gemini models when quota is exhausted
[reliability.model_fallbacks]
"gemini-2.0-flash-exp" = ["gemini-1.5-pro", "gemini-1.5-flash"]

How it works:

  1. Request tries gemini-2.0-flash-exp first
  2. If it fails with rate limit (429), retry with gemini-1.5-pro
  3. If that also fails, retry with gemini-1.5-flash
  4. All retries use the same provider (Gemini)

Solution 3: Combined Provider + Model Fallback

For maximum reliability, combine both strategies:

# config.toml
default_provider = "google"
default_model = "gemini-2.0-flash-exp"

[reliability]
# Provider fallback chain
fallback_providers = ["anthropic", "openai-codex:codex-1"]

# Model fallback within each provider
[reliability.model_fallbacks]
"gemini-2.0-flash-exp" = ["gemini-1.5-pro"]
"claude-opus-4" = ["claude-sonnet-4"]

# Provider-specific models for fallbacks
[[providers.openai_codex]]
profile = "codex-1"
model = "gpt-4o-mini"

Fallback order:

  1. gemini-2.0-flash-exp on Google
  2. gemini-1.5-pro on Google (model fallback)
  3. claude-opus-4 on Anthropic (provider fallback)
  4. claude-sonnet-4 on Anthropic (model fallback)
  5. gpt-4o-mini on OpenAI Codex (provider fallback with default model)

Retry Configuration

Fine-tune retry behavior:

[reliability]
provider_retries = 2        # Retry each provider 2 times before moving to next
provider_backoff_ms = 500   # Wait 500ms between retries (exponential backoff)

Real-World Example: Multi-Region Gemini Setup

# config.toml
default_provider = "google"
default_model = "gemini-2.0-flash-exp"

[reliability]
# Rotate through API keys (round-robin on 429)
api_keys = [
    "sk-key-for-project-a",
    "sk-key-for-project-b",
    "sk-key-for-project-c"
]

# Model fallback within Gemini
[reliability.model_fallbacks]
"gemini-2.0-flash-exp" = [
    "gemini-1.5-pro-latest",
    "gemini-1.5-flash-8b"
]

# Provider fallback if all Gemini quota exhausted
fallback_providers = ["anthropic", "openai"]

What happens when quota runs out:

  1. Try gemini-2.0-flash-exp with sk-key-for-project-a
  2. 429 → rotate to sk-key-for-project-b and retry
  3. 429 → rotate to sk-key-for-project-c and retry
  4. Still failing → try gemini-1.5-pro-latest (model fallback)
  5. Still failing → try gemini-1.5-flash-8b (model fallback)
  6. Still failing → fall back to Anthropic Claude
  7. Still failing → fall back to OpenAI

Logging and Monitoring

When fallback occurs, ZeroClaw logs:

[INFO] Provider recovered (failover/retry)
  provider="openai-codex"
  model="gpt-4o-mini"
  original_model="gemini-2.0-flash-exp"
  requested_model="gemini-2.0-flash-exp"
  attempt=1

Monitor these logs to:

  • Detect when quotas are running low
  • Identify which fallback paths are used most
  • Optimize your provider and model configuration

Best Practices

  1. Test your fallback chain - Ensure all providers in your chain have valid credentials
  2. Use model fallbacks first - Cheaper to try different models from same provider than switching providers
  3. Set appropriate retry counts - Too many retries slow down responses; too few miss transient failures
  4. Monitor costs - Fallback models may have different pricing
  5. Keep provider-specific models updated - When adding new providers, configure their default models

See Also