* fix(web-fetch): remove dead feature gates, add noise stripping, add docstrings
The nanohtml2text and fast_html2md providers were both guarded by
cfg(feature) checks for features (web-fetch-plaintext, web-fetch-html2md)
that are never declared in Cargo.toml. This caused every web_fetch call
to silently return an error instead of fetching content.
Changes:
- Add strip_noise_elements() which removes <script>, <style>, <nav>,
<header>, <footer>, <aside>, <noscript>, <form>, <button> blocks
before text extraction, eliminating menu/ad/boilerplate noise.
- Fix fast_html2md path: when web-fetch-html2md feature is not compiled
in, fall through to nanohtml2text rather than returning an error.
- Fix nanohtml2text path: remove dead cfg(feature = "web-fetch-plaintext")
gate; nanohtml2text is a direct dependency and needs no feature flag.
- Both previously gated tests (html_to_markdown_conversion_preserves_structure,
html_to_plaintext_conversion_removes_html_tags) are now always-on.
Added strip_noise_removes_nav_scripts_footer test.
- Add docstrings to all public/private methods to meet coverage threshold.
Tavily and firecrawl providers are unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(web-fetch): align default provider to nanohtml2text, remove dead feature
- Change empty-provider default from deprecated 'fast_html2md' to
'nanohtml2text' to match WEB_FETCH_PROVIDER_HELP and PR description.
- Remove dead 'web-fetch-plaintext' feature from Cargo.toml (no code
references it after the feature-gate removal).
- Apply cargo fmt to strip_noise_elements array formatting.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: xj <gh-xj@users.noreply.github.com>
Ports remaining changes from feat/unify-web-fetch-providers that were
not yet integrated into dev:
- config/schema.rs: add `user_agent` field (default "ZeroClaw/1.0") to
HttpRequestConfig, WebFetchConfig, and WebSearchConfig, with a shared
default_user_agent() helper. Field is serde-default so existing configs
remain backward compatible.
- tools/http_request.rs: accept user_agent in constructor; pass it to
reqwest::Client via .user_agent() replacing the implicit default.
- tools/web_fetch.rs: accept user_agent in constructor; replace hardcoded
"ZeroClaw/0.1 (web_fetch)" in build_http_client with the configured value.
- tools/web_search_tool.rs: accept user_agent in constructor; replace
hardcoded Chrome UA string in search_duckduckgo and add .user_agent()
to the Brave and Firecrawl client builders.
- tools/mod.rs: wire user_agent from each config struct into the
corresponding tool constructor (HttpRequestTool, WebFetchTool,
WebSearchTool).
- onboard/wizard.rs: add setup_web_tools() as wizard Step 6 "Web &
Internet Tools" (total steps bumped from 9 to 10). Configures
WebSearchConfig, WebFetchConfig, and HttpRequestConfig interactively
with provider selection and optional API key/URL prompts. Step 5
setup_tool_mode() http_request and web_search outputs are now discarded
(_, _) since step 6 owns that configuration. Uses dev's generic
api_key/api_url schema fields unchanged.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit fb83da8db021903cf5844852bdb67b9b259941d7)
- Problem: The existing http_request tool returns raw HTML/JSON, which is nearly unusable for LLMs to extract
meaningful content from web pages.
- Why it matters: All mainstream AI agents (Claude Code, Gemini CLI, Aider) have dedicated web content extraction
tools. ZeroClaw lacks this capability, limiting its ability to research and gather information from the web.
- What changed: Added a new web_fetch tool that fetches web pages and converts HTML to clean plain text using
nanohtml2text. Includes domain allowlist/blocklist, SSRF protection, redirect following, and content-type aware
processing.
- What did not change (scope boundary): http_request tool is untouched. No shared code extracted between http_request
and web_fetch (DRY rule-of-three: only 2 callers). No changes to existing tool behavior or defaults.
Label Snapshot (required)
- Risk label: risk: medium
- Size label: size: M
- Scope labels: tool, config
- Module labels: tool: web_fetch
- If any auto-label is incorrect, note requested correction: N/A
Change Metadata
- Change type: feature
- Primary scope: tool
Linked Issue
- Closes #
- Related #
- Depends on #
- Supersedes #
Supersede Attribution (required when Supersedes # is used)
N/A
Validation Evidence (required)
cargo fmt --all -- --check # pass
cargo clippy --all-targets -- -D warnings # no new warnings (pre-existing warnings only)
cargo test --lib -- web_fetch # 26/26 passed
cargo test --lib -- tools::tests # 12/12 passed
cargo test --lib -- config::schema::tests # 134/134 passed
- Evidence provided: unit test results (26 new tests), manual end-to-end test with Ollama + qwen2.5:72b
- If any command is intentionally skipped, explain why: Full cargo clippy --all-targets has 43 pre-existing errors
unrelated to this PR (e.g. await_holding_lock, format! appended to String). Zero errors from web_fetch code.
Security Impact (required)
- New permissions/capabilities? Yes — new web_fetch tool can make outbound HTTP GET requests
- New external network calls? Yes — fetches web pages from allowed domains
- Secrets/tokens handling changed? No
- File system access scope changed? No
- If any Yes, describe risk and mitigation:
- Deny-by-default: enabled = false by default; tool is not registered unless explicitly enabled
- Domain filtering: allowed_domains (default ["*"] = all public hosts) + blocked_domains (takes priority).
Blocklist always wins over allowlist.
- SSRF protection: Blocks localhost, private IPs (RFC 1918), link-local, multicast, reserved ranges, IPv4-mapped
IPv6, .local TLD — identical coverage to http_request
- Rate limiting: can_act() + record_action() enforce autonomy level and rate limits
- Read-only mode: Blocked when autonomy is ReadOnly
- Response size cap: 500KB default truncation prevents context window exhaustion
- Proxy support: Honors [proxy] config via tool.web_fetch service key
Privacy and Data Hygiene (required)
- Data-hygiene status: pass
- Redaction/anonymization notes: No personal data in code, tests, or fixtures
- Neutral wording confirmation: All test identifiers use neutral project-scoped labels
Compatibility / Migration
- Backward compatible? Yes — new tool, no existing behavior changed
- Config/env changes? Yes — new [web_fetch] section in config.toml (all fields have defaults)
- Migration needed? No — #[serde(default)] on all fields; existing configs without [web_fetch] section work unchanged
i18n Follow-Through (required when docs or user-facing wording changes)
- i18n follow-through triggered? No — no docs or user-facing wording changes
Human Verification (required)
- Verified scenarios:
- End-to-end test: zeroclaw agent with Ollama qwen2.5:72b successfully called web_fetch to fetch
https://github.com/zeroclaw-labs/zeroclaw, returned clean plain text with project description, features, star count
- Tool registration: tool_count increased from 22 to 23 when enabled = true
- Config: enabled = false (default) → tool not registered; enabled = true → tool available
- Edge cases checked:
- Missing [web_fetch] section in existing config.toml → works (serde defaults)
- Blocklist priority over allowlist
- SSRF with localhost, private IPs, IPv6
- What was not verified:
- Proxy routing (no proxy configured in test environment)
- Very large page truncation with real-world content
Side Effects / Blast Radius (required)
- Affected subsystems/workflows: all_tools_with_runtime() signature gained one parameter (web_fetch_config); all 5
call sites updated
- Potential unintended effects: None — new tool only, existing tools unchanged
- Guardrails/monitoring for early detection: enabled = false default; tool_count in debug logs
Agent Collaboration Notes (recommended)
- Agent tools used: Claude Code (Opus 4.6)
- Workflow/plan summary: Plan mode → approval → implementation → validation
- Verification focus: Security (SSRF, domain filtering, rate limiting), config compatibility, tool registration
- Confirmation: naming + architecture boundaries followed (CLAUDE.md + CONTRIBUTING.md): Yes — trait implementation +
factory registration pattern, independent security helpers (DRY rule-of-three), deny-by-default config
Rollback Plan (required)
- Fast rollback command/path: git revert <commit>
- Feature flags or config toggles: [web_fetch] enabled = false (default) disables completely
- Observable failure symptoms: tool_count in debug logs drops by 1; LLM cannot call web_fetch
Risks and Mitigations
- Risk: SSRF bypass via DNS rebinding (attacker-controlled domain resolving to private IP)
- Mitigation: Pre-request host validation blocks known private/local patterns. Same defense level as existing
http_request tool. Full DNS-level protection would require async DNS resolution before connect, which is out of scope
for this PR.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 04597352cc)