Commit Graph

10 Commits

Author SHA1 Message Date
maxtongwang
e37a53c690
fix(web-fetch): remove dead feature gates and add noise stripping (#2262)
* fix(web-fetch): remove dead feature gates, add noise stripping, add docstrings

The nanohtml2text and fast_html2md providers were both guarded by
cfg(feature) checks for features (web-fetch-plaintext, web-fetch-html2md)
that are never declared in Cargo.toml. This caused every web_fetch call
to silently return an error instead of fetching content.

Changes:
- Add strip_noise_elements() which removes <script>, <style>, <nav>,
  <header>, <footer>, <aside>, <noscript>, <form>, <button> blocks
  before text extraction, eliminating menu/ad/boilerplate noise.
- Fix fast_html2md path: when web-fetch-html2md feature is not compiled
  in, fall through to nanohtml2text rather than returning an error.
- Fix nanohtml2text path: remove dead cfg(feature = "web-fetch-plaintext")
  gate; nanohtml2text is a direct dependency and needs no feature flag.
- Both previously gated tests (html_to_markdown_conversion_preserves_structure,
  html_to_plaintext_conversion_removes_html_tags) are now always-on.
  Added strip_noise_removes_nav_scripts_footer test.
- Add docstrings to all public/private methods to meet coverage threshold.

Tavily and firecrawl providers are unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(web-fetch): align default provider to nanohtml2text, remove dead feature

- Change empty-provider default from deprecated 'fast_html2md' to
  'nanohtml2text' to match WEB_FETCH_PROVIDER_HELP and PR description.
- Remove dead 'web-fetch-plaintext' feature from Cargo.toml (no code
  references it after the feature-gate removal).
- Apply cargo fmt to strip_noise_elements array formatting.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: xj <gh-xj@users.noreply.github.com>
2026-02-28 12:19:40 -08:00
argenis de la rosa
7307aab103 feat(tools): add Tavily provider and API-key round-robin 2026-02-27 06:37:57 -05:00
argenis de la rosa
d63a6a8ceb feat(security): unify URL validation with configurable CIDR/domain allowlist 2026-02-26 22:07:07 -05:00
argenis de la rosa
b27b44829a chore: promote dev snapshot to main (resolve #1978/#1970) 2026-02-26 21:09:33 -05:00
Chum Yin
9b0e70b2f2
supersede: file-replay changes from #1895 (#1926)
Automated conflict recovery via changed-file replay on latest main.
2026-02-26 04:15:47 -05:00
Ricardo Magaña
da62bd172f feat(tools): add user_agent config and setup_web_tools wizard step
Ports remaining changes from feat/unify-web-fetch-providers that were
not yet integrated into dev:

- config/schema.rs: add `user_agent` field (default "ZeroClaw/1.0") to
  HttpRequestConfig, WebFetchConfig, and WebSearchConfig, with a shared
  default_user_agent() helper. Field is serde-default so existing configs
  remain backward compatible.

- tools/http_request.rs: accept user_agent in constructor; pass it to
  reqwest::Client via .user_agent() replacing the implicit default.

- tools/web_fetch.rs: accept user_agent in constructor; replace hardcoded
  "ZeroClaw/0.1 (web_fetch)" in build_http_client with the configured value.

- tools/web_search_tool.rs: accept user_agent in constructor; replace
  hardcoded Chrome UA string in search_duckduckgo and add .user_agent()
  to the Brave and Firecrawl client builders.

- tools/mod.rs: wire user_agent from each config struct into the
  corresponding tool constructor (HttpRequestTool, WebFetchTool,
  WebSearchTool).

- onboard/wizard.rs: add setup_web_tools() as wizard Step 6 "Web &
  Internet Tools" (total steps bumped from 9 to 10). Configures
  WebSearchConfig, WebFetchConfig, and HttpRequestConfig interactively
  with provider selection and optional API key/URL prompts. Step 5
  setup_tool_mode() http_request and web_search outputs are now discarded
  (_, _) since step 6 owns that configuration. Uses dev's generic
  api_key/api_url schema fields unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit fb83da8db021903cf5844852bdb67b9b259941d7)
2026-02-25 23:43:42 +08:00
Chummy
83ef0a3cf6 fix(tools): address codeql api key handling alerts 2026-02-25 03:30:45 +08:00
Chummy
ffe340f849 fix(tools): satisfy strict delta lint for firecrawl dispatch 2026-02-25 03:30:45 +08:00
Chummy
b4df1dc30d feat(tools): add web_fetch provider dispatch and shared URL validation 2026-02-25 03:30:45 +08:00
reidliu41
d3f0a79fe9 Summary
- Problem: The existing http_request tool returns raw HTML/JSON, which is nearly unusable for LLMs to extract
  meaningful content from web pages.
- Why it matters: All mainstream AI agents (Claude Code, Gemini CLI, Aider) have dedicated web content extraction
  tools. ZeroClaw lacks this capability, limiting its ability to research and gather information from the web.
- What changed: Added a new web_fetch tool that fetches web pages and converts HTML to clean plain text using
  nanohtml2text. Includes domain allowlist/blocklist, SSRF protection, redirect following, and content-type aware
  processing.
- What did not change (scope boundary): http_request tool is untouched. No shared code extracted between http_request
   and web_fetch (DRY rule-of-three: only 2 callers). No changes to existing tool behavior or defaults.

Label Snapshot (required)

  - Risk label: risk: medium
  - Size label: size: M
  - Scope labels: tool, config
  - Module labels: tool: web_fetch
  - If any auto-label is incorrect, note requested correction: N/A

  Change Metadata

  - Change type: feature
  - Primary scope: tool

  Linked Issue

  - Closes #
  - Related #
  - Depends on #
  - Supersedes #

  Supersede Attribution (required when Supersedes # is used)

  N/A

  Validation Evidence (required)

  cargo fmt --all -- --check   # pass
  cargo clippy --all-targets -- -D warnings  # no new warnings (pre-existing warnings only)
  cargo test --lib -- web_fetch  # 26/26 passed
  cargo test --lib -- tools::tests  # 12/12 passed
  cargo test --lib -- config::schema::tests  # 134/134 passed

  - Evidence provided: unit test results (26 new tests), manual end-to-end test with Ollama + qwen2.5:72b
  - If any command is intentionally skipped, explain why: Full cargo clippy --all-targets has 43 pre-existing errors
  unrelated to this PR (e.g. await_holding_lock, format! appended to String). Zero errors from web_fetch code.

  Security Impact (required)

  - New permissions/capabilities? Yes — new web_fetch tool can make outbound HTTP GET requests
  - New external network calls? Yes — fetches web pages from allowed domains
  - Secrets/tokens handling changed? No
  - File system access scope changed? No
  - If any Yes, describe risk and mitigation:
    - Deny-by-default: enabled = false by default; tool is not registered unless explicitly enabled
    - Domain filtering: allowed_domains (default ["*"] = all public hosts) + blocked_domains (takes priority).
  Blocklist always wins over allowlist.
    - SSRF protection: Blocks localhost, private IPs (RFC 1918), link-local, multicast, reserved ranges, IPv4-mapped
  IPv6, .local TLD — identical coverage to http_request
    - Rate limiting: can_act() + record_action() enforce autonomy level and rate limits
    - Read-only mode: Blocked when autonomy is ReadOnly
    - Response size cap: 500KB default truncation prevents context window exhaustion
    - Proxy support: Honors [proxy] config via tool.web_fetch service key

  Privacy and Data Hygiene (required)

  - Data-hygiene status: pass
  - Redaction/anonymization notes: No personal data in code, tests, or fixtures
  - Neutral wording confirmation: All test identifiers use neutral project-scoped labels

  Compatibility / Migration

  - Backward compatible? Yes — new tool, no existing behavior changed
  - Config/env changes? Yes — new [web_fetch] section in config.toml (all fields have defaults)
  - Migration needed? No — #[serde(default)] on all fields; existing configs without [web_fetch] section work unchanged

  i18n Follow-Through (required when docs or user-facing wording changes)

  - i18n follow-through triggered? No — no docs or user-facing wording changes

  Human Verification (required)

  - Verified scenarios:
    - End-to-end test: zeroclaw agent with Ollama qwen2.5:72b successfully called web_fetch to fetch
  https://github.com/zeroclaw-labs/zeroclaw, returned clean plain text with project description, features, star count
    - Tool registration: tool_count increased from 22 to 23 when enabled = true
    - Config: enabled = false (default) → tool not registered; enabled = true → tool available
  - Edge cases checked:
    - Missing [web_fetch] section in existing config.toml → works (serde defaults)
    - Blocklist priority over allowlist
    - SSRF with localhost, private IPs, IPv6
  - What was not verified:
    - Proxy routing (no proxy configured in test environment)
    - Very large page truncation with real-world content

  Side Effects / Blast Radius (required)

  - Affected subsystems/workflows: all_tools_with_runtime() signature gained one parameter (web_fetch_config); all 5
  call sites updated
  - Potential unintended effects: None — new tool only, existing tools unchanged
  - Guardrails/monitoring for early detection: enabled = false default; tool_count in debug logs

  Agent Collaboration Notes (recommended)

  - Agent tools used: Claude Code (Opus 4.6)
  - Workflow/plan summary: Plan mode → approval → implementation → validation
  - Verification focus: Security (SSRF, domain filtering, rate limiting), config compatibility, tool registration
  - Confirmation: naming + architecture boundaries followed (CLAUDE.md + CONTRIBUTING.md): Yes — trait implementation +
   factory registration pattern, independent security helpers (DRY rule-of-three), deny-by-default config

  Rollback Plan (required)

  - Fast rollback command/path: git revert <commit>
  - Feature flags or config toggles: [web_fetch] enabled = false (default) disables completely
  - Observable failure symptoms: tool_count in debug logs drops by 1; LLM cannot call web_fetch

  Risks and Mitigations

  - Risk: SSRF bypass via DNS rebinding (attacker-controlled domain resolving to private IP)
    - Mitigation: Pre-request host validation blocks known private/local patterns. Same defense level as existing
  http_request tool. Full DNS-level protection would require async DNS resolution before connect, which is out of scope
   for this PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 04597352cc)
2026-02-23 20:30:21 +08:00