zeroclaw

Author	SHA1	Message	Date
maxtongwang	e37a53c690	fix(web-fetch): remove dead feature gates and add noise stripping (#2262 ) * fix(web-fetch): remove dead feature gates, add noise stripping, add docstrings The nanohtml2text and fast_html2md providers were both guarded by cfg(feature) checks for features (web-fetch-plaintext, web-fetch-html2md) that are never declared in Cargo.toml. This caused every web_fetch call to silently return an error instead of fetching content. Changes: - Add strip_noise_elements() which removes <script>, <style>, <nav>, <header>, <footer>, <aside>, <noscript>, <form>, <button> blocks before text extraction, eliminating menu/ad/boilerplate noise. - Fix fast_html2md path: when web-fetch-html2md feature is not compiled in, fall through to nanohtml2text rather than returning an error. - Fix nanohtml2text path: remove dead cfg(feature = "web-fetch-plaintext") gate; nanohtml2text is a direct dependency and needs no feature flag. - Both previously gated tests (html_to_markdown_conversion_preserves_structure, html_to_plaintext_conversion_removes_html_tags) are now always-on. Added strip_noise_removes_nav_scripts_footer test. - Add docstrings to all public/private methods to meet coverage threshold. Tavily and firecrawl providers are unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(web-fetch): align default provider to nanohtml2text, remove dead feature - Change empty-provider default from deprecated 'fast_html2md' to 'nanohtml2text' to match WEB_FETCH_PROVIDER_HELP and PR description. - Remove dead 'web-fetch-plaintext' feature from Cargo.toml (no code references it after the feature-gate removal). - Apply cargo fmt to strip_noise_elements array formatting. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: xj <gh-xj@users.noreply.github.com>	2026-02-28 12:19:40 -08:00
argenis de la rosa	7307aab103	feat(tools): add Tavily provider and API-key round-robin	2026-02-27 06:37:57 -05:00
argenis de la rosa	d63a6a8ceb	feat(security): unify URL validation with configurable CIDR/domain allowlist	2026-02-26 22:07:07 -05:00
argenis de la rosa	b27b44829a	chore: promote dev snapshot to main (resolve #1978/#1970)	2026-02-26 21:09:33 -05:00
Chum Yin	9b0e70b2f2	supersede: file-replay changes from #1895 (#1926 ) Automated conflict recovery via changed-file replay on latest main.	2026-02-26 04:15:47 -05:00
Ricardo Magaña	da62bd172f	feat(tools): add user_agent config and setup_web_tools wizard step Ports remaining changes from feat/unify-web-fetch-providers that were not yet integrated into dev: - config/schema.rs: add `user_agent` field (default "ZeroClaw/1.0") to HttpRequestConfig, WebFetchConfig, and WebSearchConfig, with a shared default_user_agent() helper. Field is serde-default so existing configs remain backward compatible. - tools/http_request.rs: accept user_agent in constructor; pass it to reqwest::Client via .user_agent() replacing the implicit default. - tools/web_fetch.rs: accept user_agent in constructor; replace hardcoded "ZeroClaw/0.1 (web_fetch)" in build_http_client with the configured value. - tools/web_search_tool.rs: accept user_agent in constructor; replace hardcoded Chrome UA string in search_duckduckgo and add .user_agent() to the Brave and Firecrawl client builders. - tools/mod.rs: wire user_agent from each config struct into the corresponding tool constructor (HttpRequestTool, WebFetchTool, WebSearchTool). - onboard/wizard.rs: add setup_web_tools() as wizard Step 6 "Web & Internet Tools" (total steps bumped from 9 to 10). Configures WebSearchConfig, WebFetchConfig, and HttpRequestConfig interactively with provider selection and optional API key/URL prompts. Step 5 setup_tool_mode() http_request and web_search outputs are now discarded (_, _) since step 6 owns that configuration. Uses dev's generic api_key/api_url schema fields unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit fb83da8db021903cf5844852bdb67b9b259941d7)	2026-02-25 23:43:42 +08:00
Chummy	83ef0a3cf6	fix(tools): address codeql api key handling alerts	2026-02-25 03:30:45 +08:00
Chummy	ffe340f849	fix(tools): satisfy strict delta lint for firecrawl dispatch	2026-02-25 03:30:45 +08:00
Chummy	b4df1dc30d	feat(tools): add web_fetch provider dispatch and shared URL validation	2026-02-25 03:30:45 +08:00
reidliu41	d3f0a79fe9	Summary - Problem: The existing http_request tool returns raw HTML/JSON, which is nearly unusable for LLMs to extract meaningful content from web pages. - Why it matters: All mainstream AI agents (Claude Code, Gemini CLI, Aider) have dedicated web content extraction tools. ZeroClaw lacks this capability, limiting its ability to research and gather information from the web. - What changed: Added a new web_fetch tool that fetches web pages and converts HTML to clean plain text using nanohtml2text. Includes domain allowlist/blocklist, SSRF protection, redirect following, and content-type aware processing. - What did not change (scope boundary): http_request tool is untouched. No shared code extracted between http_request and web_fetch (DRY rule-of-three: only 2 callers). No changes to existing tool behavior or defaults. Label Snapshot (required) - Risk label: risk: medium - Size label: size: M - Scope labels: tool, config - Module labels: tool: web_fetch - If any auto-label is incorrect, note requested correction: N/A Change Metadata - Change type: feature - Primary scope: tool Linked Issue - Closes # - Related # - Depends on # - Supersedes # Supersede Attribution (required when Supersedes # is used) N/A Validation Evidence (required) cargo fmt --all -- --check # pass cargo clippy --all-targets -- -D warnings # no new warnings (pre-existing warnings only) cargo test --lib -- web_fetch # 26/26 passed cargo test --lib -- tools::tests # 12/12 passed cargo test --lib -- config::schema::tests # 134/134 passed - Evidence provided: unit test results (26 new tests), manual end-to-end test with Ollama + qwen2.5:72b - If any command is intentionally skipped, explain why: Full cargo clippy --all-targets has 43 pre-existing errors unrelated to this PR (e.g. await_holding_lock, format! appended to String). Zero errors from web_fetch code. Security Impact (required) - New permissions/capabilities? Yes — new web_fetch tool can make outbound HTTP GET requests - New external network calls? Yes — fetches web pages from allowed domains - Secrets/tokens handling changed? No - File system access scope changed? No - If any Yes, describe risk and mitigation: - Deny-by-default: enabled = false by default; tool is not registered unless explicitly enabled - Domain filtering: allowed_domains (default ["*"] = all public hosts) + blocked_domains (takes priority). Blocklist always wins over allowlist. - SSRF protection: Blocks localhost, private IPs (RFC 1918), link-local, multicast, reserved ranges, IPv4-mapped IPv6, .local TLD — identical coverage to http_request - Rate limiting: can_act() + record_action() enforce autonomy level and rate limits - Read-only mode: Blocked when autonomy is ReadOnly - Response size cap: 500KB default truncation prevents context window exhaustion - Proxy support: Honors [proxy] config via tool.web_fetch service key Privacy and Data Hygiene (required) - Data-hygiene status: pass - Redaction/anonymization notes: No personal data in code, tests, or fixtures - Neutral wording confirmation: All test identifiers use neutral project-scoped labels Compatibility / Migration - Backward compatible? Yes — new tool, no existing behavior changed - Config/env changes? Yes — new [web_fetch] section in config.toml (all fields have defaults) - Migration needed? No — #[serde(default)] on all fields; existing configs without [web_fetch] section work unchanged i18n Follow-Through (required when docs or user-facing wording changes) - i18n follow-through triggered? No — no docs or user-facing wording changes Human Verification (required) - Verified scenarios: - End-to-end test: zeroclaw agent with Ollama qwen2.5:72b successfully called web_fetch to fetch https://github.com/zeroclaw-labs/zeroclaw, returned clean plain text with project description, features, star count - Tool registration: tool_count increased from 22 to 23 when enabled = true - Config: enabled = false (default) → tool not registered; enabled = true → tool available - Edge cases checked: - Missing [web_fetch] section in existing config.toml → works (serde defaults) - Blocklist priority over allowlist - SSRF with localhost, private IPs, IPv6 - What was not verified: - Proxy routing (no proxy configured in test environment) - Very large page truncation with real-world content Side Effects / Blast Radius (required) - Affected subsystems/workflows: all_tools_with_runtime() signature gained one parameter (web_fetch_config); all 5 call sites updated - Potential unintended effects: None — new tool only, existing tools unchanged - Guardrails/monitoring for early detection: enabled = false default; tool_count in debug logs Agent Collaboration Notes (recommended) - Agent tools used: Claude Code (Opus 4.6) - Workflow/plan summary: Plan mode → approval → implementation → validation - Verification focus: Security (SSRF, domain filtering, rate limiting), config compatibility, tool registration - Confirmation: naming + architecture boundaries followed (CLAUDE.md + CONTRIBUTING.md): Yes — trait implementation + factory registration pattern, independent security helpers (DRY rule-of-three), deny-by-default config Rollback Plan (required) - Fast rollback command/path: git revert <commit> - Feature flags or config toggles: [web_fetch] enabled = false (default) disables completely - Observable failure symptoms: tool_count in debug logs drops by 1; LLM cannot call web_fetch Risks and Mitigations - Risk: SSRF bypass via DNS rebinding (attacker-controlled domain resolving to private IP) - Mitigation: Pre-request host validation blocks known private/local patterns. Same defense level as existing http_request tool. Full DNS-level protection would require async DNS resolution before connect, which is out of scope for this PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit `04597352cc`)	2026-02-23 20:30:21 +08:00

10 Commits