CortexScout is the Deep Research & Web Extraction module within the Cortex-Works ecosystem.
Designed for agent workloads that require token-efficient web retrieval, reliable anti-bot handling, and optional Human-in-the-Loop (HITL) fallback.
CortexScout provides a single, self-hostable Rust binary that exposes search and extraction capabilities over MCP (stdio) and an optional HTTP server. Output formats are structured and optimized for downstream LLM use.
It is built to handle the practical failure modes of web retrieval (rate limits, bot challenges, JavaScript-heavy pages) through progressive fallbacks: native retrieval → Chromium CDP rendering → HITL workflows.
| Area | MCP Tools / Capabilities |
|---|---|
| Search | web_search, web_search_json (parallel meta-search + dedup/scoring) |
| Fetch | web_fetch, web_fetch_batch (token-efficient clean output, optional semantic filtering) |
| Crawl | web_crawl (bounded discovery for doc sites / sub-pages) |
| Extraction | extract_fields, fetch_then_extract (schema-driven extraction) |
| Anti-bot handling | CDP rendering, proxy rotation, block-aware retries |
| HITL | visual_scout (screenshot for gate confirmation), human_auth_session (authenticated fetch with persisted sessions), non_robot_search (last resort rendering) |
| Memory | memory_search (LanceDB-backed research history) |
| Deep research | deep_research (multi-hop search + scrape + synthesis via OpenAI-compatible APIs) |
While CortexScout runs as a standalone tool today, it is designed to integrate with CortexDB and CortexStudio for multi-agent scaling, shared retrieval artifacts, and centralized governance.
This repository includes captured evidence artifacts that validate extraction and HITL flows against representative protected targets.
| Target | Protection | Evidence | Notes |
|---|---|---|---|
| Cloudflare + Auth | JSON · Snippet | Auth-gated listings extraction | |
| Ticketmaster | Cloudflare Turnstile | JSON · Snippet | Challenge-handled extraction |
| Airbnb | DataDome | JSON · Snippet | Large result sets under bot controls |
| Upwork | reCAPTCHA | JSON · Snippet | Protected listings retrieval |
| Amazon | AWS Shield | JSON · Snippet | Search result extraction |
| nowsecure.nl | Cloudflare | JSON | Manual return path validated |
See proof/README.md for methodology and raw outputs.
Download the latest release assets from GitHub Releases and run one of:
cortex-scout-mcp— MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)cortex-scout— optional HTTP server (default port5000; override via--port,PORT, orCORTEX_SCOUT_PORT)
Health check (HTTP server):
./cortex-scout --port 5000
curl http://localhost:5000/healthBasic build (search, scrape, deep research, memory):
git clone https://github.com/cortex-works/cortex-scout.git
cd cortex-scout/mcp-server
cargo build --releaseFull build (includes hitl_web_fetch / visible-browser HITL):
cargo build --release --all-featuresAdd a server entry to your MCP config.
VS Code (mcp.json — global, or settings.json under mcp.servers):
Important: Always use
RUST_LOG=warn, notinfo. Atinfolevel, the server emits hundreds of log lines per request to stderr, which can confuse MCP clients that monitor stderr.
Windows: Windows has no
envcommand. Use thecommand+envobject format instead — see docs/IDE_SETUP.md.
With deep research (LLM synthesis via OpenRouter / any OpenAI-compatible API):
{
"servers": {
"cortex-scout": {
"type": "stdio",
"command": "env",
"args": [
"RUST_LOG=warn",
"SEARCH_ENGINES=google,bing,duckduckgo,brave",
"LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb",
"HTTP_TIMEOUT_SECS=30",
"MAX_CONTENT_CHARS=10000",
"IP_LIST_PATH=/YOUR_PATH/cortex-scout/ip.txt",
"PROXY_SOURCE_PATH=/YOUR_PATH/cortex-scout/proxy_source.json",
"OPENAI_BASE_URL=https://openrouter.ai/api/v1",
"OPENAI_API_KEY=sk-or-v1-...",
"DEEP_RESEARCH_LLM_MODEL=moonshotai/kimi-k2.5",
"DEEP_RESEARCH_ENABLED=1",
"DEEP_RESEARCH_SYNTHESIS=1",
"DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS=4096",
"--",
"/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp"
]
}
}
}Multi-IDE guide: docs/IDE_SETUP.md
Create cortex-scout.json in the same directory as the binary (or repository root). All fields are optional; environment variables act as fallback.
{
"deep_research": {
"enabled": true,
"llm_base_url": "http://localhost:1234/v1",
"llm_api_key": "",
"llm_model": "lfm2-2.6b",
"synthesis_enabled": true,
"synthesis_max_sources": 3,
"synthesis_max_chars_per_source": 800,
"synthesis_max_tokens": 1024
}
}| Variable | Default | Description |
|---|---|---|
RUST_LOG |
warn |
Log level. Keep warn for MCP stdio — info floods stderr and confuses MCP clients |
HTTP_TIMEOUT_SECS |
30 |
Per-request read timeout (seconds) |
HTTP_CONNECT_TIMEOUT_SECS |
10 |
TCP connect timeout (seconds) |
OUTBOUND_LIMIT |
32 |
Max concurrent outbound HTTP connections |
MAX_CONTENT_CHARS |
10000 |
Max characters returned per scraped page |
| Variable | Default | Description |
|---|---|---|
CHROME_EXECUTABLE |
auto-detected | Override path to Chromium/Chrome/Brave binary |
SEARCH_CDP_FALLBACK |
true |
Retry search engine fetches via native Chromium CDP when blocked |
SEARCH_TIER2_NON_ROBOT |
unset | Set 1 to allow hitl_web_fetch as last-resort search escalation |
MAX_LINKS |
100 |
Max links followed per page crawl |
| Variable | Default | Description |
|---|---|---|
SEARCH_ENGINES |
google,bing,duckduckgo,brave |
Active engines (comma-separated) |
SEARCH_MAX_RESULTS_PER_ENGINE |
10 |
Results per engine before merge/dedup |
| Variable | Default | Description |
|---|---|---|
IP_LIST_PATH |
— | Path to ip.txt (one proxy per line: http://, socks5://) |
PROXY_SOURCE_PATH |
— | Path to proxy_source.json (used by proxy_control grab) |
| Variable | Default | Description |
|---|---|---|
LANCEDB_URI |
— | Directory path for persistent research memory. Omit to disable |
CORTEX_SCOUT_MEMORY_DISABLED |
0 |
Set 1 to disable memory even when LANCEDB_URI is set |
MODEL2VEC_MODEL |
built-in | HuggingFace model ID or local path for embedding (e.g. minishlab/potion-base-8M) |
| Variable | Default | Description |
|---|---|---|
DEEP_RESEARCH_ENABLED |
1 |
Set 0 to disable the deep_research tool at runtime |
OPENAI_API_KEY |
— | API key for LLM synthesis. Omit for key-less local endpoints (Ollama) |
OPENAI_BASE_URL |
https://api.openai.com/v1 |
OpenAI-compatible endpoint (OpenRouter, Ollama, LM Studio, etc.) |
DEEP_RESEARCH_LLM_MODEL |
gpt-4o-mini |
Model identifier (must be supported by the endpoint) |
DEEP_RESEARCH_SYNTHESIS |
1 |
Set 0 to skip LLM synthesis (search+scrape only) |
DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS |
1024 |
Max tokens for synthesis response. Use 4096+ for large-context models |
DEEP_RESEARCH_SYNTHESIS_MAX_SOURCES |
8 |
Max source documents fed to LLM synthesis |
DEEP_RESEARCH_SYNTHESIS_MAX_CHARS_PER_SOURCE |
2500 |
Max characters extracted per source for synthesis |
| Variable | Default | Description |
|---|---|---|
CORTEX_SCOUT_PORT / PORT |
5000 |
Listening port for the HTTP server binary (cortex-scout) |
Recommended operational flow:
- Call
memory_searchbefore any new research run — skip live fetching if similarity ≥ 0.60 andskip_live_fetchistrue. - For initial topic discovery use
web_search_json(returns structured snippets, lower token cost than full scrape). - For known URLs use
web_fetchwithoutput_format="clean_json", setquery+strict_relevance=trueto truncate irrelevant content. - On 403/429: call
proxy_controlwithaction:"grab"to refresh the proxy list, then retry withuse_proxy:true. - For auth-gated pages:
visual_scoutto confirm the gate type →human_auth_sessionto complete login (cookies persisted under~/.cortex-scout/sessions/). - For deep research:
deep_researchhandles multi-hop search + scrape + LLM synthesis automatically. Tunedepth(1–3) andmax_sourcesper run cost budget. - For CAPTCHA or heavy JS pages that all other paths fail:
hitl_web_fetchopens a visible Brave/Chrome window for human completion (requires--all-featuresbuild and a local desktop session).
See CHANGELOG.md.
MIT. See LICENSE.