Cloudflare Workers for the Solas data collection pipeline.
| Worker | Purpose | Input | Output |
|---|---|---|---|
| scraper | Extract clean text from a URL | { url, maxChars? } |
{ url, title, text, charCount } |
| crawler | BFS crawl a domain | { baseUrl, maxPages?, maxDepth? } |
{ pages: [...], totalFound, totalCrawled } |
| search-aggregator | Multi-source search | { query, count? } |
{ results: [...], totalSources } |
| extractor | Regex-based structured extraction | { content, type?, html? } |
{ type, data: {...} } |
cd <worker-name>
npm install
npx wrangler deploy- Cloudflare account with Workers enabled
CLOUDFLARE_API_TOKENset in environment- KV namespaces created for
crawlerandsearch-aggregator:cd crawler && npx wrangler kv:namespace create CRAWL_RESULTS cd search-aggregator && npx wrangler kv:namespace create SEARCH_CACHE
- Update
wrangler.jsoncwith returned KV namespace IDs
User → Worker → External Source → Extract → Return
↓
KV Cache (optional)
All workers are TypeScript, self-contained, minimal dependencies. They use native Cloudflare APIs (HTMLRewriter, fetch, KV) for performance and cost efficiency.
- Scraper uses HTMLRewriter for streaming extraction (no full DOM parse)
- Crawler respects robots.txt and rate-limits at 1 req/sec
- Search aggregator caches results in KV for 1 hour
- Extractor uses regex patterns — no LLM dependency (future version will add LLM extraction)