Skip to content

shekinahfire77/solas-workers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solas Workers

Cloudflare Workers for the Solas data collection pipeline.

Workers

Worker Purpose Input Output
scraper Extract clean text from a URL { url, maxChars? } { url, title, text, charCount }
crawler BFS crawl a domain { baseUrl, maxPages?, maxDepth? } { pages: [...], totalFound, totalCrawled }
search-aggregator Multi-source search { query, count? } { results: [...], totalSources }
extractor Regex-based structured extraction { content, type?, html? } { type, data: {...} }

Deploy

cd <worker-name>
npm install
npx wrangler deploy

Prerequisites

  • Cloudflare account with Workers enabled
  • CLOUDFLARE_API_TOKEN set in environment
  • KV namespaces created for crawler and search-aggregator:
    cd crawler && npx wrangler kv:namespace create CRAWL_RESULTS
    cd search-aggregator && npx wrangler kv:namespace create SEARCH_CACHE
  • Update wrangler.jsonc with returned KV namespace IDs

Architecture

User → Worker → External Source → Extract → Return
                    ↓
              KV Cache (optional)

All workers are TypeScript, self-contained, minimal dependencies. They use native Cloudflare APIs (HTMLRewriter, fetch, KV) for performance and cost efficiency.

Notes

  • Scraper uses HTMLRewriter for streaming extraction (no full DOM parse)
  • Crawler respects robots.txt and rate-limits at 1 req/sec
  • Search aggregator caches results in KV for 1 hour
  • Extractor uses regex patterns — no LLM dependency (future version will add LLM extraction)

About

Cloudflare Workers for data collection pipeline — scraper, crawler, search aggregator, extractor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors