Skip to content

test(bench): add /map recall benchmark#191

Merged
us merged 1 commit into
mainfrom
bench/map-recall
Jun 28, 2026
Merged

test(bench): add /map recall benchmark#191
us merged 1 commit into
mainfrom
bench/map-recall

Conversation

@us

@us us commented Jun 26, 2026

Copy link
Copy Markdown
Owner

What

A reproducible benchmark for /map URL-discovery recall — turns "map keeps missing URLs" into a number.

Ground truth = purpose-built scraper sandboxes with fixed, publicly-known structure (deterministic, unprotected, free):

Site Known ground truth
books.toscrape.com 1000 products, 50 listing pages, 50 categories
quotes.toscrape.com 10 pages

Recall is the metric that matters: a raw count lies. trendyol.com returns ~5082 URLs but ~0 of the Turkish catalog (20/5082 products) — almost all /ar/ + /bg/ foreign-locale category sitemaps.

Run

CRW_API_URL=http://localhost:3000 uv run python bench/map_recall.py   # live
uv run python bench/map_recall.py --selfcheck                          # offline, recorded fixture

Baseline findings (hosted fastCRW, default maxDepth=2)

  • books.toscrape.com: products 709/1000 (recall 0.71); main-catalogue pagination only 4/50 pages reached. A clean, unprotected, sitemap-less static site still loses ~29% of products.
  • maxDepth 5 and 10 → HTTP 502 on hosted (default depth 2 works) — can't raise depth to compensate.

Root cause (follow-up, not fixed here)

crw-crawl/src/crawl.rs::discover_urls runs a BFS gated by depth < max_depth (default 2). Pagination is a chain (page-2 → page-3 → …), so depth-2 reaches ~page-3 and everything deeper is lost. The fix is pagination-aware traversal (follow rel=next / numbered pages on a separate budget) rather than just bumping default depth — raising depth lengthens the crawl and worsens the gateway 502. Tracking separately.

Scope

Bench tooling only — no product code touched. Offline --selfcheck pins the ~0.71 baseline from a recorded fixture so scorer regressions fail loudly; the live run needs a local crw server.

Measures /map URL-discovery recall against purpose-built scraper sandboxes
(books.toscrape.com, quotes.toscrape.com) whose structure is fixed and known.
Offline scorer self-check pins the observed baseline (~0.71 product recall at
default maxDepth=2) from a recorded fixture, so scorer regressions fail loudly.

Baseline finding: a clean, unprotected, sitemap-less static site still loses
~29% of products because discover_urls BFS is depth-gated and does not follow
pagination chains past depth. Root cause + follow-up noted in MAP_RECALL.md.
@us us merged commit e02d789 into main Jun 28, 2026
4 checks passed
@us us deleted the bench/map-recall branch June 28, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant