test(bench): add /map recall benchmark by us · Pull Request #191 · us/crw

us · 2026-06-26T17:23:31Z

What

A reproducible benchmark for /map URL-discovery recall — turns "map keeps missing URLs" into a number.

Ground truth = purpose-built scraper sandboxes with fixed, publicly-known structure (deterministic, unprotected, free):

Site	Known ground truth
books.toscrape.com	1000 products, 50 listing pages, 50 categories
quotes.toscrape.com	10 pages

Recall is the metric that matters: a raw count lies. trendyol.com returns ~5082 URLs but ~0 of the Turkish catalog (20/5082 products) — almost all /ar/ + /bg/ foreign-locale category sitemaps.

Run

CRW_API_URL=http://localhost:3000 uv run python bench/map_recall.py   # live
uv run python bench/map_recall.py --selfcheck                          # offline, recorded fixture

Baseline findings (hosted fastCRW, default maxDepth=2)

books.toscrape.com: products 709/1000 (recall 0.71); main-catalogue pagination only 4/50 pages reached. A clean, unprotected, sitemap-less static site still loses ~29% of products.
maxDepth 5 and 10 → HTTP 502 on hosted (default depth 2 works) — can't raise depth to compensate.

Root cause (follow-up, not fixed here)

crw-crawl/src/crawl.rs::discover_urls runs a BFS gated by depth < max_depth (default 2). Pagination is a chain (page-2 → page-3 → …), so depth-2 reaches ~page-3 and everything deeper is lost. The fix is pagination-aware traversal (follow rel=next / numbered pages on a separate budget) rather than just bumping default depth — raising depth lengthens the crawl and worsens the gateway 502. Tracking separately.

Scope

Bench tooling only — no product code touched. Offline --selfcheck pins the ~0.71 baseline from a recorded fixture so scorer regressions fail loudly; the live run needs a local crw server.

Measures /map URL-discovery recall against purpose-built scraper sandboxes (books.toscrape.com, quotes.toscrape.com) whose structure is fixed and known. Offline scorer self-check pins the observed baseline (~0.71 product recall at default maxDepth=2) from a recorded fixture, so scorer regressions fail loudly. Baseline finding: a clean, unprotected, sitemap-less static site still loses ~29% of products because discover_urls BFS is depth-gated and does not follow pagination chains past depth. Root cause + follow-up noted in MAP_RECALL.md.

us merged commit e02d789 into main Jun 28, 2026
4 checks passed

us deleted the bench/map-recall branch June 28, 2026 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(bench): add /map recall benchmark#191

test(bench): add /map recall benchmark#191
us merged 1 commit into
mainfrom
bench/map-recall

us commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

us commented Jun 26, 2026

What

Run

Baseline findings (hosted fastCRW, default maxDepth=2)

Root cause (follow-up, not fixed here)

Scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant