This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
YogaMatLabData is a data pipeline repository that scrapes yoga mat product information from 19+ Shopify brand websites, normalizes the data to a unified schema, and makes it available to the YogaMatLabApp application. The pipeline runs daily via GitHub Actions to keep product data current.
- Extract β Scrape product data from brand Shopify stores using Playwright
- Normalize β Transform raw Shopify data to unified YogaMat schema
- Aggregate β Combine all brands into single dataset
- Detect Changes β Diff against previous day to track additions/removals/updates
- Download Images β Fetch and optimize product images
YogaMatLabData/
βββ config/
β βββ field-mappings.json # Shopify β YogaMat field transformations
βββ scripts/
β βββ extract-all-brands.ts # Main orchestrator (queries Convex for brands)
β βββ normalize-data.ts # Data transformation
β βββ aggregate-data.ts # Combine all brands
β βββ detect-changes.ts # Change detection
β βββ download-images.ts # Batch image downloader
β βββ lib/
β βββ shopify-scraper.ts # Playwright-based Shopify scraper
β βββ image-downloader.ts # Image fetching with Sharp
β βββ field-mapper.ts # Field mapping logic
β βββ logger.ts # Structured logging
βββ data/
β βββ raw/{date}/ # Daily Shopify extractions (by brand)
β βββ normalized/{date}/ # Transformed to YogaMat schema
β βββ aggregated/{date}/ # Combined datasets
β β βββ all-products.json # Single file with all products
β βββ changes/ # Daily changeset logs
βββ .github/workflows/
βββ fetch-products.yml # Automated daily pipeline
The unified product schema is defined in types/yogaMat.ts. Key types:
YogaMat- Main product type (references Convex schema)Brand- Brand informationMaterialType- Material categories (PVC, TPE, Natural Rubber, Cork, etc.)YogaStyle- Compatible yoga styles (Vinyasa, Hot Yoga, etc.)YogaMatFeature- Product features (Eco-Friendly, Non-Slip, etc.)
Raw scraped data structure from scripts/lib/shopify-scraper.ts:
{
brand: string
model: string
price: number
thickness?: number // in mm
length?: number // in inches
width?: number // in inches
weight?: number // in lbs
material?: string
texture?: string
imageUrl?: string
description?: string
features?: string[]
variants?: Array<{name: string, price: number}>
}Since package.json doesn't exist yet, the intended commands will be:
# Run full pipeline (once package.json is created)
npm run extract-all # Extract from all brands
npm run normalize # Transform to YogaMat schema
npm run aggregate # Combine into single dataset
npm run detect-changes # Generate changeset
# Or run all at once
npm run pipeline # Runs all steps sequentially# Using tsx directly
npx tsx scripts/extract-all-brands.ts
npx tsx scripts/normalize-data.ts
npx tsx scripts/aggregate-data.ts
npx tsx scripts/detect-changes.ts- Uses Playwright for browser automation
- Handles pagination automatically (Shopify's
?page=Npattern) - Supports lazy-loading with scroll behavior
- Multiple fallback strategies for price extraction:
- Price-related CSS classes
- Data attributes
- JSON-LD structured data
- Shopify product JSON
- Converts all units to standard: mm (thickness), inches (dimensions), lbs (weight)
- Defaults to $99 if price not found
- Uses Sharp for image optimization
- Resizes to max 1200px width
- Converts to JPEG (quality: 85) for consistency
- Processes in configurable batches (default: 3 concurrent)
- Saves to slug-based filenames:
{brand-slug}-{model-slug}.jpg
Brand scraping configuration is stored in YogaMatLabApp's Convex brands table with these fields:
{
// Standard brand fields
name: string
slug: string
website: string
// Scraping configuration
scrapingEnabled: boolean // Toggle scraping on/off
shopifyCollectionUrl: string | null // e.g., "/collections/yoga-mats"
isShopify: boolean
rateLimit: {
delayBetweenProducts: number // ms between product pages (default: 500)
delayBetweenPages: number // ms between collection pages (default: 1000)
}
}The pipeline queries api.brands.getScrapableBrands at runtime to fetch enabled brands.
- Individual brand failures don't stop the pipeline
- All errors logged to
logs/{date}.log - Pipeline continues to next brand on failure
- Validation errors skip invalid products but continue processing
YogaMatLabApp includes this repo as a git submodule at data/external/:
# In YogaMatLabApp
git submodule update --remote data/external
npm run import-mats # Import to Convex databaseYogaMatLabApp reads from data/aggregated/latest/all-products.json via the submodule and imports to Convex database using api.yogaMats.bulkUpsert mutation.
- Runs at 2 AM UTC daily (
cron: 0 2 * * *) - Also supports manual trigger via
workflow_dispatch - Fetches products using simple HTTP requests (no Playwright needed!)
- Commits results with detailed changeset summary
- Creates GitHub issue on failure
- Uploads logs as artifacts (30-day retention)
- Checkout repository
- Setup Node.js 20 with npm cache
- Install dependencies (
npm ci) - Create
.envfrom secrets - Run full pipeline (
npm run pipeline) - Update
latest/symlinks - Generate commit message from changeset
- Commit and push results
- Upload logs as artifacts
- Create issue if failed
- Post execution summary
CONVEX_URL- Convex deployment URL for querying brands table (required)PAT_TOKEN- Personal Access Token for cross-repo commits (optional, usesgithub.tokenby default)
Data update: YYYY-MM-DD
π Changes detected:
- New products: X
- Removed products: Y
- Price changes: Z
- Total changes: N
π€ Generated with YogaMatLab Data Pipeline
Run: #123
The pipeline supports three different scraper types to handle various e-commerce platforms:
File: scripts/lib/fetch-products-json.ts
How it works:
- Fetches from Shopify's public
/collections/{path}/products.jsonAPI - No browser automation needed - simple HTTP requests
- Supports pagination (up to 250 products per page)
- Handles multiple collections via pipe-delimited URLs
Configuration:
{
platform: 'shopify',
isShopify: true,
productsJsonUrl: 'https://brand.com/collections/yoga-mats/products.json',
// Multiple collections:
productsJsonUrl: 'https://brand.com/collections/mats/products.json|https://brand.com/collections/props/products.json'
}Brands using this:
- Alo Yoga, Liforme, Jade Yoga, Yolo Ha Yoga, Scoria World
- Gaiam, Yogi-Bare, Bala, 42 Birds, Ananday
- Oko Living, HeatHyoga, Stakt, Sensu, Wolo Yoga
- Keep Store, Yoga Matters, House of Mats, Shakti Warrior, Satori Concept
Advantages:
- Fast and reliable (no browser needed)
- Low resource usage
- Built-in pagination
- Complete product data including variants, images, options
Enhanced headers for Alo Yoga: Added browser-like headers to bypass 403 blocks:
- Accept-Language, Accept-Encoding
- Sec-Fetch-* headers for CORS compliance
- User-Agent rotation from pool
File: scripts/lib/lululemon-scraper.ts
How it works:
- Queries Lululemon's GraphQL API at
https://shop.lululemon.com/api/graphql - Uses category-based product search with pagination
- Converts GraphQL response to Shopify-compatible format
Configuration:
{
platform: 'lululemon',
platformConfig: {
lululemonCategoryId: '8s6' // yoga accessories category
}
}GraphQL Query Structure:
query ProductSearch($categoryId: String!, $offset: Int, $limit: Int) {
search(categoryId: $categoryId, offset: $offset, limit: $limit) {
total
products {
productId
name
price { currentPrice, fullPrice }
swatches { swatchName, colorId, images }
sizes { details, isAvailable, price }
pdpUrl
featurePanels { featuresContent }
}
}
}Data Conversion:
- Builds variants from color swatches Γ sizes
- Extracts features from
featurePanels - Maps images from all color swatches
- Generates SKUs from productId + colorId + size
Category IDs:
8s6: Yoga Accessories (includes yoga mats)1z0qd0t: Women's Yoga1z0qetm: Men's Yoga
Advantages:
- Official API (more stable than scraping)
- Rich product data with variants
- Supports pagination (60 products per page)
- No browser automation needed
File: scripts/lib/bigcommerce-scraper.ts
How it works:
- Uses Playwright browser automation
- Extracts product links from collection pages
- Visits each product page to extract detailed data
- Parses HTML and JSON-LD structured data
Configuration:
{
platform: 'bigcommerce',
platformConfig: {
bigcommerceCollectionUrl: 'https://www.huggermugger.com/collections/yoga-mats'
}
}Extraction Strategy:
- Navigate to collection page
- Extract all product card links
- Visit each product page
- Extract data from:
- HTML selectors (
.bc-product__title,.bc-product__price) - JSON-LD structured data (
<script type="application/ld+json">) - Product option dropdowns (colors, sizes)
- HTML selectors (
- Convert to Shopify-compatible format
Brands using this:
- Hugger Mugger (WordPress + BigCommerce plugin)
Limitations:
- Slower than API-based scrapers (browser automation)
- Higher resource usage (Chromium instance)
- Rate limiting important (1s delay between products)
- Variant extraction limited (requires dropdown interaction)
Advantages:
- Works with any rendered HTML (no API needed)
- Can extract from WordPress + BigCommerce hybrid sites
- Handles JavaScript-rendered content
All brands use /collections/{slug}/products.json endpoints:
- Alo Yoga (
aloyoga.com/collections/yoga) - Enhanced headers for 403 bypass - Liforme, Jade Yoga, Yolo Ha Yoga, Scoria World
- Gaiam, Yogi-Bare, Bala, 42 Birds, Ananday
- Oko Living, HeatHyoga, Stakt, Sensu, Wolo Yoga
- Keep Store, Yoga Matters, House of Mats, Shakti Warrior
- Satori Concept (4 collections via pipe-delimited URLs)
- Lululemon - Custom Next.js + GraphQL implementation
- Hugger Mugger - WordPress + BigCommerce plugin (Playwright scraper)
Require investigation and custom scrapers:
- Sugamats, Byoga, Grip Yoga, EcoYoga
- Sequential brand processing (no parallel requests)
- Configurable delays between requests
- User agent identifies as "YogaMatLab Data Pipeline"
- Only
modelis strictly required - Missing
pricedefaults to $99 - Invalid products logged but skipped
- All products validated against YogaMat schema
Tracks three types of changes:
- New products - In current day, not in previous
- Removed products - In previous day, not in current (for redirect setup)
- Updated products - Price changes, spec changes
- Raw data:
data/raw/{YYYY-MM-DD}/{brand-slug}.json - Normalized:
data/normalized/{YYYY-MM-DD}/{brand-slug}.json - Aggregated:
data/aggregated/{YYYY-MM-DD}/all-products.json - Changes:
data/changes/{YYYY-MM-DD}-changeset.json - Images:
{brand-slug}-{model-slug}.jpg
{
"dependencies": {
"playwright": "^1.x",
"sharp": "^0.x",
"zod": "^3.x",
"convex": "^1.x"
},
"devDependencies": {
"typescript": "^5.x",
"tsx": "^4.x"
}
}- Convex brand query integration
- Products.json fetching (replaced Playwright scraping)
- Rate limiting and error handling
- Multi-brand orchestration
- Field mapping configuration
- Normalization to YogaMat schema
- Data aggregation with statistics
- Change detection between runs
- GitHub Actions workflow (daily at 2 AM UTC)
- Latest symlinks updater
- Automatic commits with changeset summary
- Failure notifications via GitHub issues
- Workflow runs successfully with partial brand failures
- See
INTEGRATION_INSTRUCTIONS.mdfor complete setup guide - Git submodule configuration
- Convex bulk upsert mutation (pending in YogaMatLabApp)
- Import script (pending in YogaMatLabApp)
- README.md β
- CLAUDE.md β
- DATA_PIPELINE.md β
- INTEGRATION_INSTRUCTIONS.md β
- GitHub Actions setup guide β
- Add to YogaMatLabApp's Convex
brandstable:{ scrapingEnabled: true, platform: 'shopify', // or omit (defaults to shopify if isShopify: true) isShopify: true, productsJsonUrl: 'https://brand.com/collections/yoga-mats/products.json', rateLimit: { delayBetweenProducts: 500, delayBetweenPages: 1000 } }
- Test:
curl "https://brand.com/collections/yoga-mats/products.json?limit=10" - For multiple collections, use pipe-delimited URLs:
productsJsonUrl: 'url1/products.json|url2/products.json|url3/products.json'
- Add to Convex
brandstable:{ scrapingEnabled: true, platform: 'lululemon', platformConfig: { lululemonCategoryId: '8s6' // yoga accessories }, rateLimit: { delayBetweenPages: 2000 } // 2s between GraphQL pages }
- No testing needed - scraper auto-configured
- Add to Convex
brandstable:{ scrapingEnabled: true, platform: 'bigcommerce', platformConfig: { bigcommerceCollectionUrl: 'https://www.brand.com/collections/yoga-mats' }, rateLimit: { delayBetweenProducts: 1000 } // 1s between products }
- Test with browser to verify collection page loads
- Note: Slower than other scrapers (browser automation)
- Investigate the brand's e-commerce platform:
- Check HTML for platform indicators (Shopify, BigCommerce, WooCommerce, etc.)
- Look for API endpoints in Network tab
- Check for JSON-LD structured data
- If Shopify, follow Shopify instructions above
- If custom platform, create new scraper in
scripts/lib/{brand}-scraper.ts - Update
get-brands-from-convex.tsto route to new scraper
- Test on 2-3 brands before running full pipeline
- Shopify scraper has 4 price extraction fallbacks - maintain all
- Unit conversion functions in scraper are critical - don't break
- Always respect rate limits
- YogaMat type references Convex schema in YogaMatLabApp
- Any schema changes require coordination with YogaMatLabApp
- Maintain backwards compatibility for existing data files
Before the pipeline can run, YogaMatLabApp must have:
- Brands schema with scraping fields (see Brand Configuration section above)
- Convex query:
convex/brands/getScrapableBrands.tsthat returns brands withscrapingEnabled: true// Example query structure export const getScrapableBrands = query({ handler: async (ctx) => { return await ctx.db .query("brands") .filter((q) => q.eq(q.field("scrapingEnabled"), true)) .collect(); }, });