CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Purpose

YogaMatLabData is a data pipeline repository that scrapes yoga mat product information from 19+ Shopify brand websites, normalizes the data to a unified schema, and makes it available to the YogaMatLabApp application. The pipeline runs daily via GitHub Actions to keep product data current.

Core Architecture

Data Flow

Extract → Scrape product data from brand Shopify stores using Playwright
Normalize → Transform raw Shopify data to unified YogaMat schema
Aggregate → Combine all brands into single dataset
Detect Changes → Diff against previous day to track additions/removals/updates
Download Images → Fetch and optimize product images

Directory Structure

YogaMatLabData/
├── config/
│   └── field-mappings.json      # Shopify → YogaMat field transformations
├── scripts/
│   ├── extract-all-brands.ts    # Main orchestrator (queries Convex for brands)
│   ├── normalize-data.ts        # Data transformation
│   ├── aggregate-data.ts        # Combine all brands
│   ├── detect-changes.ts        # Change detection
│   ├── download-images.ts       # Batch image downloader
│   └── lib/
│       ├── shopify-scraper.ts   # Playwright-based Shopify scraper
│       ├── image-downloader.ts  # Image fetching with Sharp
│       ├── field-mapper.ts      # Field mapping logic
│       └── logger.ts            # Structured logging
├── data/
│   ├── raw/{date}/              # Daily Shopify extractions (by brand)
│   ├── normalized/{date}/       # Transformed to YogaMat schema
│   ├── aggregated/{date}/       # Combined datasets
│   │   └── all-products.json        # Single file with all products
│   └── changes/                 # Daily changeset logs
└── .github/workflows/
    └── fetch-products.yml     # Automated daily pipeline

Data Schema

YogaMat Type

The unified product schema is defined in types/yogaMat.ts. Key types:

YogaMat - Main product type (references Convex schema)
Brand - Brand information
MaterialType - Material categories (PVC, TPE, Natural Rubber, Cork, etc.)
YogaStyle - Compatible yoga styles (Vinyasa, Hot Yoga, etc.)
YogaMatFeature - Product features (Eco-Friendly, Non-Slip, etc.)

ProductData (Raw Shopify)

Raw scraped data structure from scripts/lib/shopify-scraper.ts:

{
  brand: string
  model: string
  price: number
  thickness?: number  // in mm
  length?: number     // in inches
  width?: number      // in inches
  weight?: number     // in lbs
  material?: string
  texture?: string
  imageUrl?: string
  description?: string
  features?: string[]
  variants?: Array<{name: string, price: number}>
}

Development Commands

Running the Pipeline

Since package.json doesn't exist yet, the intended commands will be:

# Run full pipeline (once package.json is created)
npm run extract-all    # Extract from all brands
npm run normalize      # Transform to YogaMat schema
npm run aggregate      # Combine into single dataset
npm run detect-changes # Generate changeset

# Or run all at once
npm run pipeline       # Runs all steps sequentially

Manual Execution (Current)

# Using tsx directly
npx tsx scripts/extract-all-brands.ts
npx tsx scripts/normalize-data.ts
npx tsx scripts/aggregate-data.ts
npx tsx scripts/detect-changes.ts

Key Implementation Details

Shopify Scraper (`scripts/lib/shopify-scraper.ts`)

Uses Playwright for browser automation
Handles pagination automatically (Shopify's ?page=N pattern)
Supports lazy-loading with scroll behavior
Multiple fallback strategies for price extraction:
1. Price-related CSS classes
2. Data attributes
3. JSON-LD structured data
4. Shopify product JSON
Converts all units to standard: mm (thickness), inches (dimensions), lbs (weight)
Defaults to $99 if price not found

Image Downloader (`scripts/lib/image-downloader.ts`)

Uses Sharp for image optimization
Resizes to max 1200px width
Converts to JPEG (quality: 85) for consistency
Processes in configurable batches (default: 3 concurrent)
Saves to slug-based filenames: {brand-slug}-{model-slug}.jpg

Brand Configuration (Convex)

Brand scraping configuration is stored in YogaMatLabApp's Convex brands table with these fields:

{
  // Standard brand fields
  name: string
  slug: string
  website: string

  // Scraping configuration
  scrapingEnabled: boolean              // Toggle scraping on/off
  shopifyCollectionUrl: string | null   // e.g., "/collections/yoga-mats"
  isShopify: boolean
  rateLimit: {
    delayBetweenProducts: number        // ms between product pages (default: 500)
    delayBetweenPages: number           // ms between collection pages (default: 1000)
  }
}

The pipeline queries api.brands.getScrapableBrands at runtime to fetch enabled brands.

Error Handling

Individual brand failures don't stop the pipeline
All errors logged to logs/{date}.log
Pipeline continues to next brand on failure
Validation errors skip invalid products but continue processing

Integration with YogaMatLabApp

Git Submodule

YogaMatLabApp includes this repo as a git submodule at data/external/:

# In YogaMatLabApp
git submodule update --remote data/external
npm run import-mats  # Import to Convex database

Data Consumption

YogaMatLabApp reads from data/aggregated/latest/all-products.json via the submodule and imports to Convex database using api.yogaMats.bulkUpsert mutation.

GitHub Actions Automation

Daily Pipeline

Runs at 2 AM UTC daily (cron: 0 2 * * *)
Also supports manual trigger via workflow_dispatch
Fetches products using simple HTTP requests (no Playwright needed!)
Commits results with detailed changeset summary
Creates GitHub issue on failure
Uploads logs as artifacts (30-day retention)

Workflow Steps

Checkout repository
Setup Node.js 20 with npm cache
Install dependencies (npm ci)
Create .env from secrets
Run full pipeline (npm run pipeline)
Update latest/ symlinks
Generate commit message from changeset
Commit and push results
Upload logs as artifacts
Create issue if failed
Post execution summary

Required Secrets

CONVEX_URL - Convex deployment URL for querying brands table (required)
PAT_TOKEN - Personal Access Token for cross-repo commits (optional, uses github.token by default)

Commit Message Format

Data update: YYYY-MM-DD

📊 Changes detected:
- New products: X
- Removed products: Y
- Price changes: Z
- Total changes: N

🤖 Generated with YogaMatLab Data Pipeline
Run: #123

Scraper Types and Platform Support

The pipeline supports three different scraper types to handle various e-commerce platforms:

1. Shopify Products.json Scraper ✅ (Primary)

File: scripts/lib/fetch-products-json.ts

How it works:

Fetches from Shopify's public /collections/{path}/products.json API
No browser automation needed - simple HTTP requests
Supports pagination (up to 250 products per page)
Handles multiple collections via pipe-delimited URLs

Configuration:

{
  platform: 'shopify',
  isShopify: true,
  productsJsonUrl: 'https://brand.com/collections/yoga-mats/products.json',
  // Multiple collections:
  productsJsonUrl: 'https://brand.com/collections/mats/products.json|https://brand.com/collections/props/products.json'
}

Brands using this:

Alo Yoga, Liforme, Jade Yoga, Yolo Ha Yoga, Scoria World
Gaiam, Yogi-Bare, Bala, 42 Birds, Ananday
Oko Living, HeatHyoga, Stakt, Sensu, Wolo Yoga
Keep Store, Yoga Matters, House of Mats, Shakti Warrior, Satori Concept

Advantages:

Fast and reliable (no browser needed)
Low resource usage
Built-in pagination
Complete product data including variants, images, options

Enhanced headers for Alo Yoga: Added browser-like headers to bypass 403 blocks:

Accept-Language, Accept-Encoding
Sec-Fetch-* headers for CORS compliance
User-Agent rotation from pool

2. Lululemon GraphQL Scraper ✅

File: scripts/lib/lululemon-scraper.ts

How it works:

Queries Lululemon's GraphQL API at https://shop.lululemon.com/api/graphql
Uses category-based product search with pagination
Converts GraphQL response to Shopify-compatible format

Configuration:

{
  platform: 'lululemon',
  platformConfig: {
    lululemonCategoryId: '8s6' // yoga accessories category
  }
}

GraphQL Query Structure:

query ProductSearch($categoryId: String!, $offset: Int, $limit: Int) {
  search(categoryId: $categoryId, offset: $offset, limit: $limit) {
    total
    products {
      productId
      name
      price { currentPrice, fullPrice }
      swatches { swatchName, colorId, images }
      sizes { details, isAvailable, price }
      pdpUrl
      featurePanels { featuresContent }
    }
  }
}

Data Conversion:

Builds variants from color swatches × sizes
Extracts features from featurePanels
Maps images from all color swatches
Generates SKUs from productId + colorId + size

Category IDs:

8s6: Yoga Accessories (includes yoga mats)
1z0qd0t: Women's Yoga
1z0qetm: Men's Yoga

Advantages:

Official API (more stable than scraping)
Rich product data with variants
Supports pagination (60 products per page)
No browser automation needed

3. BigCommerce Playwright Scraper ✅

File: scripts/lib/bigcommerce-scraper.ts

How it works:

Uses Playwright browser automation
Extracts product links from collection pages
Visits each product page to extract detailed data
Parses HTML and JSON-LD structured data

Configuration:

{
  platform: 'bigcommerce',
  platformConfig: {
    bigcommerceCollectionUrl: 'https://www.huggermugger.com/collections/yoga-mats'
  }
}

Extraction Strategy:

Navigate to collection page
Extract all product card links
Visit each product page
Extract data from:
- HTML selectors (.bc-product__title, .bc-product__price)
- JSON-LD structured data (<script type="application/ld+json">)
- Product option dropdowns (colors, sizes)
Convert to Shopify-compatible format

Brands using this:

Hugger Mugger (WordPress + BigCommerce plugin)

Limitations:

Slower than API-based scrapers (browser automation)
Higher resource usage (Chromium instance)
Rate limiting important (1s delay between products)
Variant extraction limited (requires dropdown interaction)

Advantages:

Works with any rendered HTML (no API needed)
Can extract from WordPress + BigCommerce hybrid sites
Handles JavaScript-rendered content

Brand Platform Mapping

Shopify Brands (19+)

All brands use /collections/{slug}/products.json endpoints:

Alo Yoga (aloyoga.com/collections/yoga) - Enhanced headers for 403 bypass
Liforme, Jade Yoga, Yolo Ha Yoga, Scoria World
Gaiam, Yogi-Bare, Bala, 42 Birds, Ananday
Oko Living, HeatHyoga, Stakt, Sensu, Wolo Yoga
Keep Store, Yoga Matters, House of Mats, Shakti Warrior
Satori Concept (4 collections via pipe-delimited URLs)

Lululemon (GraphQL API)

Lululemon - Custom Next.js + GraphQL implementation

BigCommerce Brands

Hugger Mugger - WordPress + BigCommerce plugin (Playwright scraper)

Future Brands (Pending Implementation)

Require investigation and custom scrapers:

Sugamats, Byoga, Grip Yoga, EcoYoga

Important Constraints

Politeness

Sequential brand processing (no parallel requests)
Configurable delays between requests
User agent identifies as "YogaMatLab Data Pipeline"

Data Validation

Only model is strictly required
Missing price defaults to $99
Invalid products logged but skipped
All products validated against YogaMat schema

Change Detection

Tracks three types of changes:

New products - In current day, not in previous
Removed products - In previous day, not in current (for redirect setup)
Updated products - Price changes, spec changes

File Naming Conventions

Raw data: data/raw/{YYYY-MM-DD}/{brand-slug}.json
Normalized: data/normalized/{YYYY-MM-DD}/{brand-slug}.json
Aggregated: data/aggregated/{YYYY-MM-DD}/all-products.json
Changes: data/changes/{YYYY-MM-DD}-changeset.json
Images: {brand-slug}-{model-slug}.jpg

Dependencies (Planned)

{
  "dependencies": {
    "playwright": "^1.x",
    "sharp": "^0.x",
    "zod": "^3.x",
    "convex": "^1.x"
  },
  "devDependencies": {
    "typescript": "^5.x",
    "tsx": "^4.x"
  }
}

Implementation Status

Phase 1: Core Extraction ✅ COMPLETE

Convex brand query integration
Products.json fetching (replaced Playwright scraping)
Rate limiting and error handling
Multi-brand orchestration

Phase 2: Data Processing ✅ COMPLETE

Field mapping configuration
Normalization to YogaMat schema
Data aggregation with statistics
Change detection between runs

Phase 3: Automation ✅ COMPLETE

GitHub Actions workflow (daily at 2 AM UTC)
Latest symlinks updater
Automatic commits with changeset summary
Failure notifications via GitHub issues
Workflow runs successfully with partial brand failures

Phase 4: Integration with YogaMatLabApp 🔄 IN PROGRESS

See INTEGRATION_INSTRUCTIONS.md for complete setup guide
Git submodule configuration
Convex bulk upsert mutation (pending in YogaMatLabApp)
Import script (pending in YogaMatLabApp)

Phase 5: Documentation 📝 ONGOING

README.md ✅
CLAUDE.md ✅
DATA_PIPELINE.md ✅
INTEGRATION_INSTRUCTIONS.md ✅
GitHub Actions setup guide ✅

Notes for AI Assistants

When Adding New Brands

Shopify Brands (Easiest)

Add to YogaMatLabApp's Convex brands table:

{
  scrapingEnabled: true,
  platform: 'shopify',  // or omit (defaults to shopify if isShopify: true)
  isShopify: true,
  productsJsonUrl: 'https://brand.com/collections/yoga-mats/products.json',
  rateLimit: { delayBetweenProducts: 500, delayBetweenPages: 1000 }
}

Test: curl "https://brand.com/collections/yoga-mats/products.json?limit=10"

For multiple collections, use pipe-delimited URLs:

productsJsonUrl: 'url1/products.json|url2/products.json|url3/products.json'

Lululemon

Add to Convex brands table:

{
  scrapingEnabled: true,
  platform: 'lululemon',
  platformConfig: {
    lululemonCategoryId: '8s6' // yoga accessories
  },
  rateLimit: { delayBetweenPages: 2000 } // 2s between GraphQL pages
}

No testing needed - scraper auto-configured

BigCommerce Brands

Add to Convex brands table:

{
  scrapingEnabled: true,
  platform: 'bigcommerce',
  platformConfig: {
    bigcommerceCollectionUrl: 'https://www.brand.com/collections/yoga-mats'
  },
  rateLimit: { delayBetweenProducts: 1000 } // 1s between products
}

Test with browser to verify collection page loads
Note: Slower than other scrapers (browser automation)

Unknown Platform

Investigate the brand's e-commerce platform:
- Check HTML for platform indicators (Shopify, BigCommerce, WooCommerce, etc.)
- Look for API endpoints in Network tab
- Check for JSON-LD structured data
If Shopify, follow Shopify instructions above
If custom platform, create new scraper in scripts/lib/{brand}-scraper.ts
Update get-brands-from-convex.ts to route to new scraper

When Modifying Scrapers

Test on 2-3 brands before running full pipeline
Shopify scraper has 4 price extraction fallbacks - maintain all
Unit conversion functions in scraper are critical - don't break
Always respect rate limits

When Working with Data Schema

YogaMat type references Convex schema in YogaMatLabApp
Any schema changes require coordination with YogaMatLabApp
Maintain backwards compatibility for existing data files

Required Convex Setup (in YogaMatLabApp)

Before the pipeline can run, YogaMatLabApp must have:

Brands schema with scraping fields (see Brand Configuration section above)

Convex query: convex/brands/getScrapableBrands.ts that returns brands with scrapingEnabled: true

// Example query structure
export const getScrapableBrands = query({
  handler: async (ctx) => {
    return await ctx.db
      .query("brands")
      .filter((q) => q.eq(q.field("scrapingEnabled"), true))
      .collect();
  },
});

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Repository Purpose

Core Architecture

Data Flow

Directory Structure

Data Schema

YogaMat Type

ProductData (Raw Shopify)

Development Commands

Running the Pipeline

Manual Execution (Current)

Key Implementation Details

Shopify Scraper (scripts/lib/shopify-scraper.ts)

Image Downloader (scripts/lib/image-downloader.ts)

Brand Configuration (Convex)

Error Handling

Integration with YogaMatLabApp

Git Submodule

Data Consumption

GitHub Actions Automation

Daily Pipeline

Workflow Steps

Required Secrets

Commit Message Format

Scraper Types and Platform Support

1. Shopify Products.json Scraper ✅ (Primary)

2. Lululemon GraphQL Scraper ✅

3. BigCommerce Playwright Scraper ✅

Brand Platform Mapping

Shopify Brands (19+)

Lululemon (GraphQL API)

BigCommerce Brands

Future Brands (Pending Implementation)

Important Constraints

Politeness

Data Validation

Change Detection

File Naming Conventions

Dependencies (Planned)

Implementation Status

Phase 1: Core Extraction ✅ COMPLETE

Phase 2: Data Processing ✅ COMPLETE

Phase 3: Automation ✅ COMPLETE

Phase 4: Integration with YogaMatLabApp 🔄 IN PROGRESS

Phase 5: Documentation 📝 ONGOING

Notes for AI Assistants

When Adding New Brands

Shopify Brands (Easiest)

Lululemon

BigCommerce Brands

Unknown Platform

When Modifying Scrapers

When Working with Data Schema

Required Convex Setup (in YogaMatLabApp)

Shopify Scraper (`scripts/lib/shopify-scraper.ts`)

Image Downloader (`scripts/lib/image-downloader.ts`)