Skip to content

feebleai/unstructured

Repository files navigation

Unstructured Document Processing API

A robust, scalable API for processing documents using unstructured.io.

Features

  • 25+ File Types: PDF, DOCX, PPTX, XLSX, HTML, images (OCR), emails, and more
  • Async Processing: Queue-based processing with Celery for large documents
  • Scalable: Horizontal scaling via multiple worker instances
  • API Key Auth: Secure authentication with rate limiting
  • Webhooks: Get notified when processing completes
  • Batch Processing: Submit multiple documents at once

Tech Stack

  • FastAPI - Modern async Python web framework
  • Celery + Redis - Distributed task queue
  • PostgreSQL - Job metadata and results storage
  • unstructured.io - Document parsing engine
  • Docker - Containerized deployment

Quick Start

Local Development

  1. Clone and setup:
cd unstructured
cp .env.example .env
  1. Start with Docker Compose:
docker-compose -f docker/docker-compose.yml up -d
  1. Run migrations:
docker-compose exec api alembic upgrade head
  1. Generate an API key:
docker-compose exec api python scripts/generate_api_key.py --name "dev-key" --init-db
  1. Test the API:
# Submit a document
curl -X POST "http://localhost:8000/api/v1/documents/process" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "[email protected]"

# Check job status
curl "http://localhost:8000/api/v1/jobs/{job_id}" \
  -H "X-API-Key: YOUR_API_KEY"

# Get results
curl "http://localhost:8000/api/v1/jobs/{job_id}/result" \
  -H "X-API-Key: YOUR_API_KEY"

API Endpoints

Method Endpoint Description
POST /api/v1/documents/process Submit document for processing
POST /api/v1/documents/process/batch Submit multiple documents
GET /api/v1/jobs/{job_id} Get job status
GET /api/v1/jobs/{job_id}/result Get processing results
GET /api/v1/jobs List all jobs
DELETE /api/v1/jobs/{job_id} Cancel/delete job
GET /health Health check

Processing Options

Strategies

  • auto - Automatically detect best strategy (default)
  • fast - Faster processing, may miss some content
  • hi_res - High quality, better for complex documents
  • ocr_only - Force OCR for all content

Chunking

  • null - No chunking (default)
  • basic - Simple character-based chunking
  • by_title - Chunk by document structure/titles

Railway Deployment

  1. Create a new project on Railway

  2. Add services:

    • PostgreSQL (from Railway plugins)
    • Redis (from Railway plugins)
    • API service (from this repo)
    • Worker service (from this repo with custom start command)
  3. Set environment variables:

    • DATABASE_URL - Auto-set by PostgreSQL plugin
    • REDIS_URL - Auto-set by Redis plugin
    • ADMIN_API_KEY - Your admin key
  4. Deploy:

    • API uses docker/Dockerfile.api
    • Worker uses docker/Dockerfile.worker

Environment Variables

Variable Description Default
DATABASE_URL PostgreSQL connection URL Required
REDIS_URL Redis connection URL Required
API_KEY_HEADER Header name for API key X-API-Key
RATE_LIMIT_PER_MINUTE API rate limit 60
MAX_FILE_SIZE_MB Max upload size 100
TEMP_STORAGE_PATH Temp file directory /tmp/uploads

Supported File Types

  • Documents: PDF, DOC, DOCX, ODT, RTF
  • Spreadsheets: XLS, XLSX, ODS, CSV, TSV
  • Presentations: PPT, PPTX, ODP
  • Web: HTML, XML, JSON, Markdown
  • Images: PNG, JPEG, TIFF, BMP (with OCR)
  • Email: EML, MSG
  • Ebooks: EPUB

License

MIT

About

doc parsing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •