A robust, scalable API for processing documents using unstructured.io.
- 25+ File Types: PDF, DOCX, PPTX, XLSX, HTML, images (OCR), emails, and more
- Async Processing: Queue-based processing with Celery for large documents
- Scalable: Horizontal scaling via multiple worker instances
- API Key Auth: Secure authentication with rate limiting
- Webhooks: Get notified when processing completes
- Batch Processing: Submit multiple documents at once
- FastAPI - Modern async Python web framework
- Celery + Redis - Distributed task queue
- PostgreSQL - Job metadata and results storage
- unstructured.io - Document parsing engine
- Docker - Containerized deployment
- Clone and setup:
cd unstructured
cp .env.example .env- Start with Docker Compose:
docker-compose -f docker/docker-compose.yml up -d- Run migrations:
docker-compose exec api alembic upgrade head- Generate an API key:
docker-compose exec api python scripts/generate_api_key.py --name "dev-key" --init-db- Test the API:
# Submit a document
curl -X POST "http://localhost:8000/api/v1/documents/process" \
-H "X-API-Key: YOUR_API_KEY" \
-F "[email protected]"
# Check job status
curl "http://localhost:8000/api/v1/jobs/{job_id}" \
-H "X-API-Key: YOUR_API_KEY"
# Get results
curl "http://localhost:8000/api/v1/jobs/{job_id}/result" \
-H "X-API-Key: YOUR_API_KEY"| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/documents/process |
Submit document for processing |
| POST | /api/v1/documents/process/batch |
Submit multiple documents |
| GET | /api/v1/jobs/{job_id} |
Get job status |
| GET | /api/v1/jobs/{job_id}/result |
Get processing results |
| GET | /api/v1/jobs |
List all jobs |
| DELETE | /api/v1/jobs/{job_id} |
Cancel/delete job |
| GET | /health |
Health check |
auto- Automatically detect best strategy (default)fast- Faster processing, may miss some contenthi_res- High quality, better for complex documentsocr_only- Force OCR for all content
null- No chunking (default)basic- Simple character-based chunkingby_title- Chunk by document structure/titles
-
Create a new project on Railway
-
Add services:
- PostgreSQL (from Railway plugins)
- Redis (from Railway plugins)
- API service (from this repo)
- Worker service (from this repo with custom start command)
-
Set environment variables:
DATABASE_URL- Auto-set by PostgreSQL pluginREDIS_URL- Auto-set by Redis pluginADMIN_API_KEY- Your admin key
-
Deploy:
- API uses
docker/Dockerfile.api - Worker uses
docker/Dockerfile.worker
- API uses
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection URL | Required |
REDIS_URL |
Redis connection URL | Required |
API_KEY_HEADER |
Header name for API key | X-API-Key |
RATE_LIMIT_PER_MINUTE |
API rate limit | 60 |
MAX_FILE_SIZE_MB |
Max upload size | 100 |
TEMP_STORAGE_PATH |
Temp file directory | /tmp/uploads |
- Documents: PDF, DOC, DOCX, ODT, RTF
- Spreadsheets: XLS, XLSX, ODS, CSV, TSV
- Presentations: PPT, PPTX, ODP
- Web: HTML, XML, JSON, Markdown
- Images: PNG, JPEG, TIFF, BMP (with OCR)
- Email: EML, MSG
- Ebooks: EPUB
MIT