Unstructured Document Processing API

A robust, scalable API for processing documents using unstructured.io.

Features

25+ File Types: PDF, DOCX, PPTX, XLSX, HTML, images (OCR), emails, and more
Async Processing: Queue-based processing with Celery for large documents
Scalable: Horizontal scaling via multiple worker instances
API Key Auth: Secure authentication with rate limiting
Webhooks: Get notified when processing completes
Batch Processing: Submit multiple documents at once

Tech Stack

FastAPI - Modern async Python web framework
Celery + Redis - Distributed task queue
PostgreSQL - Job metadata and results storage
unstructured.io - Document parsing engine
Docker - Containerized deployment

Quick Start

Local Development

Clone and setup:

cd unstructured
cp .env.example .env

Start with Docker Compose:

docker-compose -f docker/docker-compose.yml up -d

Run migrations:

docker-compose exec api alembic upgrade head

Generate an API key:

docker-compose exec api python scripts/generate_api_key.py --name "dev-key" --init-db

Test the API:

# Submit a document
curl -X POST "http://localhost:8000/api/v1/documents/process" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "[email protected]"

# Check job status
curl "http://localhost:8000/api/v1/jobs/{job_id}" \
  -H "X-API-Key: YOUR_API_KEY"

# Get results
curl "http://localhost:8000/api/v1/jobs/{job_id}/result" \
  -H "X-API-Key: YOUR_API_KEY"

API Endpoints

Method	Endpoint	Description
POST	`/api/v1/documents/process`	Submit document for processing
POST	`/api/v1/documents/process/batch`	Submit multiple documents
GET	`/api/v1/jobs/{job_id}`	Get job status
GET	`/api/v1/jobs/{job_id}/result`	Get processing results
GET	`/api/v1/jobs`	List all jobs
DELETE	`/api/v1/jobs/{job_id}`	Cancel/delete job
GET	`/health`	Health check

Processing Options

Strategies

auto - Automatically detect best strategy (default)
fast - Faster processing, may miss some content
hi_res - High quality, better for complex documents
ocr_only - Force OCR for all content

Chunking

null - No chunking (default)
basic - Simple character-based chunking
by_title - Chunk by document structure/titles

Railway Deployment

Create a new project on Railway
Add services:
- PostgreSQL (from Railway plugins)
- Redis (from Railway plugins)
- API service (from this repo)
- Worker service (from this repo with custom start command)
Set environment variables:
- DATABASE_URL - Auto-set by PostgreSQL plugin
- REDIS_URL - Auto-set by Redis plugin
- ADMIN_API_KEY - Your admin key
Deploy:
- API uses docker/Dockerfile.api
- Worker uses docker/Dockerfile.worker

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection URL	Required
`REDIS_URL`	Redis connection URL	Required
`API_KEY_HEADER`	Header name for API key	`X-API-Key`
`RATE_LIMIT_PER_MINUTE`	API rate limit	`60`
`MAX_FILE_SIZE_MB`	Max upload size	`100`
`TEMP_STORAGE_PATH`	Temp file directory	`/tmp/uploads`

Supported File Types

Documents: PDF, DOC, DOCX, ODT, RTF
Spreadsheets: XLS, XLSX, ODS, CSV, TSV
Presentations: PPT, PPTX, ODP
Web: HTML, XML, JSON, Markdown
Images: PNG, JPEG, TIFF, BMP (with OCR)
Email: EML, MSG
Ebooks: EPUB

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
docker		docker
lowe-harmon/uncategorized		lowe-harmon/uncategorized
migrations		migrations
scripts		scripts
worker		worker
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
Procfile		Procfile
RAILWAY_DEPLOYMENT.md		RAILWAY_DEPLOYMENT.md
README.md		README.md
alembic.ini		alembic.ini
nixpacks.toml		nixpacks.toml
pyproject.toml		pyproject.toml
railway.beat.toml		railway.beat.toml
railway.json		railway.json
railway.toml		railway.toml
railway.worker.toml		railway.worker.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unstructured Document Processing API

Features

Tech Stack

Quick Start

Local Development

API Endpoints

Processing Options

Strategies

Chunking

Railway Deployment

Environment Variables

Supported File Types

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

feebleai/unstructured

Folders and files

Latest commit

History

Repository files navigation

Unstructured Document Processing API

Features

Tech Stack

Quick Start

Local Development

API Endpoints

Processing Options

Strategies

Chunking

Railway Deployment

Environment Variables

Supported File Types

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages