Azure OCR Pre-Consume Script for Paperless-ngx

This script enables Azure Document Intelligence as the OCR engine for Paperless-ngx. It generates a searchable PDF by overlaying recognized text onto the original document. Azure Document Intelligence offers superior recognition quality—even for handwritten notes, receipts, or poor-quality scans. Pricing is very affordable, at approximately $1.40 per 1,000 pages, regardless of how much content each page contains. It’s also significantly faster than the built-in Tesseract OCR: even very long documents are processed in seconds.

Downstream from Paperless-ngx, I’m using the excellent Paperless-AI. Thanks to its flexible query mechanism, it delivers great results for tagging and metadata extraction—especially now that the OCR content is nearly perfect. However, as of now, Paperless-AI does not limit the amount of content passed into the prompt. Since my Azure OpenAI GPT-4o-mini model is limited to 8k tokens per prompt, very large documents may not be processed at all.

To address this, I’ve added an optional content cutoff that limits the amount of recognized text. In practice, most everyday documents are short enough, and even longer ones usually have the important content within the first few pages. Setting a cutoff of 15,000 to 20,000 characters helps reduce prompt size without sacrificing relevant context. (Note: as of May 14th, 2025, there is a change on the way for Paperless-AI to introduce a configurable token limitation. Once this has been published, content limitation wouldn't be needed from that point on -- unless of course you also want to restrict the load on the Paperless search engine...)

Features

✅ Uses Azure Document Intelligence for high-quality OCR, including handwriting (for PDF document input only!)
✅ Adds invisible text overlay using PyMuPDF
✅ Optional character cutoff for OCR content to restrict content size and so number of tokens required for further AI processing (default: off). For caveats, see notes below.
✅ Detailed logging in paperless.log

Configuration

The script uses the following configuration variables:

Variable	Description	Required
`AZURE_FORM_RECOGNIZER_ENDPOINT`	Endpoint of your Azure resource	Yes
`AZURE_FORM_RECOGNIZER_KEY`	Azure API Key	Yes
`OCR_CONTENT_CUTOFF`	Max character count (default: 0/no cutoff)	No
`OCR_CONTINUE_ON_ERROR`	Should Paperless continue processing if OCR fails? (default: False/no) Note: it will always break for setup errors like missing API key etc.	No

These paperless-ngx configuration variables need to be set:

Variable	Description	Required
`PAPERLESS_OCR_MODE`	Needs to be set to "skip" so that no additional OCR takes place afterwards	Yes
`PAPERLESS_PRE_CONSUME_SCRIPT`	Full path to the pre-consume script	Yes

Installation (Bare Metal)

Save requirements.txt to your pre-consume script folder and install system dependencies:

/opt/paperless/.venv/bin/python3 -m pip install -r requirements.txt

Note: if PIP is not yet available, run this first:

/opt/paperless/.venv/bin/python3 -m ensurepip --upgrade

Save the script as pre_consume_azure_ocr.py, make it executable:

chmod +x /opt/azure_ocr_preconsume/pre_consume_azure_ocr.py

Set configuration:

AZURE_FORM_RECOGNIZER_ENDPOINT=https://<your-endpoint>.cognitiveservices.azure.com/
AZURE_FORM_RECOGNIZER_KEY=<your-key>
OCR_CONTENT_CUTOFF=15000
PAPERLESS_OCR_MODE=skip
PAPERLESS_PRE_CONSUME_SCRIPT=<full path of pre-consume script>

Docker Setup

Note: Admittedly, I haven't tested the docker setup, as I am running it bare metal. So any feedback whether this setup (copy/pasted from elsewhere) works is appreciated, so I can update it accordingly.

If running inside the Paperless Docker container:

Mount the script into the container:

volumes:
  - ./azure_ocr_preconsume:/scripts/azure_ocr_preconsume

Define environment variables in docker-compose.override.yml or .env:

environment:
  AZURE_FORM_RECOGNIZER_ENDPOINT: https://<your-endpoint>.cognitiveservices.azure.com/
  AZURE_FORM_RECOGNIZER_KEY: <your-key>
  OCR_CONTENT_CUTOFF: 15000
  PAPERLESS_OCR_MODE: skip
  PAPERLESS_PRE_CONSUME_SCRIPT: <full path of pre-consume script>

Notes

Supports for now only PDF documents (as Paperless-NGX doesn't allow for changing of filetype during the pre_consume step). Other formats are handed back untouched for Paperless to proceed scanning itself.
When you set a content cutoff, only as many pages as fit within the total character limit will be made searchable by Paperless-ngx. As a rule of thumb, a fully filled text page contains roughly 2,000 characters. To process that content with AI tools like Paperless-AI, you’ll typically need around 500 tokens per page. For example, if your model has an 8k token limit per prompt (or if you simply want to reduce processing cost), a cutoff of 15,000 to 20,000 characters—equivalent to about 8–10 pages—is a safe and practical choice. The exact number of pages may vary depending on the document type and density. This approach also ensures there’s enough room left in the prompt for the actual instructions to the AI, such as what kind of tags or metadata should be extracted.
Logging entries are prefixed with [azure.ocr] for easy filtering.
If the implemented fallback mechanisms to determine the path of the logfile don't work for your setup, you can explicitly set the PAPERLESS_LOGGING_DIR environment variable.
As of May 14th, 2025, there is a change on the way for Paperless-AI to introduce a configurable token limitation. Once this has been published, OCR content limitation wouldn't be needed from that point on.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pre_consume_azure_ocr.py		pre_consume_azure_ocr.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure OCR Pre-Consume Script for Paperless-ngx

Features

Configuration

Installation (Bare Metal)

Docker Setup

Notes

About

Uh oh!

Releases 8

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Azure OCR Pre-Consume Script for Paperless-ngx

Features

Configuration

Installation (Bare Metal)

Docker Setup

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Contributors

Uh oh!

Languages