Skip to content

habitoti/Azure-OCR-Pre-Consume-Script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure OCR Pre-Consume Script for Paperless-ngx

This script enables Azure Document Intelligence as the OCR engine for Paperless-ngx. It generates a searchable PDF by overlaying recognized text onto the original document. Azure Document Intelligence offers superior recognition quality—even for handwritten notes, receipts, or poor-quality scans. Pricing is very affordable, at approximately $1.40 per 1,000 pages, regardless of how much content each page contains. It’s also significantly faster than the built-in Tesseract OCR: even very long documents are processed in seconds.

Downstream from Paperless-ngx, I’m using the excellent Paperless-AI. Thanks to its flexible query mechanism, it delivers great results for tagging and metadata extraction—especially now that the OCR content is nearly perfect. However, as of now, Paperless-AI does not limit the amount of content passed into the prompt. Since my Azure OpenAI GPT-4o-mini model is limited to 8k tokens per prompt, very large documents may not be processed at all.

To address this, I’ve added an optional content cutoff that limits the amount of recognized text. In practice, most everyday documents are short enough, and even longer ones usually have the important content within the first few pages. Setting a cutoff of 15,000 to 20,000 characters helps reduce prompt size without sacrificing relevant context. (Note: as of May 14th, 2025, there is a change on the way for Paperless-AI to introduce a configurable token limitation. Once this has been published, content limitation wouldn't be needed from that point on -- unless of course you also want to restrict the load on the Paperless search engine...)

Features

  • ✅ Uses Azure Document Intelligence for high-quality OCR, including handwriting (for PDF document input only!)
  • ✅ Adds invisible text overlay using PyMuPDF
  • ✅ Optional character cutoff for OCR content to restrict content size and so number of tokens required for further AI processing (default: off). For caveats, see notes below.
  • ✅ Detailed logging in paperless.log

Configuration

The script uses the following configuration variables:

Variable Description Required
AZURE_FORM_RECOGNIZER_ENDPOINT Endpoint of your Azure resource Yes
AZURE_FORM_RECOGNIZER_KEY Azure API Key Yes
OCR_CONTENT_CUTOFF Max character count (default: 0/no cutoff) No
OCR_CONTINUE_ON_ERROR Should Paperless continue processing if OCR fails? (default: False/no) Note: it will always break for setup errors like missing API key etc. No

These paperless-ngx configuration variables need to be set:

Variable Description Required
PAPERLESS_OCR_MODE Needs to be set to "skip" so that no additional OCR takes place afterwards Yes
PAPERLESS_PRE_CONSUME_SCRIPT Full path to the pre-consume script Yes

Installation (Bare Metal)

  1. Save requirements.txt to your pre-consume script folder and install system dependencies:
/opt/paperless/.venv/bin/python3 -m pip install -r requirements.txt

Note: if PIP is not yet available, run this first:

/opt/paperless/.venv/bin/python3 -m ensurepip --upgrade
  1. Save the script as pre_consume_azure_ocr.py, make it executable:
chmod +x /opt/azure_ocr_preconsume/pre_consume_azure_ocr.py
  1. Set configuration:
AZURE_FORM_RECOGNIZER_ENDPOINT=https://<your-endpoint>.cognitiveservices.azure.com/
AZURE_FORM_RECOGNIZER_KEY=<your-key>
OCR_CONTENT_CUTOFF=15000
PAPERLESS_OCR_MODE=skip
PAPERLESS_PRE_CONSUME_SCRIPT=<full path of pre-consume script>

Docker Setup

Note: Admittedly, I haven't tested the docker setup, as I am running it bare metal. So any feedback whether this setup (copy/pasted from elsewhere) works is appreciated, so I can update it accordingly.

If running inside the Paperless Docker container:

  1. Mount the script into the container:
volumes:
  - ./azure_ocr_preconsume:/scripts/azure_ocr_preconsume
  1. Define environment variables in docker-compose.override.yml or .env:
environment:
  AZURE_FORM_RECOGNIZER_ENDPOINT: https://<your-endpoint>.cognitiveservices.azure.com/
  AZURE_FORM_RECOGNIZER_KEY: <your-key>
  OCR_CONTENT_CUTOFF: 15000
  PAPERLESS_OCR_MODE: skip
  PAPERLESS_PRE_CONSUME_SCRIPT: <full path of pre-consume script>

Notes

  • Supports for now only PDF documents (as Paperless-NGX doesn't allow for changing of filetype during the pre_consume step). Other formats are handed back untouched for Paperless to proceed scanning itself.
  • When you set a content cutoff, only as many pages as fit within the total character limit will be made searchable by Paperless-ngx. As a rule of thumb, a fully filled text page contains roughly 2,000 characters. To process that content with AI tools like Paperless-AI, you’ll typically need around 500 tokens per page. For example, if your model has an 8k token limit per prompt (or if you simply want to reduce processing cost), a cutoff of 15,000 to 20,000 characters—equivalent to about 8–10 pages—is a safe and practical choice. The exact number of pages may vary depending on the document type and density. This approach also ensures there’s enough room left in the prompt for the actual instructions to the AI, such as what kind of tags or metadata should be extracted.
  • Logging entries are prefixed with [azure.ocr] for easy filtering.
  • If the implemented fallback mechanisms to determine the path of the logfile don't work for your setup, you can explicitly set the PAPERLESS_LOGGING_DIR environment variable.
  • As of May 14th, 2025, there is a change on the way for Paperless-AI to introduce a configurable token limitation. Once this has been published, OCR content limitation wouldn't be needed from that point on.

MIT License © 2025

About

Enables Azure Document Intelligence as the OCR engine for Paperless-ngx (with specific support for Paperless-AI)

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages