A Python-based OCR (Optical Character Recognition) solution that leverages the DocuPipe.ai API to extract text from PDF documents, with special support for right-to-left (RTL) languages like Hebrew.
DocuPipeOCR provides a simple interface to extract text from PDF documents using the DocuPipe.ai service. The project includes a main Jupyter notebook:
DocuPipeOCR.ipynb - OCR text extraction with support for both standard and RTL languages
- Extract text from PDF documents
- Process multi-page documents
- Handle right-to-left (RTL) languages with proper text direction
- Format extracted Hebrew text with appropriate styling
- Support for document sections and page structure
- Fix number formatting in RTL text
- Python 3.x
- DocuPipe.ai API key
-
Clone the repository:
git clone <repository-url> cd DocuPipeOCR -
Create a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate -
Install required packages:
pip install -r requirements.txt -
Create a
.envfile with your DocuPipe.ai API key:MY_API_KEY=your_api_key_here
Open DocuPipeOCR.ipynb in Jupyter and run the cells to process PDF documents:
# Process a PDF document and extract text
do_ocr("your_document.pdf")- The PDF document is encoded in base64 and sent to the DocuPipe.ai API
- The API processes the document and returns a document ID
- The system polls the API until processing is complete
- The extracted text is retrieved and processed
- For RTL documents, special formatting is applied to ensure proper text display
Sends the document to DocuPipe.ai and returns a document ID.
Checks the processing status and retrieves results when complete.
Extracts and formats text from the API response.
Main function that orchestrates the OCR process.
- The API key is stored in a
.envfile which is excluded from version control via.gitignore - Always keep your API keys secure and never commit them to public repositories
- DocuPipe.ai for providing the OCR API service