A production-quality layout-aware invoice key information extraction tool that jointly models text, document layout, and visual information. Extracts structured fields (vendor, invoice number, date, total, etc.) from invoice and receipt images.
| Model | Description | Training |
|---|---|---|
| BERT+bbox | Custom fusion: BERT + linear bbox embeddings | scripts/train.py |
| LayoutLM | Microsoft LayoutLM (document pretrained) | scripts/train_layoutlm.py |
| Baseline | Rule-based regex/heuristics | No training |
LayoutLM is document-pretrained and typically performs better on SROIE than BERT+bbox.
Image → OCR (tokens + bbox) → Tokenizer
↓
Layout-aware Transformer (BERT+bbox or LayoutLM)
↓
Token labels (BIO) → Postprocess → JSON
Conda (recommended):
conda env create -f environment.yml
conda activate invoicelm
pip install -e .
python scripts/verify_env.pyThe env includes Tesseract OCR via conda-forge. If verify_env.py reports OCR issues, install system Tesseract: apt install tesseract-ocr (Linux) or brew install tesseract (macOS).
Pip only:
pip install -r requirements.txt
pip install -e .
# Tesseract (system): apt install tesseract-ocr # LinuxScripts must be run from project root. If not using pip install -e ., set PYTHONPATH=src.
Extract from image (baseline, no training):
python scripts/inference.py path/to/invoice.jpg --baseline --output result.jsonExtract with trained model (BERT+bbox or LayoutLM; type auto-detected from checkpoint):
python scripts/inference.py path/to/invoice.jpg --model_path outputs/checkpoint --output result.json
python scripts/inference.py path/to/invoice.jpg --model_path outputs_layoutlm/layoutlm_.../checkpoint --output result.jsonWeb demo:
python app.py
# Open http://localhost:5000- Download SROIE:
python scripts/download_sroie.py - Download val:
python scripts/download_val_sroie.py(requires Kaggle credentials) - Run OCR cache once; subsequent runs load from cache. If you changed label alignment, delete
data/sroie/train_ocr_cache.pkl.
Outputs go to outputs/<hyperparams>/:
python scripts/train.py --data_dir data/sroie --output_dir outputsAlternative hyperparams:
python scripts/train.py --config configs/alt_hyperparams.yaml --data_dir data/sroie --output_dir outputsOutputs go to outputs_layoutlm/<hyperparams>/:
python scripts/train_layoutlm.py --data_dir data/sroieWith config:
python scripts/train_layoutlm.py --config configs/layoutlm.yaml --data_dir data/sroie# Baseline
python scripts/evaluate.py --data_dir data/sroie --split test --baseline --sroie_only
# BERT+bbox (model_type auto-detected from checkpoint)
python scripts/evaluate.py --data_dir data/sroie --model_path outputs/lr2e-5_bs28_ep70_wd0.01_entw2.0/checkpoint --sroie_only
# LayoutLM
python scripts/evaluate.py --data_dir data/sroie --model_path outputs_layoutlm/layoutlm_.../checkpoint --model_type layoutlm --sroie_only{
"vendor_name": "",
"invoice_number": "",
"date": "",
"subtotal": "",
"tax": "",
"total": "",
"currency": "",
"address": "",
"line_items": [],
"raw_tokens": []
}InvoiceLM/
├── src/invoice_kie/ # Main package
│ ├── models/ # Model implementations
│ │ ├── bert_bbox.py # BERT + bbox (LayoutAwareKIE)
│ │ └── layoutlm.py # LayoutLM (LayoutLMKIE)
│ ├── model.py # Legacy re-export
│ ├── preprocessing.py
│ ├── ocr.py # OCR + bbox normalization
│ ├── dataset.py # SROIE loader, label alignment
│ ├── tokenizer_utils.py # Tokenize with boxes
│ ├── postprocess.py
│ ├── baseline.py
│ ├── extractor.py # InvoiceExtractor (bert or layoutlm)
│ ├── metrics.py
│ ├── labels.py
│ └── schema.py
├── scripts/
│ ├── train.py # BERT+bbox training
│ ├── train_layoutlm.py # LayoutLM training
│ ├── inference.py
│ ├── evaluate.py
│ ├── download_sroie.py
│ └── download_val_sroie.py
├── configs/
│ ├── default.yaml # BERT+bbox
│ ├── alt_hyperparams.yaml # BERT alternative
│ └── layoutlm.yaml # LayoutLM
├── app.py
├── environment.yml
└── requirements.txt
| Approach | Pros | Cons |
|---|---|---|
| Baseline | Fast, no GPU, interpretable | Brittle, poor address |
| BERT+bbox | Uses layout, moderate | Underperforms LayoutLM |
| LayoutLM | Document-pretrained, strong | Heavier, needs more RAM |
MIT