Invoice KIE — Layout-Aware Key Information Extraction

A production-quality layout-aware invoice key information extraction tool that jointly models text, document layout, and visual information. Extracts structured fields (vendor, invoice number, date, total, etc.) from invoice and receipt images.

Models

Model	Description	Training
BERT+bbox	Custom fusion: BERT + linear bbox embeddings	`scripts/train.py`
LayoutLM	Microsoft LayoutLM (document pretrained)	`scripts/train_layoutlm.py`
Baseline	Rule-based regex/heuristics	No training

LayoutLM is document-pretrained and typically performs better on SROIE than BERT+bbox.

Architecture

Image → OCR (tokens + bbox) → Tokenizer
                                    ↓
                    Layout-aware Transformer (BERT+bbox or LayoutLM)
                                    ↓
                    Token labels (BIO) → Postprocess → JSON

Setup

Conda (recommended):

conda env create -f environment.yml
conda activate invoicelm
pip install -e .
python scripts/verify_env.py

The env includes Tesseract OCR via conda-forge. If verify_env.py reports OCR issues, install system Tesseract: apt install tesseract-ocr (Linux) or brew install tesseract (macOS).

Pip only:

pip install -r requirements.txt
pip install -e .
# Tesseract (system): apt install tesseract-ocr  # Linux

Scripts must be run from project root. If not using pip install -e ., set PYTHONPATH=src.

Quick Start

Extract from image (baseline, no training):

python scripts/inference.py path/to/invoice.jpg --baseline --output result.json

Extract with trained model (BERT+bbox or LayoutLM; type auto-detected from checkpoint):

python scripts/inference.py path/to/invoice.jpg --model_path outputs/checkpoint --output result.json
python scripts/inference.py path/to/invoice.jpg --model_path outputs_layoutlm/layoutlm_.../checkpoint --output result.json

Web demo:

python app.py
# Open http://localhost:5000

Training (SROIE)

Download SROIE: python scripts/download_sroie.py
Download val: python scripts/download_val_sroie.py (requires Kaggle credentials)
Run OCR cache once; subsequent runs load from cache. If you changed label alignment, delete data/sroie/train_ocr_cache.pkl.

BERT+bbox (custom)

Outputs go to outputs/<hyperparams>/:

python scripts/train.py --data_dir data/sroie --output_dir outputs

Alternative hyperparams:

python scripts/train.py --config configs/alt_hyperparams.yaml --data_dir data/sroie --output_dir outputs

LayoutLM (recommended for better accuracy)

Outputs go to outputs_layoutlm/<hyperparams>/:

python scripts/train_layoutlm.py --data_dir data/sroie

With config:

python scripts/train_layoutlm.py --config configs/layoutlm.yaml --data_dir data/sroie

Evaluation

# Baseline
python scripts/evaluate.py --data_dir data/sroie --split test --baseline --sroie_only

# BERT+bbox (model_type auto-detected from checkpoint)
python scripts/evaluate.py --data_dir data/sroie --model_path outputs/lr2e-5_bs28_ep70_wd0.01_entw2.0/checkpoint --sroie_only

# LayoutLM
python scripts/evaluate.py --data_dir data/sroie --model_path outputs_layoutlm/layoutlm_.../checkpoint --model_type layoutlm --sroie_only

Output Schema

{
  "vendor_name": "",
  "invoice_number": "",
  "date": "",
  "subtotal": "",
  "tax": "",
  "total": "",
  "currency": "",
  "address": "",
  "line_items": [],
  "raw_tokens": []
}

Project Structure

InvoiceLM/
├── src/invoice_kie/           # Main package
│   ├── models/                # Model implementations
│   │   ├── bert_bbox.py       # BERT + bbox (LayoutAwareKIE)
│   │   └── layoutlm.py        # LayoutLM (LayoutLMKIE)
│   ├── model.py               # Legacy re-export
│   ├── preprocessing.py
│   ├── ocr.py                 # OCR + bbox normalization
│   ├── dataset.py             # SROIE loader, label alignment
│   ├── tokenizer_utils.py     # Tokenize with boxes
│   ├── postprocess.py
│   ├── baseline.py
│   ├── extractor.py           # InvoiceExtractor (bert or layoutlm)
│   ├── metrics.py
│   ├── labels.py
│   └── schema.py
├── scripts/
│   ├── train.py               # BERT+bbox training
│   ├── train_layoutlm.py      # LayoutLM training
│   ├── inference.py
│   ├── evaluate.py
│   ├── download_sroie.py
│   └── download_val_sroie.py
├── configs/
│   ├── default.yaml           # BERT+bbox
│   ├── alt_hyperparams.yaml   # BERT alternative
│   └── layoutlm.yaml          # LayoutLM
├── app.py
├── environment.yml
└── requirements.txt

Tradeoffs

Approach	Pros	Cons
Baseline	Fast, no GPU, interpretable	Brittle, poor address
BERT+bbox	Uses layout, moderate	Underperforms LayoutLM
LayoutLM	Document-pretrained, strong	Heavier, needs more RAM

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Invoice KIE — Layout-Aware Key Information Extraction

Models

Architecture

Setup

Quick Start

Training (SROIE)

BERT+bbox (custom)

LayoutLM (recommended for better accuracy)

Evaluation

Output Schema

Project Structure

Tradeoffs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
scripts		scripts
src		src
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Invoice KIE — Layout-Aware Key Information Extraction

Models

Architecture

Setup

Quick Start

Training (SROIE)

BERT+bbox (custom)

LayoutLM (recommended for better accuracy)

Evaluation

Output Schema

Project Structure

Tradeoffs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages