Skip to content

FrankZhaodong/LayoutLM-Invoices

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Invoice KIE — Layout-Aware Key Information Extraction

A production-quality layout-aware invoice key information extraction tool that jointly models text, document layout, and visual information. Extracts structured fields (vendor, invoice number, date, total, etc.) from invoice and receipt images.

Models

Model Description Training
BERT+bbox Custom fusion: BERT + linear bbox embeddings scripts/train.py
LayoutLM Microsoft LayoutLM (document pretrained) scripts/train_layoutlm.py
Baseline Rule-based regex/heuristics No training

LayoutLM is document-pretrained and typically performs better on SROIE than BERT+bbox.

Architecture

Image → OCR (tokens + bbox) → Tokenizer
                                    ↓
                    Layout-aware Transformer (BERT+bbox or LayoutLM)
                                    ↓
                    Token labels (BIO) → Postprocess → JSON

Setup

Conda (recommended):

conda env create -f environment.yml
conda activate invoicelm
pip install -e .
python scripts/verify_env.py

The env includes Tesseract OCR via conda-forge. If verify_env.py reports OCR issues, install system Tesseract: apt install tesseract-ocr (Linux) or brew install tesseract (macOS).

Pip only:

pip install -r requirements.txt
pip install -e .
# Tesseract (system): apt install tesseract-ocr  # Linux

Scripts must be run from project root. If not using pip install -e ., set PYTHONPATH=src.

Quick Start

Extract from image (baseline, no training):

python scripts/inference.py path/to/invoice.jpg --baseline --output result.json

Extract with trained model (BERT+bbox or LayoutLM; type auto-detected from checkpoint):

python scripts/inference.py path/to/invoice.jpg --model_path outputs/checkpoint --output result.json
python scripts/inference.py path/to/invoice.jpg --model_path outputs_layoutlm/layoutlm_.../checkpoint --output result.json

Web demo:

python app.py
# Open http://localhost:5000

Training (SROIE)

  1. Download SROIE: python scripts/download_sroie.py
  2. Download val: python scripts/download_val_sroie.py (requires Kaggle credentials)
  3. Run OCR cache once; subsequent runs load from cache. If you changed label alignment, delete data/sroie/train_ocr_cache.pkl.

BERT+bbox (custom)

Outputs go to outputs/<hyperparams>/:

python scripts/train.py --data_dir data/sroie --output_dir outputs

Alternative hyperparams:

python scripts/train.py --config configs/alt_hyperparams.yaml --data_dir data/sroie --output_dir outputs

LayoutLM (recommended for better accuracy)

Outputs go to outputs_layoutlm/<hyperparams>/:

python scripts/train_layoutlm.py --data_dir data/sroie

With config:

python scripts/train_layoutlm.py --config configs/layoutlm.yaml --data_dir data/sroie

Evaluation

# Baseline
python scripts/evaluate.py --data_dir data/sroie --split test --baseline --sroie_only

# BERT+bbox (model_type auto-detected from checkpoint)
python scripts/evaluate.py --data_dir data/sroie --model_path outputs/lr2e-5_bs28_ep70_wd0.01_entw2.0/checkpoint --sroie_only

# LayoutLM
python scripts/evaluate.py --data_dir data/sroie --model_path outputs_layoutlm/layoutlm_.../checkpoint --model_type layoutlm --sroie_only

Output Schema

{
  "vendor_name": "",
  "invoice_number": "",
  "date": "",
  "subtotal": "",
  "tax": "",
  "total": "",
  "currency": "",
  "address": "",
  "line_items": [],
  "raw_tokens": []
}

Project Structure

InvoiceLM/
├── src/invoice_kie/           # Main package
│   ├── models/                # Model implementations
│   │   ├── bert_bbox.py       # BERT + bbox (LayoutAwareKIE)
│   │   └── layoutlm.py        # LayoutLM (LayoutLMKIE)
│   ├── model.py               # Legacy re-export
│   ├── preprocessing.py
│   ├── ocr.py                 # OCR + bbox normalization
│   ├── dataset.py             # SROIE loader, label alignment
│   ├── tokenizer_utils.py     # Tokenize with boxes
│   ├── postprocess.py
│   ├── baseline.py
│   ├── extractor.py           # InvoiceExtractor (bert or layoutlm)
│   ├── metrics.py
│   ├── labels.py
│   └── schema.py
├── scripts/
│   ├── train.py               # BERT+bbox training
│   ├── train_layoutlm.py      # LayoutLM training
│   ├── inference.py
│   ├── evaluate.py
│   ├── download_sroie.py
│   └── download_val_sroie.py
├── configs/
│   ├── default.yaml           # BERT+bbox
│   ├── alt_hyperparams.yaml   # BERT alternative
│   └── layoutlm.yaml          # LayoutLM
├── app.py
├── environment.yml
└── requirements.txt

Tradeoffs

Approach Pros Cons
Baseline Fast, no GPU, interpretable Brittle, poor address
BERT+bbox Uses layout, moderate Underperforms LayoutLM
LayoutLM Document-pretrained, strong Heavier, needs more RAM

License

MIT

About

git repository for the project "Layout-Aware Key Information Extraction from Invoices"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages