feat: add receipt_parser library#211
Conversation
Review Summary by QodoAdd receipt_parser library with OCR and AI parsing capabilities
WalkthroughsDescription• Created dedicated receipt_parser library from Jupyter notebook • Implemented OCR text detection with Google Cloud Vision API • Added structured receipt data models using Pydantic • Integrated Gemini AI for intelligent receipt parsing • Included text preprocessing to reduce noise and token usage Diagramflowchart LR
Image["Image/PDF File"] -- "detect_text" --> OCR["Google Cloud Vision OCR"]
OCR -- "preprocess_text" --> CleanText["Cleaned OCR Text"]
CleanText -- "parse_result" --> Gemini["Gemini AI Parser"]
Gemini -- "JSON Schema" --> Models["ReceiptData Model"]
Models --> Output["Structured Receipt Data"]
File Changes1. src/receipt_parser/config/settings.py
|
Code Review by Qodo
1.
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a new receipt_parser Python library intended to turn the prior receipt OCR notebook prototype into reusable code (OCR via Google Cloud Vision + structured parsing via Gemini).
Changes:
- Added OCR utilities to extract and preprocess text from images and PDFs using Google Cloud Vision.
- Added Gemini-based parsing to convert OCR text into structured
ReceiptDataPydantic models. - Added initial setup documentation for required environment variables and credentials.
Reviewed changes
Copilot reviewed 6 out of 10 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/receipt_parser/SETUP.md | Documents env vars and credential setup for OCR + Gemini usage. |
| src/receipt_parser/config/settings.py | Loads env/config and constructs a Gemini client. |
| src/receipt_parser/models/receipt.py | Introduces Pydantic models for receipt items/summary/data. |
| src/receipt_parser/models/init.py | Exposes receipt models via package exports. |
| src/receipt_parser/services/ocr.py | Implements PDF/image OCR and text preprocessing via Vision API. |
| src/receipt_parser/services/parser.py | Implements Gemini JSON-schema-based parsing into ReceiptData. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception as e: | ||
| print(f"Error type: {type(e).__name__}") | ||
| print(f"Error message: {str(e)}") | ||
| print(f"Full error: {repr(e)}") | ||
| raise |
There was a problem hiding this comment.
This exception handler prints error details directly to stdout. In the FastAPI app this bypasses the configured logging pipeline (setup_logger) and can leak request/receipt content in logs unexpectedly. Prefer using the project logger (or remove the prints and re-raise with a sanitized message).
| from dotenv import load_dotenv | ||
| from google import genai | ||
|
|
||
| load_dotenv() | ||
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json" | ||
|
|
||
| WORD = re.compile(r"\w+") | ||
|
|
||
| api_key = os.getenv("GEMINI_API_KEY") |
There was a problem hiding this comment.
This module performs side effects at import time (loading dotenv and overwriting GOOGLE_APPLICATION_CREDENTIALS with a hardcoded relative filename). This can break deployments and also prevents users from configuring credentials via environment variables. Prefer not mutating os.environ here; instead read GOOGLE_APPLICATION_CREDENTIALS (or accept a path parameter) and fail with a clear error if it’s not set.
| from dotenv import load_dotenv | |
| from google import genai | |
| load_dotenv() | |
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json" | |
| WORD = re.compile(r"\w+") | |
| api_key = os.getenv("GEMINI_API_KEY") | |
| from google import genai | |
| def _get_required_env(name: str) -> str: | |
| value = os.getenv(name) | |
| if not value: | |
| raise RuntimeError( | |
| f"Required environment variable '{name}' is not set." | |
| ) | |
| return value | |
| WORD = re.compile(r"\w+") | |
| GOOGLE_APPLICATION_CREDENTIALS = _get_required_env("GOOGLE_APPLICATION_CREDENTIALS") | |
| api_key = _get_required_env("GEMINI_API_KEY") |
| get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then | ||
| rename the file to `ocr_demo_key.json`. No newline at end of file |
There was a problem hiding this comment.
This setup doc instructs renaming a service account key file to a fixed filename. Combined with settings.py forcing GOOGLE_APPLICATION_CREDENTIALS to that name, it encourages keeping credentials in the repo working directory and makes secrets easier to accidentally commit. Prefer documenting GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/key.json and explicitly instructing users not to commit the key file.
| get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then | |
| rename the file to `ocr_demo_key.json`. | |
| get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials), store the downloaded service account key in a secure location outside this repository, and set `GOOGLE_APPLICATION_CREDENTIALS` to its absolute path (for example, `GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/key.json`). | |
| Do not rename the key to a fixed filename in the project directory, and do not commit the key file to version control. |
| def parse_result(receipt_text, model=current_model): | ||
| """Parse OCR text into structured receipt data using Gemini.""" | ||
| try: | ||
| response = client.models.generate_content( | ||
| model=model, | ||
| contents=receipt_text, | ||
| config={ | ||
| "system_instruction": RECEIPT_PARSER_PROMPT, | ||
| "response_mime_type": "application/json", | ||
| "response_schema": ReceiptData.model_json_schema(), | ||
| "thinking_config": { | ||
| "thinking_budget": 0 # no thinking | ||
| }, | ||
| }, | ||
| ) | ||
| return ReceiptData.model_validate_json(response.text) |
There was a problem hiding this comment.
New parsing behavior is introduced here but there are no accompanying tests. Since the repo already uses pytest (tests/unit/*), add unit tests that mock the Gemini client to verify: schema validation/model_validate_json handling, error propagation, and at least one happy-path response.
| def parse_result(receipt_text, model=current_model): | |
| """Parse OCR text into structured receipt data using Gemini.""" | |
| try: | |
| response = client.models.generate_content( | |
| model=model, | |
| contents=receipt_text, | |
| config={ | |
| "system_instruction": RECEIPT_PARSER_PROMPT, | |
| "response_mime_type": "application/json", | |
| "response_schema": ReceiptData.model_json_schema(), | |
| "thinking_config": { | |
| "thinking_budget": 0 # no thinking | |
| }, | |
| }, | |
| ) | |
| return ReceiptData.model_validate_json(response.text) | |
| def _build_generation_config(receipt_model=ReceiptData): | |
| """Build the Gemini response configuration for structured receipt parsing.""" | |
| return { | |
| "system_instruction": RECEIPT_PARSER_PROMPT, | |
| "response_mime_type": "application/json", | |
| "response_schema": receipt_model.model_json_schema(), | |
| "thinking_config": { | |
| "thinking_budget": 0 # no thinking | |
| }, | |
| } | |
| def _validate_receipt_response(response_text, receipt_model=ReceiptData): | |
| """Validate Gemini JSON output against the receipt schema.""" | |
| return receipt_model.model_validate_json(response_text) | |
| def parse_result(receipt_text, model=current_model, gemini_client=client, receipt_model=ReceiptData): | |
| """Parse OCR text into structured receipt data using Gemini.""" | |
| try: | |
| response = gemini_client.models.generate_content( | |
| model=model, | |
| contents=receipt_text, | |
| config=_build_generation_config(receipt_model), | |
| ) | |
| return _validate_receipt_response(response.text, receipt_model) |
| def detect_text(path): | ||
| """ | ||
| Detects text in a file using Google Cloud Vision OCR. | ||
| Handles images and multi-page PDFs by converting PDF pages to images. | ||
| """ | ||
| vision_client = vision.ImageAnnotatorClient() | ||
| file_ext = Path(path).suffix.lower() | ||
| all_text = [] | ||
|
|
||
| image_contents = [] | ||
|
|
||
| if file_ext == ".pdf": | ||
| # opening PDF and iterating through all pages | ||
| pdf_document = fitz.open(path) | ||
| for page_num in range(len(pdf_document)): | ||
| page = pdf_document[page_num] | ||
|
|
||
| # convert each page to an image | ||
| matrix = fitz.Matrix(2, 2) | ||
| pix = page.get_pixmap(matrix=matrix) | ||
| image_contents.append(pix.tobytes("png")) | ||
| pdf_document.close() | ||
| else: | ||
| # Handle standard image files (png, jpg, etc.) | ||
| with open(path, "rb") as image_file: | ||
| image_contents.append(image_file.read()) | ||
|
|
||
| # Process each image/page through Vision OCR | ||
| for content in image_contents: | ||
| image = vision.Image(content=content) | ||
|
|
||
| # We use document_text_detection for better handling of dense text/receipts | ||
| response = vision_client.document_text_detection(image=image) | ||
|
|
||
| if response.error.message: | ||
| raise Exception(f"Vision API Error: {response.error.message}") | ||
|
|
||
| # text_annotations[0] contains the entire page's text as a single string | ||
| if response.text_annotations: | ||
| page_text = response.text_annotations[0].description | ||
| all_text.append(page_text) | ||
|
|
There was a problem hiding this comment.
New OCR logic is added here (PDF page rendering + Vision API calls) but there are no tests covering file-type branching, multi-page PDF handling, and Vision API error handling. Add unit tests that mock fitz.open and vision.ImageAnnotatorClient to validate the behavior without external calls.
| from google import genai | ||
|
|
||
| load_dotenv() | ||
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json" | ||
|
|
||
| WORD = re.compile(r"\w+") | ||
|
|
||
| api_key = os.getenv("GEMINI_API_KEY") | ||
| client = genai.Client(api_key=api_key) |
There was a problem hiding this comment.
from google import genai depends on the google-genai package, but it’s not declared in pyproject.toml (and not present in uv.lock). This will raise ImportError in deployed environments. Add the dependency (or gate it behind an optional extra) and document the required install path for the receipt parser feature.
| from google import genai | |
| load_dotenv() | |
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json" | |
| WORD = re.compile(r"\w+") | |
| api_key = os.getenv("GEMINI_API_KEY") | |
| client = genai.Client(api_key=api_key) | |
| try: | |
| from google import genai | |
| except ImportError: | |
| genai = None | |
| RECEIPT_PARSER_INSTALL_HINT = ( | |
| "The receipt parser Gemini integration requires the optional " | |
| "`google-genai` package. Install it before using this feature." | |
| ) | |
| load_dotenv() | |
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json" | |
| WORD = re.compile(r"\w+") | |
| api_key = os.getenv("GEMINI_API_KEY") | |
| def get_genai_client(): | |
| if genai is None: | |
| raise RuntimeError(RECEIPT_PARSER_INSTALL_HINT) | |
| if not api_key: | |
| raise RuntimeError( | |
| "The receipt parser Gemini integration requires the " | |
| "`GEMINI_API_KEY` environment variable to be set." | |
| ) | |
| return genai.Client(api_key=api_key) | |
| client = get_genai_client() if genai is not None and api_key else None |
rajpandya737
left a comment
There was a problem hiding this comment.
good start @ian-yeh, address the ai comments and make sure there is good typehinting, we want to make sure that the types always line up cause that might affect stuff down the pipeline, get rid of the .md file and ill take a deeper look once done
- Add pymupdf, google-genai, and google-cloud-vision dependencies - Update .gitignore for OCR credentials and test directories - Remove SETUP.md from receipt_parser module
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Add type annotations to preprocess_text() and detect_text() in ocr.py - Add type annotations to parse_result() in parser.py - Import __future__.annotations for forward compatibility
…rser services - Add module-level logger using setup_logger(__name__) in ocr.py and parser.py - Replace print() with logger.info() for token savings message in ocr.py - Replace multiple print() statements with logger.exception() in parser.py to properly capture stack traces on parse failures
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 12 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| load_dotenv() | ||
|
|
||
| # Only set credentials if not already configured (e.g., by deployment/host) | ||
| if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ: | ||
| # Resolve relative to project root, not current working directory | ||
| project_root = Path(__file__).parent.parent.parent.parent | ||
| default_creds_path = project_root / "google_ocr_credentials.json" | ||
|
|
||
| if not default_creds_path.exists(): | ||
| raise FileNotFoundError( | ||
| f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set " | ||
| f"and default credentials file not found: {default_creds_path}\n" | ||
| f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path." | ||
| ) | ||
|
|
||
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path) | ||
|
|
||
| WORD = re.compile(r"\w+") | ||
|
|
||
| api_key = os.getenv("GEMINI_API_KEY") | ||
| if not api_key: | ||
| raise RuntimeError( | ||
| "GEMINI_API_KEY environment variable must be set before initializing the Gemini client." | ||
| ) | ||
| client = genai.Client(api_key=api_key) |
There was a problem hiding this comment.
settings.py performs credential discovery, reads .env, and raises exceptions at import time. This makes simply importing src.receipt_parser (or parse_result) crash in environments that don't have Vision/Gemini configured, even if receipt parsing is not used. Consider moving env/credential validation + genai.Client creation into a lazy get_client() function (or inside parse_result) so module import is side-effect-free, and raise a clear error only when the functionality is invoked.
| load_dotenv() | |
| # Only set credentials if not already configured (e.g., by deployment/host) | |
| if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ: | |
| # Resolve relative to project root, not current working directory | |
| project_root = Path(__file__).parent.parent.parent.parent | |
| default_creds_path = project_root / "google_ocr_credentials.json" | |
| if not default_creds_path.exists(): | |
| raise FileNotFoundError( | |
| f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set " | |
| f"and default credentials file not found: {default_creds_path}\n" | |
| f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path." | |
| ) | |
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path) | |
| WORD = re.compile(r"\w+") | |
| api_key = os.getenv("GEMINI_API_KEY") | |
| if not api_key: | |
| raise RuntimeError( | |
| "GEMINI_API_KEY environment variable must be set before initializing the Gemini client." | |
| ) | |
| client = genai.Client(api_key=api_key) | |
| WORD = re.compile(r"\w+") | |
| _client = None | |
| def _configure_environment(): | |
| load_dotenv() | |
| # Only set credentials if not already configured (e.g., by deployment/host) | |
| if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ: | |
| # Resolve relative to project root, not current working directory | |
| project_root = Path(__file__).parent.parent.parent.parent | |
| default_creds_path = project_root / "google_ocr_credentials.json" | |
| if not default_creds_path.exists(): | |
| raise FileNotFoundError( | |
| f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set " | |
| f"and default credentials file not found: {default_creds_path}\n" | |
| f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path." | |
| ) | |
| os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path) | |
| def get_client(): | |
| global _client | |
| if _client is not None: | |
| return _client | |
| _configure_environment() | |
| api_key = os.getenv("GEMINI_API_KEY") | |
| if not api_key: | |
| raise RuntimeError( | |
| "GEMINI_API_KEY environment variable must be set before initializing the Gemini client." | |
| ) | |
| _client = genai.Client(api_key=api_key) | |
| return _client |
| from src.core.logging_utils import setup_logger | ||
| from src.receipt_parser.config.settings import client | ||
| from src.receipt_parser.models.receipt import ReceiptData | ||
|
|
There was a problem hiding this comment.
parser.py imports client from config.settings at module import time. Because config.settings validates env vars / credentials during import, importing this module can raise before any function is called. Consider deferring the client import/initialization until inside parse_result (or using a lazy accessor) to avoid import-time failures and improve library ergonomics.
| if file_ext == ".pdf": | ||
| with fitz.open(path) as pdf_document: | ||
| page_count = len(pdf_document) | ||
| pages_to_process = min(page_count, max_pages) if max_pages else page_count |
There was a problem hiding this comment.
The max_pages check treats 0 as falsy, which will process all pages when max_pages=0 is passed. Use an explicit max_pages is not None check so callers can intentionally pass 0 (or any falsy int) and get the expected behavior.
| pages_to_process = min(page_count, max_pages) if max_pages else page_count | |
| pages_to_process = ( | |
| min(page_count, max_pages) if max_pages is not None else page_count | |
| ) |
| processed_text = preprocess_text(raw_text) | ||
| logger.info(f"Tokens Saved: {len(raw_text) - len(processed_text)}/{len(raw_text)}") | ||
| return processed_text |
There was a problem hiding this comment.
The log message labels a character-count delta as "Tokens Saved" (len(raw_text) - len(processed_text)). This is not tokenization and can be misleading for debugging/cost estimation. Consider renaming to something like "Characters removed" or computing an actual token estimate if that's the intent.
| class ReceiptSummary(BaseModel): | ||
| number_of_items: int | ||
| subtotal: float | ||
| discount: float | ||
| delivery_fee: float | ||
| service_fee: float | ||
| tax: float | ||
| tip: float | ||
| total: float | ||
|
|
There was a problem hiding this comment.
The prompt says to "Default missing numeric fields to 0", but the Pydantic models currently require all numeric fields with no defaults. If Gemini omits a field, model_validate_json will fail. Consider adding = 0 defaults (and possibly items: list[...] = [] if desired) so the schema aligns with the prompt and parsing is more resilient.
| class ReceiptData(BaseModel): | ||
| store: str | ||
| order_number: str | None = None | ||
| date: str | None = None | ||
| currency: str | ||
| items: list[ReceiptItem] |
There was a problem hiding this comment.
date is modeled as str | None, which doesn't enforce the documented YYYY-MM-DD format and will accept arbitrary strings. Consider using datetime.date (or a constrained string pattern) to validate format consistently and reduce downstream parsing errors.
| def detect_text( | ||
| path: str | Path, | ||
| *, | ||
| zoom: float = 2.0, | ||
| max_pages: int | None = None, | ||
| ) -> str: |
There was a problem hiding this comment.
PR description says usage is detect_text([image_path]), but detect_text currently accepts a single path: str | Path. Either update the public API to accept a list (if intended), or adjust the documentation/README/usage guidance so callers don't get a runtime TypeError.
| def preprocess_text(receipt_text: str) -> str: | ||
| """Clean up raw OCR text to reduce noise and token count.""" | ||
|
|
||
| # Remove page break markers | ||
| receipt_text = re.sub(r"-+\s*Page\s*Break\s*-+", "", receipt_text) | ||
|
|
||
| # Remove UPC/barcode numbers (long digit strings, optionally ending in KF) |
There was a problem hiding this comment.
New receipt parsing functionality (preprocess_text/detect_text) is currently untested. Since the repo already has pytest unit tests, add unit tests for text preprocessing (deterministic) and for the PDF/image branching logic (mocking Vision/PyMuPDF) to prevent regressions without requiring external API calls.
| def parse_result(receipt_text: str, model: str = current_model) -> ReceiptData: | ||
| """Parse OCR text into structured receipt data using Gemini.""" | ||
| try: | ||
| response = client.models.generate_content( | ||
| model=model, | ||
| contents=receipt_text, | ||
| config={ | ||
| "system_instruction": RECEIPT_PARSER_PROMPT, | ||
| "response_mime_type": "application/json", | ||
| "response_schema": ReceiptData.model_json_schema(), | ||
| "thinking_config": { | ||
| "thinking_budget": 0 # no thinking | ||
| }, | ||
| }, | ||
| ) | ||
| return ReceiptData.model_validate_json(response.text) | ||
| except Exception: | ||
| logger.exception("Failed to parse receipt text") |
There was a problem hiding this comment.
parse_result is core library behavior but has no automated tests. Add tests that validate the request configuration passed to the Gemini client and that JSON parsing/validation failures are handled/logged as expected (using a stubbed/mocked client).
| def parse_result(receipt_text: str, model: str = current_model) -> ReceiptData: | |
| """Parse OCR text into structured receipt data using Gemini.""" | |
| try: | |
| response = client.models.generate_content( | |
| model=model, | |
| contents=receipt_text, | |
| config={ | |
| "system_instruction": RECEIPT_PARSER_PROMPT, | |
| "response_mime_type": "application/json", | |
| "response_schema": ReceiptData.model_json_schema(), | |
| "thinking_config": { | |
| "thinking_budget": 0 # no thinking | |
| }, | |
| }, | |
| ) | |
| return ReceiptData.model_validate_json(response.text) | |
| except Exception: | |
| logger.exception("Failed to parse receipt text") | |
| def _build_generation_config() -> dict: | |
| """Build the Gemini request configuration used for receipt parsing.""" | |
| return { | |
| "system_instruction": RECEIPT_PARSER_PROMPT, | |
| "response_mime_type": "application/json", | |
| "response_schema": ReceiptData.model_json_schema(), | |
| "thinking_config": { | |
| "thinking_budget": 0 # no thinking | |
| }, | |
| } | |
| def parse_result( | |
| receipt_text: str, | |
| model: str = current_model, | |
| gemini_client=client, | |
| ) -> ReceiptData: | |
| """Parse OCR text into structured receipt data using Gemini.""" | |
| config = _build_generation_config() | |
| try: | |
| response = gemini_client.models.generate_content( | |
| model=model, | |
| contents=receipt_text, | |
| config=config, | |
| ) | |
| except Exception: | |
| logger.exception("Failed to request receipt parsing from Gemini") | |
| raise | |
| try: | |
| return ReceiptData.model_validate_json(response.text) | |
| except Exception: | |
| logger.exception("Failed to parse or validate Gemini receipt response") |
What is this issue for and how does it solve it
detect_text([image_path])andparse_result(text)Link to the Github Issue
#184 Receipt OCR Test