feat: add receipt_parser library by ian-yeh · Pull Request #211 · McMaster-Solar-Car-Project/purchase-request-site

ian-yeh · 2026-04-04T00:33:49Z

What is this issue for and how does it solve it

Moved the receipt OCR jupyter notebook test to a dedicated Python library
can be referenced directly in codebase with detect_text([image_path]) and parse_result(text)

Link to the Github Issue

#184 Receipt OCR Test

qodo-code-review · 2026-04-04T00:34:06Z

Review Summary by Qodo

Add receipt_parser library with OCR and AI parsing capabilities

✨ Enhancement

Walkthroughs

Description

• Created dedicated receipt_parser library from Jupyter notebook
• Implemented OCR text detection with Google Cloud Vision API
• Added structured receipt data models using Pydantic
• Integrated Gemini AI for intelligent receipt parsing
• Included text preprocessing to reduce noise and token usage

Diagram

flowchart LR
  Image["Image/PDF File"] -- "detect_text" --> OCR["Google Cloud Vision OCR"]
  OCR -- "preprocess_text" --> CleanText["Cleaned OCR Text"]
  CleanText -- "parse_result" --> Gemini["Gemini AI Parser"]
  Gemini -- "JSON Schema" --> Models["ReceiptData Model"]
  Models --> Output["Structured Receipt Data"]

File Changes

1. src/receipt_parser/config/settings.py ⚙️ Configuration changes +13/-0

Configuration and API client initialization

src/receipt_parser/config/settings.py

2. src/receipt_parser/models/__init__.py ✨ Enhancement +3/-0

Export receipt data model classes

src/receipt_parser/models/init.py

3. src/receipt_parser/models/receipt.py ✨ Enhancement +28/-0

Define Pydantic models for receipt data

src/receipt_parser/models/receipt.py

View more (7)

4. src/receipt_parser/services/ocr.py ✨ Enhancement +114/-0

OCR detection and text preprocessing logic

src/receipt_parser/services/ocr.py

5. src/receipt_parser/services/parser.py ✨ Enhancement +41/-0

Gemini-based receipt parsing with schema validation

src/receipt_parser/services/parser.py

6. src/receipt_parser/SETUP.md 📝 Documentation +10/-0

Setup instructions and environment configuration

src/receipt_parser/SETUP.md

7. src/receipt_parser/__init__.py Additional files +0/-0

...

src/receipt_parser/init.py

8. src/receipt_parser/config/__init__.py Additional files +0/-0

...

src/receipt_parser/config/init.py

9. src/receipt_parser/services/__init__.py Additional files +0/-0

...

src/receipt_parser/services/init.py

10. src/receipt_parser/utils/__init__.py Additional files +0/-0

...

src/receipt_parser/utils/init.py

qodo-code-review · 2026-04-04T00:34:08Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

1. ~~Broken package imports~~ ☑ 🐞 Bug ≡ Correctness

Description

src/receipt_parser/... uses absolute imports starting with receipt_parser.*, but the app is
launched as uvicorn src.main:app so receipt_parser is not importable from the container’s
default sys.path, causing ModuleNotFoundError when importing parser/models.

Code

src/receipt_parser/services/parser.py[R1-2]
+from receipt_parser.config.settings import client
+from receipt_parser.models.receipt import ReceiptData

Evidence

The Docker image runs uvicorn src.main:app from /app, and only copies src/ into /app/src.
Therefore, modules should be imported as src.receipt_parser... or via relative imports. However,
both parser.py and models/__init__.py import receipt_parser.* as if it were a top-level
package, which will fail in this runtime layout.

Dockerfile[62-70]
src/receipt_parser/services/parser.py[1-2]
src/receipt_parser/models/init.py[1-3]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Modules under `src/receipt_parser` import `receipt_parser.*` as a top-level package, but this repo runs as `uvicorn src.main:app` (with `/app` on `sys.path`). That means `receipt_parser` is not importable, and importing the receipt parser will raise `ModuleNotFoundError`.
### Issue Context
`receipt_parser` lives at `/app/src/receipt_parser` in the container. In that layout, imports should be relative (preferred within the package) or start with `src.receipt_parser`.
### Fix Focus Areas
- src/receipt_parser/services/parser.py[1-2]
- src/receipt_parser/models/__init__.py[1-3]
### Implementation notes
- In `services/parser.py`, change to `from ..config.settings import client` and `from ..models.receipt import ReceiptData` (or `from src.receipt_parser...`).
- In `models/__init__.py`, change to `from .receipt import ReceiptData, ReceiptItem, ReceiptSummary`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Credentials env overridden~~ ☑ 🐞 Bug ☼ Reliability

Description

settings.py unconditionally sets GOOGLE_APPLICATION_CREDENTIALS to ocr_demo_key.json at import
time, overriding host/application credentials and making auth depend on current working directory.

Code

src/receipt_parser/config/settings.py[R7-8]
+load_dotenv()
+os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"

Evidence

Importing receipt_parser.config.settings will always overwrite any pre-configured
GOOGLE_APPLICATION_CREDENTIALS value. The setup doc also instructs renaming the key file to
ocr_demo_key.json, confirming this is not a fallback but a hard requirement baked into code.

src/receipt_parser/config/settings.py[7-9]
src/receipt_parser/SETUP.md[8-10]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`src/receipt_parser/config/settings.py` mutates `os.environ["GOOGLE_APPLICATION_CREDENTIALS"]` at import time. This overrides credentials configured by the deployment/host process and is fragile because it uses a relative filename.
### Issue Context
This module is imported by other receipt_parser modules; any import triggers the override automatically.
### Fix Focus Areas
- src/receipt_parser/config/settings.py[7-9]
- src/receipt_parser/SETUP.md[8-10]
### Implementation notes
- Remove the unconditional assignment to `GOOGLE_APPLICATION_CREDENTIALS`.
- Instead, read `GOOGLE_APPLICATION_CREDENTIALS` (or a receipt-parser-specific env var like `RECEIPT_PARSER_GCP_CREDENTIALS_PATH`) and:
- if missing, raise a clear `ValueError` explaining how to configure it, OR
- if you want a default, only set it *if not already set* and resolve to an absolute path (e.g., relative to the repo or module), and validate the file exists before proceeding.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~PDF OCR buffers all pages~~ ☑ 🐞 Bug ➹ Performance

Description

detect_text() converts all PDF pages to PNG bytes and stores them in a list before OCR, which can
spike memory on larger PDFs; additionally, if an exception occurs during PDF-to-image conversion,
the PDF handle may not be closed.

Code

src/receipt_parser/services/ocr.py[R76-88]

+    image_contents = []
+
+    if file_ext == ".pdf":
+        # opening PDF and iterating through all pages
+        pdf_document = fitz.open(path)
+        for page_num in range(len(pdf_document)):
+            page = pdf_document[page_num]
+
+            # convert each page to an image
+            matrix = fitz.Matrix(2, 2)
+            pix = page.get_pixmap(matrix=matrix)
+            image_contents.append(pix.tobytes("png"))
+        pdf_document.close()

Evidence

For PDFs, the code appends each page’s PNG bytes into image_contents (a full in-memory buffer) and
only closes the PDF after the loop completes. There is no try/finally or context manager around
fitz.open(), so errors inside the conversion loop can skip pdf_document.close().

src/receipt_parser/services/ocr.py[76-88]
src/receipt_parser/services/ocr.py[94-107]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`detect_text()` buffers all PDF pages as PNG bytes in memory before OCR and does not guarantee the PDF is closed if conversion fails mid-loop.
### Issue Context
This can cause high memory usage for multi-page PDFs and can leak file handles if an exception occurs during page rendering.
### Fix Focus Areas
- src/receipt_parser/services/ocr.py[76-107]
### Implementation notes
- Use a context manager: `with fitz.open(path) as pdf_document:`
- Process each page sequentially: render one page -> call Vision OCR -> append text -> discard bytes, rather than building `image_contents` for all pages.
- (Optional) expose tuning knobs like zoom/matrix or a max page limit to avoid pathological inputs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. ~~No GEMINI key validation~~ ☑ 🐞 Bug ☼ Reliability

Description

settings.py constructs a global Gemini client from GEMINI_API_KEY without checking that the key
exists, so the first call to parse_result() can fail with a generic downstream exception rather
than a clear configuration error.

Code

src/receipt_parser/config/settings.py[R12-13]
+api_key = os.getenv("GEMINI_API_KEY")
+client = genai.Client(api_key=api_key)

Evidence
api_key is read directly from the environment and passed into genai.Client(...) without
validation. parse_result() then relies on this global client for every request.
src/receipt_parser/config/settings.py[12-13]
src/receipt_parser/services/parser.py[21-36]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Gemini client is created from `GEMINI_API_KEY` without validating that the env var is set. This pushes configuration errors to runtime and makes failures harder to diagnose.
### Issue Context
`parse_result()` depends on the global `client` imported from settings.
### Fix Focus Areas
- src/receipt_parser/config/settings.py[12-13]
- src/receipt_parser/services/parser.py[21-36]
### Implementation notes
- Add an explicit check: if `GEMINI_API_KEY` is missing/blank, raise `ValueError("GEMINI_API_KEY is required ...")`.
- Consider lazy initialization (e.g., `get_genai_client()`), so importing the module doesn’t create external-service clients automatically.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. ~~Stdout prints in library~~ ☑ 🐞 Bug ≡ Correctness

Description

OCR and parsing code prints operational data and error details to stdout, bypassing the repo’s
structured logging and creating noisy/unstructured logs in production.

Code

src/receipt_parser/services/ocr.py[R111-113]

+    processed_text = preprocess_text(raw_text)
+    print(f"Tokens Saved: {len(raw_text) - len(processed_text)}/{len(raw_text)}")
+    return processed_text

Evidence
detect_text() prints token savings and parse_result() prints exception details. The repo already
provides a setup_logger() helper for consistent structured logging across modules.
src/receipt_parser/services/ocr.py[109-113]
src/receipt_parser/services/parser.py[37-41]
src/core/logging_utils.py[1-54]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Library code uses `print()` for both normal operational messages and exception reporting, which bypasses the project’s logging system.
### Issue Context
The repo has `src/core/logging_utils.setup_logger()` used elsewhere for consistent logs.
### Fix Focus Areas
- src/receipt_parser/services/ocr.py[109-113]
- src/receipt_parser/services/parser.py[37-41]
- src/core/logging_utils.py[21-54]
### Implementation notes
- Create a module-level logger (e.g., `logger = setup_logger(__name__)`) and replace prints with `logger.info/debug`.
- For exceptions, prefer `logger.exception("...")` inside the `except` block to capture stack traces.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

Copilot

Pull request overview

This PR introduces a new receipt_parser Python library intended to turn the prior receipt OCR notebook prototype into reusable code (OCR via Google Cloud Vision + structured parsing via Gemini).

Changes:

Added OCR utilities to extract and preprocess text from images and PDFs using Google Cloud Vision.
Added Gemini-based parsing to convert OCR text into structured ReceiptData Pydantic models.
Added initial setup documentation for required environment variables and credentials.

Reviewed changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/receipt_parser/SETUP.md	Documents env vars and credential setup for OCR + Gemini usage.
src/receipt_parser/config/settings.py	Loads env/config and constructs a Gemini client.
src/receipt_parser/models/receipt.py	Introduces Pydantic models for receipt items/summary/data.
src/receipt_parser/models/init.py	Exposes receipt models via package exports.
src/receipt_parser/services/ocr.py	Implements PDF/image OCR and text preprocessing via Vision API.
src/receipt_parser/services/parser.py	Implements Gemini JSON-schema-based parsing into `ReceiptData`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T00:38:22Z

+    except Exception as e:
+        print(f"Error type: {type(e).__name__}")
+        print(f"Error message: {str(e)}")
+        print(f"Full error: {repr(e)}")
+        raise


This exception handler prints error details directly to stdout. In the FastAPI app this bypasses the configured logging pipeline (setup_logger) and can leak request/receipt content in logs unexpectedly. Prefer using the project logger (or remove the prints and re-raise with a sanitized message).

Copilot · 2026-04-04T00:38:22Z

+from dotenv import load_dotenv
+from google import genai
+
+load_dotenv()
+os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"
+
+WORD = re.compile(r"\w+")
+
+api_key = os.getenv("GEMINI_API_KEY")


This module performs side effects at import time (loading dotenv and overwriting GOOGLE_APPLICATION_CREDENTIALS with a hardcoded relative filename). This can break deployments and also prevents users from configuring credentials via environment variables. Prefer not mutating os.environ here; instead read GOOGLE_APPLICATION_CREDENTIALS (or accept a path parameter) and fail with a clear error if it’s not set.

Suggested change

from dotenv import load_dotenv

from google import genai

load_dotenv()

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")

from google import genai

def _get_required_env(name: str) -> str:

value = os.getenv(name)

if not value:

raise RuntimeError(

f"Required environment variable '{name}' is not set."

)

return value

WORD = re.compile(r"\w+")

GOOGLE_APPLICATION_CREDENTIALS = _get_required_env("GOOGLE_APPLICATION_CREDENTIALS")

api_key = _get_required_env("GEMINI_API_KEY")

Copilot · 2026-04-04T00:38:24Z

+get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then
+rename the file to `ocr_demo_key.json`. 


This setup doc instructs renaming a service account key file to a fixed filename. Combined with settings.py forcing GOOGLE_APPLICATION_CREDENTIALS to that name, it encourages keeping credentials in the repo working directory and makes secrets easier to accidentally commit. Prefer documenting GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/key.json and explicitly instructing users not to commit the key file.

Suggested change

get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then

rename the file to `ocr_demo_key.json`.

get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials), store the downloaded service account key in a secure location outside this repository, and set `GOOGLE_APPLICATION_CREDENTIALS` to its absolute path (for example, `GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/key.json`).

Do not rename the key to a fixed filename in the project directory, and do not commit the key file to version control.

Copilot · 2026-04-04T00:38:24Z

+def parse_result(receipt_text, model=current_model):
+    """Parse OCR text into structured receipt data using Gemini."""
+    try:
+        response = client.models.generate_content(
+            model=model,
+            contents=receipt_text,
+            config={
+                "system_instruction": RECEIPT_PARSER_PROMPT,
+                "response_mime_type": "application/json",
+                "response_schema": ReceiptData.model_json_schema(),
+                "thinking_config": {
+                    "thinking_budget": 0  # no thinking
+                },
+            },
+        )
+        return ReceiptData.model_validate_json(response.text)


New parsing behavior is introduced here but there are no accompanying tests. Since the repo already uses pytest (tests/unit/*), add unit tests that mock the Gemini client to verify: schema validation/model_validate_json handling, error propagation, and at least one happy-path response.

Suggested change

def parse_result(receipt_text, model=current_model):

"""Parse OCR text into structured receipt data using Gemini."""

try:

response = client.models.generate_content(

model=model,

contents=receipt_text,

config={

"system_instruction": RECEIPT_PARSER_PROMPT,

"response_mime_type": "application/json",

"response_schema": ReceiptData.model_json_schema(),

"thinking_config": {

"thinking_budget": 0 # no thinking

},

},

)

return ReceiptData.model_validate_json(response.text)

def _build_generation_config(receipt_model=ReceiptData):

"""Build the Gemini response configuration for structured receipt parsing."""

return {

"system_instruction": RECEIPT_PARSER_PROMPT,

"response_mime_type": "application/json",

"response_schema": receipt_model.model_json_schema(),

"thinking_config": {

"thinking_budget": 0 # no thinking

},

}

def _validate_receipt_response(response_text, receipt_model=ReceiptData):

"""Validate Gemini JSON output against the receipt schema."""

return receipt_model.model_validate_json(response_text)

def parse_result(receipt_text, model=current_model, gemini_client=client, receipt_model=ReceiptData):

"""Parse OCR text into structured receipt data using Gemini."""

try:

response = gemini_client.models.generate_content(

model=model,

contents=receipt_text,

config=_build_generation_config(receipt_model),

)

return _validate_receipt_response(response.text, receipt_model)

Copilot · 2026-04-04T00:38:24Z

+def detect_text(path):
+    """
+    Detects text in a file using Google Cloud Vision OCR.
+    Handles images and multi-page PDFs by converting PDF pages to images.
+    """
+    vision_client = vision.ImageAnnotatorClient()
+    file_ext = Path(path).suffix.lower()
+    all_text = []
+
+    image_contents = []
+
+    if file_ext == ".pdf":
+        # opening PDF and iterating through all pages
+        pdf_document = fitz.open(path)
+        for page_num in range(len(pdf_document)):
+            page = pdf_document[page_num]
+
+            # convert each page to an image
+            matrix = fitz.Matrix(2, 2)
+            pix = page.get_pixmap(matrix=matrix)
+            image_contents.append(pix.tobytes("png"))
+        pdf_document.close()
+    else:
+        # Handle standard image files (png, jpg, etc.)
+        with open(path, "rb") as image_file:
+            image_contents.append(image_file.read())
+
+    # Process each image/page through Vision OCR
+    for content in image_contents:
+        image = vision.Image(content=content)
+
+        # We use document_text_detection for better handling of dense text/receipts
+        response = vision_client.document_text_detection(image=image)
+
+        if response.error.message:
+            raise Exception(f"Vision API Error: {response.error.message}")
+
+        # text_annotations[0] contains the entire page's text as a single string
+        if response.text_annotations:
+            page_text = response.text_annotations[0].description
+            all_text.append(page_text)
+


New OCR logic is added here (PDF page rendering + Vision API calls) but there are no tests covering file-type branching, multi-page PDF handling, and Vision API error handling. Add unit tests that mock fitz.open and vision.ImageAnnotatorClient to validate the behavior without external calls.

Copilot · 2026-04-04T00:38:24Z

+from google import genai
+
+load_dotenv()
+os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"
+
+WORD = re.compile(r"\w+")
+
+api_key = os.getenv("GEMINI_API_KEY")
+client = genai.Client(api_key=api_key)


from google import genai depends on the google-genai package, but it’s not declared in pyproject.toml (and not present in uv.lock). This will raise ImportError in deployed environments. Add the dependency (or gate it behind an optional extra) and document the required install path for the receipt parser feature.

Suggested change

from google import genai

load_dotenv()

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")

client = genai.Client(api_key=api_key)

try:

from google import genai

except ImportError:

genai = None

RECEIPT_PARSER_INSTALL_HINT = (

"The receipt parser Gemini integration requires the optional "

"`google-genai` package. Install it before using this feature."

)

load_dotenv()

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")

def get_genai_client():

if genai is None:

raise RuntimeError(RECEIPT_PARSER_INSTALL_HINT)

if not api_key:

raise RuntimeError(

"The receipt parser Gemini integration requires the "

"`GEMINI_API_KEY` environment variable to be set."

)

return genai.Client(api_key=api_key)

client = get_genai_client() if genai is not None and api_key else None

rajpandya737

good start @ian-yeh, address the ai comments and make sure there is good typehinting, we want to make sure that the types always line up cause that might affect stuff down the pipeline, get rid of the .md file and ill take a deeper look once done

- Add pymupdf, google-genai, and google-cloud-vision dependencies - Update .gitignore for OCR credentials and test directories - Remove SETUP.md from receipt_parser module

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ar-Project/purchase-request-site into receipt-ocr-lib

- Add type annotations to preprocess_text() and detect_text() in ocr.py - Add type annotations to parse_result() in parser.py - Import __future__.annotations for forward compatibility

…rser services - Add module-level logger using setup_logger(__name__) in ocr.py and parser.py - Replace print() with logger.info() for token savings message in ocr.py - Replace multiple print() statements with logger.exception() in parser.py to properly capture stack traces on parse failures

…mory issues

…date file exists

Copilot

Pull request overview

Copilot reviewed 6 out of 12 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T05:58:02Z

+load_dotenv()
+
+# Only set credentials if not already configured (e.g., by deployment/host)
+if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:
+    # Resolve relative to project root, not current working directory
+    project_root = Path(__file__).parent.parent.parent.parent
+    default_creds_path = project_root / "google_ocr_credentials.json"
+
+    if not default_creds_path.exists():
+        raise FileNotFoundError(
+            f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set "
+            f"and default credentials file not found: {default_creds_path}\n"
+            f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path."
+        )
+
+    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path)
+
+WORD = re.compile(r"\w+")
+
+api_key = os.getenv("GEMINI_API_KEY")
+if not api_key:
+    raise RuntimeError(
+        "GEMINI_API_KEY environment variable must be set before initializing the Gemini client."
+    )
+client = genai.Client(api_key=api_key)


settings.py performs credential discovery, reads .env, and raises exceptions at import time. This makes simply importing src.receipt_parser (or parse_result) crash in environments that don't have Vision/Gemini configured, even if receipt parsing is not used. Consider moving env/credential validation + genai.Client creation into a lazy get_client() function (or inside parse_result) so module import is side-effect-free, and raise a clear error only when the functionality is invoked.

Suggested change

load_dotenv()

# Only set credentials if not already configured (e.g., by deployment/host)

if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:

# Resolve relative to project root, not current working directory

project_root = Path(__file__).parent.parent.parent.parent

default_creds_path = project_root / "google_ocr_credentials.json"

if not default_creds_path.exists():

raise FileNotFoundError(

f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set "

f"and default credentials file not found: {default_creds_path}\n"

f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path."

)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path)

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")

if not api_key:

raise RuntimeError(

"GEMINI_API_KEY environment variable must be set before initializing the Gemini client."

)

client = genai.Client(api_key=api_key)

WORD = re.compile(r"\w+")

_client = None

def _configure_environment():

load_dotenv()

# Only set credentials if not already configured (e.g., by deployment/host)

if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:

# Resolve relative to project root, not current working directory

project_root = Path(__file__).parent.parent.parent.parent

default_creds_path = project_root / "google_ocr_credentials.json"

if not default_creds_path.exists():

raise FileNotFoundError(

f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set "

f"and default credentials file not found: {default_creds_path}\n"

f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path."

)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path)

def get_client():

global _client

if _client is not None:

return _client

_configure_environment()

api_key = os.getenv("GEMINI_API_KEY")

if not api_key:

raise RuntimeError(

"GEMINI_API_KEY environment variable must be set before initializing the Gemini client."

)

_client = genai.Client(api_key=api_key)

return _client

Copilot · 2026-04-07T05:58:02Z

+from src.core.logging_utils import setup_logger
+from src.receipt_parser.config.settings import client
+from src.receipt_parser.models.receipt import ReceiptData
+


parser.py imports client from config.settings at module import time. Because config.settings validates env vars / credentials during import, importing this module can raise before any function is called. Consider deferring the client import/initialization until inside parse_result (or using a lazy accessor) to avoid import-time failures and improve library ergonomics.

Copilot · 2026-04-07T05:58:02Z

+    if file_ext == ".pdf":
+        with fitz.open(path) as pdf_document:
+            page_count = len(pdf_document)
+            pages_to_process = min(page_count, max_pages) if max_pages else page_count


The max_pages check treats 0 as falsy, which will process all pages when max_pages=0 is passed. Use an explicit max_pages is not None check so callers can intentionally pass 0 (or any falsy int) and get the expected behavior.

Suggested change

pages_to_process = min(page_count, max_pages) if max_pages else page_count

pages_to_process = (

min(page_count, max_pages) if max_pages is not None else page_count

)

Copilot · 2026-04-07T05:58:02Z

+    processed_text = preprocess_text(raw_text)
+    logger.info(f"Tokens Saved: {len(raw_text) - len(processed_text)}/{len(raw_text)}")
+    return processed_text


The log message labels a character-count delta as "Tokens Saved" (len(raw_text) - len(processed_text)). This is not tokenization and can be misleading for debugging/cost estimation. Consider renaming to something like "Characters removed" or computing an actual token estimate if that's the intent.

Copilot · 2026-04-07T05:58:03Z

+class ReceiptSummary(BaseModel):
+    number_of_items: int
+    subtotal: float
+    discount: float
+    delivery_fee: float
+    service_fee: float
+    tax: float
+    tip: float
+    total: float
+


The prompt says to "Default missing numeric fields to 0", but the Pydantic models currently require all numeric fields with no defaults. If Gemini omits a field, model_validate_json will fail. Consider adding = 0 defaults (and possibly items: list[...] = [] if desired) so the schema aligns with the prompt and parsing is more resilient.

Copilot · 2026-04-07T05:58:03Z

+class ReceiptData(BaseModel):
+    store: str
+    order_number: str | None = None
+    date: str | None = None
+    currency: str
+    items: list[ReceiptItem]


date is modeled as str | None, which doesn't enforce the documented YYYY-MM-DD format and will accept arbitrary strings. Consider using datetime.date (or a constrained string pattern) to validate format consistently and reduce downstream parsing errors.

Copilot · 2026-04-07T05:58:03Z

+def detect_text(
+    path: str | Path,
+    *,
+    zoom: float = 2.0,
+    max_pages: int | None = None,
+) -> str:


PR description says usage is detect_text([image_path]), but detect_text currently accepts a single path: str | Path. Either update the public API to accept a list (if intended), or adjust the documentation/README/usage guidance so callers don't get a runtime TypeError.

Copilot · 2026-04-07T05:58:03Z

+def preprocess_text(receipt_text: str) -> str:
+    """Clean up raw OCR text to reduce noise and token count."""
+
+    # Remove page break markers
+    receipt_text = re.sub(r"-+\s*Page\s*Break\s*-+", "", receipt_text)
+
+    # Remove UPC/barcode numbers (long digit strings, optionally ending in KF)


New receipt parsing functionality (preprocess_text/detect_text) is currently untested. Since the repo already has pytest unit tests, add unit tests for text preprocessing (deterministic) and for the PDF/image branching logic (mocking Vision/PyMuPDF) to prevent regressions without requiring external API calls.

Copilot · 2026-04-07T05:58:03Z

+def parse_result(receipt_text: str, model: str = current_model) -> ReceiptData:
+    """Parse OCR text into structured receipt data using Gemini."""
+    try:
+        response = client.models.generate_content(
+            model=model,
+            contents=receipt_text,
+            config={
+                "system_instruction": RECEIPT_PARSER_PROMPT,
+                "response_mime_type": "application/json",
+                "response_schema": ReceiptData.model_json_schema(),
+                "thinking_config": {
+                    "thinking_budget": 0  # no thinking
+                },
+            },
+        )
+        return ReceiptData.model_validate_json(response.text)
+    except Exception:
+        logger.exception("Failed to parse receipt text")


parse_result is core library behavior but has no automated tests. Add tests that validate the request configuration passed to the Gemini client and that JSON parsing/validation failures are handled/logged as expected (using a stubbed/mocked client).

Suggested change

def parse_result(receipt_text: str, model: str = current_model) -> ReceiptData:

"""Parse OCR text into structured receipt data using Gemini."""

try:

response = client.models.generate_content(

model=model,

contents=receipt_text,

config={

"system_instruction": RECEIPT_PARSER_PROMPT,

"response_mime_type": "application/json",

"response_schema": ReceiptData.model_json_schema(),

"thinking_config": {

"thinking_budget": 0 # no thinking

},

},

)

return ReceiptData.model_validate_json(response.text)

except Exception:

logger.exception("Failed to parse receipt text")

def _build_generation_config() -> dict:

"""Build the Gemini request configuration used for receipt parsing."""

return {

"system_instruction": RECEIPT_PARSER_PROMPT,

"response_mime_type": "application/json",

"response_schema": ReceiptData.model_json_schema(),

"thinking_config": {

"thinking_budget": 0 # no thinking

},

}

def parse_result(

receipt_text: str,

model: str = current_model,

gemini_client=client,

) -> ReceiptData:

"""Parse OCR text into structured receipt data using Gemini."""

config = _build_generation_config()

try:

response = gemini_client.models.generate_content(

model=model,

contents=receipt_text,

config=config,

)

except Exception:

logger.exception("Failed to request receipt parsing from Gemini")

raise

try:

return ReceiptData.model_validate_json(response.text)

except Exception:

logger.exception("Failed to parse or validate Gemini receipt response")

feat: add receipt_parser library

5c63ea3

Copilot AI review requested due to automatic review settings April 4, 2026 00:33

ian-yeh requested a review from rajpandya737 as a code owner April 4, 2026 00:33

Copilot started reviewing on behalf of ian-yeh April 4, 2026 00:34 View session

qodo-code-review Bot reviewed Apr 4, 2026

View reviewed changes

Comment thread src/receipt_parser/services/parser.py Outdated

Comment thread src/receipt_parser/config/settings.py Outdated

Copilot AI reviewed Apr 4, 2026

View reviewed changes

rajpandya737 requested changes Apr 4, 2026

View reviewed changes

ian-yeh and others added 12 commits April 6, 2026 23:38

chore(deps): add receipt parser ocr dependencies and update gitignore

5ea9362

- Add pymupdf, google-genai, and google-cloud-vision dependencies - Update .gitignore for OCR credentials and test directories - Remove SETUP.md from receipt_parser module

Update src/receipt_parser/models/receipt.py

81ec925

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/receipt_parser/services/parser.py

f74643a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/receipt_parser/config/settings.py

572b639

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/receipt_parser/models/__init__.py

dc3307c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'receipt-ocr-lib' of https://github.com/McMaster-Solar-C…

7a2a860

…ar-Project/purchase-request-site into receipt-ocr-lib

style: format code with ruff and yamlfmt

7d58334

Merge branch 'receipt-ocr-lib' of https://github.com/McMaster-Solar-C…

12b57e7

…ar-Project/purchase-request-site into receipt-ocr-lib

refactor(types): add type hints to receipt parser services

9df373b

- Add type annotations to preprocess_text() and detect_text() in ocr.py - Add type annotations to parse_result() in parser.py - Import __future__.annotations for forward compatibility

refactor(ocr): stream PDF processing and add guardrails to prevent me…

ce4e0a2

…mory issues

fix(config): respect existing GOOGLE_APPLICATION_CREDENTIALS and vali…

60d1463

…date file exists

ian-yeh requested a review from Copilot April 7, 2026 05:53

Copilot started reviewing on behalf of ian-yeh April 7, 2026 05:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

		get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then
		rename the file to `ocr_demo_key.json`. No newline at end of file

-            pages_to_process = min(page_count, max_pages) if max_pages else page_count
+            pages_to_process = (
+                min(page_count, max_pages) if max_pages is not None else page_count
+            )

Conversation

ian-yeh commented Apr 4, 2026

What is this issue for and how does it solve it

Link to the Github Issue

Uh oh!

qodo-code-review Bot commented Apr 4, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rajpandya737 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

qodo-code-review Bot commented Apr 4, 2026 •

edited

Loading