Skip to content

feat: add receipt_parser library#211

Open
ian-yeh wants to merge 13 commits into
mainfrom
receipt-ocr-lib
Open

feat: add receipt_parser library#211
ian-yeh wants to merge 13 commits into
mainfrom
receipt-ocr-lib

Conversation

@ian-yeh

@ian-yeh ian-yeh commented Apr 4, 2026

Copy link
Copy Markdown
Collaborator

What is this issue for and how does it solve it

  • Moved the receipt OCR jupyter notebook test to a dedicated Python library
  • can be referenced directly in codebase with detect_text([image_path]) and parse_result(text)

Link to the Github Issue

#184 Receipt OCR Test

Copilot AI review requested due to automatic review settings April 4, 2026 00:33
@ian-yeh ian-yeh requested a review from rajpandya737 as a code owner April 4, 2026 00:33
@qodo-code-review

Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Add receipt_parser library with OCR and AI parsing capabilities

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Created dedicated receipt_parser library from Jupyter notebook
• Implemented OCR text detection with Google Cloud Vision API
• Added structured receipt data models using Pydantic
• Integrated Gemini AI for intelligent receipt parsing
• Included text preprocessing to reduce noise and token usage
Diagram
flowchart LR
  Image["Image/PDF File"] -- "detect_text" --> OCR["Google Cloud Vision OCR"]
  OCR -- "preprocess_text" --> CleanText["Cleaned OCR Text"]
  CleanText -- "parse_result" --> Gemini["Gemini AI Parser"]
  Gemini -- "JSON Schema" --> Models["ReceiptData Model"]
  Models --> Output["Structured Receipt Data"]
Loading

Grey Divider

File Changes

1. src/receipt_parser/config/settings.py ⚙️ Configuration changes +13/-0

Configuration and API client initialization

src/receipt_parser/config/settings.py


2. src/receipt_parser/models/__init__.py ✨ Enhancement +3/-0

Export receipt data model classes

src/receipt_parser/models/init.py


3. src/receipt_parser/models/receipt.py ✨ Enhancement +28/-0

Define Pydantic models for receipt data

src/receipt_parser/models/receipt.py


View more (7)
4. src/receipt_parser/services/ocr.py ✨ Enhancement +114/-0

OCR detection and text preprocessing logic

src/receipt_parser/services/ocr.py


5. src/receipt_parser/services/parser.py ✨ Enhancement +41/-0

Gemini-based receipt parsing with schema validation

src/receipt_parser/services/parser.py


6. src/receipt_parser/SETUP.md 📝 Documentation +10/-0

Setup instructions and environment configuration

src/receipt_parser/SETUP.md


7. src/receipt_parser/__init__.py Additional files +0/-0

...

src/receipt_parser/init.py


8. src/receipt_parser/config/__init__.py Additional files +0/-0

...

src/receipt_parser/config/init.py


9. src/receipt_parser/services/__init__.py Additional files +0/-0

...

src/receipt_parser/services/init.py


10. src/receipt_parser/utils/__init__.py Additional files +0/-0

...

src/receipt_parser/utils/init.py


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Broken package imports🐞 Bug ≡ Correctness
Description
src/receipt_parser/... uses absolute imports starting with receipt_parser.*, but the app is
launched as uvicorn src.main:app so receipt_parser is not importable from the container’s
default sys.path, causing ModuleNotFoundError when importing parser/models.
Code

src/receipt_parser/services/parser.py[R1-2]

+from receipt_parser.config.settings import client
+from receipt_parser.models.receipt import ReceiptData
Evidence
The Docker image runs uvicorn src.main:app from /app, and only copies src/ into /app/src.
Therefore, modules should be imported as src.receipt_parser... or via relative imports. However,
both parser.py and models/__init__.py import receipt_parser.* as if it were a top-level
package, which will fail in this runtime layout.

Dockerfile[62-70]
src/receipt_parser/services/parser.py[1-2]
src/receipt_parser/models/init.py[1-3]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Modules under `src/receipt_parser` import `receipt_parser.*` as a top-level package, but this repo runs as `uvicorn src.main:app` (with `/app` on `sys.path`). That means `receipt_parser` is not importable, and importing the receipt parser will raise `ModuleNotFoundError`.
### Issue Context
`receipt_parser` lives at `/app/src/receipt_parser` in the container. In that layout, imports should be relative (preferred within the package) or start with `src.receipt_parser`.
### Fix Focus Areas
- src/receipt_parser/services/parser.py[1-2]
- src/receipt_parser/models/__init__.py[1-3]
### Implementation notes
- In `services/parser.py`, change to `from ..config.settings import client` and `from ..models.receipt import ReceiptData` (or `from src.receipt_parser...`).
- In `models/__init__.py`, change to `from .receipt import ReceiptData, ReceiptItem, ReceiptSummary`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Credentials env overridden🐞 Bug ☼ Reliability
Description
settings.py unconditionally sets GOOGLE_APPLICATION_CREDENTIALS to ocr_demo_key.json at import
time, overriding host/application credentials and making auth depend on current working directory.
Code

src/receipt_parser/config/settings.py[R7-8]

+load_dotenv()
+os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"
Evidence
Importing receipt_parser.config.settings will always overwrite any pre-configured
GOOGLE_APPLICATION_CREDENTIALS value. The setup doc also instructs renaming the key file to
ocr_demo_key.json, confirming this is not a fallback but a hard requirement baked into code.

src/receipt_parser/config/settings.py[7-9]
src/receipt_parser/SETUP.md[8-10]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`src/receipt_parser/config/settings.py` mutates `os.environ["GOOGLE_APPLICATION_CREDENTIALS"]` at import time. This overrides credentials configured by the deployment/host process and is fragile because it uses a relative filename.
### Issue Context
This module is imported by other receipt_parser modules; any import triggers the override automatically.
### Fix Focus Areas
- src/receipt_parser/config/settings.py[7-9]
- src/receipt_parser/SETUP.md[8-10]
### Implementation notes
- Remove the unconditional assignment to `GOOGLE_APPLICATION_CREDENTIALS`.
- Instead, read `GOOGLE_APPLICATION_CREDENTIALS` (or a receipt-parser-specific env var like `RECEIPT_PARSER_GCP_CREDENTIALS_PATH`) and:
- if missing, raise a clear `ValueError` explaining how to configure it, OR
- if you want a default, only set it *if not already set* and resolve to an absolute path (e.g., relative to the repo or module), and validate the file exists before proceeding.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. PDF OCR buffers all pages🐞 Bug ➹ Performance
Description
detect_text() converts all PDF pages to PNG bytes and stores them in a list before OCR, which can
spike memory on larger PDFs; additionally, if an exception occurs during PDF-to-image conversion,
the PDF handle may not be closed.
Code

src/receipt_parser/services/ocr.py[R76-88]

+    image_contents = []
+
+    if file_ext == ".pdf":
+        # opening PDF and iterating through all pages
+        pdf_document = fitz.open(path)
+        for page_num in range(len(pdf_document)):
+            page = pdf_document[page_num]
+
+            # convert each page to an image
+            matrix = fitz.Matrix(2, 2)
+            pix = page.get_pixmap(matrix=matrix)
+            image_contents.append(pix.tobytes("png"))
+        pdf_document.close()
Evidence
For PDFs, the code appends each page’s PNG bytes into image_contents (a full in-memory buffer) and
only closes the PDF after the loop completes. There is no try/finally or context manager around
fitz.open(), so errors inside the conversion loop can skip pdf_document.close().

src/receipt_parser/services/ocr.py[76-88]
src/receipt_parser/services/ocr.py[94-107]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`detect_text()` buffers all PDF pages as PNG bytes in memory before OCR and does not guarantee the PDF is closed if conversion fails mid-loop.
### Issue Context
This can cause high memory usage for multi-page PDFs and can leak file handles if an exception occurs during page rendering.
### Fix Focus Areas
- src/receipt_parser/services/ocr.py[76-107]
### Implementation notes
- Use a context manager: `with fitz.open(path) as pdf_document:`
- Process each page sequentially: render one page -> call Vision OCR -> append text -> discard bytes, rather than building `image_contents` for all pages.
- (Optional) expose tuning knobs like zoom/matrix or a max page limit to avoid pathological inputs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. No GEMINI key validation🐞 Bug ☼ Reliability
Description
settings.py constructs a global Gemini client from GEMINI_API_KEY without checking that the key
exists, so the first call to parse_result() can fail with a generic downstream exception rather
than a clear configuration error.
Code

src/receipt_parser/config/settings.py[R12-13]

+api_key = os.getenv("GEMINI_API_KEY")
+client = genai.Client(api_key=api_key)
Evidence
api_key is read directly from the environment and passed into genai.Client(...) without
validation. parse_result() then relies on this global client for every request.

src/receipt_parser/config/settings.py[12-13]
src/receipt_parser/services/parser.py[21-36]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Gemini client is created from `GEMINI_API_KEY` without validating that the env var is set. This pushes configuration errors to runtime and makes failures harder to diagnose.
### Issue Context
`parse_result()` depends on the global `client` imported from settings.
### Fix Focus Areas
- src/receipt_parser/config/settings.py[12-13]
- src/receipt_parser/services/parser.py[21-36]
### Implementation notes
- Add an explicit check: if `GEMINI_API_KEY` is missing/blank, raise `ValueError("GEMINI_API_KEY is required ...")`.
- Consider lazy initialization (e.g., `get_genai_client()`), so importing the module doesn’t create external-service clients automatically.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

5. Stdout prints in library🐞 Bug ≡ Correctness
Description
OCR and parsing code prints operational data and error details to stdout, bypassing the repo’s
structured logging and creating noisy/unstructured logs in production.
Code

src/receipt_parser/services/ocr.py[R111-113]

+    processed_text = preprocess_text(raw_text)
+    print(f"Tokens Saved: {len(raw_text) - len(processed_text)}/{len(raw_text)}")
+    return processed_text
Evidence
detect_text() prints token savings and parse_result() prints exception details. The repo already
provides a setup_logger() helper for consistent structured logging across modules.

src/receipt_parser/services/ocr.py[109-113]
src/receipt_parser/services/parser.py[37-41]
src/core/logging_utils.py[1-54]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Library code uses `print()` for both normal operational messages and exception reporting, which bypasses the project’s logging system.
### Issue Context
The repo has `src/core/logging_utils.setup_logger()` used elsewhere for consistent logs.
### Fix Focus Areas
- src/receipt_parser/services/ocr.py[109-113]
- src/receipt_parser/services/parser.py[37-41]
- src/core/logging_utils.py[21-54]
### Implementation notes
- Create a module-level logger (e.g., `logger = setup_logger(__name__)`) and replace prints with `logger.info/debug`.
- For exceptions, prefer `logger.exception("...")` inside the `except` block to capture stack traces.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread src/receipt_parser/services/parser.py Outdated
Comment thread src/receipt_parser/config/settings.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new receipt_parser Python library intended to turn the prior receipt OCR notebook prototype into reusable code (OCR via Google Cloud Vision + structured parsing via Gemini).

Changes:

  • Added OCR utilities to extract and preprocess text from images and PDFs using Google Cloud Vision.
  • Added Gemini-based parsing to convert OCR text into structured ReceiptData Pydantic models.
  • Added initial setup documentation for required environment variables and credentials.

Reviewed changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/receipt_parser/SETUP.md Documents env vars and credential setup for OCR + Gemini usage.
src/receipt_parser/config/settings.py Loads env/config and constructs a Gemini client.
src/receipt_parser/models/receipt.py Introduces Pydantic models for receipt items/summary/data.
src/receipt_parser/models/init.py Exposes receipt models via package exports.
src/receipt_parser/services/ocr.py Implements PDF/image OCR and text preprocessing via Vision API.
src/receipt_parser/services/parser.py Implements Gemini JSON-schema-based parsing into ReceiptData.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/receipt_parser/services/parser.py Outdated
Comment thread src/receipt_parser/services/parser.py Outdated
Comment on lines +37 to +41
except Exception as e:
print(f"Error type: {type(e).__name__}")
print(f"Error message: {str(e)}")
print(f"Full error: {repr(e)}")
raise

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception handler prints error details directly to stdout. In the FastAPI app this bypasses the configured logging pipeline (setup_logger) and can leak request/receipt content in logs unexpectedly. Prefer using the project logger (or remove the prints and re-raise with a sanitized message).

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +12
from dotenv import load_dotenv
from google import genai

load_dotenv()
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module performs side effects at import time (loading dotenv and overwriting GOOGLE_APPLICATION_CREDENTIALS with a hardcoded relative filename). This can break deployments and also prevents users from configuring credentials via environment variables. Prefer not mutating os.environ here; instead read GOOGLE_APPLICATION_CREDENTIALS (or accept a path parameter) and fail with a clear error if it’s not set.

Suggested change
from dotenv import load_dotenv
from google import genai
load_dotenv()
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"
WORD = re.compile(r"\w+")
api_key = os.getenv("GEMINI_API_KEY")
from google import genai
def _get_required_env(name: str) -> str:
value = os.getenv(name)
if not value:
raise RuntimeError(
f"Required environment variable '{name}' is not set."
)
return value
WORD = re.compile(r"\w+")
GOOGLE_APPLICATION_CREDENTIALS = _get_required_env("GOOGLE_APPLICATION_CREDENTIALS")
api_key = _get_required_env("GEMINI_API_KEY")

Copilot uses AI. Check for mistakes.
Comment thread src/receipt_parser/config/settings.py
Comment thread src/receipt_parser/models/__init__.py Outdated
Comment thread src/receipt_parser/SETUP.md Outdated
Comment on lines +9 to +10
get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then
rename the file to `ocr_demo_key.json`. No newline at end of file

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup doc instructs renaming a service account key file to a fixed filename. Combined with settings.py forcing GOOGLE_APPLICATION_CREDENTIALS to that name, it encourages keeping credentials in the repo working directory and makes secrets easier to accidentally commit. Prefer documenting GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/key.json and explicitly instructing users not to commit the key file.

Suggested change
get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) then
rename the file to `ocr_demo_key.json`.
get credentials from [Google Cloud Console](https://console.cloud.google.com/apis/credentials), store the downloaded service account key in a secure location outside this repository, and set `GOOGLE_APPLICATION_CREDENTIALS` to its absolute path (for example, `GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/key.json`).
Do not rename the key to a fixed filename in the project directory, and do not commit the key file to version control.

Copilot uses AI. Check for mistakes.
Comment thread src/receipt_parser/services/parser.py Outdated
Comment on lines +21 to +36
def parse_result(receipt_text, model=current_model):
"""Parse OCR text into structured receipt data using Gemini."""
try:
response = client.models.generate_content(
model=model,
contents=receipt_text,
config={
"system_instruction": RECEIPT_PARSER_PROMPT,
"response_mime_type": "application/json",
"response_schema": ReceiptData.model_json_schema(),
"thinking_config": {
"thinking_budget": 0 # no thinking
},
},
)
return ReceiptData.model_validate_json(response.text)

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New parsing behavior is introduced here but there are no accompanying tests. Since the repo already uses pytest (tests/unit/*), add unit tests that mock the Gemini client to verify: schema validation/model_validate_json handling, error propagation, and at least one happy-path response.

Suggested change
def parse_result(receipt_text, model=current_model):
"""Parse OCR text into structured receipt data using Gemini."""
try:
response = client.models.generate_content(
model=model,
contents=receipt_text,
config={
"system_instruction": RECEIPT_PARSER_PROMPT,
"response_mime_type": "application/json",
"response_schema": ReceiptData.model_json_schema(),
"thinking_config": {
"thinking_budget": 0 # no thinking
},
},
)
return ReceiptData.model_validate_json(response.text)
def _build_generation_config(receipt_model=ReceiptData):
"""Build the Gemini response configuration for structured receipt parsing."""
return {
"system_instruction": RECEIPT_PARSER_PROMPT,
"response_mime_type": "application/json",
"response_schema": receipt_model.model_json_schema(),
"thinking_config": {
"thinking_budget": 0 # no thinking
},
}
def _validate_receipt_response(response_text, receipt_model=ReceiptData):
"""Validate Gemini JSON output against the receipt schema."""
return receipt_model.model_validate_json(response_text)
def parse_result(receipt_text, model=current_model, gemini_client=client, receipt_model=ReceiptData):
"""Parse OCR text into structured receipt data using Gemini."""
try:
response = gemini_client.models.generate_content(
model=model,
contents=receipt_text,
config=_build_generation_config(receipt_model),
)
return _validate_receipt_response(response.text, receipt_model)

Copilot uses AI. Check for mistakes.
Comment thread src/receipt_parser/services/ocr.py Outdated
Comment on lines +67 to +108
def detect_text(path):
"""
Detects text in a file using Google Cloud Vision OCR.
Handles images and multi-page PDFs by converting PDF pages to images.
"""
vision_client = vision.ImageAnnotatorClient()
file_ext = Path(path).suffix.lower()
all_text = []

image_contents = []

if file_ext == ".pdf":
# opening PDF and iterating through all pages
pdf_document = fitz.open(path)
for page_num in range(len(pdf_document)):
page = pdf_document[page_num]

# convert each page to an image
matrix = fitz.Matrix(2, 2)
pix = page.get_pixmap(matrix=matrix)
image_contents.append(pix.tobytes("png"))
pdf_document.close()
else:
# Handle standard image files (png, jpg, etc.)
with open(path, "rb") as image_file:
image_contents.append(image_file.read())

# Process each image/page through Vision OCR
for content in image_contents:
image = vision.Image(content=content)

# We use document_text_detection for better handling of dense text/receipts
response = vision_client.document_text_detection(image=image)

if response.error.message:
raise Exception(f"Vision API Error: {response.error.message}")

# text_annotations[0] contains the entire page's text as a single string
if response.text_annotations:
page_text = response.text_annotations[0].description
all_text.append(page_text)

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New OCR logic is added here (PDF page rendering + Vision API calls) but there are no tests covering file-type branching, multi-page PDF handling, and Vision API error handling. Add unit tests that mock fitz.open and vision.ImageAnnotatorClient to validate the behavior without external calls.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +13
from google import genai

load_dotenv()
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from google import genai depends on the google-genai package, but it’s not declared in pyproject.toml (and not present in uv.lock). This will raise ImportError in deployed environments. Add the dependency (or gate it behind an optional extra) and document the required install path for the receipt parser feature.

Suggested change
from google import genai
load_dotenv()
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"
WORD = re.compile(r"\w+")
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)
try:
from google import genai
except ImportError:
genai = None
RECEIPT_PARSER_INSTALL_HINT = (
"The receipt parser Gemini integration requires the optional "
"`google-genai` package. Install it before using this feature."
)
load_dotenv()
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "ocr_demo_key.json"
WORD = re.compile(r"\w+")
api_key = os.getenv("GEMINI_API_KEY")
def get_genai_client():
if genai is None:
raise RuntimeError(RECEIPT_PARSER_INSTALL_HINT)
if not api_key:
raise RuntimeError(
"The receipt parser Gemini integration requires the "
"`GEMINI_API_KEY` environment variable to be set."
)
return genai.Client(api_key=api_key)
client = get_genai_client() if genai is not None and api_key else None

Copilot uses AI. Check for mistakes.
Comment thread src/receipt_parser/models/receipt.py Outdated

@rajpandya737 rajpandya737 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good start @ian-yeh, address the ai comments and make sure there is good typehinting, we want to make sure that the types always line up cause that might affect stuff down the pipeline, get rid of the .md file and ill take a deeper look once done

ian-yeh and others added 12 commits April 6, 2026 23:38
- Add pymupdf, google-genai, and google-cloud-vision dependencies
- Update .gitignore for OCR credentials and test directories
- Remove SETUP.md from receipt_parser module
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Add type annotations to preprocess_text() and detect_text() in ocr.py
- Add type annotations to parse_result() in parser.py
- Import __future__.annotations for forward compatibility
…rser services

- Add module-level logger using setup_logger(__name__) in ocr.py and parser.py
- Replace print() with logger.info() for token savings message in ocr.py
- Replace multiple print() statements with logger.exception() in parser.py
  to properly capture stack traces on parse failures

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 12 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +32
load_dotenv()

# Only set credentials if not already configured (e.g., by deployment/host)
if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:
# Resolve relative to project root, not current working directory
project_root = Path(__file__).parent.parent.parent.parent
default_creds_path = project_root / "google_ocr_credentials.json"

if not default_creds_path.exists():
raise FileNotFoundError(
f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set "
f"and default credentials file not found: {default_creds_path}\n"
f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path."
)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path)

WORD = re.compile(r"\w+")

api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
raise RuntimeError(
"GEMINI_API_KEY environment variable must be set before initializing the Gemini client."
)
client = genai.Client(api_key=api_key)

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

settings.py performs credential discovery, reads .env, and raises exceptions at import time. This makes simply importing src.receipt_parser (or parse_result) crash in environments that don't have Vision/Gemini configured, even if receipt parsing is not used. Consider moving env/credential validation + genai.Client creation into a lazy get_client() function (or inside parse_result) so module import is side-effect-free, and raise a clear error only when the functionality is invoked.

Suggested change
load_dotenv()
# Only set credentials if not already configured (e.g., by deployment/host)
if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:
# Resolve relative to project root, not current working directory
project_root = Path(__file__).parent.parent.parent.parent
default_creds_path = project_root / "google_ocr_credentials.json"
if not default_creds_path.exists():
raise FileNotFoundError(
f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set "
f"and default credentials file not found: {default_creds_path}\n"
f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path."
)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path)
WORD = re.compile(r"\w+")
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
raise RuntimeError(
"GEMINI_API_KEY environment variable must be set before initializing the Gemini client."
)
client = genai.Client(api_key=api_key)
WORD = re.compile(r"\w+")
_client = None
def _configure_environment():
load_dotenv()
# Only set credentials if not already configured (e.g., by deployment/host)
if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:
# Resolve relative to project root, not current working directory
project_root = Path(__file__).parent.parent.parent.parent
default_creds_path = project_root / "google_ocr_credentials.json"
if not default_creds_path.exists():
raise FileNotFoundError(
f"GOOGLE_APPLICATION_CREDENTIALS environment variable is not set "
f"and default credentials file not found: {default_creds_path}\n"
f"Please set GOOGLE_APPLICATION_CREDENTIALS to your service account key file path."
)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = str(default_creds_path)
def get_client():
global _client
if _client is not None:
return _client
_configure_environment()
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
raise RuntimeError(
"GEMINI_API_KEY environment variable must be set before initializing the Gemini client."
)
_client = genai.Client(api_key=api_key)
return _client

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +6
from src.core.logging_utils import setup_logger
from src.receipt_parser.config.settings import client
from src.receipt_parser.models.receipt import ReceiptData

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parser.py imports client from config.settings at module import time. Because config.settings validates env vars / credentials during import, importing this module can raise before any function is called. Consider deferring the client import/initialization until inside parse_result (or using a lazy accessor) to avoid import-time failures and improve library ergonomics.

Copilot uses AI. Check for mistakes.
if file_ext == ".pdf":
with fitz.open(path) as pdf_document:
page_count = len(pdf_document)
pages_to_process = min(page_count, max_pages) if max_pages else page_count

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_pages check treats 0 as falsy, which will process all pages when max_pages=0 is passed. Use an explicit max_pages is not None check so callers can intentionally pass 0 (or any falsy int) and get the expected behavior.

Suggested change
pages_to_process = min(page_count, max_pages) if max_pages else page_count
pages_to_process = (
min(page_count, max_pages) if max_pages is not None else page_count
)

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +139
processed_text = preprocess_text(raw_text)
logger.info(f"Tokens Saved: {len(raw_text) - len(processed_text)}/{len(raw_text)}")
return processed_text

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message labels a character-count delta as "Tokens Saved" (len(raw_text) - len(processed_text)). This is not tokenization and can be misleading for debugging/cost estimation. Consider renaming to something like "Characters removed" or computing an actual token estimate if that's the intent.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +20
class ReceiptSummary(BaseModel):
number_of_items: int
subtotal: float
discount: float
delivery_fee: float
service_fee: float
tax: float
tip: float
total: float

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt says to "Default missing numeric fields to 0", but the Pydantic models currently require all numeric fields with no defaults. If Gemini omits a field, model_validate_json will fail. Consider adding = 0 defaults (and possibly items: list[...] = [] if desired) so the schema aligns with the prompt and parsing is more resilient.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +27
class ReceiptData(BaseModel):
store: str
order_number: str | None = None
date: str | None = None
currency: str
items: list[ReceiptItem]

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date is modeled as str | None, which doesn't enforce the documented YYYY-MM-DD format and will accept arbitrary strings. Consider using datetime.date (or a constrained string pattern) to validate format consistently and reduce downstream parsing errors.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +78
def detect_text(
path: str | Path,
*,
zoom: float = 2.0,
max_pages: int | None = None,
) -> str:

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says usage is detect_text([image_path]), but detect_text currently accepts a single path: str | Path. Either update the public API to accept a list (if intended), or adjust the documentation/README/usage guidance so callers don't get a runtime TypeError.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +20
def preprocess_text(receipt_text: str) -> str:
"""Clean up raw OCR text to reduce noise and token count."""

# Remove page break markers
receipt_text = re.sub(r"-+\s*Page\s*Break\s*-+", "", receipt_text)

# Remove UPC/barcode numbers (long digit strings, optionally ending in KF)

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New receipt parsing functionality (preprocess_text/detect_text) is currently untested. Since the repo already has pytest unit tests, add unit tests for text preprocessing (deterministic) and for the PDF/image branching logic (mocking Vision/PyMuPDF) to prevent regressions without requiring external API calls.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +43
def parse_result(receipt_text: str, model: str = current_model) -> ReceiptData:
"""Parse OCR text into structured receipt data using Gemini."""
try:
response = client.models.generate_content(
model=model,
contents=receipt_text,
config={
"system_instruction": RECEIPT_PARSER_PROMPT,
"response_mime_type": "application/json",
"response_schema": ReceiptData.model_json_schema(),
"thinking_config": {
"thinking_budget": 0 # no thinking
},
},
)
return ReceiptData.model_validate_json(response.text)
except Exception:
logger.exception("Failed to parse receipt text")

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_result is core library behavior but has no automated tests. Add tests that validate the request configuration passed to the Gemini client and that JSON parsing/validation failures are handled/logged as expected (using a stubbed/mocked client).

Suggested change
def parse_result(receipt_text: str, model: str = current_model) -> ReceiptData:
"""Parse OCR text into structured receipt data using Gemini."""
try:
response = client.models.generate_content(
model=model,
contents=receipt_text,
config={
"system_instruction": RECEIPT_PARSER_PROMPT,
"response_mime_type": "application/json",
"response_schema": ReceiptData.model_json_schema(),
"thinking_config": {
"thinking_budget": 0 # no thinking
},
},
)
return ReceiptData.model_validate_json(response.text)
except Exception:
logger.exception("Failed to parse receipt text")
def _build_generation_config() -> dict:
"""Build the Gemini request configuration used for receipt parsing."""
return {
"system_instruction": RECEIPT_PARSER_PROMPT,
"response_mime_type": "application/json",
"response_schema": ReceiptData.model_json_schema(),
"thinking_config": {
"thinking_budget": 0 # no thinking
},
}
def parse_result(
receipt_text: str,
model: str = current_model,
gemini_client=client,
) -> ReceiptData:
"""Parse OCR text into structured receipt data using Gemini."""
config = _build_generation_config()
try:
response = gemini_client.models.generate_content(
model=model,
contents=receipt_text,
config=config,
)
except Exception:
logger.exception("Failed to request receipt parsing from Gemini")
raise
try:
return ReceiptData.model_validate_json(response.text)
except Exception:
logger.exception("Failed to parse or validate Gemini receipt response")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants