Skip to content

Conversation

@muhammad-ali-emumba
Copy link

PR Description

Problem

Dolphin applies padding to PDF page images during processing, but the dumped bboxes in the recognition JSON correspond to the un-padded images, causing misalignment when visualizing or reusing those bboxes.

This leads to a mismatch when users attempt to:

  • Reconstruct or visualize the processed pages using the dumped bboxes.

  • Crop specific parts of a page (e.g., code blocks or tables) for downstream tasks such as LLM-based text/code extraction.

As a result, the dumped bboxes from the existing recognition_json/.json do not correctly align with the actual padded images used internally.

Solution

This PR introduces two key enhancements to improve accuracy and reproducibility:

Export Padded Images

  • Added logic in demo_page.py and utils/util.py to save the padded page images generated during PDF processing.

  • Padded images are saved in a directory specified by the user via the new CLI argument --processed_images_dir.

  • If no directory is provided (via CLI or config), Dolphin will automatically create a default folder named processed_images_by_dolphin in the parent directory.

  • The saved images follow a consistent naming structure:

    <processed_images_dir>/<pdf_name>/page-1.png

Dump Adjusted (Padded) Bounding Boxes

  • When Dolphin transforms the original bboxes to match the padded images, these adjusted coordinates are now stored in the output JSON under a new key:
{
  "bboxes": [...],
  "padded_bboxes": [...]
}
  • This ensures consumers of the output can precisely map visual elements or crop specific regions from the padded images without additional coordinate transformation.

Impact

  • Enables accurate cropping of specific PDF regions (e.g., code snippets, figures) for post-processing or LLM-based enrichment.

  • Improves reproducibility between internal image processing and exported JSON metadata.

  • Backward compatibility - existing users relying on the current bboxes field will not be affected.

Files Modified

  • demo_page.py

  • utils/util.py

Testing

Validated on multiple PDFs containing code blocks and tables.
Verified that:
-Dumped padded_bboxes correctly align with visual regions when overlayed on padded images.

…cognition json files and save the padded images on disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant