Add support for exporting padded images and adjusted bounding boxes for accurate region cropping #154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Description
Problem
Dolphin applies padding to PDF page images during processing, but the dumped bboxes in the recognition JSON correspond to the un-padded images, causing misalignment when visualizing or reusing those bboxes.
This leads to a mismatch when users attempt to:
Reconstruct or visualize the processed pages using the dumped bboxes.
Crop specific parts of a page (e.g., code blocks or tables) for downstream tasks such as LLM-based text/code extraction.
As a result, the dumped bboxes from the existing recognition_json/.json do not correctly align with the actual padded images used internally.
Solution
This PR introduces two key enhancements to improve accuracy and reproducibility:
Export Padded Images
Added logic in demo_page.py and utils/util.py to save the padded page images generated during PDF processing.
Padded images are saved in a directory specified by the user via the new CLI argument --processed_images_dir.
If no directory is provided (via CLI or config), Dolphin will automatically create a default folder named processed_images_by_dolphin in the parent directory.
The saved images follow a consistent naming structure:
<processed_images_dir>/<pdf_name>/page-1.png
Dump Adjusted (Padded) Bounding Boxes
Impact
Enables accurate cropping of specific PDF regions (e.g., code snippets, figures) for post-processing or LLM-based enrichment.
Improves reproducibility between internal image processing and exported JSON metadata.
Backward compatibility - existing users relying on the current bboxes field will not be affected.
Files Modified
demo_page.py
utils/util.py
Testing
Validated on multiple PDFs containing code blocks and tables.
Verified that:
-Dumped padded_bboxes correctly align with visual regions when overlayed on padded images.