Description of the bug
When converting PDFs that contain URL annotations plus decorative vector link styling, pdf2docx can produce three hyperlink-related regressions in DOCX output:
- Invalid OOXML hyperlink structure (
w:hyperlink nested under w:r), which is not the expected paragraph-level structure.
- Duplicate underline artifacts: decorative vector strips near links can be exported as extra inline images, so links show an additional line.
- Link color mismatch: link color can be reported as black from text extraction, even when vector drawing around the hyperlink region is blue.
This affects real-world PDFs with vector-heavy text/styling (including CJK-heavy documents), and can lead to visibly broken link rendering in Word/LibreOffice.
How to reproduce the bug
- Use the attached sample flow (added in PR branch) to generate the sample PDF:
cd test/samples
python generate_demo_hyperlink_style_shape.py
This creates test/samples/demo-hyperlink-style-shape.pdf.
- Convert it:
from pdf2docx import Converter
cv = Converter('test/samples/demo-hyperlink-style-shape.pdf')
cv.convert('out.docx')
cv.close()
- Inspect
out.docx internals (word/document.xml):
- On current release, hyperlinks can be nested under runs.
- Decorative vector strip may appear as
word/media/image*.png + w:drawing in the same paragraph as link text.
- Hyperlink text color may not match vector-styled source links.
Expected behavior:
w:hyperlink should be direct paragraph content (valid structure).
- Decorative line strip used for link styling should not be emitted as an extra inline image.
- Hyperlink color should follow source vector styling when available.
I also added an automated regression test that checks all three points:
pytest -q test/test.py::TestConversion::test_hyperlink_style_and_structure
pdf2docx version
0.5.10
Operating system
MacOS
Python version
3.11
Description of the bug
When converting PDFs that contain URL annotations plus decorative vector link styling,
pdf2docxcan produce three hyperlink-related regressions in DOCX output:w:hyperlinknested underw:r), which is not the expected paragraph-level structure.This affects real-world PDFs with vector-heavy text/styling (including CJK-heavy documents), and can lead to visibly broken link rendering in Word/LibreOffice.
How to reproduce the bug
cd test/samples python generate_demo_hyperlink_style_shape.pyThis creates
test/samples/demo-hyperlink-style-shape.pdf.out.docxinternals (word/document.xml):word/media/image*.png+w:drawingin the same paragraph as link text.Expected behavior:
w:hyperlinkshould be direct paragraph content (valid structure).I also added an automated regression test that checks all three points:
pdf2docx version
0.5.10
Operating system
MacOS
Python version
3.11