Skip to content

Hyperlink conversion emits invalid OOXML and duplicate underline artifacts #369

@devinhurry

Description

@devinhurry

Description of the bug

When converting PDFs that contain URL annotations plus decorative vector link styling, pdf2docx can produce three hyperlink-related regressions in DOCX output:

  1. Invalid OOXML hyperlink structure (w:hyperlink nested under w:r), which is not the expected paragraph-level structure.
  2. Duplicate underline artifacts: decorative vector strips near links can be exported as extra inline images, so links show an additional line.
  3. Link color mismatch: link color can be reported as black from text extraction, even when vector drawing around the hyperlink region is blue.

This affects real-world PDFs with vector-heavy text/styling (including CJK-heavy documents), and can lead to visibly broken link rendering in Word/LibreOffice.

How to reproduce the bug

  1. Use the attached sample flow (added in PR branch) to generate the sample PDF:
cd test/samples
python generate_demo_hyperlink_style_shape.py

This creates test/samples/demo-hyperlink-style-shape.pdf.

  1. Convert it:
from pdf2docx import Converter
cv = Converter('test/samples/demo-hyperlink-style-shape.pdf')
cv.convert('out.docx')
cv.close()
  1. Inspect out.docx internals (word/document.xml):
  • On current release, hyperlinks can be nested under runs.
  • Decorative vector strip may appear as word/media/image*.png + w:drawing in the same paragraph as link text.
  • Hyperlink text color may not match vector-styled source links.

Expected behavior:

  • w:hyperlink should be direct paragraph content (valid structure).
  • Decorative line strip used for link styling should not be emitted as an extra inline image.
  • Hyperlink color should follow source vector styling when available.

I also added an automated regression test that checks all three points:

pytest -q test/test.py::TestConversion::test_hyperlink_style_and_structure

pdf2docx version

0.5.10

Operating system

MacOS

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions