Skip to content

extract_inner_text silently deletes text between < and > in DocTags text nodes (e.g. p-values like "P < 0.05") #618

@papatistos

Description

@papatistos

Summary

load_from_doctags uses a nested helper extract_inner_text to strip structural DocTags from text node content. The regex it uses has a bug that silently deletes arbitrary spans of text whenever the content contains both a < and a later > character — a common occurrence in scientific text (e.g. statistical comparisons like P < 0.05 ... P > 0.05).

Affected file

docling_core/types/doc/document.py, inside DoclingDocument.load_from_doctags:

def extract_inner_text(text_chunk: str) -> str:
    """Strip all <...> tags (except <_..._>) to get the raw text content."""
    return re.sub(r"<(?!_.*?_>).*?>", "", text_chunk, flags=re.DOTALL).strip()

Root cause

The pattern <(?!_.*?_>).*?> with re.DOTALL uses a lazy quantifier .*? that will match the next > anywhere in the string. When a text node contains something like:

...( r = -0.36, P < 0.05), but not ( r = -0.08, P > 0.05)...

The regex opens a "tag" match at P <, then .*? scans forward until it finds the > inside P > 0.05, consuming the entire span:

< 0.05), but not ( r = -0.08, P >

This leaves the output as ...( r = -0.36, P 0.05)... — all text between the two comparisons is silently dropped.

Reproduction

import re

def extract_inner_text_buggy(text_chunk):
    return re.sub(r"<(?!_.*?_>).*?>", "", text_chunk, flags=re.DOTALL).strip()

sample = '<loc_35><loc_264>We found (r = -0.36, P < 0.05), but not (r = -0.08, P > 0.05), lending support.'
print(extract_inner_text_buggy(sample))
# Output: We found (r = -0.36, P  0.05), lending support.
#                                ^ everything between < and > is deleted

Fix

Require the tag to start with [a-zA-Z/] (an actual tag name character) and use [^>]* instead of .*? so the match can never cross a > boundary:

def extract_inner_text(text_chunk: str) -> str:
    return re.sub(r"<(?!_.*?_>)[a-zA-Z/][^>]*>", "", text_chunk).strip()

Verified fix:

def extract_inner_text_fixed(text_chunk):
    return re.sub(r"<(?!_.*?_>)[a-zA-Z/][^>]*>", "", text_chunk).strip()

print(extract_inner_text_fixed(sample))
# Output: We found (r = -0.36, P < 0.05), but not (r = -0.08, P > 0.05), lending support.

Note: re.DOTALL is also unnecessary with the fixed pattern since [^>]* already handles multi-line tag content correctly.

Impact

Any document whose text nodes contain < followed by a later > (statistical notation, inequalities, angle brackets in prose, HTML fragments in text) will have content silently deleted when using the DocTags pipeline (e.g. VlmPipeline with GraniteDocling or SmolDocling models).

Version

docling-core==2.77.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions