Summary
load_from_doctags uses a nested helper extract_inner_text to strip structural DocTags from text node content. The regex it uses has a bug that silently deletes arbitrary spans of text whenever the content contains both a < and a later > character — a common occurrence in scientific text (e.g. statistical comparisons like P < 0.05 ... P > 0.05).
Affected file
docling_core/types/doc/document.py, inside DoclingDocument.load_from_doctags:
def extract_inner_text(text_chunk: str) -> str:
"""Strip all <...> tags (except <_..._>) to get the raw text content."""
return re.sub(r"<(?!_.*?_>).*?>", "", text_chunk, flags=re.DOTALL).strip()
Root cause
The pattern <(?!_.*?_>).*?> with re.DOTALL uses a lazy quantifier .*? that will match the next > anywhere in the string. When a text node contains something like:
...( r = -0.36, P < 0.05), but not ( r = -0.08, P > 0.05)...
The regex opens a "tag" match at P <, then .*? scans forward until it finds the > inside P > 0.05, consuming the entire span:
< 0.05), but not ( r = -0.08, P >
This leaves the output as ...( r = -0.36, P 0.05)... — all text between the two comparisons is silently dropped.
Reproduction
import re
def extract_inner_text_buggy(text_chunk):
return re.sub(r"<(?!_.*?_>).*?>", "", text_chunk, flags=re.DOTALL).strip()
sample = '<loc_35><loc_264>We found (r = -0.36, P < 0.05), but not (r = -0.08, P > 0.05), lending support.'
print(extract_inner_text_buggy(sample))
# Output: We found (r = -0.36, P 0.05), lending support.
# ^ everything between < and > is deleted
Fix
Require the tag to start with [a-zA-Z/] (an actual tag name character) and use [^>]* instead of .*? so the match can never cross a > boundary:
def extract_inner_text(text_chunk: str) -> str:
return re.sub(r"<(?!_.*?_>)[a-zA-Z/][^>]*>", "", text_chunk).strip()
Verified fix:
def extract_inner_text_fixed(text_chunk):
return re.sub(r"<(?!_.*?_>)[a-zA-Z/][^>]*>", "", text_chunk).strip()
print(extract_inner_text_fixed(sample))
# Output: We found (r = -0.36, P < 0.05), but not (r = -0.08, P > 0.05), lending support.
Note: re.DOTALL is also unnecessary with the fixed pattern since [^>]* already handles multi-line tag content correctly.
Impact
Any document whose text nodes contain < followed by a later > (statistical notation, inequalities, angle brackets in prose, HTML fragments in text) will have content silently deleted when using the DocTags pipeline (e.g. VlmPipeline with GraniteDocling or SmolDocling models).
Version
docling-core==2.77.0
Summary
load_from_doctagsuses a nested helperextract_inner_textto strip structural DocTags from text node content. The regex it uses has a bug that silently deletes arbitrary spans of text whenever the content contains both a<and a later>character — a common occurrence in scientific text (e.g. statistical comparisons likeP < 0.05 ... P > 0.05).Affected file
docling_core/types/doc/document.py, insideDoclingDocument.load_from_doctags:Root cause
The pattern
<(?!_.*?_>).*?>withre.DOTALLuses a lazy quantifier.*?that will match the next>anywhere in the string. When a text node contains something like:The regex opens a "tag" match at
P <, then.*?scans forward until it finds the>insideP > 0.05, consuming the entire span:This leaves the output as
...( r = -0.36, P 0.05)...— all text between the two comparisons is silently dropped.Reproduction
Fix
Require the tag to start with
[a-zA-Z/](an actual tag name character) and use[^>]*instead of.*?so the match can never cross a>boundary:Verified fix:
Note:
re.DOTALLis also unnecessary with the fixed pattern since[^>]*already handles multi-line tag content correctly.Impact
Any document whose text nodes contain
<followed by a later>(statistical notation, inequalities, angle brackets in prose, HTML fragments in text) will have content silently deleted when using the DocTags pipeline (e.g.VlmPipelinewith GraniteDocling or SmolDocling models).Version
docling-core==2.77.0