extract_inner_text silently deletes text between < and > in DocTags text nodes (e.g. p-values like "P < 0.05")

## Summary

`load_from_doctags` uses a nested helper `extract_inner_text` to strip structural DocTags from text node content. The regex it uses has a bug that silently deletes arbitrary spans of text whenever the content contains both a `<` and a later `>` character — a common occurrence in scientific text (e.g. statistical comparisons like `P < 0.05 ... P > 0.05`).

## Affected file

`docling_core/types/doc/document.py`, inside `DoclingDocument.load_from_doctags`:

```python
def extract_inner_text(text_chunk: str) -> str:
    """Strip all <...> tags (except <_..._>) to get the raw text content."""
    return re.sub(r"<(?!_.*?_>).*?>", "", text_chunk, flags=re.DOTALL).strip()
```

## Root cause

The pattern `<(?!_.*?_>).*?>` with `re.DOTALL` uses a lazy quantifier `.*?` that will match the *next* `>` anywhere in the string. When a text node contains something like:

```
...( r = -0.36, P < 0.05), but not ( r = -0.08, P > 0.05)...
```

The regex opens a "tag" match at `P <`, then `.*?` scans forward until it finds the `>` inside `P > 0.05`, consuming the entire span:

```
< 0.05), but not ( r = -0.08, P >
```

This leaves the output as `...( r = -0.36, P  0.05)...` — all text between the two comparisons is silently dropped.

## Reproduction

```python
import re

def extract_inner_text_buggy(text_chunk):
    return re.sub(r"<(?!_.*?_>).*?>", "", text_chunk, flags=re.DOTALL).strip()

sample = '<loc_35><loc_264>We found (r = -0.36, P < 0.05), but not (r = -0.08, P > 0.05), lending support.'
print(extract_inner_text_buggy(sample))
# Output: We found (r = -0.36, P  0.05), lending support.
#                                ^ everything between < and > is deleted
```

## Fix

Require the tag to start with `[a-zA-Z/]` (an actual tag name character) and use `[^>]*` instead of `.*?` so the match can never cross a `>` boundary:

```python
def extract_inner_text(text_chunk: str) -> str:
    return re.sub(r"<(?!_.*?_>)[a-zA-Z/][^>]*>", "", text_chunk).strip()
```

Verified fix:

```python
def extract_inner_text_fixed(text_chunk):
    return re.sub(r"<(?!_.*?_>)[a-zA-Z/][^>]*>", "", text_chunk).strip()

print(extract_inner_text_fixed(sample))
# Output: We found (r = -0.36, P < 0.05), but not (r = -0.08, P > 0.05), lending support.
```

Note: `re.DOTALL` is also unnecessary with the fixed pattern since `[^>]*` already handles multi-line tag content correctly.

## Impact

Any document whose text nodes contain `<` followed by a later `>` (statistical notation, inequalities, angle brackets in prose, HTML fragments in text) will have content silently deleted when using the DocTags pipeline (e.g. `VlmPipeline` with GraniteDocling or SmolDocling models).

## Version

`docling-core==2.77.0`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_inner_text silently deletes text between < and > in DocTags text nodes (e.g. p-values like "P < 0.05") #618

Summary

Affected file

Root cause

Reproduction

Fix

Impact

Version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

extract_inner_text silently deletes text between < and > in DocTags text nodes (e.g. p-values like "P < 0.05") #618

Description

Summary

Affected file

Root cause

Reproduction

Fix

Impact

Version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions