Skip to content

Phase 5a: pure-Swift RTFConverter#16

Merged
ronaldmannak merged 4 commits into
mainfrom
claude/phase-5a-rtf
Jun 7, 2026
Merged

Phase 5a: pure-Swift RTFConverter#16
ronaldmannak merged 4 commits into
mainfrom
claude/phase-5a-rtf

Conversation

@ronaldmannak

Copy link
Copy Markdown
Contributor

Phase 5a — RTFConverter (Tier A polish)

RTF is listed as a supported type but previously threw documentTypeNotSupported (it was deferred during the rewrite to avoid reintroducing NSAttributedString). This adds a small pure-Swift RTF→Markdown parser, so RTF now converts with no NSAttributedString, no main-thread hop, and no font-size heading guessing.

Parser

  • Decodes the RTF byte container as Latin-1 (non-ASCII arrives via \uN / \'hh).
  • Walks groups, tracking bold/italic on a state stack; emits paragraphs on \par/\row/\sect.
  • Skips control tables (fonttbl/colortbl/stylesheet/info/pict/…) and ignorable (\*) destinations.
  • Handles \uN (with the \ucN fallback-skip), \'hh, and the common control symbols (\\ \{ \} \~ \_ \-).
  • RTF has no semantic headings, so none are inferred — deliberately avoiding the lossy font-size→heading guessing the old engine did.

Wiring & tests

  • Registered at .specific priority; detection already classifies RTF via the {\rtf magic bytes.
  • Adds a conversion test (paragraphs + **bold**/*italic*, control tables skipped, no raw control words leak).
  • Drops the now-obsolete "RTF deferred" note from the registry.

First of the Tier A polish items (RTF → renderers → DOCX images).


Generated by Claude Code

Adds an RTFConverter so RTF (a documented supported type) converts instead of
throwing documentTypeNotSupported. It's a small dependency-free parser — no
NSAttributedString (the lossy, main-thread-biased path this rewrite removed):

- decodes the RTF byte container as Latin-1 (non-ASCII arrives via \uN / \'hh);
- walks groups, tracking bold/italic on a state stack and emitting paragraphs on
  \par/\row/\sect;
- skips control tables (fonttbl/colortbl/stylesheet/info/pict/...) and ignorable
  (\*) destinations;
- handles \uN (with the \ucN fallback skip), \'hh, and the common control
  symbols.

RTF carries no semantic headings, so none are inferred (we deliberately don't
guess headings from font sizes). Registered at specific priority; detection
already classifies RTF by its {\rtf magic. Adds a conversion test and drops the
"RTF deferred" note from the registry.

https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request registers and implements a new pure-Swift RTFConverter to convert RTF documents to Markdown without relying on NSAttributedString, and adds corresponding unit tests. The review feedback identifies a bug in the fallback skipping logic where it prematurely terminates on backslashes, potentially causing duplicate characters. Additionally, the reviewer recommends scoping the ucSkip parameter to groups by saving and restoring it on the group state stack.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +132 to +137
// Skip the \ucN fallback characters that follow a \uN.
var skipped = 0
while i < n, skipped < ucSkip,
chars[i] != "\\", chars[i] != "{", chars[i] != "}" {
i += 1; skipped += 1
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current fallback skipping logic terminates immediately if it encounters a backslash (\). In RTF, Unicode fallback characters are very frequently represented as hex escape control symbols (e.g., \'hh like \u2014\'97). Because the loop stops at \, the fallback escape sequence is not skipped and instead gets parsed and appended to the output, resulting in duplicate characters. The skipping logic should be updated to correctly skip control symbols and control words as a single fallback item.

                        // Skip the \ucN fallback characters that follow a \uN.
                        var skipped = 0
                        while i < n, skipped < ucSkip {
                            let c = chars[i]
                            if c == "{" || c == "}" {
                                break
                            } else if c == "\\" {
                                if i + 1 < n {
                                    let next = chars[i + 1]
                                    if next == "'" {
                                        i += min(4, n - i)
                                    } else if next.isLetter {
                                        i += 1
                                        while i < n, chars[i].isLetter { i += 1 }
                                        if i < n, chars[i] == "-" { i += 1 }
                                        while i < n, chars[i].isNumber { i += 1 }
                                        if i < n, chars[i] == " " { i += 1 }
                                    } else {
                                        i += 2
                                    }
                                } else {
                                    i += 1
                                }
                            } else {
                                i += 1
                            }
                            skipped += 1
                        }

// MARK: - Parser

private struct Run { var text: String; var bold: Bool; var italic: Bool }
private struct GroupState { var bold: Bool; var italic: Bool; var ignore: Bool }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ucSkip parameter (defined by the \ucN control word) is group-scoped according to the RTF specification. It should be saved and restored on the group state stack to prevent changes inside a group from leaking into the rest of the document.

Suggested change
private struct GroupState { var bold: Bool; var italic: Bool; var ignore: Bool }
private struct GroupState { var bold: Bool; var italic: Bool; var ignore: Bool; var ucSkip: Int }

Comment on lines +85 to +93
case "{":
stack.append(GroupState(bold: bold, italic: italic, ignore: ignore))
i += 1

case "}":
if let saved = stack.popLast() {
bold = saved.bold; italic = saved.italic; ignore = saved.ignore
}
i += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Save and restore the ucSkip state on the group stack when entering ({) and exiting (}) a group to ensure its scope remains local to the group.

Suggested change
case "{":
stack.append(GroupState(bold: bold, italic: italic, ignore: ignore))
i += 1
case "}":
if let saved = stack.popLast() {
bold = saved.bold; italic = saved.italic; ignore = saved.ignore
}
i += 1
case "{":
stack.append(GroupState(bold: bold, italic: italic, ignore: ignore, ucSkip: ucSkip))
i += 1
case "}":
if let saved = stack.popLast() {
bold = saved.bold; italic = saved.italic; ignore = saved.ignore; ucSkip = saved.ucSkip
}
i += 1

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49416d43ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +153 to +155
let byte = UInt32(String([chars[i], chars[i + 1]]), radix: 16),
let scalar = Unicode.Scalar(byte) {
appendText(String(Character(scalar)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decode RTF hex escapes using the declared code page

For RTFs that use \ansicpg1252 (the common Windows ANSI default), \'hh bytes in the 0x80–0x9F range represent printable characters such as smart quotes and dashes, but this path maps the byte directly to the same Unicode code point, producing C1 control characters instead. Files containing text like \'93quoted\'94 will convert to invisible/control output instead of readable punctuation; the parser needs to honor the RTF code page (at least Windows-1252 for \ansi/\ansicpg1252) when decoding hex escapes.

Useful? React with 👍 / 👎.

Comment on lines +128 to +129
if value >= 0, let scalar = Unicode.Scalar(UInt32(value)) {
appendText(String(Character(scalar)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve surrogate pairs in Unicode escapes

RTF \uN values are UTF-16 code units, so characters outside the BMP (for example emoji) are emitted as high/low surrogate pairs. Each surrogate code unit makes Unicode.Scalar(UInt32(value)) return nil here, and the fallback bytes are still skipped, so those characters disappear entirely from converted documents. Buffering a high surrogate and combining it with the following low surrogate before appending would preserve these characters.

Useful? React with 👍 / 👎.

Comment on lines +134 to +135
while i < n, skipped < ucSkip,
chars[i] != "\\", chars[i] != "{", chars[i] != "}" {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip escaped fallback bytes after Unicode escapes

When an RTF writer emits the \uc fallback as an escaped byte sequence, such as the common \u8217\'92 form, this loop stops immediately at the backslash and skips zero fallback characters. The parser then processes \'92 as real text, so affected non-ASCII characters are duplicated or followed by a bogus control character in the Markdown output. The fallback skip needs to consume escaped control-symbol fallback bytes as one skipped character as well as raw fallback characters.

Useful? React with 👍 / 👎.

Comment on lines +111 to +112
case "par", "row", "sect":
flushParagraph()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore paragraph controls inside skipped destinations

When ignore is true for destinations like footnote, header, or pict, control words inside that skipped group still call flushParagraph(). A document with inline ignored content such as Hello{\footnote ...\par} world\par will be split into two Markdown paragraphs even though the skipped destination should not affect body text layout; paragraph/line/table controls should be no-ops while ignoring a destination.

Useful? React with 👍 / 👎.

Comment on lines +90 to +91
if let saved = stack.popLast() {
bold = saved.bold; italic = saved.italic; ignore = saved.ignore

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore Unicode fallback length when leaving groups

RTF character properties such as \ucN are scoped by the current group, but the stack only restores bold, italic, and ignored-destination state. If a nested group sets \uc0 or \uc2, that value leaks into following body text after the group closes, causing later \uN escapes to skip the wrong number of fallback characters and either duplicate fallback text or drop real characters. Include ucSkip in the saved group state and restore it on }.

Useful? React with 👍 / 👎.

- Group-scope \ucN: save/restore ucSkip on the group stack so a nested \ucN
  doesn't leak into later body text (Gemini; Codex).
- Unit-aware \uN fallback skip: skip whole fallback units (\'hh, control words,
  control symbols, literals) instead of stopping at the first backslash, so an
  escaped fallback like 舗\'92 isn't re-parsed and duplicated (Gemini; Codex).
- Decode \'hh via Windows-1252 (the \ansi default) rather than raw Latin-1, so
  0x80-0x9F become smart quotes/dashes instead of C1 control characters (Codex).
- Combine \uN UTF-16 surrogate pairs so astral characters (e.g. emoji) survive
  instead of being dropped at a failed scalar init (Codex).
- Don't treat \par/\line/\row/escaped-newline as body breaks while inside a
  skipped destination (footnote/header/pict), so inline ignored content doesn't
  split a paragraph (Codex).

Adds tests for the fallback-no-duplicate, Windows-1252, surrogate-pair, and
ignored-destination-paragraph cases.

https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP

Copy link
Copy Markdown
Contributor Author

Addressed the @gemini-code-assist and @chatgpt-codex-connector reviews in 84c1adc. Combined, the RTF parser now handles:

  1. Group-scoped \ucN (Gemini :40/:93; Codex :91) — ucSkip is saved/restored on the group stack, so a nested \ucN no longer leaks into later body text.
  2. Unit-aware \uN fallback skip (Gemini :137; Codex :135) — skips whole fallback units (hex escapes, control words, control symbols, literals) instead of stopping at the first backslash, so an escaped fallback isn't re-parsed and duplicated.
  3. \'hh decoded via Windows-1252 (Codex :155) — the \ansi default code page, so 0x80–0x9F become smart quotes/dashes instead of C1 control characters (falls back to Latin-1 for CP1252-undefined bytes).
  4. \uN surrogate pairs combined (Codex :129) — high+low UTF-16 surrogates are merged into the astral scalar, so emoji and other non-BMP characters survive instead of being dropped.
  5. Skipped-destination paragraph controls (Codex :112) — \par/\line/\row/escaped-newline are no-ops while inside an ignored destination (footnote/header/pict), so inline ignored content doesn't split a body paragraph.

Added tests covering the no-duplicate fallback, Windows-1252 punctuation, surrogate-pair (U+1F600), and ignored-destination-paragraph cases (alongside the existing bold/italic/paragraph test).


Generated by Claude Code

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84c1adc541

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

guard let rtf = String(data: data, encoding: .isoLatin1), !rtf.isEmpty else {
throw ConverterError.decodingFailed
}
let markdown = Self.markdown(fromRTF: rtf)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate the RTF header before parsing

When detection classifies an input as .rtf from the filename/MIME hint rather than the {"\\rtf magic bytes, this call still parses any decodable text as body content because markdown(fromRTF:) appends ordinary characters by default. A corrupt or mislabeled notes.rtf containing not rtf will therefore succeed and emit that text instead of surfacing a real document failure, which bypasses the strict document-hint failure behavior used elsewhere; check the decoded stream starts with an RTF header before converting.

Useful? React with 👍 / 👎.

/// quotes/dashes rather than C1 control characters. Falls back to Latin-1 for
/// the few CP1252-undefined bytes.
private static func decodeANSIByte(_ byte: UInt8) -> String {
if let decoded = String(bytes: [byte], encoding: .windowsCP1252), !decoded.isEmpty {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor non-1252 RTF code pages

For RTFs that declare a different ANSI code page, such as \ansicpg1251 or \ansicpg932, escaped bytes in \'hh are still decoded as Windows-1252 here because the parser never records the ansicpg value. Those documents now convert successfully but with mojibake for escaped non-ASCII text; the fresh evidence in this revision is that decodeANSIByte hard-codes .windowsCP1252 for every hex escape.

Useful? React with 👍 / 👎.

Comment on lines +176 to +178
default:
if Self.ignoredDestinations.contains(word) { ignore = true }
// All other control words carry no body text.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve RTF special-character control words

RTF writers can emit printable characters as control words such as \emdash, \endash, \bullet, or \lquote/\rquote instead of using \uN or \'hh. This default treats every unrecognized control word as non-text, so an otherwise valid input like A\emdash B converts to A B, silently dropping punctuation and bullets from the document; add explicit text mappings for these character controls before discarding unknown words.

Useful? React with 👍 / 👎.

Comment on lines +176 to +178
default:
if Self.ignoredDestinations.contains(word) { ignore = true }
// All other control words carry no body text.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip binary payloads before parsing braces

When an ignored destination such as \pict or \object contains \binN data, those next N bytes are arbitrary and can include } or \. This branch currently treats \bin as an unknown control word and then resumes normal parsing, so a } byte inside image data can prematurely pop the group, clear ignore, and leak the remaining binary bytes into the converted Markdown; handle bin by advancing over its declared byte count before interpreting any more RTF syntax.

Useful? React with 👍 / 👎.

- Validate the RTF header: when .rtf is inferred from a filename/MIME hint
  (not the magic bytes), reject input that doesn't start with the RTF signature
  instead of emitting its raw text -- matching the strict-failure rule used for
  other hint-detected document types.
- Honor the declared code page: track \ansicpgN and decode hex escapes with that
  Windows code page (via CFStringConvertWindowsCodepageToEncoding), so e.g.
  \ansicpg1251 Cyrillic no longer turns into Windows-1252 mojibake. Falls back to
  1252 then Latin-1.
- Map special-character control words (\emdash, \endash, \bullet, \lquote,
  \rquote, \ldblquote, \rdblquote, \emspace/\enspace) to their characters instead
  of dropping them as unknown.
- Skip \binN binary payloads (in \pict/\object) by their declared byte count, so
  a brace or backslash inside image data can't prematurely pop the group and leak
  binary into the output.

Adds tests for the mislabeled-header failure, special characters, the CP1251
code page, and the binary-payload brace hazard.

https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP

Copy link
Copy Markdown
Contributor Author

Addressed @chatgpt-codex-connector's round-2 review (commit 84c1adc) in ef460a9:

  1. Validate the RTF header (:29) — fixed. When .rtf is inferred from a filename/MIME hint rather than the magic bytes, input that doesn't start with the RTF signature is now rejected (fileCorrupted) instead of having its raw text emitted as a document — matching the strict-failure rule used for other hint-detected document types.
  2. Honor non-1252 code pages (:217) — fixed. The parser now tracks \ansicpgN and decodes \'hh escapes with that Windows code page (via CFStringConvertWindowsCodepageToEncoding), so e.g. \ansicpg1251 Cyrillic no longer becomes Windows-1252 mojibake. Falls back to 1252, then Latin-1, for undecodable bytes.
  3. Preserve special-character control words (:178) — fixed. \emdash/\endash/\bullet/\lquote/\rquote/\ldblquote/\rdblquote/\emspace/\enspace now map to their characters instead of being dropped as unknown.
  4. Skip \binN binary payloads (:178) — fixed. \bin now advances past its declared byte count, so a }/\ inside \pict/\object image data can't prematurely pop the group and leak binary into the Markdown.

Added tests for the mislabeled-header failure, special characters, the CP1251 code page, and the binary-payload brace hazard.


Generated by Claude Code

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef460a921a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

/// become the right punctuation/letters rather than C1 control characters.
/// Falls back to Windows-1252, then Latin-1, for undecodable bytes.
private static func decodeByte(_ byte: UInt8, encoding: String.Encoding) -> String {
if let decoded = String(bytes: [byte], encoding: encoding), !decoded.isEmpty {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decode multibyte ANSI escapes as byte runs

For RTFs declaring multibyte code pages such as \ansicpg932, a character is often encoded as consecutive hex escapes (for example Shift-JIS bytes \'82\'a0 for あ). This decodes each \'hh in isolation, so the lead byte cannot be decoded with the selected code page and then falls back to CP1252/Latin-1, producing mojibake instead of one character. Fresh evidence versus the earlier code-page finding is that the current implementation now records ansicpg, but still passes only [byte] to String(bytes:encoding:); buffer adjacent escaped bytes for multibyte encodings before decoding.

Useful? React with 👍 / 👎.

Comment on lines +122 to +123
case "line":
appendText("\n")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit hard Markdown breaks for RTF line controls

When an RTF uses \line for an explicit line break inside a paragraph, this appends only a bare newline. In Markdown that is a soft break and can be rendered as a space/wrapped text, unlike the DOCX converter's w:br path which emits \n for a hard break, so addresses, poems, or other line-sensitive RTF content lose their intended breaks. Emit a Markdown hard break (or another preserved block break) for this control word.

Useful? React with 👍 / 👎.

Comment on lines +199 to +201
default:
if Self.ignoredDestinations.contains(word) { ignore = true }
// All other control words carry no body text.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve separation at page breaks

RTF \page is a forced page break, but it currently falls into the unknown-control default after the parser has consumed the delimiter space. Inputs such as End\page Start therefore become EndStart, losing even word separation at the page boundary; treat \page as at least a paragraph or hard break instead of discarding it.

Useful? React with 👍 / 👎.

- \line now emits a Markdown hard break ("  \n") instead of a bare newline (a
  soft break), matching the DOCX w:br path, so line-sensitive content (addresses,
  poems) keeps its breaks.
- \page is treated as a paragraph break, so "End\page Start" keeps separation
  rather than collapsing to "EndStart".

Deferred (documented in source): reassembling multibyte ANSI \'hh runs for DBCS
code pages like 932 (Shift-JIS). It needs byte-run buffering for a legacy niche;
modern RTF emits non-ASCII (incl. CJK) as \uN, which is fully handled (including
surrogate pairs). Each \'hh is decoded individually with the declared code page.

Adds a test for the \line hard break and \page separation.

https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP

Copy link
Copy Markdown
Contributor Author

Addressed @chatgpt-codex-connector's round-3 review (commit ef460a9) in c9a97ac:

  1. \line hard break (:123) — fixed. \line now emits a Markdown hard break ( \n) instead of a bare newline (a soft break that renders as a space), matching the DOCX w:br path, so line-sensitive content (addresses, poems) keeps its breaks.
  2. \page separation (:201) — fixed. \page is now treated as a paragraph break, so End\page Start stays separated instead of collapsing to EndStart.
  3. Multibyte ANSI \'hh runs (:240) — deferred (documented in source). Reassembling consecutive \'hh bytes for DBCS code pages (e.g. 932 Shift-JIS) needs byte-run buffering for a legacy niche; modern RTF emits non-ASCII — including CJK — as \uN, which is fully handled here (with surrogate-pair combining). Each \'hh is decoded individually with the declared code page. I added a source comment on decodeByte noting this limitation so it's an explicit, skippable decision rather than an oversight.

Added a test for the \line hard break and \page separation.


Generated by Claude Code

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@ronaldmannak ronaldmannak merged commit 5590467 into main Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants