Phase 5a: pure-Swift RTFConverter#16
Conversation
Adds an RTFConverter so RTF (a documented supported type) converts instead of
throwing documentTypeNotSupported. It's a small dependency-free parser — no
NSAttributedString (the lossy, main-thread-biased path this rewrite removed):
- decodes the RTF byte container as Latin-1 (non-ASCII arrives via \uN / \'hh);
- walks groups, tracking bold/italic on a state stack and emitting paragraphs on
\par/\row/\sect;
- skips control tables (fonttbl/colortbl/stylesheet/info/pict/...) and ignorable
(\*) destinations;
- handles \uN (with the \ucN fallback skip), \'hh, and the common control
symbols.
RTF carries no semantic headings, so none are inferred (we deliberately don't
guess headings from font sizes). Registered at specific priority; detection
already classifies RTF by its {\rtf magic. Adds a conversion test and drops the
"RTF deferred" note from the registry.
https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP
There was a problem hiding this comment.
Code Review
This pull request registers and implements a new pure-Swift RTFConverter to convert RTF documents to Markdown without relying on NSAttributedString, and adds corresponding unit tests. The review feedback identifies a bug in the fallback skipping logic where it prematurely terminates on backslashes, potentially causing duplicate characters. Additionally, the reviewer recommends scoping the ucSkip parameter to groups by saving and restoring it on the group state stack.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Skip the \ucN fallback characters that follow a \uN. | ||
| var skipped = 0 | ||
| while i < n, skipped < ucSkip, | ||
| chars[i] != "\\", chars[i] != "{", chars[i] != "}" { | ||
| i += 1; skipped += 1 | ||
| } |
There was a problem hiding this comment.
The current fallback skipping logic terminates immediately if it encounters a backslash (\). In RTF, Unicode fallback characters are very frequently represented as hex escape control symbols (e.g., \'hh like \u2014\'97). Because the loop stops at \, the fallback escape sequence is not skipped and instead gets parsed and appended to the output, resulting in duplicate characters. The skipping logic should be updated to correctly skip control symbols and control words as a single fallback item.
// Skip the \ucN fallback characters that follow a \uN.
var skipped = 0
while i < n, skipped < ucSkip {
let c = chars[i]
if c == "{" || c == "}" {
break
} else if c == "\\" {
if i + 1 < n {
let next = chars[i + 1]
if next == "'" {
i += min(4, n - i)
} else if next.isLetter {
i += 1
while i < n, chars[i].isLetter { i += 1 }
if i < n, chars[i] == "-" { i += 1 }
while i < n, chars[i].isNumber { i += 1 }
if i < n, chars[i] == " " { i += 1 }
} else {
i += 2
}
} else {
i += 1
}
} else {
i += 1
}
skipped += 1
}| // MARK: - Parser | ||
|
|
||
| private struct Run { var text: String; var bold: Bool; var italic: Bool } | ||
| private struct GroupState { var bold: Bool; var italic: Bool; var ignore: Bool } |
There was a problem hiding this comment.
The ucSkip parameter (defined by the \ucN control word) is group-scoped according to the RTF specification. It should be saved and restored on the group state stack to prevent changes inside a group from leaking into the rest of the document.
| private struct GroupState { var bold: Bool; var italic: Bool; var ignore: Bool } | |
| private struct GroupState { var bold: Bool; var italic: Bool; var ignore: Bool; var ucSkip: Int } |
| case "{": | ||
| stack.append(GroupState(bold: bold, italic: italic, ignore: ignore)) | ||
| i += 1 | ||
|
|
||
| case "}": | ||
| if let saved = stack.popLast() { | ||
| bold = saved.bold; italic = saved.italic; ignore = saved.ignore | ||
| } | ||
| i += 1 |
There was a problem hiding this comment.
Save and restore the ucSkip state on the group stack when entering ({) and exiting (}) a group to ensure its scope remains local to the group.
| case "{": | |
| stack.append(GroupState(bold: bold, italic: italic, ignore: ignore)) | |
| i += 1 | |
| case "}": | |
| if let saved = stack.popLast() { | |
| bold = saved.bold; italic = saved.italic; ignore = saved.ignore | |
| } | |
| i += 1 | |
| case "{": | |
| stack.append(GroupState(bold: bold, italic: italic, ignore: ignore, ucSkip: ucSkip)) | |
| i += 1 | |
| case "}": | |
| if let saved = stack.popLast() { | |
| bold = saved.bold; italic = saved.italic; ignore = saved.ignore; ucSkip = saved.ucSkip | |
| } | |
| i += 1 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49416d43ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let byte = UInt32(String([chars[i], chars[i + 1]]), radix: 16), | ||
| let scalar = Unicode.Scalar(byte) { | ||
| appendText(String(Character(scalar))) |
There was a problem hiding this comment.
Decode RTF hex escapes using the declared code page
For RTFs that use \ansicpg1252 (the common Windows ANSI default), \'hh bytes in the 0x80–0x9F range represent printable characters such as smart quotes and dashes, but this path maps the byte directly to the same Unicode code point, producing C1 control characters instead. Files containing text like \'93quoted\'94 will convert to invisible/control output instead of readable punctuation; the parser needs to honor the RTF code page (at least Windows-1252 for \ansi/\ansicpg1252) when decoding hex escapes.
Useful? React with 👍 / 👎.
| if value >= 0, let scalar = Unicode.Scalar(UInt32(value)) { | ||
| appendText(String(Character(scalar))) |
There was a problem hiding this comment.
Preserve surrogate pairs in Unicode escapes
RTF \uN values are UTF-16 code units, so characters outside the BMP (for example emoji) are emitted as high/low surrogate pairs. Each surrogate code unit makes Unicode.Scalar(UInt32(value)) return nil here, and the fallback bytes are still skipped, so those characters disappear entirely from converted documents. Buffering a high surrogate and combining it with the following low surrogate before appending would preserve these characters.
Useful? React with 👍 / 👎.
| while i < n, skipped < ucSkip, | ||
| chars[i] != "\\", chars[i] != "{", chars[i] != "}" { |
There was a problem hiding this comment.
Skip escaped fallback bytes after Unicode escapes
When an RTF writer emits the \uc fallback as an escaped byte sequence, such as the common \u8217\'92 form, this loop stops immediately at the backslash and skips zero fallback characters. The parser then processes \'92 as real text, so affected non-ASCII characters are duplicated or followed by a bogus control character in the Markdown output. The fallback skip needs to consume escaped control-symbol fallback bytes as one skipped character as well as raw fallback characters.
Useful? React with 👍 / 👎.
| case "par", "row", "sect": | ||
| flushParagraph() |
There was a problem hiding this comment.
Ignore paragraph controls inside skipped destinations
When ignore is true for destinations like footnote, header, or pict, control words inside that skipped group still call flushParagraph(). A document with inline ignored content such as Hello{\footnote ...\par} world\par will be split into two Markdown paragraphs even though the skipped destination should not affect body text layout; paragraph/line/table controls should be no-ops while ignoring a destination.
Useful? React with 👍 / 👎.
| if let saved = stack.popLast() { | ||
| bold = saved.bold; italic = saved.italic; ignore = saved.ignore |
There was a problem hiding this comment.
Restore Unicode fallback length when leaving groups
RTF character properties such as \ucN are scoped by the current group, but the stack only restores bold, italic, and ignored-destination state. If a nested group sets \uc0 or \uc2, that value leaks into following body text after the group closes, causing later \uN escapes to skip the wrong number of fallback characters and either duplicate fallback text or drop real characters. Include ucSkip in the saved group state and restore it on }.
Useful? React with 👍 / 👎.
- Group-scope \ucN: save/restore ucSkip on the group stack so a nested \ucN doesn't leak into later body text (Gemini; Codex). - Unit-aware \uN fallback skip: skip whole fallback units (\'hh, control words, control symbols, literals) instead of stopping at the first backslash, so an escaped fallback like 舗\'92 isn't re-parsed and duplicated (Gemini; Codex). - Decode \'hh via Windows-1252 (the \ansi default) rather than raw Latin-1, so 0x80-0x9F become smart quotes/dashes instead of C1 control characters (Codex). - Combine \uN UTF-16 surrogate pairs so astral characters (e.g. emoji) survive instead of being dropped at a failed scalar init (Codex). - Don't treat \par/\line/\row/escaped-newline as body breaks while inside a skipped destination (footnote/header/pict), so inline ignored content doesn't split a paragraph (Codex). Adds tests for the fallback-no-duplicate, Windows-1252, surrogate-pair, and ignored-destination-paragraph cases. https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP
|
Addressed the @gemini-code-assist and @chatgpt-codex-connector reviews in
Added tests covering the no-duplicate fallback, Windows-1252 punctuation, surrogate-pair (U+1F600), and ignored-destination-paragraph cases (alongside the existing bold/italic/paragraph test). Generated by Claude Code |
|
To use Codex here, create an environment for this repo. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84c1adc541
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| guard let rtf = String(data: data, encoding: .isoLatin1), !rtf.isEmpty else { | ||
| throw ConverterError.decodingFailed | ||
| } | ||
| let markdown = Self.markdown(fromRTF: rtf) |
There was a problem hiding this comment.
Validate the RTF header before parsing
When detection classifies an input as .rtf from the filename/MIME hint rather than the {"\\rtf magic bytes, this call still parses any decodable text as body content because markdown(fromRTF:) appends ordinary characters by default. A corrupt or mislabeled notes.rtf containing not rtf will therefore succeed and emit that text instead of surfacing a real document failure, which bypasses the strict document-hint failure behavior used elsewhere; check the decoded stream starts with an RTF header before converting.
Useful? React with 👍 / 👎.
| /// quotes/dashes rather than C1 control characters. Falls back to Latin-1 for | ||
| /// the few CP1252-undefined bytes. | ||
| private static func decodeANSIByte(_ byte: UInt8) -> String { | ||
| if let decoded = String(bytes: [byte], encoding: .windowsCP1252), !decoded.isEmpty { |
There was a problem hiding this comment.
For RTFs that declare a different ANSI code page, such as \ansicpg1251 or \ansicpg932, escaped bytes in \'hh are still decoded as Windows-1252 here because the parser never records the ansicpg value. Those documents now convert successfully but with mojibake for escaped non-ASCII text; the fresh evidence in this revision is that decodeANSIByte hard-codes .windowsCP1252 for every hex escape.
Useful? React with 👍 / 👎.
| default: | ||
| if Self.ignoredDestinations.contains(word) { ignore = true } | ||
| // All other control words carry no body text. |
There was a problem hiding this comment.
Preserve RTF special-character control words
RTF writers can emit printable characters as control words such as \emdash, \endash, \bullet, or \lquote/\rquote instead of using \uN or \'hh. This default treats every unrecognized control word as non-text, so an otherwise valid input like A\emdash B converts to A B, silently dropping punctuation and bullets from the document; add explicit text mappings for these character controls before discarding unknown words.
Useful? React with 👍 / 👎.
| default: | ||
| if Self.ignoredDestinations.contains(word) { ignore = true } | ||
| // All other control words carry no body text. |
There was a problem hiding this comment.
Skip binary payloads before parsing braces
When an ignored destination such as \pict or \object contains \binN data, those next N bytes are arbitrary and can include } or \. This branch currently treats \bin as an unknown control word and then resumes normal parsing, so a } byte inside image data can prematurely pop the group, clear ignore, and leak the remaining binary bytes into the converted Markdown; handle bin by advancing over its declared byte count before interpreting any more RTF syntax.
Useful? React with 👍 / 👎.
- Validate the RTF header: when .rtf is inferred from a filename/MIME hint (not the magic bytes), reject input that doesn't start with the RTF signature instead of emitting its raw text -- matching the strict-failure rule used for other hint-detected document types. - Honor the declared code page: track \ansicpgN and decode hex escapes with that Windows code page (via CFStringConvertWindowsCodepageToEncoding), so e.g. \ansicpg1251 Cyrillic no longer turns into Windows-1252 mojibake. Falls back to 1252 then Latin-1. - Map special-character control words (\emdash, \endash, \bullet, \lquote, \rquote, \ldblquote, \rdblquote, \emspace/\enspace) to their characters instead of dropping them as unknown. - Skip \binN binary payloads (in \pict/\object) by their declared byte count, so a brace or backslash inside image data can't prematurely pop the group and leak binary into the output. Adds tests for the mislabeled-header failure, special characters, the CP1251 code page, and the binary-payload brace hazard. https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP
|
Addressed @chatgpt-codex-connector's round-2 review (commit
Added tests for the mislabeled-header failure, special characters, the CP1251 code page, and the binary-payload brace hazard. Generated by Claude Code |
|
To use Codex here, create an environment for this repo. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef460a921a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// become the right punctuation/letters rather than C1 control characters. | ||
| /// Falls back to Windows-1252, then Latin-1, for undecodable bytes. | ||
| private static func decodeByte(_ byte: UInt8, encoding: String.Encoding) -> String { | ||
| if let decoded = String(bytes: [byte], encoding: encoding), !decoded.isEmpty { |
There was a problem hiding this comment.
Decode multibyte ANSI escapes as byte runs
For RTFs declaring multibyte code pages such as \ansicpg932, a character is often encoded as consecutive hex escapes (for example Shift-JIS bytes \'82\'a0 for あ). This decodes each \'hh in isolation, so the lead byte cannot be decoded with the selected code page and then falls back to CP1252/Latin-1, producing mojibake instead of one character. Fresh evidence versus the earlier code-page finding is that the current implementation now records ansicpg, but still passes only [byte] to String(bytes:encoding:); buffer adjacent escaped bytes for multibyte encodings before decoding.
Useful? React with 👍 / 👎.
| case "line": | ||
| appendText("\n") |
There was a problem hiding this comment.
Emit hard Markdown breaks for RTF line controls
When an RTF uses \line for an explicit line break inside a paragraph, this appends only a bare newline. In Markdown that is a soft break and can be rendered as a space/wrapped text, unlike the DOCX converter's w:br path which emits \n for a hard break, so addresses, poems, or other line-sensitive RTF content lose their intended breaks. Emit a Markdown hard break (or another preserved block break) for this control word.
Useful? React with 👍 / 👎.
| default: | ||
| if Self.ignoredDestinations.contains(word) { ignore = true } | ||
| // All other control words carry no body text. |
There was a problem hiding this comment.
Preserve separation at page breaks
RTF \page is a forced page break, but it currently falls into the unknown-control default after the parser has consumed the delimiter space. Inputs such as End\page Start therefore become EndStart, losing even word separation at the page boundary; treat \page as at least a paragraph or hard break instead of discarding it.
Useful? React with 👍 / 👎.
- \line now emits a Markdown hard break (" \n") instead of a bare newline (a
soft break), matching the DOCX w:br path, so line-sensitive content (addresses,
poems) keeps its breaks.
- \page is treated as a paragraph break, so "End\page Start" keeps separation
rather than collapsing to "EndStart".
Deferred (documented in source): reassembling multibyte ANSI \'hh runs for DBCS
code pages like 932 (Shift-JIS). It needs byte-run buffering for a legacy niche;
modern RTF emits non-ASCII (incl. CJK) as \uN, which is fully handled (including
surrogate pairs). Each \'hh is decoded individually with the declared code page.
Adds a test for the \line hard break and \page separation.
https://claude.ai/code/session_01VPiUoXjbN1KLgYXsSE2cdP
|
Addressed @chatgpt-codex-connector's round-3 review (commit
Added a test for the Generated by Claude Code |
|
To use Codex here, create an environment for this repo. |
Phase 5a — RTFConverter (Tier A polish)
RTF is listed as a supported type but previously threw
documentTypeNotSupported(it was deferred during the rewrite to avoid reintroducingNSAttributedString). This adds a small pure-Swift RTF→Markdown parser, so RTF now converts with noNSAttributedString, no main-thread hop, and no font-size heading guessing.Parser
\uN/\'hh).\par/\row/\sect.fonttbl/colortbl/stylesheet/info/pict/…) and ignorable (\*) destinations.\uN(with the\ucNfallback-skip),\'hh, and the common control symbols (\\ \{ \} \~ \_ \-).Wiring & tests
.specificpriority; detection already classifies RTF via the{\rtfmagic bytes.**bold**/*italic*, control tables skipped, no raw control words leak).First of the Tier A polish items (RTF → renderers → DOCX images).
Generated by Claude Code