Skip to content

fix: force UTF-8 for Windows shell subprocesses to fix CJK mojibake#1418

Merged
yanyihan-xiaomi merged 5 commits into
mainfrom
fix/shell-utf8-encoding-windows
Jun 28, 2026
Merged

fix: force UTF-8 for Windows shell subprocesses to fix CJK mojibake#1418
yanyihan-xiaomi merged 5 commits into
mainfrom
fix/shell-utf8-encoding-windows

Conversation

@yanyihan-xiaomi

@yanyihan-xiaomi yanyihan-xiaomi commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes CJK mojibake (garbled Chinese/Japanese/Korean text) in shell command output on Windows with non-UTF-8 locales (e.g. zh-CN, where the active code page is 936/GBK). Spawned PowerShell/cmd subprocesses emit output in the legacy code page, which we decode as UTF-8 → garbled characters.

中文:修复 Windows 非 UTF-8 区域(如简体中文,活动代码页 936/GBK)下 shell 命令输出的 CJK 乱码。子进程(PowerShell/cmd)按旧代码页输出,我们按 UTF-8 解码,于是出现乱码。

Background

Based on upstream anomalyco/opencode#31658 ("set default UTF-8 encoding for spawned subprocess on Windows"), which prepends an encoding statement to the command and adds PYTHONIOENCODING.

We investigated two earlier upstream attempts first and rejected them:

  • #14766 (SetConsoleCP(CP_UTF8) via FFI) — targets TUI keyboard input, not command output; verified ineffective for our mojibake case on the test machine.
  • #23635 ([Console]::OutputEncoding only) — partial; doesn't cover cmd or piped Python output.

中文:本 PR 参考上游 #31658(为 Windows 子进程设置默认 UTF-8 编码)。我们先排查并否决了另外两个上游方案:#14766(用 FFI 调 SetConsoleCP,改的是 TUI 键盘输入而非命令输出,实测对乱码无效);#23635(只设 [Console]::OutputEncoding,不覆盖 cmd 和管道里的 Python 输出)。

What we changed (beyond upstream #31658)

Upstream only patched the single subprocess spawn path. Our codebase has two independent shell execution paths, and we fixed both:

  1. bash tool (src/tool/bash.ts) — used by the model/LLM tool calls.
  2. TUI shell mode (src/session/prompt.ts shellImpl) — the interactive path when the user presses `tab` to switch into shell mode and types commands directly. This path is completely separate from the bash tool and was missed by a naive single-location port; it has its own `invocations` table for picking shell args.

In both paths, on Windows:

  • PowerShell / pwsh: prepend `$OutputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new($false);` (sets both the pipe output encoding and the console encoding).
  • cmd: prepend `chcp 65001 >nul & `.
  • env: inject `PYTHONIOENCODING=utf-8` so piped Python output is UTF-8 (Python ignores the console code page when stdout is a pipe and falls back to ANSI/GBK).

中文:上游只改了单一的子进程路径,而我们代码库里有两条独立的 shell 执行路径,两条都改了:(1) bash 工具 src/tool/bash.ts(模型工具调用走这条);(2) TUI shell 模式 src/session/prompt.tsshellImpl(用户按 tab 切到 shell 模式直接敲命令,这条完全独立,单点移植容易漏掉,它有自己的 invocations 表)。两条路径在 Windows 上都做了:PowerShell/pwsh 命令前加 $OutputEncoding = [Console]::OutputEncoding = ...UTF8...(同时设管道输出和控制台编码);cmd 前加 chcp 65001 >nul &;环境变量注入 PYTHONIOENCODING=utf-8(管道输出时 Python 会忽略代码页、回退到 ANSI/GBK)。

Recommended workaround before this lands

Until this fix is merged and released, the most reliable workaround is to enable Windows' system-wide UTF-8 support, which sets the active code page (ACP) to 65001 for all programs:

Settings → Time & language → Language & region → Administrative language settings (Change system locale) → check "Beta: Use Unicode UTF-8 for worldwide language support" → reboot.

This makes the ACP UTF-8 globally, so subprocesses no longer inherit GBK and the mojibake disappears without any build change. Note it is a system-wide Beta toggle and may affect other legacy (non-Unicode) apps, so treat it as a workaround rather than a permanent requirement.

This PR also adds a "Windows: garbled CJK output" note documenting this workaround to both the English and Chinese README.

中文:在本修复合入并发版之前,现阶段最推荐的临时方案是开启 Windows 的系统级 UTF-8 支持(把活动代码页 ACP 设为 65001):设置 → 时间和语言 → 语言和区域 → 管理语言设置 / 更改系统区域设置 → 勾选「Beta 版: 使用 Unicode UTF-8 提供全球语言支持」→ 重启。这样全局 ACP 变为 UTF-8,子进程不再继承 GBK,无需改任何代码即可消除乱码。注意这是系统级 Beta 开关,可能影响其它老的非 Unicode 程序,建议作为临时方案而非长期依赖。本 PR 也在中英文 README 中新增了「Windows:shell 输出中文(CJK)乱码」一节,记录该临时方案。

Verification

Built a `windows-x64` binary and tested on Windows 11 + Windows Terminal (zh-CN, code page 936):

  • Model-invoked `ls` of a directory containing files with Chinese names now renders correctly, previously garbled.
  • TUI shell-mode `ls` fixed by the second change in `prompt.ts`.

中文:在 Windows 11 + Windows Terminal(简体中文,代码页 936)上用 windows-x64 二进制实测:模型调用 ls 列含中文名文件的目录已正常显示(此前为乱码);TUI shell 模式的 lsprompt.ts 的第二处改动修复。

Before

img_v3_02133_36dabe28-87c4-4230-9215-81fc987691ag

After

img_v3_02133_0149dbd5-a984-47f6-81cd-346aedae829g

On non-UTF-8 Windows locales (e.g. zh-CN ACP 936/GBK), shell subprocesses
emit output in the legacy code page which we decode as UTF-8, producing
mojibake for CJK filenames/output. Fix both execution paths:

- bash tool (src/tool/bash.ts): prepend an $OutputEncoding/[Console]::
  OutputEncoding UTF-8 statement for PowerShell, chcp 65001 for cmd.
- TUI shell mode (src/session/prompt.ts shellImpl): same prefixes for
  powershell/pwsh/cmd invocations.
- Add PYTHONIOENCODING=utf-8 to the shell env on Windows so piped Python
  output is UTF-8 (Python ignores the code page when stdout is a pipe).

Based on upstream anomalyco/opencode#31658.
@yanyihan-xiaomi yanyihan-xiaomi force-pushed the fix/shell-utf8-encoding-windows branch from f1237a3 to c8a3598 Compare June 28, 2026 06:42
Document the system-wide UTF-8 (Beta) toggle as a workaround for garbled
CJK shell output on non-UTF-8 Windows locales, for users on older versions
or tools not yet special-cased. Added to both English and Chinese README.
@qiaozongming

Copy link
Copy Markdown
Collaborator

Reviewed with a focus on the command-concatenation safety. TL;DR: no new injection surface, and the most dangerous concat pitfall (exit-code clobbering) is handled correctly. One minor robustness nit worth fixing.

No new injection risk. input.command is already passed verbatim to the shell by design. Prefixing an encoding setup in front of it grants no capability the caller didn't already have, and there's no "break out of the prefix" direction since the user command is the trailing part. Neutral from a security standpoint.

Exit-code semantics are correct (the easy thing to get wrong here):

  • cmd uses & (not &&): chcp 65001 >/dev/null & <cmd> returns the exit code of the last command (the user's), and chcp failing won't block it. ✅
  • PowerShell uses ;, so $LASTEXITCODE/$? reflect the trailing user statement. ✅
  • Good call choosing & over &&&& would have let chcp's status leak into the result.

Minor: >nul only redirects stdout, not stderr. packages/opencode/src/tool/bash.ts:327 — if chcp fails (restricted/locked-down shell), its error goes to stderr and pollutes command output. Suggest:

chcp 65001 >/dev/null 2>/dev/null & ${command}

Notes (not blocking):

  • The PowerShell prefix globally overrides $OutputEncoding/[Console]::OutputEncoding. That's the intended fix; using UTF8Encoding($false) (no BOM) is the right choice to avoid leading garbage bytes. Just flagging it as a global-state side effect for awareness.
  • The UTF-8 prefix string is duplicated as a literal in both session/prompt.ts and tool/bash.ts. Consider extracting a shared constant so the two paths can't drift.

Overall LGTM on the concat form; only the 2>nul is worth a quick follow-up.

…stderr

Address PR review: extract POWERSHELL_UTF8_PREFIX and CMD_UTF8_PREFIX into
src/shell/shell.ts so the bash tool and TUI shell-mode paths can't drift, and
redirect both stdout and stderr (>nul 2>nul) so a chcp failure in a restricted
shell never pollutes command output.
@yanyihan-xiaomi

Copy link
Copy Markdown
Collaborator Author

Thanks for the review! Both addressed in b2dd9fc:

  • 2>nul: cmd prefix is now chcp 65001 >nul 2>nul & so a chcp failure in a restricted shell no longer pollutes stderr.
  • Shared constant: extracted POWERSHELL_UTF8_PREFIX and CMD_UTF8_PREFIX into src/shell/shell.ts; both the bash tool and the TUI shell-mode path now reference them, so the two can't drift (which was exactly the root cause of the earlier miss where the shell-mode path was overlooked).

The global $OutputEncoding/[Console]::OutputEncoding override is intended; keeping the BOM-less UTF8Encoding($false) as you noted.

中文:review 的两点都已在 b2dd9fc 处理:cmd 前缀加了 2>nul,受限 shell 下 chcp 失败不再污染输出;PowerShell/cmd 前缀抽成 src/shell/shell.ts 的共享常量,bash 工具和 TUI shell 模式两条路径都引用,杜绝漂移。全局 $OutputEncoding 覆盖是预期行为,保留无 BOM 的 UTF8Encoding($false)

@yanyihan-xiaomi yanyihan-xiaomi merged commit aae0243 into main Jun 28, 2026
6 checks passed
@yanyihan-xiaomi yanyihan-xiaomi deleted the fix/shell-utf8-encoding-windows branch June 28, 2026 07:22
gabrieljamh pushed a commit to gabrieljamh/Aria-Chat that referenced this pull request Jun 28, 2026
…iaomiMiMo#1418)

* fix: force UTF-8 for Windows shell subprocesses to fix CJK mojibake

On non-UTF-8 Windows locales (e.g. zh-CN ACP 936/GBK), shell subprocesses
emit output in the legacy code page which we decode as UTF-8, producing
mojibake for CJK filenames/output. Fix both execution paths:

- bash tool (src/tool/bash.ts): prepend an $OutputEncoding/[Console]::
  OutputEncoding UTF-8 statement for PowerShell, chcp 65001 for cmd.
- TUI shell mode (src/session/prompt.ts shellImpl): same prefixes for
  powershell/pwsh/cmd invocations.
- Add PYTHONIOENCODING=utf-8 to the shell env on Windows so piped Python
  output is UTF-8 (Python ignores the code page when stdout is a pipe).

Based on upstream anomalyco/opencode#31658.

* docs: add Windows CJK mojibake workaround to README

Document the system-wide UTF-8 (Beta) toggle as a workaround for garbled
CJK shell output on non-UTF-8 Windows locales, for users on older versions
or tools not yet special-cased. Added to both English and Chinese README.

* docs: clarify MiMoCode version and tone in Windows CJK note

* docs: refine wording of Windows CJK note

* refactor: extract shared Windows UTF-8 shell prefixes; redirect chcp stderr

Address PR review: extract POWERSHELL_UTF8_PREFIX and CMD_UTF8_PREFIX into
src/shell/shell.ts so the bash tool and TUI shell-mode paths can't drift, and
redirect both stdout and stderr (>nul 2>nul) so a chcp failure in a restricted
shell never pollutes command output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants