Skip to content

mcp-publisher mangles non-ASCII description bytes on Windows (em-dash → \udc94 lone surrogate) #1306

@Liyue2341

Description

@Liyue2341

Summary

The mcp-publisher CLI on Windows mangles non-ASCII bytes in the description field of server.json somewhere between reading the local file and sending the publish request to the registry. The result: published entries contain invalid UTF-16 surrogates and broken text that JSON parsers flag as malformed.

Environment

  • OS: Windows 11
  • Shell: Git Bash (MSYS2) on Windows
  • System locale: zh-CN (default Windows codepage CP-936 / GBK)
  • mcp-publisher binary: mcp-publisher_windows_amd64.tar.gz from the official release
  • Auth method tried: both DNS-based and GitHub OIDC — same outcome

Reproducer

  1. mcp-publisher init and edit the description to include an em-dash (U+2014, UTF-8 E2 80 94):

    {
      "name": "org.example/my-server",
      "description": "Turn rough requests into rigorously structured prompts — for any coding agent.",
      "version": "0.1.3",
      "...": "..."
    }
  2. Verify the local file is correct UTF-8:

    $ xxd server.json | grep e28094
    ... e2 80 94 ...   # em-dash present as expected
    
  3. mcp-publisher login ... and mcp-publisher publish. The publish step reports success.

  4. Query the registry:

    curl 'https://registry.modelcontextprotocol.io/v0.1/servers?search=<name>'
    
  5. The returned description no longer contains . It contains a mangled sequence with an invalid lone low surrogate.

Concrete evidence — live registry entry

A real broken entry is still queryable (we had to bump version to ship a fix):

$ curl -s 'https://registry.modelcontextprotocol.io/v0.1/servers?search=org.sciscale/agentforge-mcp' \
  | python -m json.tool

Version 0.1.3 returns:

"description": "Turn rough requests into rigorously structured prompts 鈥\udc94 for any coding agent."

Version 0.1.4 (em-dash removed as workaround) is clean:

"description": "Turn rough requests into rigorously structured prompts, for any coding agent."

Note the mangling pattern: the three UTF-8 bytes E2 80 94 came out as 鈥\udc94:

  • E2 80 decoded as CP-936 (GBK) (a valid CJK character, 硥)
  • The orphan 94 byte → \udc94, a lone low surrogate (invalid UTF-16; Go's typical way of round-tripping an unrepresentable byte via the "surrogateescape"-like scheme)

This is the unmistakable signature of UTF-8 bytes being reinterpreted as the Windows ANSI / OEM codepage somewhere in the pipeline, then the orphan byte being escaped into a surrogate.

Likely root cause

Something in the publish path is going through a Windows codepage API instead of treating the JSON body as opaque UTF-8 bytes. Candidates worth auditing in mcp-publisher:

  • syscall.UTF16FromString / UTF16ToString on anything that touches the description string
  • File reading that goes through a layer doing windows.MultiByteToWideChar(CP_ACP, ...) instead of os.ReadFile
  • os.Args decoding on Windows (less likely here since the field is in a file)
  • Any logging / printing step that re-serializes the body and then sends the re-serialized version on the wire

For context, we observed the same exact \udc94-style mangling with gh api and curl when posting non-ASCII bodies from Git Bash on this machine. Switching the same request to Python's urllib.request (which uses Windows-native sockets and doesn't pass the body through the MSYS / Git Bash layer) preserved bytes correctly. So the issue may not even be in mcp-publisher's Go code per se — it could be that the binary inherits a mangled environment from Git Bash / MSYS. Either way, the publisher should be robust to this.

Impact

  • Any Windows publisher with non-ASCII in description (em-dash, en-dash, smart quotes, emoji, accented characters, CJK text) will silently ship mangled metadata to the official registry.
  • Directory listings on registry.modelcontextprotocol.io and downstream consumers display broken text.
  • Strict JSON parsers refuse to load the response at all (Python's json.dumps rejects lone surrogates by default).

Compounding problem

The registry refuses to republish the same version, so a publisher who hits this can't simply re-run publish after spotting the bug — they have to bump the package version (and re-publish to npm / PyPI / wherever) just to fix a cosmetic display issue. In our case we had to bump npm 0.1.3 → 0.1.4 with no actual code change other than removing the em-dash from one string.

Suggested fix direction

  1. Validate before sending. Right before transmitting the publish request body, run utf8.Valid(jsonBytes) and a check that no 0xDC00..0xDFFF lone surrogates appear in any string field. Fail loudly with a clear message naming the field if either check fails — better to refuse to publish than to silently ship garbage.
  2. Audit the read path. Confirm server.json is read with os.ReadFile (which gives you raw bytes) and that the JSON decoder never round-trips through syscall.UTF16FromString or any CP_ACP-aware Windows API.
  3. Consider sniffing the environment. If mcp-publisher detects it's running under Git Bash / MSYS with a non-UTF-8 ANSI codepage, it could print a one-line warning at startup pointing at this issue.

Offer

Happy to test patches against this exact repro — we have a Windows 11 + Git Bash + zh-CN setup that reliably reproduces the bug.

Affected package

  • npm package: agentforge-mcp
  • Registry name: org.sciscale/agentforge-mcp
  • Broken version: 0.1.3 (still queryable — see evidence above)
  • Workaround version: 0.1.4 (em-dash removed; not a real fix, just avoidance)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions