Skip to content

Replace utf8 with modified utf8.#197

Open
xymb-endcrystalme wants to merge 1 commit intovberlier:mainfrom
xymb-endcrystalme:main
Open

Replace utf8 with modified utf8.#197
xymb-endcrystalme wants to merge 1 commit intovberlier:mainfrom
xymb-endcrystalme:main

Conversation

@xymb-endcrystalme
Copy link

nbtlib is sadly broken. Minecraft uses modified UTF-8, while nbtlib uses normal UTF-8. Reading an NBT file with UTF-8 characters and saving it again screws up texts.

I admit, I did just ask o3 to write this code. However, a test with 70 region files started passing, so I suspect it's at least somewhat correct.

My test was "open a region file, open&save each chunk as nbtlib, verify contents (chunk bytes) are the same". nbtlib was making errors in some chunks that contained books with funny characters, after this patch it stopped. So I'll start internally using this for now.

@Happy2018new
Copy link

Happy2018new commented Jun 25, 2025

Maybe this is a problem. However, this project may also worked for some Bedrock projects, but different to Minecraft Java Edition, they seems uses standard UTF-8 encoding.

So, here comes a problem is that, how would those projects who based on Minecraft bedrock work as well? (Because this PR looks like destroy the support of Bedrock)

@Happy2018new
Copy link

Happy2018new commented Jun 25, 2025

I modified a item name to 𡧛 (the ord of it is 137691) by using the anvil in Minecraft Bedrock Edition, and use Structure Block export it as .mcstructure file, and it proves that Minecraft Bedrock Edition is using standard UTF-8.

image

test_mcstructure.zip

b'\x08\x04\x00\x4e\x61\x6d\x65\x04\x00\xf0\xa1\xa7\x9b'

\x08 The ID of TAG_String (8)
\x04\x00 The length (4) of the key name of TAG_String (key is 'Name') who encoding in little endian.
\x4e\x61\x6d\x65 The key name of TAG_String, which is b'\x4e\x61\x6d\x65'.decode() = 'Name'.
\x04\x00 The length (4) of the value of this key ('Name') who encoding in little endian.
\xf0\xa1\xa7\x9b The value of this key ('Name'), which is b'\xf0\xa1\xa7\x9b'.decode() = '𡧛'.

However, in your code, the encode result is not \xf0\xa1\xa7\x9b but _modified_utf8_encode('𡧛') = b'\xed\xa1\x86\xed\xb7\x9b'.

@xymb-endcrystalme
Copy link
Author

I wasn't aware that NBT isn't a "standard" and that Bedrock used normal UTF-8. 🤣

Yea, my patch would 100% break that. To do it properly nbtlib will probably need some kind of a switch that tells it if it's a bedrock, or a Java NBT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants