Steps to reproduce
- Save the two attached files locally: a 7950-byte and an 8000-byte markdown file, both valid UTF-8, both containing CJK characters mixed with ASCII. (Any markdown file with CJK whose
content= payload size is ≥ ~8000 bytes will reproduce; the attached pair is a minimal threshold pair.)
- Run, on the 7950-byte file:
obsidian create path="test.md" content="$(cat cli-utf8-repro-7950.md)" overwrite silent
obsidian read path="test.md" | grep "��"
No matches.
- Run, on the 8000-byte file:
obsidian create path="test.md" content="$(cat cli-utf8-repro-8000.md)" overwrite silent
obsidian read path="test.md" | grep "��"
One or more matches: a multi-byte UTF-8 character has been silently replaced with U+FFFD (��).
- Repeat step 3 several times. Same input produces the same corrupted positions every time.
Did you follow the troubleshooting guide? [Y/N]
Yes.
- Confirmed on latest Obsidian (
1.12.7) and the bundled CLI (1.12.7).
- Searched the bug-reports forum: closest hits are the now-fixed hang #111035, the escape-character threads #113772 / #113117, and the filename-dot truncation #113711. None describe silent UTF-8 corruption with a sharp ~8 KB threshold.
- Sandbox vault / restricted mode / themes / plugins are not relevant: the corruption happens inside the CLI argv → vault write pipeline, before any plugin or renderer is involved. Reproduces against any vault, including a fresh one.
Expected result
Vault file content equals the input file content, byte-for-byte. No �� injected.
Actual result
One or more characters in the vault file are replaced with �� (U+FFFD). The CLI exits with success — corruption is silent, no error or warning. For fixed input content the corrupted byte position is deterministic; varying the content size shifts which character is hit.
The CLI is not the previously fixed hang from #111035 (that is fixed in 1.12.2; my CLI is 1.12.7 and never hangs). It is a different multi-byte issue that surfaces only once content= payload crosses a sharp ~8 KB boundary.
Environment
- Obsidian:
1.12.7
obsidian CLI: 1.12.7 (installer 1.12.7)
- OS: macOS Darwin 24.6.0, x86_64
- Shell:
zsh, locale LANG=LC_ALL=en_US.UTF-8
Additional information
Threshold bisect (same input prefix, varying length)
| Content size (bytes) |
Corruption count over 3 trials |
| 6000 |
0 / 0 / 0 |
| 7400 |
0 / 0 / 0 |
| 7800 |
0 / 0 / 0 |
| 7950 |
0 / 0 / 0 |
| 8000 |
1 / 1 / 1 |
| 8050 |
1 / 1 / 1 |
| 9000 |
1 / 1 / 1 |
| ~22000 (full doc) |
2 / 2 / 2 (always at the same two character positions) |
The sharp cliff near 8 KB strongly suggests an IPC or stdin read buffer of 8192 bytes that does byte-chunk → string conversion before re-assembly, instead of buffering bytes until a complete UTF-8 sequence is available. When a chunk boundary lands inside a 2-, 3-, or 4-byte UTF-8 sequence, the partial bytes are decoded independently and replaced with U+FFFD.
Workaround
Bypass the content= parameter by writing through eval + Node fs:
obsidian eval code="(async()=>{
const fs=require('fs');
await app.vault.adapter.write(
'test.md',
fs.readFileSync('/tmp/input.md','utf8')
);
})()"
This routes the bytes through Node’s UTF-8-aware readFileSync and Obsidian’s vault adapter, skipping the CLI’s content= pipeline. Confirmed clean on the same input that triggers the corruption.
Attached repro files
The two files share the same prefix; only the trailing 50 bytes differ.
Credits
Bug noticed in real use by me (writing a long CJK note via the CLI). Bisect, threshold confirmation, and minimal repro construction were performed by an agent driving the CLI (Claude Code).
Hello. You didn’t include the output of “show debug info” command
Can you please add?
I’ll try to repro your findings
I tried but I can’t access to that command.
The video will be removed later: <removed youtube link>
nvm, Obsidian also localization the command, it’s “顯示除錯訊息” for me.
SYSTEM INFO:
Obsidian version: 1.12.7
Installer version: 1.12.7
Operating system: Darwin Kernel Version 24.6.0: Mon Jan 19 22:00:10 PST 2026; root:xnu-11417.140.69.708.3~1/RELEASE_X86_64 24.6.0
Login status: logged in
Language: zh-TW
Catalyst license: none
Insider build toggle: off
Live preview: on
Base theme: dark
Community theme: Minimal 8.1.6
Snippets enabled: 3
Restricted mode: off
Plugins installed: 13
Plugins enabled: 13
1: Admonition v10.3.2
2: Calendar v1.5.10
3: Dataview v0.5.68
4: Homepage v4.3.1
5: Kanban v2.0.51
6: Spaced Repetition v1.13.9
7: Git v2.38.0
8: Vimrc Support v0.10.2
9: Minimal Theme Settings v8.2.1
10: Local REST API v3.4.6
11: Dialogue v1.0.2
12: Relative Line Numbers v3.0.0
13: Habit Tracker 21 v2.4.1
RECOMMENDATIONS:
Custom theme and snippets: for cosmetic issues, please first try updating your theme and disabling your snippets. If still not fixed, please try to make the issue happen in the Sandbox Vault or disable community theme and snippets.
Community plugins: for bugs, please first try updating all your plugins to latest. If still not fixed, please try to make the issue happen in the Sandbox Vault or disable community plugins.
And here is an update. Opus 4.7 is terrifying. I can’t do this kind of debugging by myself.
After more testing the trigger is not the size of content= alone — it is the total CLI argv payload (target path= + content= + framing overhead) crossing the ~8 KB chunk boundary. The same 8000-byte content writes cleanly when the destination is a short path like _tmp.md in the vault root, but reliably corrupts (3/3 trials) when the destination is 02 resources/ideas/_tmp.md. Twenty extra path characters are enough to push the same content from “always clean” to “always corrupted”. So the minimum content size that triggers the bug depends on the path you write to: maintainers attempting to reproduce should pin both a long nested path and a content payload that brings the total argv past ~8 KB. (My earlier note about the bug “disappearing after I opened DevTools” was a confound — between those runs the destination path had also shortened.)
As a hypothesis worth verifying — please double-check, since this came from reading the minified obsidian.asar/main.js and I may be misreading — the relevant CLI server code on my install (Obsidian 1.12.7) appears to accumulate incoming socket chunks with r += o.toString() before searching for the framing \n that delimits the JSON header. If that reading is correct, each Buffer#toString() call would decode its chunk independently as UTF-8, so a chunk boundary that splits a multi-byte sequence would replace the partial bytes with U+FFFD before they could be rejoined with their other halves. The shape and threshold of the corruption I see is consistent with that mechanism, but I am not in a position to confirm it from outside, so please treat the suggestion as a starting point rather than a diagnosis. If it does turn out to be the cause, two standard fixes exist: accumulate as Buffer and call toString('utf8') exactly once after Buffer.concat, or feed each chunk through a string_decoder.StringDecoder('utf8'), which is designed to buffer incomplete UTF-8 sequences across chunk boundaries.
Hi @WhiteNoise,
Posting a Windows finding that looks like the same defect family with a different face.
On Windows the same chunk-boundary issue manifests as a fatal JSON.parse SyntaxError that crashes Obsidian’s main process whenever a single CLI argv element exceeds approximately 4 KB. Symptom is Uncaught Exception: SyntaxError: Unexpected token ',' ... is not valid JSON raised inside Socket.n at obsidian.asar/main.js:66:136, called from Pipe.onStreamRead — same parse site you pointed at, just a different surface symptom because what straddles the chunk boundary on Windows is JSON syntax (which JSON.parse rejects loudly) rather than a multi-byte UTF-8 codepoint (which Buffer#toString() rewrites silently to U+FFFD).
Two screenshots attached, both from a fresh empty vault on Windows 11 with Obsidian 1.12.7:
-
Screenshot 1 — ..."":["eval"],"tty":"fa..." — from an obsidian eval code=... call where code= crossed ~4.4 KB. The fragment ["eval"],"tty":"false" in the parser’s complaint confirms the receiver is JSON-parsing the full argv-plus-framing envelope (subcommand name + spawn-options descriptor), with the parse fault landing mid-string inside the user-supplied code= payload.
-
Screenshot 2 — ..."reate.md","overwrit..." — from an obsidian create path=... content=... overwrite call. Same parser site, same crash dialog, cut landing between the content= value and the trailing overwrite flag.
Bisect across both surfaces against a fresh vault:
size create.argv create eval.argv eval
2 KB 2076 OK 3111 OK
3 KB 3076 OK 4443 CRASH
4 KB 4076 OK 5779 CRASH
5 KB 5128 CRASH 7245 CRASH
Both surfaces flip from success to crash when their largest single argv element crosses ~4 to 4.5 KB. The eval surface trips on a smaller content size only because its code= element packs a constant ~150-byte JS template plus base64 expansion. CLI exit code is 0 and stderr empty in every CRASH row — no signal back to the caller; the host JS-error dialog is the only indication.
The Windows ~4 KB vs macOS ~8 KB threshold differential is consistent with the platforms’ default IPC pipe chunk sizes, which fits your hypothesis that the server-side chunk handler is the locus. If your reading of main.js (per-chunk Buffer#toString() accumulation rather than buffer-then-decode-once) is correct, both manifestations fall out of the same defect: a chunk boundary that lands inside any framed JSON message corrupts whichever byte sequence it straddles — UTF-8 codepoints become U+FFFD silently, JSON syntax yields an uncaught SyntaxError.
1 Like