Incorrect Character and Word Counting for 3-Byte UTF-8 Characters

Steps to reproduce

  1. Create a new vault.
  2. Create a new note.
  3. Paste or start typing text that contains Unicode text with codepoints greater than U-FFFF, such as π‘žπ‘¦π‘• π‘¦π‘Ÿ 𐑩 π‘‘π‘§π‘•π‘‘πŸ˜€ (β€œthis is a testπŸ˜€β€ with Latin English replaces with Shavian English)
  4. Check the character and word counts.

Did you follow the troubleshooting guide? Y

Expected result

The character/word counter shows 3 words, 14 characters

Actual result

The character/word counter shows 0 words, 25 characters

Environment

SYSTEM INFO:
Operating system: android 16 (Google Pixel 8)
Webview version: 141.0.7390.111
Obsidian version: 1.9.14 (242)
API version: v1.9.14
Login status: not logged in
Language: en
Live preview: on
Base theme: adapt to system
Community theme: none
Snippets enabled: 0
Restricted mode: on

RECOMMENDATIONS:
none


Additional information

It appears that all of the Shavian letters and the emoji are being double counted, as there are 11 of them in that snippet, and 3 whitespace, and 2 * 11 + 3 = 25. I’ve also tested Hiragana, Hangul, Cyrillic, and the Greek alphabet. None of them have this problem.

There are also sometimes errors when editing in the middle of a body text that has/is almost entirely made up of these codepoints, where a character elsewhere in the buffer will become an invalid codepoint, usually ones that are next to a β€œnormal” character.

The Shavian block is U+10450-1047F, and the Emoji block is U+1F600-1F64F.

When counting words in Shavian, a normal English word counter should work just fine, but replace the normal English (Latin) letters with Shavian letters.

IDK how Emojis should be handled for word counting.

While I don’t think it’s currently in use, I suspect there will be a similar problem for characters counting and text editing for 4-byte encoded text as well, but I’ve not tested that.

yes, this is due to UTF-8 to UTF-16 mapping done by javascript to actually handle the characters.

Emoji are converted to two UTF-16 chars, that’s where it comes from.

Is there a plan to do anything to fix this? Is it possible?