Support Tibetan Language - don't segment using whitespaces (behave like CJK)

Steps to reproduce

I mark ( [[zzzzzzz]]) in text of pre-existing files using UTF-8 file names outside the ASCII8 (roman) range.

Expected result

I expect in the mark popup file names starting with that name, ex (Tibetan) [[སངས ] and subsequent files with the same name.

Actual result

However in UTF-8 and non-English, I get random file names rather than files starting with that name, ex (Tibetan) [[སངས་] do not bring up files such as སངས་.md and similar longer file names.

Environment

MACOSX 10.15.7

  • Obsidian version:
    v0.11.5

Additional information

<!-- Anything else you think would help our investigation, like a screenshot or a log file? You can drag and drop screenshots to this box. For large amount of text, try putting them into something like Pastebin. ![Screen Shot 2021-03-22 at 1.51.09 PM|406x500](upload://u2YO74kBk7qwDOesjD956gzYek2.jpeg) 1. The file name fetching from the file system, does it use UTF-8 or assume ASCII-8 (UTF-8 subset) for the popup build for linking? 2. Or does the popup sorting of file names use UTF-8 sorting? If you need UTF-8 files for testing, let know. Or if you have a beta build I could try out here to see if that fixed the issue.

can you post some screenshots and screen recordings?
what’s your installer version?

I update directly from Obsidian so I’m not sure about the installer version. Anyway, the image above shows (clearly I think) the issue. There are files starting with string for [[ in the file system that are not displayed, neither as the first entries.

Do you use [ as part of the filename? We don’t support it. You should get a warning whtn you try to type it.

If you go in settings->about, what is the installer version?

You are gonna have to be more precise that this, because I struggle to undestand you. I don’t think the problem is utf8 becuause we have CJK users and they fine with obsidian. Now maybe there is a problem with Tibetan maybe not. But I need to see a list of file name you have on your disk and what autocomplete shows. Make a simple test case and make big screenshots.

Yes it’s odd as any UTF-8 should work fine in that case. Not that Tibetan most likely uses two or three-byte UTF runes rather than one byte.

I’ve uploaded a simple folder with a test case, see TestPage.md inside how to reproduce this case with that page itself.

If you need help just let me know, don’t expect everyone to know Tibetan, but the UTF-8 codes should be valid and it’s a standard for Tibetan now for 10+ years and it’s been very solid on the Mac platform.
Obsidian-Tibetan-Test.zip (2.4 MB)

image

This seems fine to me. I am on windows.

My adivese is to download and reinstall obsidian from the website.
We updated codemirror in on the latest insider build.

I don’t know. Maybe this is a mac specific problem.

I don’t undestand maybe you mean that the སངས་རྒྱས་ཆོས་དང་ཚོགས་ཀྱི་མཆོག་རྣམས་ལ། ། should offer autocomplete to this page སངས་རྒྱས?

Does tibetan separate words with spaces or not?

There’s no such thing as word separators in Tibetan or many other Asian languages. There’s a syllable separator, in the case of Tibetan it’s called tsek and looks like this : ་
Hope you see that tiny raised dot. Anyway, does the parsing of md files assume anything about word separators as that could cause issues.

And yes , སངས་རྒྱས་ཆོས་དང་ཚོགས་ཀྱི་མཆོག་རྣམས་ལ། when marked in beginning with [[ should give the first entry in the popup as སངས་རྒྱས.md.

Maybe it’s a codemirror issues specific to the Mac, but UTF-8 file names should be UTF-8 file names. Where could I download the insider build for testing as I’m not an insider? If this is a codemirror issue we need to file a github bug report about it, could bite in other programs as well.

I know the UTF-8 file names work fine in MacOSX as I could see and sort them in a terminal emulator directly from the file system io calls. And I did my tests on an APFS file system.

On my file system here:
14:59|ksandvik ~> ls སངས་*

སངས་རྒྱས.md
སངས་རྒྱས་པ.md
སངས་རྒྱས་ཆོས.md
སངས་རྒྱས་ཉིད.md
སངས་རྒྱས་དངོས.md
སངས་རྒྱས་ཐོབ་པ.md
སངས་རྒྱས་ཐམས་ཅད.md
སངས་རྒྱས་བསྟན་པ.md

So the output from the file system looks legit with no spaces between or odd characters.

It does for western languages. I think that’s the issue.
We have to make Tibetan behave like CJK.

If this is the case, the unlinked mention also should not look right.

I am gonna move this feature request and rename it.

OK thx, if that was needed, then it’s OK. There are some other Asian languages, Hindi, Sanskrit and various other ones with no space separators as well.

I would appreciate if you could provide a list of languages you are SURE they don’t separate words with whitespaces and are like CJK.

https://www.quora.com/Does-Hindi-have-the-same-or-similar-rules-as-to-when-to-add-a-space-or-mark-a-boundary-between-words-or-sets-of-words-like-with-Sanskrit-or-no-Between-spaces-there-may-be-multiple-words-strung-together-like-in

Here it says that in Hindu the words are commonly separated by space.
So are you sure about what you saying?

will work in 0.11.10