Ignore accents/diacritics in search

mikeknight · January 31, 2024, 5:18pm

I do a lot of study on ancient languages, this feature would be invaluable.

reytrace · February 1, 2024, 11:29am

+1 to add this feature, searching without accents would help a lot

camilohoyos · February 9, 2024, 1:31pm

+1 for this.

purplebrackets · February 9, 2024, 1:52pm

Still waiting for this. I feel like it is a lack of inclusivity

rsenna · February 12, 2024, 10:41am

I found this page again while looking for another question.

Just wanted to confirm: Omnisearch is the way to go for all of us “international users” (meaning non-English speakers, also known as “most of the world” .) It can be configured to ignore diacritics, and it works as expected (so searching for “idee” will find “idée.md”.)

It even has some other very cool features. It requires a bit of setting up, including replacing hot keys, but once that is done it just works. Highly recommended.

Dor · February 12, 2024, 4:40pm

Not sure about that. Apparently 20% of the world speaks English. And most of the world doesn’t use the roman alphabet, which is where diacritics are modifiers.

rsenna · February 12, 2024, 6:07pm

First of all, I think yours is a very unnecessary reply. It adds nothing to the main discussion, and can be interpreted as, again, contempt against “diacritics users”…

Having said that, it seems to be also wrong, according to the Encyclopedia Britannica:

The Latin alphabet is the most widely used script, with nearly 70 percent of the world’s population employing it.

Of course, how many of those actually use diacritics I don’t know… I would say, the whole of Latin America, most of western continental Europe (and many eastern European countries too, such as Romania, Poland), some Asian languages (at least Filipino, from the top of my mind.)

Apparently 20% of the world speaks English

Yeah, about that: this is the full paragraph where I think this information you mentioned was quoted from:

Approximately 20% of the world’s population speaks English, with around 400 million people speaking it natively and an additional 1.5 billion who speak it as their second or foreign language.

So 20% (i.e. 1.9 billion people), for this discussion, is a very inflated value, since most of this people do not speak English natively - meaning it’s not their primary nor their only language.

TL;DR I do not know how many people use diacritics in the world, but I can say it’s a lot. Probably more than native English speakers, and probably, yes, most of the world. Is it enough to consider this an important feature? We’ll see.

Dor · February 12, 2024, 6:47pm

I refer you to your comment:

which seems entirely unnecessary

8% world population

Europe total only 9%

“Most Filipinos (and Philippine news journals) write Tagalog without using any diacritic at all . However, pieces of Tagalog writing which use diacritics can occasionally be found in some religious journals, old books, and others. This case is where a word is stressed in the last or end syllable.”

I don’t think I’m wrong.
Over 50% of the world’s population is in Asia. Most of those using the Latin alphabet will do so in English.

idk. The numbers that article relies on are wrong, and underestimates the number of native English speakers (there’s nearly their 400m in North America alone). It doesn’t matter. The point is that claiming a spurious superiority by being “international” or “most of the world” is not the way to support a case. Especially when diacritic users are not most of the world.

I have no contempt for diacritic users: I sometimes use them myself. Just as I don’t always use the Latin alphabet.

rigmarole · February 12, 2024, 11:16pm

@Dor please back down. It doesn’t matter if you’re right or wrong, there is no reason to make any of these arguments. Any further arguing in this thread will simply be flagged and removed. The previous replies may still be.

@rsenna, you are (a bit) newer here, so I’ll guide you to the Code of Conduct, “Encouraged Behaviors”, which includes “step away when heated”. You are free to flag posts you think are off-topic or inappropriate. Community code of conduct - Obsidian Help

The feature request is tagged “valuable” and “i18n”. No one needs to defend whether this is a valuable feature request. It is.

Shaiya · February 17, 2024, 4:17pm

Hi, being able to search regardless of special characters is critical for Hebrew also (all other searches i used ignore the vowel characters). So + 1 for this option

50bbx · February 17, 2024, 4:39pm

Well, I didn’t know about this issue until now. I don’t know how many searches missed some of the results because of accents, but +1

reaty · February 19, 2024, 6:55am

I want to add that cyrillic script also has diacritics and suffers from this issue as well. For example cyrillic letter “ё” often is written as “е” (they are not the same letters as latin, at least for computers). There can be other examples that I don’t know about, because there are many languages that use different versions of cyrillic.

youben · February 25, 2024, 8:28am

This is a followup to an old post that was published in Help and left unanswered. More details can be found in the original post Exempting Diacritics/Tashkeel/Harakat (Arabic) in Search

Use case or problem

I want arabic words to be found through search even if they have different “tashkeel” [1]. An example is: if I search for “أَسْمَى” I would find instances with “أسمى”. You can also think of it as searching for “éléphant” and finding instances with “elephant”.

[1] The same word can be written with Tashkeel or not and would mean exactly the same thing, it serves the pronunciation as the same letter can be pronounced in different forms. Two words with different tashkeels can also have two different meanings, however, they are usually related.

Proposed solution

Keywords and the documents being indexed should be stored in a canonical form. So we basically need an arabic language canonilizer that would be plugged into the search engine.

Current workaround (optional)

Nothing really apart that I usually try to write keywords without tashkeel so that I can find documents later. This is quite hard to manage, so I wouldn’t consider it as an actual workaround.

To note that I’m willing to help implement this feature as a plugin (if possible) if it’s actually feasible (as it’s related to documents’ search). So any guidance would be helpful as well.

BoBo.Zeko · March 9, 2024, 11:14pm

One out of my three vaults is actually useless because I can’t search any text at all, and I wonder if I have missed some results on the others as well because of Tashkeel/diacritics. +1

odyash · March 24, 2024, 3:04pm

I’m beginning to get close to a workaround solution (for Arabic at least), but I unfortunately don’t have the time to implement it, so I’ll share my findings here. As @rsenna mentioned, omnisearch has an “ignore diacritics” option which works for most languages. However, when I tried it with Arabic, it didn’t yield favorable results. Therefore, I thought about:

forking the omnisearch repo,
then seeing the part of the code where it fetches the content of each document (i.e., the content in which we will search on our query string)
then adding a condition that if the first letter of the query string is in Arabic, I’ll run the function mentioned in this StackOverflow answer to remove diacritics from the content string,
so now hopefully the word matches correctly even with words that initially contained diacritics.

Visualization of how this can be done:

(where the green rectangle is my added code)

However, my only problem is that the query variable isn’t defined in this omnisearch.ts file, so I don’t know how to do my “first arabic letter” check :[.

Hopefully this will help anyone who wants to improve omnisearch until Obsidian team implements this functionality! :]

scambier · March 27, 2024, 6:31pm

Omnisearch dev here, please fill an issue in the repo with relevant links and I’ll take a look at it

dreamdoggo · April 19, 2024, 9:22am

Was directed here from my failed bug report: Diacritics and curly/smart quotes cannot be searched well, universally or in a note . Not that anyone’s asking for it, but here are my two cents…

To me, calling this a feature request, even a “valuable” one, downplays reasonable expectations that most users have, especially after being primed to expect similar fuzzy logic from so many other programs.

If you searched your bank card transactions for “Lucky’s Diner”, but by default your bank removed apostrophes from business names, you’d expect some logic to show “Luckys Diner” transactions when searching for “Lucky’s Diner”, in the same way you’d expect your bank’s search function to understand “luckys diner” (lowercase) is the same as “Luckys Diner”.

At the very least, you’d expect your bank to explicitly tell you how to type your search queries if they’re going to have such exacting, narrow scope. Not doing so leads to misleading results. In my example, you might get no results and wrongly think you used a different card for the transaction. What a waste of your time.

You could argue adding such logic is a new feature, but it’s also clearly bad UX with a UI operating in a subtly opaque way. If a huge break from common usability expectations can’t count as a bug—when we all know bugs tend to rightfully take priority—and a fix is more likely to count as a nice-to-have feature, I think the Troubleshooting Guide for bug reports should make it clearer what counts as a bug. That would be useful for everyone.

Not trying to be snide. I legitimately do not understand the distinction if this counts as a feature request rather than a bug.

For a lot of us, the search results are clearly missing relevant information, breaking functionality in a variety of ways. But because this is a feature request, I’m going to assume it’s naturally not as high priority (nearly four years in here). Meanwhile, to see accurate results in Obsidian, I have to painstakingly, manually replace single straight quotes/apostrophes with curly quotes/apostrophes (real fun on mobile, and Omnisearch doesn’t appear to recognize them either) or try to remember alt codes for diacritics (just kidding, I’m often on a TKL keyboard where this is a nightmare). It’s a thing I didn’t initially realize I had to do. Really, it’s just forcing me to use other programs half the time, when I’d really like to use Obsidian.

CawlinTeffid · April 19, 2024, 11:22am

The developers’ definition of “bug” is something not working as intended. An example would be if they’d programmed search to ignore diacritics but it didn’t. It’s a narrow definition which can be surprising.

dreamdoggo · April 22, 2024, 8:28am

I’ll admit I find it hard to believe that curly quotes (or diacritics) breaking Obisidian’s search function is the code working as intended. Amusingly, if I were to paste text from this very thread into Obsidian, Obsidian would not seamlessly spit out search results for it in many cases because the forum automatically adjusts our punctuation marks to be curly. Every contraction becomes a nightmare in search then. But I guess that’s working as intended…

CawlinTeffid · April 24, 2024, 1:57am

It’s not that they intend the breakage, it’s (apparently) that the intent overlooked this case.