Exempting Diacritics/Tashkeel/Harakat in Search

This is a followup to an old post that was published in Help and left unanswered. More details can be found in the original post Exempting Diacritics/Tashkeel/Harakat (Arabic) in Search

Use case or problem

I want arabic words to be found through search even if they have different “tashkeel” [1]. An example is: if I search for “أَسْمَى” I would find instances with “أسمى”. You can also think of it as searching for “éléphant” and finding instances with “elephant”.

[1] The same word can be written with Tashkeel or not and would mean exactly the same thing, it serves the pronunciation as the same letter can be pronounced in different forms. Two words with different tashkeels can also have two different meanings, however, they are usually related.

Proposed solution

Keywords and the documents being indexed should be stored in a canonical form. So we basically need an arabic language canonilizer that would be plugged into the search engine.

Current workaround (optional)

Nothing really apart that I usually try to write keywords without tashkeel so that I can find documents later. This is quite hard to manage, so I wouldn’t consider it as an actual workaround.

To note that I’m willing to help implement this feature as a plugin (if possible) if it’s actually feasible (as it’s related to documents’ search). So any guidance would be helpful as well.

1 Like

Given the complexity of the Arabic language, careful consideration must be given to the accuracy and performance of the developed canonizer. The improvement to support diacritic-insensitive searches for Arabic text would significantly benefit users working with it, making the search process more intuitive and efficient.

1 Like

A post was merged into an existing topic: Ignore accents/diacritics in search