An inspiring idea using regex to search the problematical unicode characters was discussed at :
Then I remember another problem cause by the unicodes that look the same. In the following case, the unicode causing the problem is \u00A0, i.e. No-Break Space. And it holds the position where the \u0020, i.e. normal space, should be.
As the this problem cannot be detected by human eyes, so I think using the similar trick discussed in the 1st reference might be helpful.
So I made a test.
Using the built-in query command, I lookup all the appearance of the \u00A0 using regex
The result is astonishing: There are hundreds of appearance of No-Break space in a lot of notes in my vault. The contents of those notes are mainly the text I copied from website or from email written in English.
So I wonder: is this No-Break space very common in English spoken countries? And for me, should I keep them as what they were or replace them with the normal space(\u0020) before some unexpected problems happen?
Usually in my experience the no-break space is used to introduce a space where you don’t want line breaks, like where it’ll look strange or easily be misunderstood if a line break occur.
I also know some like to use it after some abbreviations, like No._1 (I’m using underscore here, not the no-break space). And sometimes I’ve seen it used to force a little wider gap between words and/or sentences.
Regarding the language thingy, in Norwegian we can combine words and create no words very easily, like we could possibly do some variation like nobreak space, which is not common to do in English. So whether some use a no- break space, like in no_break space, or similar compound words, I’m not sure, but I wouldn’t rule it out.
In short, I do believe they are mainly there for a reason, and I wouldn’t change them. Within regex’s, you often have special characters/classes to match for whitespace, so it shouldn’t pose as a problem in that context either.