How can I find all the appearances of the unicode that starts with "\u2f"?

GLight · December 14, 2022, 4:02am

Things I have tried

The problem that I meet is, when using Chinese, the unicodes on the left column look almost the same to the ones in the right column. And the ones on the right, which start with \u2f, are seldom used.

\u4e00：：：：\u2f00

\u7528：：：：\u2f64

\u751f：：：：\u2f63

\u65b9：：：：\u2f45

(the screenshot of the result of the unicode-to-Chinese conversion)

However, due to some reasons, OCR, I guess, some of the Chinese characters in my notes are actually wrong(i.e. they should be the characters whose unicode does not start with \u2f). It is not possible to identify them in the note, as when you reading the note, they just looks the same. However, when you do the search, that will be a huge problem, i.e. the one you searched and the one that you read in the note will never be matched.

And since there are a lot of characters, I cannot search and replace them base on the codebook.

What I’m trying to do

I think one possible way to fix this problem is to identify the appearance of the unicode start with \u2f in my vault in the first step. I think, optimistically, the total number of the appearance will not be large. And then, a manual fix can be applied to each of the appearances by delete the wrong character and re-input the correct character and probably in Obsidian to get rid of breaking the links.

My question is how can I accomplish the first step? i.e. to find all the appearance of the unicode starting with \u2f in the range of a folder including sub-folders(such as the vault).

holroy · December 14, 2022, 11:07pm

Disclaimer: I don’t know any Chinese, so sorry if that makes parts of my answer confusing!

I do however know a little bit about unicode and regex, and that’s my basis for answering your question.

Match part or single characters

Firstly, it’s actually hard to match for only part of a multi-byte character, so it’s not easy to locate all the \u2f. To do such a search, you would need to get a binary editor with search capabilities, and I don’t think that’s the path you should go.

However, using regex you can search for character classes, and with a little bit of luck you can search for the entire (or a suitable) range of characters starting with \u2f.

Since I’m rather bad (read: totally blank) on Chinese characters, I’m going to give you an example using ordinary latin characters, and hope you’re able to translate into your use case.

Character class regex search

To get regex search, you need to search (Using cmd+shift+F (not only cmd+F (or Ctrl + … if on windows)) your entire vault (or limit using something like path:/YourFolder or other means), and then you can enter something like /[a-cru]/. This searches for a single character which needs to be in the character class, [...], of a to c, or r and u. In other words, either a, b, c, r or u.

It’ll now show all matching notes, and you can click on either note and it’ll show you all the characters matching your pattern. You can now either change them manually, or possibly do a search-and-replace for the various cases in that particular file.

I don’t think you can do a global search and replace within Obsidian, and I’m not sure from your description whether there is a many characters you need to search for, or just the four you mentioned.

Building your character class

You might already have understood how to build your search, but basically you do /[ to start the character class, and then enter either the various single characters you want to search for, or a range like the a-c where you replace the a with the first character in the range, aka something close \u2f00, and replace the c with something near the end of range, aka something close to \u2fff.

Finally, you end the range with ]/. Just for the sake of clarity the / at start and end denotes the regex, and the [ and ] mark the character class.

Hope this helps!

GLight · December 15, 2022, 1:20pm

Brilliant idea! Thank you!
I will have a try and report the result.

GLight · December 15, 2022, 2:26pm

It works!

I use Numbers to generate the string from \u2f00 to \u2fe1
put the result into one cell
remove the \t between them

\u2f00\u2f01\u2f02\u2f03\u2f04\u2f05\u2f06\u2f07\u2f08\u2f09\u2f0A\u2f0B\u2f0C\u2f0D\u2f0E\u2f0F\u2f10\u2f11\u2f12\u2f13\u2f14\u2f15\u2f16\u2f17\u2f18\u2f19\u2f1A\u2f1B\u2f1C\u2f1D\u2f1E\u2f1F\u2f20\u2f21\u2f22\u2f23\u2f24\u2f25\u2f26\u2f27\u2f28\u2f29\u2f2A\u2f2B\u2f2C\u2f2D\u2f2E\u2f2F\u2f30\u2f31\u2f32\u2f33\u2f34\u2f35\u2f36\u2f37\u2f38\u2f39\u2f3A\u2f3B\u2f3C\u2f3D\u2f3E\u2f3F\u2f40\u2f41\u2f42\u2f43\u2f44\u2f45\u2f46\u2f47\u2f48\u2f49\u2f4A\u2f4B\u2f4C\u2f4D\u2f4E\u2f4F\u2f50\u2f51\u2f52\u2f53\u2f54\u2f55\u2f56\u2f57\u2f58\u2f59\u2f5A\u2f5B\u2f5C\u2f5D\u2f5E\u2f5F\u2f60\u2f61\u2f62\u2f63\u2f64\u2f65\u2f66\u2f67\u2f68\u2f69\u2f6A\u2f6B\u2f6C\u2f6D\u2f6E\u2f6F\u2f70\u2f71\u2f72\u2f73\u2f74\u2f75\u2f76\u2f77\u2f78\u2f79\u2f7A\u2f7B\u2f7C\u2f7D\u2f7E\u2f7F\u2f80\u2f81\u2f82\u2f83\u2f84\u2f85\u2f86\u2f87\u2f88\u2f89\u2f8A\u2f8B\u2f8C\u2f8D\u2f8E\u2f8F\u2f90\u2f91\u2f92\u2f93\u2f94\u2f95\u2f96\u2f97\u2f98\u2f99\u2f9A\u2f9B\u2f9C\u2f9D\u2f9E\u2f9F\u2fA0\u2fA1\u2fA2\u2fA3\u2fA4\u2fA5\u2fA6\u2fA7\u2fA8\u2fA9\u2fAA\u2fAB\u2fAC\u2fAD\u2fAE\u2fAF\u2fB0\u2fB1\u2fB2\u2fB3\u2fB4\u2fB5\u2fB6\u2fB7\u2fB8\u2fB9\u2fBA\u2fBB\u2fBC\u2fBD\u2fBE\u2fBF\u2fC0\u2fC1\u2fC2\u2fC3\u2fC4\u2fC5\u2fC6\u2fC7\u2fC8\u2fC9\u2fCA\u2fCB\u2fCC\u2fCD\u2fCE\u2fCF\u2fD0\u2fD1\u2fD2\u2fD3\u2fD4\u2fD5\u2fD6\u2fD7\u2fD8\u2fD9\u2fDA\u2fDB\u2fDC\u2fDD\u2fDE\u2fDF\u2fE0\u2fE1

pasts the string in the cell onto a website which can convert unicode to Chinese.

⼀⼁⼂⼃⼄⼅⼆⼇⼈⼉⼊⼋⼌⼍⼎⼏⼐⼑⼒⼓⼔⼕⼖⼗⼘⼙⼚⼛⼜⼝⼞⼟⼠⼡⼢⼣⼤⼥⼦⼧⼨⼩⼪⼫⼬⼭⼮⼯⼰⼱⼲⼳⼴⼵⼶⼷⼸⼹⼺⼻⼼⼽⼾⼿⽀⽁⽂⽃⽄⽅⽆⽇⽈⽉⽊⽋⽌⽍⽎⽏⽐⽑⽒⽓⽔⽕⽖⽗⽘⽙⽚⽛⽜⽝⽞⽟⽠⽡⽢⽣⽤⽥⽦⽧⽨⽩⽪⽫⽬⽭⽮⽯⽰⽱⽲⽳⽴⽵⽶⽷⽸⽹⽺⽻⽼⽽⽾⽿⾀⾁⾂⾃⾄⾅⾆⾇⾈⾉⾊⾋⾌⾍⾎⾏⾐⾑⾒⾓⾔⾕⾖⾗⾘⾙⾚⾛⾜⾝⾞⾟⾠⾡⾢⾣⾤⾥⾦⾧⾨⾩⾪⾫⾬⾭⾮⾯⾰⾱⾲⾳⾴⾵⾶⾷⾸⾹⾺⾻⾼⾽⾾⾿⿀⿁⿂⿃⿄⿅⿆⿇⿈⿉⿊⿋⿌⿍⿎⿏⿐⿑⿒⿓⿔⿕

put the mess generated above in to Obsidian search pane, using regex.

/[⼀⼁⼂⼃⼄⼅⼆⼇⼈⼉⼊⼋⼌⼍⼎⼏⼐⼑⼒⼓⼔⼕⼖⼗⼘⼙⼚⼛⼜⼝⼞⼟⼠⼡⼢⼣⼤⼥⼦⼧⼨⼩⼪⼫⼬⼭⼮⼯⼰⼱⼲⼳⼴⼵⼶⼷⼸⼹⼺⼻⼼⼽⼾⼿⽀⽁⽂⽃⽄⽅⽆⽇⽈⽉⽊⽋⽌⽍⽎⽏⽐⽑⽒⽓⽔⽕⽖⽗⽘⽙⽚⽛⽜⽝⽞⽟⽠⽡⽢⽣⽤⽥⽦⽧⽨⽩⽪⽫⽬⽭⽮⽯⽰⽱⽲⽳⽴⽵⽶⽷⽸⽹⽺⽻⽼⽽⽾⽿⾀⾁⾂⾃⾄⾅⾆⾇⾈⾉⾊⾋⾌⾍⾎⾏⾐⾑⾒⾓⾔⾕⾖⾗⾘⾙⾚⾛⾜⾝⾞⾟⾠⾡⾢⾣⾤⾥⾦⾧⾨⾩⾪⾫⾬⾭⾮⾯⾰⾱⾲⾳⾴⾵⾶⾷⾸⾹⾺⾻⾼⾽⾾⾿⿀⿁⿂⿃⿄⿅⿆⿇⿈⿉⿊⿋⿌⿍⿎⿏⿐⿑⿒⿓⿔⿕]*/

Here is the result that I have:

image740×235 17.8 KB

It seems that I’m lucky enough.
The problematic line that I ran into accidentally a few days ago is the only one in my vault. But since I will sometime use OCR to capture text from book, this trick will be useful and I will probably use it to detect errors in my vault periodically.

@holroy,
I really appreciate for your help.

system · December 22, 2022, 2:26pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.