Better Handle md files not stored in UTF8 format

Lise · February 23, 2021, 6:41pm

My accents display correctly in the original md files but they do not display correctly in Obsidian. Worse, one edit within Obsidian and it overwrites my original file, stripping all characters and replacing them with the single � character. Of course this damage is irreparable, since all different characters are replaced with the same single character making a search and replace impossible.

This french md file renders correctly in Notepad and Textpad.
20210223_131608

In Obsidian it looks like this:
20210223_131902

One edit in Obsidian (adding a tag, whatever) makes the original look like this:

20210223_133542

Steps to reproduce

I don’t know if it happens with all files that include accents. This one (and some others that do not work) are encoded in ANSI Latin I, so maybe try putting that sort of file with accents in the vault.

In fact, I’m tracking another similar issue where spaces in English files are being replaced with the same character, so my guess is that it is related to ANSI Latin I files.

Environment

Operating system: Win 10

Obsidian version: 0.11.0

WhiteNoise · February 23, 2021, 6:53pm

We have to handle better these cases, but it’s not easy. My advise is to configure your other editors to use a more modern format: UTF8

Lise · February 23, 2021, 7:06pm

@WhiteNoise - Not sure how to do that, but I will look it up. Having said that, I noticed that all my files that have been touched by Obsidian were (auto) converted to UTF-8, and those are the ones that don’t handle the accents. So I’m a bit confused about your suggestion. Is UTF8 supposed to handle special characters? And if so, then why is Obsidian’s ‘conversion’ to that format stripping all of them? Just trying to figure this out.

WhiteNoise · February 23, 2021, 7:09pm

UTF8 handles accents and so so much more. The problem is that à in ANSI Latin I corresponds to gibbrish in UTF8. And there is no easy way to tell the encoding of file.

Moonbase59 · April 20, 2021, 7:31pm

@Lise, if you are on Windows, you might want to try Notepad++. It handles UTF-8 files and can even convert. It also handles Mac/Linux line ends correctly (and can convert those).

Most modern systems use UTF-8 text encoding, since it handles (almost) all languages of the world, and you’ll never have to worry about “character sets” again.

It’s just great being able to seamlessly work on Linux, Macs, Windows, Android, iPads, all using the same files.

Lise · April 21, 2021, 6:09am

@Moonbase59, yeah, I got it the same day and saw that conversion option. Works like a charm. I’ve set it as my default, but I must confess I’ll often open my (non-French!) files with Textpad, probably because I’ve been using it for almost 20 years and I can regex my way through anything with it. I just have to get comfy with Notepad++. I certainly have the project for it; turning all my 30 year old WordPerfect files into md files. I’m so tired of proprietary anything. I should have listened to the ‘geeks’ back in the day

Moonbase59 · April 21, 2021, 6:23am

Oh well, life is learning something new every day!

Back in my Windows days, many years ago, I was quite happy with Notepad++. And even more happy with my decision to use “naked” text files, later Markdown, and UTF-8 encoding wherever possible. Made the transition to Linux much easier.

Don’t forget to check out their plugins, might be something useful in there!

Then there’s Atom, too, but it is a monster, and starts getting slow with many plugins installed and large files. The Markdown preview is also not the best. On the other hand, it’s customizable like hell and has some great features for other programming languages. The forefather of all Electron-based editors (Electron once was “Atom Shell”).

Lise · April 21, 2021, 6:36am

Ah, thanks for that plugin page. I’ll definiltey take a gander there.
I’ve heard about Atom but never looked into it. Added to my todo list.

Lise · July 15, 2021, 11:53pm

@whitenoise In this thread or a related one you told me to get back to you if I saw any funny characters (so non-UTF-8) in obsidian, and I just saw it happen again.

The md file opened with notepad++ includes this line:
director: Sébastien Betbeder

When I open the md file in Obsidian it looks like this:
director: S�bastien Betbeder

WhiteNoise · July 16, 2021, 1:11am

Yes, it’s the same issue. The file was likely latin1 and obsidian interpred and resaved it as UTF8.

I believe notepad++ shows\guesses the encoding of files in the bottom right corned and also has function to save the file in different encoding.

Lise · July 16, 2021, 1:27am

yeah, I have it set to auto display as UTF-8… but I have to manually open and save each file in order to actually get it to be UTF-8. What a pain in the … I need that Windows switch on my desktop, you know, the one that says *EVERYTHING that comes in to this computer is auto converted to UTF-8 before it gets here!".

Jare · July 19, 2021, 7:44pm

I’m not commenting about the bug itself, but I’d like to try to help with this problem about getting the � characters converted back to how they were. I’m just making a guess here: If I’m not mistaken, multiple different characters can appear as �. Even though the corrupted characters look the same, they may be different.

Try to open one of the corrupted files in for example Notepad++ (good choice, as you have already discussed about it), then highlight one of these � characters in your file, copy to clipboard, and do a search & replace operation. Search for the � character that you just copied, and replace it with another character, e.g. à. Then try to see if all � characters were replaced, or just some.

If all characters were replaced, then my assumption was wrong, and � characters are not different from each other. If only some of the � characters were replaced by à, then you should be able to repair the corruption.

Lise · July 19, 2021, 9:51pm

Jare, that’s a great suggestion. The next time it happens I’ll give it a go. It sure would be interesting if your hunch were correct.

s.lopez31 · August 18, 2023, 9:38am

Hi there,

I encountered the same issue with this python script. Fortunately, I was able to resolve the problem by trying the following solution: Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) - Stack Overflow

Please be cautious, as there might be potential side effects with legacy applications.

horst · March 5, 2025, 11:29pm

Hi and thanks for the hints.

###Request:
But please can’t we just get a feature implemented in Obsidian for Windows, where the Windows-1252 (Latin-1) is “guessed” in a similar way, and the user can choose at loading, whether to convert to UTF-8 in order to be correctly displayed. The file should not be changed unless saved by the user.

###Justification:
As you are delivering a version for Windows, and as (apparently) for many of us Windows-1252/Latin-1 was set as default encoding for many years, it makes a lot of sense to support these defaulted encodings.

###Value:
In my case, I would need to go over all my existing files - one by one - and convert them to UTF-8. This would mean a lot of work, and also the ‘last modified’ dates would be awkwardedly be set to today.

Thanks in advance. Best wishes. And congratulations to the otherwise excellent Obsidian!