Converting html to md with Pandoc, problems in Obisidian

I am having tons of problems with pandoc’s conversion of html to md. It “works” but external links are cosmetically “messed up” when imported into Obsidian.

The html files look perfect when viewed in my browser, but the md files are screwed up when viewed in Obsidian. The links to the assets folder is fine and pics show perfectly in the note. However, there are random line breaks in the middle of sentences, and often in the middle of a link. They can be fixed by going to the end of the line, hitting delete to remove the return, and then reinserting a space. So I guess I cd say that randomly, spaces are being replaced by returns. Very strange.

The original script I used with pandoc to convert to html was as follows:

#!/bin/bash

# Use the current working directory as the root directory

root_dir=“.”

# Recursively traverse the directory structure

find “$root_dir” -name ‘*.html’ -type f -exec sh -c ‘pandoc “$1” -o “${1%.html}.md”’ _ {} ;

THAT created all kinds of gibberish that I have been told (above) has to do with pandoc’s fenced divs, which obsidian apparently doesn’t support.

I revised the script, which got rid of the gibberish:

find "$root_dir" -name '*.html' -type f -exec sh -c 'pandoc "$1" -t gfm-raw_html -o "${1%.html}.md"' _ {} \;

It appears as though the gibberish is gone in my notes now, which is great, but the problems above with the random returns persist.

In a related post on Stack Overflow, John MacFarlane said:

Those are a pandoc markdown extension, fenced divs. Apparently obsidian’s markdown dialect doesn’t support them. That’s fine, you can disable them by running pandoc with -t markdown-fenced_divs . In that case you may get some raw HTML div tags; to disable all of this you can use -t markdown-fenced_divs-native_divs-raw_html . Or you could try something like -t commonmark or -t gfm or -t markdown_strict . Pandoc supports many different markdown dialects.

Could the random-returns problem be related to an improper format in my revised pandoc script? If I could get that random-return thing fixed I’d be home free.

I am absolutely at a loss. Any help would be tremendously appreciated.

Have you tried the Obsidian Importer plugin …

… to see how it compares?

PS: I have so far only tried the Importer for my Google Keep notes.

I have. It does a reasonable job, but the problem is that it doesn’t import recursive folders: All the files are imported into one single folder. It ignores my directory structure, so it becomes quite a project (I have hundreds of recursive folders, with thousands of files). Ugh.

I wish there were a way around that.

1 Like

yes, understood - that requires a lot of extra work.

RESOLVED

Update:

The following script run in bash fixed all my issues:

find "$root_dir" -name '*.html' -type f -exec sh -c 'pandoc "$1" -t gfm-raw_html --wrap=none -o "${1%.html}.md"' _ {} \;

Note that the -t gfm-raw_html piece fixed the gibberish from fenced divs.

There was still another problem though: text were getting messed up by frequent and random line-breaks. It seems pandoc replaces spaces with line-breaks after 72 and sometimes 80 characters (I’ve read inconsistent things about that). In any event, the --wrap=none piece fixed that as well.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.