Converting to PDF to markdow with images in obsidian

kannan · March 13, 2022, 7:45am

Get PDF
Convert into word document using this website and put into the obsidian valut folder
Install Pandoc
Got to terminal and change directory to obsidian vault
Run this command

pandoc --extract-media= “word doc file name”.docx -o “converted markdown file name”.md
(Optional): The output requires a little formatting; so you can use the regex find and replace plugin to easily clean it up

If you have any doubts, feel free to ask

jvalon · March 16, 2022, 9:20am

I am new scripts, coding, command prompts, etc. I’m Googling and trying to learn so I apologize if these are silly questions but I can’t find the answer. Once I change my directory to my obsidian vault, how to I know it worked? Step 5 says run this command I copy and pasted it into pandoc and updated the file names but how do I save it and execute it?

kannan · March 16, 2022, 2:46pm

Sorry for taking this long.

I am writing a python script for this, so that you can do this process in bulk. Got some other work to do, I will finish writing within a day. I will update this thread when I finish.

bhickta · March 30, 2022, 5:04pm

I am using Adobe to convert pdf into markdown
but problem comes when i convert .docx file to .md using pandoc
it extracts in backquotes in each line like below

some text
some other text
… so on

any solution?

Archie · March 30, 2022, 5:24pm

is there a quick way to deal with tables in pdf? like turning them into a image? I can’t find a efficient solution to turn them into that, especially if they are multiple pages long

Archie · March 30, 2022, 6:37pm

it does it for me to as i tried it, it is not all the text but like for 99% of the text

kannan · April 4, 2022, 2:55am

That happens to me as well, I just find and replace “>” with " "

Archie · April 15, 2022, 6:29am

there is actual use of “>” in documents, i made this little script in AHK to get rid of that in addition to single linebreaks (using line wrapping freely instead)

StringReplace, ContentsL2, ContentsL1, `r`n>`r`n,92!@#$84, All
ContentsL1 := ""
StringReplace, ContentsL3, ContentsL2, `r`n>%A_Space%,%A_Space%, All
ContentsL2 := ""
StringReplace, ContentsL4, ContentsL3, 92!@#$84,`r`n`r`n, All
ContentsL3 := ""
StringReplace, ContentsL5, ContentsL4, `r`n>%A_Space%,`r`n , All
ContentsL4 := ""

Archie · April 15, 2022, 7:04am

another problem is that some images have selectable text in them in the output of this method is useless for them, i wonder is there a reliable way toget a true image for those cases and paste it in the vault, Screenshot is one option but in regard to quality is there a better option