Converting to PDF to markdow with images in obsidian

  1. Get PDF
  2. Convert into word document using this website and put into the obsidian valut folder
  3. Install Pandoc
  4. Got to terminal and change directory to obsidian vault
  5. Run this command

    pandoc --extract-media= “word doc file name”.docx -o “converted markdown file name”.md

  6. (Optional): The output requires a little formatting; so you can use the regex find and replace plugin to easily clean it up

If you have any doubts, feel free to ask

5 Likes

I am new scripts, coding, command prompts, etc. I’m Googling and trying to learn so I apologize if these are silly questions but I can’t find the answer. Once I change my directory to my obsidian vault, how to I know it worked? Step 5 says run this command I copy and pasted it into pandoc and updated the file names but how do I save it and execute it?

Sorry for taking this long.

I am writing a python script for this, so that you can do this process in bulk. Got some other work to do, I will finish writing within a day. I will update this thread when I finish.

1 Like

I am using Adobe to convert pdf into markdown
but problem comes when i convert .docx file to .md using pandoc
it extracts in backquotes in each line like below

some text
some other text
… so on

any solution?

is there a quick way to deal with tables in pdf? like turning them into a image? I can’t find a efficient solution to turn them into that, especially if they are multiple pages long

it does it for me to as i tried it, it is not all the text but like for 99% of the text

That happens to me as well, I just find and replace “>” with " "

there is actual use of “>” in documents, i made this little script in AHK to get rid of that in addition to single linebreaks (using line wrapping freely instead)

StringReplace, ContentsL2, ContentsL1, `r`n>`r`n,92!@#$84, All
ContentsL1 := ""
StringReplace, ContentsL3, ContentsL2, `r`n>%A_Space%,%A_Space%, All
ContentsL2 := ""
StringReplace, ContentsL4, ContentsL3, 92!@#$84,`r`n`r`n, All
ContentsL3 := ""
StringReplace, ContentsL5, ContentsL4, `r`n>%A_Space%,`r`n , All
ContentsL4 := ""
1 Like

another problem is that some images have selectable text in them in the output of this method is useless for them, i wonder is there a reliable way toget a true image for those cases and paste it in the vault, Screenshot is one option but in regard to quality is there a better option