Simple PowerShell (for now) script to convert Microsoft Word files into markdown with pandoc

This is something that I quickly hacked together for someone in the community who needed to batch convert a whole bunch of docx files into markdown on windows. I found a couple of GUIs for pandoc but none of them could process docx files in batch for some reason, so I decided to do it with PowerShell instead.

There are a lot of room for improvement but I’m leaving this here first before I forget about it. I’ll come up a bash/zsh compatible version too at a later date for macOS & Linux users.

  1. Install pandoc: https://pandoc.org/installing.html
  2. Reboot your computer if this is windows (the installer updates $PATH on windows and a reboot is required to take effect)
  3. Make a new folder, and make a copy of your documents to this folder. Hold down shift and right click an empty area in the file explorer and click on “Open PowerShell here”
  4. Try running pandoc --help, if you get a bunch of help text, then pandoc is properly installed. Paste the following into the console and it should be able to convert all of the docx files in this folder to markdown.
    Get-ChildItem . -Filter *.docx | 
    Foreach-Object {
        pandoc --from docx --to markdown $_ -o $_.Name.Replace('.docx', '.md')
    }
    

The pandoc manual has a comprehensive list of all supported formats, simply change the filtering parameter -Filter *.docx and --from docx argument to your source format if you want to convert something that obsidian can’t import directly: https://pandoc.org/MANUAL.html

3 Likes

Thanks! Changed the title a little to add keywords for search!

That sounds awesome! :slight_smile: Will try to adapt it to linux within the next days…

1 Like

Thank you.
Works well (within the limits of pandoc).

I tested it with
docx file with table (interesting visual for row in italics :slight_smile: )
epub War and Peace (too big for Typora, handled fine by Obsidian if somewhat slow loading; some glitches on chapter headings and Index).
A few glitches in complex documents are normal in pandoc, and I have to remember it uses markdown links, but this was very impressive for the script and Obsidian.

1 Like

The script above will produce .md files but not clean. Text is wrap as an 80-character line length.
Add --wrap=none and will clean .md files.

Get-ChildItem . -Filter *.docx | 
Foreach-Object {
    pandoc --from docx --to markdown --wrap=none $_ -o $_.Name.Replace('.docx', '.md')
}

For Linux users:

place a script with the following content in the folder with the files you’d like to convert:

find -name "*.docx" -type f -exec sh -c '
      for f; do
         pandoc -f docx -t markdown -o "${f%.*}.md" "$f"
      done
   ' find-sh {} +

Then execute the script via terminal like this: ./name-of-script

If you want to remove the original file, add the following line

rm "$f"
before the “done” command

1 Like

Can you tell me which option to use for pandoc so that the tables are converted in the correct form for Obsidian?