Simple PowerShell (for now) script to convert Microsoft Word files into markdown with pandoc

This is something that I quickly hacked together for someone in the community who needed to batch convert a whole bunch of docx files into markdown on windows. I found a couple of GUIs for pandoc but none of them could process docx files in batch for some reason, so I decided to do it with PowerShell instead.

There are a lot of room for improvement but I’m leaving this here first before I forget about it. I’ll come up a bash/zsh compatible version too at a later date for macOS & Linux users.

  1. Install pandoc: https://pandoc.org/installing.html
  2. Reboot your computer if this is windows (the installer updates $PATH on windows and a reboot is required to take effect)
  3. Make a new folder, and make a copy of your documents to this folder. Hold down shift and right click an empty area in the file explorer and click on “Open PowerShell here”
  4. Try running pandoc --help, if you get a bunch of help text, then pandoc is properly installed. Paste the following into the console and it should be able to convert all of the docx files in this folder to markdown.
    Get-ChildItem . -Filter *.docx | 
    Foreach-Object {
        pandoc --from docx --to markdown $_ -o $_.Name.Replace('.docx', '.md')
    }
    

The pandoc manual has a comprehensive list of all supported formats, simply change the filtering parameter -Filter *.docx and --from docx argument to your source format if you want to convert something that obsidian can’t import directly: https://pandoc.org/MANUAL.html

10 Likes

Thanks! Changed the title a little to add keywords for search!

That sounds awesome! :slight_smile: Will try to adapt it to linux within the next days…

1 Like

Thank you.
Works well (within the limits of pandoc).

I tested it with
docx file with table (interesting visual for row in italics :slight_smile: )
epub War and Peace (too big for Typora, handled fine by Obsidian if somewhat slow loading; some glitches on chapter headings and Index).
A few glitches in complex documents are normal in pandoc, and I have to remember it uses markdown links, but this was very impressive for the script and Obsidian.

1 Like

The script above will produce .md files but not clean. Text is wrap as an 80-character line length.
Add --wrap=none and will clean .md files.

Get-ChildItem . -Filter *.docx | 
Foreach-Object {
    pandoc --from docx --to markdown --wrap=none $_ -o $_.Name.Replace('.docx', '.md')
}
2 Likes

For Linux users:

place a script with the following content in the folder with the files you’d like to convert:

find -name "*.docx" -type f -exec sh -c '
      for f; do
         pandoc -f docx -t markdown -o "${f%.*}.md" "$f"
      done
   ' find-sh {} +

Then execute the script via terminal like this: ./name-of-script

If you want to remove the original file, add the following line

rm "$f"
before the “done” command

2 Likes

Can you tell me which option to use for pandoc so that the tables are converted in the correct form for Obsidian?

2 Likes

Its unable to add images into the markdown file.

Here are some PowerShell command line to convert ALL the files within a directory

in .docx from .md

gci -r -i *.md |foreach{$docx=$_.directoryname+"\"+$_.basename+".docx";pandoc -f markdown -s --citeproc $_.name -o $docx}

in .pdf from .md

gci -r -i *.md |foreach{$pdf=$_.directoryname+"\"+$_.basename+".pdf";pandoc -f markdown -s --citeproc $_.name -o $pdf}

from docx in md

gci -r -i *.docx |foreach{$md=$_.directoryname+"\"+$_.basename+".md";pandoc -f docx -s $_.name -o $md}

You can play around looking at syntax to convert from/ to other formats

2 Likes

You have to pass option in pandoc.
This will convert .docx in .md extracting images in /media folder

gci -r -i *.docx |foreach{$md=$_.directoryname+"\"+$_.basename+".md";pandoc -f docx -s --extract-media=./ $_.name -o $md}
2 Likes

I have encountered a problem, when dealing with multiple docx documents, if the image name inside is the same,like image1.png, it will be automatically replaced so that only 1 file remain. How to solve this problem?Thank you!

I am a total nube and do not have any coding experience. Is there a way for me to set this up that’s not that complicated?

If you use windows, search for powershell and open the Powershell ISE app.

Copy and paste this code:


#tell our computer we trust our ability to download packages
Set-ExecutionPolicy RemoteSigned -scope CurrentUser
#download a single package that we trust
Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://get.scoop.sh')
#scoop screens packages for us, so the packages available on scoop are generally more trustworthy

#wget allows for downloading from the web

scoop install wget

#pandoc allows for converting between many types of document
scoop install pandoc

Run it.

Then you can make a new file (Crtl + N)

And run the code from this page…after feeding in your directory.

# set the working directory 
cd 'C:\Users\myusername\myfolder\word-documents-for-converting'

# find all .docx files in current directory
Get-ChildItem . -Filter *.docx | Foreach-Object {    pandoc --from docx --to markdown --wrap=none $_ -o $_.Name.Replace('.docx', '.md')}
1 Like

Genius! Thank you! Very useful!

Thanks all for sharing!

I can’t get the tables right… This is what I tried:

pandoc -s --extract-media= "note.docx" -t markdown-bracketed_spans-raw_html-native_spans-pipe_tables --wrap=none -o "note.md"

Anyone please?

I always get grid_tables instead of pipe_tables


after googling a while I found that this seems to be my problem: