Simple PowerShell (for now) script to convert Microsoft Word files into markdown with pandoc

htuy · October 15, 2020, 9:50pm

This is something that I quickly hacked together for someone in the community who needed to batch convert a whole bunch of docx files into markdown on windows. I found a couple of GUIs for pandoc but none of them could process docx files in batch for some reason, so I decided to do it with PowerShell instead.

There are a lot of room for improvement but I’m leaving this here first before I forget about it. I’ll come up a bash/zsh compatible version too at a later date for macOS & Linux users.

Install pandoc: https://pandoc.org/installing.html
Reboot your computer if this is windows (the installer updates $PATH on windows and a reboot is required to take effect)
Make a new folder, and make a copy of your documents to this folder. Hold down shift and right click an empty area in the file explorer and click on “Open PowerShell here”
Try running pandoc --help, if you get a bunch of help text, then pandoc is properly installed. Paste the following into the console and it should be able to convert all of the docx files in this folder to markdown.
```
Get-ChildItem . -Filter *.docx | 
Foreach-Object {
    pandoc --from docx --to markdown $_ -o $_.Name.Replace('.docx', '.md')
}
```

The pandoc manual has a comprehensive list of all supported formats, simply change the filtering parameter -Filter *.docx and --from docx argument to your source format if you want to convert something that obsidian can’t import directly: https://pandoc.org/MANUAL.html

argentum · October 16, 2020, 8:56am

Thanks! Changed the title a little to add keywords for search!

alltagsverstand · October 16, 2020, 9:34am

That sounds awesome! Will try to adapt it to linux within the next days…

Dor · October 16, 2020, 10:47am

Thank you.
Works well (within the limits of pandoc).

I tested it with
docx file with table (interesting visual for row in italics )
epub War and Peace (too big for Typora, handled fine by Obsidian if somewhat slow loading; some glitches on chapter headings and Index).
A few glitches in complex documents are normal in pandoc, and I have to remember it uses markdown links, but this was very impressive for the script and Obsidian.

mafsi · October 18, 2020, 9:31pm

The script above will produce .md files but not clean. Text is wrap as an 80-character line length.
Add --wrap=none and will clean .md files.

Get-ChildItem . -Filter *.docx | 
Foreach-Object {
    pandoc --from docx --to markdown --wrap=none $_ -o $_.Name.Replace('.docx', '.md')
}

alltagsverstand · October 18, 2020, 10:09pm

For Linux users:

place a script with the following content in the folder with the files you’d like to convert:

find -name "*.docx" -type f -exec sh -c '
      for f; do
         pandoc -f docx -t markdown -o "${f%.*}.md" "$f"
      done
   ' find-sh {} +

Then execute the script via terminal like this: ./name-of-script

If you want to remove the original file, add the following line

rm "$f"
before the “done” command

Vadych · October 25, 2020, 11:26am

Can you tell me which option to use for pandoc so that the tables are converted in the correct form for Obsidian?

phantomsin · April 20, 2021, 10:01am

Its unable to add images into the markdown file.

mafsi · May 16, 2021, 4:48pm

Here are some PowerShell command line to convert ALL the files within a directory

in .docx from .md

gci -r -i *.md |foreach{$docx=$_.directoryname+"\"+$_.basename+".docx";pandoc -f markdown -s --citeproc $_.name -o $docx}

in .pdf from .md

gci -r -i *.md |foreach{$pdf=$_.directoryname+"\"+$_.basename+".pdf";pandoc -f markdown -s --citeproc $_.name -o $pdf}

from docx in md

gci -r -i *.docx |foreach{$md=$_.directoryname+"\"+$_.basename+".md";pandoc -f docx -s $_.name -o $md}

You can play around looking at syntax to convert from/ to other formats

mafsi · May 16, 2021, 5:01pm

You have to pass option in pandoc.
This will convert .docx in .md extracting images in /media folder

gci -r -i *.docx |foreach{$md=$_.directoryname+"\"+$_.basename+".md";pandoc -f docx -s --extract-media=./ $_.name -o $md}

cjh1993129 · April 19, 2022, 1:07am

I have encountered a problem, when dealing with multiple docx documents, if the image name inside is the same,like image1.png, it will be automatically replaced so that only 1 file remain. How to solve this problem?Thank you!

Tiresias · January 25, 2023, 1:43pm

I am a total nube and do not have any coding experience. Is there a way for me to set this up that’s not that complicated?

zacht · February 13, 2023, 12:21am

If you use windows, search for powershell and open the Powershell ISE app.

Copy and paste this code:


#tell our computer we trust our ability to download packages
Set-ExecutionPolicy RemoteSigned -scope CurrentUser
#download a single package that we trust
Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://get.scoop.sh')
#scoop screens packages for us, so the packages available on scoop are generally more trustworthy

#wget allows for downloading from the web

scoop install wget

#pandoc allows for converting between many types of document
scoop install pandoc

Run it.

Then you can make a new file (Crtl + N)

And run the code from this page…after feeding in your directory.

# set the working directory 
cd 'C:\Users\myusername\myfolder\word-documents-for-converting'

# find all .docx files in current directory
Get-ChildItem . -Filter *.docx | Foreach-Object {    pandoc --from docx --to markdown --wrap=none $_ -o $_.Name.Replace('.docx', '.md')}

tondeaf · May 10, 2023, 3:54am

Genius! Thank you! Very useful!

berot3 · November 20, 2023, 12:31pm

Thanks all for sharing!

I can’t get the tables right… This is what I tried:

pandoc -s --extract-media= "note.docx" -t markdown-bracketed_spans-raw_html-native_spans-pipe_tables --wrap=none -o "note.md"

Anyone please?

I always get grid_tables instead of pipe_tables

after googling a while I found that this seems to be my problem: