Intelligently Absorbing PDFs into your PKM

I’ve been copying important documents into my vault’s PDF folder, but there are many times when I wish I had the markdown text file equivalent of that PDF instead.

I often start with this site to convert PDF to MD, grab the results, and paste that text into a new note. The problem is there’s A LOT of cleanup needed. That’s driven an unhealthy obsession with regular expressions, but that’s another story…

Sometimes I try Pandoc (where I’m a complete noob) and I get decent output, especially if I use Acrobat to wash the PDF into DOCX, then Pandoc to go from DOCX to markdown.

That seemed to work well, except tables come out horribly. I get lines with plus signs all over them instead of standard Obsidian markdown table syntax. In some cases I’m going to be responsible for editing/re-creating these documents, so it would make sense to create their markdown file equivalent.

  1. What do you do with PDFs? Keep them external and import the notes only?
  2. Run them through a customized PDF-to-markdown filter?
  3. Some other system?

i use the Zotero best practices workflow with zotfile, mdnotes, and working with the pdfs in zotero separately, but with the zotero links you can link directly to the pdf pages from obsidian as well. i personally prefer the separation of concerns. Reference management, and pdf’s in zotero, everything else in obsidian

3 Likes

This is great advice. “Dealing with” pdfs is something I’ve dealt with for years.

Hopefully someday, someone will develop a tool that can convert a flat, pdf or image, into fully responsive, searchable html. THe hard stuff though, is when text is on some kind of background, and you want to get that text, but also preserve the background.

THanks for the tips.