Intelligently Absorbing PDFs into your PKM

s0ph0s · September 22, 2020, 2:52pm

I’ve been copying important documents into my vault’s PDF folder, but there are many times when I wish I had the markdown text file equivalent of that PDF instead.

I often start with this site to convert PDF to MD, grab the results, and paste that text into a new note. The problem is there’s A LOT of cleanup needed. That’s driven an unhealthy obsession with regular expressions, but that’s another story…

Sometimes I try Pandoc (where I’m a complete noob) and I get decent output, especially if I use Acrobat to wash the PDF into DOCX, then Pandoc to go from DOCX to markdown.

That seemed to work well, except tables come out horribly. I get lines with plus signs all over them instead of standard Obsidian markdown table syntax. In some cases I’m going to be responsible for editing/re-creating these documents, so it would make sense to create their markdown file equivalent.

What do you do with PDFs? Keep them external and import the notes only?
Run them through a customized PDF-to-markdown filter?
Some other system?

tallguyjenks · September 22, 2020, 2:55pm

i use the Zotero best practices workflow with zotfile, mdnotes, and working with the pdfs in zotero separately, but with the zotero links you can link directly to the pdf pages from obsidian as well. i personally prefer the separation of concerns. Reference management, and pdf’s in zotero, everything else in obsidian

robertsmith · October 4, 2020, 3:45am

This is great advice. “Dealing with” pdfs is something I’ve dealt with for years.

Hopefully someday, someone will develop a tool that can convert a flat, pdf or image, into fully responsive, searchable html. THe hard stuff though, is when text is on some kind of background, and you want to get that text, but also preserve the background.

THanks for the tips.

alltagsverstand · January 24, 2021, 11:37am

This is a topic I am dealing with since I have started using obsidian as my main working environment. I would be interested in some shared experience of others: how do you handle your pdfs? Do you store them inside or outside your obsidian vault?

Currently, I still have a folder external to my vault under which I store all of my pdf attachments. I use Zotero in combination with zotfile, BetterBibtex and mdnotes for organizing my literature and - if required - extracting highlights and exporting them to my vault. The corresponding pdfs are linked in my vault so that I can open them in my external pdf viewer.

In any case, I will keep zotero as my main application for organizing references. Yet, I am still dealing with the question if I should move my attachments folder to my vault. This question became more important again as the new internal pdf viewer of obsidian now allows me to select text and copy and paste it directly to my active note - and there might be some possibilities that obsidian’s pdf viewer might even further develop in the future, allowing me for example to highlight text and extract my highlights directly into my note. Even if this feature won’t come, there is already the pdf extract highlights plugin I can use for that.

Moving my attachments folder to my obsidian vault thus has several advantages:

pdfs are opened faster
Links to pdfs are inserted in a faster and smoother way
I don’t have to switch between different applications but can have my working note and my source file side by side
The backlink pane shows me all my notes that already include citations of a special source or are somehow related to it

Yet, to date, I am still hesitating. The main reason is:

My attachments folder contains several thousand of pdf files, summing up to an overall size of currently 10 GB - and it will continuously grow. I am afraid that this could massively slow down obsidian (to be honest, I still haven’t completely understood obsidian’s caching process… Does obsidian, for example, only cache file names of other file formats like pdf or also content?).

Have others already experimented with such a large attachment folder? Which solutions have you found?

A kind of compromise could be (maybe as a feature request) that obsidian would allow to “connect” to an external attachments folder, thus allowing to open external pdfs in its integrated viewer. Could this be a meaningful option others of you would like to see?

bepolymathe · February 2, 2021, 8:39pm

Hi,

I understand that you are asking yourself this question. For my part, I tend to think that it is desirable to have all of its sources in obsidian, especially for the reading notes. On the other hand, the “Citations” plugin allowed me to find a compromise. I only put PDF files attached to Zotero items in my vault. The Zotero database itself (and thus the .html files of the screenshots remain outside). But the template I use in “Citations” allows me to open everything from the reference note.

# {{title}}
## {{authorString}}
**Publication** : {{containerTitle}}
**Year** : {{year}}
**Pages** : {{page}}
**DOI** : {{DOI}}
**File** : [Open file](<file://{{entry.files.[0]}}>)
** Web ** : [Open on line]({{URL}})
** Zotero ** : [Open Zotero]({{zoteroSelectURI}})
** Annotations ** :  [[@{{citekey}} - Annotations]]

## Abstract
> {{abstract}}



---
Type (⚛)
#📘
Tags (🔖)

Projet: (📂)

Links (🔗)

Review (📅)

My PDF files could be out of obsidian with this method but I leave them in that I have subscribed to the synchronization option… it allows me to have access to the PDFs on several machines.