PDF Peculiarities

I have a vault that includes a directory tree with sizeable number of files as source materials, around 60,000 - a subset of a corpus of several million legal documents. There are not more than a couple of hundred files in any single sub-directory. The files are mostly PDF, a few are ePUB or other formats. In Obsidian, I can click on any such PDF in the Files pane and it loads into a tab in one or two seconds (only modestly slower than using a default Windows platform app as the file viewer). This is so whether I use the built-in PDF viewer or a plug-in PDF file viewer in Obsidian. Perfectly acceptable performance.

But this is true only when that file has no links to it.

Once the file has links to it, then whether I attempt to open it via such a link or via its entry in the Files pane, it will take a good 20 or 30 seconds or longer to begin to open the file.

This is a real usability killer for my developing workflows in Obsidian. It’s also unclear to me why this huge performance hit would exist.

I have tried disabling all plug-ins. I have excluded the vault from real-time anti-virus scanning. Logically, I did not expect these efforts to make any difference to performance difference issue, and they did not.

A second (potentially related?) peculiarity is that when a link has been used to open a PDF to a specific page, say page 150, the next attempt to open that PDF, once again whether from a link that does not include a page number, or from the Files pane, will result in built-in viewer opening that document to page 150. This does not seem to me to be an intended behaviour - a link that does not specify a page number should open the document at page 1, no?

I should add that I have found these problems reproducible on different installations of Obsidian on different computers. (I had initially thought that disk speed might be involved, but that appears to make little difference).

Any concrete suggestions would be much appreciated.

Some follow-up:

I compared various scenarios (e.g. text based versus image based PDF) without spotting any real difference in performance. Finally, I compared Obidian’s PDF viewer with a known performant program (SumatraPDF). I would expect Sumatra to be slightly more performant as a native app, but it was in fact orders of magnitude faster and here is where it may be possible to hone down to a clue as to why Obsidian performs so very poorly.

1st is that Sumatra apparently only allocates enough memory to open the document to the beginning page or to a specific page if one has been specified. Obidian on the other hand allocates enough, or even more, memory than is required to hold the entire file in memory. I assume the entire document is being read into memory.

Unsurprisingly, this appears to have an impact on the duration required to open a file (as well as using a heck of a lot of memory if more than a few files of any size are open at once - although Obidian does appear to release a bit of memory if the current tab selected is one containing a smaller file. In any case, the act of opening the file also spikes the CPU more and longer than it does with Sumatra and overall, the process of memory allocation appears to account for a modest amount of the performance issue.

2nd, and apparently far more impactful, is the difference in duration and resource consumption needed to scroll a document to a specific page. In this case, the difference is primarily in CPU usage, and the duration appears to account for most of the performance issue initially reported. Even though the file is apparently entirely in memory, when Obsidian scrolls to a specific page in that document, whether because

  1. the use interacts with document viewer to scroll the document,
  2. because the file is being opened via a link specifying the page number to scroll to, or
  3. because Obsidian is opening the document to the page it was on when last opened

the CPU spikes to nearly everything the system will permit it (about 80% CPU on the system tested) and this will remain the case until the operation is complete - something that can take 30 seconds or more if the file is large and the requested page number is high, but which takes a number of seconds to complete even if the document is relatively small (in either file size or page count) and the requested page number is low.

There is some kind of gross inefficiency at play here. There is no rationale for this behavior. Error is involved.

At a guess, doing a little code tracing could run this down pretty quick, and would be likely to yield an opportunity to implement a major performance improvement.

I imagine this also explains the peculiar behavior of why opening a PDF file becomes far less performant once it is linked to. Obisidian appears to be remembering the ‘last page displayed’ in this case and scrolling to that page - thereby invoking the performance penalty described above.

Follow up bug report here: Massive CPU usage incurred when scrolling a PDF - very long delay in opening files - blocked UI

Maybe you get this slowdown bc your pdfs are OCR pdfs or filled with high quality images

In both cases, editors or authors missed to post-process their final product before the distribution
For example, usually you can shrink pdfs for at least 20 to 50% of their original size, also you don’t need all pages in rgb just those with pictures.

I stopped to use Obsidian for pdfs time ago, the pdf viewer got an update so i wonder. Anyway, i just wanted to add my 2 cents

Thanks for thinking about the issue. Unfortunately, the problem exists with both text-based and scanned PDFs, and as I described, the fact that scanned PDFs tend to be bulky does not explain why Obisidan can open a large, scanned PDF in two seconds, even though it is reading the entire file into memory, but cannot open that same file in twenty seconds if it has to open/scroll to a specific page in the file (which is what is happening when the file has links to it.)

To me, this poor performance appears to be the result of two things;

  • first is that is that Obsidian is remembering the last page number the document was opened at, and re-opening the document to that page number. To me, that is a mistaken behavior. While I can understand that some people might like to go back to where they left off in a document, this default behavior means that every link would have to be formulated to somedoc.pdf#page=1 in order to open documents at the beginning.

  • second, even forcing every document to open at page 1 wouldn’t solve the performance issue, since it appears that in this case, Obsidian opens the document at the last viewed page, and then scrolls to page 1, invoking the performance hit anyway.

Based purely on observation, it seems likely that the tool the Obsidian team is using is capable of much better performance, since using it directly results in much less CPU usage and markedly faster movement through a file.

At a guess, based on watching CPU and memory usage, the Obsidian team has implemented this tool in such as way that they are pre-rendering every single page between the current position in the file and the page being moved to, instead of simply moving to the appropriate offset and then rendering the page there. Just a guess, but it fits the observed behavior. If so it’s a clumsy implementation and explains why the operation blocks the entire Obsidian UI for much of the time it is occuring.

We are all limited by our tools and our abilities to use them, I guess. In my case, I will have to find a different tool or solution to handling PDFs, since they are the source material I need to work with and I do not have control over them and cannot make any changes to them since they are legal documents.

It’s a pity, but I think you were correct to end up choosing to no longer use Obisidian for PDFs.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.