Managing Large Files and External Resources in Obsidian Vaults

Dear Obsidian Forum,

Obsidian has become the central piece of my productivity system in just over a year. I strongly believe in the power of having a well-organized information bank as part of your daily life, and Obsidian brilliantly tackles most of the obstacles we face when trying to put this into practice.

One of the few problems I think remains unsolved — and feel free to correct me if I’m wrong — is managing non-Markdown files. Obsidian, in its current state, whether you pay for Sync or not, is a fantastic tool for managing “lightweight content,” which, in my view, includes general text notes with some reasonable exceptions.

That said, I’m sure many of us have come across the problem of integrating what we could call “resources” into the ecosystem. Many artists work with images, of varying resolutions; many researchers have large PDF collections to manage, and the list goes on.

Including such files as part of the vault creates mainly two problems: first, there’s an issue with the system’s execution time. But the main issue, by far, is storage management. Obsidian Sync is clearly limited in that respect, and any third-party external solution is insecure and dependent on those same third parties. That, in my opinion, goes against the very philosophy of Obsidian, which is about giving users the choice to depend or not depend on external services — regardless of any financial aspect.

I understand that, up to now, the ecosystem has been built for note management, so these issues are expected. But I also believe we can’t ignore the fact that when building a flexible personal knowledge manager, resources are essential elements for many people’s routines and workflows, even though they may have a different nature. I firmly believe they should be able to integrate seamlessly. Just the other day, I read a post from @Kepano on LinkedIn asking what features we miss most, and after thinking about it for a few days, I believe this is one of the major gaps that remain to be addressed.

A well-organized cloud is useful, but it feels inorganic. It requires a lot of configuration to integrate properly with the vault, and if, for any reason (financial or otherwise), you want to migrate, you’re left with the hassle of reconfiguring. I see it as a very fundamental piece of the puzzle, but not in the best way.

After some thought, the ideal solution for me would enable two key points: first, indexing large resources (like simple links) within the normal vault elements, and second, providing flexible connection to any external storage source, including local or virtual hard drives. I think the idea proposed in this post is very much in this direction.

At the end of the day, what’s the issue with having a second vault for resources that integrates smoothly? A simple “link” with access to a main root could allow connections to a Drive repository, GitHub, or any cloud provider, just as easily as it would to any local or online source, including a NAS. Migrations would also be easy if the hierarchy is preserved; all it would take is swapping the root link to the new provider or new local location.

Some extra features would definitely be useful, such as enabling or disabling indexing or search in supported file types like PDFs based on certain directory criteria, or having a “ghost file” that holds the hierarchy of indexed resources so that when there’s no connection, you can still see they exist and can be accessed once the connection is restored.

I don’t think this is the case yet, but is there any plugin that allows managing this kind of system with minimal friction? And, most importantly, how do you all handle this issue if you’ve encountered it?

A big shout-out to the whole community — wishing you all a very happy New Year!
Take care,
Jose

3 Likes

I have been working on something like this for my own use case. Namely, I’ve been working towards an integrated Zotero-Obsidian workflow relying on Obsidian’s robust search engine so any topic is easily investigated in our dedicated material.

Objectives:

  • OCR PDF files comprising scanned images with professional software such as ABBYY FineReader (as the Zotero OCR plugin I tried is not up to par).
  • Have all Zotero storage and externally linked publications pulled into a dedicated secondary Obsidian vault (all files will be linked to a primary PKM vault via Obsidian URI links).
    • A Python script reads (also sanitizies) the Zotero exported BetterBibTex JSON, extracts thumbnails of PDFs, populates each markdown file with metadata (Obsidian properties) based on JSON item properties, and copies Zotero (some part of or the full extent of its) full text index into a helper file (so the master file will not be cramped and be easily added annotations made with Zotero Integration plugin).
  • Obsidian core search will handle proximity search with regex (/searchterm.*?searchterm2 or /searchterm[\s\S]*?searchterm2, effectively finding any strings on the same paragraph or across multiple paragraphs. This search functionality is impossible in Zotero.
  • DataCore (beta plugin) table with live filtering filtering ability (preview here).
  • Plus but may not yet be feasible: optionally manipulate metadata of cells of said DC table and sync back changes to Zotero via API or (hacking) some existing plugin.

Tools:

  • Obsidian, with plugins like Datacore and PDF++
  • Zotero to do annoations in if one doesn’t want to use PDF++ and of course, mainly for source material management and the indexing for thousands of files very fast with text format indexed files to be easily copied to Obsidian
  • VSCode Editor or Cursor AI Editor with built-in AI coding help

Coding upkeep:

Any new fiddling in Zotero and the resulting automatic BetterBibTex exports can break things, so the Python script needs to be updated now and again, squashing new bugs. This is easily done with Cursor AI Editor’s built-in Composer functionality, but still needs user interaction and upkeep.


With Obsidian and external tools, using Python and some Javascript here and there, one can build really cool things within a couple of days or possibly, weeks, on top of Obsidian that will always remain a safe base to build upon and that cannot keep up with users’ various needs in the short run anyway.

Regarding the large files and syncing them:
Zotero can be self-hosted or be paid for (to sync attachments as well as date to zotero.org), Obsidian can be used with git and PDF files linked to or not associated with Zotero at all can be easily symlinked to Obsidian vault.
The secondary vault can be synced to GitHub without the PDF’s with said external hard drive locations.

This is my current journey now!!! I am looking for a way to create a Digital Asset Managment (DAM) system. I am in the process of rebuilding my Zotero library, and it occurred to me that now is the time to find something better. I am on Windows 11 though I am in WSL2 Linux also on my machine and have been on the lookout for an alternative to something like Mac’s (Devonthink). Oddly, the combo I keep coming back to is Obsidian paired with job appropriate software like Zotero for PDF and Eagle for images, though there are some open-source DAM products that I want to look into like (ResourceSpace).

I am frustrated with (Zotero). Don’t get me wrong, the software is great, but there is an ego with many that is off-putting in their belief that their way is the correct and only way. Little changes and improvements would go a long way to making the tool great for handling more than just citations as the software currently handles so many file types.

I would also like a way to handle resource images that I find for my art. Apps like (Eagle) and (PureRef) are great for taking images and such and making notes on my ideas so I don’t forget.

I know that much, if not all, of this can be handled within Obsidian through a combo of plugins, templates, and a well-defined structure, but that is not the point of Obsidian to me (for any woodworkers out there, just think about the Shopsmith line of woodworking machines versus dedicated tools and machines). I am already using Obsidian with a ton of plugins and occasionally find one or two here and there that are losing support from the developer. I am pretty sure that eventually I will head back down the slope to the minimal plugin setup or freeze my system at a point I like and not upgrade (free for you forever type thing).

Hearing other have the same sort of need is encouraging. I hope others here will weigh in on their thoughts for such a thing and I will endeavor to update on this post what I am finding.

That is not true, actually. It didn’t work out, at least for me.
You can actually use regex for your attachment content if you have indexed your files beforehand:

Searches here didn’t work for me, as the Zotero process hit 10GB in RAM usage and another 30GB was used for virtual memory and still no output.

What one can do also to query Zotero and non-Zotero items (on an external hard drive, for instance) is to have AnyTXT Searcher (OS: Windows) index every folder where PDFs and other items are located (.mobi, .mmap, etc.) and use that program.
But again, that program’s GUI can also go ‘Not responding’ on you.
Looks like no recent developments have been done on this program, either…

Luckily, AnyTXT Searcher has an API (beta, but works) which we can leverage with a Python script.
Claude.AI’s latest model will be able to handle that (just make sure you set compromises so you don’t have to wait too long for 100+ files with content generated).

So now I have the following setup:
All my Zotero PDF’s (52GB worth of PDF’s) indexed, with index txt files copied to md files to a Vault with index files linked to master md files.
That vault can be queried with near-instant results.

All my Zotero plus other non-Zotero items (140GB) indexed with AnyTXT Searcher and queriable through its API via a Python script, whose output when copied into a markdown file in Obsidian, will have a zotero link to be clicked if the file was in my Zotero storage, or just a path on the external drive if Zotero link unavailable.

This setup is far better than I have ever envisioned having, certainly better than what I have seen on the forum offered (DocFetcher related stuff).

EDIT.
Even better (more cosily achieved) results via the Interactivity plugin: