Managing Large Files and External Resources in Obsidian Vaults

josevzs · December 26, 2024, 11:18pm

Dear Obsidian Forum,

Obsidian has become the central piece of my productivity system in just over a year. I strongly believe in the power of having a well-organized information bank as part of your daily life, and Obsidian brilliantly tackles most of the obstacles we face when trying to put this into practice.

One of the few problems I think remains unsolved — and feel free to correct me if I’m wrong — is managing non-Markdown files. Obsidian, in its current state, whether you pay for Sync or not, is a fantastic tool for managing “lightweight content,” which, in my view, includes general text notes with some reasonable exceptions.

That said, I’m sure many of us have come across the problem of integrating what we could call “resources” into the ecosystem. Many artists work with images, of varying resolutions; many researchers have large PDF collections to manage, and the list goes on.

Including such files as part of the vault creates mainly two problems: first, there’s an issue with the system’s execution time. But the main issue, by far, is storage management. Obsidian Sync is clearly limited in that respect, and any third-party external solution is insecure and dependent on those same third parties. That, in my opinion, goes against the very philosophy of Obsidian, which is about giving users the choice to depend or not depend on external services — regardless of any financial aspect.

I understand that, up to now, the ecosystem has been built for note management, so these issues are expected. But I also believe we can’t ignore the fact that when building a flexible personal knowledge manager, resources are essential elements for many people’s routines and workflows, even though they may have a different nature. I firmly believe they should be able to integrate seamlessly. Just the other day, I read a post from @Kepano on LinkedIn asking what features we miss most, and after thinking about it for a few days, I believe this is one of the major gaps that remain to be addressed.

A well-organized cloud is useful, but it feels inorganic. It requires a lot of configuration to integrate properly with the vault, and if, for any reason (financial or otherwise), you want to migrate, you’re left with the hassle of reconfiguring. I see it as a very fundamental piece of the puzzle, but not in the best way.

After some thought, the ideal solution for me would enable two key points: first, indexing large resources (like simple links) within the normal vault elements, and second, providing flexible connection to any external storage source, including local or virtual hard drives. I think the idea proposed in this post is very much in this direction.

At the end of the day, what’s the issue with having a second vault for resources that integrates smoothly? A simple “link” with access to a main root could allow connections to a Drive repository, GitHub, or any cloud provider, just as easily as it would to any local or online source, including a NAS. Migrations would also be easy if the hierarchy is preserved; all it would take is swapping the root link to the new provider or new local location.

Some extra features would definitely be useful, such as enabling or disabling indexing or search in supported file types like PDFs based on certain directory criteria, or having a “ghost file” that holds the hierarchy of indexed resources so that when there’s no connection, you can still see they exist and can be accessed once the connection is restored.

I don’t think this is the case yet, but is there any plugin that allows managing this kind of system with minimal friction? And, most importantly, how do you all handle this issue if you’ve encountered it?

A big shout-out to the whole community — wishing you all a very happy New Year!
Take care,
Jose

Yurcee · January 4, 2025, 11:08am

I have been working on something like this for my own use case. Namely, I’ve been working towards an integrated Zotero-Obsidian workflow relying on Obsidian’s robust search engine so any topic is easily investigated in our dedicated material.

Objectives:

OCR PDF files comprising scanned images with professional software such as ABBYY FineReader (as the Zotero OCR plugin I tried is not up to par).
Have all Zotero storage and externally linked publications pulled into a dedicated secondary Obsidian vault (all files will be linked to a primary PKM vault via Obsidian URI links).
- A Python script reads (also sanitizies) the Zotero exported BetterBibTex JSON, extracts thumbnails of PDFs, populates each markdown file with metadata (Obsidian properties) based on JSON item properties, and copies Zotero (some part of or the full extent of its) full text index into a helper file (so the master file will not be cramped and be easily added annotations made with Zotero Integration plugin).
Obsidian core search will handle proximity search with regex (/searchterm.*?searchterm2 or /searchterm[\s\S]*?searchterm2, effectively finding any strings on the same paragraph or across multiple paragraphs. This search functionality is impossible in Zotero.
DataCore (beta plugin) table with live filtering filtering ability (preview here).
Plus but may not yet be feasible: optionally manipulate metadata of cells of said DC table and sync back changes to Zotero via API or (hacking) some existing plugin.

Tools:

Obsidian, with plugins like Datacore and PDF++
Zotero to do annoations in if one doesn’t want to use PDF++ and of course, mainly for source material management and the indexing for thousands of files very fast with text format indexed files to be easily copied to Obsidian
VSCode Editor or Cursor AI Editor with built-in AI coding help

Coding upkeep:

Any new fiddling in Zotero and the resulting automatic BetterBibTex exports can break things, so the Python script needs to be updated now and again, squashing new bugs. This is easily done with Cursor AI Editor’s built-in Composer functionality, but still needs user interaction and upkeep.

With Obsidian and external tools, using Python and some Javascript here and there, one can build really cool things within a couple of days or possibly, weeks, on top of Obsidian that will always remain a safe base to build upon and that cannot keep up with users’ various needs in the short run anyway.

Regarding the large files and syncing them:
Zotero can be self-hosted or be paid for (to sync attachments as well as date to zotero.org), Obsidian can be used with git and PDF files linked to or not associated with Zotero at all can be easily symlinked to Obsidian vault.
The secondary vault can be synced to GitHub without the PDF’s with said external hard drive locations.

Spark706 · January 7, 2025, 2:56pm

This is my current journey now!!! I am looking for a way to create a Digital Asset Managment (DAM) system. I am in the process of rebuilding my Zotero library, and it occurred to me that now is the time to find something better. I am on Windows 11 though I am in WSL2 Linux also on my machine and have been on the lookout for an alternative to something like Mac’s (Devonthink). Oddly, the combo I keep coming back to is Obsidian paired with job appropriate software like Zotero for PDF and Eagle for images, though there are some open-source DAM products that I want to look into like (ResourceSpace).

I am frustrated with (Zotero). Don’t get me wrong, the software is great, but there is an ego with many that is off-putting in their belief that their way is the correct and only way. Little changes and improvements would go a long way to making the tool great for handling more than just citations as the software currently handles so many file types.

I would also like a way to handle resource images that I find for my art. Apps like (Eagle) and (PureRef) are great for taking images and such and making notes on my ideas so I don’t forget.

I know that much, if not all, of this can be handled within Obsidian through a combo of plugins, templates, and a well-defined structure, but that is not the point of Obsidian to me (for any woodworkers out there, just think about the Shopsmith line of woodworking machines versus dedicated tools and machines). I am already using Obsidian with a ton of plugins and occasionally find one or two here and there that are losing support from the developer. I am pretty sure that eventually I will head back down the slope to the minimal plugin setup or freeze my system at a point I like and not upgrade (free for you forever type thing).

Hearing other have the same sort of need is encouraging. I hope others here will weigh in on their thoughts for such a thing and I will endeavor to update on this post what I am finding.

Yurcee · January 10, 2025, 2:52pm

That is not true, actually. It didn’t work out, at least for me.
You can actually use regex for your attachment content if you have indexed your files beforehand:

Searches here didn’t work for me, as the Zotero process hit 10GB in RAM usage and another 30GB was used for virtual memory and still no output.

What one can do also to query Zotero and non-Zotero items (on an external hard drive, for instance) is to have AnyTXT Searcher (OS: Windows) index every folder where PDFs and other items are located (.mobi, .mmap, etc.) and use that program.
But again, that program’s GUI can also go ‘Not responding’ on you.
Looks like no recent developments have been done on this program, either…

Luckily, AnyTXT Searcher has an API (beta, but works) which we can leverage with a Python script.
Claude.AI’s latest model will be able to handle that (just make sure you set compromises so you don’t have to wait too long for 100+ files with content generated).

So now I have the following setup:
All my Zotero PDF’s (52GB worth of PDF’s) indexed, with index txt files copied to md files to a Vault with index files linked to master md files.
That vault can be queried with near-instant results.

All my Zotero plus other non-Zotero items (140GB) indexed with AnyTXT Searcher and queriable through its API via a Python script, whose output when copied into a markdown file in Obsidian, will have a zotero link to be clicked if the file was in my Zotero storage, or just a path on the external drive if Zotero link unavailable.

This setup is far better than I have ever envisioned having, certainly better than what I have seen on the forum offered (DocFetcher related stuff).

EDIT.
Even better (more cosily achieved) results via the Interactivity plugin:

josevzs · March 25, 2025, 10:27pm

Hi everyone,
Thank you for sharing your workflow and tools. I’ll have a deeper look at Eagle (didn’t know it!), and probably will explore further Zotero possibilities.

However, I’m still looking for:
A lightweight vault, pure Markdown, synced via Obsidian Sync.
All heavy resources (media, reference material, scans, drawings) live outside the vault — ideally on external drives, cloud storage, or other software — but still be indexed, referenced and reachable from the notes in a sustainable and scalable way.

Also important for me:
I keep a minimal but meaningful folder hierarchy in my vault, which reflects contexts and domains I regularly work with.
To maintain coherence, I need my attachments to follow that same structure — or at least mirror it partially. Otherwise, things become too chaotic to track in the long term.

What I’ve already explored

I’ve tested all the known options, and none fully deliver:

Relative or absolute links: break when moving folders across devices, don’t scale well, and offer no preview or indexing.
Symlinks or folder mounts: fragile across OSes, problematic for cloud sync, and not portable.
Embed plugins (like External File Embed and Link): some promise, but previews are limited and functionality is brittle.
Zotero: great for PDFs, but too rigid and academic for broader media.
Eagle / Tropy / PureRef: excellent at what they do, but disconnected from Obsidian without serious workaround hacking.
Dataview / metadata-based indexing: not helpful when files are external, and no preview or search.
Third-party DAMs or cloud drives: difficult to maintain stable links, especially across devices or after folder reorganization.

The deeper issue

The problem is not “can I link to an external file?” — of course I can.
The problem is: can I sustainably integrate an external resource collection into my knowledge system, so it’s searchable, previewable (even partially), and portable — without bloating or breaking the vault, and without losing the structural coherence I need across domains?

Where we’re stuck

After all this, it seems there is still no viable, robust workflow for this hybrid model — a lean Obsidian vault with properly linked and semi-integrated external resources.

If anyone out there has figured out a truly sustainable setup, I’d love to hear about it.

Thanks again!

Yurcee · March 26, 2025, 12:31am

Use one external hard drive or you can have 2 with the same structure, one for backup.
Symlinks: You don’t need to symlink multiple items, only the top folder. So whatever you do beneath the top level doesn’t matter. You can move things around, if you want subfolders to add, whatever.
- I’m not sure what Obsidian does when you do this frequently, though. It’s possible that it will have to reindex something and there will be some freezes on start-up. Otherwise, no issues.
Multiple OS’s: You add one symlink of the top-level folder on all OS’s. Done. I do it for Windows and Linux. On iPad I have limited space, so only sync one third of my stuff, which would still last me a few lifetimes, at the rate I’m going…

I don’t keep everything in Zotero. I upload vids to YouTube, it automatically transcribes it, then I keep the full transcription and the AI made summary.

What you do inside the vault doesn’t matter. As long as the symlinked items are there, you can hide those folders with File Explorer++, doesn’t matter. You’ll link to the content anyway. It’s like they are inside the vault but they are not.

I don’t use for syncing anything else but git at the moment. Symlinked items are not uploaded to the remote. I’m not sure how any cloud sync providers fare. I’d re-think syncing options then.

You have a Python script written that generates and updates md files for your attachments, along with thumbnails and metadata. They you can query. Every day or week you re-run the script to update stuff.

And mostly, take this advice: you’d want to set something up, you need to let some things go. You only will pick up ideas to improve your setup along the way.

Hi.LikeKite · April 1, 2025, 9:32pm

Good to read your thoughts as I get to the point and implementing Obsidian… after 6-months of research/testing whilst getting used to Zotero for law. I’ve come to the same conclusion re a DAM (or similar) acting as an “Attachment manager” of sorts from and into various sources. So I’m wondering if you have identified anything that may fit the bill?

This being after a significant review of options out there today that has raised more questions than answered. Quite a few of the good fits I found were taken over by cloud providers, removing the precise features i want from their desktop/device apps. However, I REALLY like the direction in which TagStudio (not TagSpace) is heading, and that will surely become an Obsidian integrated app due to the main man Travis starting the dev from this POV… but the project could still be a year from a stable release on iOS and Windows (a now-hidden Svelte-based repo seemed to be targeting an iOS app and Windows).

Today’s deeper delve in to lightweight local/offline DAM-like apps slowly turned into being a customisation of self-hosted CMS - which seems to be what most of the opensource versions are based upon anyway… at least i’d choose a headless sqllite wrapper instead of Druplaa or similar… but I’ve not seen a lightweight locally cached “DAM”, which then returns to me to also thinking like @Yurcee about a solution based upon searching Obsidian Notes with links or embedded files, which would certainly force a commitment to producing all my content for other apps from within Obsidian. Anyway, its interesting how Cloud providers took over three of the most popular DAM’s in their respective specilised areas during around 2018-2020, where Nuxeo stopped development on their iPad/iPhone sync apps, Razuna stopped committing changes to their opensource repo, and Kyno’s buyers seemed to drop the app’s dev.

The way I see it, if we are cataloging/tagging assets, we need to capitalise on that effort from within other applications, and have Dam features like EXIF meta and target-focused resized/cropped versions being automatically prepared for exports from Obsidian (pandoc longform/email-html/quarto doc) as well as for other processes such as within page templates of a headless CMS, email client, social media workflow, etc.

After years of supporting failing sync systems, I’m also keen on keeping Obsidian data as compact, simple and responsive as possible, so I’ll stay away from symlinks, and I’d like the indexing/cataloguing/tagging of heavy documents to remain outside of it. Of course, accessing a search of that index/tags from within Obsidian for an attachment would be ideal, but inserting an attachment from the Dam would be fine… if i could find one suitable.

BroadwayTower · April 11, 2025, 10:37pm

I have developed a tentative solution to this problem which I’ll describe below.

I working on Mac using DevonThink and several AI tools and am attempting to set up a system in which I maintain a lean Obsidian Vault and have large pdf and other files stored separately but linked from .md files such as DevLogs that reside in the vault.

I found that the Obsidian Vault should be in its own folder with nothing else in that folder. So I have a structure like this …/Main/Obsidian with the Obsidian Vault in the folder called “Obsidian.” Large research files are in Main/Research with a few sub-folders under Research to separate out some major branches of my research.

I experimented with relative links, but found that Obsidian would not properly read a relative link to a folder outside the vault.

chatGPT suggested symlinks, but Obsidian Help strongly advises against symlinks.

I have been avoiding absolute links because I may want to move Main to a new computer in the future and all the paths of absolute links would need tedious (or impossible) updating.

So now I am planning to go ahead with absolute links but with all of this data stored on a dedicated portable hard drive named Z100. So the file path will be Z100/Project/Main/Obsidian (for the location of the Obsidian Vault). The file path for the large research files will be Z100/Project/Main/Research/Subfolder-One.

For an overview of everything I’ll use DevonThink and the lateral creativity of its See Also feature. The file path for the DevonThink database will be Z100/Project/DevonThink. The DevonThink database itself will be left completely empty. Instead, DevonThink will just index everything in Main.

If this works, I also plan to add further distinct Obsidian Vaults, each in their own folder under Main (Obsidian-A, Obsidian-B etc…)

In the future, I will/should be able to put the Project folder on a new portable hard drive and have all links still working.

BroadwayTower · April 12, 2025, 1:50am

External link is working.

On a Mac, with external hard drive named Z100, in a DevLog in Z100/Project/Main/Obsidian/Obsidian-Vault, I have a link in the format file:///Volumes/Z100/Project/Main/Research/Subfolder-1/research-file.pdf, clicking this link opens the pdf in Adobe Acrobat Pro.

When creating the link, using a three-button mouse, I press Option + Right-click when selecting research-file.pdf in Subfolder-1. Pasting this into the DevLog looks like this: /Volumes/Z100/Project/Main/Research/Subfolder-1/research-file.pdf. All I have to do to make it be a working link is to add file://

For now, I am leaving the whole path showing. Maybe in the future I’ll find this makes too much visual clutter, but for now I like the clarity of seeing how my file structure is set up.

BroadwayTower · April 14, 2025, 2:26pm

In the file structure I describe above, my research pdfs are just in my file system and not inside a DevonThink database. I use a DevonThink database to index this file system, but the DevonThink database itself contains no pdfs. So links from .md files from an Obsidian Vault are absolute links in the format: file:///Volumes/Z100…

Alternatively, I could have my research pdfs in a second DevonThink database and link to these from an Obsidian Vault with links in the .md files in this format: x-devonthink-item://

I like the absolute links which future-proof the setup so it is not locked into needing DevonThink.

But the x-devonthink-item:// links have a big advantage with how they work inside a DevonThink database. A pdf in DevonThink linked to a .md in Obsidian can be moved to any folder in DevonThink, the pdf name can be changed, the pdf can be duplicated or replicated (sort of like an alias) in DevonThink…

Perhaps there are other tools (AI or other) that I may want to have access to all of my .md and pdf files. Having the pdfs inside a DevonThink database may be a disadvantage in this case, unless it is a tool that interoperates with DevonThink.

It is a tricky business trying to figure out a file structure that works well in the future in this fast-changing digital space.

I think I’ll opt for only indexing in DevonThink for now and see how it goes. If I really miss how it works to have research pdfs (and other files) inside a DevonThink database, I’ll switch back to that.