Terabyte size, million notes vaults? How scalable is Obsidian?

What I’m trying to do

I’m using the approach of immutable data and “store everything”, and considering to use Obsidian as indexing front end for some of that stuff. The intended growth for Obsidian-managed data is ~1TB/year, 99% by size of it is not text, but autogenerated images/packaged sensor data/etc. (that Obsidian should not be concerned with, but only keep links to and index the file names), with the 1% being accompanying .md files.

The goal of trying to use Obsidian with this is to have that thin layer of annotations, metadata, autogenerated descriptions, which would be organized just like my personal vault, by tags, frontmatter, folder structure, bookmarks.

I’m going to try load testing it with fake contents this weekend, but want to know your guys opinion:

  • have anyone tried this?
  • how does having non-.md files affect Obsidian’s performance?
  • how does having many .md files affect Obsidian’s performance?
  • what edge cases should I pay attention to when load testing?

Things I have tried

I’ve just tried to bring it up to 10GB, and so far it’s working well.

Thanks in advance!

1 Like

This is the largest benchmark I know of:

https://www.goedel.io/p/interlude-obsidian-vs-100000

If you end up testing it, please publish your results somewhere. It would be great to have them as a reference.

2 Likes

this is beyond what we can currently and also likely in the future handle. You need to write your own software for your use case.

2 Likes

Out of curiosity, how many files does this translate to? The file size of non-Markdown files doesn’t matter much, as I understand it (except for syncing).

Thanks! This is actually very encouraging.

Yes, I’m considering that. Just trying to escape the hustle of incessant tuning and managing, which is going to be the case with any unbundled toolset, e.g. using ElasticSearch+MySQL+file system means a lot of connective tissue, so I’ll probably just postpone it again for a few years waiting for the tools to catch up. Basically, it’s a nice hobby vs feels like work difference.

Yeah, no syncing involved for that workload, I have tools already. Was just pondering what’s under the hood of Obsidian, where it can screech to a halt (or sail smoothly!).

I thought today that I may be able to compress the text information that needs to be indexed 10x more, so maybe it’s going to be >1000:1 for size, and <100:1 for file count (non .md : .md).

The file size of non-Markdown files doesn’t matter much, as I understand it (except for syncing).

Yeah, that’s what I thought too, that it shouldn’t matter for Obsidian what’s in the non-.md files, and this means if I can make it <1GB pure text, and it can handle that, than for at least a year I’m gonna be fine.

My biggest concern, I suppose, is in reindexing? What events may invalidate the index? Is it sharded or global? Does it do caching, fingerprinting, bloom or something else?

Plugins are certainly going to be a problem, so I am not looking forward to using dataview for that, unfortunately.

I think any normal tool presented with more than a few GB of content is going to struggle. Indexing will be slow for sure and the memory footprint for the derived metadata might prove an issue unless the system you are running it on has heaps of memory. I’ve found that many tools don’t cope well when the dataset is that large. I work with large log files for instance and for anything larger than a few GB, few tools can manage. I know of only one or two editors like SlickEdit that can cope with a single file that large. Those often have to be processed in memory multiple times, once to index them and then again when searching, sorting or editing is required. I imagine obsidian will have the same issues. But it’s an interesting experiment to try to see just what the limits are.

3 Likes

Yes, you are correct, thank you for your perspective!

What I’ve experienced, though, is that the continuing dynamic of both more powerful machines and an army of smart toolmakers making it possible to do things with smaller and smaller effort. What you’d need a Hadoop cluster job 10 years ago now happily hums along on an M1 laptop.

So that’s what I’m chasing :slight_smile: . Every year or so I reevaluate the tools available to see what I can do this year as toy projects. Fun things come in smaller and smaller packages.

Did you do the experiment and if so what were the results?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.