Building a 9000+pdf database to integrate with my LLM. Obsidian crashes. Any solutions?

What I’m trying to do

Unable to get it Obsidian to index all. Let alone try Smart Connections. Disabled Graph view, sync, file recovery. My PC is RTX3090, 128GB DDR4 RAM, Ryzen 5900XT, 1TB HDD.

Things I have tried

Import in 1000 batches, tried disable core plugins. One by one. Still failed.

When I did similar things I always symlinked the top folder of all PDFs to the vault, never brought the physical pdfs into the vault as syncing the vault would have been impossible after.
I remember little freezes but even on a smaller laptop 30k+ files were indexed.

I reckon there is some issue with one particular path (folder) or file.

But I would NOT recommend going down this route at all.

Add all your pdfs to Zotero, it will index all your files and run LLM models on the txt files, which you can turn into md files as well with Python or whatever. You can use other software like CursorAI for this (import workspace with your indexed files: NOT PDFs, but raw txt/md).

You do NOT want to use Obsidian for all things.

Alternatively, use Cursor AI to regex search your stuff for common topics and only pick those index files or pdfs to add to Obsidian, to a dedicated project vault.

I am currently trying out Neural Composer for a smaller batch of books (md-index files because I have them, but I could try original PDFs) to see what gives.

2 Likes

Hi @mystvearn!

With that RTX 3090 and 128GB RAM, your machine is an absolute beast for local AI. You should be flying, not crashing!

The crash likely happens because standard Obsidian plugins run the indexing logic (reading files + calculating embeddings) inside the main Obsidian process (Electron/JavaScript) or try to store the vector index as thousands of small files inside the vault. On a 1TB HDD (mechanical drive), the I/O latency of reading/writing thousands of small cache files + the memory overhead of indexing 9000 PDFs will choke Obsidian to death.

The Solution: Decoupled Architecture
You need a system that runs the heavy lifting (indexing/embedding) in a separate process, not inside Obsidian’s UI thread.

As @Sunnaq445 mentioned, Neural Composer (which uses LightRAG) handles this differently:

  1. It spawns a separate Python server.

  2. The heavy indexing work happens in that Python process (utilizing your RTX 3090 via CUDA).

  3. Obsidian stays lightweight because it just queries the database; it doesn’t hold the index in RAM.

Recommendation for your setup:
Even if you don’t use my plugin, try to move the embedding workload outside of Obsidian (using external scripts or servers). But if you want to keep the UX inside Obsidian, Neural Composer was built exactly for this “heavy local hardware” scenario.

Tip regarding the HDD: Since you are on a mechanical drive, set MAX_PARALLEL_INSERT=1 in the plugin settings (Review .env) to avoid killing your disk seek time during ingestion.

Hope you can put that 3090 to good use!

1 Like

I’ve actually tried the md file format first before using direct pdf. Same results though. Still crashed. It took me me 10 days non stop to convert from pdf to md using pdf maker. I still have both folders. Pdf and md

Sorry, forgot to mention that it is a nvme ssd pci Gen 3 HD.

I don’t mind slow as long as it works. I tried indexing in anything LLM. It work, but output is terrible. Not accurate, just very loosely correct.

I still think obsidian as a whole is better, just not sure how to get it to work

Obsidian doesn’t like hefty md files.
And plugins written for it will be never as performant as dedicated software.

I can then recommend Msty Studio and its Knowledge Stacks for the whole lot.
But as I said, I’d try chunking them up by 5-6 books. Especially to experiment with RAG chunking, etc.
I personally gave up on it (RAG) and use only regex and semantic searches in VSCode Editor (and CursorAI and Windsurf) on 56GB PDF’s worth of md files.

Hi @mystvearn!

Ah, NVMe Gen 3! That changes the game completely. With an RTX 3090 + 128GB RAM + NVMe, you have a dream setup for local AI. You definitely shouldn’t be crashing.

Why your previous results were “terrible”:
You mentioned using “AnythingLLM” and getting “loosely correct” answers. That is the classic limitation of Standard Vector RAG.

  • Standard RAG just looks for matching keywords/chunks. If you have 9,000 PDFs, it retrieves chunks that sound similar but might be totally irrelevant contextually. It creates a “Frankenstein” answer.
  • Neural Composer (LightRAG) is different. It builds a Knowledge Graph. It maps relationships between concepts across those 9,000 files. Instead of just guessing based on keywords, it “traverses” the graph to build a coherent answer. This is exactly why we built this plugin—to fix that “loosely correct” problem.

How to get it to work:
Since you prioritize “working” over “speed”, here is the safest path to ingest that massive library without crashing Obsidian:

  1. The Backend: Since Neural Composer runs the indexing in a separate Python process, it will not crash Obsidian even if it takes hours. Your UI will remain responsive.
  2. The Model: With your RTX 3090, I highly recommend using Ollama locally.
    • LLM: qwen2.5:14b (Great balance of speed/intelligence) or llama3.
    • Embedding: bge-m3 (Excellent for retrieval).
  3. The Strategy: Don’t throw the 9,000 PDFs at once on the first try.
    • Create a folder in Obsidian with just 10 or 20 complex PDFs.
    • Right-click -Ingest Folder.
    • Test the output in the “Vault Chat”.
    • If the quality amazes you (and I think it will), then add the rest in batches (e.g., 500 at a time).

Settings Tip:
In the plugin settings (“Review .env”), ensure:
MAX_ASYNC=4 (Your 3090 can handle it).

Give it a shot with a small batch first. The difference between Vector RAG (AnythingLLM) and Graph RAG (Neural Composer) should be noticeable immediately in the quality of the answers.

1 Like