Global Search text in PDFs

LeAvie · May 19, 2022, 8:41pm

indeed!

LeAvie · May 19, 2022, 8:42pm

any workaround found?

Seri · June 4, 2022, 4:57am

+1

I can see how this may be helpful. It depends on a person’s workflow. Sometimes PDFs can just be used as reference, but sometimes they also contain essential information that needs to be searched. Since PDFs tend to have many pages, a search function would make it easier to navigate through them and quickly get to the information that I need.

on_no · June 30, 2022, 2:51am

+1
two years after. no changes. sadly!!!

Kalmir · July 8, 2022, 5:54am

+1
searching in pdfs would allow for new database-like use cases of Obsidian

obsequious · July 27, 2022, 8:39pm

On mac, just use Spotlight search or Finder search (with Finder, you can scope your search at the root of your vault), as macOS automatically indexes PDF files. I imagine Windows has a similar function as well.

CharlieP · July 28, 2022, 12:35pm

I have written a crude python script using pdfplumber and ocrmypdf (when required) to create parallel md files with the text by page and links back to the pdf. I agree that the search of non-text files shouldn’t be core to obsidian, but could be achieved through an add in. However, searching text within a PDF when viewing it would be very useful.

Thibaultmol · July 28, 2022, 1:46pm

I guess if you have time to make it something more concrete (maybe even an obsidian plug-in, I’m honestly not sure what the limitations are there) that would be interesting

CharlieP · July 28, 2022, 3:55pm

I’ll see if it’s something I can easily share. Although I have programmed in the past, Python is new to me, and it’s not pretty! Where is the best place to post scripts like that? I’m not sure how you would do an add in based on python. I don’t think there are equivalent javascript libraries for all the features.

obsequious · July 28, 2022, 4:18pm

Might want to check out GitHub - MSzturc/obsidian-better-pdf-plugin: Goal of this Plugin in to implement a native PDF handling workflow in Obsidian . Since it uses pdf.js, the viewer should have a search feature at the top

Blubloos · August 1, 2022, 1:12pm

MASSIVE +1 for this feature from me. I’d be interested in taking a stab at making a plugin to add text search in PDFs to the global Obsidian search. Is there any work being done on this that I should know of before I give it a go?

CharlieP · August 2, 2022, 12:33pm

I have a very crude python script. My intention is just to use it on specific folders as I bring my files into Obsidian. I am running it manually from terminal on MacOS. I’m not sure where I can post it on this forum.

I just wrote some crude documentation:

ToMD.py
Convert files to Obsidian markdown
By default uses the current working folder, or can take a folder on the command line or a file

Will OCR image files, and PDF files where necessary

Extracts text from pdf, docx, html, xlsx, pptx files
and from png, jpg, jpeg, gif, bmp image files with OCR

Converts generic scan file names to more meaningful ones based on first lines of text

Limits file names to alphanumeric, space and the ‘-’ ‘_’ characters

Will only convert a file once, unless it has a more recent modified time

Options are -o for output folder [default is same folder], -c for compression level 0 - 4 [0 is using default compression level for images and PDFs]
-b backup compressed files

ashish · October 28, 2022, 12:57am

Is there any update on this feature? Or someone has a solution?

pivic · November 1, 2022, 12:20pm

A couple of days ago, the Omnisearch community plugin was released in the product itself. I’ve used it successfully to index and search PDFs. This covers my needs so far, actually.

Edit: Omnisearch has been updated to add support for indexing text inside of images.

There’s also Obsidian OCR, another community plugin, that lets you search for text in both images and PDFs.

I hope I’m not killing people by doing this, but here goes: ping @ashish, @Blubloos, @on_no, @LeAvie, @carlosbaraza, @freecicero, @mochsner, @friednslip, @Gnopps, @varian93.

Blubloos · November 23, 2022, 12:55am

I’d say Omnisearch does the trick

Bjaux · March 1, 2023, 12:01pm

I may be mistaken here - but I don’t seem to find an easy solution here. I have around 600 PDF (mostly german) documents, most between 1-5 pages but some with 20-30 and a handful that is larger. All of them already do have the recognized text embedded (my scanner does that). The MacOS finder is very much able to find any pdf by a full text search with ease.

But neither the obsidian search nor omnisearch or obsidian ocr seem to do the trick. The problem is that obsidian native search does not seem to recognize those included texts, omnisearch does it’s own indexing but is missing around 60% of the pdfs (empty text in the metadata) and obsidian ocr always hangs up at some of my PDFs in the middle.

I might be mistaken here - but shouldn’t it be rather easy to search for PDFs using full text search on the already included recognized text, without any additional OCR? Any Ideas?

scambier · March 1, 2023, 7:16pm

Omnisearch & Text Extractor developer here.

Short answer: no, or it would already be built-in in Obsidian, or Text Extractor wouldn’t fail on so many files.

You mention that MacOS has already OCRed your files through its “Live Text” (iirc) feature, but unfortunately there is no API for 3rd party apps to use this data.
The plugin Text Extractor does not use OCR for PDFs, but tries to extract the text directly from the file. OCR with TesseractJS works pretty well, but converting a 150-pages PDF into an image to OCR it is… not ideal.
What makes the whole thing even more complex is that Obsidian is essentially a webapp, and so we’re limited to web technologies (js and wasm)

While there are ways to do this, solutions are often flawed: they break often, or don’t scale well, or take too much time/resources. The first version of Text Extractor used PDFjs and worked perfectly… unless you had more than a dozen files, then it hard crashed Obsidian.

So yeah, less easy than you think

Bjaux · March 1, 2023, 8:03pm

Wow, thank you so much for your fast, kind and elaborate answer!

I just ditched a full notion-obsidian sync setup I wrote and used for a year to move completely to obsidian and am quite amazed of the community so far!

Your answer helps me in multiple ways:

I wasn’t aware that MacOS does their own OCR-run and was quite convinced it only used existing embedded data. But this fits quite well with the whole problem I’m seeing
I should read the plugin description more carefully. Like you say TextExtractor does OCR only for images and extracts already embedded text from PDFs if possible. This means on the other hand that I might just need to improve or alter the already embedded Text inside the 60% PDFs it is missing. I know that at least a share of them does not have the text already included but would have guessed that number lower - but that’s something I could try to improve!
I couldn’t find a way to limit the number of pages or exclude PDF files for Obsidian-OCR, but this is something I could look into too. Moreover that I had the impression that it failed (and hung up) on a 9-page document - but maybe I’ve read that wrong.

I think the TextExtractor approach seems to be the most promising to me. Thanks alot again!

zhenbo · March 2, 2023, 4:14pm

When talking about workarounds, please allow me to promote my own project GitHub - Endle/fireSeqSearch: When using search engine, it would also search local logseq notebook

It requires running an external binary (server) and a browser extension. It’s not integrated into ObsidianMD as a plugin.

With the server, loading PDFs and extracting text is a bit easier. If anyone is interested in this workaround, I could add the support for Obsidian notes soon

scambier · March 2, 2023, 9:10pm

That’s unfortunately not necessarily the case. The library used by Text Extractor can fail on PDFs that have selectable text. And TBH I would have expected Obsidian OCR to work better because of their requirements, but it seems not. So definitely not an easy problem to solve.