Search text in PDFs

indeed!

any workaround found?

+1

I can see how this may be helpful. It depends on a person’s workflow. Sometimes PDFs can just be used as reference, but sometimes they also contain essential information that needs to be searched. Since PDFs tend to have many pages, a search function would make it easier to navigate through them and quickly get to the information that I need.

1 Like

+1
two years after. no changes. sadly!!!

2 Likes

+1
searching in pdfs would allow for new database-like use cases of Obsidian

On mac, just use Spotlight search or Finder search (with Finder, you can scope your search at the root of your vault), as macOS automatically indexes PDF files. I imagine Windows has a similar function as well.

I have written a crude python script using pdfplumber and ocrmypdf (when required) to create parallel md files with the text by page and links back to the pdf. I agree that the search of non-text files shouldn’t be core to obsidian, but could be achieved through an add in. However, searching text within a PDF when viewing it would be very useful.

1 Like

I guess if you have time to make it something more concrete (maybe even an obsidian plug-in, I’m honestly not sure what the limitations are there) that would be interesting

I’ll see if it’s something I can easily share. Although I have programmed in the past, Python is new to me, and it’s not pretty! Where is the best place to post scripts like that? I’m not sure how you would do an add in based on python. I don’t think there are equivalent javascript libraries for all the features.

Might want to check out GitHub - MSzturc/obsidian-better-pdf-plugin: Goal of this Plugin in to implement a native PDF handling workflow in Obsidian . Since it uses pdf.js, the viewer should have a search feature at the top

MASSIVE +1 for this feature from me. I’d be interested in taking a stab at making a plugin to add text search in PDFs to the global Obsidian search. Is there any work being done on this that I should know of before I give it a go?

I have a very crude python script. My intention is just to use it on specific folders as I bring my files into Obsidian. I am running it manually from terminal on MacOS. I’m not sure where I can post it on this forum.

I just wrote some crude documentation:

ToMD.py
Convert files to Obsidian markdown
By default uses the current working folder, or can take a folder on the command line or a file

Will OCR image files, and PDF files where necessary

Extracts text from pdf, docx, html, xlsx, pptx files
and from png, jpg, jpeg, gif, bmp image files with OCR

Converts generic scan file names to more meaningful ones based on first lines of text

Limits file names to alphanumeric, space and the ‘-’ ‘_’ characters

Will only convert a file once, unless it has a more recent modified time

Options are -o for output folder [default is same folder], -c for compression level 0 - 4 [0 is using default compression level for images and PDFs]
-b backup compressed files

1 Like

Is there any update on this feature? Or someone has a solution?

A couple of days ago, the Omnisearch community plugin was released in the product itself. I’ve used it successfully to index and search PDFs. This covers my needs so far, actually.

There’s also Obsidian OCR, another community plugin, that lets you search for text in both images and PDFs.

I hope I’m not killing people by doing this, but here goes: ping @ashish, @Blubloos, @on_no, @LeAvie, @carlosbaraza, @freecicero, @mochsner, @friednslip, @Gnopps, @varian93.

1 Like

I’d say Omnisearch does the trick