Global Search text in PDFs

My main points are:

  1. It’s a natural plugin and doesn’t need the devs to do it themselves.
  2. The plugin should cover all frequently used types of text document. There are many threads for Office etc support.
  3. Many users will accumulate a very large number of files, and would gain flexibility and speed from a plugin that maintained an index.

Indeed. As I said originally.
But it’s open source. It can be forked, it can be updated.
If someone can make it smaller or faster I’m sure the original developer would be delighted.

Size her is about the range of formats it accepts and also the need to have an index database. A simple text scraper offers much less. Some users might prefer a simple grep plugin, but there’s no reason there can’t be both.

I can tell you’ve never used docfetcher in practice. It works OK. It takes text from a number of files other, sometimes expensive, programs choke on. I’ve never encountered a PDF it won’t work on. I work a lot with PDFs and have most of the common commercial editors which will open and save in all up-to-date formats, and docfetcher has never had a problem with any of them. iirc the most recent PDF standard was 2017 (and the vast majority of PDFs adhere to earlier standards).

1 Like

Docfetcher was only an example of the type of functionality needed and which could make an easy route to a plugin, given it’s open source. It’s cross-platform text search software but not management. Fairly simple but effective and free.

It’s text search that makes a natural plugin.

Probably at least two in the end, one indexed and one not. An indexed version would allow a much more sophisticated set of features.

1 Like

I’d like to see this as well (searching available data in a PDF).

2 Likes

It’s not possible currently due to

1 Like

Does anyone know if this is likely to happen? Or just completely impossible? I really, really want to stop using evernote and move to obsidian, and this is the only thing stopping me at present

5 Likes

Yeah that’s a pitty.

Since the internal pdf viewer of electron doesn’t support proper PDF handling i decided to write my own plugin to be able to, show only one PDF page, cut out a picture out of a PDF document and show it inside a note. It’s based on pdf.js. You can find the first prototype here.

With pdf.js im able to extract the whole text of a pdf. Is there a way to extend search results via plugin?

3 Likes

We used to use pdf.js and moved to the native renderer becuase it’s faster and more accurate (some pdf do not open correctly with pdf.js)
We might revisit our choice in the future, especially if we need to other things with pdf.

4 Likes

The feature of PDF viewer is great !, but a great viewer needs the search bar definitely

4 Likes

I’ve posted in the Electron issue here pushing for this feature and asking for some sort of detail to help the community get involved with this, if necessary. Support searching in native PDF rendering · Issue #9030 · electron/electron · GitHub . Upvote if you agree with any notion and want more information from Deepak, an Electron maintainer who might be able to speak to what can be done (but may not have the time).

Noteworthy, for Obsidian developers, it does seem that Chromium (electron’s base) search for PDF uses the browser search, not the embedded PDF workflow. So is it possible that Chromium releases a search outside of the scope of the embedded PDF viewer, and it doesn’t get pulled into the viewer for Obsidian?

For the time being, I’ve found the closest workaround for this issue at the moment is by just opening the PDF with the native app:

  1. Click title or outside of the PDF itself, since it eats your scope
  2. Ctrl + P
  3. Search “Default app” => Click “Open in Default App”
4 Likes

Is there a current suggested way to work around this issue and allow the text of PDF’s to be searchable as part of a master vault search?

Do any of the PDF plugins allow “importing” or is there as suggested way that PANDOC or something else be used to create a MD file from a PDF?

Thanks…

5 Likes

I am also quite affected by this. Evernote is pretty good searching in the full text, including PDFs. My main problem is that I often scan documents into PDF, which I would love to be able to search.

Does anyone have a workaround to search both .md files and .pdf? Particularly tricky in mobile (iOS). I’m even considering implementing this functionality somehow, even if it is as a plugin, because it is the only thing that really stops be from loving my Obsidian experience.

6 Likes

Not sure if it would be possible to integrate something like pdfgrep (https://pdfgrep.org/) in the mobile version of Obsidian.

Alternatively, maybe a plugin could be made to maintain a OCR dump of every PDF file in the vault, that could be searched. The integration with search might be painful. Maybe it’d be possible to directly find out you are trying to open the text OCR version of the PDF and open the PDF instead (maybe even in the right spot).

2 Likes

indeed!

any workaround found?

+1

I can see how this may be helpful. It depends on a person’s workflow. Sometimes PDFs can just be used as reference, but sometimes they also contain essential information that needs to be searched. Since PDFs tend to have many pages, a search function would make it easier to navigate through them and quickly get to the information that I need.

1 Like

+1
two years after. no changes. sadly!!!

2 Likes

+1
searching in pdfs would allow for new database-like use cases of Obsidian

On mac, just use Spotlight search or Finder search (with Finder, you can scope your search at the root of your vault), as macOS automatically indexes PDF files. I imagine Windows has a similar function as well.

I have written a crude python script using pdfplumber and ocrmypdf (when required) to create parallel md files with the text by page and links back to the pdf. I agree that the search of non-text files shouldn’t be core to obsidian, but could be achieved through an add in. However, searching text within a PDF when viewing it would be very useful.

1 Like

I guess if you have time to make it something more concrete (maybe even an obsidian plug-in, I’m honestly not sure what the limitations are there) that would be interesting