Search text in PDFs

Would it be possible to search content inside of PDFs? Similar to how it functions in Evernote or OneNote.

36 Likes

Also it would be really cool if navigation can be improved in one way or another

  1. PgUp and PgDn or arrow keys for moving between pages
  2. Zoom in and out
  3. Ability to edit the page no in the indicator to go from pg to another
12 Likes

Agreed. I would love for PDFs to be part of search, or added as a kind of alternative search, like ‘search in: PDFs’.

4 Likes

I agree with @nicrivard and @divesh_code, these features would be very useful.

3 Likes

Not only would this be helpful in general, as some people have valuable/insightful information stored in PDFs that they’d like searched as well as their standard vault, but it would help with “web clipping”; if one ran across a page with difficult-to-translate-to-markdown-formatting they could simply “print as PDF” and put the PDF in their vault.

2 Likes

The trouble is that this isn’t a simple thing to do. Reading a PDF takes time, managing a very large number of them would require indexing, and if the text isn’t already in extractable form, it will require OCR.

Docfetcher, a very useful open source text searcher (in maintenance mode last I heard owing to maintainer’s lack of time) has the following in its FAQ:

Why are the DocFetcher installer and the other packages so large (> 30 MB)?
This is mainly due to the fact that DocFetcher is shipped with lots of built-in text extraction libraries, some of which are quite big. The worst offenders are the libraries for MS Office and PDF files. However, the developers of these libraries aren’t to blame here: The libraries have to be big because the respective file formats are immensely complex.

The whole point of plaintext and markdown is that it is quickly and easily read. PDFs are a whole extra world of processing requirements.
My suggestion for people who need to be able to do a native Obsidian search on this text would be to bulk extract the text outside of Obsidian and put it into separate files with links to the original PDF.

5 Likes

Thinking about it, maybe someone could produce a plugin that called docfetcher (or another program/libraries) and allowed the text to be searched and used in Obsidian. That would be a plugin rather than a feature though.

3 Likes

add-on = plugin
But no need for Obsidian’s developers to do it themselves.

And if you look at docfetcher’s filetypes they go much further than just PDF. Includes the MSOffice formats, Open Office, epub, RTF etc. All formats that some Obsidian users would like.

My main points are:

  1. It’s a natural plugin and doesn’t need the devs to do it themselves.
  2. The plugin should cover all frequently used types of text document. There are many threads for Office etc support.
  3. Many users will accumulate a very large number of files, and would gain flexibility and speed from a plugin that maintained an index.

Indeed. As I said originally.
But it’s open source. It can be forked, it can be updated.
If someone can make it smaller or faster I’m sure the original developer would be delighted.

Size her is about the range of formats it accepts and also the need to have an index database. A simple text scraper offers much less. Some users might prefer a simple grep plugin, but there’s no reason there can’t be both.

I can tell you’ve never used docfetcher in practice. It works OK. It takes text from a number of files other, sometimes expensive, programs choke on. I’ve never encountered a PDF it won’t work on. I work a lot with PDFs and have most of the common commercial editors which will open and save in all up-to-date formats, and docfetcher has never had a problem with any of them. iirc the most recent PDF standard was 2017 (and the vast majority of PDFs adhere to earlier standards).

Docfetcher was only an example of the type of functionality needed and which could make an easy route to a plugin, given it’s open source. It’s cross-platform text search software but not management. Fairly simple but effective and free.

It’s text search that makes a natural plugin.

Probably at least two in the end, one indexed and one not. An indexed version would allow a much more sophisticated set of features.

1 Like

I’d like to see this as well (searching available data in a PDF).

2 Likes

It’s not possible currently due to

Does anyone know if this is likely to happen? Or just completely impossible? I really, really want to stop using evernote and move to obsidian, and this is the only thing stopping me at present

2 Likes

Yeah that’s a pitty.

Since the internal pdf viewer of electron doesn’t support proper PDF handling i decided to write my own plugin to be able to, show only one PDF page, cut out a picture out of a PDF document and show it inside a note. It’s based on pdf.js. You can find the first prototype here.

With pdf.js im able to extract the whole text of a pdf. Is there a way to extend search results via plugin?

3 Likes

We used to use pdf.js and moved to the native renderer becuase it’s faster and more accurate (some pdf do not open correctly with pdf.js)
We might revisit our choice in the future, especially if we need to other things with pdf.

2 Likes

The feature of PDF viewer is great !, but a great viewer needs the search bar definitely

2 Likes

I’ve posted in the Electron issue here pushing for this feature and asking for some sort of detail to help the community get involved with this, if necessary. Support searching in native PDF rendering · Issue #9030 · electron/electron · GitHub . Upvote if you agree with any notion and want more information from Deepak, an Electron maintainer who might be able to speak to what can be done (but may not have the time).

Noteworthy, for Obsidian developers, it does seem that Chromium (electron’s base) search for PDF uses the browser search, not the embedded PDF workflow. So is it possible that Chromium releases a search outside of the scope of the embedded PDF viewer, and it doesn’t get pulled into the viewer for Obsidian?

For the time being, I’ve found the closest workaround for this issue at the moment is by just opening the PDF with the native app:

  1. Click title or outside of the PDF itself, since it eats your scope
  2. Ctrl + P
  3. Search “Default app” => Click “Open in Default App”
1 Like