Global Search text in PDFs

nicrivard · May 26, 2020, 8:40am

Would it be possible to search content inside of PDFs? Similar to how it functions in Evernote or OneNote.

divesh_code · May 29, 2020, 12:46pm

Also it would be really cool if navigation can be improved in one way or another

PgUp and PgDn or arrow keys for moving between pages
Zoom in and out
Ability to edit the page no in the indicator to go from pg to another

pivic · July 3, 2020, 5:39am

Agreed. I would love for PDFs to be part of search, or added as a kind of alternative search, like ‘search in: PDFs’.

marcelobritox3 · October 20, 2020, 9:46pm

I agree with @nicrivard and @divesh_code, these features would be very useful.

varian93 · October 21, 2020, 10:21am

Not only would this be helpful in general, as some people have valuable/insightful information stored in PDFs that they’d like searched as well as their standard vault, but it would help with “web clipping”; if one ran across a page with difficult-to-translate-to-markdown-formatting they could simply “print as PDF” and put the PDF in their vault.

Dor · October 21, 2020, 11:02am

The trouble is that this isn’t a simple thing to do. Reading a PDF takes time, managing a very large number of them would require indexing, and if the text isn’t already in extractable form, it will require OCR.

Docfetcher, a very useful open source text searcher (in maintenance mode last I heard owing to maintainer’s lack of time) has the following in its FAQ:

Why are the DocFetcher installer and the other packages so large (> 30 MB)?
This is mainly due to the fact that DocFetcher is shipped with lots of built-in text extraction libraries, some of which are quite big. The worst offenders are the libraries for MS Office and PDF files. However, the developers of these libraries aren’t to blame here: The libraries have to be big because the respective file formats are immensely complex.

The whole point of plaintext and markdown is that it is quickly and easily read. PDFs are a whole extra world of processing requirements.
My suggestion for people who need to be able to do a native Obsidian search on this text would be to bulk extract the text outside of Obsidian and put it into separate files with links to the original PDF.

Dor · October 21, 2020, 3:49pm

Thinking about it, maybe someone could produce a plugin that called docfetcher (or another program/libraries) and allowed the text to be searched and used in Obsidian. That would be a plugin rather than a feature though.

Dor · October 21, 2020, 6:04pm

add-on = plugin
But no need for Obsidian’s developers to do it themselves.

And if you look at docfetcher’s filetypes they go much further than just PDF. Includes the MSOffice formats, Open Office, epub, RTF etc. All formats that some Obsidian users would like.

Dor · October 22, 2020, 10:36am

My main points are:

It’s a natural plugin and doesn’t need the devs to do it themselves.
The plugin should cover all frequently used types of text document. There are many threads for Office etc support.
Many users will accumulate a very large number of files, and would gain flexibility and speed from a plugin that maintained an index.

Indeed. As I said originally.
But it’s open source. It can be forked, it can be updated.
If someone can make it smaller or faster I’m sure the original developer would be delighted.

Size her is about the range of formats it accepts and also the need to have an index database. A simple text scraper offers much less. Some users might prefer a simple grep plugin, but there’s no reason there can’t be both.

I can tell you’ve never used docfetcher in practice. It works OK. It takes text from a number of files other, sometimes expensive, programs choke on. I’ve never encountered a PDF it won’t work on. I work a lot with PDFs and have most of the common commercial editors which will open and save in all up-to-date formats, and docfetcher has never had a problem with any of them. iirc the most recent PDF standard was 2017 (and the vast majority of PDFs adhere to earlier standards).

Dor · October 22, 2020, 4:57pm

Docfetcher was only an example of the type of functionality needed and which could make an easy route to a plugin, given it’s open source. It’s cross-platform text search software but not management. Fairly simple but effective and free.

It’s text search that makes a natural plugin.

Probably at least two in the end, one indexed and one not. An indexed version would allow a much more sophisticated set of features.

Gnopps · November 3, 2020, 7:06am

I’d like to see this as well (searching available data in a PDF).

WhiteNoise · January 11, 2021, 6:14am

It’s not possible currently due to

friednslip · January 31, 2021, 8:33am

Does anyone know if this is likely to happen? Or just completely impossible? I really, really want to stop using evernote and move to obsidian, and this is the only thing stopping me at present

MSzturc · January 31, 2021, 11:06am

Yeah that’s a pitty.

Since the internal pdf viewer of electron doesn’t support proper PDF handling i decided to write my own plugin to be able to, show only one PDF page, cut out a picture out of a PDF document and show it inside a note. It’s based on pdf.js. You can find the first prototype here.

With pdf.js im able to extract the whole text of a pdf. Is there a way to extend search results via plugin?

WhiteNoise · January 31, 2021, 1:50pm

We used to use pdf.js and moved to the native renderer becuase it’s faster and more accurate (some pdf do not open correctly with pdf.js)
We might revisit our choice in the future, especially if we need to other things with pdf.

fashaun · September 6, 2021, 12:19pm

The feature of PDF viewer is great !, but a great viewer needs the search bar definitely

mochsner · October 13, 2021, 4:04pm

I’ve posted in the Electron issue here pushing for this feature and asking for some sort of detail to help the community get involved with this, if necessary. Support searching in native PDF rendering · Issue #9030 · electron/electron · GitHub . Upvote if you agree with any notion and want more information from Deepak, an Electron maintainer who might be able to speak to what can be done (but may not have the time).

Noteworthy, for Obsidian developers, it does seem that Chromium (electron’s base) search for PDF uses the browser search, not the embedded PDF workflow. So is it possible that Chromium releases a search outside of the scope of the embedded PDF viewer, and it doesn’t get pulled into the viewer for Obsidian?

For the time being, I’ve found the closest workaround for this issue at the moment is by just opening the PDF with the native app:

Click title or outside of the PDF itself, since it eats your scope
Ctrl + P
Search “Default app” => Click “Open in Default App”

freecicero · January 11, 2022, 1:37pm

Is there a current suggested way to work around this issue and allow the text of PDF’s to be searchable as part of a master vault search?

Do any of the PDF plugins allow “importing” or is there as suggested way that PANDOC or something else be used to create a MD file from a PDF?

Thanks…

carlosbaraza · February 24, 2022, 5:37pm

I am also quite affected by this. Evernote is pretty good searching in the full text, including PDFs. My main problem is that I often scan documents into PDF, which I would love to be able to search.

Does anyone have a workaround to search both .md files and .pdf? Particularly tricky in mobile (iOS). I’m even considering implementing this functionality somehow, even if it is as a plugin, because it is the only thing that really stops be from loving my Obsidian experience.

carlosbaraza · April 13, 2022, 11:47am

Not sure if it would be possible to integrate something like pdfgrep (https://pdfgrep.org/) in the mobile version of Obsidian.

Alternatively, maybe a plugin could be made to maintain a OCR dump of every PDF file in the vault, that could be searched. The integration with search might be painful. Maybe it’d be possible to directly find out you are trying to open the text OCR version of the PDF and open the PDF instead (maybe even in the right spot).