Searchable OCR - Let's get it built!

I need a usable OCR function (similar to what I’ve used in Onenote).

Here are some requirements to make this useable:

  • Automatic - I don’t have to think about it, or run a command. OCR just works in the background

  • Searchable OCR to the note- The OCR works in obsidians global search and links from search lead to the note the image is ON, not the media file itself. Best use-case would be that it links to the image on the note AND highlights them. I’m not sure if its possible to highlight them, but linking (maybe with ^3agr73) should be.

  • (optional) Handwriting OCR - Lower priority, but would help with excelidraw usage for me.

  • Not store a million text files - I would be nice if the OCRed text was stored somewhere other than a bunch of other vault text files (like maybe in a search index file)… or stored directly below the image like in an image alt (messy when editing), etc…

Why a another OCR thread

I realize there are many other posts and a couple other solutions out there (here and here).

I want to put my specific use case out there, gather interest, raise some funds, and get a plugin created.

If you’re interested in creating this plugin, let me know what it might cost to get created.

13 Likes

I’d say there are two parts to this request:

  1. Extracting data in image-based PDFs: I.e. doing OCR on those pdfs.
  2. Including PDF-data in global search

For point 1 I don’t think any new files would have to be created, the pdf itself could just be updated to include the data gathered through OCR. This is also covered in the OCR-thread you linked.

For point 2 there is already a separate request. How to best solve this I don’t know, but I agree with you - there shouldn’t be plenty of new files created.

Thanks for the feedback and ideas. I’ll be honest, PDFs to me have their own world. So for the sake of simplicity, I’m going to keep them out of my “specs” above.

For image OCR checkout https://bensonruan.com/image-to-text-ocr-with-tesseract-js

  • Not store a million text files - I would be nice if the OCRed text was stored somewhere other than a bunch of other vault text files (like maybe in a search index file)… or stored directly below the image like in an image alt (messy when editing), etc…

You could store the text in %% comments in the note that embeds the image.

I’m going to play with the %% comments a bit since it could also be useful to copy the contents of embedded google docs or google sheets and exposing them to the search.

You can use a # heading on top of the comment and then collapse it (with “fold heading” setting enabled). That way it doesn’t look too messy
image

I think the best option would be to store the OCR in a plugin index some how, then inject the results into the search if possible

1 Like

Just throwing out an alternative workflow that doesn’t need a plug-in …

I threw that one OCR solution together for fun, but ultimately decided to not store PDFs in Obsidian.
Only content I write goes in Obsidian.
Reference material, PDFs, etc all go in an external tool (DEVONthink) and I just link to it from my markdown notes in Obsidian.

DEVONthink can index obsidian’s vault too so I still have a single place to search, including PDF contents.

For me, I never store PDF’s in obsidian. What I’m talking about our notes that involve screenshots (like when watching a youtube video or zoom conference). I think it would probably be a pain to put all these in an alternate note software… though doable.

You are right; it’s doable, but I don’t find it to be a pain and it’s quite usable keeping digital assets outside obsidian and textual notes and things I create inside obsidian. A plug-in could be nice for this OCR case, but it’s not necessary.