Searchable OCR - Let's get it built!

I need a usable OCR function (similar to what I’ve used in Onenote).

Here are some requirements to make this useable:

  • Automatic - I don’t have to think about it, or run a command. OCR just works in the background

  • Searchable OCR to the note- The OCR works in obsidians global search and links from search lead to the note the image is ON, not the media file itself. Best use-case would be that it links to the image on the note AND highlights them. I’m not sure if its possible to highlight them, but linking (maybe with ^3agr73) should be.

  • (optional) Handwriting OCR - Lower priority, but would help with excelidraw usage for me.

  • Not store a million text files - I would be nice if the OCRed text was stored somewhere other than a bunch of other vault text files (like maybe in a search index file)… or stored directly below the image like in an image alt (messy when editing), etc…

Why a another OCR thread

I realize there are many other posts and a couple other solutions out there (here and here).

I want to put my specific use case out there, gather interest, raise some funds, and get a plugin created.

If you’re interested in creating this plugin, let me know what it might cost to get created.

25 Likes

I’d say there are two parts to this request:

  1. Extracting data in image-based PDFs: I.e. doing OCR on those pdfs.
  2. Including PDF-data in global search

For point 1 I don’t think any new files would have to be created, the pdf itself could just be updated to include the data gathered through OCR. This is also covered in the OCR-thread you linked.

For point 2 there is already a separate request. How to best solve this I don’t know, but I agree with you - there shouldn’t be plenty of new files created.

Thanks for the feedback and ideas. I’ll be honest, PDFs to me have their own world. So for the sake of simplicity, I’m going to keep them out of my “specs” above.

For image OCR checkout https://bensonruan.com/image-to-text-ocr-with-tesseract-js

  • Not store a million text files - I would be nice if the OCRed text was stored somewhere other than a bunch of other vault text files (like maybe in a search index file)… or stored directly below the image like in an image alt (messy when editing), etc…

You could store the text in %% comments in the note that embeds the image.

I’m going to play with the %% comments a bit since it could also be useful to copy the contents of embedded google docs or google sheets and exposing them to the search.

You can use a # heading on top of the comment and then collapse it (with “fold heading” setting enabled). That way it doesn’t look too messy
image

I think the best option would be to store the OCR in a plugin index some how, then inject the results into the search if possible

1 Like

Just throwing out an alternative workflow that doesn’t need a plug-in …

I threw that one OCR solution together for fun, but ultimately decided to not store PDFs in Obsidian.
Only content I write goes in Obsidian.
Reference material, PDFs, etc all go in an external tool (DEVONthink) and I just link to it from my markdown notes in Obsidian.

DEVONthink can index obsidian’s vault too so I still have a single place to search, including PDF contents.

For me, I never store PDF’s in obsidian. What I’m talking about our notes that involve screenshots (like when watching a youtube video or zoom conference). I think it would probably be a pain to put all these in an alternate note software… though doable.

You are right; it’s doable, but I don’t find it to be a pain and it’s quite usable keeping digital assets outside obsidian and textual notes and things I create inside obsidian. A plug-in could be nice for this OCR case, but it’s not necessary.

I really want this to be a thing. I take a lot of screenshots of meeting notes/diagrams/conversations and I just want to be able to find them quickly. I’m bent on using Obsidian because of the Excalidraw plugin, but image OCR is one of the things I really miss from OneNote.

I don’t know what the plugin API looks like for Obsidian, but I think embedding OCR text as a hidden comment is the easiest option - if this idea gets more traction, maybe someone could then work out how to highlight the correct part of the image

2 Likes

hi,
same boat as you.
i snapshot MCQs on phone,
and wanna OCR them and edit/study on obsidian.
time flies, there are already plugins like “text extractor” and “taskbone ocr”.
the latter may need paid if heavily used.
text extractor, claim to use tesseract, dont function well

but there is a STANDALONE foss called capture2text, which also use tesseract but somehow they modified and function very very well.

capture2text support CLI command line mode. e.g. capture2text -i input.jpg -o output.jpg (or to clipboard)

the only remaining thing is how to make it work smoothly instead of ocr files 1 by 1. thanks

ps: i saw someone used templater for this /w tesseract, so this may also be possible.

New user, coming over from Evernote. I have thousands and thousands of notes, mostly manuals in PDF format, that I really need to be searchable, so OCR is something I’d love to get working. How does something like this get funded and developed for Obsidian?

Do you have requirements which exceed what the Omnisearch search/OCR plugin can provide?

cc @ccchan

I will give it a try and report back. I was not able to distinguish this one from all of the others and their dependencies - tesseract, etc.

Thank you for the pointer.