Search inside handwritten PDF (Imported from GoodNote)

dipaktandel · March 16, 2023, 8:34pm

What I’m trying to do

I have imported my handwritten notes from goodNotes to obsidian and want to enable a search for handwritten notes.

Things I have tried

I have installed and enabled Omnisearch and Text Extractor plugin, PDF and Image Indexing are also enabled but when I am doing omnisearch, it’s not showing any handwritten notes in a search result.

MacOS PDF viewer is able to find both the occurrence of word variance.

The same PDF is not searchable in obsidian with omnisearch and Text Extract, in obsidian omnisearch is not finding any result for the word variance.

I tried many easy handwritten words, words which are written perfectly but nothing is working.

scambier · March 17, 2023, 10:36am

Unfortunately, there’s virtually 0 chance that Text Extractor will be able to extract handwritten content.

PDFs are not OCRed, they’re just “read” by Text Extract, which tries to find embedded text.
Image files are OCRed with Tesseract, which is not trained for handwriting.

The only way that could work is if 1) your PDF has been “pre-OCRed” by another process (such as Adobe), 2) the resulting text is saved inside the PDF, and 3) Text Extractor can find that text.

dipaktandel · March 17, 2023, 12:46pm

@scambier thank you for your suggestions. I think the text is already embedded in pdf and the search is working in the chrome browser as well.

Embedded Text

Search results in chrome

dipaktandel · March 17, 2023, 12:59pm

Text Extractor is extracting the below text from the PDF, which means it’s working but not accurate.

But I am surprised that it’s not using embedded text or maybe unable to read the embedded text properly.

I am attaching the test PDF here for reference.
Time Series.zip (722.9 KB)

I have used this tool to find embedded text in pdf. FREE PDF Documents analyser

holroy · March 17, 2023, 1:57pm

I’ve not worked with the Text Extractor plugin, or the other tools you mention, but the text you provided as example in that last part, could that be stripped of all white space, and then be used in your search?

It seems like that text just introduces a lot of spaces where you didn’t intend for them to be, but other than that the text seems to be coherent enough for searching.

You could end up with some false positives if you search for something which happens to be the end of one word and the start of another. (Like if you searched the previous sentence for “fan”, it would render a hit on “oFANother”). But no sure if this in general would be a great issue.

dipaktandel · March 17, 2023, 2:18pm

@holroy I was thinking on a similar line. If the extractor strips off all the white space, the text will be searchable though not correct 100%, it would be a temporary fix. The better option is to use the embedded text properly. I don’t see any space in the embedded text.

dipaktandel · March 17, 2023, 9:14pm

I tried pdfminer and the result is encouraging. As all the text is embedded in the PDF it was able to extract the text with quite a good accuracy. @scambier maybe text-extractor was not extracting these embedded texts but it was using tesseract to ORC the pdf, which is not very accurate. Is it possible to use these or any better API, I am ready to spend some time and contribute back.

scambier · March 18, 2023, 9:15am

If anyone is willing to contribute on Text Extractor, here’s a discussion about the current state of PDFs: Improving text extraction of PDFs · scambier/obsidian-text-extractor · Discussion #21 · GitHub

If you wish to work on other extractors (word, excel, html, anything) you’re also more than welcome

system · June 16, 2023, 9:16am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.