Mistral recently announced a SOTA OCR model that converts PDFs into markdown. It works pretty good, even cutting automatically the images. I wanted to be able to use this in Obsidian, so i changed a bit the codes they provide in their documentation to adapt specially the images to work with wikilinks, as by default it encoded the images directly in the markdown document, at that made my notes so slow.
I found it very useful for latex formulas, as before it was dificult, I was sending images of each page to ChatGPT and it was clunky.
Here is the repository: pdf-ocr-obsidian, where I put a python notebook you all can explore. I’m open to improvements, so you can suggest pull requests with any improvements. It would be great if this could work inside obsidian at some point, like the new web-browser plugin does with webpages, but with PDFs…
What has been your experience with the Mistral OCR when it comes to the text maintaining their correct location and not splitting words after the conversion? I’ve been using “Marker” and its fairly solid, here and there I notice some text may result like this: Constru ctive c riticism is welcome, but criticize ideas, not pe ople.
Haven’t seen any errors so far I think. I hate that too. But it looks that as the model is using some kind of LLM, the phrases maintain coherency. But could be worse in some cases if there are hallucinations, the converted phrases would make sense with the context, but entire words could be made up, so it would be harder to find mistakes.
Thanks for sharing! I ran across this post a few months ago, and it inspired me to write a plugin to use Mistral OCR to extract text from PDFs, documents, images, etc. embedded in notes. It’s a bit different than your approach (my main focus was on extracting text to make it searchable, so it doesn’t extract images), but hope you don’t mind me sharing it here too in case anyone finds it helpful for their use case:
Hello! Great to hear that, sure, I’ll also add it to my github project readme, in case somebody else finds it useful. I’ll test the plugin too, would be really useful to have it integrated.