Convert entire PDFs to Markdown (New Mistral OCR)

Mistral recently announced a SOTA OCR model that converts PDFs into markdown. It works pretty good, even cutting automatically the images. I wanted to be able to use this in Obsidian, so i changed a bit the codes they provide in their documentation to adapt specially the images to work with wikilinks, as by default it encoded the images directly in the markdown document, at that made my notes so slow.

I found it very useful for latex formulas, as before it was dificult, I was sending images of each page to ChatGPT and it was clunky.

Here is the repository: pdf-ocr-obsidian, where I put a python notebook you all can explore. I’m open to improvements, so you can suggest pull requests with any improvements. It would be great if this could work inside obsidian at some point, like the new web-browser plugin does with webpages, but with PDFs…

Here is an example of the results:

4 Likes

What has been your experience with the Mistral OCR when it comes to the text maintaining their correct location and not splitting words after the conversion? I’ve been using “Marker” and its fairly solid, here and there I notice some text may result like this: Constru ctive c riticism is welcome, but criticize ideas, not pe ople.

Haven’t seen any errors so far I think. I hate that too. But it looks that as the model is using some kind of LLM, the phrases maintain coherency. But could be worse in some cases if there are hallucinations, the converted phrases would make sense with the context, but entire words could be made up, so it would be harder to find mistakes.