Intriguing announcement, however the examples on the mistral.ai page seem rather...

Intriguing announcement, however the examples on the mistral.ai page seem rather "easy".

What about rare glyphs in different languages using handwriting from previous centuries?

I've been dealing with OCR issues and evaluating different approaches for past 5+ years at a national library that I work at.

Usual consensus is that widely used open source Tesseract is subpar to commercial models.

That might be so without fine tuning. However one can perform supplemental training and build your own Tesseract models that can outperform the base ones.

Case study of Kant's letter's from 18th century:

About 6 months ago, I tested OpenAi approach to OCR to some old 18th century letters that needed digitizing.

The results were rather good (90+% accuracy) with the usual hallucination here and there.

What was funny that OpenAI was using base Tesseract to generate the segmenting and initial OCR.

The actual OCRed content before last inference step was rather horrid because the Tesseract model that OpenAi was using was not appropriate for the particular image.

When I took OpenAi off the first step and moved to my own Tesseract models, I gained significantly in "raw" OCR accuracy at character level.

Then I performed normal LLM inference at the last step.

What was a bit shocking: My actual gains for the task (humanly readable text for general use) were not particularly significant.

That is LLMs are fantastic at "untangling" complete mess of tokens into something humanly readable.

For example:

P!3goattie -> prerogative (that is given the surrounding text is similarly garbled)