I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.
I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:
```

```
I'll keep testing, but so far, very disappointing :(
This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.
Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.
I would have loved to add this into the judge list, but might have to skip it.
Where did you test it? At the end of the post they say:
> Mistral OCR capabilities are free to try on le Chat
but when asked, Le Chat responds:
> can you do ocr?
> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.
Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.
Tried again with a better definition image, output only the first twenty words or so of the page.
Interestingly I’m currently going through and scanning the hundreds of journal papers my grandfather authored in medicine and thinking through what to do about graphs. I was expecting to do some form of multiphase agent based generation of LaTeX or SVG rather than a verbal summary of the graphs. At least in his generation of authorship his papers clearly explained the graphs already. I was pretty excited to see your post naturally but when I looked at the examples what I saw was, effectively, a more verbose form of
```  ```
I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?
We need to update the examples on the front page. Currently for things that are considered charts/graphs/figures we convert to a description. For things like logos or images we do an image tag. You can also choose to exclude them.
The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.
I do like that they give you coordinates for the images though, we need to do something like that.
Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me ([email protected]), I can give you an extra 500 (goes for anyone else here also)
If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?
We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.
In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.
We've been getting great results with those aswell. But ofcourse there is always some chance of not getting it perfect, specially with different handwritings.
Give it a try, no credit cards needed to try it. If you email me ([email protected]) i can give you extra free credits for testing.
Customers are willing to pay for accuracy compared to existing solutions out there. We started out in need of an accurate solution for a RAG product we were building, but none of the solutions we tried were providing the accuracy we needed.
Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.
Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.
I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.
I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:
```  ```
I'll keep testing, but so far, very disappointing :(
This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.
Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.
I would have loved to add this into the judge list, but might have to skip it.