Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Co-founder of doctly.ai here (OCR tool)

I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.

I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:

``` ![img-0.jpeg](img-0.jpeg) ```

I'll keep testing, but so far, very disappointing :(

This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.

Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.

I would have loved to add this into the judge list, but might have to skip it.



Where did you test it? At the end of the post they say:

> Mistral OCR capabilities are free to try on le Chat

but when asked, Le Chat responds:

> can you do ocr?

> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.

Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.

Tried again with a better definition image, output only the first twenty words or so of the page.

Did you try using the API?


Yes I used the API. They have examples here:

https://docs.mistral.ai/capabilities/document/

I used base64 encoding of the image of the pdf page. The output was an object that has the markdown, and coordinates for the images:

[OCRPageObject(index=0, markdown='![img-0.jpeg](img-0.jpeg)', images=[OCRImageObject(id='img-0.jpeg', top_left_x=140, top_left_y=65, bottom_right_x=2136, bottom_right_y=1635, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=1778, width=2300))] model='mistral-ocr-2503-completion' usage_info=OCRUsageInfo(pages_processed=1, doc_size_bytes=634209)


Any luck with this? I'm trying to process photos of paperwork (.pdf, .png) and got the same results as you.

Feels like something is missing in the docs, or the API itself.

https://imgur.com/a/1J9bkml


Interestingly I’m currently going through and scanning the hundreds of journal papers my grandfather authored in medicine and thinking through what to do about graphs. I was expecting to do some form of multiphase agent based generation of LaTeX or SVG rather than a verbal summary of the graphs. At least in his generation of authorship his papers clearly explained the graphs already. I was pretty excited to see your post naturally but when I looked at the examples what I saw was, effectively, a more verbose form of

``` ![img-0.jpeg](img-0.jpeg) ```

I’m assuming this is partially because your use case is targeting RAG under various assumptions bur also partially because multimodal models aren’t near what I would need to be successful with?


We need to update the examples on the front page. Currently for things that are considered charts/graphs/figures we convert to a description. For things like logos or images we do an image tag. You can also choose to exclude them.

The difference with this is that it took the entire page as an image tag (it's just a table of text in my document). rather than being more selective.

I do like that they give you coordinates for the images though, we need to do something like that.

Give the actual tool a try. Would love to get your feedback for that use case. It gives you 100 free credits initially but if you email me ([email protected]), I can give you an extra 500 (goes for anyone else here also)


If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?


We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.

In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.


Does doctly do handwritten forms like dates?

I have a lot of "This document filed and registered in the county of ______ on ______ of _____ 2023" sort of thing.


We've been getting great results with those aswell. But ofcourse there is always some chance of not getting it perfect, specially with different handwritings.

Give it a try, no credit cards needed to try it. If you email me ([email protected]) i can give you extra free credits for testing.


Just tried it. Got all the dates correct and even extracted signatures really well.

Now to figure out how many millions of pages I have.


How do you stay competitive with $2/100 pages pricing as compared to mistral and others offering 1000 pages for $1 approx?


Customers are willing to pay for accuracy compared to existing solutions out there. We started out in need of an accurate solution for a RAG product we were building, but none of the solutions we tried were providing the accuracy we needed.


Why pay more for doctly than an AWS Textract?


I did not try doctly, but AWS Textract does not support in my case Russian, so the output is completely useless


Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.

Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.


Would love to see the test file.


would be glad to see benchmarking results


This is a good idea. We should publish a benchmark results/comparison.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: