I've found that the stunning OCR results so far were because the models were trained on the example file category. Is that the case here? Or can this recognize various documents?
After four years of "This VLM will solve OCR forever!" posts I've firmly put VLMs in the "useless until 100t parameters" category.
Some day when there is enough internal state and training data that they can recognize tables, images and text we will get a gpt3 like moment that will make regular OCR obsolete.
But that day is very far off and everyone who I've talked with and consulted over using VLMs in their pipeline is better served doing something else specific to their use case.
To extract the text contained within each box while ignoring the text in sub-boxes, we can follow these steps:
1. Identify the outermost box and extract its text.
2. Move to the next outermost box and extract its text, ignoring any text within its sub-boxes.
3. Continue this process for all boxes.
Let's apply this to the image provided:
1. The outermost box contains the text: "This should be second".
2. The next outermost box (ignoring the sub-box within it) contains the text: "First".
3. The next box contains the text: "And also this".
4. The final box contains the text: "The quick brown fox".
So, the extracted text from each box, ignoring sub-boxes, is:
1. "This should be second"
2. "First"
3. "And also this"
4. "The quick brown fox"
---
As you can plainly see it is _wildly_ wrong and gives you no way to try and recover from those errors.