With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.
However, this paradigm shift raises an important question:
If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?
Thanks for the great question! We actually use a reasoning-based, vectorless approach. In short, it follows this process:
1. Generate a table of contents (ToC) for the document.
2. Read the ToC to select a relevant section.
3. Extract relevant information from the selected section.
4. If enough information has been gathered, provide the answer; otherwise, return to step 2.
We believe this approach closely mimics how a human would navigate and read long PDFs.
reply