Hacker Newsnew | past | comments | ask | show | jobs | submit | LoMoGan's commentslogin

Yeah good point!

Yeah lolll, this is the memory you need

With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

However, this paradigm shift raises an important question:

If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?


Interesting, is this based on an external Vector DB to store and process the PDF?


Thanks for the great question! We actually use a reasoning-based, vectorless approach. In short, it follows this process:

  1. Generate a table of contents (ToC) for the document.

  2. Read the ToC to select a relevant section.

  3. Extract relevant information from the selected section.

  4. If enough information has been gathered, provide the answer; otherwise, return to step 2.
We believe this approach closely mimics how a human would navigate and read long PDFs.


Sounds interesting, will try it out.


Thanks, any feedback is welcome!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: