LoMoGan's comments

LoMoGan · 2025-11-27T07:30:35 1764228635

Yeah good point!

LoMoGan · 2025-11-26T17:26:07 1764177967

Yeah lolll, this is the memory you need

LoMoGan · 2025-10-29T16:00:24 1761753624

With the rise of vision-language models (VLMs) (such as Qwen-VL and GPT-4.1), new end-to-end OCR models like DeepSeek-OCR have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.

However, this paradigm shift raises an important question:

If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?

LoMoGan · 2025-10-05T07:25:03 1759649103

Interesting, is this based on an external Vector DB to store and process the PDF?

mingtianzhang · 2025-10-05T07:32:49 1759649569

Thanks for the great question! We actually use a reasoning-based, vectorless approach. In short, it follows this process:

  1. Generate a table of contents (ToC) for the document.

  2. Read the ToC to select a relevant section.

  3. Extract relevant information from the selected section.

  4. If enough information has been gathered, provide the answer; otherwise, return to step 2.

We believe this approach closely mimics how a human would navigate and read long PDFs.

LoMoGan · 2025-10-05T08:22:36 1759652556

Sounds interesting, will try it out.

mingtianzhang · 2025-10-05T08:26:51 1759652811

Thanks, any feedback is welcome!