LLMs are inherently bad at this due to tokenization, scaling, and lack of traini...

nostrebored · 2025-01-24T18:48:02 1737744482

PaliGemma on computer use data is absolutely not good. The difference between a FT YOLO model and a FT PaliGemma model is huge if generic bboxes are what you need. Microsoft's OmniParser also winds up using a YOLO backbone [1]. All of the browser use tools (like our friends at browser-use [2]) wind up trying to get a generic set of bboxes using the DOM and then applying generative models.

PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.

[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use