PaliGemma on computer use data is absolutely not good. The difference between a FT YOLO model and a FT PaliGemma model is huge if generic bboxes are what you need. Microsoft's OmniParser also winds up using a YOLO backbone [1]. All of the browser use tools (like our friends at browser-use [2]) wind up trying to get a generic set of bboxes using the DOM and then applying generative models.
PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.
PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.
[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use