Unfortunately, VLMs don’t seem to work. Even the best ones (eg, Gemini 2.5 Pro) hallucinate to a laughable degree. I’ve been seeing them make up large-scale features in images that absolutely no human would ever mistake.
What does work is narrow AI trained to do very specific tasks (such as medical image segmentation). But I don’t think that type of AI will benefit much from massive scaling.