The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of...

ACCount37 · 2025-09-23T22:29:39 1758666579

Most multi-modal input implementations suck, and a lot of them suck big time.

Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.

Computer0 · 2025-09-23T22:34:57 1758666897

I feel like most Open Source releases regardless of size claim to be similar in output quality to SOTA closed source stuff.