This is exactly what Google ASR does. Give it a try and watch how the results fl...

This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.

Streaming used to be something people cared about more. VAD is always part of those systems as well, you want to use it to start segments and to hard cut-off, but it is just the starting off point. It's kind of a big gap (to me) that's missing in available models since Whisper came out, partly I think because it does add to the complexity of using the model, and latency has to be tuned/traded-off with quality.