so if i understand this correctly — you want the speech recognition model to ide...

GavCo · 2025-04-01T08:46:40 1743497200

I was wondering the same and found these related papers:

https://arxiv.org/pdf/2309.08561 https://arxiv.org/pdf/2406.02649

I haven't really dug in yet but from a quick skim, it looks promising. They show a big improvement over Whisper on a medical dataset (F1 increased from 80.5% to 96.58%).

The inference time for the keyword detection is about 10ms. If it scales linearly with additional keywords you could potentially scale to hundreds or thousands of keywords but it really depends on how sensitive you are to latency. For real-time with large vocabularies my guess is you might still want to fine-tune.

agold97 · 2025-04-01T08:59:26 1743497966

yeah — sounds about right. retraining the whole model just to add one jargon-y term isn’t super efficient. this approach lets you plug in a vocab list at runtime instead, which feels a lot more scalable.