Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Windows port of OpenAI's Whisper automatic speech recognition model (github.com/const-me)
43 points by Const-me on Jan 16, 2023 | hide | past | favorite | 20 comments
This project is a Windows port of the whisper.cpp implementation: https://github.com/ggerganov/whisper.cpp

Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model: https://github.com/openai/whisper

The implementation has no dependencies, usually much faster than realtime, and should hopefully work on most Windows computers in the world.




It's said that whisper.cpp can already run on Windows. What's the difference?


whisper.cpp runs on CPU. My version runs on GPU, because Windows includes a good vendor-agnostic GPU API, Direct3D. On my desktop computer, the performance difference between them is about an order of magnitude. My version is even twice as fast compared to the OpenAI’s original GPGPU implementation, which is based on PyTorch and CUDA.

The original version only supports *.wav audio files with 16kHz sample rate, my version supports most audio and video codecs with any sample rate, because Windows comes with a built-in APIs to decode audio and convert the audio between sample rates.

My version can capture audio directly from microphones, again because Windows comes with a Microsoft-supported API to deal with audio capture devices.


> The original version only supports *.wav audio files with 16kHz sample rate

This particular point is not true (at least not fully). The version publicly announced in 2022 had ffmpeg dependency for supporting any audio-containing format. For Windows I had just to drop the binary in Python script folder and enjoy converting from anything.


GP asked about the difference between whisper.cpp and my version, not OpenAI’s implementation and my version. By “the original version” in that paragraph I meant whisper.cpp.

On a general note, I believe using ffmpeg or gstreamer on Windows is sloppy software engineering. Media Foundation is a part of the OS and is supported by Microsoft.

For software which deals with video (as opposed to just audio) it’s even more important because GPU vendors directly supporting MF. While installing their GPU drivers, they also installing DLLs which expose their hardware codecs as media foundation transforms. Examples of such transforms are NVIDIA H.264 Encoder MFT, NVIDIA HEVC Encoder MFT, AMD D3D11 Hardware MFT Playback Decoder, and AMDh265Encoder.


Excellent project!

When I run this, it succeeds fine but I get the message "This build of the DLL doesn’t implement the reference CPU-running Whisper model."

What does this mean? Also very interested in the hybrid option, how do you get this working, and does this use both GPU and CPU simultaneously?


> What does this mean?

It probably means you flipped the combobox on the first screen. In the build on github, the only included model implementation is GPU. The other two implementations are disabled with macros, there: https://github.com/Const-me/Whisper/blob/1.1.0/Whisper/stdaf... These implementations are lacking some UX features like callbacks and cancellation, and I haven't tested them for a while, but they might still work.

> does this use both GPU and CPU simultaneously?

No, it's sequential. There's a data dependency between these two stages. The encode function computes some buffers (probably called "cross attention" but I'm not sure, not an ML expert), and then the decode function needs that data to generate the output text.


Is there any overhead from Windows (i.e. codec translation) during the live transcription? kinda surprising the latency is so large...

...anyway this is great, will check it out after work!


That latency is in my code, not in some Windows component. I’m accumulating several seconds of the audio before running the model to transcribe the buffer with these audio samples.

The logic is in that method: https://github.com/Const-me/Whisper/blob/15aea5bc/Whisper/Wh...

How long is “several seconds” controlled by these user-adjustable parameters: https://github.com/Const-me/Whisper/blob/8648d1d5/Whisper/AP...


Thanks I understand now. You needed to send buffered audio because the model wasn't handling short snippets well.

Do you have a sample audio clip you would like to add to the repo for benchmarking purposes? I'm going to try it on my 3060ti tonight and could compare times...


I have uploaded two sample clips: https://github.com/Const-me/Whisper/tree/master/SampleClips

The text files in that folder contain performance data from two computers: a desktop with nVidia 1080Ti, and a laptop with integrated AMD GPU.

If you want just a single number, look at the “RunComplete” value in these text files.


Here are my benchmarks on a 3060Ti, i7 12700k:

Columbia Medium EN: 21.56 seconds

Columbia Large: 37.5 seconds

JFK Medium EN: 1.89 seconds

JFK Large: 3.25 seconds

Seems like your optimizations for your native hardware are really good!


Wikipedia says there're two versions of 3060Ti, one has GDDR6 memory with 448 GB/second bandwidth, another one has GDDR6X memory with 608 GB/second bandwidth: https://en.wikipedia.org/wiki/GeForce_30_series#Desktop

The GDDR5X VRAM in 1080Ti delivers up to 484 GB/second.

I wonder whether are you using GDDR6 or 6X version of 3060Ti?


Founder's Edition, so according to this site,

https://www.techpowerup.com/gpu-specs/nvidia-geforce-rtx-306...

it's the 6X version.


Here's another founders edition on that web site, with GDDR6 memory:

https://www.techpowerup.com/gpu-specs/nvidia-geforce-rtx-306...

They have a tool to find out for sure: https://www.techpowerup.com/download/techpowerup-gpu-z/


Yes, the page I quoted is wrong. Founder's Edition is the lower range GDDR6, 8gb.


That’s what I thought, and I think we have our answer. Apparently, these compute shaders are memory bound on our two GPUs, and 1080Ti has faster VRAM.


Just curious, (I have no idea how GPU stats influence neural network benchmarks), would slapping a 1080ti alongside my 3060ti gain me anything? Can I 'cluster' VRAM for better performance? Can we top ~5x transcribe speeds with more VRAM?

I'm open to the idea of buying an additional old gen GPU that nails a good price/VRAM ratio


> I have no idea how GPU stats influence neural network benchmarks

I don’t have any idea either, I don’t do ML stuff professionally. On my day job I’m using the same tech (C++, SSE and AVX SIMD, DirectCompute) for a CAM/CAE application.

> would slapping a 1080ti alongside my 3060ti gain me anything

In the current version of my library, you’ll gain very little. You’ll probably get the same performance as on my computer.

I think it should be technically possible to split the work to multiple GPUs. The most expensive compute shaders in that library, by far, are computing matrix*matrix products. When each GPU has enough VRAM to fit both input matrices, the problem is parallelizable.

However, that’s a lot of work, not something I’m willing to do within the scope of that project. Also, if you have multiple input streams to transcribe, you’ll get better overall throughput processing these streams in parallel on different GPUs.

> I'm open to the idea of buying an additional old gen GPU that nails a good price/VRAM ratio

Based on my observations from the tests https://github.com/Const-me/Whisper/blob/master/SampleClips/... and also this thread about 3060Ti, it looks like the library is indeed bound by VRAM, not compute.

I have another data point, that commit https://github.com/Const-me/Whisper/commit/062d01a9701a11468... Same AMD iGPU, the only difference is BIOS setup, I switched the memory from the default DDR4-2400T into the faster XMP-3332 mode.

If you can, try on Radeon RX 6700 XT, or better ones from that table: https://en.wikipedia.org/wiki/Radeon_RX_6000_series#Desktop The figure for VRAM bandwidth is “only” 384 GB/sec, but the GPU has 96 MB L3 cache, which might make a difference for these compute shaders. That’s pure theory though, I haven’t tested on such GPUs. If you do that, make sure to play with the comboboxes on the “Advanced GPU Settings” dialog in the desktop example.


Very cool, thanks for sharing!


Hi,

This is very cool, thanks for porting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: