Performance Results:
Initial Latency: ~315ms for short text
Audio Generation Speed (seconds of audio per second of processing):
- Short text (12 chars): 3.35x realtime
- Medium text (100 chars): 5.34x realtime
- Long text (225 chars): 5.46x realtime
- Very Long text (306 chars): 5.50x realtime
Findings:
- Model loads in ~710ms
- Generates audio at ~5x realtime speed (excluding initial latency)
- Performance is consistent across different voices (4.63x - 5.28x realtime)
Thanks for running the benchmarks. Currently the models are not optimized yet. We will optimize loading etc when we release an SDK meant for production :)
Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX