Hacker Newsnew | past | comments | ask | show | jobs | submit | sandslides's commentslogin

So Ghostbusters II is now reality? :)


The model weights have been uploaded to Huggingface : https://huggingface.co/pyp1/VoiceCraft

This seems to be really high quality judging by the demo's. Not had time to try it for myself

Demos : https://jasonppy.github.io/VoiceCraft_web/


The LibriTTS demo clones unseen speakers from a five second or so clip


Ah ok, thanks. I tried the other demo.


I tried it. Sounds absolutely nothing like my voice or my wife's voice. I used the same sample files as I used 2 days ago on the Eleven Labs website, and they worked flawlessly there. So this is very, very far from being close to "Eleven Labs quality" when it comes to voice cloning.


Ah that's disappointing, have you tried https://git.ecker.tech/mrq/ai-voice-cloning ? I've had decent results with that, but inference is quite slow.


ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.


The speech generated is the best I've heard from an open source model. The one test I made didn't make an exact clone either but this is still early days. There's likely something not quite right. The cloned voice does speak without any artifacts or other weirdness that most TTS systems suffer from.


Yep. Tried as well. Tried a little clip of Tony Sopranos and it came out as a british guy.

xTTSv2 does it much better. But the quality on the trained voices are great though.


Yes, same for my voice. Made me sound British and didn't capture anything special about my voice that makes it recognizable.


Yes, I noticed that. Doesn't seem right does it


Just tried the collab notebooks. Seems to be very good quality. It also supports voice cloning.


Great stuff, took a look through the README but... what are the minimum hardware requirements to run this? Is this gonna blow up my CPU / harddrive?


Not sure. The only inference demos are colab notebooks. The models are approx 700mb each so I imagine it will run on modest gpu


Would it run in a cheap non-GPU server?


Seems to run about "2x realtime" on 2015 4 core i7-6700HQ laptop, that is, 5 seconds to generate 10 seconds of output. Can imagine that being 4x or greater on a real machine


I skimmed the github but didn't see any info on this, how long does it take to finetune to a particular voice?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: