Self-proclaimed state of the art. A year ago, i would have been blown away, today, this is dramatically worse than Eleven Labs. Lower quality audio, strange cadence, pretty monotonic. It’s not what people sound like.
I think it’s impressive, but i wouldn’t call it state of the art.
I’ve played around with Eleven Labs a bunch, and while it does a pretty good job sounding like an audiobook a lot of the time, the quality of cloned voices and the ability to convey emotions is inconsistent IME. Based on the samples, this isn’t obviously worse and might actually be better.
> Generating humanlike speech requires the model to act as though it thinks while speaking, while making use of filler words to make the speech sound extremely realistic.
This is impressive from a "generating humanlike conversation" standpoint, but its application in terms of a support call seems rather boneheaded to me.
I accept filler words when speaking to a human because many humans need time to think. An AI generated response does not. The filler words are extra tokens that waste time and energy.
I'd feel somewhat miffed if I found out I was speaking to an AI that was intentionally wasting my time in an effort to sound more human. I only need a TTS voice to sound more human so that I can better understand what it's saying, not so that I feel convinced that I'm talking to a real human.
Why charge on a subscription basis if you're getting a limited amount of usage? If Netflix charged me $12 and limited me to 5 shows a month, I'd be outraged.
At least offer up one time payments, even if you mark it up 10-20%
Previously TortoiseTTS was associated with PlayHT in some way, although the exact connection is a bit vague [0].
From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.
If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.
I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.
Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.
I think it’s impressive, but i wouldn’t call it state of the art.