Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PlayHT2.0: State-of-the-Art Generative Voice AI Model for Conversational Speech (play.ht)
47 points by smusamashah on Aug 11, 2023 | hide | past | favorite | 10 comments


Self-proclaimed state of the art. A year ago, i would have been blown away, today, this is dramatically worse than Eleven Labs. Lower quality audio, strange cadence, pretty monotonic. It’s not what people sound like.

I think it’s impressive, but i wouldn’t call it state of the art.


I’ve played around with Eleven Labs a bunch, and while it does a pretty good job sounding like an audiobook a lot of the time, the quality of cloned voices and the ability to convey emotions is inconsistent IME. Based on the samples, this isn’t obviously worse and might actually be better.


> Generating humanlike speech requires the model to act as though it thinks while speaking, while making use of filler words to make the speech sound extremely realistic.

This is impressive from a "generating humanlike conversation" standpoint, but its application in terms of a support call seems rather boneheaded to me.

I accept filler words when speaking to a human because many humans need time to think. An AI generated response does not. The filler words are extra tokens that waste time and energy.

I'd feel somewhat miffed if I found out I was speaking to an AI that was intentionally wasting my time in an effort to sound more human. I only need a TTS voice to sound more human so that I can better understand what it's saying, not so that I feel convinced that I'm talking to a real human.


Why charge on a subscription basis if you're getting a limited amount of usage? If Netflix charged me $12 and limited me to 5 shows a month, I'd be outraged.

At least offer up one time payments, even if you mark it up 10-20%


These guys want to act like they were the reason for certain things to exist lol.

Nothing beats VoiceBox by Meta in realism & generation speed


It says closed alpha, but also says available through the API. Is it closed or open now?


I’m guessing it’s closed because you can’t download it. It’s only available through the API.


What models/architecture are they using?


Previously TortoiseTTS was associated with PlayHT in some way, although the exact connection is a bit vague [0].

From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.

If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.

I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.

Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.

[0] Discussion here https://github.com/neonbjb/tortoise-tts/issues/182#issuecomm...

[1] https://www.tiktok.com/@jonathanflyfly/video/722513498370947...

[1a] Bark github https://github.com/suno-ai/bark

[2] https://medium.com/dessa-news/realtalk-how-it-works-94c1afda...


Mel + multispeaker vocoder is very much a classic (tacotron era) TTS approach




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: