Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A good time to link the TTS leaderboard: https://huggingface.co/spaces/TTS-AGI/TTS-Arena

Eleven Labs is still very far above open source models in quality. But StyleTTS2 (MIT license) is impressively good and quite fast. It'll be interesting to see where this new one ends up. The code-switching ability is quite interesting. Most open source TTS models are strictly one language per sentence, often one language per voice.

In general though, TTS as an isolated system is mostly a dead end IMO. The future is in multimodal end-to-end audio-to-audio (or anything-to-audio) models, as demonstrated by OpenAI with GPT-4o's voice mode (though I've been saying this since long before their demo: https://news.ycombinator.com/item?id=38339222). Text is very useful as training data but as a way to represent other modalities like audio or image data it is far too lossy.



> In general though, TTS as an isolated system is mostly a dead end IMO

Do you mean like as a simple text to speech application? There is a huge need for better quality audiobook output.


I don't think recording an audiobook with human-level quality is "simple". It's really a kind of acting. TTS models do very poorly at acting because they generally process one sentence at a time, or at most a paragraph, and have very little context or understanding. They just kind of fake it like a newscaster reading an unrehearsed script from a teleprompter.

True human-level audiobook reading would require understanding the whole story, which often assumes general cultural knowledge, which you'll only get from a model trained on LLM-scale data. If you asked GPT-4o's new end-to-end voice mode to read an audiobook you'd probably get a better result than any TTS model. I bet it would even do different voices for the characters if you asked it to.


Well, no. This is a reasonable guess turned strangely confidently wrong and opinionated.

Voice acting is quite literally done a sentence or at most a paragraph at a time. Often the recording order is completely different from the script.

An actor may very well record his final scene on the first day of a project, after the whole character arc has transpired. But you know, acting. They get fed a line with stage direction and do a bunch of takes and somehow it works.

Heck you might be a full blown Italian who can't say a word in English but with the right kind of jacket it comes out a banger: https://www.youtube.com/watch?v=-VsmF9m_Nt8

You mention Eleven labs being ahead, check out Suno. There is no LLM-scale anything involved there. The voice in this context is a musical instrument and there are lots of viable ways to tackle this problem domain.


We're taking about audiobooks here. An actor recording an audiobook does not read the sentences or paragraphs in a random order without context.

Sure, voice acting for games or movies is done piecemeal. But the actor still gets information about the story ahead of time to inform their acting, along with their general cultural knowledge as a human. Most crucially, when acting is done in this way it is done with a human director in the loop with deep knowledge of the story and a strong vision, coaching the actor as they record each line and selecting takes afterward. When the directing is done poorly, it is pretty easy to tell.

Sure, for a movie or game you could direct a TTS system line by line in the same way and select takes manually, but it would be labor intensive and not at all automatic. And to take human direction the model would need more than just the text as input. Either a special annotation language (requiring a bunch of engineering and special annotated training datasets), or preferably a general audio-to-audio model that can understand the same kind of direction a human voice actor gets.


> I don't think recording an audiobook with human-level quality is "simple"

It is, though, I've just done it a few days back. Once you have a clean text extracted from the book (this is actually the difficult part, removing headers, footers, footnote references, etc.), you can feed it into edge-tts (I recommend the Ava voice) and you get something that is, in my opinion, better than most human performers. Not perfect, but humans aren't either (I'm currently listening to a book performed by a human who pronounces GNU like G-N-U).


Something tells me one of you is speaking about fiction and one of you is speaking about nonfiction.

Inflection, emotion, tone, character-specific affects and all that can really change the audiobook experience for fiction. Your mentioning of footnote references and GNU suggests you're talking about nonfiction, perhaps technical books. For that, a voice that never significantly changes is fine and maybe even a good thing. For fiction it'd be a big step down from a professional human reader who understands the story and the characters' mental states.


I'm talking about both. I cannot listen to audiobooks that are acted, in fact, that was the reason I decided to go down the rabbit hole of creating my own.

> For fiction it'd be a big step down from a professional human reader who understands the story and the characters' mental states.

On the contrary, I don't want the reader to understand anything, I just want the text in audio form and I will do the interpretation of it myself.


shrug good for you. A lot of people including myself find audiobook fiction really hard to listen to if read with a flat automated voice.

I think how you tend to listen might also matter. I mostly use audiobooks when I'm driving or otherwise doing something else that is going to claim a portion of my attention. Following the narrative and dialog is easier when the audio provides cues like vocal tone changes for each speaker / narrator.


Agreed. In fact a great example of this is the Blood Meridian audio book where each of the characters seems to get a distinct "voice" despite being narrated by a single person.

You can find it on YouTube easily if you want an example.


Maybe authors can tag sentences/paragraphs with acting directions while they write to facilitate acting. Seems like there's ways for some human input to streamline process.


Approaches based on tagging and interpreting metadata are tempting. Building structured human knowledge into the system seems like a good idea. But ultimately it isn't scalable or effective compared to general learning from large amounts of data. It's exactly the famous Bitter Lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

To the extent that authors provide supplemental notes or instructions to human actors reading their books, that information would be helpful to provide to an automated audiobook reading model. But there is no reason for it to be in a different form than a human actor would get. Additional structure or detail would be neither necessary nor helpful.


The difference is production moves from multiple people / skills to potentially one person, the writer, who ideally already knows emotions in scene. Economics makes sense before one click audio book / production as long as it's subtantially labour reducing.


It would be better to just have a professional director guide the model the same way you would any other actor.


Not only the whole story, but also which character is currently speaking, what place and mood he is in, whether it is sarcasm or irony and many many more aspects.

However, in my opinion it would be a huge benefit, if this kind of metadata would be put into the ebook file in some way, so that it would be something extractable and not has to be detected. I think it would be enough to ID the characters and tag a gender and a mood in the book together with citations, so that you could add different speech models for different characters. That would also allow to use different voices for different characters.

I wrote a little tool called voicebuilder (which I will open source next year). It's a "sentence splitter" which is able to extract an LJSpeech training dataset for an audio file, epub file and length matching. Works pretty accurate for now, although it needs manual polish of the extracted model. Still way faster than doing it manually.

This way you can build speech sets of your favorite narrators and although you would never be allowed to publish them, I think for private use they are great!


for non-fiction books TTS is already good enough. what's needed is the convenience and speed for turning text to audio. if with one click on my ebook app I can start listening, it'll be a darn good feature for me.


This one does seem like it does multiple languages in a sentence (at least for its currently supported languages, Chinese and English): https://www.bilibili.com/video/BV1zn4y1o7iV/?share_source=co...

But it does seem like the Chinese version is better than the English one for this TTS, which would make the arena not quite as applicable to it as they're all focusing on English.


Pi by Inflection.ai was doing audio-to-audio long before GPT-4o with the most realistic voice ever (human-like imperfections, micro-pauses, etc). I don’t know why it didn’t get more attention.


Was it end-to-end or it was audio -> speech to text -> LLM -> text to speech -> audio? I imagine it's the latter.


It's end-to-end audio, in the sense that you speak and it will reply audibly, all without visibly transcribing your words into a prompt and submitting (it may in fact be employing STT->LLM on the backend, I don't know).

Works great in the car on speaker with the kids -- endless Harry Potter trivia, science questions, etc. I was completely blown away by the voice. Truly conversational.


> It's end-to-end audio, in the sense that you speak and it will reply audibly

This is not what was meant by "audio-to-audio" or "end-to-end". It's not a statement about the UI, it's a statement about the whole system.

> it may in fact be employing STT->LLM on the backend

It certainly is, and additionally TTS after the LLM, all connected solely by text. This is not audio-to-audio, it's audio-to-text-to-audio, and the components are trained separately, not end-to-end. ChatGPT has already done this too, for months.

See OpenAI's GPT-4 blog post: https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.


Thanks. I hadn't actually read the announcement, just all the hullabaloo about how the voice sounded so human-like (and like ScarJo), and that's what had impressed me the most about conversing with Pi, thus my OP.


Can it understand how you feel by the intonation of your voice, can it recognize the animal by the sound? If not, then it's probably not end-to-end, ChatGPT already had this mode for months where they simply use a STT and TTS to let you converse with the AI.


Why isn’t Microsoft Azure’s TTS on here? (or am I missing something and it’s called something else)


Many proprietary ones are missing including OpenAI. I'd guess that they don't have budget to pay for the API usage. I think the leaderboard is more focused on open source options.


Isn’t gpt 4o voice not audio to audio, but audio to text to audio?


It isn't released yet, but the new one that they demoed is audio-to-audio. That's why it can do things like sing songs, and detect emotion in voices.

The one that you can currently access in the ChatGPT app (with subscription) is the old one which is ASR->LLM->TTS.


Are we sure it’s a single model behind the scenes doing that?

Practically it doesn’t really matter, but I’d like to know for sure.


It's the second paragraph in their announcement blog post. https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.


I'm pretty sure you can use the new GPT4o audio-to-audio model already – even without a subscription. You can even use the "Sky" model if you didn't update your app.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: