A good time to link the TTS leaderboard: https://huggingface.co/spaces/TTS-AGI/T...

ugh123 · on May 29, 2024

> In general though, TTS as an isolated system is mostly a dead end IMO

Do you mean like as a simple text to speech application? There is a huge need for better quality audiobook output.

modeless · on May 29, 2024

I don't think recording an audiobook with human-level quality is "simple". It's really a kind of acting. TTS models do very poorly at acting because they generally process one sentence at a time, or at most a paragraph, and have very little context or understanding. They just kind of fake it like a newscaster reading an unrehearsed script from a teleprompter.

True human-level audiobook reading would require understanding the whole story, which often assumes general cultural knowledge, which you'll only get from a model trained on LLM-scale data. If you asked GPT-4o's new end-to-end voice mode to read an audiobook you'd probably get a better result than any TTS model. I bet it would even do different voices for the characters if you asked it to.

verticalscaler · on May 29, 2024

Well, no. This is a reasonable guess turned strangely confidently wrong and opinionated.

Voice acting is quite literally done a sentence or at most a paragraph at a time. Often the recording order is completely different from the script.

An actor may very well record his final scene on the first day of a project, after the whole character arc has transpired. But you know, acting. They get fed a line with stage direction and do a bunch of takes and somehow it works.

Heck you might be a full blown Italian who can't say a word in English but with the right kind of jacket it comes out a banger: https://www.youtube.com/watch?v=-VsmF9m_Nt8

You mention Eleven labs being ahead, check out Suno. There is no LLM-scale anything involved there. The voice in this context is a musical instrument and there are lots of viable ways to tackle this problem domain.

modeless · on May 29, 2024

We're taking about audiobooks here. An actor recording an audiobook does not read the sentences or paragraphs in a random order without context.

Sure, voice acting for games or movies is done piecemeal. But the actor still gets information about the story ahead of time to inform their acting, along with their general cultural knowledge as a human. Most crucially, when acting is done in this way it is done with a human director in the loop with deep knowledge of the story and a strong vision, coaching the actor as they record each line and selecting takes afterward. When the directing is done poorly, it is pretty easy to tell.

Sure, for a movie or game you could direct a TTS system line by line in the same way and select takes manually, but it would be labor intensive and not at all automatic. And to take human direction the model would need more than just the text as input. Either a special annotation language (requiring a bunch of engineering and special annotated training datasets), or preferably a general audio-to-audio model that can understand the same kind of direction a human voice actor gets.

dececco · on May 29, 2024

> I don't think recording an audiobook with human-level quality is "simple"

It is, though, I've just done it a few days back. Once you have a clean text extracted from the book (this is actually the difficult part, removing headers, footers, footnote references, etc.), you can feed it into edge-tts (I recommend the Ava voice) and you get something that is, in my opinion, better than most human performers. Not perfect, but humans aren't either (I'm currently listening to a book performed by a human who pronounces GNU like G-N-U).

sjsdaiuasgdia · on May 29, 2024

Something tells me one of you is speaking about fiction and one of you is speaking about nonfiction.

Inflection, emotion, tone, character-specific affects and all that can really change the audiobook experience for fiction. Your mentioning of footnote references and GNU suggests you're talking about nonfiction, perhaps technical books. For that, a voice that never significantly changes is fine and maybe even a good thing. For fiction it'd be a big step down from a professional human reader who understands the story and the characters' mental states.

dececco · on May 29, 2024

I'm talking about both. I cannot listen to audiobooks that are acted, in fact, that was the reason I decided to go down the rabbit hole of creating my own.

> For fiction it'd be a big step down from a professional human reader who understands the story and the characters' mental states.

On the contrary, I don't want the reader to understand anything, I just want the text in audio form and I will do the interpretation of it myself.

sjsdaiuasgdia · on May 29, 2024

shrug good for you. A lot of people including myself find audiobook fiction really hard to listen to if read with a flat automated voice.

I think how you tend to listen might also matter. I mostly use audiobooks when I'm driving or otherwise doing something else that is going to claim a portion of my attention. Following the narrative and dialog is easier when the audio provides cues like vocal tone changes for each speaker / narrator.

Rodeoclash · on May 29, 2024

Agreed. In fact a great example of this is the Blood Meridian audio book where each of the characters seems to get a distinct "voice" despite being narrated by a single person.

You can find it on YouTube easily if you want an example.

maxglute · on May 29, 2024

Maybe authors can tag sentences/paragraphs with acting directions while they write to facilitate acting. Seems like there's ways for some human input to streamline process.

modeless · on May 29, 2024

Approaches based on tagging and interpreting metadata are tempting. Building structured human knowledge into the system seems like a good idea. But ultimately it isn't scalable or effective compared to general learning from large amounts of data. It's exactly the famous Bitter Lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

To the extent that authors provide supplemental notes or instructions to human actors reading their books, that information would be helpful to provide to an automated audiobook reading model. But there is no reason for it to be in a different form than a human actor would get. Additional structure or detail would be neither necessary nor helpful.

maxglute · on May 29, 2024

The difference is production moves from multiple people / skills to potentially one person, the writer, who ideally already knows emotions in scene. Economics makes sense before one click audio book / production as long as it's subtantially labour reducing.

throwthrowuknow · on May 29, 2024

It would be better to just have a professional director guide the model the same way you would any other actor.

sandreas · on May 29, 2024

Not only the whole story, but also which character is currently speaking, what place and mood he is in, whether it is sarcasm or irony and many many more aspects.

However, in my opinion it would be a huge benefit, if this kind of metadata would be put into the ebook file in some way, so that it would be something extractable and not has to be detected. I think it would be enough to ID the characters and tag a gender and a mood in the book together with citations, so that you could add different speech models for different characters. That would also allow to use different voices for different characters.

I wrote a little tool called voicebuilder (which I will open source next year). It's a "sentence splitter" which is able to extract an LJSpeech training dataset for an audio file, epub file and length matching. Works pretty accurate for now, although it needs manual polish of the extracted model. Still way faster than doing it manually.

This way you can build speech sets of your favorite narrators and although you would never be allowed to publish them, I think for private use they are great!

fhe · on May 29, 2024

for non-fiction books TTS is already good enough. what's needed is the convenience and speed for turning text to audio. if with one click on my ebook app I can start listening, it'll be a darn good feature for me.

presentation · on May 29, 2024

This one does seem like it does multiple languages in a sentence (at least for its currently supported languages, Chinese and English): https://www.bilibili.com/video/BV1zn4y1o7iV/?share_source=co...

But it does seem like the Chinese version is better than the English one for this TTS, which would make the arena not quite as applicable to it as they're all focusing on English.

insane_dreamer · on May 29, 2024

Pi by Inflection.ai was doing audio-to-audio long before GPT-4o with the most realistic voice ever (human-like imperfections, micro-pauses, etc). I don’t know why it didn’t get more attention.

GaggiX · on May 29, 2024

Was it end-to-end or it was audio -> speech to text -> LLM -> text to speech -> audio? I imagine it's the latter.

insane_dreamer · on May 29, 2024

It's end-to-end audio, in the sense that you speak and it will reply audibly, all without visibly transcribing your words into a prompt and submitting (it may in fact be employing STT->LLM on the backend, I don't know).

Works great in the car on speaker with the kids -- endless Harry Potter trivia, science questions, etc. I was completely blown away by the voice. Truly conversational.

modeless · on May 29, 2024

> It's end-to-end audio, in the sense that you speak and it will reply audibly

This is not what was meant by "audio-to-audio" or "end-to-end". It's not a statement about the UI, it's a statement about the whole system.

> it may in fact be employing STT->LLM on the backend

It certainly is, and additionally TTS after the LLM, all connected solely by text. This is not audio-to-audio, it's audio-to-text-to-audio, and the components are trained separately, not end-to-end. ChatGPT has already done this too, for months.

See OpenAI's GPT-4 blog post: https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

insane_dreamer · on May 29, 2024

Thanks. I hadn't actually read the announcement, just all the hullabaloo about how the voice sounded so human-like (and like ScarJo), and that's what had impressed me the most about conversing with Pi, thus my OP.

GaggiX · on May 29, 2024

Can it understand how you feel by the intonation of your voice, can it recognize the animal by the sound? If not, then it's probably not end-to-end, ChatGPT already had this mode for months where they simply use a STT and TTS to let you converse with the AI.

aprilnya · on May 29, 2024

Why isn’t Microsoft Azure’s TTS on here? (or am I missing something and it’s called something else)

modeless · on May 30, 2024

Many proprietary ones are missing including OpenAI. I'd guess that they don't have budget to pay for the API usage. I think the leaderboard is more focused on open source options.

dartos · on May 29, 2024

Isn’t gpt 4o voice not audio to audio, but audio to text to audio?

modeless · on May 29, 2024

It isn't released yet, but the new one that they demoed is audio-to-audio. That's why it can do things like sing songs, and detect emotion in voices.

The one that you can currently access in the ChatGPT app (with subscription) is the old one which is ASR->LLM->TTS.

dartos · on May 29, 2024

Are we sure it’s a single model behind the scenes doing that?

Practically it doesn’t really matter, but I’d like to know for sure.

modeless · on May 29, 2024

It's the second paragraph in their announcement blog post. https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

addandsubtract · on May 29, 2024

I'm pretty sure you can use the new GPT4o audio-to-audio model already – even without a subscription. You can even use the "Sky" model if you didn't update your app.