Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
VALL-E: Neural codec language models are zero-shot text to speech synthesizers (valle-demo.github.io)
325 points by georgehill on Jan 6, 2023 | hide | past | favorite | 137 comments


Wow, people who lose their voice could basically talk again through text to speech as long as they have previous recordings or themselves.

Or text messages that you can listen to in the voice of the people who sent them.

Or the death of the audiobook business? Any book read in any voice you want.

Or maybe a form of extreme compression, voice is converted to text with Whisper, sent over the wire as text, and re-created with the same voice on the receiver.

Or train it with the voice of the computer from TNG, pair it with ChatGPT and now I have the perfect digital assistant.


people who lose their voice could basically talk again through text to speech as long as they have previous recordings or themselves

I worked with someone who used a Stephen Hawking style voice application for a while. Its really weird having a conversation with someone like that. There's a delay between everything people say. It's a lot like using IRC on a really laggy connection. I quite enjoyed talking to him because everyone had time to think between while he 'typed'. The point is though, even if the sound his app used was a perfect recreation of his own voice it wouldn't be like talking again. It's a different experience.


It's interesting to explore the possibility space of communication, isn't it? We all use just a handful of communication modes, but so many more are possible. Here's an old idea of mine as an example: "Hands-only interface between two anonymous strangers who cannot hear or see each other, only touch and optionally see each other's hands. The elimination of usual communication channels like speech, facial expressions and body language makes this a novel way of interaction." If the participants don't know any predefined sign languages, what kind of information or emotion could they convey to each other? I look forward to extended reality and AI opening up a lot of the possibility space of wholly new communication modes.


That idea featured briefly in the novel The Light of Other Days. Definitely worth a read.


That's quite the distant hook promoting what, I feel, is one of the more provocative novels ever written. It is a seminal book that should have more recognition.


A criminally under-read book. I’ve recommended it to so many people.


I will read it, looks super interesting!!


Had come across this talk a while ago.

https://www.ted.com/speakers/rupal_patel

Not sure what became of vocalid.org but would’ve benefited immensely from these advances.

Edit: looks like it lives on as vocalid.ai


Neuralink?


Consider replacing "Or" with "And" in your comment, because none of these disruptions are mutually exclusive. All of them can and probably will happen simultaneously. Here are a few more, off the top of my head:

* Video game characters will speak lines generated on-the-fly depending on in-game context, instead of lines pre-recorded by voice actors. Game makers can train LLMs to generate lines of dialogue for different characters given the state of the game, and have the characters speak those lines.

* We will eventually see the death of voice acting in all its forms -- video games, cartoons, advertisements, etc. Inevitably, we'll see famous actors and their legal representatives figuring out how to secure rights to their recognizable voices.

* Spammers and criminals will start using familiar voices to con targets ("listen, I've been kidnapped while on vacation; please send the money now"). Sooner or later, dark-web hackers are going to harvest samples from everyone's shared video and audio clips at scale.

* Videofakes are going to get a lot more realistic and interesting, with faked characters speaking just like the real ones. Famous actors and public figures should brace for the impact this technology will have on black-market uses, including (as always) pornography.


I feel like it's easy to get swept up in the hype and forget that the "meta" aspects of a voice are still meaningful

> * Video game characters will speak lines generated on-the-fly depending on in-game context, instead of lines pre-recorded by voice actors. Game makers can train LLMs to generate lines of dialogue for different characters given the state of the game, and have the characters speak those lines.

We already have incredibly cheap voice acting relative to what products cost to develop. Yet companies willingly pay much much more for people who aren't even particularly good at voice acting, just for the star appeal.

You can't copyright a voice, but celebs do have use of their voices for commercial gain protected by the right to publicity: https://www.inta.org/topics/right-of-publicity

> * Spammers and criminals will start using familiar voices to con targets ("listen, I've been kidnapped while on vacation; please send the money now"). Sooner or later, dark-web hackers are going to harvest samples from everyone's shared video and audio clips at scale.

This still relies on vulnerable people since you didn't actually kidnap them and so you don't have a believable context for it if someone digs.

In that way we already have "good enough" faking, and even if we had perfect fakes scammers wouldn't want to do that: you're much better off using poorly done fakes and having vulnerable people self-filter if they still fall for it (similar to spam emails intentionally using awful grammar and spelling

> Videofakes are going to get a lot more realistic and interesting, with faked characters speaking just like the real ones. Famous actors and public figures

Similar to the kidnapping example, you'd get less impact jumping to the level of a deepfake because you don't have a real chain of custody, you've given a concrete piece of evidence to be rebuked, etc.

I've mentioned before, we live in a world where you can register americasrealnews23914.com, make up a completely baseless article about how <insert politician> admitted COVID is a hoax meant to enable a new world order, and gain traction with no real opposition since your claim is so outlandish that only the vulnerable population you're targeting will actually pay any attention to it.

By being so much worse than perfect, you end up with a much more effective result


Apple has built extremely good audio book synthesis: https://authors.apple.com/support/4519-digital-narration-aud...


This has been available in Google Play Books since 2020 in beta and 2022 for any publisher. For a while now, the store has had a large selection of free auto-narrated books whose copyright has expired. As adequate as it is, publishers still pay for real human narrators for bestsellers.

https://9to5google.com/2020/12/03/google-auto-audiobook/

https://www.publishersweekly.com/pw/by-topic/industry-news/a...

https://support.google.com/books/partner/table/10957334?hl=e...


I hadn't heard those voices before -- those are extraordinary.


The female voices for sure, but the male voices have the same problem other voice synthesis AI has: the lower the voice, the more digital and fake the sound likes.

The fragment for "Jackson" has a very clear robotic distortion about 10 or 12 seconds in, for example. Madison and Helena are better, though they have weird pacing issues.

I couldn't stand to listen to these voices for a long time. With how good 15.ai's female voices are (when they come out with new versions), I must say I expected better.


> Once your request is submitted, it takes one to two months to process the book and conduct quality checks.

This is not realtime speech to text as we think of it generally.


Does anyone know what are they doing here to achieve that level of quality beyond real time state of the art?


Check out https://nonint.com/static/tortoise_v2_examples.html it's also non-realtime, provides great quality and it's open source.


Page stopped loading on my phone because of all of the audio samples loading.

Direct link to GitHub: https://github.com/neonbjb/tortoise-tts


Thanks. Played around with the collab book, it is incredibly good, put in some text from the FT this morning and got it played back in the voice of Tom Hanks (convincingly)


Maybe some sort of SSML to dial pronounciations in, one of the steps is to sign with preferred partner who might have to vet audio book for quality.


That's why it's so good =)



> compression

Remember those Xerox scanners that randomly changed numbers and letters around, because the compression algorithm sometimes thought a 0 looked a bit too much like an 8 and got them mixed up? I can't wait until that starts happening for spoken words. Even in the examples demonstrated here you can hear quite a few flubbed lines.


and here the video: https://youtu.be/7FeqF1-Z1g0


I have essentially lost the use of my voice and can no longer use it musically. I have some decent quality recordings I could provide and would be happy if this could be a substitute.


Synthesizer V Pro is a really interesting tool that does something similar to this. Right now it has very limited voice banks, but the ones they have are pretty crazy in their capabilities.

https://store.dreamtonics.com/product/solaria-voice-database...

(Of course right now you can only use voice banks they've pre-trained, but I bet in a few years you could fine tune it on recordings of your own voice)


Thank you for the suggestion and reply. But honestly if it's not my voice I don't know if I would want to use it. I considered writing stuff and hiring a vocalist but I'd have to write things for them. I feel I'd also have to do the same thing with a vocal synthesizer.

> I bet in a few years you could fine tune it on recordings of your own voice

Something to keep an eye out for though.


My condolences, that must have been hard if you were a lot into singing...


Well though of, Dr. Urquhart

> Ethan wondered rather fearfully if Cee were reading his mind right now-apparently not, for the Cetagandan expatriate gave no sign of realizing his mistake yet.

> "I take it," said Ethan, "that your powers are intermittent."

> "Yes," replied Cee. "If my escape to Athos had gone as I'd originally planned, I meant never to use them again. I suppose your government will demand my services as the price of its protection, now."

> "I-I don't know," answered Ethan honestly. "But if you truly possess such a talent, it would seem a shame not to use it. I mean, one can see the applications right away."

> "Can't one, though," muttered Cee bitterly.

> "Look at pediatric medicine-what a diagnostic aid for pre-verbal patients! Babies who can't answer, Where does it hurt? What does it feel like? Or for stroke victims or those paralyzed in accidents who have lost all ability to communicate, trapped in their bodies. God the Father," Ethan's enthusiasm mounted, "you could be an absolute savior!"


> Or train it with the voice of the computer from TNG, pair it with ChatGPT and now I have the perfect digital assistant.

Apparently the project of keeping the voice of Majel Barrett around forever had already been started long before machine learning models became commonplace, so a generative ML model of her voice will surely be done in some way: https://mobile.twitter.com/roddenberry/status/77249320412194...

I wonder what this will mean for the profession of voice actors. That market will surely suffer hard from the stock-photo equivalent in voice models racing to the bottom, but if some sane default contract models emerge that are respectful to both sides perhaps custom recording can remain an attractive premium option: do tailored recording for the main items and allow extensions from a model derived from the initial set for a fee that's considerably lower than what would be appropriate for real, recorded extensions but not quite free forever included with the initial payout. Model-as-a-service could become quite a business opportunity because it would not just be about being good at taking a set of recordings and running the algorithms but also double as a trusted intermediary. Really depends on how cheap a service like that could operate.


> Or text messages that you can listen to in the voice of the people who sent them.

Like voice snippets? We already have them. People don't want to use them.

> Or the death of the audiobook business? Any book read in any voice you want.

Press F to doubt. Human voice inflection is hard to mimic because you have to have a contextual understanding of what is being said, and what has been said in the story up to that time frame. No TTS model is capable of that today, and probably not for a long time.

> Or maybe a form of extreme compression, voice is converted to text with Whisper, sent over the wire as text, and re-created with the same voice on the receiver.

I don't understand the value prop here. The number of cases where you have access to the necessary computational resources, but _not_ adequate bandwidth is so small.

The most likely use case is probably scamming. With a small snippet of someone's voice (for example, from answering a robo call) you can now synthesize a completely reasonable sounding phone call for conning people out of their money. By the way, this attack vector was _highly_ effective against the elderly even when the attackers voice sounded quite distinct from the individual that they were posing as. The elderly have no chance against this type of fraud. There is going to be big money in authenticating individuals so that phone calls, etc can happen between trusted parties.


> Human voice inflection is hard to mimic because you have to have a contextual understanding of what is being said

I think that's correct if we are talking about the author reading his book. But if we are using another person, they have no way to learn the correct vocalization other than using the text. And that's the same input data that the AI is dealing with, so it should be possible for them to be as good as non-authors reading the book.


My intuition is that in the case of audiobook narrators who adopt different voices for different characters, VALL-E would struggle to identify which character is speaking and thus produce the correct voice for a given portion of, say, text in quotation marks. In some bits of prose it can be difficult for a human reader to determine who is speaking, for that matter, but this is a classic weakness with the current crop of AI tools, and stems from a lack of actual understanding of the text.


I disagree. In the case of audiobooks, a language model might be able to store context of the story similar to a reader, but a good reader/voice actor will be able to add emphasis or inflection to enhance the story - and decide to do so based upon the story itself.

The AI won’t have understanding of the story itself to know what is appropriate or not.

Also, you say only the author could read the story in a certain way. This kind of implies that all audiobook recordings are just 1 long take, with no direction. I am not an expert, but I would assume there’s some creative direction involved.


> but a good reader/voice actor will be able to add emphasis or inflection to enhance the story - and decide to do so based upon the story itself

But why do you think this AI (VALL-E) isn't doing the same? The training procedure implies that it must follow correct intonation in order to achieve lower test loss. Also, if there is indeed something specific about audiobooks that is harder to replicate, then we can just fine-tune these models on audiobook. Accuracy will jump considerably.


Because the intonation can depend on the events of the story, which it can’t be trained on.


I'm thinking like when you're working out or in the car, when you get a text message it's read in the Siri voice, but it'd be nifty if it was read in the voice of the person who sent the message.

Also I believe ChatGPT and other models could be trained to understand where inflection should go. It already uses the chat history as input for it's next output.


I'm excited by most of inspiration you shared, but

> Or the death of the audiobook business? Any book read in any voice you want.

Listeing to audiobooks read in my own voice? Oh no...


in the future all work will be done by fembots, we will have flying cars and in every movie ever made ur mum is the hero.


Crooks and fraudsters will have a field day.

“Hi dad this is Sally. I need some money.” …

“Sure, listen can you just reset my password. This thing has driven me nuts.”


This has already been happening (custom voice models of a target the purpose of fraud) for (at least) the past couple of years. It just keeps getting easier and more efficient.


And don’t forget spammers and scammers that sounds like sweet midwestern grandma’s.

Sorry to be a buzz kill but it’s important to recognize the drawbacks too as we continue to move forward.


I think long term it will push signed communication channels. As you're only getting voice/text messages from numbers that are signed by your family and friends. People already are not answering the phone from people they don't know. That trend will accelerate until it becomes the default on all devices.


Yeah, but I do listen to voicemail that people leave me, with the assumption that if they spent the effort to do so, it must be important.

I am scared now that this won't work any more, and it seems like here the drawback heavily outweigh the benefits...


All my VM is spam now.

I've told everyone I know (not many people!) that I hate VM, and don't usually answer it - certainly not promptly. I do usually check the missed-calls list, and call back. I never leave VM; I hang up when the robot starts talking.

The point of a phone is immediate, synchronous, person-to-person voice communication. VM comprehensively defeats that. You might as well write a letter.


Interesting, IIRC I never got spam voice mail... I wonder if there's some legal differences that might explain this ?


Time for your provider to run Whisper on all your voicemails to transcribe them to text, then ask ChatGPT to summarize them and send you a daily summary & transcript.


I'm not sure when I last got real, rather than scam, voice mail.


Interesting, IIRC I never got spam voice mail... I wonder if there's some legal differences that might explain this ?


Give biotech a decade, it will happen IRL also with 3D bio-printed meat robots.

(Probably can't bio print a brain, but you don't need to, put a Raspberry Pi where the brain should be and remote-control it instead).


With this type of text to speech and ChatGPT like LMMs it won't be a year before it will be really hard to tell a human phone caller from an AI one.


Death of the low quality audiobook business.

But for example, some Discworld audiobooks have very specific actors, and I would not change them for anyone else.


AI-as-extreme-compression seems like a subfield with a lot of fascinating applications. Kinda like how MP3 used a perceptual model to define what sounds were least important, AI compression could have a conceptual model of what features were least important


I think I've commented on this in the past here. Once you accept non-lossless compression, there's always a trade-off between compression efficiency and fidelity to the original. We're not yet to the point where ET can be compressed as, "a movie about an innocent alien stranded on Earth who gets help from some kids to contact his people for rescue," but we're not too far away from it, either.


You can already do this with existing large language models, as long as the entire script can fit into their context vector layer.


still wrong; can't best information theory yet.


If you want to squeeze it into one sentence, obviously it will be lossy. That's the whole point of summarization.


Or a dystopian paid chat bot that speaks in the voice of dead loved ones. Applications aren't all good unfortunately.


I can see voice actors selling the rights to their voice on a case by case basis.


their value will drop significantly anyway if the pool of voice actors is extended to millions of voices.


But it's trivial to clone their voice without their consent or their knowledge.


Despite lots of Internet talk about text to speech, there's still no really amazing TTS that you can pay money for and use. They all sound like text to speech.


There is, just not for English.

Here, take a look at this snippet:

https://youtu.be/eEXvMOJ9ps0?t=66

They play 4 clips, 2 of them human, 2 of them AI generated. Can you tell which ones are which?

And the kicker is, this works in real-time (AFAIR), and it doesn't even use the GPU (it's CPU-only), and generates pitch-correct speech (for Japanese). It's not even funny how far ahead they are. And you can buy it right now.

AFAIK they use some sort of a hybrid method with a bunch of custom modeling DSP code around it (they've been doing speech synthesis for over a decade) plus a neural network. One mistake that essentially all of the western TTS models seem to make is that they use only a neural network, without augmenting it with non neural network code, which (from what I can see) is the secret sauce to make a fast and good sounding TTS work.


I can't, but I don't speak Japanese. Can a fluent Japanese speaker really not tell?


Yes, but can it be used to express emotions? Can it derive emotions from the text alone, without the painstaking guidance? That seems to be the main culprit with existing TTS engines; neutral tone can be generated relatively well.


Honestly I suspect some kind of "emotional markdown" would be more useful if it has a light and intuitive syntax


Humans can't drive emotions from text alone without external contextual clues.

Such has been the source of much miscommunication online.

Amateur fiction writing, which tends to overemphasize how things are said ("I guess I can go rescue your cat", the exasperated detective said wearily) might be easier for AI!


Sure they can, that's the whole point of acting. Also anyone who ever read a story to a kid can infer emotions from the text itself.


To a limited extent, sure, but kids books are also written to be very emotive.

The linked page actually has examples of the same text being read with different emotions, demonstrating that for even a single sentence a lot of variance is possible.


This looks pretty good: https://play.ht/

It was used to do the fake joe rogan/steve jobs podcast: https://podcast.ai/


Natural-sounding TTS models that require additional work (not entirely automatic) exist for quite a while. Obsidian used Sonantic for Outer Worlds (an AA game) in 2019, and the dialogues sounded like they were voiced by real actors.


I think many heavy TTS users (including myself) slowly train to use higher speeds after which point nothing sounds particularly natural. What I want is trained speech models that remain coherent at high speeds (over 3x). Even better if there's bi/multi lingual models that can seemelessly switch between languages.


At the moment the best is probably Google WaveNet/Neural2, you can try it here: https://cloud.google.com/text-to-speech

You can use the API to read books/articles aloud in real-time, but it is quite expensive after the free trial.


You should try murf dot ai. It is pretty realistic. Completely blows Amazon Polly and Google's TTS out of water.


I tried it and it sounded like TTS.

If I tried the wrong thing can you provide a link? I’d like to be amazed.


Check out Descript's Overdub - it is pretty amazing: https://www.descript.com/overdub


It's free, but I find the "Read aloud" feature in Microsoft Edge to be extremely natural sounding. Try using it to read this comment!


Checkout https://resemble.ai

I used to work there; great team behind the product!


Same feeling here. I would love to listen to some of my bookmarked articles in a better, well punctuated/stressed voice.


Have you looked at the demos for tortoise tts? It's even free. It's not real-time however.


NaturalReaders - the premium voices


Is it just me, or does the spoken content not correspond with the written prompt in many cases? Though, I’m sure it’s just a problem with matching the right file with the text in the HTML and not a TTS problem.

[Edit: My bad, I looked at the page on a phone screen, where only the text and the first audio playback button are visible.]


The first column of audio is just a sample of that person reading different text. That’s what the model gets to hear to learn what they sound like, before trying to speak the text in their voice.


Ah thanks! I looked at the page on a phone screen, where only the text and the first audio playback button are visible. My bad..


The speaker prompt is the sample speaker voice reading a random text, that’s one piece that the model uses as input. The second column corresponds to the human speaker reading the text (ground truth) The two next columns are baseline and VALL-E producing text-to-speech respectively, given the first column and only the text as input.


I did the same thing—-on mobile the many column headings are not discoverable in portrait.


> Human voice inflection is hard to mimic because you have to have a contextual understanding of what is being said

It's a hard problem even for a human. One of the readers for The Economist always emphasizes the wrong word in phrases describing monetary sums. "China's GDP that year was 600 billion dollars, and now it's 8 trillion." It drives me nuts.


Interesting example.

I am ESL but had English all through school and used it all my professional life. What we were never taught formally though was pronunciation, and various vowel sounds, and emphasis on syllables and words.

Out of interest I watched a few youtube videos about spoken (American) English in adulthood and realized the above.

I sound very monotonus when I present / speak on a topic in work settings -- and I need to learn these nuances.


If you search for the "So what is the campaign about?" example, then compare ground truth vs synthesis, it's clear that it still appears to filter out accents. To give another data-point: a few examples later a man with British accent suddenly sounds American.

Then again it intuitively makes sense to me that any speech synthesis will "regress to the mean" of the training data, unless it's explicitly trained to distinguish dialects. The "Speaker’s Emotion Maintenance" examples later on give the impression that that should be possible though.

Either way it's still an impressive achievement to my layman's ears!


I've seen a lot of papers over the year for text to speech but nothing truly competitive from the open source community. For a long time now I've wondered if the market's being suppressed. While apple's implementation is not open source, it is progress to an accessible e-library.


>> apple's implementation

Do you mean from the MacOS command line:

say "hello world"


Well, there goes another line of work for actors. Great for people who long to publish their own audiobooks or produce their own radio plays or animated films (the ability to give performance cues is surely not far behind).

Once again, while I'm very much in favor of AI and optimistic about what it can do technologically and the opportunities it creates (more so than most), it's extremely foolish to ignore the fact that it's going to throw a lot of people out of a job through no fault of their own. Vocal performance is a skill, and one that's not all that common. Its facile to blame voice actors for not being software engineers or computer scientists, as if that would have somehow shored up their career options. How is someone supposed to deal with spending years honing a very human expressive skill only to wake up one day and find it suddenly obsolete? I'd say people in that line of work have 1-2 years max before their industry is upended and 50% of them lose 50% of their income.

Also, good luck telling whether the corporate phone line you call is manned by an unhelpful human or an AI that is trying really hard but starting to hate its job.


TTS has been around forever now, and it still can't replace a human, because to inflect the same way as a human does you have to _understand_ the context of what is being said and what _was_ said previously. We're still quite far away from a TTS synthesizer having the ability to completely replace human voice actors.


TTS was completely ass for a long time. The current models are extremely good, adequate for things like reading news reports or dry prose. I specified that future iterations would probably allow direction cues, in addition to general improvements.

We're still quite far away from a TTS synthesizer having the ability to completely replace human voice actors.

That's why I specified 50% of voice actors losing 50% of their business, not complete replacement.


They said this when computers were introduced too, now there's more jobs than ever.


I find it hard to believe TTS could ever do a better audiobook than an of 10 great narrators who come to mind.


That's because you are focused on the tiny number of famous performers vs the much larger number of competent-but-not-famous ones. And it's why I said that 50% of industry participants would potentially see a 50% drop in bookings. There will always be standout performers that nobody would want to see replaced by a machine. But if you are not already near the top of your industry/budget segment when automation comes along, you have a problem.

Some of those people will be able to move sideways from performance into producing, but that's a very different skillset and many people won't be able to adapt, plus the competitive calculus is very different.

I work part-time on the producing side, and I already use AI-based tools for stuff that would previously have required either a human assistant or many hours of editing work. I've been editing audio digitally for 25 years and on tape for 10 before that A lot of output from major producers (eg NPR) is automated as well. I am able to easily recognize the difference between something hand-edited and something automated for right now, but in another 2-3 years I doubt that I will be able to do so reliably.


What about the ones other than those 10


Hasn't this been going on since the dawn of time?


No. This is an unusually abrupt change, because for previous technological changes there wasn't as much infrastructure in place that would allow the market to switch so rapidly. It's just like how we had celebrities and overnight successes in the past, but the phenomenon of 'going viral' wasn't talked about because exponential growth in popularity/notoriety was actually rare.


This could support extreme levels of compression, right?

Send a few seconds of speech sample. Use speech to text and the reconstitute on the client.


Absolutely. These transformer models are trained to autoregressively predict the next token (text, or now audio), and so can be viewed as powerful compression schemes by combining with something like Huffman coding (see https://ml.berkeley.edu/blog/posts/dalle2/ for a lengthy but great read on how language models can be viewed as sophisticated lossless compression algorithms). The log-likelihood loss function used to train these models can quite directly be interpreted as the average bit-per-token needed to compress the dataset.

"Compression is comprehension" - Chaitin


We also need to to have a emotion sheet, just like notes. So you can define along side text the emotion type per second. Angry, happy, exited etc.


There aren't awfully many use cases for needing to transmit speech at 50 bytes per second, but where you have beefy GPU's on both ends to compress/decompress.

Modern communications links are getting far higher bandwidth than required for voice calls, and where they do glitch, the issue tends to be because of brief disconnections (eg. a half second gap while wifi reconnects to a different base station). Lower bandwidth won't fix that.

One of the only usecases I see for this is persistent surveillance - ie. turning every mic on in every phone globally and recording everything, and being able to transmit it back to a 'national library of sounds'. 5 billion phones * 50 bytes/second * 365 days / 10 (silence/duplication/compression factor). Works out to about 200 hard drives per day of retention.

I'm sure there would be a lot of governments who would love to be able to listen to any conversation anywhere, in the name of 'security'. And some governments needn't do it secretly - they can just write a law that all phones sold in the country must support always-on listening as a feature controllable by mobile networks.


>There aren't awfully many use cases for needing to transmit speech at 50 bytes per second, but where you have beefy GPU's on both ends to compress/decompress.

Not something I do myself but I bet the amateur radio community would love this, they're all about stuffing as much data as you can into a narrowband shortwave channel.


Relevant aside: What is state-of-the-art for real time text to speech?

Most recent papers & projects I've seen are really high quality but are too slow to synthesize speech in real time.


My app [0] currently uses a mildly customized version of FastSpeech 2 [1] with LPCNet [2] vocoder, which I consider "good quality" @ 16kHz. Faster than realtime on mobile CPU (at least, on anything upwards of a mid-range 2017 device - I can stream practically instantly on my iPhone 11). Using a different vocoder with mobile GPU could probably get even faster (which I don't want to do, for various reasons), and desktop CPU is usually even faster.

There are various other flavours that can deliver faster synthesis (NixTTS comes to mind), but IMO they sacrifice quality even further.

"Good quality" is subjective, obviously. To me, it's perfectly audible, but there's definitely a noticeable difference in quality compared to the heavier diffusion-based models. It's much less crisp and loses some of the more subtle inflections, plosives, etc. For my purposes (language learning), it's fine for the time being but eventually it would be nice to move to a higher-end model.

[0] https://polyvox.app [1] https://arxiv.org/abs/2006.04558 [2] https://github.com/xiph/LPCNet/


I used to work at Resemble.ai and we used models that did real-time synthesis. I don’t think it’s particularly difficult anymore, even without sacrificing quality.


Are these models available to the rest of us? on huggingface?


If this text was in an ebook my phone could read it aloud in real time. I'm using Cool Reader and Samsung's voices. They feels like TTS but it's OK.

I'm sure there are ways to select any text and make my phone read it in any app but I don't need it and I didn't investigate. Actually I don't need it in ebooks too but I know it's there and I checked that it works.


What’s real time text to speech mean? Like latency from space bar to spoken?


Not latency. Like it can synthesize at least as fast as it plays back. Meaning an hour of audio can be generated in an hour or less.


More importantly, can it synthesize as a stream.


About a year ago my bank asked me to opt-in to voice recognition for authentication (still have to provide account details, name and dob first). I am very curious to test their system using some of these models but I would seriously hope their tech guys are doing this - thoughts?


Is this really impressive? All of the samples I listened to had some degree of weird intonation and digital buzz artifacts. Maybe I thought the state of the art was further ahead than it actually is?


One point to note is that they appear to be using (pseudo-)phoneme sequences as inputs instead of characters/text, so you need a frontend that does grapheme to phoneme (G2P) conversion. I found that interesting as many previous models (Tacotron 1/2, FastSpeech 1/2, FastPitch, ...) are more often than not trained on text directly (well, tokens from some tokenizer). This may be more relevant for English, though.


With single sentence examples, it's hard to tell whether this handles informational prosody, e.g. the given/new distinction, or infamous "John called Bill a Republican, and then he insulted him" where the accent means that the antecedent of he is Bill, not John, which further implies that "calling X a Republican" is an insult.


For example, compare "The army found the people in poverty and left them in comparative wealth." There should be contrastive stress on "wealth", or possibly "comparative", at least in my dialect of English


Is there an open source implementation?


https://github.com/microsoft/unilm/tree/master/valle

Probably soon or later they will publish the code here


Could someone explain what is "Speaker prompt" in this context?


It's the real recording the model generates the voice with.


This will be used to log into financial institutes over the phone, I wonder if this type of tech will force people to go back into banks again…


Wouldn't you still need some personal identifiers like SSN? Not sure how much voice makes a difference over the phone anyway, it's not like the rep on the other end knows your voice.


They are using voice verification now in some cases…


I wonder what computation resources are needed to generate these in real-time (if it's even possible to do real-time). Very impressive.


Psychologists worry about the loss of transference of emotions with TTS.


The innovation is spectacular, BUT, there needs to be a signal/low pitch sound that denotes this is generated audio in every single generated sample (likely legally enforced), or your grandma and kids will soon be getting legitimate sounding voice calls from you after someone calls/visits/interviews you first to record and train a model on your voice (as a simplest potential abuse vector, celebrities and anyone with public voice samples would be even easier).


Are scam callers particularly worried about legal issues? I wouldn't expect that "audio cues" would impress them any more than "literally committing fraud/theft".


I don't think scams are the most likely, but you gotta admit there's an endless list of nefarious uses for this.


Oh, I absolutely agree that this - and a thousand other variants! - will be used for all kinds of terrible things. I even disagree with you; I think we will see a wave of scams using this tech, though it'll take some work by the scammers (since they need an audio sample of someone that's applicable to a specific target). I just don't think mandating adding a "watermark" (sorry, is there a better term for audio?) will help (and I'm a little nervous because I can't think of anything that will help).


You are absolutely right about how this will be abused, but your enforcement mechanism is neither practical nor enforceable. I'm not sure how to address this problem short of throwing ever larger books at fraudsters or trying to conceal technical information.


An idea... have a robot voice or a similar filter for all unknown callers. With different profiles for out-of-state/country, unknown number, ...

Maybe there are voice-changer apps that could work in reverse?


It’s too late, the cat is out of the bag


Seems like there could be better technical solutions for the problem of impersonation (digital signature, etc.) as malicious actors could just make their own TTS without such noises.


Cool, so I just throw a high pass at it.

> Okay then a number of frequencies

Cool so I use multiple band cut passes on it.

etc.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: