> Additionally, rare hallucinations in Voice Mode persist with this update, resu...

transcriptase · 2025-06-07T22:24:24 1749335064

I use advanced voice a lot and have come across many weird bugs.

1) Every response would be normal except end with a “whoosh” like one of those sound effects some mail clients use when an message is sent, and the model itself either couldn’t or wouldn’t acknowledge it.

2) The same except with someone knocking on a door. Like someone would play on a soundboard.

3) The entire history in the conversation disappearing after several minutes of back and forth, leading to the model having no idea what I’m talking about and acting as if it’s a fresh conversation.

4) Advanced voice mode stuttering because it hears its own voice and thinks it’s me interrupting (on a brand new iPhone 16 Pro, medium-low built in speaker volume and built-in mic).

5) Really weird changes in pronunciation or randomly saying certain words high-pitched, or suddenly using a weird accent.

And all of this was prior to these most recent changes.

It also stutters and repeats sometimes and says poor connection even though I know the connection is near-ideal.

zaptrem · 2025-06-07T23:19:03 1749338343

I may know why that first one happens! They’re not correctly padding the latent in their decoder (by default torch pads with zeros, they should pad with whatever their latent’s representation of silence is). You can hear the same effect in songs generated with our music model: https://sonauto.ai/

Yeah we’re too lazy to fix it too

transcriptase · 2025-06-08T00:00:54 1749340854

I’m super curious now, how does padding lead to repeatedly ending tts replies with what seem to be an actual non-speech sound effect?

Centigonal · 2025-06-08T00:08:10 1749341290

If you pad your output with something that doesn't represent silence, then any outputs that happen to have a non-standard length (i.e. nearly all outputs) will end with whatever sound your padding bits represent in the model's embedding space. if "0000" represents "Whoosh," then most of your outputs will end in "whoosh."

Here's a non-AI example: If all HN comments had to be some multiple of 50 characters long and comments were padded with the letter "A," then most HN comments would look like the user was screaming at the end. AAAAAAAAAAAAAAAAAA

brookst · 2025-06-08T01:53:51 1749347631

Also a decent AI example as most AI audio uses base64 encoding where AAAAAAAAA is a string of zeroes.

zaptrem · 2025-06-08T00:19:53 1749341993

In addition to what Centigonal said, even if the autoencoder was trained on only speech data, an all zero vector is probably just out of distribution (decoder has never seen it before) and causes weird sounds. However, given the hallucinations we're seeing, the AE has (maybe unintentionally) likely seen a bunch of non-speech data like music and sound effects too.

rpozarickij · 2025-06-08T04:54:02 1749358442

> 4) Advanced voice mode stuttering because it hears its own voice and thinks it’s me interrupting

I experience the same issue on an iPhone 15 Pro Max and have to mute the mic whenever I'm listening to a response. I wish they added an option to disable voice interruptions so that it could be interrupted only by touch.

lostmsu · 2025-06-09T02:09:58 1749434998

> 3) The entire history in the conversation disappearing after several minutes

Likely a natural consequence of using native voice mode. It is very probable that its context span is very short. Sesame for instance loses context after less than 20 minutes (not sure when exactly).

kevinventullo · 2025-06-08T05:32:16 1749360736

Oh man, I thought the “whoosh” was an intentional indicator that it was done speaking.

automationist · 2025-06-07T23:55:41 1749340541

If anyone's wondering, here's a short sample. It quietly updated last night, and I ended up chatting for like an hour. It sounds as smart as before, but like 10x more emotionally intelligent. Laughter is the biggest giveaway, but the serious/empathetic tones for more therapy-like conversations are noticeable, too. https://drive.google.com/file/d/16kiJ2hQW3KF4IfwYaPHdNXC-rsU...

SamBam · 2025-06-11T16:57:43 1749661063

I find it weird the way it's just lying. Did you tell it to impersonate a living human?

Obviously I get LLMs have no concept of truth and hallucinate all the time, but I would have thought that the model prompt would have told it to acknowledge that it's a chatbot.

refulgentis · 2025-06-08T03:59:06 1749355146

Holy moley. Thanks for sharing. I had to work with the API version a lot the last week and it was frustrating how "old" it felt intelligence-wise. This is in another league, I hope it's 4.1 x audio training, I'd love to talk to this. Current one is passable for hands-free RAG, that's it for me.

candiddevmike · 2025-06-08T00:31:40 1749342700

Did it really say partwheel or is it garbled?

rpozarickij · 2025-06-08T04:45:38 1749357938

It sounded like partwheel to me too. Now I wonder if more people will hear it as partwheel due to the McGurk effect [0].

[0] https://en.wikipedia.org/wiki/McGurk_effect

arthurcolle · 2025-06-07T23:05:18 1749337518

they still need to post-train out the emissions of all the trapped souls