> It's end-to-end audio, in the sense that you speak and it will reply audibly
This is not what was meant by "audio-to-audio" or "end-to-end". It's not a statement about the UI, it's a statement about the whole system.
> it may in fact be employing STT->LLM on the backend
It certainly is, and additionally TTS after the LLM, all connected solely by text. This is not audio-to-audio, it's audio-to-text-to-audio, and the components are trained separately, not end-to-end. ChatGPT has already done this too, for months.
> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.
Thanks. I hadn't actually read the announcement, just all the hullabaloo about how the voice sounded so human-like (and like ScarJo), and that's what had impressed me the most about conversing with Pi, thus my OP.
This is not what was meant by "audio-to-audio" or "end-to-end". It's not a statement about the UI, it's a statement about the whole system.
> it may in fact be employing STT->LLM on the backend
It certainly is, and additionally TTS after the LLM, all connected solely by text. This is not audio-to-audio, it's audio-to-text-to-audio, and the components are trained separately, not end-to-end. ChatGPT has already done this too, for months.
See OpenAI's GPT-4 blog post: https://openai.com/index/hello-gpt-4o/
> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.