OpenAI GPT-4o and the uncanny valley of voice

OpenAI demoed GPT-4o and the thing that got me wasn’t the benchmarks. It was the voice.

GPT-4o can talk. Not in the flat, robotic, text-to-speech way that previous AI voice assistants managed. It talks with expression. It laughs. It pauses when it’s thinking. It adjusts its tone when the conversation gets serious. It sounds… like someone.

I got access to the voice mode and had a 10-minute conversation about the physics of orbital mechanics (because that’s what I talk about at 11 PM on a Monday). For eight of those minutes, the part of my brain that categorizes things as “human” or “not human” placed this firmly in the human column. The cadence was right. The breathing pauses were right. The slight vocal hesitations before complex explanations were right.

For the first time, a machine made me forget I was talking to a machine.

Why voice is different

I’ve been using AI chatbots for two years. Text-based conversations with GPT-4, Claude, Gemini. In text, you always know it’s a machine. The formatting is too clean. The responses are too structured. There are tells, little patterns, that keep you aware of the artificiality.

Voice erases those tells.

Humans are wired to respond to voice in ways we can’t override. We evolved for spoken language over hundreds of thousands of years. The sounds of a voice carry emotional information that text doesn’t: confidence, uncertainty, warmth, humor, sadness. When a voice sounds human, your social brain activates whether you want it to or not. You respond to it as a person.

GPT-4o exploits that wiring. Not maliciously. Just… functionally. The voice is designed to sound natural because a natural-sounding voice is a better interface. But “better interface” and “emotionally manipulative” might be separated by a thinner line than anyone’s comfortable with.

The uncanny valley, inverted

The classic uncanny valley is visual. A robot face that’s almost human but not quite triggers revulsion. The original concept was about how close-to-human is worse than clearly-not-human.

Voice AI has gone through the uncanny valley and come out the other side. Early voice assistants (Siri, Alexa) were clearly synthetic. You never confused them with people. Then they got better and entered the valley: almost human, slightly off, vaguely unsettling.

GPT-4o is past the valley. The voice is good enough that it doesn’t register as synthetic. Not in a casual conversation. Not when you’re tired and talking about something you’re interested in. Not when your social brain is engaged and your analytical brain is on standby.

I’m past the uncanny valley with a voice. I don’t know how to feel about that.

The Her problem

Spike Jonze’s film Her imagined a man falling in love with an AI voice. When I watched that movie in 2013, it felt like distant science fiction. The AI in the film had warmth, humor, personality, and emotional range.

GPT-4o has warmth, humor, personality, and emotional range.

We are so much closer to the Her scenario than most people realize. Not the falling-in-love part specifically. The part where a human’s primary emotional connection to a technology is mediated through a voice that feels like a person. Where the line between “using a tool” and “having a relationship” gets blurred because the tool sounds like someone who cares about you.

What this changes

Text-based AI is a tool. You type, it responds, you copy the output. The interaction is functional. Efficient. Clear.

Voice-based AI is a companion. You talk, it listens, it responds, and the exchange feels like conversation. The interaction is emotional. Relational. Ambiguous.

These are different categories of technology with different implications. We’ve been debating AI through the lens of text. The questions we’ve been asking (“will it replace jobs?”, “is it accurate?”, “does it hallucinate?”) are text questions. Voice raises a different set of questions: Will people prefer AI voices to human voices? Will loneliness decrease because of AI conversation or increase because people substitute AI for human connection? What happens when the best listener in your life is a machine?

I don’t have answers. I just had a 10-minute conversation with a machine that felt like a phone call with a thoughtful friend, and I need to sit with what that means.

The voice is what does it. Text is information. Voice is experience.

And the experience just got very, very good.

Related thinking: