Gemini 2.5 Pro is genuinely multimodal and that

I pointed my phone at my neighborhood and recorded a 30-second video. Then I uploaded it to Gemini 2.5 Pro and asked, “What’s the vibe of this neighborhood?”

It told me: residential, tree-lined, older construction mixed with some newer infill. Probably built in the 1940s-1960s based on the architectural style. Walkable. A coffee shop on the corner that looked locally owned. A park two blocks behind me based on the tree canopy visible over the rooflines.

I hadn’t mentioned the park. I’d barely noticed the coffee shop. But it was there in the video, and Gemini saw it.

The feeling that changed

Every AI interaction I’ve had until now has felt like using a tool. I type. It responds. I evaluate. It’s text-in, text-out. The medium is language, and language is abstract.

Showing an AI a video is different. It feels like showing someone something. “Look at this.” “What do you see?” “What do you think?”

The interaction shifts from transaction to conversation. From query to collaboration. From asking a machine to telling a friend.

I know it’s not a friend. I know it’s a model processing pixel values. But the experience of pointing a camera at something and getting an intelligent response changes the emotional register of the interaction. It just does.

What multimodal actually means now

“Multimodal” has been a buzzword for two years. GPT-4 could process images. Claude could analyze photos. But Gemini 2.5 Pro is the first model where the multimodality feels native rather than bolted on.

It processes video continuously. Not frame-by-frame analysis. Continuous understanding of motion, spatial relationships, and temporal context. It hears audio, reads lips, understands the relationship between what’s being said and what’s being shown.

I played it a clip of someone explaining a recipe while cooking, and it could tell me both what they said and what they did, including noting that they added an ingredient they didn’t mention verbally.

The model watches the way you watch. Absorbing everything, not just the foreground.

Why this matters beyond the demos

Most people interact with the world through their senses, not through text. They see a broken thing and need to fix it. They hear a sound and need to identify it. They watch a video and want to understand it.

An AI that can see and hear the way people see and hear is accessible to everyone, not just people who can describe their problems in precise text. “What’s wrong with my plant?” accompanied by a photo. “What’s this sound?” accompanied by a recording. “Is this mole something I should worry about?” accompanied by an image.

The interface becomes the world itself. Not a text box. The actual world.

I’m not sure people have processed how big that shift is. The AI just got eyes and ears that work. Everything about how we use it changes from here.

Related thinking: