GPT-4 can see. That changes things.

GPT-4 is here. OpenAI released it today and the headline feature is multimodal input: it accepts images.

You can show it a photo and ask questions about it. Not in a “describe this image” party trick way (though it can do that). In a “look at this circuit diagram and tell me why it doesn’t work” way. In a “here’s a photo of my fridge, what can I make for dinner?” way. In a “analyze this chart and tell me what’s interesting” way.

I’ve been testing it for a few hours. I showed it a photo of my desk.

The desk test

My desk had: a laptop, two books (one open, one closed), three coffee cups (two empty, one half full), a notebook with handwriting, a charging cable, and a small cactus.

GPT-4 identified all of it. Correctly. It identified the open book as “The Three-Body Problem” by Cixin Liu from the cover art visible at the edge of the photo. It noted the closed book appeared to be a Penguin Classic based on the spine design.

Then it said something I didn’t expect: “The three empty or mostly empty coffee cups suggest a long working session. You might be tired.”

I was tired. I’d been up since 5 AM testing the model.

The observation was trivial. Any human glancing at my desk would draw the same conclusion. But a machine drew it. A machine looked at a photo of my physical environment and inferred my emotional state from the evidence. Not accurately, not reliably, not in any way that should be mistaken for empathy. But it did it.

What vision means for AI

Previous GPT versions were text-in, text-out. You described the world in words and the model responded in words. The world was always mediated through language.

GPT-4 with vision can look at the world directly. Not through a camera it controls (it doesn’t have a camera). Through images you provide. But the principle is the same: the model’s understanding of context just expanded from “what you tell it” to “what you show it.”

That expansion is significant. Language is a lossy compression of reality. When I describe my desk in words, I choose what to mention and what to leave out. When I show the model a photo, everything is there. The coffee cups I didn’t think were important. The book title I forgot to mention. The cable that tells the model which devices I use.

A model that can see your world extracts information you didn’t intend to share.

The benchmark results

OpenAI published GPT-4’s results on standardized tests. It passed the bar exam in the 90th percentile. It scored in the 99th percentile on the Biology Olympiad. It passed AP exams in art, chemistry, environmental science, calculus, statistics, and more.

These are benchmarks designed for humans. A language model, with the addition of vision, now outperforms most humans on tests that measure human knowledge.

I don’t know what to do with that. The benchmarks prove capability. They don’t prove understanding. A model can pass the bar exam without knowing what justice is. It can score 99th percentile in biology without understanding what it means to be alive.

But the capability is real, and the capabilities are expanding in a direction that matters more than the benchmarks suggest. When GPT-3 was text-only, you could maintain the fiction that it was just a fancy autocomplete. When GPT-4 can look at a photo of your desk and notice you’re tired, the fiction gets harder to sustain.

What’s changing

The relationship between humans and AI systems is about to shift. Not because of what the model knows or what benchmarks it passes. Because of the interface. Vision input means the model can exist in your context without you having to translate your context into words.

Point your phone at a problem. Show the model what you see. Ask for help. That’s a different interaction than typing a description. It’s more natural. More ambient. More like asking a friend who’s standing next to you.

I keep thinking about my desk and the coffee cups and the model’s observation about tiredness. It was a party trick. But tricks become tools. Tools become infrastructure. Infrastructure becomes invisible.

A machine that can see. I need a minute with that.

Related thinking: