Llama 4 and Meta's open weight strategy

Meta gave away another frontier model.

Llama 4 uses mixture-of-experts architecture, which means only a fraction of the model’s parameters activate for any given query. It’s fast. It’s efficient. And it benchmarks alongside the best from OpenAI and Anthropic.

And it’s free. Open weights. Anyone can download it. Anyone can fine-tune it. Anyone can build a product on it.

Why does Meta keep doing this?

The Android play

I wrote about this when Llama 2 came out. I’ll say it again because I think it’s still the answer.

Meta doesn’t need to sell AI. They sell ads. The AI model is a means to an end. If every developer in the world builds on Llama, Meta’s platform grows. The model becomes infrastructure. And infrastructure owners have power that model sellers don’t.

Google gave away Android for the same reason. The phone operating system was free. But the search engine, the app store, the data pipeline feeding the ad business? Those made money. Android wasn’t the product. Android was the distribution channel for the product.

Llama is Meta’s Android. The model is free. The platform is the business.

Why this matters for everyone else

If Llama 4 is competitive with closed models and it’s free, then OpenAI and Anthropic have to justify their pricing with something that open source can’t match. That something is speed (closed models often ship improvements faster), safety (alignment and guardrails), and integration (APIs, tools, support).

But the capability gap keeps closing. Llama 2 was behind GPT-4. Llama 3 was roughly on par with GPT-4 in many tasks. Llama 4 is competing with the latest from everyone.

The recurring pattern: closed models lead. Open models catch up. The gap narrows with each generation. If this pattern continues (and so far there’s no reason to think it won’t), then the equilibrium state is open models at parity with closed models, and the value proposition of closed providers shifts from “better models” to “better everything else.”

The mixture-of-experts part

This is the technical detail I find most interesting. Traditional models use all their parameters for every query. A 70 billion parameter model uses 70 billion parameters whether you ask it to solve a calculus problem or tell you a joke.

Mixture-of-experts activates a subset. Different queries activate different experts. The model can be enormous (hundreds of billions of parameters in total) while only using a fraction for any given response. More capacity. Less compute per query.

It’s how brains work, roughly. You don’t use your entire brain to tie your shoes. Different regions activate for different tasks.

If this architecture scales well (and Llama 4 suggests it does), it changes the economics of inference. Bigger models. Lower cost per query. The quality of a huge model at the price of a small one.

Together AI and Hugging Face already have Llama 4 available for inference. The deployment community moves fast when the model is free.

I don’t know who wins the AI race. But I’m increasingly convinced that open models will be at the table when the race is decided. Meta made sure of that.

Related thinking: