‘Holy shit’: Gemini 3 is winning the AI race — for now

When an AI model release immediately spawns memes and treatises declaring the rest of the industry cooked, you know you’ve got something worth dissecting.

Google’s Gemini 3 was released Tuesday to widespread fanfare. The company called the model a “new era of intelligence,” integrating it into Google Search on day one for the first time. It’s blown past OpenAI and other competitors’ products on a range of benchmarks and is topping the charts on LMArena, a crowdsourced AI evaluation platform that’s essentially the Billboard Hot 100 of AI model ranking. Within 24 hours of its launch, more than one million users tried Gemini 3 in Google AI Studio and the Gemini API, per Google. “From a day one adoption standpoint, [it’s] the best we’ve seen from any of our model releases,” Google DeepMind’s Logan Kilpatrick, who is product lead for Google’s AI Studio and the Gemini API, told The Verge.

Even OpenAI CEO Sam Altman and xAI CEO Elon Musk publicly congratulated the Gemini team on a job well done. And Salesforce CEO Marc Benioff wrote that after using ChatGPT every day for three years, spending two hours on Gemini 3 changed everything: “Holy shit … I’m not going back. The leap is insane — reasoning, speed, images, video… everything is sharper and faster. It feels like the world just changed, again.”

“This is more than a leaderboard shuffle,” said Wei-Lin Chiang, cofounder and CTO of LMArena. Chiang told The Verge that Gemini 3 Pro holds a “clear lead” in occupational categories including coding, match, and creative writing, and its agentic coding abilities “in many cases now surpass top coding models like Claude 4.5 and GPT-5.1.” It also got the top spot on visual comprehension and was the first model to surpass a ~1500 score on the platform’s text leaderboard.

The new model’s performance, Chiang said, “illustrates that the AI arms race is being shaped by models that can reason more abstractly, generalize more consistently, and deliver dependable results across an increasingly diverse set of real-world evaluations.”

Alex Conway, principal software engineer at DataRobot, told The Verge that one of Gemini 3’s most notable advancements was on a specific reasoning benchmark called ARC-AGI-2. Gemini scored almost twice as high as OpenAI’s GPT-5 Pro while running at one-tenth of the cost per task, he said, which is “really challenging the notion that these models are plateauing.” And on the SimpleQA benchmark — which involves simple questions and answers on a broad range of topics, and requires a lot of niche knowledge — Gemini 3 Pro scored more than twice as high as OpenAI’s GPT-5.1, Conway flagged. “Use case-wise, it’ll be great for a lot more niche topics and diving deep into state-of-the-art research and scientific fields,” he said.

But leaderboards aren’t everything. It’s possible — and in the high-pressure AI world, tempting — to train a model for narrow benchmarks rather than general-purpose success. So to really know how well a system is doing, you have to rely on real-world testing, anecdotal experience, and complex use cases in the wild.

The Verge spoke with professionals across disciplines who use AI every day for work. The consensus: Gemini 3 looks impressive, and it does a great job on a wide breadth of tasks — but when it comes to edge cases and niche aspects of certain industries, many professionals won’t be replacing their current models with it anytime soon.

The majority of people The Verge spoke with plan to continue to use Anthropic’s Claude for their coding needs, despite Gemini 3’s advancements in that space. Some also said that Gemini 3 isn’t optimal on the user interaction front. Tim Dettmers, assistant professor at Carnegie Mellon University and a research scientist at Ai2, said that though it’s a “great model,” it’s a bit raw when it comes to UX, meaning “it doesn’t follow instructions precisely.”

Tulsee Doshi, Google DeepMind’s senior director of product management for Gemini and Gen Media, told The Verge that the company prioritized bringing Gemini 3 to a variety of Google products in a “very real way.” When asked about the instruction-following concerns, she said it’s been helpful to see “where folks are hitting some of the sticking points.”

She also said that since the Pro model is the first release in the Gemini 3 suite, later models will help “round out that concern.”

Joel Hron, CTO of Thomson Reuters, said that the company has its own internal benchmarks it’s developed to rank both its internal models and public ones on the areas that are most relevant to their work — like comparing two documents up to several hundreds of pages in length, interpreting a long document, understanding legal contracts, and reasoning in the legal and tax spaces. He said that so far, Gemini 3 has performed strongly across all of them and is “a significant jump up from where Gemini 2.5 was.” It also outperforms several of Anthropic’s and OpenAI’s models right now in some of those areas.

Louis Blankemeier, cofounder and CEO of Cognita, a radiology AI startup, said that in terms of “pure numbers” Gemini 3 is “super exciting.” But, he said, “we still need some time to figure out what the real-world utility of this model is.” For more general domains, Blankemeier said, Gemini 3 is a star, but when he played around with it for radiology, it struggled with correctly identifying subtle rib fractures on chest X-rays, as well as uncommon or rare conditions. He calls radiology akin to self-driving cars in many ways, with a lot of edge cases — so a newer, more powerful model may still not be as effective as an older one that’s been refined and trained on custom data over time. “The real world is just so much more difficult,” he said.

Similarly, Matt Hoffman, head of AI at Longeye, a company providing AI tools for law enforcement investigations, sees promise in the Gemini 3 Pro-powered Nano Banana Pro image generator. Image generators allow Longeye to create convincing synthetic datasets for testing, letting it keep real, sensitive investigation data secure. But although the benchmarks are impressive, they may not map to the company’s actual use cases. “I’m not confident Longeye could swap out a model we’re using in production for Gemini 3 and see immediate improvements,” he said.

Other companies also say they’re excited about Gemini — but not necessarily using it to replace everything else. Built, a construction lending startup, currently uses a mix of foundational models from Google, Anthropic, OpenAI, and others to analyze construction draw requests — a package of documents often sent to a construction lender, like invoices and proof of work done, requesting that funds be paid. This requires multimodal analysis of text and images, plus a large context window for the main agent delegating tasks to the others, VP of engineering Thomas Schlegel told The Verge. That’s part of what Google promises with Gemini 3, so the company is currently exploring switching it out for 2.5.

“In the past we’ve found Gemini to be the best at all-purpose tasks, and 3 looks to be a big step forward along those same lines,” Schlegel said. “It’s everything we love about Gemini on steroids.” But he doesn’t yet think it will replace all the other models, including Claude for coding tasks and OpenAI products for business reasoning.

For Tanmai Gopal, cofounder and CEO of AI agent platform PromptQL, the stir Gemini 3 has caused is valid, but “it’s definitely not the end of anything” for Google’s competitors. AI models are becoming better and cheaper, and since they’re on such quick release cycles, “one is always ahead of the pack for a period of time.” (For instance, the day after Gemini 3 came out, OpenAI released GPT-5.1-Codex-Max, an update to a week-old model, ostensibly to challenge Gemini 3 on a few coding benchmarks.)

Gopal said PromptQL is still working on internal evaluations to decide how, if at all, the team’s model choices will change, but “initial results aren’t necessarily showing something drastically better” than their current lineup. He said his current preference is Claude for code generation, ChatGPT for web search, and GPT-5 Pro for “deep brainstorming,” but he may incorporate Gemini 3 as a default model, since it’s “probably best-in-class for consumer tasks across creative, text, [and] image.”

And like virtually every model, Gemini 3 has had moments of what I’ll dub “robotic hand syndrome” — when an AI system does something complex with flying colors but gets gobsmacked by the simplest query, akin to the robotic hands of yesteryear having trouble gripping a soda can. Famed researcher Andrej Karpathy, who was a founding member of OpenAI and former director of AI at Tesla, wrote on X after testing Gemini 3 that he “had a positive early impression yesterday across personality, writing, vibe coding, humor, etc., very solid daily driver potential, clearly a tier 1 LLM,” but he noted that the model refused to believe him when he said it was 2025 and later said it had forgotten to turn on Google Search. (He ascertained that in early testing, he may have been given a model with a stale system prompt.)

In The Verge’s own experience testing Gemini 3, we found it “delivers reasonably well — with caveats.” It likely won’t stay on top forever, but it’s an unmistakable step up for the company.

“You’re sort of in this leapfrog game from model to model, month to month, when a new one drops,” Hron said. “But what stuck to me about Google’s release is it makes substantial improvements across many dimensions of models — so it’s not like it just got better at coding or it just got better at reasoning … It really, across the board, got a good bit better.”

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.