Gemini 3 Deep Think Hits 84.6% on ARC-AGI-2 and 3455 Elo on Codeforces

Abhishek GautamAbhishek Gautam8 min read
Gemini 3 Deep Think Hits 84.6% on ARC-AGI-2 and 3455 Elo on Codeforces

Quick summary

Google's Gemini 3 Deep Think scored 84.6% on ARC-AGI-2, 48.4% on Humanity's Last Exam, and 3455 Elo on Codeforces. Gemini 3.1 Pro is now in preview. Here is what the benchmarks actually mean.

If your traffic dropped

Check which pages lost clicks in Google Search Console, then run Core Web Vitals on those URLs.

Google's Gemini 3 Deep Think scored 84.6% on ARC-AGI-2, 48.4% on Humanity's Last Exam, and achieved a 3455 Elo rating on Codeforces competitive programming. These scores were independently verified by the ARC Prize Foundation in February 2026. Gemini 3.1 Pro is now in preview (March 2026), currently leading the Artificial Analysis Intelligence Index. For developers choosing a model for coding, reasoning, and complex analysis tasks, these numbers change the calculus.

What ARC-AGI-2 Actually Tests

ARC-AGI (Abstraction and Reasoning Corpus) was designed by François Chollet specifically to test general reasoning that cannot be solved by pattern matching or memorization. The original ARC-AGI was largely solved by frontier models in 2024-2025. ARC-AGI-2 is significantly harder — it presents novel visual reasoning puzzles that require genuine problem decomposition, not recall.

The scoring context: the average human score on ARC-AGI-2 is approximately 60%. GPT-4o and Claude 3.5 Sonnet scored below 20% on early evaluations. Gemini 3 Deep Think at 84.6% — verified by the ARC Prize Foundation, not just Google's internal evaluation — represents a meaningful discontinuity from previous frontier model performance on this benchmark.

The question Chollet designed ARC-AGI to answer is whether a model is doing genuine reasoning or statistical pattern completion. A score of 84.6% from a system that had no access to the test puzzles during training (ARC-AGI-2 puzzles are novel by construction) is evidence of reasoning capability that goes beyond lookup.

Humanity's Last Exam: 48.4% in Context

Humanity's Last Exam (HLE) is a benchmark assembled from questions that represent the hardest known problems across academic disciplines — the kind of questions that appear at the end of PhD qualifying exams or in the hardest competition mathematics. Questions are contributed by domain experts specifically to be outside what current AI systems can answer from training data.

48.4% on HLE means Gemini 3 Deep Think correctly answers nearly half of problems designed to be at the frontier of human expertise. For comparison: the human expert score on HLE questions from their own domain is around 90%; the score across all domains (where experts are not specialists) is lower. A score of 48.4% across all domains represents strong generalist performance at expert difficulty.

This does not mean Gemini 3 Deep Think is approaching human expert performance in the domains where it scored poorly — the benchmark is specifically designed to surface where models fail. The 51.6% it gets wrong includes the hardest problems in each domain. But the 48.4% it gets right is a substantial increase over what any previous model has achieved.

Codeforces 3455 Elo: What It Means for Developer Workflows

Codeforces is the competitive programming platform used by serious competitive programmers globally. The Elo rating system there maps to approximate skill levels: 1200 is beginner, 1900 is expert, 2400 is Grandmaster, and 3000+ is International Grandmaster — a tier occupied by fewer than 100 active human competitors worldwide.

3455 Elo places Gemini 3 Deep Think above the highest active human competitive programmers. This is not the same as saying it is better than top developers at real-world engineering tasks — competitive programming problems are well-defined, have verified solutions, and test algorithmic skill in isolation. But it is a strong signal for:

Algorithm and data structure implementation. If you need an efficient solution to a graph problem, a dynamic programming challenge, or a tree traversal, Gemini 3 Deep Think can likely produce correct, optimized code.

Debugging complex logical errors. Competitive programming skill correlates with the ability to identify logical edge cases — exactly the capability needed to debug subtle off-by-one errors, race conditions, and incorrect algorithm implementations.

Code review for correctness. A model at 3455 Codeforces Elo can evaluate whether an implementation is algorithmically correct, not just whether it passes basic tests.

Gemini 3.1 Pro Preview and the Current Leaderboard

Gemini 3.1 Pro (not Deep Think — a separate, more accessible model) entered preview in March 2026 and is currently leading the Artificial Analysis Intelligence Index, which aggregates performance across coding, reasoning, and instruction following benchmarks.

The model lineup as of late March 2026:

  • Gemini 3.1 Pro — preview, leading Artificial Analysis Index, faster and more accessible than Deep Think
  • Gemini 3 Deep Think — maximum reasoning mode, slower, best for hard problems requiring extended thinking
  • Claude 3.7 Sonnet — strong on coding and reasoning, particularly with extended thinking mode
  • GPT-4.5 — strong on instruction following and creative tasks, reasoning benchmarks below Gemini 3
  • Llama 4 Scout/Maverick — Meta's open-weight frontier models, strong for self-hosted deployment

For most developer use cases — coding assistance, code review, document analysis, RAG — Gemini 3.1 Pro is the practical choice: faster response time, lower cost than Deep Think, and top benchmark scores. Deep Think is for tasks where you can afford to wait for extended reasoning: complex mathematical proofs, adversarial security analysis, hard algorithmic problems.

What the Benchmarks Do Not Tell You

ARC-AGI-2 and HLE measure specific reasoning capabilities in controlled conditions. They do not measure:

Context window reliability. How well does the model perform at the 128K or 1M token context boundary? Benchmark evaluations typically use short prompts. Real developer workflows often involve very long documents.

Instruction following consistency. Does the model reliably follow format instructions, output constraints, and system prompt requirements across 1,000 API calls? Benchmarks test single-shot performance. Production deployments need consistency.

Hallucination rate on domain-specific facts. ARC-AGI-2 tests abstract reasoning. A model that scores 84.6% on abstract puzzles may still hallucinate specific API method names, library versions, or regulatory details.

Latency and cost at production scale. Deep Think's extended reasoning mode is slow. For latency-sensitive applications — real-time coding assistants, chat interfaces — response time matters more than peak benchmark score.

Key Takeaways

  • Gemini 3 Deep Think scored 84.6% on ARC-AGI-2 (independently verified by ARC Prize Foundation) — first model to substantially exceed average human performance (60%) on this benchmark
  • 48.4% on Humanity's Last Exam — nearly half of PhD-level expert questions answered correctly across all domains
  • 3455 Elo on Codeforces — above all active human competitive programmers, strong signal for algorithmic code generation and debugging
  • Gemini 3.1 Pro in preview (March 2026), leading Artificial Analysis Intelligence Index — faster, more accessible version for daily developer use
  • ARC-AGI-2 at 84.6% is evidence of genuine reasoning, not pattern matching — the benchmark is specifically designed to test novelty
  • Benchmarks do not capture context window reliability, instruction following consistency, hallucination rate, or production latency — evaluate these separately for your use case

FAQ

Frequently Asked Questions

What did Gemini 3 Deep Think score on ARC-AGI-2?

Gemini 3 Deep Think scored 84.6% on ARC-AGI-2, independently verified by the ARC Prize Foundation in February 2026. The average human score on ARC-AGI-2 is approximately 60%. Previous frontier models including GPT-4o and Claude 3.5 Sonnet scored below 20% on early ARC-AGI-2 evaluations. The benchmark tests novel abstract visual reasoning designed to require genuine problem decomposition rather than pattern matching from training data.

What is Gemini 3.1 Pro and how does it differ from Deep Think?

Gemini 3.1 Pro is a faster, more accessible model in Google's Gemini 3 family, currently in preview (March 2026) and leading the Artificial Analysis Intelligence Index. Deep Think is the maximum reasoning mode — slower, more expensive, designed for hard problems requiring extended thinking such as complex mathematical proofs and difficult algorithmic challenges. For most developer use cases (coding assistance, code review, document analysis), Gemini 3.1 Pro is the practical choice; Deep Think is for tasks where you can tolerate slower responses.

What does Gemini 3's Codeforces 3455 Elo mean for developers?

Codeforces 3455 Elo places Gemini 3 Deep Think above all active human competitive programmers (3000+ is International Grandmaster, fewer than 100 active humans). For developers, this means strong capability in algorithm implementation, data structure optimization, and identifying logical edge cases. It does not mean it outperforms top engineers on real-world software architecture or domain-specific engineering — competitive programming tests algorithmic skill in isolation from system design, API knowledge, and software engineering judgment.

How does Gemini 3 Deep Think compare to Claude and GPT-4.5?

As of March 2026: Gemini 3 Deep Think leads on ARC-AGI-2 (84.6%), Humanity's Last Exam (48.4%), and Codeforces Elo (3455). Claude 3.7 Sonnet is strong on coding and reasoning tasks, particularly with extended thinking mode. GPT-4.5 leads on instruction following and creative tasks but trails on hard reasoning benchmarks. Gemini 3.1 Pro (the faster preview model) currently leads the Artificial Analysis Intelligence Index which aggregates across coding, reasoning, and instruction following.

What is Humanity's Last Exam and why does the score matter?

Humanity's Last Exam (HLE) is a benchmark assembled from questions at the frontier of human expertise across all academic disciplines — designed to be outside what AI systems can answer from training data by using novel problems contributed by domain experts. Gemini 3 Deep Think scored 48.4%, meaning it correctly answers nearly half of PhD-level expert problems across all domains. For context, human experts score approximately 90% within their specific domain and lower across all domains. A score of 48.4% across all domains represents strong generalist performance at expert difficulty.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.

Free Tool

Will AI replace your job?

4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.

Check Your AI Risk Score →

Written by

Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 941+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.