NVIDIA Nemotron 3 Super: 60% SWE-bench, Best Open Model for Code

Abhishek Gautam··7 min read

Quick summary

NVIDIA Nemotron 3 Super hits 60.47% on SWE-bench — highest open-weight score ever. 120B total, 12B active, 1M context, 5x throughput vs GPT-OSS. Already in CodeRabbit and Greptile.

NVIDIA released Nemotron 3 Super on March 11, 2026. The headline number is 60.47% on SWE-bench Verified — the highest score any open-weight model has achieved on the benchmark that tests AI's ability to resolve real GitHub issues.

For context: Claude Opus 4.6 scores 80.8% on the same benchmark and GPT-5.4 scores around 75%. Nemotron 3 Super is not beating frontier closed models. What it is doing is beating every other model you can download and run yourself — by a meaningful margin.

This matters because open-weight models have a fundamentally different value proposition than API-accessed models. No per-token costs. No data leaving your infrastructure. No rate limits on private codebases. If Nemotron 3 Super can handle 60% of real GitHub issues autonomously, that's a capable autonomous coding agent you can run on your own hardware.

The Architecture: Why Mamba-Transformer Hybrid Is Different

Most large language models use transformer attention, which has quadratic computational complexity relative to sequence length. Longer context = dramatically higher cost. This is why 1M-token context windows are expensive to run even when the model technically supports them.

Nemotron 3 Super uses a hybrid architecture: interleaved Mamba-2 layers, Mixture-of-Experts (MoE) layers, and select transformer attention layers. The Mamba-2 backbone uses linear-time sequence processing — compute cost scales linearly with context length, not quadratically.

The result: the model can process a 1M-token context window at a cost that doesn't explode the way pure transformer attention would. For developers trying to run codebase-wide analysis, this is the practical difference between "can do this with a single H100" and "needs a multi-GPU cluster."

The Scale Numbers and What They Mean

Nemotron 3 Super has 120 billion total parameters and 12 billion active parameters. The gap between those two numbers is the MoE architecture at work. In each forward pass, only 12B of the 120B parameters activate — the router selects which expert sub-networks handle each token. You get near-120B model quality at roughly 12B inference compute cost.

For throughput comparison:

  • 5x higher throughput than GPT-OSS-120B (a comparable-scale open-weight model)
  • 7.5x higher throughput than Qwen3.5-122B
  • 2-3x wall-clock speedup on structured generation like code and tool calls, via built-in speculative decoding

These are not marginal differences. A 5x throughput advantage means you can run 5 parallel coding agents for the same hardware budget that would run one GPT-OSS-120B agent. For agentic coding workflows where you want multiple parallel code-review or debugging passes, this compounds.

SWE-bench 60.47%: What the Number Actually Means

SWE-bench Verified is one of the most rigorous benchmarks for AI coding capability. It presents real GitHub issues from popular Python repositories — the same issues real contributors resolved — and asks the model to produce a patch that passes the test suite.

There's no memorization shortcut available. The issues are from real production codebases. The test suite validates whether the patch actually fixes the problem, not just whether it looks correct.

60.47% means Nemotron 3 Super resolves more than 3 in every 5 real GitHub issues autonomously. Among open-weight models, the previous best was in the high 50s. Among all models including closed frontier systems, 60.47% sits meaningfully below Claude Opus 4.6 (80.8%) and GPT-5.4 (~75%), but it's no longer in a different category from them.

The practical implication: for code review automation, bug triage, and greenfield feature implementation in constrained contexts, Nemotron 3 Super is capable enough that the "good enough" bar for a self-hosted solution has been crossed.

The Inference Stack: How to Access It

NVIDIA has made Nemotron 3 Super available through several routes, from zero-setup cloud inference to full self-hosted deployment.

Managed inference (no setup required):

  • Perplexity Labs API — callable via standard OpenAI-compatible endpoint
  • OpenRouter — aggregated access alongside other models
  • build.nvidia.com — NVIDIA's own NIM (NVIDIA Inference Microservice) endpoint

Self-hosted:

  • HuggingFace model hub — full weights available for download
  • NVIDIA NIM container — Docker-compatible deployment with built-in speculative decoding already configured

For most developers evaluating the model, starting with OpenRouter or build.nvidia.com is the fastest path to a working prototype before committing to the infrastructure investment of self-hosting a 120B-parameter model.

Already Integrated: CodeRabbit, Factory, Greptile

Three coding tools have already shipped integrations with Nemotron 3 Super as of March 2026:

CodeRabbit — AI code review tool that comments on pull requests. Nemotron 3 Super is now available as a code review engine alongside Claude Opus 4.6 and GPT-5.4. The throughput advantage means faster PR turnaround at lower cost for high-volume repositories.

Factory — agentic coding platform that implements feature requests end-to-end. Nemotron 3 Super runs as an agent backbone for implementation tasks where users want to self-host the model rather than route through external APIs.

Greptile — codebase Q&A and search tool. Nemotron 3 Super's 1M-token context window is particularly relevant here: Greptile needs to load large code contexts to answer questions about complex codebases, and linear-time sequence processing makes that economical at scale.

The fact that these tools shipped integrations within days of the model release signals that the performance is real and reproducible outside of NVIDIA's own benchmarking environment.

The Caveats Developers Should Know

Training data cutoff. Pre-training data has a cutoff of June 2025. Post-training (instruction tuning) data has a cutoff of February 2026. For code in repositories that evolved significantly after mid-2025, the model may not be aware of new APIs, breaking changes, or community patterns introduced after that date.

English-primary. The model was trained on English and 19 other languages, with 43 programming languages. If your codebase has extensive non-English comments or documentation, performance degrades from the benchmark numbers.

Mamba-2 and attention layer interaction. The hybrid architecture is newer than pure transformers and less battle-tested across diverse deployment configurations. Some inference frameworks don't yet fully optimize for Mamba-2 layers. Benchmark the model on your specific workload before building a production pipeline around benchmark numbers.

Open-weight is not fully open-source. The weights are downloadable, but the training code and full dataset composition are not released. You can run and fine-tune Nemotron 3 Super, but you cannot reproduce the training run.

Key Takeaways

  • Nemotron 3 Super scores 60.47% on SWE-bench Verified — the highest open-weight result on record
  • 120B total parameters, 12B active via MoE — near-120B quality at 12B inference compute cost
  • 1M-token context window with linear-time processing via Mamba-2 backbone
  • 5x throughput vs GPT-OSS-120B, 7.5x vs Qwen3.5-122B, 2-3x speedup on code generation
  • Available via Perplexity, OpenRouter, build.nvidia.com, and HuggingFace today
  • Already integrated into CodeRabbit, Factory, and Greptile
  • Best use cases: self-hosted code review automation, agentic coding in private codebases, codebase-wide Q&A
  • Caveats: training cutoff June 2025 (pre-training), no full open-source training code, hybrid Mamba architecture less tested in diverse deployment configs

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.

ShareX / TwitterLinkedIn

Written by

Abhishek Gautam

Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 355+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 121 countries.