State of LLMs — The Complete Guide
State of LLMs: The Complete Guide — March 8, 2026
Last updated: March 8, 2026
This unified guide synthesizes and cross-verifies the latest data from official announcements (OpenAI, Anthropic, Google, xAI, Alibaba, MiniMax, Zhipu, DeepSeek, Meta), leaderboards (SWE-Bench Verified, ARC-AGI-2, Terminal-Bench 2.0, GPQA Diamond, LMArena, Artificial Analysis), OpenRouter usage, and API pricing docs. Prices in USD per 1M tokens unless noted. Frontier capabilities have commoditized: Chinese and open-weight models now deliver 90-95%+ of Western flagship performance at 10-30x lower cost for most tasks. Agentic coding, 1M+ context, multi-agent reasoning, terminal control, and strong multimodality are table stakes. 99% of dev tasks run excellently on sub-$3/M blended models with smart routing and caching.
Key Takeaway (March 8, 2026): The Feb-Mar 2026 release wave (GPT-5.4 on Mar 5 unifying reasoning+coding, Gemini 3.1 Pro on Feb 19 leading many benchmarks, Grok 4.20 Beta 2 on Mar 3 with multi-agent capabilities, Claude 4.6 Opus/Sonnet family, Qwen3.5 Plus, MiniMax M2.5, GLM-5, DeepSeek V3.2) has dramatically narrowed gaps. Chinese models dominate value, volume, and price/performance; US/EU models lead in compliance-sensitive, high-stakes, and regulated work. Grok 5 is expected in Q2 2026.
Master
| Full Name | Provider | Notes |
|---|---|---|
| Claude Opus 4.6 | Anthropic | Premium agentic/reasoning |
| Claude Sonnet 4.6 | Anthropic | High-value daily driver |
| GPT-5.4 | OpenAI | Unified reasoning+coding flagship (Mar 5) |
| GPT-5.3-Codex | OpenAI | Terminal/coding specialist |
| Gemini 3.1 Pro | General/reasoning/terminal leader (Feb 19) | |
| Grok 4.20 Beta 2 | xAI | Multi-agent, real-time X data |
| Grok 4.1 Fast | xAI | Massive context |
| Qwen3.5 Plus | Alibaba | Price/performance/multilingual leader |
| MiniMax M2.5 | MiniMax | Ultra-budget agentic coding SOTA |
| GLM-5 | Zhipu AI | Open agentic (744B MoE) |
| Llama 4 Maverick | Meta | Fully open weights |
| Llama 4 Scout | Meta | Extreme context open variant |
| DeepSeek V3.2 | DeepSeek | Ultra-cheap open frontier |
| GPT-5.3-Codex-Spark | OpenAI/Cerebras | 1,000+ tok/s coding specialist |
All frontier models support strong agentic/tool use and multimodality. Routing rule of thumb: Hard/high-stakes → Opus 4.6 or GPT-5.4; Medium → Gemini 3.1 Pro or Qwen3.5 Plus; Simple/volume → MiniMax M2.5 or DeepSeek V3.2. Caching + smart routing (OpenRouter/LiteLLM) is essential.
1. Market Summary
| Metric | Value | Notes/Source |
|---|---|---|
| Dev AI Coding Adoption | 92% use/plan | JetBrains 2026 |
| AI-Written Code Share | 55% of global code | GitHub/Greptile |
| Commodity 1M Context Price | 2.40 out | Qwen3.5 Plus |
| OpenRouter Scale | 1T+ tokens/day, 5M+ devs | OpenRouter/a16z |
2. Frontier Model Overview
| Full Name | Provider | Release | Context | SWE-Bench Verified | Price (In/Out $/M) | Top Strengths |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | Feb 5 | 1M (beta) | ~80.8% | 25 | Agentic king, parallel sub-agents, high-stakes reasoning |
| GPT-5.4 | OpenAI | Mar 5 | 1M | Strong | ~15-20 | Unified reasoning+coding, GDPval SOTA, native computer-use |
| Gemini 3.1 Pro | Feb 19 | 1M+ | ~80.6% | 12 (tiered) | ARC-AGI/GPQA/terminal/science/math leader, 3-tier thinking | |
| Grok 4.20 Beta 2 | xAI | Mar 3 | 2M | Strong | ~15 | Multi-agent (4+), real-time X, rapid learning |
| Claude Sonnet 4.6 | Anthropic | Feb 17 | 1M (beta) | ~79.6% | 15 | Reliable daily driver (90%+ of Opus at lower cost) |
| GPT-5.3-Codex | OpenAI | Feb 5 | 400K-1M | ~80% | 14 | Terminal control, recursive self-debug |
| Qwen3.5 Plus | Alibaba | Feb 15 | 1M | 76-77% | 2.40 | Best price/performance, multilingual (strong CJK) |
| MiniMax M2.5 | MiniMax | Feb 12 | 205K | ~80.2% | 0.60-2.40 | Ultra-budget agentic SOTA, high throughput |
| GLM-5 | Zhipu AI | Feb 11 | 200K+ | ~77.8% | ~2.55-3.20 | Open agentic leader |
| DeepSeek V3.2 | DeepSeek | Late 2025 | 128K-1M | ~80% | 0.38-0.42 | Ultra-cheap open frontier, MIT license |
| Llama 4 Maverick | Meta | 2025 | 1M | Strong | Free/low (open) | Fully open weights, multimodal |
| Llama 4 Scout | Meta | 2025 | Up to 10M | Specialized | ~0.34 (hosted) | Extreme context open variant |
| GPT-5.3-Codex-Spark | OpenAI/Cerebras | Feb 12 | — | Strong | Varies (fast) | 1,000+ tok/s coding |
3. API Pricing Snapshot (Sorted by Approx. Blended Cost, ~1:1.3 in:out ratio)
| Rank | Model | Context | Input ($/M) | Output ($/M) | Blended ~ | Hosting | Notes |
|---|---|---|---|---|---|---|---|
| 1 | DeepSeek V3.2 | 128K-1M | $0.028 (hit) | $0.38-0.42 | ~$0.57 | China | Cache miss higher |
| 2 | Llama 4 Scout (Groq) | 10M+ | ~$0.11 | ~$0.34 | ~$0.55 | US | Fast open inference |
| 3 | MiniMax M2.5 | 205K | $0.15-0.30 | $0.60-2.40 | ~$1-2 | China | Ultra-budget agentic |
| 4 | Qwen3.5 Plus | 1M | $0.40 | $2.40 | ~$3.50 | China | Volume sweet spot |
| 5 | GPT-5.3-Codex | 1M | $1.75 | $14 | ~$20 | US/EU | Coding specialist |
| 6 | Gemini 3.1 Pro | 1M+ | 4 >200K) | 18 >200K) | ~$17-26 | US/EU | Tiered pricing |
| 7 | GPT-5.4 | 1M | $2.50 | $15-20 | ~$25-30 | US/EU | Unified flagship |
| 8 | Claude Sonnet 4.6 | 1M beta | $3 | $15 | ~$22.50 | US/EU | Daily driver |
| 9 | Grok 4.20/4.1 | 2M | $3 | $15 | ~$22.50 | US | Real-time + context |
| 10 | Claude Opus 4.6 | 1M beta | $5 | $25 | ~$37.50 | US/EU | Premium high-stakes |
Caching delivers 75-90% savings on supported providers.
Paid Chat Subscriptions (Non-API)
- Anthropic Claude Pro: $20/mo — Opus 4.6, Sonnet 4.6 (best for reasoning/agents)
- OpenAI ChatGPT Plus: $20/mo — GPT-5.4 (general + creative)
- Google Gemini AI Pro: ~$20/mo — Gemini 3.1 Pro (1M+ context)
- xAI SuperGrok: ~$30-50/mo — Grok 4.20/4.1 (real-time)
Free/Generous Tiers: Google Gemini (unlimited basic), DeepSeek Chat (very high limits), Groq (fast open models), HuggingChat/Ollama (open weights).
4. Key Innovations
- Unified Reasoning + Coding + Computer-Use — GPT-5.4 (native tool use, mid-response steering)
- Parallel/Multi Sub-Agents & Agent Teams — Opus 4.6, Grok 4.20 Beta 2 (4-16 agents that coordinate/debate)
- 3-Tier Thinking System — Gemini 3.1 Pro (Low/Med/High compute modes)
- Recursive Self-Debug & Terminal Control — GPT-5.3-Codex, Gemini 3.1 Pro, Opus 4.6 (autonomous error fixing)
- Rapid Learning Architecture — Grok 4.20 Beta 2 (weekly real-world updates)
- Extreme Context Reliability — Grok 4.1 Fast, Llama 4 Scout (2M-10M production-grade)
- Ultra-Efficient Agentic at Low Cost — MiniMax M2.5, DeepSeek V3.2, Qwen3.5 Plus (near-SOTA at 10-30x lower price)
5. Context Window Tiers
- Extreme: 10M+ → Llama 4 Scout
- Massive: 1M-2M+ → Grok 4.1 Fast/Grok 4.20 Beta 2, GPT-5.4, Gemini 3.1 Pro, Opus 4.6, Qwen3.5 Plus, DeepSeek V3.2
- Large: 400K-1M → GPT-5.3-Codex
- Standard: 128K-262K → MiniMax M2.5, GLM-5
6. Best Models by Use Case
| Use Case | Top Pick | Why | Budget/Alt |
|---|---|---|---|
| Pro Coding/Agents | Opus 4.6 | Highest SWE + reliability | Sonnet 4.6, MiniMax M2.5 |
| Unified Reasoning+Coding | GPT-5.4 | GDPval SOTA, native computer use | GPT-5.3-Codex |
| Terminal/DevOps | Gemini 3.1 Pro / GPT-5.3-Codex | Terminal-Bench leader | - |
| Cost-Efficient Production | Qwen3.5 Plus | 1M context at low cost | MiniMax M2.5, DeepSeek V3.2 |
| Research/Science/Math | Gemini 3.1 Pro | ARC-AGI/GPQA leader | - |
| Real-Time/Creative/Long | Grok 4.20 Beta 2/4.1 | Multi-agent + live X data | - |
| Open/On-Prem/Custom | Llama 4 Maverick/Llama 4 Scout/DeepSeek V3.2/GLM-5 | Fully open or self-host | - |
| Multilingual | Qwen3.5 Plus | Strong CJK + others | MiniMax M2.5 |
| Ultra-Budget Frontier | MiniMax M2.5 / DeepSeek V3.2 | 80%+ SWE at minimal cost | - |
7. Leaderboard Summary (Early March 2026)
SWE-Bench Verified (Agentic Coding): Opus 4.6 (~80.8%) > Gemini 3.1 Pro (~80.6%) > MiniMax M2.5 (~80.2%)
Other Key Benchmarks:
- ARC-AGI-2: Gemini 3.1 Pro leads (~77%)
- GPQA Diamond: Gemini 3.1 Pro leads (~94%)
- GDPval: GPT-5.4 leads
- Terminal-Bench 2.0: Gemini 3.1 Pro / GPT-5.3-Codex
- LMArena (Elo): Opus 4.6 variants top
- OpenRouter Usage: MiniMax M2.5 highest volume, followed by Qwen3.5 Plus/DeepSeek V3.2
8. AI Coding Tools & IDEs
- Cursor ($20 Pro): AI-native IDE, excellent Composer agents, multi-file edits. Supports Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4.1 Fast.
- Windsurf ($15 Pro): Best value, Cascade agent, strong with Qwen3.5 Plus/Opus 4.6/MiniMax M2.5.
- Claude Code / Artifacts: Terminal agentic workflows with Opus 4.6/Sonnet 4.6.
- GitHub Copilot: Enterprise integration.
- OSS (Cline, Continue, Aider): Free BYOK agentic editing in VS Code/CLI — privacy-focused, Git-aware.
Local Inference Platforms (Ollama, LM Studio, Jan AI, Llamafile, GPT4All): Ideal for privacy. Popular models: Qwen3-Coder variants, Llama 4 Maverick/Llama 4 Scout, DeepSeek V3.2.
Recommended Stack: Cursor or Windsurf + OpenRouter/LiteLLM routing + Claude Code for terminal + Ollama/Continue for local/privacy.
9. Global Directory & Compliance
US: OpenAI (GPT-5.4/GPT-5.3-Codex), Anthropic (Claude 4.6), Google (Gemini 3.1 Pro), xAI (Grok 4.20/4.1), Meta (Llama 4). China (ultra-low cost, non-sensitive only): Alibaba (Qwen3.5 Plus), MiniMax (MiniMax M2.5), Zhipu (GLM-5), DeepSeek (DeepSeek V3.2). Other: Mistral (EU), Cohere (Canada).
India-Specific: Prioritize US/EU or self-hosted (Ollama + Llama 4 Maverick/Qwen open weights) for client/NDA/gov work. MiniMax M2.5/Qwen3.5 Plus via OpenRouter for cost-effective frontier agentic coding. Leverage GitHub Student Pack, Azure/Google education credits.
Hosting Guide:
- US/EU: Opus 4.6, GPT-5.4/GPT-5.3-Codex, Gemini 3.1 Pro, Grok 4.20/4.1 (sensitive/client work)
- China: Qwen3.5 Plus, MiniMax M2.5, DeepSeek V3.2 (high-volume, non-sensitive)
- Self-Host/Open: Llama 4 Maverick/Llama 4 Scout, DeepSeek V3.2, GLM-5 (full privacy)
10. Cost Optimization & Upcoming
Strategies: Prompt caching (75-90% savings), smart routing (70-80%), batch API (~50%).
Cheapest Realistic Blended:
- DeepSeek V3.2 (cached) or Groq Llama 4 Scout: ~$0.55/M
- MiniMax M2.5: ~$1-2/M
- Qwen3.5 Plus: ~$3.50/M
Upcoming (Q2 2026+): Grok 5 (large MoE), Llama 4 Behemoth (2T+ params potential), Claude 5 family, continued Chinese iterations.
March 8, 2026 Bottom Line
- High-Stakes: Opus 4.6 + GPT-5.4/GPT-5.3-Codex
- Value/Agentic: MiniMax M2.5 + Qwen3.5 Plus/DeepSeek V3.2 (95%+ performance at commodity prices)
- Context/Real-Time: Grok 4.1 Fast/Grok 4.20 Beta 2 + Llama 4 Scout
- General/Terminal: Gemini 3.1 Pro
- Open/Local: L4 variants + Ollama/Continue/Cline
The raw capability gap has largely closed for practical use. The real bottlenecks are now integration, workflow design, prompt engineering, and knowing what to build. Route intelligently, cache aggressively, and build boldly — especially with affordable frontier agentic options widely available. 🚀
Sources: Official provider releases and API docs (as of Mar 8, 2026), SWE-Bench Verified, ARC-AGI-2, GPQA, LMArena, Artificial Analysis, OpenRouter stats. The landscape evolves weekly—always verify latest pricing and benchmarks for production use.