State of LLMs — The Complete Guide

0:00

State of LLMs: The Complete Guide — March 8, 2026

Last updated: March 8, 2026

This unified guide synthesizes and cross-verifies the latest data from official announcements (OpenAI, Anthropic, Google, xAI, Alibaba, MiniMax, Zhipu, DeepSeek, Meta), leaderboards (SWE-Bench Verified, ARC-AGI-2, Terminal-Bench 2.0, GPQA Diamond, LMArena, Artificial Analysis), OpenRouter usage, and API pricing docs. Prices in USD per 1M tokens unless noted. Frontier capabilities have commoditized: Chinese and open-weight models now deliver 90-95%+ of Western flagship performance at 10-30x lower cost for most tasks. Agentic coding, 1M+ context, multi-agent reasoning, terminal control, and strong multimodality are table stakes. 99% of dev tasks run excellently on sub-$3/M blended models with smart routing and caching.

Key Takeaway (March 8, 2026): The Feb-Mar 2026 release wave (GPT-5.4 on Mar 5 unifying reasoning+coding, Gemini 3.1 Pro on Feb 19 leading many benchmarks, Grok 4.20 Beta 2 on Mar 3 with multi-agent capabilities, Claude 4.6 Opus/Sonnet family, Qwen3.5 Plus, MiniMax M2.5, GLM-5, DeepSeek V3.2) has dramatically narrowed gaps. Chinese models dominate value, volume, and price/performance; US/EU models lead in compliance-sensitive, high-stakes, and regulated work. Grok 5 is expected in Q2 2026.

Master

Full Name	Provider	Notes
Claude Opus 4.6	Anthropic	Premium agentic/reasoning
Claude Sonnet 4.6	Anthropic	High-value daily driver
GPT-5.4	OpenAI	Unified reasoning+coding flagship (Mar 5)
GPT-5.3-Codex	OpenAI	Terminal/coding specialist
Gemini 3.1 Pro	Google	General/reasoning/terminal leader (Feb 19)
Grok 4.20 Beta 2	xAI	Multi-agent, real-time X data
Grok 4.1 Fast	xAI	Massive context
Qwen3.5 Plus	Alibaba	Price/performance/multilingual leader
MiniMax M2.5	MiniMax	Ultra-budget agentic coding SOTA
GLM-5	Zhipu AI	Open agentic (744B MoE)
Llama 4 Maverick	Meta	Fully open weights
Llama 4 Scout	Meta	Extreme context open variant
DeepSeek V3.2	DeepSeek	Ultra-cheap open frontier
GPT-5.3-Codex-Spark	OpenAI/Cerebras	1,000+ tok/s coding specialist

All frontier models support strong agentic/tool use and multimodality. Routing rule of thumb: Hard/high-stakes → Opus 4.6 or GPT-5.4; Medium → Gemini 3.1 Pro or Qwen3.5 Plus; Simple/volume → MiniMax M2.5 or DeepSeek V3.2. Caching + smart routing (OpenRouter/LiteLLM) is essential.

1. Market Summary

Metric	Value	Notes/Source
Dev AI Coding Adoption	92% use/plan	JetBrains 2026
AI-Written Code Share	55% of global code	GitHub/Greptile
Commodity 1M Context Price	$0.40 in /$ 2.40 out	Qwen3.5 Plus
OpenRouter Scale	1T+ tokens/day, 5M+ devs	OpenRouter/a16z

2. Frontier Model Overview

Full Name	Provider	Release	Context	SWE-Bench Verified	Price (In/Out $/M)	Top Strengths
Claude Opus 4.6	Anthropic	Feb 5	1M (beta)	~80.8%	$5/$ 25	Agentic king, parallel sub-agents, high-stakes reasoning
GPT-5.4	OpenAI	Mar 5	1M	Strong	~ $2.50/$ 15-20	Unified reasoning+coding, GDPval SOTA, native computer-use
Gemini 3.1 Pro	Google	Feb 19	1M+	~80.6%	$2/$ 12 (tiered)	ARC-AGI/GPQA/terminal/science/math leader, 3-tier thinking
Grok 4.20 Beta 2	xAI	Mar 3	2M	Strong	~ $3/$ 15	Multi-agent (4+), real-time X, rapid learning
Claude Sonnet 4.6	Anthropic	Feb 17	1M (beta)	~79.6%	$3/$ 15	Reliable daily driver (90%+ of Opus at lower cost)
GPT-5.3-Codex	OpenAI	Feb 5	400K-1M	~80%	$1.75/$ 14	Terminal control, recursive self-debug
Qwen3.5 Plus	Alibaba	Feb 15	1M	76-77%	$0.40/$ 2.40	Best price/performance, multilingual (strong CJK)
MiniMax M2.5	MiniMax	Feb 12	205K	~80.2%	$0.15 - 0.30/$ 0.60-2.40	Ultra-budget agentic SOTA, high throughput
GLM-5	Zhipu AI	Feb 11	200K+	~77.8%	~ $0.30 - 1.00/$ 2.55-3.20	Open agentic leader
DeepSeek V3.2	DeepSeek	Late 2025	128K-1M	~80%	$0.028 - 0.26/$ 0.38-0.42	Ultra-cheap open frontier, MIT license
Llama 4 Maverick	Meta	2025	1M	Strong	Free/low (open)	Fully open weights, multimodal
Llama 4 Scout	Meta	2025	Up to 10M	Specialized	~ $0.11/$ 0.34 (hosted)	Extreme context open variant
GPT-5.3-Codex-Spark	OpenAI/Cerebras	Feb 12	—	Strong	Varies (fast)	1,000+ tok/s coding

3. API Pricing Snapshot (Sorted by Approx. Blended Cost, ~1:1.3 in:out ratio)

Rank	Model	Context	Input ($/M)	Output ($/M)	Blended ~	Hosting	Notes
1	DeepSeek V3.2	128K-1M	$0.028 (hit)	$0.38-0.42	~$0.57	China	Cache miss higher
2	Llama 4 Scout (Groq)	10M+	~$0.11	~$0.34	~$0.55	US	Fast open inference
3	MiniMax M2.5	205K	$0.15-0.30	$0.60-2.40	~$1-2	China	Ultra-budget agentic
4	Qwen3.5 Plus	1M	$0.40	$2.40	~$3.50	China	Volume sweet spot
5	GPT-5.3-Codex	1M	$1.75	$14	~$20	US/EU	Coding specialist
6	Gemini 3.1 Pro	1M+	$2 ($ 4 >200K)	$12 ($ 18 >200K)	~$17-26	US/EU	Tiered pricing
7	GPT-5.4	1M	$2.50	$15-20	~$25-30	US/EU	Unified flagship
8	Claude Sonnet 4.6	1M beta	$3	$15	~$22.50	US/EU	Daily driver
9	Grok 4.20/4.1	2M	$3	$15	~$22.50	US	Real-time + context
10	Claude Opus 4.6	1M beta	$5	$25	~$37.50	US/EU	Premium high-stakes

Caching delivers 75-90% savings on supported providers.

Paid Chat Subscriptions (Non-API)

Anthropic Claude Pro: $20/mo — Opus 4.6, Sonnet 4.6 (best for reasoning/agents)
OpenAI ChatGPT Plus: $20/mo — GPT-5.4 (general + creative)
Google Gemini AI Pro: ~$20/mo — Gemini 3.1 Pro (1M+ context)
xAI SuperGrok: ~$30-50/mo — Grok 4.20/4.1 (real-time)

Free/Generous Tiers: Google Gemini (unlimited basic), DeepSeek Chat (very high limits), Groq (fast open models), HuggingChat/Ollama (open weights).

4. Key Innovations

Unified Reasoning + Coding + Computer-Use — GPT-5.4 (native tool use, mid-response steering)
Parallel/Multi Sub-Agents & Agent Teams — Opus 4.6, Grok 4.20 Beta 2 (4-16 agents that coordinate/debate)
3-Tier Thinking System — Gemini 3.1 Pro (Low/Med/High compute modes)
Recursive Self-Debug & Terminal Control — GPT-5.3-Codex, Gemini 3.1 Pro, Opus 4.6 (autonomous error fixing)
Rapid Learning Architecture — Grok 4.20 Beta 2 (weekly real-world updates)
Extreme Context Reliability — Grok 4.1 Fast, Llama 4 Scout (2M-10M production-grade)
Ultra-Efficient Agentic at Low Cost — MiniMax M2.5, DeepSeek V3.2, Qwen3.5 Plus (near-SOTA at 10-30x lower price)

5. Context Window Tiers

Extreme: 10M+ → Llama 4 Scout
Massive: 1M-2M+ → Grok 4.1 Fast/Grok 4.20 Beta 2, GPT-5.4, Gemini 3.1 Pro, Opus 4.6, Qwen3.5 Plus, DeepSeek V3.2
Large: 400K-1M → GPT-5.3-Codex
Standard: 128K-262K → MiniMax M2.5, GLM-5

6. Best Models by Use Case

Use Case	Top Pick	Why	Budget/Alt
Pro Coding/Agents	Opus 4.6	Highest SWE + reliability	Sonnet 4.6, MiniMax M2.5
Unified Reasoning+Coding	GPT-5.4	GDPval SOTA, native computer use	GPT-5.3-Codex
Terminal/DevOps	Gemini 3.1 Pro / GPT-5.3-Codex	Terminal-Bench leader	-
Cost-Efficient Production	Qwen3.5 Plus	1M context at low cost	MiniMax M2.5, DeepSeek V3.2
Research/Science/Math	Gemini 3.1 Pro	ARC-AGI/GPQA leader	-
Real-Time/Creative/Long	Grok 4.20 Beta 2/4.1	Multi-agent + live X data	-
Open/On-Prem/Custom	Llama 4 Maverick/Llama 4 Scout/DeepSeek V3.2/GLM-5	Fully open or self-host	-
Multilingual	Qwen3.5 Plus	Strong CJK + others	MiniMax M2.5
Ultra-Budget Frontier	MiniMax M2.5 / DeepSeek V3.2	80%+ SWE at minimal cost	-

7. Leaderboard Summary (Early March 2026)

SWE-Bench Verified (Agentic Coding): Opus 4.6 (~80.8%) > Gemini 3.1 Pro (~80.6%) > MiniMax M2.5 (~80.2%)

Other Key Benchmarks:

ARC-AGI-2: Gemini 3.1 Pro leads (~77%)
GPQA Diamond: Gemini 3.1 Pro leads (~94%)
GDPval: GPT-5.4 leads
Terminal-Bench 2.0: Gemini 3.1 Pro / GPT-5.3-Codex
LMArena (Elo): Opus 4.6 variants top
OpenRouter Usage: MiniMax M2.5 highest volume, followed by Qwen3.5 Plus/DeepSeek V3.2

8. AI Coding Tools & IDEs

Cursor ($20 Pro): AI-native IDE, excellent Composer agents, multi-file edits. Supports Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4.1 Fast.
Windsurf ($15 Pro): Best value, Cascade agent, strong with Qwen3.5 Plus/Opus 4.6/MiniMax M2.5.
Claude Code / Artifacts: Terminal agentic workflows with Opus 4.6/Sonnet 4.6.
GitHub Copilot: Enterprise integration.
OSS (Cline, Continue, Aider): Free BYOK agentic editing in VS Code/CLI — privacy-focused, Git-aware.

Local Inference Platforms (Ollama, LM Studio, Jan AI, Llamafile, GPT4All): Ideal for privacy. Popular models: Qwen3-Coder variants, Llama 4 Maverick/Llama 4 Scout, DeepSeek V3.2.

Recommended Stack: Cursor or Windsurf + OpenRouter/LiteLLM routing + Claude Code for terminal + Ollama/Continue for local/privacy.

9. Global Directory & Compliance

US: OpenAI (GPT-5.4/GPT-5.3-Codex), Anthropic (Claude 4.6), Google (Gemini 3.1 Pro), xAI (Grok 4.20/4.1), Meta (Llama 4). China (ultra-low cost, non-sensitive only): Alibaba (Qwen3.5 Plus), MiniMax (MiniMax M2.5), Zhipu (GLM-5), DeepSeek (DeepSeek V3.2). Other: Mistral (EU), Cohere (Canada).

India-Specific: Prioritize US/EU or self-hosted (Ollama + Llama 4 Maverick/Qwen open weights) for client/NDA/gov work. MiniMax M2.5/Qwen3.5 Plus via OpenRouter for cost-effective frontier agentic coding. Leverage GitHub Student Pack, Azure/Google education credits.

Hosting Guide:

US/EU: Opus 4.6, GPT-5.4/GPT-5.3-Codex, Gemini 3.1 Pro, Grok 4.20/4.1 (sensitive/client work)
China: Qwen3.5 Plus, MiniMax M2.5, DeepSeek V3.2 (high-volume, non-sensitive)
Self-Host/Open: Llama 4 Maverick/Llama 4 Scout, DeepSeek V3.2, GLM-5 (full privacy)

10. Cost Optimization & Upcoming

Strategies: Prompt caching (75-90% savings), smart routing (70-80%), batch API (~50%).

Cheapest Realistic Blended:

DeepSeek V3.2 (cached) or Groq Llama 4 Scout: ~$0.55/M
MiniMax M2.5: ~$1-2/M
Qwen3.5 Plus: ~$3.50/M

Upcoming (Q2 2026+): Grok 5 (large MoE), Llama 4 Behemoth (2T+ params potential), Claude 5 family, continued Chinese iterations.

March 8, 2026 Bottom Line

High-Stakes: Opus 4.6 + GPT-5.4/GPT-5.3-Codex
Value/Agentic: MiniMax M2.5 + Qwen3.5 Plus/DeepSeek V3.2 (95%+ performance at commodity prices)
Context/Real-Time: Grok 4.1 Fast/Grok 4.20 Beta 2 + Llama 4 Scout
General/Terminal: Gemini 3.1 Pro
Open/Local: L4 variants + Ollama/Continue/Cline

The raw capability gap has largely closed for practical use. The real bottlenecks are now integration, workflow design, prompt engineering, and knowing what to build. Route intelligently, cache aggressively, and build boldly — especially with affordable frontier agentic options widely available. 🚀

Sources: Official provider releases and API docs (as of Mar 8, 2026), SWE-Bench Verified, ARC-AGI-2, GPQA, LMArena, Artificial Analysis, OpenRouter stats. The landscape evolves weekly—always verify latest pricing and benchmarks for production use.