11 AI Models Released in February 2026 — Benchmarks, Pricing, and Architecture Comparison

2026-02-22 · # AI 뉴스

AI 모델 Claude GPT Gemini Qwen DeepSeek GLM Grok 2026

February 2026 saw 11 AI models released in a single month. This article doesn’t just list “what came out.” From the developer perspective of actually having to choose models, it compares which model suits which tasks across three dimensions: benchmarks, pricing, and architecture.

February Release Timeline at a Glance

Date	Model	Company	Key Features
2/5	Claude Opus 4.6	Anthropic	adaptive thinking, 1M ctx, 128K output
2/5	GPT-5.3-Codex	OpenAI	Agentic coding, first “High” risk classification
2/11	GLM-5	Zhipu AI	744B open source
2/12	Gemini 3 Deep Think	Google	Science & reasoning specialized upgrade
2/14	Doubao 2.0	ByteDance	4-model family, 200M+ users
2/16	Qwen 3.5-397B	Alibaba	397B MoE(17B active), 512 experts
2/17	Claude Sonnet 4.6	Anthropic	Opus-level performance at 1/5 price
2/17	Grok 4.20 Beta	xAI	4-16 agent collaboration
2/19	Gemini 3.1 Pro	Google	ARC-AGI-2 77.1%, 3-level reasoning control
Late Jan-Feb	Kimi K2.5	Moonshot AI	Open-weight, agent swarm
Mid-Feb (expected)	DeepSeek V4	DeepSeek	1T parameters, 1M ctx

Coding Performance: Who Codes Best?

Coding was the most fierce battleground in February’s rush. SWE-bench Verified (a benchmark where models directly fix actual GitHub issues) reveals the picture.

GPT-5.3-Codex was the first flagship designed specifically for coding¹. It understood project structure from CLI, wrote tests, and performed debugging in agent fashion. GitHub Copilot integration went GA on February 9, opening an era where agentic coding became the default in IDEs. It was also the first model classified as “High” risk in OpenAI’s internal safety evaluation — testament to how powerful its autonomous code execution abilities had become.

Claude Sonnet 4.6 scored 79.6% on SWE-bench Verified². This was only 0.2%p behind Opus 4.6, while pricing was $3/$15 (input/output per 1M tokens) — one-fifth of Opus. For coding tasks prioritizing cost-effectiveness, it was the most reasonable choice as of February.

Gemini 3.1 Pro achieved the numerically highest SWE-bench score at 80.6%³. Pricing was also cheaper than Claude Sonnet at $2/$12. However, its 77.1% on ARC-AGI-2 reflected general reasoning ability, distinct from pure coding skills.

From the Chinese camp, Qwen 3.5-397B demonstrated presence with 83.6 on LiveCodeBench v6⁴. The 397B MoE architecture activating only 17B kept inference costs low, and being open source (Apache 2.0) enabled self-hosting. GLM-5 approached Claude Opus 4.5’s coding benchmarks as 744B open source⁵.

Model	SWE-bench Verified	LiveCodeBench v6	API Price (input/output)
Gemini 3.1 Pro	80.6%	—	$2/$12
Claude Sonnet 4.6	79.6%	—	$3/$15
GPT-5.3-Codex	—	—	Undisclosed
Qwen 3.5-397B	—	83.6	$0.11/— (Alibaba)

Reasoning Ability: Thinking Depth Evolved

The common keyword among February models was “reasoning control”. Not just getting smarter, but enabling user control over reasoning depth was the fundamental change.

Claude Opus 4.6’s adaptive thinking let the model judge difficulty itself⁶. It gave instant answers to “How’s the weather today?” but engaged long thought chains for complex mathematical proofs. METR benchmark showed 50% completion time of 14 hours 30 minutes — meaning models could work autonomously for over half a day. The 1M token context window (beta) and 128K output tokens dramatically expanded physical limits for long-form work.

Gemini 3.1 Pro took a different approach³. Users could directly select reasoning depth across 3 levels. Shallow reasoning for quick responses, deep reasoning when accuracy mattered. VentureBeat called this “Deep Think Mini” — Google’s dedicated reasoning model Deep Think’s lightweight version embedded in general-purpose models.

Gemini 3 Deep Think itself was also upgraded on February 12⁷. As reasoning mode specialized for science, research, and engineering, it focused on complex multi-step problem solving. Simultaneously updating general models (3.1 Pro) and specialized reasoning models (Deep Think) signaled Google’s full-spectrum pressure in the reasoning domain.

Grok 4.20 Beta showed completely different philosophy⁸. Instead of deepening single-model reasoning, it ran 4 (or 16 in Heavy mode) AI agents simultaneously for collaboration. Chatbot Arena ELO estimated 1505-1535. This experiment tested how far the approach of multiple experts discussing could work instead of one genius.

Open Source vs Closed: The Gap Is Disappearing

February’s most significant change was open source models reaching parity with closed models. This wasn’t simple benchmark catching-up but an ecosystem-changing event.

GLM-5 (744B) was released completely open source by Zhipu AI⁵. In coding, it approached Claude Opus 4.5 and surpassed Gemini 3 Pro in some areas. Such open source model performance would have been unimaginable a year ago.

Qwen 3.5-397B showed efficiency extremes⁴. The MoE structure activating only 17B out of 512 experts achieved AIME26 (math) score of 91.3 while improving decoding throughput 8.6-19x over previous generations. Apache 2.0 license.

Kimi K2.5 was Moonshot AI’s open-weight model with vision capabilities and agent swarm, showing GPT-5 and Gemini-level coding performance⁹. Open-weight models beginning to include agent functionality was a noteworthy trend.

These three models’ simultaneous emergence created clear pressure. Closed model companies could no longer justify API pricing with just “our model is better.” Claude Sonnet 4.6 launching at 1/5 of Opus pricing and Gemini 3.1 Pro freezing prices weren’t unrelated to this open source pressure.

The Real Dawn of the Agent Era

Another flow running through February models was “agents”. Previously, agents were framework territory (LangChain, CrewAI, etc.), but now models themselves began embedding agent capabilities.

GPT-5.3-Codex: Full-cycle agent exploring codebases, analyzing issues, submitting PRs¹
Grok 4.20: Up to 16 agents collaborating simultaneously⁸
Doubao 2.0: ByteDance officially declared the “agent era”, restructuring 4 model families around agents¹⁰. Already deployed on an app used by 200 million people
Kimi K2.5: Agent swarm — independently controlling parallel workflows⁹
Claude Opus 4.6: METR 50% at 14 hours 30 minutes — half-day autonomous work⁶

Models were transitioning from “answering tools” to “working colleagues.” This change accelerated rapidly in February.

The Unrevealed Card: DeepSeek V4

The most anticipated yet unrevealed model in February’s rush was DeepSeek V4¹¹. Known specs: 1 trillion parameters, 1M token context, targeting 80%+ on SWE-bench. DeepSeek had already shocked the industry twice with V3 and R1. If V4 actually achieved these specs, it could be the first case of open source completely catching up to closed models.

Personal Thoughts

Comparing 11 models, what struck me most was that the concept of “best model” itself is becoming meaningless. GPT-5.3-Codex and Gemini 3.1 Pro excel at coding, Claude Opus 4.6 dominates long autonomous work, and Qwen 3.5 is unmatched in cost-effectiveness. The era of one model winning in all domains is over.

The practical conclusion was simple. Multi-model strategies switching models by task are now necessity, not choice. And with open source model performance reaching these levels, pricing premiums for closed models will become increasingly hard to justify.

February isn’t even over yet. DeepSeek V4 remains.

OpenAI, “Introducing GPT-5.3-Codex,” February 5, 2026. ↩ ↩²
Anthropic, “Claude Sonnet 4.6,” February 17, 2026. VentureBeat coverage: SWE-bench 79.6%, $3/$15. ↩
Google DeepMind, “Gemini 3.1 Pro,” February 19, 2026. ARC-AGI-2 77.1%, SWE-bench 80.6%. ↩ ↩²
Alibaba Cloud, “Qwen 3.5,” February 16, 2026. 397B MoE, LiveCodeBench v6 83.6, AIME26 91.3. ↩ ↩²
Zhipu AI, “GLM-5 Open Source Release,” February 11, 2026. 744B parameters. ↩ ↩²
Anthropic, “Claude Opus 4.6,” February 5, 2026. Adaptive thinking, METR 50%-time 14h30m. ↩ ↩²
Google DeepMind, “Gemini 3 Deep Think Upgrade,” February 12, 2026. ↩
xAI, “Grok 4.20 Beta,” February 17, 2026. 4-16 agent collaboration, ELO 1505-1535. ↩ ↩²
Moonshot AI, “Kimi K2.5,” Jan-Feb 2026. Open-weight, vision+agent swarm. ↩ ↩²
ByteDance, “Doubao 2.0 / Seed 2.0,” February 14, 2026. 4-model family. ↩
DeepSeek, “DeepSeek V4 Preview,” February 2026. 1T parameters, 1M context. ↩