11 AI Models Released in February 2026 — Benchmarks, Pricing, and Architecture Comparison

· # AI 뉴스
AI 모델 Claude GPT Gemini Qwen DeepSeek GLM Grok 2026

February 2026 saw 11 AI models released in a single month. This article doesn’t just list “what came out.” From the developer perspective of actually having to choose models, it compares which model suits which tasks across three dimensions: benchmarks, pricing, and architecture.

February Release Timeline at a Glance

DateModelCompanyKey Features
2/5Claude Opus 4.6Anthropicadaptive thinking, 1M ctx, 128K output
2/5GPT-5.3-CodexOpenAIAgentic coding, first “High” risk classification
2/11GLM-5Zhipu AI744B open source
2/12Gemini 3 Deep ThinkGoogleScience & reasoning specialized upgrade
2/14Doubao 2.0ByteDance4-model family, 200M+ users
2/16Qwen 3.5-397BAlibaba397B MoE(17B active), 512 experts
2/17Claude Sonnet 4.6AnthropicOpus-level performance at 1/5 price
2/17Grok 4.20 BetaxAI4-16 agent collaboration
2/19Gemini 3.1 ProGoogleARC-AGI-2 77.1%, 3-level reasoning control
Late Jan-FebKimi K2.5Moonshot AIOpen-weight, agent swarm
Mid-Feb (expected)DeepSeek V4DeepSeek1T parameters, 1M ctx

Coding Performance: Who Codes Best?

Coding was the most fierce battleground in February’s rush. SWE-bench Verified (a benchmark where models directly fix actual GitHub issues) reveals the picture.

GPT-5.3-Codex was the first flagship designed specifically for coding1. It understood project structure from CLI, wrote tests, and performed debugging in agent fashion. GitHub Copilot integration went GA on February 9, opening an era where agentic coding became the default in IDEs. It was also the first model classified as “High” risk in OpenAI’s internal safety evaluation — testament to how powerful its autonomous code execution abilities had become.

Claude Sonnet 4.6 scored 79.6% on SWE-bench Verified2. This was only 0.2%p behind Opus 4.6, while pricing was $3/$15 (input/output per 1M tokens) — one-fifth of Opus. For coding tasks prioritizing cost-effectiveness, it was the most reasonable choice as of February.

Gemini 3.1 Pro achieved the numerically highest SWE-bench score at 80.6%3. Pricing was also cheaper than Claude Sonnet at $2/$12. However, its 77.1% on ARC-AGI-2 reflected general reasoning ability, distinct from pure coding skills.

From the Chinese camp, Qwen 3.5-397B demonstrated presence with 83.6 on LiveCodeBench v64. The 397B MoE architecture activating only 17B kept inference costs low, and being open source (Apache 2.0) enabled self-hosting. GLM-5 approached Claude Opus 4.5’s coding benchmarks as 744B open source5.

ModelSWE-bench VerifiedLiveCodeBench v6API Price (input/output)
Gemini 3.1 Pro80.6%$2/$12
Claude Sonnet 4.679.6%$3/$15
GPT-5.3-CodexUndisclosed
Qwen 3.5-397B83.6$0.11/— (Alibaba)

Reasoning Ability: Thinking Depth Evolved

The common keyword among February models was “reasoning control”. Not just getting smarter, but enabling user control over reasoning depth was the fundamental change.

Claude Opus 4.6’s adaptive thinking let the model judge difficulty itself6. It gave instant answers to “How’s the weather today?” but engaged long thought chains for complex mathematical proofs. METR benchmark showed 50% completion time of 14 hours 30 minutes — meaning models could work autonomously for over half a day. The 1M token context window (beta) and 128K output tokens dramatically expanded physical limits for long-form work.

Gemini 3.1 Pro took a different approach3. Users could directly select reasoning depth across 3 levels. Shallow reasoning for quick responses, deep reasoning when accuracy mattered. VentureBeat called this “Deep Think Mini” — Google’s dedicated reasoning model Deep Think’s lightweight version embedded in general-purpose models.

Gemini 3 Deep Think itself was also upgraded on February 127. As reasoning mode specialized for science, research, and engineering, it focused on complex multi-step problem solving. Simultaneously updating general models (3.1 Pro) and specialized reasoning models (Deep Think) signaled Google’s full-spectrum pressure in the reasoning domain.

Grok 4.20 Beta showed completely different philosophy8. Instead of deepening single-model reasoning, it ran 4 (or 16 in Heavy mode) AI agents simultaneously for collaboration. Chatbot Arena ELO estimated 1505-1535. This experiment tested how far the approach of multiple experts discussing could work instead of one genius.

Open Source vs Closed: The Gap Is Disappearing

February’s most significant change was open source models reaching parity with closed models. This wasn’t simple benchmark catching-up but an ecosystem-changing event.

GLM-5 (744B) was released completely open source by Zhipu AI5. In coding, it approached Claude Opus 4.5 and surpassed Gemini 3 Pro in some areas. Such open source model performance would have been unimaginable a year ago.

Qwen 3.5-397B showed efficiency extremes4. The MoE structure activating only 17B out of 512 experts achieved AIME26 (math) score of 91.3 while improving decoding throughput 8.6-19x over previous generations. Apache 2.0 license.

Kimi K2.5 was Moonshot AI’s open-weight model with vision capabilities and agent swarm, showing GPT-5 and Gemini-level coding performance9. Open-weight models beginning to include agent functionality was a noteworthy trend.

These three models’ simultaneous emergence created clear pressure. Closed model companies could no longer justify API pricing with just “our model is better.” Claude Sonnet 4.6 launching at 1/5 of Opus pricing and Gemini 3.1 Pro freezing prices weren’t unrelated to this open source pressure.

The Real Dawn of the Agent Era

Another flow running through February models was “agents”. Previously, agents were framework territory (LangChain, CrewAI, etc.), but now models themselves began embedding agent capabilities.

  • GPT-5.3-Codex: Full-cycle agent exploring codebases, analyzing issues, submitting PRs1
  • Grok 4.20: Up to 16 agents collaborating simultaneously8
  • Doubao 2.0: ByteDance officially declared the “agent era”, restructuring 4 model families around agents10. Already deployed on an app used by 200 million people
  • Kimi K2.5: Agent swarm — independently controlling parallel workflows9
  • Claude Opus 4.6: METR 50% at 14 hours 30 minutes — half-day autonomous work6

Models were transitioning from “answering tools” to “working colleagues.” This change accelerated rapidly in February.

The Unrevealed Card: DeepSeek V4

The most anticipated yet unrevealed model in February’s rush was DeepSeek V411. Known specs: 1 trillion parameters, 1M token context, targeting 80%+ on SWE-bench. DeepSeek had already shocked the industry twice with V3 and R1. If V4 actually achieved these specs, it could be the first case of open source completely catching up to closed models.

Personal Thoughts

Comparing 11 models, what struck me most was that the concept of “best model” itself is becoming meaningless. GPT-5.3-Codex and Gemini 3.1 Pro excel at coding, Claude Opus 4.6 dominates long autonomous work, and Qwen 3.5 is unmatched in cost-effectiveness. The era of one model winning in all domains is over.

The practical conclusion was simple. Multi-model strategies switching models by task are now necessity, not choice. And with open source model performance reaching these levels, pricing premiums for closed models will become increasingly hard to justify.

February isn’t even over yet. DeepSeek V4 remains.

Footnotes

  1. OpenAI, “Introducing GPT-5.3-Codex,” February 5, 2026. 2

  2. Anthropic, “Claude Sonnet 4.6,” February 17, 2026. VentureBeat coverage: SWE-bench 79.6%, $3/$15.

  3. Google DeepMind, “Gemini 3.1 Pro,” February 19, 2026. ARC-AGI-2 77.1%, SWE-bench 80.6%. 2

  4. Alibaba Cloud, “Qwen 3.5,” February 16, 2026. 397B MoE, LiveCodeBench v6 83.6, AIME26 91.3. 2

  5. Zhipu AI, “GLM-5 Open Source Release,” February 11, 2026. 744B parameters. 2

  6. Anthropic, “Claude Opus 4.6,” February 5, 2026. Adaptive thinking, METR 50%-time 14h30m. 2

  7. Google DeepMind, “Gemini 3 Deep Think Upgrade,” February 12, 2026.

  8. xAI, “Grok 4.20 Beta,” February 17, 2026. 4-16 agent collaboration, ELO 1505-1535. 2

  9. Moonshot AI, “Kimi K2.5,” Jan-Feb 2026. Open-weight, vision+agent swarm. 2

  10. ByteDance, “Doubao 2.0 / Seed 2.0,” February 14, 2026. 4-model family.

  11. DeepSeek, “DeepSeek V4 Preview,” February 2026. 1T parameters, 1M context.

← OpenClaw Skill Ecosystem Security Crisis — ClawHavoc, CVE, and the Lethal Trifecta The Ugly Truth of AI Tool Security — Self-Hosting Reality Through Moltbook, n8n, and OpenClaw Incidents →