👍DeepSeek Didn't Show Up—GLM-5 and Qwen3.5 Did, and They Came to Win
Chinese AI labs stay within three months of top U.S. models as GLM-5 targets Claude Opus 4.5 and Qwen 3.5 challenges Gemini 3.0
Usually, the weeks between the New Year and Chinese New Year follow a familiar rhythm for Chinese companies—recap the past year, let employees rest, and quietly plan for the year ahead. The AI race doesn’t allow for that luxury anymore.
In the weeks leading up to Chinese New Year, which began on February 17th, nearly every frontier Chinese AI lab—except DeepSeek—rushed out their latest flagship models. The timing wasn’t coincidental. With DeepSeek-V4 rumored to drop before the holiday, no lab could afford to sit still and wait to be made irrelevant. These models have a shelf life of months at best, and every lab knows that being first to claim a benchmark, a capability, or a narrative matters enormously for how developers and enterprises perceive you on the world stage.
And it wasn’t just Chinese AI labs. Anthropic, OpenAI, and Google each dropped major releases of their own—Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.3 Codex—in the same compressed window. If you zoom out and look at February 2026 as a whole, it may well go down as the most consequential single month in the history of AI development. I genuinely hope this pace of progress continues, as dizzying as it is to keep up with.
I’ll do my best to walk you through all of these releases and what they mean—but let’s start with the four Chinese open-weight models that I think deserve the most attention right now.
Qwen3.5-Plus: The Most Ambitious One
Qwen3.5-Plus isn’t the strongest open model right now, but it might be the most ambitious, and I mean that as a genuine compliment.
Native multimodality, finally. Alibaba’s previous Qwen series separated vision and language into dedicated model lines, Qwen and Qwen-VL. Qwen3.5-Plus breaks from that tradition with a native multimodal foundation model trained from the ground up on a significantly large corpus of visual-text tokens, enriched with Chinese, English, multilingual, STEM, and reasoning data. This isn’t a text model with vision added on, and the difference shows in tasks like visual reasoning, visual coding, and visual agentic workflows.
Extreme sparsity. The model has 397 billion total parameters, but only 17 billion are activated per forward pass. It’s built on the Qwen3-Next architecture, which introduces higher-sparsity MoE, a Gated DeltaNet + Gated Attention hybrid attention mechanism, stability optimizations, and multi-token prediction.
A quick note on what those attention mechanisms actually mean: Gated DeltaNet is a form of linear recurrent attention that uses gating to selectively update a memory matrix, letting the model efficiently track information across long sequences without the quadratic cost of standard attention. Gated Attention layers are interspersed with these linear layers to preserve the model’s ability to do precise, content-based retrieval when it needs to. Together, the hybrid avoids the bottlenecks of both pure transformers and pure linear models.
The throughput gains from this architecture are striking. Under a 32K context, Qwen3.5-Plus runs at 8.6× the decoding speed of Qwen3-Max; under 256K context, that jumps to 19×. It supports up to a 1 million token context window, built-in tools, and adaptive tool use. It currently ranks third on the Artificial Analysis Intelligence Index, behind GLM-5 and K2.5.
On pricing, Qwen3.5-Plus charges ¥0.8 per million input tokens (~$0.12) and ¥4.8 per million output tokens on Alibaba Cloud in mainland China, which Alibaba positions as roughly 1/18th the cost of Gemini 3 Pro at comparable performance. On OpenRouter, it’s $0.40 input and $2.40 output.
GLM-5: The Best Open-Weight Model
After GLM-4.7 signaled that Z.ai was back in the conversation among top-tier Chinese AI labs, GLM-5 makes the argument definitively. It currently sits at the top of the Artificial Analysis Intelligence Index, the best open-weight model you can run today.
The model’s focus is concrete: AI agents that can autonomously plan, implement, debug, and iterate on real-world software tasks. And I want to call out something I genuinely appreciated: Z.ai clearly learned from DeepSeek’s approach of writing accessible, readable research papers. GLM-5’s technical report is one of the cleaner reads in this space.
Architecture. The model has 744 billion total parameters with 40 billion activated per forward pass. Pre-training data expanded from 23 trillion to 28.5 trillion tokens, with a 200K context window. Z.ai adopted several of DeepSeek’s architectural choices, including Multi-head Latent Attention (MLA), multi-token prediction, and most notably DeepSeek Sparse Attention (DSA). DSA replaces the standard dense O(L²) attention with dynamic token selection, reducing long-context compute cost by 1.5–2× without sacrificing quality.
Post-training is where Z.ai really distinguished itself. Not all agent tasks are created equal. Some finish in a few steps; others require long, sprawling rollouts. Traditional synchronous RL infrastructure couples generation with training, which means fast samples sit idle waiting for slow ones, resulting in GPU idle, lower throughput, and a training bottleneck.
Z.ai’s solution is an Asynchronous RL Infrastructure that fully decouples generation from training. Fast tasks update model parameters immediately; slow tasks keep running until completion. But this introduces a new challenge: rollouts generated by older model versions become off-policy data. To handle this, they developed a suite of asynchronous RL algorithms—Token-in-Token-Out (TITO), Double-sided importance sampling, and Off-policy filtering—that together stabilize training on this kind of stale data.
Its technical paper, Z.ai researchers highlighted that GLM-5 has been optimized to adapt major domestic Chinese AI chips such as Huawei Ascend Cambricon, and Baidu’s Kunlunxin.
K2.5: Visual Coding and Agentic Search
Like Qwen3.5-Plus, K2.5 is Moonshot AI’s first native multimodal foundation model. It was trained on approximately 15 trillion mixed visual and text tokens—vision and language learned jointly from day one. The result is notably stronger cross-modal reasoning across images, videos, and text.
K2.5 builds on the Kimi K2 foundation model: a trillion-parameter MoE with 1.04T total parameters and ~32B activated per token. Where K2.5 stands out is in two specific areas: front-end visual coding (it’s particularly strong at creative website design, and can generate functional code directly from video walkthroughs) and agentic search, where it reportedly outperforms even proprietary models—a genuine achievement in deep research workflows.
The most novel contribution in the paper is Agent Swarm, which introduces a learned framework for parallel multi-agent execution. Rather than a single agent working through tasks sequentially, K2.5’s PARL (Parallel-Agent Reinforcement Learning) paradigm works like this:
A trainable orchestrator learns when and how to spawn sub-agents. The sub-agents themselves are frozen during training—deliberately not co-trained—to avoid unstable credit assignment across a dynamic swarm. The orchestrator is then trained via reinforcement learning to schedule parallel work and balance efficiency against complexity.
The results are compelling: inference latency reduced by up to 4.5× compared to sequential agents, with meaningful gains in task completion quality and F1 scores. The agent swarm isn’t just a throughput trick—it actually improves how well tasks get decomposed and executed.
M2.5: The Office Specialist
M2.5 is a more incremental release—an upgrade over M2.1, likely timed to respond to the momentum of other models (and possibly DeepSeek V4 specifically). The architecture is presumably close to M2.1’s, so I won’t spend too much time on the model itself.
What MiniMax focused on this cycle is coding and office productivity. M2.5 achieved then-SOTA results on multiple coding benchmarks and, interestingly, demonstrated the ability to generate professional deliverables in Word, Excel, and PowerPoint—not just raw text. It wins approximately 59% of head-to-head comparisons against peer models on productivity benchmarks.
The more interesting story is MiniMax’s post-training infrastructure, Forge—their in-house agent-native RL framework that, like Z.ai’s asynchronous system, decouples training from inference. Forge trains across 200,000+ real-world environments spanning coding, search, tool interaction, and workspace automation. The key stabilization algorithm they use is CISPO, which helps maintain training stability and improves long-horizon performance on complex agentic tasks.
One number from MiniMax’s deployment data is worth lingering on: 30% of all tasks across the company’s daily operations—R&D, product, sales, HR, finance—are now autonomously completed by M2.5. And M2.5-generated code accounts for 80% of newly committed code. That’s not a benchmark. That’s a live claim about internal operations, and it’s either the most confident thing said in this release cycle or the most audacious.
Three Patterns Worth Taking Seriously
Looking across all four models, three themes emerge that I think have real implications for the broader industry.
1. Sparsity is the new default
MoE has been the dominant architecture since DeepSeek-V3 and R1, and these new models push sparsity further than any previous generation without giving up performance. But it’s not just network sparsity anymore. Frontier open-weight models are now also adopting sparse attention mechanisms—GLM-5 with DeepSeek’s DSA, Qwen3.5-Plus with its hybrid Gated DeltaNet and Gated Attention layers.
The practical consequence is cheaper inference, and that’s where Chinese open-weight models are carving out a real competitive identity. Alibaba claimed its model costs 1/18th-of-Gemini-Pro, and MiniMax’s “intelligence too cheap to meter” narrative also captures broad attention. For developers who are cost-sensitive, or for agentic workloads where token consumption can be 10–100× higher than standard queries, this pricing gap is not marginal. It’s structural. And in China, where AI commercialization paths remain unclear and most consumer chatbots are still offered for free, cheaper models can save companies not millions but billions in compute costs.
2. Native multimodality has arrived in open weights
A year ago, the leading open-weight models were overwhelmingly text-only, including Qwen3 and Kimi K2. Many researchers believed native multimodality wasn’t necessary for core intelligence. Then Gemini 3.0 happened, and its staggering benchmark performance shifted the conversation.
Chinese AI labs responded fast. Qwen3.5-Plus and K2.5 are both the first native multimodal flagship models from their respective labs. Add ERNIE 5.0 from Baidu and Seed2.0 from ByteDance, and you have a clear generational inflection: open-weight LLMs are no longer text-only. Training vision and language jointly from scratch—rather than layering vision on top—produces qualitatively better cross-modal reasoning. Both Qwen3.5-Plus and K2.5 can build functional websites directly from video input. That’s a capability class that didn’t exist in this model tier twelve months ago.
3. Agentic capabilities in real world
Previous-generation Chinese open-weight models added tool use and basic agent scaffolding. This generation is different—these models are credibly competing with top proprietary systems on complex, real-world workflows.
Almost every lab in this release cycle rebuilt their post-training RL recipe specifically for agentic tasks. Z.ai and MiniMax both developed asynchronous RL systems that decouple inference from training. K2.5 introduced Agent Swarm with a learned orchestrator. The common thread: standard synchronous RL wasn’t designed for the variance in rollout length that agentic tasks produce, and each lab had to engineer around that constraint independently.
The business implications are significant. Agentic models that actually work—that can replace portions of employee workflows—consume tokens at a fundamentally different scale than chatbots. Z.ai seems to understand this: days after releasing GLM-5, they raised the price of their GLM Coding Plan subscription by 30%, announced expansions to their chip partner network, and watched their stock surge over 40% in a single day. Their market cap is now approaching Baidu, which reported over $18 billion in revenue in 2024.
Three years ago, Chinese AI labs open-sourced their models largely to raise awareness and goodwill within the developer community. Contributing to the open-source ecosystem was the goal; competing internationally was not really on the table. That framing has changed meaningfully heading into 2026.
Look at how these labs are presenting themselves now. The release blogs, the benchmark visuals, the example demos, the design polish—all of it is clearly aimed beyond domestic audiences. These labs are active on X, writing in fluent English, and positioning their models explicitly against GPT, Gemini, and Claude.
The early commercial signals are modest but real. Moonshot AI, which is reportedly seeking a valuation of $10 billion in an ongoing funding expansion, disclosed that its overseas API revenue has quadrupled since November 2025. These models are finding their first footholds through open-weight platforms like OpenRouter, a natural entry point for cost-sensitive developers, and the expectation is that enterprise and consumer markets follow.
What makes this moment historically significant is: for the first time, China’s leading technology companies and AI startups are competing head-to-head against America’s best in one of the most consequential technologies of our era—AI, where the stakes may be higher than any of those. The race is still early, the outcome is genuinely uncertain, and that is precisely what makes February 2026 feel like a turning point worth paying attention to.







The capability race is real -- but having used both GLM-5 and Qwen3.5 for production work, pricing context matters as much as benchmarks. Z.ai hiked by 30%+ right around when these models were shipping. The gap between what these models can do and what they'll sustainably cost is where it gets interesting for devs actually building on them: https://sulat.com/p/the-real-cost-of-cheap-ai-inference