Discussion about this post

User's avatar
JP's avatar

The progression from 64 to 160 to 256 experts is interesting to track. DeepSeek keeps scaling the expert count while NVIDIA took a different route with Nemotron 3 Super; 512 experts but only 22 active per token, and they compress tokens into a lower-dimensional latent space before routing. LatentMoE they call it. Gets you 4x more experts at the same compute cost.

Both are betting on sparsity but the mechanisms diverge pretty sharply. DeepSeek's MLA and native sparse attention attack the KV cache bottleneck. Nemotron stacks Mamba-2 layers (linear cost) with sparse attention layers and routes through LatentMoE. I covered the Nemotron architecture in detail here https://reading.sh/inside-the-model-merging-three-ai-architectures-into-one-c5dcc7302528 and the 10:1 total-to-active parameter ratio is the part that keeps surprising people.

Curious whether V4 will stick with pure transformer + MoE or if they'll follow the hybrid trend. Qwen 3.5 went GDN, Nemotron went Mamba, IBM Granite went hybrid too. Feels like pure transformers are losing ground.

No posts

Ready for more?