DeepSeek V4 Launches With 1 Million Token Context on Huawei Chips
Chinese AI startup DeepSeek released a preview of DeepSeek-V4 on April 24, marking the company's most significant model since R1 disrupted global markets 15 months ago. The open-source release introduces 1-million-token context as a standard feature — not a premium tier — and represents the first major model optimized for Huawei's domestic AI chips rather than Nvidia hardware.
The preview is officially live on Hugging Face and through DeepSeek's API, with the company announcing: "Welcome to the era of cost-effective 1M context length." V4 comes in two variants: V4-Pro with 1.6 trillion total parameters (49 billion active) and V4-Flash with 284 billion parameters (13 billion active). Both are available now, though the Pro version costs up to 12x more than Flash due to current compute capacity constraints.
The headline technical upgrade is a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). This architecture dramatically cuts the compute and memory costs of processing long contexts. At 1 million tokens, V4-Pro requires only 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. That efficiency is what makes million-token contexts economically viable — a 10x smaller cache means a single GPU can serve roughly 10x as many concurrent long-context sessions (a problem that has plagued users for years, frankly).
According to the official DeepSeek-V4 blog post, the KV cache memory against established architectures like grouped query attention with 8 heads requires roughly 2% the cache size. This makes deployment for very large context handling significantly easier. The layers alternate between CSA and HCA across the 61-layer stack, with different layers carrying different attention patterns.
Performance benchmarks show V4-Pro outperforms all other open-source models in world knowledge benchmarks, trailing only Google's Gemini 3.1 Pro. In agentic coding evaluations, V4-Pro beats Anthropic's Claude Sonnet 4.5 and approaches Claude Opus 4.6's non-thinking mode performance. On competition math benchmarks, V4-Pro-Max scores 95.2 on HMMT 2026 February and 89.8 on IMOAnswerBench — within range of OpenAI's GPT-5.4.
The company's tech report acknowledges limitations: V4 "falls marginally short of GPT-5.4 and Gemini 3.1 Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately three to six months." In thinking/reasoning mode specifically, the gap with top closed-source models remains wider.
DeepSeek's pricing undercuts Western competitors by roughly an order of magnitude. V4-Flash costs $0.28 per million output tokens, while V4-Pro costs $3.48. Compare that to OpenAI GPT-5.5 at $30 or Anthropic Claude Opus 4.6 at $25 per million output tokens. DeepSeek expects Pro pricing to drop sharply once Huawei scales up Ascend 950 production in the second half of this year.
The most significant shift from DeepSeek's earlier models: V4 is optimized for Huawei's Ascend AI chips rather than Nvidia hardware. Hours after the preview launched, Huawei confirmed that V4 is fully supported on its Ascend 950-based supernode clusters and that its chips were used for part of V4-Flash's training. "Through close technical collaboration, the entire Ascend supernode product line now supports the DeepSeek-V4 series models," Huawei said.
DeepSeek trained its earlier V3 and R1 models on Nvidia H800 GPUs. V4's pivot to domestic chips comes as U.S. export controls continue blocking Chinese developers from purchasing Nvidia's most advanced processors. Independent reporting from Tom's Hardware corroborates the hardware shift and pricing details.
The release landed the same day Reuters reported that the U.S. State Department sent a diplomatic cable instructing embassy staff worldwide to warn foreign governments about alleged IP theft by DeepSeek and other Chinese AI firms. The White House Office of Science and Technology Policy published a memo earlier in the week accusing Chinese entities of running "deliberate, industrial-scale campaigns" to distill American frontier AI systems.
Anthropic has claimed that DeepSeek, Moonshot AI, and MiniMax used 24,000 fraudulent accounts to make 16 million exchanges with its Claude model. China's foreign ministry called the accusations "groundless" and "a smear against the achievements of China's AI industry." DeepSeek has previously said its V3 model relied on naturally occurring data collected through web crawling and didn't intentionally use synthetic data generated by OpenAI.
V4 is designed for agent workflows, not just chatbot interactions. The model works with mainstream development frameworks, including Claude Code, OpenClaw, OpenCode, and CodeBuddy — allowing developers to integrate it into existing coding tools. The company has been using V4 internally as its primary agentic coding model for program development tasks.
The model is optimized for multi-step processes: data collection, organization, and output generation as a complete workflow rather than isolated responses. V4 preserves reasoning content across user message boundaries when the conversation contains tool calls, allowing a coherent, cumulative chain of thought over long-horizon agent tasks.
V4 is still a preview release. DeepSeek hasn't announced when the final version will ship. The company's existing deepseek-chat and deepseek-reasoner API endpoints will be fully retired after July 24, 2026. Developers should migrate to the explicit deepseek-v4-pro and deepseek-v4-flash model IDs before that date.
DeepSeek is also reportedly raising funds at a valuation exceeding $20 billion, with Alibaba and Tencent in discussions to take stakes, according to The Information. Whether users actually pay for it remains the real question.
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt
Comments