DeepSeek V4 Cuts Memory Use 9.5x, Adds Huawei Ascend Support
The AI model landscape just got a lot more crowded, and significantly cheaper to run. DeepSeek has released two new open-weight models in preview: DeepSeek V4 and DeepSeek V4-Pro, both featuring architectural changes that slash memory requirements by up to 13.7x compared to the previous generation.
The headline number is 9.5x lower memory usage, but the actual range extends to 13.7x depending on configuration. This isn't marketing fluff—it comes from concrete architectural decisions that directly impact how developers deploy these models in production environments.
Both models are now available for download on Hugging Face and through DeepSeek's API and web service, according to the company's official announcement.
Under the hood, V4 introduces a hybrid attention mechanism combining Compressed Sparse Attention and Heavy Compressed Attention. This combination reduces computation during inference and compresses the key-value caches used to track the model's state. The result: a one million token context window with dramatically reduced memory overhead.
For anyone who's tried to run large context models, the physical reality is brutal. You watch your GPU memory fill up, swap to system RAM, and wait through cold start penalties that make the experience feel like watching paint dry. DeepSeek's compressed KV caches directly address this friction point.
The precision strategy is equally aggressive. Both V4 models utilize a mix of FP8 and FP4 precision, with quantization-aware training applied to the mixture-of-experts weights. Using FP4 roughly halves the memory needed to store model weights compared to FP8. The trade-off is precision loss, but for many use cases, the memory savings justify the compromise.
DeepSeek V4 is a 284 billion parameter mixture-of-experts model with 13 billion active parameters. V4-Pro scales to 1.6 trillion parameters with 49 billion active parameters, trained on 33 trillion tokens. The company claims it outperforms all open-weight large language models and rivals leading proprietary Western models across its benchmark suite.
These claims are self-reported, which means they should be evaluated against independent testing before being treated as fact. (We've seen enough benchmark gaming to know better.)
The hardware story is where things get geopolitically interesting. DeepSeek V4 has been confirmed to operate on both Nvidia GPUs and Huawei Ascend NPU platforms. The technical paper mentions validation of the model's expert parallel scheme across these hardware types.
It is not clear whether Huawei accelerators were used during training or solely for inference. The Register notes that the paper only mentions the chips in passing, stating the company validated its "fine-grained EP scheme on both Nvidia GPUs and Ascend NPU platforms."
This distinction matters. Training frontier models on domestic Chinese hardware would represent a significant milestone for China's AI infrastructure independence. Inference support, while still notable, is a lower barrier to entry for chipmakers.
At one point, DeepSeek reportedly tried to train models on Huawei's chips. That effort was derailed by dodgy chips, glacial interconnects, and an immature software stack that ultimately drove DeepSeek back into Nvidia's embrace. Whether V4 represents a return to that strategy remains unclear.
The pricing structure is aggressive. DeepSeek V4 costs $0.14 per million input tokens and $0.28 per million output tokens for uncached requests. V4-Pro is priced at $1.74 per million input tokens and $3.48 per million output tokens.
For comparison, OpenAI's GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens. That's a massive difference for high-volume deployments, though the performance gap between models remains to be seen in real-world applications.
Both models, including base and instruction-tuned versions, are now available in preview through the DeepSeek API and Hugging Face. The MIT license means developers can deploy V4 on cloud infrastructure without worrying about proprietary restrictions.
The new Muon optimizer aims to accelerate convergence and enhance training stability. This is a homegrown tool that DeepSeek developed specifically for V4, suggesting the company is building more of its own infrastructure stack rather than relying entirely on third-party tools.
For developers self-hosting V4, the Huawei training hardware is largely irrelevant. The open weights are released in standard formats that work on any Nvidia GPU via vLLM, SGLang, or other inference frameworks. You don't need Huawei hardware to run V4.
The strategic implications are significant regardless. If Chinese labs can train frontier models on domestic chips, the leverage of US export controls diminishes. This doesn't mean the controls are irrelevant—Nvidia hardware likely still offers efficiency advantages—but the assumption that China can't build frontier AI without American chips no longer holds.
Whether users actually pay for it remains the real question. The efficiency gains are impressive on paper, but real-world performance across diverse workloads will determine if V4 becomes a mainstream choice or remains a niche option for cost-conscious deployments.
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt
Comments