AI Agents AI Gadgets & HW AI Models - LLM AI Open Source AI Security AI for Coding AI for Gaming AI for Images AI for Music AI for Videos Artificial Intelligence Editor's Choice NVIDIA AI Other News Robotics Tech Face-off Tech Satire

DeepSeek-V4 Launches with Million-Token Context and Agent Optimizations

By Artūras Malašauskas Apr 24, 2026 4 min read Share:
DeepSeek has released V4-Pro and V4-Flash models with 1M token context windows, claiming 90% KV cache reduction and improved agentic performance over V3.2.

DeepSeek has officially released the DeepSeek-V4 series, introducing two Mixture-of-Experts models designed specifically for long-running agentic workloads. The announcement, published on April 24, 2026, marks a shift from simply expanding context capacity to making million-token inference economically viable.

The core problem with million-token context windows has never been raw capacity. It's the cost of every forward pass at that depth. When an agent runs a long tool-use trajectory—think a SWE-bench task, a multi-step browse session, or a terminal session with hundreds of commands—every tool result appends to the context. Each subsequent token pays the full attention cost against everything that came before. The KV cache fills the GPU. The trace blows past the context budget. The model stops. You reprompt. The cycle repeats.

DeepSeek built V4 to fix these known failures. The series includes two variants: DeepSeek-V4-Pro with 1.6 trillion total parameters (49 billion activated) and DeepSeek-V4-Flash with 284 billion total parameters (13 billion activated). Both support a 1 million token context window and are fully open-source with API access available immediately.

The efficiency gains are specific and measurable. At 1M tokens, DeepSeek-V4-Pro requires 27% of single-token inference FLOPs compared with DeepSeek-V3.2. It uses 10% of the KV cache memory. V4-Flash drops these numbers further: 10% of the FLOPs and 7% of the KV cache. Against a standard grouped query attention architecture with 8 heads stored in bfloat16 format, V4 requires roughly 2% the cache size. This matters when you're actually deploying models that need to remember entire codebases or multi-hour terminal sessions.

The architecture achieves this through a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries by 4x along the sequence dimension using softmax-gated pooling with a learned positional bias. A lightning indexer—FP4, ReLU-scored multi-head dot product—picks the top-k compressed blocks per query. It inherits the sparse-selection idea from DeepSeek Sparse Attention in V3.2, but runs it over blocks that are already 4x shorter than the original sequence. The indexer's search space shrinks with it.

Documentation from the company reveals additional architectural changes. Manifold-Constrained Hyper-Connections (mHC) strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity. The Muon optimizer replaces standard AdamW for faster convergence and greater training stability. Hash-based routing in early layers of the MoE system replaces the initial dense Feed-Forward Network layers from V3.

Both models were pre-trained on more than 32 trillion diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts through SFT and RL with GRPO, followed by unified model consolidation via on-policy distillation. This integrates distinct proficiencies across diverse domains into a single model.

Performance benchmarks show V4-Pro leads among all open-source models in agentic coding benchmarks. The company deployed V4-Pro internally as its coding agent of choice. Employee feedback indicates the experience surpasses Claude Sonnet 4.5, with output quality approaching Claude Opus 4.6 in non-thinking mode, though still trailing Opus 4.6's thinking mode. On world knowledge benchmarks, V4-Pro significantly outperforms other open-source models, falling only slightly short of Gemini Pro 3.1.

Independent reporting from Yahoo News corroborates the launch details and notes analysts at Jefferies see the release as part of a broader wave of rapid AI model releases, with more than 10 new model announcements in April alone across the industry. They highlighted DeepSeek's reported improvements in agentic capabilities, reasoning, and long-context performance across coding, tool use, and knowledge benchmarks.

The official model page on HuggingFace confirms the technical specifications and provides download links for both variants. Starting today, you can chat with DeepSeek-V4 at chat.deepseek.com or via the official app. The API is live—simply set model_name to deepseek-v4-pro or deepseek-v4-flash to get started.

Specialized optimization for agentic use cases includes fine-tuning for popular agent products including Claude Code, OpenClaw, OpenCode, and CodeBuddy. Performance improvements have been observed across code generation, document creation, and other agent-driven tasks. This kind of framework-specific tuning matters more in practice than it might sound. A model that performs well in isolation but behaves inconsistently inside a structured agent loop is useless for actual deployment.

DeepSeek-V4-Flash delivers comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows. The tradeoff is clear: faster response times and more economical API pricing versus maximum capability.

Going forward, 1M token context is the standard for all official DeepSeek services. The company has positioned this as the baseline, not the ceiling. Whether developers actually need a million tokens for their use cases remains an open question. Most real-world applications probably don't require that much context, but having it available changes what's technically possible. The real test comes when teams try to deploy these models at scale and discover whether the efficiency gains hold up under production load.

Whether users actually pay for it remains the real question.

Arturas Malas Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn
Share:

Comments

Sign in to comment:
    <