Google DeepMind Releases Gemma 4, Its 'Most Capable' Open-Source AI Models

By Artūras Malašauskas May 15, 2026 6 min read Share:

Google DeepMind has launched Gemma 4 under an Apache 2.0 license, offering four model sizes optimized for everything from mobile devices to data centers with claims of outperforming models 20 times larger.

Google DeepMind has officially released Gemma 4, a new family of open-source AI models that the company describes as its most capable to date. The announcement comes through multiple official channels, including the Google AI blog, where the firm details the technical specifications and licensing terms.

The release spans four distinct model sizes: Effective 2B (E2B), Effective 4B (E4B), a 26B Mixture of Experts (MoE) variant, and a 31B Dense model. All four variants draw from the same research foundation powering Google's proprietary Gemini 3 models, though they're engineered for different hardware profiles and use cases.

What stands out immediately is the licensing shift. Gemma 4 launches under an Apache 2.0 license, which is commercially permissive and grants developers full control over data, infrastructure, and deployment environments. This marks a notable departure from more restrictive open model licenses that have dominated the space (developers have been asking for this flexibility for years, frankly).

Clement Farabet, Vice President of Research at Google DeepMind, and Olivier Lacombe, Director of Product Management, co-authored the official announcement. They positioned the models as a response to community feedback, noting that developers have downloaded Gemma family models more than 400 million times since the first generation launched.

The performance claims are aggressive. According to the official documentation, the 31B Dense model currently ranks third among open models on the Arena AI text leaderboard, while the 26B MoE variant holds the sixth spot. Google states these models outcompete rivals up to 20 times their size on those benchmarks.

Hardware targeting is where Gemma 4 gets interesting. The 26B and 31B models are designed for researchers and developers working on personal computers and workstations. Unquantized bfloat16 weights fit on a single 80GB Nvidia H100 GPU, while quantized versions can run on consumer-grade GPUs for use in IDEs and coding assistants.

The 26B MoE model activates only 3.8 billion of its total parameters during inference, prioritizing speed. The 31B Dense model targets raw quality and fine-tuning flexibility. At the other end of the spectrum, the E2B and E4B models are engineered for mobile and IoT deployments, activating an effective two billion and four billion parameter footprint during inference to preserve RAM and battery life.

Google worked with its Pixel team as well as Qualcomm Technologies and MediaTek to ensure the smaller models run offline with near-zero latency on edge devices. This includes phones, Raspberry Pi boards, and Nvidia Jetson Orin Nano units. Android developers can prototype agentic flows in the AICore Developer Preview for forward-compatibility with Gemini Nano 4.

From a physical interaction standpoint, the difference matters. Running a model on a laptop means waiting through compilation steps, watching GPU utilization spike, and dealing with thermal throttling. Running the same model on a phone means the device warms up in your hand, the battery indicator ticks down faster, and you get responses without network latency. The E2B and E4B models are specifically optimized for this reality.

Key capabilities include multi-step planning, deeper logic, and improved performance on math and instruction-following benchmarks. Native function-calling, structured JSON output, and system instructions are included to enable developers to build autonomous agents that interact with tools and APIs.

Code generation support is designed to turn workstations into local-first AI coding assistants. All models natively process video and images at variable resolutions, with capabilities spanning OCR and chart understanding. The E2B and E4B models also feature native audio input for speech recognition and understanding.

Context windows have been expanded significantly. The edge models offer 128K tokens, while the larger models support up to 256K tokens. This is enough to pass entire code repositories or long documents in a single prompt, which changes how developers approach context management.

The models are natively trained on more than 140 languages. This helps developers build inclusive, high-performance applications for a global audience without relying on translation layers that often lose nuance.

Clement Delangue, Co-Founder and Chief Executive Officer of Hugging Face, described the licensing decision as "a huge milestone." He confirmed that Hugging Face would support the Gemma 4 family on day one, which matters for ecosystem adoption.

Google states that Gemma 4 models undergo the same infrastructure security protocols as its proprietary models. This pitches them as a trusted foundation for enterprises and sovereign organizations that need transparency without sacrificing security standards.

The models are available through Google AI Studio for the 31B and 26B MoE variants, and through Google AI Edge Gallery for the E4B and E2B models. Model weights can be downloaded from Hugging Face, Kaggle, or Ollama.

Day-one tooling support includes Hugging Face (Transformers, TRL, Transformers.js, Candle), LiteRT-LM, vLLM, llama.cpp, MLX, Ollama, Nvidia NIM and NeMo, LM Studio, Unsloth, SGLang, Cactus, Baseten, Docker, MaxText, Tunix, and Keras.

For production workloads, Google Cloud offers deployment through Vertex AI, Cloud Run, GKE, Sovereign Cloud, and TPU-accelerated serving. On the hardware side, Gemma 4 is optimized for Nvidia AI infrastructure spanning Nvidia Jetson Orin Nano through to Blackwell GPUs.

AMD GPU support is available through the open-source ROCm stack, and Google's own Trillium and Ironwood TPUs are also supported. This breadth of hardware compatibility is unusual for a major model release and suggests serious commitment to the open model strategy.

Google is also running a Gemma 4 Good Challenge on Kaggle, inviting developers to build products using the models. This kind of incentive program helps drive early adoption and creates showcase applications that demonstrate real-world use cases.

The community momentum is already visible. Since the first generation of Gemma models launched, developers have spawned what Google describes as a "Gemmaverse" of more than 100,000 community-built variants. That ecosystem will now have access to significantly more capable base models.

Whether this translates to widespread enterprise adoption remains the real question. Apache 2.0 licensing removes legal friction, but infrastructure costs, fine-tuning expertise, and integration complexity still create barriers for many organizations.

For individual developers and smaller teams, the ability to run frontier-class reasoning on consumer hardware is genuinely useful. For enterprises, the value proposition depends on whether they can justify the operational overhead of managing their own model infrastructure versus using managed services.

Google's positioning of Gemma 4 as "byte for byte, the most capable open models" is a bold claim that will face scrutiny from the broader AI community. Benchmarks can be gamed, and real-world performance often diverges from leaderboard rankings.

The release represents a significant commitment to open models from a company that has historically balanced open and proprietary strategies. Whether this signals a broader shift in Google's AI strategy or simply a tactical move to maintain developer engagement remains to be seen.

For now, developers have access to the models, the tooling, and the documentation. The next few months will show whether the community builds meaningful applications or whether Gemma 4 becomes another well-documented release that gathers dust in model repositories.

Whether users actually pay for the infrastructure to run these models remains the real question. The models are free, but the electricity, hardware, and engineering time required to deploy them are not.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

Google DeepMind Releases Gemma 4, Its 'Most Capable' Open-Source AI Models

Comments