Google Unveils Dual TPU Architecture for Agentic AI Workloads

By Artūras Malašauskas Apr 24, 2026 5 min read Share:

Google has split its eighth-generation TPU line into two specialized chips—TPU 8t for training and TPU 8i for inference—marking a structural shift in how hyperscalers approach custom AI silicon.

At Google Cloud Next on April 22, 2026, Google announced the eighth generation of its Tensor Processing Units, but with a critical departure from previous designs. The company is launching two distinct chips: TPU 8t for training workloads and TPU 8i for inference and reinforcement learning. This bifurcation signals that the computational profiles for frontier model training and low-latency inference have diverged to the point where a single architecture can no longer optimally serve both.

The announcement comes from Google's official blog, where the company frames the split as a response to the "agentic era." AI agents need to reason, plan, and execute multi-step workflows. TPU 8i is designed specifically to enable AI agents to complete this very quickly to provide a good user experience. Complementing TPU 8i, TPU 8t is optimized for training and can run even the most complex models on a single, massive pool of memory.

Hardware development cycles are much longer than software. With each generation of TPUs, Google needs to consider what technologies and demands will exist by the time they are brought to market. Several years ago, the company anticipated rising demand for inference from customers as frontier AI models are deployed in production and at scale. And with the rise of AI agents, they determined the community would benefit from chips individually specialized to the needs of training and serving.

TPU 8t is built to reduce the frontier model development cycle from months to weeks. By balancing the highest possible compute throughput, shared memory, and interchip bandwidth with the best possible power efficiency and productive compute time, the system delivers nearly 3x the compute performance per pod over the previous generation. A single TPU 8t superpod now scales to 9,600 chips and two petabytes of shared high bandwidth memory, with double the interchip bandwidth of the previous generation. This architecture delivers 121 ExaFlops of compute and allows the most complex models to leverage a single, massive pool of memory.

TPU 8i targets inference and reinforcement learning with different priorities. The chip triples on-chip SRAM to 384 megabytes, increases HBM by 50% to 288 gigabytes, and introduces a new Collectives Acceleration Engine. The most strategically significant feature may not be the memory specs, but the network topology surrounding it. Google's Boardfly topology was co-designed with DeepMind to optimize for latency rather than bandwidth. Rather than arranging chips in a conventional mesh or torus, where data hops through intermediate nodes, Boardfly is engineered to minimize the number of hops between any two chips in an inference cluster.

Interactions between agents at scale magnify even small inefficiencies (a problem that has plagued users for years, frankly). The latency-sensitive nature of agentic workflows means that every millisecond of network hop time compounds across thousands of agent interactions. This is why the Boardfly topology trades some aggregate bandwidth for dramatically lower point-to-point latency.

Independent reporting from Ars Technica notes that most companies fully committed to building AI models are gobbling up every Nvidia AI accelerator they can get, but Google has taken a different approach. Most of its cloud AI infrastructure is based on its line of custom Tensor processing units. After announcing the seventh-gen Ironwood TPU in 2025, the company has moved on to the eighth-gen version, but it's not just a faster iteration of the same chip.

The physical reality of these systems matters. A single TPU 8t superpod houses 9,600 chips with two petabytes of shared high-bandwidth memory. That's not abstract compute—it's racks of liquid-cooled hardware humming in data centers, consuming megawatts of power, with engineers monitoring thermal gradients and network latency in real-time. The 121 FP4 EFlops of compute per pod represents almost three times higher than Ironwood's training compute ceiling. But if you're involved in building those giant AI models, all this hardware saves time.

Alongside the TPU announcements, Google confirmed that Google Cloud will be among the first to offer Nvidia's upcoming Vera Rubin system, paired with Google's new Virgo network for large-scale training clusters. The company is also introducing Axion N4A VMs powered by custom Axion Arm-based CPUs, Google Compute Engine 4th generation VMs, and Google Cloud Managed Lustre for high-performance parallel file systems.

Analyst Brendan Burke from Futurum Group notes that Google's decision to bifurcate its TPU line into dedicated training and inference chips marks a structural evolution in how hyperscalers approach custom AI silicon. The dual-chip strategy acknowledges that the computational profiles for frontier model training and low-latency inference have diverged to the point where a single architecture can no longer optimally serve both. This move places Google alongside AWS and Nvidia, which have pursued similar disaggregation with their Trainium and Rubin inference architecture.

The key tension is whether workload specialization at this level delivers a durable competitive advantage or simply raises the silicon engineering burden without proportional returns. Software complexity already weighs on the TPU's ecosystem alignment, and this additional complexity may make the system even less operable outside of close engagements with the top AI labs.

Google's VP and GM of AI and Computing Infrastructure, Mark Lohmeyer, stated that the eighth-generation TPUs are "two chips for the agentic era," reflecting the company's view that training and inference workloads now require fundamentally different hardware architectures. The shift to agentic intelligence means a single intent triggers a chain reaction. Unlike chat, a primary AI agent decomposes goals into specific tasks for a fleet of specialized agents that then collaborate, preserve state, and use reinforcement learning to deliver outcomes in real-time.

This process scales intelligence per interaction, but also creates complexity that yesterday's architectures cannot support without spiraling costs or performance bottlenecks. To scale efficiently and effectively, you must move beyond manually integrating fragmented components and technologies. To deliver agentic experiences that are smart, fast, scalable, and cost-effective, you need a unified infrastructure stack that spans purpose-built hardware, open software, and flexible consumption models.

Whether enterprise customers actually adopt this specialized infrastructure at scale remains the real question. The hardware is impressive on paper, but the software ecosystem around TPUs has historically lagged behind Nvidia's CUDA. Whether developers will embrace the dual-chip architecture or continue to gravitate toward more familiar platforms is something only time—and customer deployments—will reveal.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

Google Unveils Dual TPU Architecture for Agentic AI Workloads

Comments