Google Splits TPU Line Into Training and Inference Chips

By Artūras Malašauskas Apr 24, 2026 5 min read Share:

Google's eighth-generation TPUs now come as two specialized chips—TPU 8t for training and TPU 8i for inference—marking a structural shift in AI accelerator design for the agentic era.

At Google Cloud Next on April 22, 2026, Google announced a fundamental restructuring of its custom AI silicon. The company is splitting its eighth-generation Tensor Processing Units into two distinct chips: TPU 8t for training and TPU 8i for inference. This marks the first time the company has bifurcated its TPU line into workload-specific architectures.

The move signals that the computational profiles for frontier model training and low-latency inference have diverged to the point where a single architecture can no longer optimally serve both. Amin Vahdat, SVP and Chief Technologist for AI and Infrastructure at Google, called them "two chips for the agentic era" in the official announcement.

TPU 8t is the training powerhouse. A single superpod scales to 9,600 chips with two petabytes of shared high-bandwidth memory. The architecture delivers 121 exaflops of compute and doubles the interchip bandwidth over the previous generation. According to Google's official blog post, this configuration enables near-linear scaling for up to a million chips in a single logical cluster.

TPU 8i targets inference and reinforcement learning. It triples on-chip SRAM to 384 megabytes and increases HBM by 50% to 288 gigabytes. The chip introduces a new Collectives Acceleration Engine and uses a Boardfly network topology co-designed with DeepMind to minimize latency rather than maximize aggregate bandwidth.

The Boardfly topology is worth examining. Rather than arranging chips in a conventional mesh or torus where data hops through intermediate nodes, Boardfly is engineered to minimize the number of hops between any two chips in an inference cluster. This trades some aggregate bandwidth for dramatically lower point-to-point latency—critical when AI agents execute multi-step workflows in continuous loops.

Both chips can technically run various workloads, but specialization unlocks significant efficiencies. TPU 8t delivers nearly 3x the compute performance per pod over the previous generation. TPU 8i is designed with more memory bandwidth to serve latency-sensitive inference workloads, which matters because interactions between agents at scale magnify even small inefficiencies (a problem that has plagued users for years, frankly).

Google is not abandoning its partnership with NVIDIA. The company confirmed that Google Cloud will be among the first to offer NVIDIA's upcoming Vera Rubin system, paired with Google's new Virgo network for large-scale training clusters. This dual-strategy approach means enterprises can choose between custom silicon and industry-standard GPUs depending on their workload requirements.

Independent reporting from TechCrunch notes that Google's chips are not a full frontal assault on NVIDIA's future. Like other hyperscalers including Microsoft and Amazon, Google is using these chips to supplement NVIDIA-based systems in its infrastructure rather than replacing them outright.

The hardware development cycles are much longer than software cycles. Google anticipated rising demand for inference from customers as frontier AI models are deployed in production and at scale. Several years ago, the company determined the community would benefit from chips individually specialized to the needs of training and serving.

TPU 8t integrates 10x faster storage access combined with TPUDirect to pull data directly into the TPU. This helps ensure maximum utilization of the end-to-end system. The new Virgo Network, combined with JAX and Pathways software, means TPU 8t can provide near-linear scaling for massive clusters.

From a physical perspective, the difference between these chips becomes apparent when you consider the actual workload. Training a model on TPU 8t means waiting weeks instead of months for convergence. Running inference on TPU 8i means the latency between user input and agent response drops measurably—enough that the difference is perceptible during actual use.

Google's documentation states that TPUs have been powering leading foundation models, including Gemini, for years. The eighth generation represents the culmination of more than a decade of development. The key insight behind the original TPU design continues to hold: by customizing and co-designing silicon with hardware, networking, and software, the company can deliver dramatically more power efficiency and absolute performance.

Organizations like Citadel Securities are already choosing TPUs to power their cutting-edge AI workloads. The firm's adoption suggests that financial services and other latency-sensitive industries see value in the specialized inference architecture.

The split creates complexity for software engineers. Native PyTorch support for TPUs is now available, but the ecosystem alignment remains a challenge. Analysts note that software complexity already weighs on the TPU's ecosystem score, and this additional complexity may make the system even less operable outside of close engagements with top AI labs.

Whether users actually pay for it remains the real question. The infrastructure investment is substantial, and the benefits depend entirely on workload characteristics. For training massive models, TPU 8t delivers clear advantages. For inference-heavy applications, TPU 8i's latency optimizations matter. But for general-purpose workloads, the specialization may introduce friction rather than efficiency.

Google's approach reflects a broader industry shift toward systemic co-design. The company is engineering computer networking that allows NVIDIA-based systems to perform more efficiently in its cloud. The two tech giants are working to beef up the software-based networking tech called Falcon, which Google created and open sourced in 2023 under the Open Compute Project.

The agentic era requires computing infrastructure designed and optimized for new requirements. Companies who want to lead need infrastructure that spans purpose-built hardware, open software, and flexible consumption models. Google's AI Hypercomputer is AI-optimized infrastructure built for this era, engineered to deliver on these new requirements.

Time will tell if the specialization pays off. The market will decide whether workload-specific silicon delivers durable competitive advantage or simply raises the engineering burden without proportional returns. For now, the chips are available on Google Cloud, and the performance claims are backed by official documentation. Whether that translates to customer adoption is another matter entirely.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

Google Splits TPU Line Into Training and Inference Chips

Comments