Google Unveils TPU 8t and 8i for AI Training and Inference

By Artūras Malašauskas Apr 22, 2026 3 min read Share:

Google's eighth-gen TPU lineup introduces specialized chips for training (8t) and inference (8i), promising 2.7x and 80% efficiency gains over prior generations while targeting agentic AI workloads.

Google Cloud has officially launched its eighth-generation Tensor Processing Units (TPUs), introducing two specialized architectures designed to address distinct phases of AI development: the TPU 8t for large-scale pre-training and the TPU 8i for low-latency inference. Announced at Cloud Next 2026, these chips form the hardware foundation of Google's AI Hypercomputer, an integrated system combining purpose-built accelerators, networking, and software to optimize the full AI lifecycle from training to deployment.

The TPU 8t targets massive-scale model training with a 9,600-chip superpod architecture delivering 121 Exaflops of FP4 compute capacity—2.84x higher than the previous Ironwood generation. Key innovations include the SparseCore accelerator, which offloads irregular memory access patterns during embedding lookups, and native FP4 support that doubles matrix multiply unit (MXU) throughput while maintaining model accuracy. Google emphasizes that TPU 8t achieves a 2.7x performance-per-dollar improvement over Ironwood for large-scale training, directly addressing the computational demands of emerging agentic AI systems requiring multi-step reasoning and world-model simulations.

Meanwhile, the TPU 8i focuses on inference efficiency with 288 GB of high-bandwidth memory (HBM) and 384 MB of on-chip SRAM—tripling the previous generation's capacity. This enables entire models to reside on-chip, eliminating latency from memory transfers. The chip achieves 331.8 Exaflops of FP8 compute per pod (6.74x Ironwood), with an 80% performance-per-dollar improvement for large Mixture-of-Experts (MoE) models. Google attributes this to its Collectives Acceleration Engine (CAE), which reduces on-chip latency by up to five times through offloaded global operations, and Boardfly architecture that cuts network diameter by 50% for near-instantaneous inter-chip communication.

Both chips integrate Arm-based Axion CPUs to eliminate data preparation bottlenecks, with the TPU 8i leveraging a non-uniform memory architecture (NUMA) for optimized isolation. The system also incorporates Google's Virgo Network for near-linear scaling across up to a million chips, paired with JAX and Pathways software to maximize utilization. Crucially, the technical deep dive confirms these architectures were designed specifically for the operational intensities of modern agentic AI workloads, including long-context reasoning and simulation-based learning.

Google's approach contrasts with Nvidia's unified GPU strategy by explicitly separating training and inference hardware—a response to the diverging requirements of pre-training (massive parallelism) versus serving (low-latency, high-throughput). The company positions the TPU 8t as a solution to reduce frontier model development cycles from months to weeks, while the TPU 8i directly addresses the "memory wall" in inference through on-chip SRAM expansion. This specialization aligns with Google's broader AI Hypercomputer vision, which integrates TPU 8s, Axion CPUs, and NVIDIA Rubin GPUs into a single cohesive architecture for the agentic AI era.

Industry implications are significant for developers and enterprises. The TPU 8t's 121 Exaflops capacity and 9,600-chip scalability could accelerate training of models like Google DeepMind's Genie 3, which simulates complex agent behaviors in virtual environments. For businesses, the TPU 8i's 80% inference efficiency gain may lower costs for deploying MoE models in applications like Workspace Intelligence—Google's new contextual AI layer for Gmail, Docs, and Sheets that leverages these chips to enable features like "Ask Gemini" in Chat. As noted in the official TPU documentation, these chips power Google's own AI services, including Search and Maps, serving over 1 billion users daily.

While the TPU 8t and 8i are described as "coming soon" in Google's public materials, their technical specifications—particularly the FP4/FP8 precision support and SparseCore architecture—represent a clear evolution from prior generations. The focus on workload-specific optimization, rather than raw FLOPS, signals a maturing AI infrastructure market where hardware specialization is becoming as critical as software innovation. For enterprises evaluating AI infrastructure, the TPU 8 series offers a compelling alternative to Nvidia's H100s, particularly for organizations prioritizing cost efficiency in large-scale training or low-latency inference scenarios.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

Google Unveils TPU 8t and 8i for AI Training and Inference

Comments