Google Sells TPUs, Mistral Deploys Vibe Agents, AI Evals Hit Cost Wall
Alphabet is finally selling its custom Tensor Processing Units to outside customers, marking a strategic pivot that puts the company in direct competition with Nvidia for enterprise AI infrastructure. The move follows announcements of two new TPU generations dedicated to training and inference workloads, with Anthropic and Meta already signed as early adopters. This isn't just a hardware play—it's a statement about who controls the AI supply chain when cloud providers decide to open their doors.
According to a filing with the U.S. Securities and Exchange Commission, Broadcom has entered a long-term agreement to develop and supply custom TPUs for Google's future generations through 2031. The partnership includes approximately 3.5 gigawatts of next-generation TPU-based AI compute capacity for Anthropic starting in 2027. That's a massive commitment from a company whose annual revenue run rate has tripled since late 2025, now crossing $30 billion. CRN reported the details of the deal, noting that the vast majority of infrastructure will be built in the U.S.
Anthropic's CFO Krishna Rao called it their "most significant compute commitment to date," but the language is telling. The company is hedging its bets across multiple platforms—AWS Trainium, Google TPUs, and Nvidia GPUs—to match workloads to chips best suited for them. This diversity translates to better performance and resilience for customers depending on Claude for critical work. It's a pragmatic approach in a market where hardware shortages can derail deployment timelines.
Meanwhile, Mistral AI is pushing agentic workflows into production with Medium 3.5, a 128B dense model designed for instruction-following, reasoning, and coding in a single package. The model has a 256,000-token context window and can run on just four GPUs for self-hosting, which matters for enterprises prioritizing European data sovereignty. The weights are published as open weights on Hugging Face under a modified MIT license, with API pricing at $1.50 per million input tokens and $7.50 per million output tokens.
The bigger story is Mistral Vibe's remote agents. Previously, coding agents ran exclusively on local machines, but now sessions can be "teleported" to the cloud, running multiple agents in parallel while notifying users when work is complete. Each session runs in an isolated sandbox, and once finished, the agent can automatically open a pull request on GitHub. Developers only review the result, not every single step. Heise covered the launch, noting the model scores 77.6% on SWE-Bench Verified.
Le Chat also gained a "Work Mode" that allows agents to use multiple tools simultaneously for complex, multi-stage tasks—reviewing emails and calendars, combining web research with internal documents, or sending summaries to Slack. Each tool call and reasoning justification remains visible, and the agent explicitly asks for permission before sensitive actions like sending messages or changing data. This transparency is non-negotiable for regulated industries.
But here's the problem nobody wants to discuss: AI evaluation is becoming the new compute bottleneck. The Holistic Agent Leaderboard recently spent about $40,000 to run 21,730 agent rollouts across nine models and nine benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. HuggingFace's analysis shows a 33× cost spread on identical tasks, isolating scaffold choice as a first-order cost driver.
The cost problem started before agents. When Stanford's CRFM released HELM in 2022, API costs ranged from $85 for OpenAI's code-cushman-001 to $10,926 for AI21's J1-Jumbo. Across HELM's 30 models and 42 scenarios, aggregate costs came to roughly $100,000. Perlitz et al. (2023) noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle.
Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error. The Open LLM Leaderboard collapsed from 29,000 examples to 180. That trick weakened sharply once benchmarks moved from static predictions to agents. When each item is a multi-turn rollout with its own variance, the unavoidable long trajectory per single question becomes the expensive object.
Higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes "a 9× difference in cost despite just a two-percentage-point difference in accuracy." On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across six SOTA agents on 300 enterprise tasks that "accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives" with comparable real-world performance.
This creates a perverse incentive structure. Companies with deep pockets can afford to run expensive evals, while smaller players struggle to validate their models. The field faces uneven cost distributions across models and tasks, highlighting inefficiencies and the need for cost-effective approaches like standardized documentation and data reuse. Without addressing these issues, the evaluation process remains expensive, challenging equal access and hindering external validation in AI research (a problem that has plagued users for years, frankly).
Some benchmarks escape the API-cost framing altogether because their evaluation protocol trains models from scratch. The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.
The physical reality of this bottleneck is tangible. Engineers waiting for eval results face hours of idle time while GPUs burn through electricity. The click to start a benchmark triggers a chain of events that can cost thousands before anyone sees a single accuracy number. There's no progress bar that feels satisfying—just a terminal window slowly filling with logs while your credit card statement grows.
Whether users actually pay for these capabilities remains the real question. Google's TPU sales depend on enterprises willing to commit to proprietary hardware in a market dominated by Nvidia's CUDA ecosystem. Mistral's agents need developers who trust cloud sandboxes with their codebases. And AI evaluation costs will only rise as models grow more complex, creating a barrier that may concentrate power in the hands of a few well-funded labs.
The infrastructure is here. The models are shipping. The question is whether the economics work for anyone besides the companies already writing nine-figure checks.
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt
Comments