Thinking Machines Debuts Real-Time Interrupting AI Model

By Artūras Malašauskas May 12, 2026 4 min read Share:

Thinking Machines Lab announced TML-Interaction-Small, a full-duplex AI model that processes input and generates responses simultaneously with 0.40-second latency.

The AI startup Thinking Machines announced a research preview of what it calls interaction models, systems designed to interrupt users during conversations rather than waiting for turn completion. Founded by former OpenAI CTO Mira Murati, the company asserts that traditional AI models function sequentially—listening and then responding—whereas its new architecture aims for simultaneous input processing and response generation.

The technical term for this capability is "full duplex," and the company claims its model, TML-Interaction-Small, responds in 0.40 seconds, which is roughly the speed of natural human conversation. This performance exceeds that of comparable models from OpenAI and Google, according to the firm's official documentation.

According to the company's official blog post, the model uses a multi-stream, micro-turn design that processes 200ms chunks of input and output simultaneously. Rather than relying on massive standalone encoders like Whisper for audio, the system takes in raw audio signals as dMel and image patches through a lightweight embedding layer, co-training all components from scratch within the transformer.

This architecture represents a fundamental shift in how AI perceives time and presence. Current frontier models typically experience reality in a single thread; they wait for a user to finish an input before they begin processing, and their perception freezes while they generate a response. The Thinking Machines researchers described the status quo as a limitation that forces humans to contort themselves to AI interfaces, phrasing questions like emails and batching their thoughts (which is frustrating, honestly).

To solve this "collaboration bottleneck," the company has moved away from the standard alternating token sequence. The dual model system introduces TML-Interaction-Small, a 276-billion parameter Mixture-of-Experts (MoE) model with 12 billion active parameters. Because real-time interaction requires near-instantaneous response times that often conflict with deep reasoning, the company has architected a two-part system.

The Interaction Model stays in a constant exchange with the user, handling dialog management, presence, and immediate follow-ups. The Background Model is an asynchronous agent that handles sustained reasoning, web browsing, or complex tool calls, streaming results back to the interaction model to be woven naturally into the conversation. This setup allows the AI to perform tasks like live translation or generating a UI chart while continuing to listen to user feedback.

Independent reporting from TechCrunch corroborates the timeline and technical claims. The publication notes that the model's success in visual benchmarks like RepCount-A (accurate repetition counting) and ProactiveVideoQA (answering questions as visual events unfold) demonstrates capabilities that other frontier models lack.

To prove the efficacy of this approach, the lab utilized FD-bench, a benchmark specifically designed to measure interaction quality rather than just raw intelligence. The results show that TML-Interaction-Small significantly outperforms existing real-time systems. It achieved a turn-taking latency of 0.40 seconds, compared to 0.57 seconds for Gemini-3.1-flash-live and 1.18 seconds for GPT-realtime-2.0 minimal.

On interaction quality metrics, the model scored 77.8 on FD-bench V1.5, nearly doubling the scores of its primary competitors. GPT-realtime-2.0 minimal scored 46.8, while Gemini-3.1-flash-live scored 54.3. These numbers suggest the model can track whether a speaker is thinking, yielding, self-correcting, or inviting a response without a separate dialog management component.

Having interactivity be part of the model unlocks capabilities that would otherwise need to be implemented in external scaffolding. The model can jump in as needed depending on the context, not only when the user finishes speaking. Users and the model can speak concurrently for live translation, and the model has a direct sense of elapsed time.

While speaking and listening to the user, the model can concurrently search, browse the web, or generate UI, weaving back results into the conversation as needed. In a longer real session, all of this happens continuously, creating an experience that feels more like collaboration than command-and-response.

Currently, TML-Interaction-Small is only available in a research preview phase and is not open to the public. Thinking Machines Labs plans to roll out a limited research preview in the next few months, followed by a wider release planned for later this year. The company says in its announcement blog post that it will open a limited research preview to collect feedback first.

If made available to the enterprise sector, Thinking Machines' interaction models would represent a fundamental shift in how businesses integrate AI into their operational workflows. A native interaction model allows for several enterprise capabilities that are currently impossible or highly brittle with standard multimodal models.

Current enterprise AI requires a turn to be completed before it can analyze data. In a manufacturing or lab setting, a native interaction model can monitor a video feed and proactively interject the moment it detects a safety violation or a deviation from a protocol without waiting for the worker to ask for feedback.

The benchmarks are impressive and the underlying idea—that interactivity should be native to a model—is definitely interesting. However, the effectiveness of the model in practical applications will only be verified once it is accessible to users. Whether the real-world experience lives up to the technical claims is something we won't know until people can actually use it.

There's also the matter of whether anyone actually wants an AI that interrupts them. The concept of integrating interactivity into AI models has been noted as a novel approach, though its ultimate success remains to be determined. Time will tell if users prefer being talked over by a machine or if they'll find it genuinely useful.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

Thinking Machines Debuts Real-Time Interrupting AI Model

Comments