Krisp VIVA 2.0 Adds Predictive Voice Infrastructure for AI Agents
The voice AI infrastructure market is getting crowded, but Krisp is betting that most voice agents fail not because of bad language models, but because of bad audio. The company launched VIVA 2.0 today, a server-side SDK that sits in the audio pipeline before speech-to-text processing. The goal: make voice agents handle messy, real-world conversations the way humans do.
According to the official press release from Krisp, voice agent usage grew 9x in 2025. Yet most systems still collapse the moment they leave a demo room. Background noise pushes word error rates from 5% to over 30%. Voice activity detection misfires on background voices. On telephony, the agent's own voice can loop back through the mic and trigger self-interruption. These aren't edge cases. They're every call.
VIVA 2.0 introduces four new model categories. Turn Prediction v3 predicts end-of-turn from audio alone, no transcription needed. It reacts to real turn-ends while holding through mid-sentence pauses, delivering low-latency responses without the agent cutting users off. The model is tiny enough to run on standard CPUs or locally on-device for robotics and conversational toys. Interrupt Prediction v1 is a first-of-its-kind audio-only classifier that predicts when a user intends to interrupt the agent. It distinguishes intent-to-take-the-floor from backchannel speech like "yes" or "mhm." A patent has been filed.
Signal Detectors represent a new category of real-time audio models. Three launch with VIVA 2.0: TTS Detector identifies synthetic speech in real time (useful when an outbound AI agent calls and recognizes another AI or IVR on the other end), Accent Detector identifies the speaker's accent so audio can be routed to the STT model best tuned for it, and Gender Detector identifies speaker gender to enable personalized responses. Voice Isolation v3 is an upgraded version of the world's most widely used voice isolation model, delivering measurable improvements in downstream word error rate.
All models run on standard server CPUs. They operate on audio input alone with no transcription required. They're bundled into existing VIVA pricing at no additional charge. This matters because developers don't want to pay for infrastructure they can't see. The SDK processes more than 12 billion minutes of voice AI agent traffic a year and is embedded in over 130 voice AI products, including Daily, Vapi, LiveKit, Ultravox, and Telnyx.
Platforms running VIVA report 3.5x improvement in turn-taking accuracy, 50% fewer dropped calls, and 30% higher customer satisfaction. The numbers are specific enough to be verifiable, which is more than most AI vendors offer. "At scale, the biggest challenge in voice AI isn't the model. It's the quality of the signal going into it," said David Casem, CEO of Telnyx. "Krisp addresses that at the source, which improves everything downstream from transcription to response."
The technical deep-dive on the Krisp blog includes benchmark comparisons. Turn Prediction v3 leads on balanced accuracy and AUC across all conditions compared to SmartTurn v3.2, Deepgram Flux, and LiveKit. The company says V3 catches 47% more true turn-shifts within the first 200 milliseconds compared to v2, without more false positives. That's the difference between a conversation that feels natural and one that feels like talking to a robot counting seconds of silence.
Here's the physical reality of what this means. When you call a voice agent from a busy airport, your kid is screaming in the background. A bad cell connection mangles the audio. The agent talks over you, ignores a real interruption, or gets confused by a siren outside the window. VIVA 2.0 sits in the pipeline before any of that reaches the LLM. It isolates your voice, predicts when you're done speaking, and tells the system when you're about to interrupt. The conversation flows. Or it doesn't. (The difference is whether the audio layer actually works.)
Robert Schoenfield, EVP of Licensing and Partnerships at Krisp, said voice is becoming the primary interface between humans and AI. Those conversations don't happen in clean environments. They happen in the real world, shaped by noise and subtle human cues. VIVA brings that layer into the system, so voice agents can operate the way people actually speak. Krisp will showcase VIVA 2.0 live at Twilio Signal 2026 on May 6-7 in San Francisco.
The company has spent over eight years solving real-world voice in production, first for human-to-human conversations and now for human-to-AI. That experience gives VIVA the depth of training data and field-tested reliability nothing else in the market can match. Krisp is deployed on over 200 million devices and processes more than 80 billion minutes of voice conversations every month. It recently won two 2026 Webby Awards for Technical Achievement.
Whether this actually moves the needle for voice AI adoption remains to be seen. The infrastructure is available now, but developers still need to integrate it, test it, and decide if the improvement justifies the engineering overhead. Voice AI demos work. Production doesn't. VIVA 2.0 claims to fix that gap. Whether users actually pay for it remains the real question.
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn
Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt
Comments