Back

Next Blog

Why Your Voice Agent Breaks in Noisy Calls & How to Fix It?

Date

Jun 18, 26

Reading Time

10 Minutes

Real Phone Calls Have Physics. Your Demo Didn't.

Human conversation runs on strict timing. A PNAS study analyzing turn-taking across ten languages found an average inter-turn gap of roughly 200ms. The Max Planck Institute pushed it further: humans start preparing their response before the other person finishes speaking. We pre-plan.

We anticipate. Reactive AI architectures wait for silence before taking any action. That 200ms gap is where your agent already starts losing.

Your voice agent latency breaks into two separate budgets. The End-to-End Response Budget is how fast the agent generates and speaks a reply. The Barge-In Flush Budget is how fast it stops when the caller interrupts. Both matter independently. Fail the first, and your agent feels slow. Fail the second, and it feels broken.

Metric	Threshold	What the Caller Experiences
Barge-In Flush Target	Under 150ms	Agent yields naturally, feels responsive
E2E Response (Ideal)	300ms to 500ms	Indistinguishable from human cadence
E2E Response (Acceptable)	500ms to 800ms	Slightly sluggish but functional
Cognitive Friction Point	Over 800ms	Caller repeats themselves, assumes the agent failed
Conversation Breakdown	Over 1,500ms	The caller talks over the agent, and the call is abandoned

And then there's the Lombard Effect. When callers are in noisy environments, they don't just speak louder. Their vocal frequency actually shifts. A model trained on clean studio audio receives a phonetically different signal from someone calling off a highway. That's a voice agent noise-handling problem; no LLM upgrade touches.

Your demo ran in silence, over a stable connection, with a patient speaker. Real calls have engine noise, compressed mobile codecs, side conversations, and callers who cut in mid-sentence. Proper voice agent noise handling means accounting for all of that. And the place where it visibly collapses first is voice agent interruption handling, specifically when the barge-in flush budget exceeds 150ms.

Four specific layers in your stack are responsible for 95% of these failures. Only one of them involves the model.

The Four Places Your Voice Agent Actually Breaks

Infographic showing four layers where a voice agent breaks in production: VAD failure, AEC collapse and double-talk divergence, transport protocol stalling the stream, and token buffer mismanagement

Most voice agent failures trace back to one of four specific layers. And good voice agent noise handling requires all four to work. None of them is the LLM.

Layer 1: VAD is lying to you

Legacy WebRTC VAD runs on Gaussian Mixture Models and raw energy thresholds. Simple in theory: detect energy above X, declare speech. The problem is that real-world noise immediately breaks this.

Background hum, keyboard clicks, a cough, compressed mobile audio, all of it can cross that threshold. In real-world noise simulations, WebRTC VAD misses roughly 50% of valid speech frames at a 5% false positive rate.

That's not a tuning problem. That's a structural limitation of energy-based detection.

You run into two specific edge cases that no threshold setting solves:

The TV Problem. A television or a colleague talking nearby contains actual human speech. The VAD can't distinguish it from the caller. It fires a barge-in, and the agent cuts itself off mid-response.
The Whisper Problem. Soft-spoken callers never push enough energy to cross the threshold. The agent doesn't register them at all. From the system's perspective, nobody's speaking.

Endlessly tuning sensitivity doesn't fix either of these. You're just choosing which failure mode to accept.

Layer 2: AEC collapse and double-talk divergence

Without Acoustic Echo Cancellation, the agent hears its own TTS output through the microphone, feeds it back into the STT engine, and interrupts itself. That's false barge-in at its most basic.

The nastier failure is double-talk divergence. When the caller interrupts mid-response, both signals hit the microphone simultaneously.

If the Double-Talk Detector misfires, the adaptive filter starts treating the caller's voice as residual echo and mutes it. The caller is speaking. The agent can't hear them. The call falls apart.

Layer 3: Your transport protocol is stalling the stream

Server-Sent Events run over TCP. TCP guarantees packet delivery by holding up the entire stream while it retransmits a dropped packet. On a degraded mobile connection, the wait is 20-80ms. A single dropped packet causes audible clipping and barge-in signals that arrive late or out of order. Voice agent interruption handling depends on millisecond precision. TCP doesn't offer that on real networks.

Layer 4: Token buffer mismanagement

Sending individual LLM tokens straight to TTS is a shortcut that kills audio quality. The TTS engine needs sentence-level context to produce natural intonation. Without a SentenceAggregator buffering tokens to complete sentences before synthesis, you get choppy audio that sounds broken even when the transcription is perfect.

"It's a nightmare, and as you said, there's no silver bullet. Different languages, different applications, different acoustic environments, different people all have different requirements."

That quote is from someone running voice agents at scale. They're right that there's no single fix. But there is a correct order of operations across your voice AI stack.

One of these four failures is most likely being caused by a fix you're already running. The order of operations in your DSP pipeline is probably wrong.

The Specific Ways Noisy Calls Destroy Interruption Handling

Three specific failure modes break voice agent interruption handling on real calls. Each one has a different cause and a different fix.

Problem 1: Backchannel Misfires

A caller says "mm-hmm" or "yeah" while the agent is mid-explanation. They're not interrupting. They're signaling they're still listening.

Pure acoustic VAD can't distinguish between them. It sees energy above threshold, declares a barge-in, and the agent stops dead. The conversation broke for no reason.

Teams that try to fix this with semantic interrupt detection run into a different wall. Classifying whether an interruption is a backchannel versus a genuine redirect still carries roughly a 15% error rate in real testing, and the classification itself adds latency. So you trade one failure mode for two.

Problem 2: Premature Endpointing

A caller says: "I'd like to transfer funds to my... [400ms pause] ...checking account."

The VAD's silence threshold fires on that pause. The agent responds to an incomplete sentence. The caller repeats themselves.

You raise the silence threshold to 1,000ms to prevent the cutoff. Now every completed sentence carries a full second of dead air before the agent responds. One frustrating failure traded for another.

Problem 3: The DSP Pipeline Ordering Error

This is the one most teams miss. And it causes both problems above to get significantly worse.

Placing deep-learning noise suppression before the Acoustic Echo Canceller in your DSP pipeline breaks your interruption logic entirely.

Here's why:

Noise suppression applies spectral subtraction and dynamic range compression
These introduce non-linearities into the signal
The AEC's linear adaptive filter needs a mathematically clean, linear signal to model the room impulse response correctly
Feed it a non-linear signal, and it can't converge
The result is continuous echo bleed and an endless loop of false barge-ins

Expert Tip: AEC must be the absolute first DSP operation on the raw PCM audio stream. Any non-linear preprocessing upstream, including noise suppression, permanently prevents the adaptive filter from converging. This is the most common cause of constant false barge-ins that teams can't debug.

AEC goes first. No exceptions. If you want to make your voice agent sound more human on real calls, this ordering issue is the first thing to check.

Once you know what's breaking, the fixes are precise. Apply them in the wrong order and you make things worse.

Fixing Voice Agent Noise Handling, Layer by Layer

Infographic listing 5 fixes for voice agent noise handling: replace energy VAD with Silero, protect AEC from double-talk divergence, move transport from SSE to WebRTC, add semantic endpointing, and adaptive backchannel handling

These aren't suggestions. Each fix addresses a specific failure mode from the previous sections. Apply them in order.

Fix 1: Replace Energy VAD with Silero

Legacy WebRTC VAD misses roughly 50% of valid speech frames under real-world noise. That's not a tuning problem. It's a structural one. Swap it for Silero VAD: trained on clean speech, overlapping audio, and background noise, it drops the miss rate to 12.3% under hostile conditions.

Don't run it on defaults. Tune these specifically:

Parameter	Default	Noisy Environment Setting	Why
activation_threshold	0.5	0.7 to 0.8	Stops background hum from triggering false detections
deactivation_threshold	Varies	activation minus 0.15	Prevents rapid state-switching at the boundary
min_silence_duration	Varies	300ms to 550ms	Allows mid-sentence pauses without cutting the caller off
min_speech_duration	50ms	50ms (strict)	Rejects mic bumps that can't sustain probability above 50ms
speech_pad_ms	Varies	500ms	Stops STT from clipping terminal consonants

Two edge cases need separate handling beyond the table:

Whisper Problem: Auto-tune mic gain at the start of the session. Sample the caller's baseline energy, lower the absolute floor to around -50 dBFS for quiet profiles.
TV Problem: Enforce AND-logic on barge-in. Sound must have a high VAD probability AND exceed a dynamically calculated local volume baseline. One gate alone fails.

Fix 2: Protect AEC from Double-Talk Divergence

Implement Geigel Double-Talk Detection. It compares the ratio of the maximum far-end signal level over an interval to the near-end signal. When double-talk is declared mathematically, the adaptive filter's coefficients freeze immediately. This prevents the filter from mistaking the caller's voice for residual echo.

Monitor it actively via WebRTC getStats():

ERLE below 10 dB = your adaptive filter is diverging
residualEchoLikelihood above 0.5 = NLP stage is suppressing audio so aggressively that it's clipping the caller's voice

Fix 3: Move Transport from SSE to WebRTC

SSE runs over TCP. One dropped packet stalls the entire stream for 20-80ms. On a degraded mobile connection, that's audible clipping and missed barge-in signals. Switch to WebRTC. For a full breakdown, see WebRTC vs SIP for voice agents.

The specific settings that matter for voice agent noise handling:

Audio via RTP, interruption metadata via WebRTC Data Channels
Set Data Channels to unreliable: ordered: false, maxRetransmits: 0. Packets drop instead of stalling.
Reduce JitterBufferTarget to 40ms, JitterBufferMaxPackets to 50

Expert Tip: The 150ms barge-in flush budget breaks down as follows: VAD-to-flush dispatch (10 to 30ms) + WebSocket cancellation to the TTS provider (20 to 30ms) + buffer drain (20 to 40ms) + device release (10 to 20ms). ElevenLabs Turbo and Cartesia Sonic both support mid-stream cancellation via WebSocket close frame. If your TTS provider doesn't, you're already over budget before the caller finishes interrupting.

Fix 4: Add Semantic Endpointing

An acoustic VAD answers a binary question: Is there speech in this 20ms window? It has no idea whether the sentence is finished. Semantic endpointing adds a Small Language Model that reads partial transcripts from streaming STT every 50ms and evaluates syntactic completeness.

LiveKit's production implementation runs a 135M parameter SmolLM-v2 fine-tune locally on the worker node
Gradium's Semantic VAD emits turn-completion predictions every 80ms across three future horizons: 0.5s, 1s, and 2s inactivity probabilities
Trailing dependency detected ("I was walking down the...") = silence threshold extends up to 1,500ms
Clean completion detected ("Yes.") = turn fires immediately, cutting response latency by 200 to 500ms

The flushing trick: the moment the SLM commits to an end-of-turn decision, force the server to process all outstanding buffered audio instantly. Don't wait for the STT engine's natural silence timer. Bypass it.

Fix 5: Adaptive Backchannel Handling

Deploy an audio-based ML model that distinguishes the prosodic footprint of "uh-huh" from a genuine redirect. This is separate from choosing the right LLM for your voice agent. It's smaller, faster, and solves one specific problem. LiveKit 1.5's adaptive interruption handling rejects up to 51% of false barge-ins through this approach alone.

These fixes handle the engineering layer. But conversation design causes the same failures for free, and costs nothing to repair.

Conversation Design Fixes That Need No Engineering

Good voice agent noise handling doesn't always need a stack change. Sometimes the prompts are the problem.

A few things worth fixing before your next deployment:

Keep prompts short. Long agent responses give callers more time to interrupt. Under two sentences per turn is a good rule.
One question per exchange. Multi-part questions on noisy calls mean one detail gets lost every time. Ask for one thing, confirm it, move on.
Confirm critical fields. Names, numbers, dates. A quick "Was that April 14?" costs two seconds. A wrong entry costs the whole call.
Make recovery prompts surgical. "I didn't catch the last four digits." beats "Could you repeat that?" The caller knows exactly what to re-say.
Build a fast human handoff. A 2026 SurveyMonkey study found that 79% of Americans prefer a human when automation starts to feel unreliable. Don't force bad calls through more retries.

For a deeper look at prompting your voice agent for real call conditions, that's worth reading alongside this.

Before your next deployment, check:
Short prompts (under 2 sentences per turn) / One question per exchange / Confirmation step on all numeric input / Human escalation trigger configured for sustained low audio confidence

None of these changes matter if you can't measure whether they worked.

How to Tell If Your Voice Agent Is Actually Fixed

WER (Word Error Rate) tells you nothing useful about real call performance. It measures transcription accuracy in isolation. It won't tell you whether a barge-in fired correctly, whether the agent held the floor during a mid-sentence pause, or why callers hang up in the first 30 seconds.

Track these instead:

Metric	Definition	Target
Barge-In Success Rate (P90)	Latency to halt audio in 90% of barge-in events	Under 150ms
False Barge-In Rate	Interruptions triggered by noise or backchannels	Under 5%
E2E Endpointing Latency (P95)	Caller stops speaking to the agent's first audio byte	Under 600ms
Reprompt Rate	"I didn't catch that" frequency per call	Below 10% of turns
Wrong-Intent Rate	Recognized intent mismatches the caller's actual need	Below 5%
Audio-Caused Transfer Rate	Human transfers from low audio confidence	Track separately

For acoustic-level diagnostics, poll WebRTC getStats() continuously. Watch echoReturnLossEnhancement, residualEchoLikelihood, totalRoundTripTime, and fractionLost. These show you what's happening in the signal layer, not just the application layer.

And watch P99 latency, not just averages. A 400ms average TTFT looks fine. A P99 spike to 2,500ms means 1% of callers hit dead air and abandon. Averages hide tail failures.

Expert Tip: Track Task Success Rate alongside Barge-In Success Rate. An agent running a 12% false barge-in rate will show measurably lower task completion even when WER looks clean. Voice agent noise handling problems always show up in task outcomes before they show up in transcription metrics.

Good voice agent monitoring and regular regression testing are what separate teams that catch these failures early from teams that learn about them only from customer complaints.

The teams getting this right share one thing: they treat the acoustic layer as seriously as the model layer.

The Model Was Never the Problem

The instinct when calls break is to upgrade the model. Better reasoning, cleaner output. But the model was fine. The acoustic pipeline underneath it was failing.

VAD firing on background speech. AEC filters diverging under double-talk. TCP stalls the token stream when a packet is dropped. Noise suppression is placed before AEC and destroys filter convergence. These are the actual failure points in voice agent noise handling on real production calls.

Fix those four layers, add semantic endpointing, and tune your monitoring to acoustic metrics instead of WER. The model you already have will perform significantly better.

If your agent is breaking on noisy calls or can't handle voice agent interruption handling cleanly, Relinns scopes exactly which layer is failing before touching anything else. That's the starting point, not a demo.

Your voice agent is breaking somewhere
in the stack. Let's find it.
Talk to Experts!

Recommended for you

Joget Development

Joget Intelligence Explained: AI Designer, Agent Builder 5

AI Voice Agents

Why Your AI Outbound Calls Get Flagged as Spam & How to Fix it?