Why Your Voice Agent Breaks in Noisy Calls & How to Fix It?
Date
Jun 18, 26
Reading Time
10 Minutes
Category
AI Voice Agents

Your voice agents passed every internal test. The demo was clean. Then production happened. Suddenly, the agent talks over callers, misses chunks of conversation, or falls apart the moment someone calls from a car.
The instinct is to upgrade the model. Better LLM, better calls. That logic sounds right until you look at where the voice agent's noise handling actually breaks down. It's almost never the model. The LLM is reasoning correctly. The acoustic pipeline underneath it is what's failing.
Fixing voice agent noise handling means fixing VAD configuration, AEC filter management, and transport protocol. That's where voice agent interruption handling collapses on real production calls.
Before you swap models or raise your silence threshold again, you need to know what's actually collapsing on those calls.
Real Phone Calls Have Physics. Your Demo Didn't.
Human conversation runs on strict timing. A PNAS study analyzing turn-taking across ten languages found an average inter-turn gap of roughly 200ms. The Max Planck Institute pushed it further: humans start preparing their response before the other person finishes speaking. We pre-plan.
We anticipate. Reactive AI architectures wait for silence before taking any action. That 200ms gap is where your agent already starts losing.
Your voice agent latency breaks into two separate budgets. The End-to-End Response Budget is how fast the agent generates and speaks a reply. The Barge-In Flush Budget is how fast it stops when the caller interrupts. Both matter independently. Fail the first, and your agent feels slow. Fail the second, and it feels broken.
And then there's the Lombard Effect. When callers are in noisy environments, they don't just speak louder. Their vocal frequency actually shifts. A model trained on clean studio audio receives a phonetically different signal from someone calling off a highway. That's a voice agent noise-handling problem; no LLM upgrade touches.
Your demo ran in silence, over a stable connection, with a patient speaker. Real calls have engine noise, compressed mobile codecs, side conversations, and callers who cut in mid-sentence. Proper voice agent noise handling means accounting for all of that. And the place where it visibly collapses first is voice agent interruption handling, specifically when the barge-in flush budget exceeds 150ms.
Four specific layers in your stack are responsible for 95% of these failures. Only one of them involves the model.
The Four Places Your Voice Agent Actually Breaks

Most voice agent failures trace back to one of four specific layers. And good voice agent noise handling requires all four to work. None of them is the LLM.
Layer 1: VAD is lying to you
Legacy WebRTC VAD runs on Gaussian Mixture Models and raw energy thresholds. Simple in theory: detect energy above X, declare speech. The problem is that real-world noise immediately breaks this.
Background hum, keyboard clicks, a cough, compressed mobile audio, all of it can cross that threshold. In real-world noise simulations, WebRTC VAD misses roughly 50% of valid speech frames at a 5% false positive rate.
That's not a tuning problem. That's a structural limitation of energy-based detection.
You run into two specific edge cases that no threshold setting solves:
- The TV Problem. A television or a colleague talking nearby contains actual human speech. The VAD can't distinguish it from the caller. It fires a barge-in, and the agent cuts itself off mid-response.
- The Whisper Problem. Soft-spoken callers never push enough energy to cross the threshold. The agent doesn't register them at all. From the system's perspective, nobody's speaking.
Endlessly tuning sensitivity doesn't fix either of these. You're just choosing which failure mode to accept.
Layer 2: AEC collapse and double-talk divergence
Without Acoustic Echo Cancellation, the agent hears its own TTS output through the microphone, feeds it back into the STT engine, and interrupts itself. That's false barge-in at its most basic.
The nastier failure is double-talk divergence. When the caller interrupts mid-response, both signals hit the microphone simultaneously.
If the Double-Talk Detector misfires, the adaptive filter starts treating the caller's voice as residual echo and mutes it. The caller is speaking. The agent can't hear them. The call falls apart.
Layer 3: Your transport protocol is stalling the stream
Server-Sent Events run over TCP. TCP guarantees packet delivery by holding up the entire stream while it retransmits a dropped packet. On a degraded mobile connection, the wait is 20-80ms. A single dropped packet causes audible clipping and barge-in signals that arrive late or out of order. Voice agent interruption handling depends on millisecond precision. TCP doesn't offer that on real networks.
Layer 4: Token buffer mismanagement
Sending individual LLM tokens straight to TTS is a shortcut that kills audio quality. The TTS engine needs sentence-level context to produce natural intonation. Without a SentenceAggregator buffering tokens to complete sentences before synthesis, you get choppy audio that sounds broken even when the transcription is perfect.
"It's a nightmare, and as you said, there's no silver bullet. Different languages, different applications, different acoustic environments, different people all have different requirements."
That quote is from someone running voice agents at scale. They're right that there's no single fix. But there is a correct order of operations across your voice AI stack.
One of these four failures is most likely being caused by a fix you're already running. The order of operations in your DSP pipeline is probably wrong.
The Specific Ways Noisy Calls Destroy Interruption Handling
Three specific failure modes break voice agent interruption handling on real calls. Each one has a different cause and a different fix.
Problem 1: Backchannel Misfires
A caller says "mm-hmm" or "yeah" while the agent is mid-explanation. They're not interrupting. They're signaling they're still listening.
Pure acoustic VAD can't distinguish between them. It sees energy above threshold, declares a barge-in, and the agent stops dead. The conversation broke for no reason.
Teams that try to fix this with semantic interrupt detection run into a different wall. Classifying whether an interruption is a backchannel versus a genuine redirect still carries roughly a 15% error rate in real testing, and the classification itself adds latency. So you trade one failure mode for two.
Problem 2: Premature Endpointing
A caller says: "I'd like to transfer funds to my... [400ms pause] ...checking account."
The VAD's silence threshold fires on that pause. The agent responds to an incomplete sentence. The caller repeats themselves.
You raise the silence threshold to 1,000ms to prevent the cutoff. Now every completed sentence carries a full second of dead air before the agent responds. One frustrating failure traded for another.
Problem 3: The DSP Pipeline Ordering Error
This is the one most teams miss. And it causes both problems above to get significantly worse.
Placing deep-learning noise suppression before the Acoustic Echo Canceller in your DSP pipeline breaks your interruption logic entirely.
Here's why:
- Noise suppression applies spectral subtraction and dynamic range compression
- These introduce non-linearities into the signal
- The AEC's linear adaptive filter needs a mathematically clean, linear signal to model the room impulse response correctly
- Feed it a non-linear signal, and it can't converge
- The result is continuous echo bleed and an endless loop of false barge-ins
Expert Tip: AEC must be the absolute first DSP operation on the raw PCM audio stream. Any non-linear preprocessing upstream, including noise suppression, permanently prevents the adaptive filter from converging. This is the most common cause of constant false barge-ins that teams can't debug.
AEC goes first. No exceptions. If you want to make your voice agent sound more human on real calls, this ordering issue is the first thing to check.
Once you know what's breaking, the fixes are precise. Apply them in the wrong order and you make things worse.
Fixing Voice Agent Noise Handling, Layer by Layer

These aren't suggestions. Each fix addresses a specific failure mode from the previous sections. Apply them in order.
Fix 1: Replace Energy VAD with Silero
Legacy WebRTC VAD misses roughly 50% of valid speech frames under real-world noise. That's not a tuning problem. It's a structural one. Swap it for Silero VAD: trained on clean speech, overlapping audio, and background noise, it drops the miss rate to 12.3% under hostile conditions.
Don't run it on defaults. Tune these specifically:
Two edge cases need separate handling beyond the table:
- Whisper Problem: Auto-tune mic gain at the start of the session. Sample the caller's baseline energy, lower the absolute floor to around -50 dBFS for quiet profiles.
- TV Problem: Enforce AND-logic on barge-in. Sound must have a high VAD probability AND exceed a dynamically calculated local volume baseline. One gate alone fails.
Fix 2: Protect AEC from Double-Talk Divergence
Implement Geigel Double-Talk Detection. It compares the ratio of the maximum far-end signal level over an interval to the near-end signal. When double-talk is declared mathematically, the adaptive filter's coefficients freeze immediately. This prevents the filter from mistaking the caller's voice for residual echo.
Monitor it actively via WebRTC getStats():
- ERLE below 10 dB = your adaptive filter is diverging
- residualEchoLikelihood above 0.5 = NLP stage is suppressing audio so aggressively that it's clipping the caller's voice
Fix 3: Move Transport from SSE to WebRTC
SSE runs over TCP. One dropped packet stalls the entire stream for 20-80ms. On a degraded mobile connection, that's audible clipping and missed barge-in signals. Switch to WebRTC. For a full breakdown, see WebRTC vs SIP for voice agents.
The specific settings that matter for voice agent noise handling:
- Audio via RTP, interruption metadata via WebRTC Data Channels
- Set Data Channels to unreliable: ordered: false, maxRetransmits: 0. Packets drop instead of stalling.
- Reduce JitterBufferTarget to 40ms, JitterBufferMaxPackets to 50
Expert Tip: The 150ms barge-in flush budget breaks down as follows: VAD-to-flush dispatch (10 to 30ms) + WebSocket cancellation to the TTS provider (20 to 30ms) + buffer drain (20 to 40ms) + device release (10 to 20ms). ElevenLabs Turbo and Cartesia Sonic both support mid-stream cancellation via WebSocket close frame. If your TTS provider doesn't, you're already over budget before the caller finishes interrupting.
Fix 4: Add Semantic Endpointing
An acoustic VAD answers a binary question: Is there speech in this 20ms window? It has no idea whether the sentence is finished. Semantic endpointing adds a Small Language Model that reads partial transcripts from streaming STT every 50ms and evaluates syntactic completeness.
- LiveKit's production implementation runs a 135M parameter SmolLM-v2 fine-tune locally on the worker node
- Gradium's Semantic VAD emits turn-completion predictions every 80ms across three future horizons: 0.5s, 1s, and 2s inactivity probabilities
- Trailing dependency detected ("I was walking down the...") = silence threshold extends up to 1,500ms
- Clean completion detected ("Yes.") = turn fires immediately, cutting response latency by 200 to 500ms
The flushing trick: the moment the SLM commits to an end-of-turn decision, force the server to process all outstanding buffered audio instantly. Don't wait for the STT engine's natural silence timer. Bypass it.
Fix 5: Adaptive Backchannel Handling
Deploy an audio-based ML model that distinguishes the prosodic footprint of "uh-huh" from a genuine redirect. This is separate from choosing the right LLM for your voice agent. It's smaller, faster, and solves one specific problem. LiveKit 1.5's adaptive interruption handling rejects up to 51% of false barge-ins through this approach alone.
These fixes handle the engineering layer. But conversation design causes the same failures for free, and costs nothing to repair.
Conversation Design Fixes That Need No Engineering
Good voice agent noise handling doesn't always need a stack change. Sometimes the prompts are the problem.
A few things worth fixing before your next deployment:
- Keep prompts short. Long agent responses give callers more time to interrupt. Under two sentences per turn is a good rule.
- One question per exchange. Multi-part questions on noisy calls mean one detail gets lost every time. Ask for one thing, confirm it, move on.
- Confirm critical fields. Names, numbers, dates. A quick "Was that April 14?" costs two seconds. A wrong entry costs the whole call.
- Make recovery prompts surgical. "I didn't catch the last four digits." beats "Could you repeat that?" The caller knows exactly what to re-say.
- Build a fast human handoff. A 2026 SurveyMonkey study found that 79% of Americans prefer a human when automation starts to feel unreliable. Don't force bad calls through more retries.
For a deeper look at prompting your voice agent for real call conditions, that's worth reading alongside this.
Before your next deployment, check:
Short prompts (under 2 sentences per turn) / One question per exchange / Confirmation step on all numeric input / Human escalation trigger configured for sustained low audio confidence
None of these changes matter if you can't measure whether they worked.
How to Tell If Your Voice Agent Is Actually Fixed
WER (Word Error Rate) tells you nothing useful about real call performance. It measures transcription accuracy in isolation. It won't tell you whether a barge-in fired correctly, whether the agent held the floor during a mid-sentence pause, or why callers hang up in the first 30 seconds.
Track these instead:
For acoustic-level diagnostics, poll WebRTC getStats() continuously. Watch echoReturnLossEnhancement, residualEchoLikelihood, totalRoundTripTime, and fractionLost. These show you what's happening in the signal layer, not just the application layer.
And watch P99 latency, not just averages. A 400ms average TTFT looks fine. A P99 spike to 2,500ms means 1% of callers hit dead air and abandon. Averages hide tail failures.
Expert Tip: Track Task Success Rate alongside Barge-In Success Rate. An agent running a 12% false barge-in rate will show measurably lower task completion even when WER looks clean. Voice agent noise handling problems always show up in task outcomes before they show up in transcription metrics.
Good voice agent monitoring and regular regression testing are what separate teams that catch these failures early from teams that learn about them only from customer complaints.
The teams getting this right share one thing: they treat the acoustic layer as seriously as the model layer.
The Model Was Never the Problem
The instinct when calls break is to upgrade the model. Better reasoning, cleaner output. But the model was fine. The acoustic pipeline underneath it was failing.
VAD firing on background speech. AEC filters diverging under double-talk. TCP stalls the token stream when a packet is dropped. Noise suppression is placed before AEC and destroys filter convergence. These are the actual failure points in voice agent noise handling on real production calls.
Fix those four layers, add semantic endpointing, and tune your monitoring to acoustic metrics instead of WER. The model you already have will perform significantly better.
If your agent is breaking on noisy calls or can't handle voice agent interruption handling cleanly, Relinns scopes exactly which layer is failing before touching anything else. That's the starting point, not a demo.


