On Saturday 13 December 2025 at 11:12:15, Ambica via Asterisk Community wrote:
Help: Implementing Barge-in with ARI in Asterisk 22.6
I don’t have an answer to your question, but I do have a question about what
you are trying to do:
System: “Please tell me your 10-digit account num—”
Caller: “1234567890” ← interrupts
System: Stops immediately and listens
If I were having a conversation with a human who was asking me for my account
number, I would not interrupt them before they finished the question, so why is
it desirable to allow people to do that with a TTS system?
If you do find a way to detect that someone has interrupted the TTS, I would
suggest not listening to what they say and processing their speech, but simply
responding with “please do not interrupt - it’s very rude” and continue with
the question.
Maybe I’m just old-fashioned about how people whould interact with each other
(and, by extension, with replacements for people).
Antony.
–
The best time to plant a tree is 20 years ago.
The second best time is now.
Thankyou for responding. I’ve gone through your responses . I believe the information which i m providing will be helpful for you to understand my requirement ‘
Barge-in refers to interrupting prompt playback based on live inbound speech rather than DTMF input, enabling more natural, conversational voice interactions.
We are currently developing an AI-driven voice agent on Asterisk and are looking to implement barge-in functionality. The goal is to immediately stop audio playback when caller speech is detected, capture the inbound audio in real time, and stream it to an external STT/AI service for processing.
Our setup uses PJSIP with standard Asterisk playback applications. We are exploring potential approaches such as voice activity detection (VAD), media hooks, or ARI-based control, and would appreciate guidance on any recommended architectures, best practices, or proven implementations within Asterisk for achieving this behavior.
Any insights, suggestions, or references would be greatly appreciated.
How to implement barge-in with ARI? Yes, you need two-way audio, so you can detect the caller’s audio and apply VAD to stop playing TTS. Yes, external media is recommended, it is just software and ports in your ARI application, not hardware.
Voice Activity Detection: There are multiple ways to detect the caller’s speech, I use the “power” of the incoming audio, detect it, track it and stop playing after a small configurable time. Asterisk ARI handles the VAD.
Audio Control: VAD is your friend here. Controls what you want with it.
STT Coordination: No need to STT stream, only the VAD ARI detection while playing TTS.
Looking For
You can see a video of how barge-in/interruptions works, but with a trully Realtime model:
Sample code, no TTS or STT, instead OpenAI Realtime model:
I’d like to clarify our use case further. Our requirement is to implement barge-in in a model-agnostic way, such that it works uniformly regardless of the AI provider being used (OpenAI, Gemini, Claude, or any other future model).
From our understanding:
Barge-in should be handled entirely at the telephony/ARI layer using two-way audio and VAD.
Detection of caller speech via ARI/VAD should immediately stop TTS playback.
This logic should remain independent of the downstream STT/LLM/TTS provider, so we can swap models without changing the barge-in mechanism.
External media is acceptable, as long as it remains a software-only implementation.
Please confirm if this understanding is correct and whether ARI’s built-in VAD is sufficient to reliably support this model-independent barge-in behavior while TTS is playing.
Additionally, if there are any best practices or limitations we should be aware of when supporting multiple AI providers in this setup, please let us know.
From purely a built-in Asterisk perspective there is just the TALK_DETECT dialplan function[1] which can be set on a channel in ARI using the set variable method, and will raise an ARI event. It does not inherently stop anything - that’s up to you in the ARI application to react to the event.
Note that, if your caller has poor echo suppression, you risk false speech detections (most likely for handsfree). Also, if they handle echo with anti-vox, on a loud speaking phone, it may be impossible for them to break in, without the use of DTMF.