Barge-in in voice call

Help: Implementing Barge-in with ARI in Asterisk 22.6

Current Setup

  • Asterisk 22.6 with ARI

  • STT/TTS working (Google, OpenAI - need dynamic support)

  • Outbound calls and basic conversation working

Problem

Callers must wait for complete TTS playback before speaking. Need barge-in so they can interrupt anytime.

Example:

  • System: “Please tell me your 10-digit account num—”

  • Caller: “1234567890” ← interrupts

  • System: Stops immediately and listens

Questions

  1. How to implement barge-in with ARI?

    • Should I use bidirectional audio streaming?

    • Need external media servers (AudioSocket)?

  2. Voice Activity Detection:

    • How to detect caller speech during TTS playback?

    • Handle VAD in Asterisk or delegate to STT provider?

  3. Audio Control:

    • Best way to stop TTS playback instantly when speech detected?

    • Which ARI endpoints/events to use?

  4. STT Coordination:

    • Should STT stream continuously while TTS plays?

    • How to manage TTS-to-STT transition?

Looking For

  • ARI code examples for barge-in

On Saturday 13 December 2025 at 11:12:15, Ambica via Asterisk Community wrote:

Help: Implementing Barge-in with ARI in Asterisk 22.6

I don’t have an answer to your question, but I do have a question about what
you are trying to do:

  • System: “Please tell me your 10-digit account num—”

  • Caller: “1234567890” ← interrupts

  • System: Stops immediately and listens

If I were having a conversation with a human who was asking me for my account
number, I would not interrupt them before they finished the question, so why is
it desirable to allow people to do that with a TTS system?

If you do find a way to detect that someone has interrupted the TTS, I would
suggest not listening to what they say and processing their speech, but simply
responding with “please do not interrupt - it’s very rude” and continue with
the question.

Maybe I’m just old-fashioned about how people whould interact with each other
(and, by extension, with replacements for people).

Antony.


The best time to plant a tree is 20 years ago.
The second best time is now.

I would not consider “replacements for people” to be “people”. Save that for the bad Hollywood movies.

A machine telling people what is or is not rude will likely not go down well with some (real) people …

Thankyou for responding. I’ve gone through your responses . I believe the information which i m providing will be helpful for you to understand my requirement ‘

Barge-in refers to interrupting prompt playback based on live inbound speech rather than DTMF input, enabling more natural, conversational voice interactions.

We are currently developing an AI-driven voice agent on Asterisk and are looking to implement barge-in functionality. The goal is to immediately stop audio playback when caller speech is detected, capture the inbound audio in real time, and stream it to an external STT/AI service for processing.

Our setup uses PJSIP with standard Asterisk playback applications. We are exploring potential approaches such as voice activity detection (VAD), media hooks, or ARI-based control, and would appreciate guidance on any recommended architectures, best practices, or proven implementations within Asterisk for achieving this behavior.

Any insights, suggestions, or references would be greatly appreciated.

Hi,

I can answer your questions:

  1. How to implement barge-in with ARI? Yes, you need two-way audio, so you can detect the caller’s audio and apply VAD to stop playing TTS. Yes, external media is recommended, it is just software and ports in your ARI application, not hardware.
  2. Voice Activity Detection: There are multiple ways to detect the caller’s speech, I use the “power” of the incoming audio, detect it, track it and stop playing after a small configurable time. Asterisk ARI handles the VAD.
  3. Audio Control: VAD is your friend here. Controls what you want with it.
  4. STT Coordination: No need to STT stream, only the VAD ARI detection while playing TTS.

Looking For

  • You can see a video of how barge-in/interruptions works, but with a trully Realtime model:
  • Sample code, no TTS or STT, instead OpenAI Realtime model:

Regards!

Hi,

Thank you for the detailed explanation.

I’d like to clarify our use case further. Our requirement is to implement barge-in in a model-agnostic way, such that it works uniformly regardless of the AI provider being used (OpenAI, Gemini, Claude, or any other future model).

From our understanding:

  • Barge-in should be handled entirely at the telephony/ARI layer using two-way audio and VAD.

  • Detection of caller speech via ARI/VAD should immediately stop TTS playback.

  • This logic should remain independent of the downstream STT/LLM/TTS provider, so we can swap models without changing the barge-in mechanism.

  • External media is acceptable, as long as it remains a software-only implementation.

Please confirm if this understanding is correct and whether ARI’s built-in VAD is sufficient to reliably support this model-independent barge-in behavior while TTS is playing.

Additionally, if there are any best practices or limitations we should be aware of when supporting multiple AI providers in this setup, please let us know.

From purely a built-in Asterisk perspective there is just the TALK_DETECT dialplan function[1] which can be set on a channel in ARI using the set variable method, and will raise an ARI event. It does not inherently stop anything - that’s up to you in the ARI application to react to the event.

[1] TALK_DETECT - Asterisk Documentation

1 Like

Note that, if your caller has poor echo suppression, you risk false speech detections (most likely for handsfree). Also, if they handle echo with anti-vox, on a loud speaking phone, it may be impossible for them to break in, without the use of DTMF.