Enabling barge in with audiosocket connection between asterisk and python server

We have a Python-based audio server connected to our Asterisk IVR system using AudioSocket. The call flow works as follows:

same => n,Wait(2)
same => n,Read(EXTID,${VOICE_DIR}/pls-enter-phone-ext,6,3,300)
same => n,Wait(1)
same => n,Playback(${VOICE_DIR}/you-have-entered)
same => n,SayDigits(${EXTID})
same => n,Wait(1)

same => n,Set(CIVR_HOST=ipconfig)
same => n,Log(NOTICE, Starting AudioSocket connection to ${CIVR_HOST}:3000)

same => n,Set(CALL_UUID=${UUID()})

same => n,Log(NOTICE, Notifying Python app - CallerId: ${CALLERID(num)}, UUID: ${CALL_UUID}, DNID: ${CALLERID(DNID)}, EXTID: ${EXTID})

same => n,System(curl -s "http://${CIVR_HOST}:1650/api/call-start?callerId=${CALLERID(num)}&uuid=${CALL_UUID}&dnid=${CALLERID(DNID)}&lext=${EXTID}" >/dev/null 2>&1 || wget -q -O - "http://${CIVR_HOST}:1650/api/call-start?callerId=${CALLERID(num)}&uuid=${CALL_UUID}&dnid=${CALLERID(DNID)}&lext=${EXTID}" >/dev/null 2>&1)

same => n,AudioSocket(${CALL_UUID},${CIVR_HOST}:3000)

same => n,Log(NOTICE, AudioSocket connection ended)

Once the AudioSocket connection is established:

  • User speech is streamed to the Python server.

  • The server performs Speech-to-Text (STT).

  • The transcribed text is sent to an agentic system for generating a response.

  • The response text is sent to an external TTS service.

  • TTS audio chunks are streamed back through AudioSocket and played to the caller via Asterisk.

  • After playback finishes, the system starts listening for user input again.

This loop continues throughout the call.


Current Problem

The current implementation behaves like a walkie-talkie or half-duplex system.

While the bot is speaking (during TTS playback), microphone/input processing is disabled to avoid:

  • Background noise

  • Echo from bot audio

  • TTS audio being reprocessed as user speech

Because of this, the system cannot hear the caller while the bot is speaking.

If a user interrupts mid-response — for example to clarify something or barge in — their speech is ignored completely.


Attempted Solution

To support full-duplex conversations and barge-in, I tried:

  • WebRTC-based echo cancellation (AEC)

  • Voice Activity Detection (VAD)

  • Continuous audio processing during TTS playback

All these changes were implemented on the Python server side.


Issues With Current Full-Duplex Attempt

Even after significant tuning, we are facing two major issues:

  1. Missed Barge-Ins

    • The system often fails to detect when the user is speaking over the bot audio.
  2. False Barge-Ins

    • The system frequently triggers a barge-in immediately when TTS playback starts, even when the user is completely silent.

This makes the experience unstable and unreliable.


What We Want To Understand

We are exploring whether there is a better approach from the Asterisk side itself.

Specifically:

  • Can full-duplex/barge-in handling be implemented more effectively using Asterisk dialplan features?

  • Does Asterisk provide any native support for interruptible playback or duplex audio handling with AudioSocket?

  • Are there recommended architectural patterns for implementing reliable barge-in with Asterisk + AudioSocket + external STT/TTS systems?

  • Would moving some logic from Python into Asterisk help improve detection stability?

That’s probably because the echo canceller needs to receive some echoes in order to calibrate itself. I suppose you might be able to send a calibration signal as soon as the call is answered, and before barge in detection is enabled, but that might annoy the other party, and they may be in an environment where the acoustics are continually changing.

One problem with this is that the correct place to cancel far end echoes is the far end, and you don;t control the algorithm at that end. Doing echo cancellation at both ends is lilkely to cause conflicts and prevent correct operation.

If you are open to exploring other options, I catered for these exact problem in my project or you can use my VAD and Barge in logic, its open source and python based

A different approach might be to use ExternalIVR instead of AudioSocket. This allows the callee to press DTMF buttons to interrupt the playback.

This would likely mean breaking up the audio response into small temporary files for insertion in the playback queue. Note however that I have found Asterisk’s handling of the playback queue to be unreliable: luckily, you get notification as each file is played, so you can ensure the queue never has more than one file waiting.

For an example, see the ivr_dtmf_demo script here: Lawrence D’Oliveiro / seaskirt_examples · GitLab

Thanks for the suggestion, but we are bound not to use voice agents, we have a different setup, in which we have to process chunks from external TTS service, and send chunks from user to STT service.

Thanks for the reply!! ExternalIVR is just using buttons to interrupt, and also as per docs, it does not offers realtime streaming which is required in our case. Our project is basically a Conversational IVR.

Yes that’s the main issue, is there any way to prevent asterisk and python server to avoid taking bot’s voice as input for barge in?

I think there may be a way around that. If you look at the ami+agi_audio_player_async example script in that repo, you will see that it tells Asterisk to play an audio file via the AGI STREAM FILE command, and passes it the name of a pipe into which audio is being streamed in real time.

I think the same trick would work with ExternalIVR. Assuming that Asterisk uses the same audio-file-playing code in both places, of course … (whyever would it not do that?) …

This will just create ARI and AMI events handler, and when we speak, asterisk will emit events, and the event will not start until I speak, but in our case initially the bot will have to play a greeting message, and initially we also need to interact database for tracking purpose.

Ignoring other requirements, even I tried listening these events, but even when I said no AMI or ARI events were fired from asterisk.

Asterisk Dialplan

same  => n(civrAsteriskMode),Log(NOTICE, CIVR asterisk/FIFO mode UUID=${CALL_UUID} caller=${CALLERID(num)} ext=${EXTID})
 same  => n,Set(TALK_DETECT(set)=1200,384)
 ; MixMonitor writes caller audio (rx only, flag 'r') to the FIFO Python created in call-start.
 ; Python opens the read end after creating the FIFO, feeds it to the STT pump.
 same  => n,MixMonitor(/tmp/civr-rx-${CALL_UUID}.sln,r)
 ; Notify ARI listener that AsyncAGI is about to start so channel↔UUID is mapped.
 same  => n,UserEvent(CIVRStart,UUID: ${CALL_UUID},CallerID: ${CALLERID(num)},DNID: ${CALLERID(DNID)},EXTID: ${EXTID})
 same  => n,Log(NOTICE, CIVR AGI starting UUID=${CALL_UUID})
 ; Enter AsyncAGI — Python drives STREAM FILE calls via AMI; exits via ASYNCAGI BREAK.
 same  => n,AGI(agi:async,civr)
 same  => n,StopMixMonitor()
 same  => n,Set(TALK_DETECT(remove)=)
 same  => n,UserEvent(CIVREnd,UUID: ${CALL_UUID},CallerID: ${CALLERID(num)})
 same  => n,Log(NOTICE, CIVR AGI ended UUID=${CALL_UUID})

We have barge-in enabled in our telephony voice bot (Asterisk AudioSocket + Python STT/TTS loop), but we see inconsistent behavior:

  1. Sometimes barge-in does not trigger at all.
    The user speaks while the bot is talking, but the bot continues speaking until the prompt finishes.

  2. Sometimes barge-in works, but later the bot’s own audio gets transcribed as user speech.
    Bot audio seems to leak into STT after or between turns.

Current approach:

  • While TTS is playing, incoming audio is analyzed for barge-in, but silence is fed to STT to avoid echo.

  • A frame is treated as possible user speech only if:

    • TTS is actively transmitting,
    • cooldown after TTS start has passed,
    • inbound RMS crosses an adaptive threshold,
    • echo correlation check says it is not bot audio.
  • Uses a sliding-window qualification instead of strict consecutive frames.

  • On barge-in:

    • cancel TTS,
    • flush playback queue,
    • clear TTS-active state,
    • send preroll speech frames to STT,
    • add a short silence tail to reduce residual echo.

Detection logic is roughly:

if tts_active and bot_is_transmitting:
    rms = inbound_rms(frame)

    if rms > adaptive_threshold:
        echo_corr = correlate_with_recent_tx(frame)

        if echo_corr < threshold:
            qualify_frame()

    if enough_qualified_frames():
        trigger_barge_in()

The problem is that this works well in some calls, but behaves poorly in others depending on call acoustics, latency, echo, carrier quality, speakerphone usage, etc.

Looking for practical suggestions from people who have implemented reliable barge-in in real telephony systems:

  • better echo rejection strategies,
  • VAD tuning,
  • handling residual TTS leakage,
  • timing/cooldown strategies,
  • or architecture improvements.

Earlier I also asked about implementation of barge-in using asterisk built-in events: Enabling barge in with audiosocket connection between asterisk and python server

You need to run STT in parallel and implement interruption based on content, not just sound level. If user says “ok” you should not stop talking. If he says “connect to operator” you have to stop.

Looks like DTMF is still the simplest and most reliable way to do this.