Separating Audio Streams for Sending Only Caller Audio to OpenAI

Hi. We are reaching out to seek guidance on how to properly separate audio streams in Asterisk in order to achieve our specific use case. Below, we will outline our goals, current implementation, challenges, and the specific assistance we require.


General Goal

Our goal is to integrate Asterisk with OpenAI’s real-time API to create an AI-powered voicebot that interacts with users over the phone. The core functionality involves:

  1. Receiving audio from the caller and sending it to OpenAI for processing.
  2. Receiving audio responses from OpenAI and sending them to the caller.
  3. Recording audio streams in separate files:
  • One file containing only the audio sent by OpenAI (AI-to-caller audio).
  • Another file containing only the audio from the caller (caller-to-AI audio).

It is crucial that the caller’s audio and OpenAI’s audio remain completely isolated to prevent OpenAI from hearing its own audio (feedback loop) while still providing the caller with the generated responses in real time.


Specific Implementation Goals

  1. Separate the caller’s audio and OpenAI’s audio into two distinct streams.
  • Only send the caller’s audio to OpenAI.
  • Play back only OpenAI’s audio to the caller.
  1. Use Asterisk’s externalMedia functionality to handle OpenAI’s API integration and RTP streams effectively.
  2. Record the two audio streams in separate files:
  • Caller-to-AI stream (external_bridge_*).
  • AI-to-caller stream (user_bridge_*).

Current Implementation

We currently use two holding bridges in Asterisk to manage the separation of audio streams:

  1. User Bridge (user_bridge):
  • Contains the caller’s channel (PJSIP/680).
  • Plays the audio received from OpenAI to the caller.
  • Records OpenAI’s responses to the caller in a separate file (user_bridge_*.wav).
  1. External Bridge (external_bridge):
  • Contains the external media channel (UnicastRTP) connected to OpenAI.
  • Supposed to record only the audio from the caller to send to OpenAI.

The RTP stream from the caller is routed to the external_bridge, and the audio responses from OpenAI are routed back to the user_bridge. However, while the OpenAI-to-caller recording works as expected, the caller-to-AI recording is empty (0 seconds). This indicates that the external_bridge is not correctly receiving or handling the caller’s audio.


Challenges

  1. Empty Audio Recording in external_bridge:
    The file generated for the external_bridge is always empty (0 seconds), even though the caller’s channel (PJSIP/680) is added to the external_bridge along with the externalMedia channel. We suspect that the caller’s audio is not being routed correctly to the external_bridge.
  2. Isolating Streams:
    We need to ensure that:
  • The external_bridge receives only the caller’s audio.
  • The user_bridge plays only OpenAI’s audio to the caller.
  • Neither bridge mixes the two streams.
  1. Bidirectional RTP Synchronization:
  • externalMedia uses a dedicated RTP port to handle communication with OpenAI. While we see RTP packets flowing correctly between OpenAI and Asterisk, we are unsure if the audio stream from the caller is being routed correctly to this channel for processing.
  1. Avoiding Feedback Loops:
    Currently, OpenAI could potentially hear its own audio responses if the streams are not isolated correctly. This would disrupt the voicebot’s behavior and produce undesired results.

Specific Questions

  1. How can we ensure the caller’s audio is correctly routed to the external_bridge for both processing by OpenAI and recording?
  • Are there specific configurations required for the externalMedia channel to capture the caller’s audio properly?
  • Is it necessary to configure externalMedia differently when used with holding bridges?
  1. Is our approach of using two separate holding bridges (user_bridge and external_bridge) appropriate for separating the streams?
  • If not, what would be the recommended way to isolate audio streams for our use case?
  • Is there an alternative to using holding bridges for isolating audio streams while maintaining real-time playback and recording?
  1. How can we debug or confirm the audio routing between the caller channel (PJSIP/680), externalMedia, and the bridges?
  • Are there specific tools or logs we can enable to verify where the audio stream is being dropped?
  • Would enabling rtp set debug on or other debugging methods help in this scenario?
  1. What is the best way to configure the record() function for bridges to ensure that the correct audio streams are captured?
  • For example, we currently use the following options: options: 'b(IN)b(OUT)'. Is this correct for recording only the audio received or transmitted by a bridge?
  1. Is there a better way to manage the externalMedia channel and its integration with bridges for real-time processing by OpenAI?
  • For example, should we consider an alternative architecture or bridge type (e.g., mixing) for more control over audio streams?

Logs and Observations

Here are some key observations from our logs:

  1. RTP packets from the caller are received successfully:

python

Copia

Got RTP from 10.7.1.2:4084 (type 96, seq 005518, ts 166400, len 000640)
  1. RTP packets are sent to OpenAI via the externalMedia channel:

bash

Copia

Sent RTP to 127.0.0.1:12050 (type 118, seq 064810, ts 166400, len 000640)
  1. OpenAI’s audio responses are played correctly to the caller:

csharp

Copia

Playing response on user bridge: sound:chunk_123456
  1. The file for the caller-to-AI stream (external_bridge_*) is always empty:

bash

Copia

/var/spool/asterisk/recording/external_bridge_*.wav (0 seconds)

Despite receiving RTP packets from the caller, the audio is not being recorded or processed correctly in the external_bridge.


Expected Outcome

We need to achieve the following:

  1. Correctly route the caller’s audio to the external_bridge for processing and recording.
  2. Ensure that the caller’s audio stream and OpenAI’s audio stream are isolated and do not mix.
  3. Maintain real-time playback for the caller, with OpenAI’s responses being audible as expected.
  4. Generate two separate audio files:
  • One containing the caller’s audio only.
  • One containing OpenAI’s responses only.

Thank you for taking the time to review this request. Your assistance in helping us achieve this setup would be greatly appreciated. If you need additional details, such as code snippets or configuration files, we will be happy to provide them.

async start() {
try {
logger.info(Starting voicebot handler for channel ${this.channel.id});
this.isActive = true;

    // Ensure the directory for storing audio files exists
    await this.ensureAudioDir();

    // Skip initialization if this is an external media channel
    if (this.isExternalMediaChannel) {
        logger.info('External media channel detected, skipping initialization');
        return;
    }

    // Answer the incoming channel
    await this.channel.answer();
    logger.info(`Channel ${this.channel.id} answered`);

    // Create the first holding bridge for the caller
    this.userBridge = await this.ari.bridges.create({ 
        type: 'holding',  // Use a holding bridge
        name: `user_bridge_${this.channel.id}`,
        bridgeId: `user_bridge_${this.channel.id}`
    });
    logger.info(`[BRIDGE] Created user holding bridge ${this.userBridge.id}`);

    // Add the caller's channel to the user bridge
    await this.userBridge.addChannel({
        channel: this.channel.id
    });
    logger.info(`[BRIDGE] Added caller channel ${this.channel.id} to user holding bridge`);

    // Create an external media channel to communicate with OpenAI
    this.externalMedia = await this.ari.channels.externalMedia({
        app: 'voicebot',
        external_host: '0.0.0.0:12050',  // Dedicated RTP port for external media
        format: 'slin16',
        channelId: `external_${this.channel.id}`,
        variables: {
            JITTERBUFFER: 'adaptive',
            AUDIO_BUFFER_POLICY: 'strict',
            AUDIO_BUFFER_SIZE: '128'
        }
    });
    logger.info(`[MEDIA] Created external media channel ${this.externalMedia.id}`);

    // Create the second holding bridge for the external media channel
    this.externalBridge = await this.ari.bridges.create({
        type: 'holding',  // Separate holding bridge
        name: `external_bridge_${this.channel.id}`,
        bridgeId: `external_bridge_${this.channel.id}`
    });
    logger.info(`[BRIDGE] Created external holding bridge ${this.externalBridge.id}`);

    // Add the external media channel to the external bridge
    await this.externalBridge.addChannel({
        channel: this.externalMedia.id
    });
    logger.info(`[BRIDGE] Added external media channel ${this.externalMedia.id} to external holding bridge`);

    // Configure recording for the user bridge (records OpenAI responses)
    const userRecording = await this.userBridge.record({
        name: `user_bridge_${this.channel.id}`,
        format: 'wav',
        beep: false,
        maxDurationSeconds: 3600,
        ifExists: 'overwrite',
        options: 'b(IN)b(OUT)' // Record both incoming and outgoing streams
    });

    // Handle events for the user bridge recording
    userRecording.once('RecordingStarted', () => {
        logger.info(`[MONITOR] Recording started on user bridge ${this.userBridge.id}`);
    });

    userRecording.once('RecordingFailed', (event) => {
        logger.error(`[MONITOR] Recording failed on user bridge: ${event.error}`);
    });

    userRecording.once('RecordingFinished', () => {
        logger.info(`[MONITOR] Recording completed on user bridge ${this.userBridge.id}`);
    });

    logger.info(`[MONITOR] Configured recording on user bridge ${this.userBridge.id}`);

    // Configure recording for the external bridge (records caller's audio)
    const externalRecording = await this.externalBridge.record({
        name: `external_bridge_${this.channel.id}`,
        format: 'wav',
        beep: false,
        maxDurationSeconds: 3600,
        ifExists: 'overwrite',
        options: 'b(IN)b(OUT)' // Record both incoming and outgoing streams
    });

    // Handle events for the external bridge recording
    externalRecording.once('RecordingStarted', () => {
        logger.info(`[MONITOR] Recording started on external bridge ${this.externalBridge.id}`);
    });

    externalRecording.once('RecordingFailed', (event) => {
        logger.error(`[MONITOR] Recording failed on external bridge: ${event.error}`);
    });

    externalRecording.once('RecordingFinished', () => {
        logger.info(`[MONITOR] Recording completed on external bridge ${this.externalBridge.id}`);
    });

    logger.info(`[MONITOR] Configured recording on external bridge ${this.externalBridge.id}`);

    // Connect to OpenAI's real-time API for processing audio
    logger.info("[OPENAI] Connecting to OpenAI Realtime API...");
    await this.realtimeHandler.connect();

    // Set up event handlers for the voicebot functionality
    await this._setupEventHandlers();

    // Mark the handler as initialized
    this.isInitialized = true;
    logger.info('VoicebotHandler initialization completed');

    // Add event handlers for cleanup when the channel is destroyed or leaves Stasis
    this.channel.once('ChannelDestroyed', async () => {
        logger.info(`[EVENT] Channel ${this.channel.id} destroyed, starting cleanup`);
        await this.cleanup();
    });

    this.channel.once('StasisEnd', async () => {
        logger.info(`[EVENT] Channel ${this.channel.id} left Stasis, starting cleanup`);
        await this.cleanup();
    });

    // Handle unexpected hangup events
    this.channel.once('ChannelHangupRequest', async () => {
        logger.info(`[EVENT] Hangup requested for channel ${this.channel.id}`);
        await this.cleanup();
    });

} catch (error) {
    logger.error('[START] Error in VoicebotHandler:', error);
    await this.cleanup();
    throw error;
}

}

Why are you using a holding bridge? It’s not intended for recording, so that could very well be the cause. Additionally there is no “options” to record. Calling record on a bridge will record the entirety of the bridge, both directions.

If you want to separate the streams in an asynchronous fashion and record you would create a snoop channel on each channel, specifying what direction you want, and then call record on that snoop channel.

This would simplify this down to a single bridge with a normal connection between the caller and external media - while still recording each direction individually.

Did you use ChatGPT or some other thing to write this implementation?

Thank you for your previous guidance. We’ve made progress but still have some issues that we’d like to solve.

Current Situation:

  1. We’ve successfully set up a snoop channel that records caller audio to a WAV file
  2. The format and codecs are correctly configured (slin16)
  3. The WebSocket connection to OpenAI works (we receive audio from OpenAI)
  4. However, we’re not receiving any AudioFrameReceived events from the snoop channel

Here’s our current implementation:

// Snoop channel setup
async _setupSnoopChannels() {
    try {
        const originalChannelId = this.channel.id;
        
        // Initialize snoop channel for capturing caller audio
        this.callerSnoop = await this.ari.channels.snoopChannel({
            channelId: originalChannelId,
            app: 'voicebot',
            spy: 'in',         // Only capture incoming audio
            whisper: 'none',   // Don't inject audio
            snoopId: `snoop_caller_${originalChannelId}`,
            appArgs: 'audio=yes'  // Explicitly request audio events
        });
        
        // Debug recording works fine
        const callerRecording = await this.callerSnoop.record({
            name: `debug_caller_${originalChannelId}`,
            format: 'wav',
            beep: false,
            maxDurationSeconds: 3600,
            ifExists: 'overwrite',
            terminateOn: 'none'
        });

        // Audio frame listener that never gets called
        this.callerSnoop.on('AudioFrameReceived', async (event) => {
            logger.debug(`[SNOOP] >>> AudioFrameReceived event triggered`);
            // ... audio processing code ...
        });
        
        // Format verification shows correct configuration
        const snoopFormat = await this.callerSnoop.getChannelVar({ 
            variable: 'CHANNEL(audioreadformat)' 
        });
        logger.info(`[SNOOP] Snoop channel audio format: ${snoopFormat?.value}`); // Shows "slin16"
        
    } catch (error) {
        logger.error('[SNOOP] Error in snoop channel setup:', error);
        throw error;
    }
}

Our logs show:

  1. Snoop channel is created successfully
  2. Debug recording works (we get a valid WAV file with caller audio)
  3. Audio format is correctly set to slin16
  4. But no AudioFrameReceived events are triggered

Questions:

  1. Is there something specific we need to do to enable audio frame events on the snoop channel?
  2. Since we can record the audio successfully, we know the audio is flowing through the snoop channel. What could prevent the AudioFrameReceived events from being triggered?
  3. Is there a way to debug or trace whether the events are being emitted by Asterisk?

We’d greatly appreciate any guidance on how to get the audio frame events working, as we need them to forward the audio to OpenAI in real-time.

I don’t know what or where AudioFrameReceived comes from. It doesn’t exist in Asterisk or ARI. If you are using ChatGPT/Copilot/AI to help write this, it is making things up.

If you want to receive audio in your ARI application you must still use an external media channel, bridged to something (such as the caller).

I would also suggest learning what each of these things actually do and how they work, by looking at the ARI documentation and experimenting with the specific components before putting things together.

Thank you for your previous guidance. I’d like to detail my current situation and what I’ve tried based on your suggestions.

INITIAL SETUP (where OpenAI could hear itself):

// Initial working setup (but with audio loop)
this.bridge = await this.ari.bridges.create({
    type: 'mixing',
    name: `voicebot_${this.channel.id}`
});

// Create channels and add to bridge
await this.bridge.addChannel({ 
    channel: this.callerMedia.id,
    role: 'announcer'  // Caller can only talk
});

await this.bridge.addChannel({ 
    channel: this.openaiMedia.id,
    role: 'participant'  // OpenAI can talk and hear
});

await this.bridge.addChannel({ 
    channel: this.channel.id,
    role: 'listener'  // Main channel can only listen
});

In this setup:

  • I could hear OpenAI’s responses correctly
  • OpenAI could hear me
  • But OpenAI was also hearing itself (causing a loop)

Based on your suggestion about snoop channels, I tried modifying the setup:

// Modified setup attempting to use snoop channel
this.bridge = await this.ari.bridges.create({
    type: 'mixing',
    name: `voicebot_${this.channel.id}`
});

// Create external media for OpenAI
this.openaiMedia = await this.ari.channels.externalMedia({
    app: 'voicebot',
    external_host: '127.0.0.1:12050',
    format: 'slin16',
    direction: 'out',
    channelId: `openai_${this.channel.id}`,
    variables: {
        DIRECTION: 'out',
        MIXMONITOR_DIRECTION: 'WRITE'
    }
});

// Create snoop channel for caller audio
this.callerSnoop = await this.ari.channels.snoopChannel({
    channelId: this.channel.id,
    app: 'voicebot',
    spy: 'in',         
    whisper: 'none',   
    snoopId: `snoop_caller_${this.channel.id}`
});

// Modified bridge setup
await this.bridge.addChannel({ 
    channel: this.channel.id,
    role: 'listener'  
});

await this.bridge.addChannel({ 
    channel: this.openaiMedia.id,
    role: 'announcer'  
});

Current result:

  • Neither the caller nor OpenAI can hear each other
  • The audio flow seems completely broken

Questions:

  1. In my initial setup, how was the caller hearing OpenAI even though callerMedia was set as ‘announcer’?

  2. For using snoop channel:

    • Do I need two bridges (one for snoop → OpenAI, one for OpenAI → caller)?
    • Or should I add the snoop channel to the existing bridge with specific roles?
    • How should I route the audio from snoop channel to OpenAI?
  3. For the External Media channels:

    • I see that setting ‘direction’ doesn’t control actual audio flow
    • Should I be using different configuration for the External Media channels?

My goal is to:

  1. Maintain the working part (caller hearing OpenAI)
  2. Use snoop channel to feed caller’s audio to OpenAI
  3. Prevent OpenAI from hearing itself

Can you help me understand how to properly implement this setup?

Note: Here’s my full current code where

  • I could hear OpenAI’s responses correctly
  • OpenAI could hear me
  • But OpenAI was also hearing itself (causing a loop)
    if that helps in understanding the complete context.
// Initial working configuration where OpenAI could hear both caller and itself
async start() {
   try {
       // Create mixing bridge
       this.bridge = await this.ari.bridges.create({
           type: 'mixing',
           name: `voicebot_${this.channel.id}`,
           bridgeId: `voicebot_${this.channel.id}`,
           // Advanced mixing configuration
           options: {
               MIXMONITOR_FORMAT: 'slin16',
               BRIDGE_MIXING_INTERVAL: '20',
               BRIDGE_VIDEO_MODE: 'none'
           }
       });

       // Create external media channel for caller (input only)
       const callerMediaOptions = {
           app: 'voicebot',
           external_host: '127.0.0.1:12051',
           format: 'slin16',
           channelId: `caller_media_${this.channel.id}`,
           encapsulation: 'rtp',
           transport: 'udp',
           connection_type: 'client',
           direction: 'in',  // Set to receive audio only
           variables: {
               CHANNEL_DIRECTION: 'caller',
               ORIGINAL_CHANNEL_ID: this.channel.id,
               UNICASTRTP_LOCAL_ADDRESS: '127.0.0.1',
               UNICASTRTP_LOCAL_PORT: '12051',
               DIRECTION: 'in',
               MIXMONITOR_DIRECTION: 'READ'
           }
       };
       this.callerMedia = await this.ari.channels.externalMedia(callerMediaOptions);

       // Create external media channel for OpenAI (output only)
       const openaiMediaOptions = {
           app: 'voicebot',
           external_host: '127.0.0.1:12050',
           format: 'slin16', 
           channelId: `openai_${this.channel.id}`,
           encapsulation: 'rtp',
           transport: 'udp',
           connection_type: 'client',
           direction: 'out',  // Set to send audio only
           variables: {
               CHANNEL_DIRECTION: 'openai',
               ORIGINAL_CHANNEL_ID: this.channel.id,
               UNICASTRTP_LOCAL_ADDRESS: '127.0.0.1',
               UNICASTRTP_LOCAL_PORT: '12050',
               DIRECTION: 'out',
               MIXMONITOR_DIRECTION: 'WRITE'
           }
       };
       this.openaiMedia = await this.ari.channels.externalMedia(openaiMediaOptions);

       // Add channels to bridge with roles
       // This configuration created the audio loop:
       
       // 1. Caller channel set as announcer
       await this.bridge.addChannel({ 
           channel: this.callerMedia.id,
           inhibitConnectedLineUpdates: true,
           mute: false,
           role: 'announcer'  // Caller can only talk (but somehow could still hear OpenAI)
       });

       // 2. OpenAI channel set as participant
       await this.bridge.addChannel({ 
           channel: this.openaiMedia.id,
           inhibitConnectedLineUpdates: true,
           mute: false,
           role: 'participant'  // OpenAI can both talk and listen (causing it to hear itself)
       });

       // 3. Main channel set as listener
       await this.bridge.addChannel({ 
           channel: this.channel.id,
           inhibitConnectedLineUpdates: true,
           mute: false,
           role: 'listener'  // Main channel can only listen
       });

       // Result of this configuration:
       // - Caller could hear OpenAI (despite being 'announcer')
       // - OpenAI could hear caller (as 'participant')
       // - OpenAI could hear itself (as 'participant'), causing audio loop
       // - Audio direction settings ('in'/'out') didn't prevent the loop

   } catch (error) {
       logger.error('[START] Error in VoicebotHandler:', error);
       await this.cleanup();
       throw error;
   }
}

I would suggest instead of undertaking such a large project, you first do smaller projects to understand the specific components and how they work, and then put them together.

Someone else may choose to respond and help you write it in more detail, but that is beyond what I can do.

Thank you for your feedback. You’re absolutely right about breaking down the problem into smaller components. Let me focus on a specific issue I’m trying to solve:

I’m working on preventing a channel from hearing itself in a mixing bridge, while maintaining bidirectional communication with another channel.

Here’s my minimal test case:

// Single mixing bridge
this.bridge = await this.ari.bridges.create({ type: 'mixing' });

// Two channels in bridge:
// Channel A (needs to hear B but not itself)
await this.bridge.addChannel({ 
    channel: channelA.id,
    role: 'participant'  // Can hear itself (problem)
});

// Channel B
await this.bridge.addChannel({ 
    channel: channelB.id,
    role: 'participant'
});

I’ve seen several possible approaches in the documentation:

Bridge Roles:

Using different combinations of announcer/participant/listener roles
Using whisper role to control audio routing

Snoop Channel:

Using snoopChannel to capture audio without mixing
Could potentially isolate the audio streams

Multiple Bridges:

Separating input/output audio flows
But unsure about the correct configuration

Which approach would you recommend for this specific use case? I want to understand the basic concepts before expanding the implementation.
Would you mind explaining the pros/cons of these approaches, or suggesting the most appropriate solution for this particular problem?
A workable solution could also be to use talk detect to send audio only when the user is speaking, thanks for the help

If this alone causes someone to hear themselves, then the issue is outside of ARI/Asterisk. A bridge itself will never have the audio from someone go back to themselves.

In fact, with only two channels there is no conference bridge or mixing. Asterisk is just passing audio back/forth.

If there is a real telephone, there will always be some echo back, unless the device uses very aggressive VOX operation. Even without a real telephone, a complex system will attempt echo suppression, and, if that is done by cancellation, the cancellor may learn the wrong parameters, and generate echo when there was none.

I assume that the AI knows what it sent back, as the subsequent replies won’t make complete sense without that information, so it seems to me that AI services should be recognizing their own echoes, themselves.

I can see some benefit in separating the directions and feeding them, simultaneously into a recognizer, but not in trying to recognize one direction, in isolation.