RTP suppression issue in bidirectional flow

Our bot application handles incoming calls via a SIP trunk with Asterisk. We bridge the incoming call and the external bot application using an ARI application, bridge type is ‘mixing’ , simple_bridges. Bidirectional RTP flows remain smooth until the user speaks concurrently with the bot’s playback. At that point, the bot’s RTP stream appears to be suppressed, causing the user to receive degraded audio. Notably, the SIP provider is delivering audio encoded in mlaw, which is transcoded into slin on our side, and there is no CPU utilization spike observed. Is this related to asterisk configuration or how RTP is handled by our apps or is something, that SIP trunk provider needs to look into?

What do the streams actually show on the wire?

PFA pcap trace.

  1. voip_inf.pcap
  2. tcpdump on interface connected to SBC
  3. va_inf.pcap
  4. tcpdump on interface connected to the voice bot

Have you looked at it, and listened?

I don’t hear the voice bot audio suppressed in voip_inf.pcap

Though it does have to insert silence, which you may not be doing in your voice bot generated audio.

Thank you for the feedback. Yes, I have noticed this behavior.

We are currently receiving 20ms (160 samples @8000Hz) RTP packets from Asterisk. While sending audio back, we are maintaining the same 20ms packet duration but introducing a 20ms delay between consecutive packets.

Would this delay require us to explicitly insert silence for the missing 20ms interval?

Or could we add a local dummy channel that continuously transmits silence and bridge it with both the incoming and outgoing streams to ensure smooth RTP flow without gaps?

Looking forward to your insights on the best approach.

You should provide a constant stream of audio to Asterisk in 20ms chunks including silence. I don’t know what your source is, so can’t comment on that.

If you do this, you must make sure that that the timestamp is incremented for both the frames actually sent and the missing ones, but the sequence numbers only step by one for each frame actually sent.

E.g.

Time (ms)   Timestamp    Sequence
1000        1234000      33
Gap
1040        1234320      34
Gap
1080        1234640      35
1100        1234800      36