RTP Event packets not being forwarded by Asterisk

We are using Asterisk 18.0.0-rc2 to route outbound calls from our in-house IVR system to our upline SIP provider. A call is eventually answered by one of our customers who interacts with the call using DTMF. The problem we are having is that for about 1 in 10 of these calls, the RTP Events (with the DTMF information) don’t make it back our IVR. I’m capturing all the relevant network traffic with Wireshark, and I can see that the RTP Event packets are being correctly sent from our SIP provider to Asterisk, but sometimes Asterisk filters out the RTP Events when it forwards the RTP stream along to our IVR.

I can’t see any rhyme or reason as to why this works with most calls and only fails on some. We are using PJSIP. Both trunks (to our SIP Provider and to the IVR) are currently configured with “dtmf_mode=auto” (although I have experimented with various combinations of the DTMF settings).

The SDP seems to correctly specify the telephone-events in all legs of the communication
m=audio 15602 RTP/AVP 0 101
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16

Here are two images of the Wireshark captures to illustrate what is happening. Notice that the RTP Events are there in the first image (from our SIP provider to Asterisk). But when Asterisk passes the RTP stream onto our IVR, the RTP Events are mysteriously missing, leaving gaps in the RTP sequence numbers.

I’m very curious about what kinds of things might cause this behavior, and I would appreciate any head start in the right direction. I am able to run Asterisk in debug, but I would welcome some advice about where to start looking…

Thanks,
Jeff Wilhite

Please provide the DTMF log and please provide it as plain text, not an image.

Well, I’ve tried that, but my dtmf log isn’t recording anything. I’ve got “dtmf => dtmf” in my logger.conf file, and after restarting asterisk several times and pumping about 30 calls through the system (some where dtmf worked and some where it failed), my /var/log/asterisk/dtmf log file just says:

[Jan 20 15:38:05] Asterisk 18.0.0-rc2 built by root @ MLB2-test on a x86_64 running Linux on 2020-11-09 18:18:55 UTC
[Jan 20 15:48:11] Asterisk 18.0.0-rc2 built by root @ MLB2-test on a x86_64 running Linux on 2020-11-09 18:18:55 UTC
[Jan 21 08:30:18] Asterisk 18.0.0-rc2 built by root @ MLB2-test on a x86_64 running Linux on 2020-11-09 18:18:55 UTC

I haven’t configured Asterisk to take any special action on DTMF (like parking a call), so maybe DTMF that is just passed through Asterisk isn’t logged?

Also, I’m setting “direct_media=no” for both endpoints in PJSIP.

Your packet capture shows µ-law and telephony events being sent at the same time. Normally telephony events are sent in place of normal media. That may will confuse things.

So, I think that may be the problem. Our upline SIP provider is stripping out the inband DTMF but is missing a few milliseconds of the tone at the beginning. I don’t know why this causes a problem for Asterisk, though. If my endpoints are configured with “dtmf_mode=rfc4733” I don’t know why Asterisk would care about the tones. I am experimenting with different combinations of DTMF modes in our test environment, but our production system shows dtmf_mode=rfc4733 for all of our endpoints.

Can you point me to the place in the source code where this DTMF logic lives?

What’s going on here is that Asterisk is connecting the two endpoints with a local native bridge. That’s why nothing is showing up in the DTMF log; Asterisk is just forwarding the RTP packets between the channels without doing any kind of processing on them.

It seems the problem is that our upline SIP provider is sending the DTMF tones at the same time as other non-DTMF audio. We experience the problem only when the RTP Event arrives at almost exactly the same time (within 0.3 ms) as another non-DTMF RTP audio packet. Looking at the related RFCs, I’m not seeing anything that says this should be illegal, though. And our upline SIP provider says they are just passing along exactly what they are getting without doing any digital signal processing that might be breaking things.

I have updated to Asterisk 18.9.0 and it didn’t fix anything. I was hopeful that this commit https://gerrit.asterisk.org/c/asterisk/+/13343 was going to fix it because it says it “corrects situations where successive dtmf events are received quicker than the signalled event duration (plus minimum gap/pause) allows)” But, alas, I still have the same problem.

Anyway, I don’t know why no one else using asterisk seems to be experiencing this, but it looks like my only option is to try to fix it in the asterisk source somehow. Any advice about where to start? bridges/bridge_native_rtp.c seems promising, or is this more likely to be an issue in pjmedia?

pjmedia is not used at all in Asterisk, the RTP implementation is located at res/res_rtp_asterisk.c

Thanks, that was helpful. I was able to add a bit of additional debug code in res_rtp_asterisk.c to investigate the problem.

However, this is looking like an intractable problem to me. UDP is designed to be lossy, and the receiving end is not supposed to have to depend upon all the packets arriving successfully. However, we are experiencing a situation where, in certain conditions, we are missing 100% of the RTP events due to the inopportune pattern of their arrival (being mixed with the ulaw RTP packets).

First, ast_recvfrom() receives a 172 byte ulaw packet, then it eventually calls bridge_p2p_rtp_write() and passes the packet immediately on to the bridged channel. However, the next time ast_recvfrom() gets called, we get another 172 byte ulaw packet, completely missing the 16 byte RTP event that wireshark says we should have seen next. So, the RTP stream received by the bridged channel is missing every other sequence number where the RTP events are supposed to be.

I tried to throw in a few spurious calls to recvfrom with the MSG_PEEK and MSG_DONTWAIT flags set just to see if I could catch and log a few of those RTP events at exactly the right time, and I was actually able to have some limited success with that, but even that diagnostic code ended up causing additional packet losses elsewhere.

Anyway, my main question is, “Why do I seem to be the only one having this problem?” We aren’t really doing anything complicated here, just trying to pass DTMF over a local native bridge.

I have done some testing at home using Twilio as my SIP trunk (instead of this provider we use at work), and it works just fine with them because Twilio doesn’t send the ulaw stream and the RTP events at the same time. However, I don’t know enough about what’s going on behind the scenes in the telecom industry to understand why DTMF from my cell phone routed through Twilio ends up looking very different than DTMF from the same cell phone routed through a different SIP trunk.

If our SIP provider is violating something in one of the RFCs by doing this, that would be very helpful to know, but from what I can tell, this should be legal. Any more sage advice? I’m not really sure where to go next.

I don’t believe the code has ever allowed/expected audio to be intermingled, so that’s likely an off-nominal. It may or may not fundamentally work, may work code changes.

There was also a fix that went in recently for DTMF and sequence numbers for audio[1].

[1] https://gerrit.asterisk.org/c/asterisk/+/17912

I figured out what is going on here. The Linux kernel is actually dropping the UDP packets in these situations where the RTP Events arrive < 1ms after another audio RTP packet. When I run netstat -su I can actually watch the Udp “packet receive errors” incrementing exactly when I have the problem. However, the “receive buffer errors” doesn’t increment, so that means it’s not a buffer overflow. I even greatly increased net.core.rmem_default and net.core.rmem_max just to be sure. From the kernel source, it looks like this can only happen if there is a checksum failure in the UDP packets (which I verified is not the case) or if a packet is received in the brief window of time while the application has the socket locked for reading (which seems is what’s happening here).

I know that UDP packets are not guaranteed to be delivered, but due to the timing of the RTP events, we are experiencing a situation where (in certain scenarios) 100% of the RTP events are dropped.

So, it seems it doesn’t actually have anything to do with the asterisk source at all. I’m curious, though, why I seem to be the only one having this problem. I’m using RHEL 7.9. Do more recent Linux kernels process simultaneous UDP packets better? Also, what could my telecom be doing wrong here? Is there some way to misconfigure certain types of telephony hardware to cause this issue where the audio and DTMF events are sent simultaneously?

I know this doesn’t really have anything to do with Asterisk, but I am hoping the telephony experts here may have experienced something like this before. What would you do in such a situation? Get a new telecom? Hack the Linux kernel?

Could it be a buffer overflow on the ethernet interface itself, rather than in software?

Well, WireShark is showing the RTP packets arriving, and the Linux kernel seems to know that it’s dropping the packets because it’s incrementing the “packet receive errors” counter. So, the kernel is receiving the packets and then dropping them before they are made available to the application (Asterisk).

Also, I experimented with using direct_media=yes, allowing the RTP traffic to completely bypass Asterisk. Our IVR system that this traffic is ultimately going to uses Dialogic HMP running on Windows Server 2016, and it’s having the same problem. Dialogic is missing those DTMF events, too. We’ve opened a ticket with Dialogic, and hopefully, they can provide some insight into what’s going on here.

So, even if I am somehow able to get Linux to see the packets, process them in Asterisk, and forward them on to our IVR system, if the timing of the RTP packets is still too close, our IVR system won’t see them anyway. Our telecom is trying to help us solve the problem by using only inband DTMF, but it seems that some of their upline providers aren’t properly sending inband DTMF to them, so it doesn’t really help any.

Something, somewhere in the telecom system upline from us is broken, and I don’t know how (or who to tell) to fix it.

Assuming you have verified that you really are achieving direct media, having another system reject them does suggest, to me, that there is something wrong with what is on the wire.

Yes, I have verified in the wireshark capture that I am actually achieving direct media. And, yes, I agree that there is actually something wrong with what is on the wire.

And although it seems highly unlikely that I’m the only person to have ever encountered this problem, I have yet to find any other person who has experienced it or who knows what is causing it (or how to fix it).

I wanted to report back and let everyone know how this gripping narrative played out. It seems there was (mostly) no problem with Asterisk or our SIP provider. Our network adapter on aging hardware seems to have been unable to keep up with the demands of modern computing. We moved the system to a different network and upgraded from RHEL 7.9 to 8.4 and the problems went away.

Also, it seems that Asterisk can correctly handle concurrent audio and DTMF events after all (as long as the OS can get the UDP packets to it). I spent a lot of time tracking this down, but it wasn’t entirely wasted. I certainly learned a lot about the asterisk source and the Linux kernel code in the process.

Hi jwilhite,

I am currently battling with the similar problem for over a month now. I run this on Ubuntu 18.4 and asterisk 16.9

On scenarios:

  1. My Asterisk receive call from the Huawei SBC (Telco), my ivr got the DTMF and worked on my asterisk server.
  2. My Asterisk receive call from the Telco and bridge the call to another asterisk not behind NAT. The DTMF from the Telco get to my asterisk (tcpdump showed the DTMF) but not forwarded to the remote asterisk. Audio worked fine.
  3. My Asterisk received call from the Telco and forward it to remote asterisk behind NAT. DTMF received on my asterisk but not forwarded to the remote asterisk behind NAT.

I will be glad for any assistance as well. I have been on this for over a month now accessing several asterisk past list responses without luck…

The OP lost the packet before it would have been seen by tcpdump, so their problem is different.

Goksie,

One thing you want to check is to see if the OS is dropping the UDP packets. Run:
watch -n 1 netstat -su

The output of this will update every second, so you can watch the network statistics change while your calls are in progress. Keep an eye on the lines “packet receive errors” and “receive buffer errors”. Usually, those numbers should be the same. However, if the kernel is dropping the packets before making them available to Asterisk, the “packet receive errors” will increment, but the “receive buffer errors” wont. If this is the case, then Asterisk isn’t the problem. It’s your network, possibly your hardware, that’s the problem.

I’m curious, are you receiving audio and DTMF events simultaneously? I mean, do the audio RTP packets stop while the RTP event packets are arriving, or are they coming in together?
Look at the frame deltas on the UDP packets that aren’t successfully forwarded. If the UDP packet for the RTP Event arrives less than 1 millisecond after the previous audio RTP packet, that may be what is causing the problem.

Also, has this been working for you in the past and then suddenly broke? That’s what happened to us. I think some telecom equipment somewhere out there must have recently changed to start causing this behavior, maybe due to the shaken/stir rollout?

Jeff

Thank you Jeff.

I run the netstat and the value for the receiver error or buffer error count remain the same throughout the calls.
my asterisk server got the DTMF (tcpdump). however, forwarding DTMF to a remote asterisk peer has never worked for this particular TELCO. The TELCO use Huawei (SBC model - CloudSE2980 [ Huawei]) and dtmf_mode is inband.

I hope to receive your help.

Regards

Goksie,

Let me make sure I understand. You are receiving RFC 2833 RTP Events from one telco, and then Asterisk is translating the 2833 DTMF to inband DTMF and sending it over a different telco to a remote Asterisk, but the remote Asterisk doesn’t hear the inband DTMF? Is that correct?

Can you verify that the outbound RTP stream leaving the first Asterisk machine has good inband DTMF tones? I use Wireshark for my captures. It has some excellent telephony features that allow you to listen to individual RTP streams so that you can verify that you are sending good inband DTMF tones. Also, are you capturing the incoming traffic on your second Asterisk machine? Can you verify that the DTMF tones are correct on the first Asterisk machine but missing on the second Asterisk machine?

Also, inspect the SIP INVITE packet for the leg of the call between the two Asterisk machines. What do you see in the Media Description of the Session Description Protocol of your INVITE packet? Also, what does the telco send back in the Media Description of their “200 OK (INVITE)” packet?

I’ve encountered one carrier (Peerless Technologies) that seems to have a problem where they clip the inband DTMF tones as if they are converting to 2833 and then they don’t even send the RTP Events. We ended up resolving our problem by asking our telco to block calls from going through that one problematic carrier.