SIP retransmissions when qualify enabled in sip.conf

Hey there!

We’re experiencing some really odd behavior with qualify enabled in sip.conf on both Asterisk 11.6-cert15 and 13.8-cert3 running on both CentOS 6.7 and Amazon Linux 2016-09 instances in Amazon AWS.

When qualify is enabled (globally, or on an individual peer), we see consistent and reproducible SIP packet retransmissions from the Asterisk instance in our Wireshark traces (see screen shots below where the Asterisk instance is 10.60.254.113 and the other IPs are Brias).

Our problem is that the frequency of the retransmissions, and the call volumes we’re putting on these instances, essentially causes the CPU to spike and call quality to degrade substantially (even with 16x vCPUs) due to the incredible amount of retransmissions on not only the call setup signaling, but on the options messages as well (it looks like a self imposed DoS attack when viewing with ngrep in realtime).

There is nothing in the system logs or asterisk logs (outside of Asterisk noting the retransmissions when qualify is enabled) to indicate a specific issue or misconfiguration. We’ve built several new instances from scratch with very minimal configurations to test the functionality, with the exact same results.

Our team manages several hundred Asterisk instances across the globe, spanning multiple virtualization environments like VMware, Xen, AWS, and GCP - and we’re stumped on this one.

We’ve done the customary googling until our eyes bleed, and come up with nothing . . so before we go and open a bug report, we’re following the bug report check-list.

Any thoughts / ideas would be greatly appreciated.

All the best,

  • Darren

Qualify On

Worth noting, while this is one capture, the behavior is very consistent and we have tons of pcaps.

The retransmission on that trace is legitimate. 200ms is a long time for an initial 100 response.

I would be more concerned about the ACK in inappropriate place, from the client. at 133.140810.

The only effect of qualify is that it allows Asterisk to learn the typical round trip time and fine tune the timeouts.

I haven’t heard of retransmissions being a significant drain on resources. The main CPU hogs are media handling, and and complex queues, not signalling.

Also, what does the “v” mean in “vCPUs”?

When chan_sip is in use the result of “qualify” influences the T1 timer, as it is expected that the transmission time should be similar (or better) than the OPTIONS. There’s no way I know of to disable this while keeping qualify on. It may also be a sign of another problem outside of Asterisk with variable network conditions.

We haven’t seen any messages on the console though that indicate that the peers are unreachable though is the thing.

[Oct 19 06:23:39] NOTICE[59580]: chan_sip.c:23446 handle_response_peerpoke: Peer ‘1028’ is now Reachable. (127ms / 2000ms)
[Oct 19 06:23:40] NOTICE[59580]: chan_sip.c:23446 handle_response_peerpoke: Peer ‘1058’ is now Reachable. (192ms / 2000ms)

*CLI> sip show peers
Name/username Host Dyn Forcerport ACL Port Status Description
1028/1028 162.207.112.26 D N 61033 OK (134 ms)
1058/1058 186.32.185.42 D N 57623 OK (170 ms)
2 sip peers [Monitored: 2 online, 0 offline Unmonitored: 0 online, 0 offline]

[general]
externip=52.29.30.192
media_address=52.29.30.192
transport=udp
localnet=10.60.0.0/255.255.0.0
allowguest=no
context=unauthorized
dtmfmode=rfc2833
t38pt_udptl = yes
bindport=5060
bindaddr=0.0.0.0
srvlookup=no
disallow=all
allow=g729
port=5060
alwaysauthreject=yes
callcounter=yes
nat=force_rport,comedia
qualify=yes

OPTIONS requests are not sent as often, and the response time could just change up and down depending on conditions.

I’d suggest doing analysis at a peer side. Do a packet capture there and see what it sees (and when) in comparison to the Asterisk side. Rule out networking in between. If you only see 1 packet and Asterisk sent 2 then you’ve isolated it.

Client capture

Asterisk capture

Looks like to me, at a glance, not all the packets are making it to the client.

Since a packet capture is done outside of Asterisk this also rules Asterisk itself out.

I don’t see lost packets, only excessive round trip times, i.e network overload/buffer bloat.

We’ve done an extensive amount of testing, from scratch, and can confirm that this behavior with the re-transmissions while qualify is enabled happens in both AWS as well as Google Compute on low-end to very high end instance types. We’ve tested using CentOS 6.7, Ubuntu 14, and Ubuntu 16 and all do the exact same thing.

Our support organization uses the qualify option extensively to do troubleshooting on endpoints. Without it, we’re semi-blind as to the status of the endpoint.

Any additional thoughts? ideas?

Greatly appreciate the input.

Best,

  • Darren

fwiw . . using the chan_pjsip, with qualify turned on for the endpoint, does not exhibit any retransmissions.

chan_pjsip is not written to do that behavior so I’m not surprised. The only way to turn it off in chan_sip would be to modify the code, as far as I know.

Which behavior doesn’t it have exactly? I believe you, I just want to be able to have an intelligent conversation with the rest of our technical team and be able to explain.

Is it the ability for qualify to modify the T1 timer?

Thanks Josh!

Yes, it does not have that ability. The timer controls in PJSIP are system-wide and not granular.

Thanks again for all the feedback (it’s certainly appreciated) !