PJSIP stack resends messages on high CPS

While doing some load testing of the PJSIP stack, I see strange behaviour when it gets a bit loaded.
Asterisk keeps resending the 200 OK message, although it has received the ACK.
After 7 times, it resends BYE 7 times, although it has received the 200 OK.
CPU is not the issue as I’m testing with 256 thread Epyc setup.

Asterisk pjsip logger shows it has received the ACK and 200 OK messages, same as wireshark, it just seems failing to parse it.

Environment:
CentOS8 with two kernel versions: 4.18.0-193.19.1.el8_2 and 5.8.11-1.el8
Servers are connected locally over a 10Gb/s connection.
Ulimit raised to 102400, rtp port range to 30000 ports, pjsip and stasis thread pool optimized as suggested by Asterisk wiki.
Asterisk 17.7.0 and 16.13.0 with embedded PJPROJECT 2.10
CPU: Ryzen 2700 with 16 threads and 2 x EPYC 7742 64 with 256 threads
sipp SIPp v3.6.1-TLS-SCTP-PCAP-RTPSTREAM
sipp command sipp -sf /root/uac.xml -rtp_echo -nr -trace_msg -trace_err -d 180s -r 100 -l 5000 -i 172.16.0.32 -p 5070 -s 505 172.16.0.31:5060

On the lower spec machine (Ryzen 2700) problems manifest at 30 cps, whereas on the higher spec it starts around 70+ cps

Issue is present with sipp rtp_echo and without it.

Should I file this as a bug?

Same behaviour can be observed with chan_sip.
Could this be related to sipp?

On which machine were the logs captured? Please provide the channel driver (e.g. sip set debug on) logging output from the Asterisk machine, as it is likely that the callid is wrong on the ACK. (Message sequence diagrams are rarely useful.)

Hi David,
Thx for your reply,
Capture was done on the same machine asterisk is running on, to avoid any network drop offs.
As I mentioned, I’ve run into the same issue with chan_sip, with one difference on pjsip asterisk shows the same packets as tcpdump, whereas chan_sip does not log them.
Here are the files: filtered asterisk log, pcap and sequence diagram from wireshark.

I can’t see any obvious problem, although it could be a non-printing character, such as an excess carriage return.

You should turn up the debugging

Will do. Full verbose & debug.
I plan to try older versions of asterisk/chan_sip/pjsip.
What tool could be used instead of sipp to eliminate it as a potential problem?

Full verbose & debug log uploaded under test-15.log/pcap/pdf. Can’t see any obvious error.
It looks like a bug.
If you look for example Call-ID: 5-3492805@10.40.1.37 it just looks like Asterisk gets the messages, but fails to process them.
I’ve tried increasing the timer_t1 to 750ms, although the roundtrip time is lower than 500 ms (according to wireshark).

BTW PJSIP log timing correlates with wireshark timestamp, where as chan_sip logs the replies 10 seconds late on average.

In regards to PJSIP, I would suggest getting a backtrace[1].

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc.

1 Like

Thx for the advice. I’ve filed the issue and attached the core dump here

As Joshua has figured out, it was a DNS resolution issue, so if you run into this issue, be sure to check your DNS traffic and configure the names being looked up in /etc/hosts or your DNS server.