Just looking for a little guidance on where to proceed next in my investigation. I appreciate any feedback.
I am running Asterisk 20.9.1 at three locations, all of them running Ubuntu 24.04 LTS. Each Asterisk location connects to our ITSP vendor via two TLS SIP trunks. We recently migrated them from UDP over an MPLS to TLS/SRTP direct over the internet.
SBC A is overall very stable, while SBC B has an issue where it closes their side of the TCP socket on 5061 every 60 seconds. New sockets get created soon after, but after another 60 seconds they send us a FIN, ACK.
The issue is based on timing - if we attempt to make an outbound call to SBC B, we can see us sending the INVITE, getting the 100 Trying and 183 Session Progress, then right after the socket closes the Asterisk server gives the Congestion error. In this particular scenario, the Asterisk server sends a 503 Service Unavailable to the originating server trying to make the call, we wait a few seconds then try again.
There is nothing in pjsip_wizard.conf that is different between SBC A or B except for the remote host IP address. I originally had aor/qualify_frequency set to 60 but I changed it to 10. I also see that both SBCs send us OPTIONS packets every 7-15 seconds. In both cases, the 200 OK replies are immediate.
I sent this up to our ITSP, who forwarded it to their SBC vendor (MetaSwitch). They are claiming it’s because the Asterisk servers are not sending [TCP Keep-Alive] packets to them, and because of that they are closing the socket.
It’s been a while, but from what I’ve read the only reason a [TCP Keep-Alive] packet would be sent is because if inactivity, and since both sides are sending multiple OPTIONS messages, there shouldn’t be a need. That being said, I did add the global keep_alive_interval to 15 seconds in pjsip.conf, and I do see that setting in “pjsip show settings”.
Not seeing any difference, I did some more research and found that I can change some OS values in Ubuntu/Linux relating to TCP keepalives:
net.ipv4.tcp_keepalive_time = 45 (was default 7200)
net.ipv4.tcp_keepalive_intvl = 15 (was default 75)
Even with all these changes I don’t see any different behavior with SBC B.
So, I guess I’m asking the following:
- Do you believe I’m on the right path
- Is there something in a PCAP or log that would confirm that either my global keep_alive_interval setting, or the OS settings, would show that I’m doing what I’m doing is correct?
- Any other suggestions you might have in order to resolve this?
I can send snips of any logs you need. Thank you in advance.
Dean