Loss of SIP packets by overflow UDP socket ring buffer

Hello!

After migrate on new virtual machine we have an issue now. Some SIP signals do not reach asterisk chan_sip module. In fact, this is most noticeable in the “OPTIONS” messages. On asterisk logs at that moments we have a lot “NOTICE[13098] chan_sip.c: Peer is now UNREACHABLE!” messages at the same second.

With some diagnostics we found that UDP asterisk socket “net.core.rmem_default = 212992” is overflowed (we have heplify installed on system and they see all packets (used AF_PACKET sniffering), but asterisk is not). Debugging with “ss” “Recv-Q” returns:
"
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
UNCONN 215040 0 *:5060 : users:((“asterisk”,pid=12987,fd=16))
after 1 second
UNCONN 215040 0 *:5060 : users:((“asterisk”,pid=12987,fd=16))
after 1 second
UNCONN 215040 0 *:5060 : users:((“asterisk”,pid=12987,fd=16))

"
So, we have a sequence of moments for “Recv-Q” is 215040.
Looking for summary SIP packets received we can not say that we have any peak value at these moments.

We can increase “net.core.rmem_default” with hope that possible resolve the issue, but may be someone can tell us why asterisk can’t get in time all data from ring buffer (may be some locks occure, etc.)

rasterisk -V
Asterisk 13.22.0

sip show peers
2458 sip peers [Monitored: 1356 online, 986 offline Unmonitored: 110 online, 6 offline]

load average: 2,96, 3,72, 4,09 / per 12 CPU

Thanks!

I’m asterisk CLI, do you get this sort of error?

Unable to allocate RTCP socket: Too many open files in system

The chan_sip module will handle 1 thing at a time on incoming UDP. If that blocks, then everything else waits. That’s the way it is architected and works.

no such error at all

how can we debug which one occurs relatively long blocking?

You can get a backtrace[1] and try to understand why things are blocked, if they are. In general though you’re on an unsupported version of Asterisk, with a channel driver that doesn’t see attention any longer, so set your expectations accordingly on help.

[1] Getting a Backtrace - Asterisk Project - Asterisk Project Wiki

If you are on a virtual machine, it may well be that Asterisk isn’t being given CPU time sufficiently frequently. VMs weren’t originally intended for guests with tight real time constraints.

Yes, we use VM, but especially allocated several processors only for a VM with an asterisk. Anyway, how we can check lack of process time?

asterisk is running in production, so we will only try to get a backtrace if nothing else helps