SIP Trunks going unreachable

Hai Everyone,

I have two Asterisk machines installed on Centos 7.X with Asterisk 16.XX version.

Asterisk Box 1 Contain with Asterisk 16.7.0

On this box, we have 3 SIP trunks and configured. 2 are the Internal SIP trunks with different Asterisk machines which are in the same network. We also have different SIP trunks connected from telecom operators on separate NIC cards and configured.

All the calls are through the SIP trunk from the Operator and Internal SIP trunks are not having any calls.

The issue: The Asterisk Box works fine with 300 -350 Calls. But when it reached more than 350+ we have noticed that the SIP trunk going lagged and after 2-3 seconds it goes unreachable.

When we have less call volume < 200 Calls then in SIP show peers all my trunk will show as OK (<7 MS)

when the call volume increase the sip show peers value OK(>50 MS) is getting increase and after some more calls > 350 Calls then SIP show peers show as OK(1000 MS) and after 2 seconds it will turn to lagged (1500 MS) and in seconds it will change to UNREACHABLE. After some time again the SIP trunk will come as online and the SIP show peers will show as OK.

As checked there is no firewall or Iptables enabled on the Server and the SIP trunks are different operators and Internal SIP trunks too which are going unreachable altogether.

Can anyone help with this?

The same happens with my different asterisk machine also which is having the Asterisk Asterisk 16.1.1 Version.

How are your system resources used? I’d start with htop.

Sorry both Servers are 16 CORE CPU and 64 GB RAM and load average during this time also < 5 .

Something must be the bottleneck. I guess that you are using a bare metal server and not a VM. The next step would be to do the usual iperf and iftop tests to get an idea whether there is something special with the nics.

Even with less connections you sometimes get some resource warnings. It seems that you are using the older chan_sip module. Maybe PJSIP is more performant (or not). What happens, if you disable all OPTION requests locally? Usually, not all accounts get calls at the same time, so, if you have 350 calls, then there are probably more than 2000 registered SIP accounts. If the quality time is 60 seconds, that would imply that Asterisk has to handle more than 30 OPTION requests per second in parallel.

Asterisk cannot spread work evenly across all cores, as a thread can only be on one core at a time. For example, with chan_sip, all incoming SIP signalling is handled by a single thread and therefore a single core.

OK. Is there anyway that i can manage to use all cores ?

@EkFudrek - I have only 3 peers who send the OPTIONS message and in the 1 peer having the calls which is send 250+ Calls and it started facing the issues.

Interesting. So far I though you had the problems with multiple simple endpoints like phones, but I guess you are connecting multiple PBXs, or so. If this is so, then Asterisk might not be the right type of server. Asterisk is basically a B2BUA (Back-to-back user agent - Wikipedia), while you would likely need a SIP server. I think explaining this would go too far here. That said, you can actually couple Asterisk and SIP servers like Kamailio to handle larger loads, failures, etc.

This has a specific meaning in SIP terminology, and Asterisk very definitely implements SIP server capabilities. I think the term you are after is SIP proxy…

If you are strict about the terminology, then that is correct. It is, however, the case that an Asterisk server is typically used differently than a product like Kamailio. The choice of words was deliberately imprecise because one can imagine different scenarios and I didn’t want to go into too much detail.

I don’t know of a small company that uses Kamailio as a telephone system, and no large telephony provider that uses Asterisk as a central controller. Conversely, it tends to be more true. Even if there are functional overlaps, the products are ultimately not interchangeable in every situation.

I have Asterisk server with No SIP Phone connected to the system.

I have have SIP trunk from Telecom operator (1 Nos)
SIP trunk from Different Asterisk Box (2 Nos)

All the calls are system generated calls which conferencing the between the users with telecom lines. Am having the issue when there calls generating to telecom operator lines more than 300 Calls its going effected. I have also tried to upgrade asterisk version from Asterisk 16.7 to 16.12 and issue still remain same.

I am not sure, how PJSIP handles incoming calls, but chan_sip has limitations as David already said. It probably makes more sense to check the newer PJSIP stack instead of simply upgrading the version.

This is one reason why the term SIP server, above, can cause confusion. People think of Asterisk as a server, when it is actually the hardware on which it runs that is the server. In generic OS terms, Asterisk is a daemon, and in SIP terms it acts as both client and server. On normal calls, it acts as both.

Conferencing isn’t something for which SIP proxies are used, as it does require you to terminate call legs.

I haven’t looked into the detailed structure of conference bridging in Asterisk, but that is something that will obviously require serialisation, and, if you have a small number large conferences, might well result in not being able to make effective use of all the cores. The key point is that real world applications often cannot make full use of multicore architectures, unless your run large numbers of completely independent instances of them. (Certain very structured problems (typically mathematical models) can be designed to take advantage of as many cores as available, but most applications just aren’t that structured.

The lagged messages might be the result of the SIP receive thread overloading, but could well be the result of the network overloading. You would need to take packet captures and see where the delays were occurring. It is possible, but unlikely that locking is pushing back into the SIP receive thread. You would need to take snapshot dumps, to see if it was waiting for things other than the SIP socket.

There is also a thread debugging build option, that provides CLI commands to look at the locks, but this adds significant overheads, and isn’t something you would want to do if the system is already overloaded.

Thanks @david551 @EkFudrek

So Basically what do you prefer if need to handle 1000 concurrent calls to manage.

I have checked pcap trace also found that some time calls reach in pcap trace but not reaching at asterisk

I have even disabled qualify optioj and found better ik performance and still there are issues with concurrent calls going more than 300+

The standard answer to all sizing questions is that you need to benchmark for your particular usage.

Firstly, you need to establish that your network connectivity supports your traffic loads with low latency. A common problem with networks is buffer bloat, where packets don’t get lost, but get stuck in buffers in the network, because there is a lot of buffer capacity, but inadequate network capacity.

Typically, where the bottleneck is the PABX, one uses load balancing proxies to distribute traffic over multiple PABXes, but that assumes there isn’t a resource, such as a single conference, needed for all calls. Having said that, I would say that 1,000 erlangs would strike me as a heavy load, for general purpose PC hardware.

My involvement was for software development, for a product that tended to deal with around 100 calls, purely over intranets, so I don’t have personal experience of sizing - that was done by the pre-sales people.

When you are working purely on an intranet, you can prioritise SIP signalling and media, in the network, but I’m not sure why an ISP would do this, and any prioritisation wouldn’t survive onto the general internet.

Where exactly did you do the measurement of your traces? Both on the same machine or one in the network and one on the asterisk machine itself?

Are you using udp or tcp for signaling?

Is your problem RTP or SIP related? Or both? With pjsip and tcp, I would use different transports per trunk hoping that different transports are handled on different CPUs (better parallelization). But at first you should know your bottleneck.

BTW: In order to test your network or local network parameters in the kernel, you should test with small packets (RTP UDP length is around 182 bytes - depending on codec - or even much less) - not big packets. Try to get the max throughput to the desired destination. I wouldn’t be surprised, if the throughput you get with those little packages is far from what you’re getting with usual sizes (1500).

Or to be more precise: 1 call on the trunk consists of two RTP “lines” - one incoming and one outgoing. This call needs 50 packages/second for each direction (G.711 / 20 ms of speech contains each package). 400 calls would produce 20k packages/s in each direction. This means: Your server must process 40k packages each second only on the network stack. But those packages must be processed further (merging them, maybe transcoding). All must be realtime … .
My power consumption optimized server (6 W / APU 2) with Intel GB interface can do at max 45k - 50k packages per second as long as nothing more is running on it (but no problem to get the 1 GBit/s with “regular” packages).
You may test yourself with

netperf -H $DestIP -f M -t UDP_STREAM -- -m 128

Check it in both directions … . (here: 128 byte per package).

My traffic is not on the Internet. The SIP lines are directly connected to the network card of the Server. And there is no latency between the SBC IP of the telecom provider(No Internet used for this).

Where exactly did you do the measurement of your traces? Both on the same machine or one in the network and one on the asterisk machine itself? - Am doing between my asterisk machine and Telecom line connected on the network card of the same asterisk machine.
Are you using udp or tcp for signaling? - Using the UDP

Is your problem RTP or SIP related? - SIP Related - SIP with UDP
Am using the law codec on the SIP trunk and my network card have 1GB/S bandwidth.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.