Finding server bottleneck when running asterisk

I am running the latest asterisk 20 on a very large server, 48 cores, real hardware, lots of memory, but when the number of channels reaches 300, the call quality starts to decrease with increased jitter while the OS load is always under 10. Which parts of the Linux server OS can be tweaked to better suit what asterisk needs? What about the PJSIP max thread? How can I know how many max threads to set? Is there a guide to analyze and optimize the OS or Asterisk configuration to provide the best environment?

Leandro

Disk write cache comes to mind. It’s often ‘off’ by default in Linux.

That’s because LInux does its own caching. I imagine, if you let the disk controller cache, it could re-order the writes in a way that is more vulnerable to power failures.

Disk writes should only be an issue if you are recording all calls.

PJSIP threads also aren’t involved in media, they handle signaling.

You need to be more specific about your use case. What the calls are doing, calls per second if calls are being set up, what the dialplan is doing.

I don’t think it is a disk bottleneck. There are no processes in D state when checking with the “top”, all cores show a 0.0% waiting state. During normal call, there is very little disk activity

Disk controllers usually activate a write cache when they are battery backed, so in case of power failure, they can still write on the disk what they have in memory. In these servers, there is no battery backed disk controller, but just NVM disks. The load of the disk subsystem is always zero.

From experience, treating incoming calls requires more resources than treating established ones.
A machine that could handle 1000 simultaneous calls could have trouble to accept 50 new ones per second.
At first sight, I would say 300 channels is not very demanding for simple trunking.

Can you elaborate “when the number of channels reaches 300, the call quality starts to decrease with increased jitter” ? How is call quality evaluated, for instance ?

I agree, once the call is established, asterisk should be very good in just forwarding RTP packets, counting on the fact there is no trasnscoding. In the busiest moments, there are 40 new calls every minute. The dialplan can be complex, but the number doesn’t seem really high.

However, I was expecting to see some kind of activity in each one of the CPU cores, but “top” is showing at max 2% on the us column with a 98% on the idle one.

So I think I am missing something.

It seems not the disk because I don’t have any wait % in top

It seems not the CPU because I have a 2% at max in the us column in top

It seems not the network because at top it uses less than 20Mbit/s

What should I look at?

How did you evaluate things?

A point on networking, VoIP is a lot of tiny packets.

Assuming 400 calls:

400 * 20 * 50 = 400,000

Number Of Calls * Amount of Media (in milliseconds) per RTP Packet * Number of Packets per Second

That’s 400k packets per second in each direction.

I think ultimately you need to narrow it down, before we go down the road of any changes. A packet capture for a period of time can provide data for analysis directly in Wireshark for timing, jitter, timestamps. It can also be used to compare the ingress packet with the egress packet and how long it took to traverse Asterisk. Depending on the result that narrows down where to actually identify/resolve any potential issues. If things pass through Asterisk quickly and correctly then you know Asterisk is not the problem and something outside of it is, such as the Linux instance, or perhaps even something networking related.

If stuff like this has been done then elaborating on it would be good.