I am running the latest asterisk 20 on a very large server, 48 cores, real hardware, lots of memory, but when the number of channels reaches 300, the call quality starts to decrease with increased jitter while the OS load is always under 10. Which parts of the Linux server OS can be tweaked to better suit what asterisk needs? What about the PJSIP max thread? How can I know how many max threads to set? Is there a guide to analyze and optimize the OS or Asterisk configuration to provide the best environment?
That’s because LInux does its own caching. I imagine, if you let the disk controller cache, it could re-order the writes in a way that is more vulnerable to power failures.
Disk writes should only be an issue if you are recording all calls.
I don’t think it is a disk bottleneck. There are no processes in D state when checking with the “top”, all cores show a 0.0% waiting state. During normal call, there is very little disk activity
Disk controllers usually activate a write cache when they are battery backed, so in case of power failure, they can still write on the disk what they have in memory. In these servers, there is no battery backed disk controller, but just NVM disks. The load of the disk subsystem is always zero.
From experience, treating incoming calls requires more resources than treating established ones.
A machine that could handle 1000 simultaneous calls could have trouble to accept 50 new ones per second.
At first sight, I would say 300 channels is not very demanding for simple trunking.
Can you elaborate “when the number of channels reaches 300, the call quality starts to decrease with increased jitter” ? How is call quality evaluated, for instance ?
I agree, once the call is established, asterisk should be very good in just forwarding RTP packets, counting on the fact there is no trasnscoding. In the busiest moments, there are 40 new calls every minute. The dialplan can be complex, but the number doesn’t seem really high.
However, I was expecting to see some kind of activity in each one of the CPU cores, but “top” is showing at max 2% on the us column with a 98% on the idle one.
So I think I am missing something.
It seems not the disk because I don’t have any wait % in top
It seems not the CPU because I have a 2% at max in the us column in top
It seems not the network because at top it uses less than 20Mbit/s
I think ultimately you need to narrow it down, before we go down the road of any changes. A packet capture for a period of time can provide data for analysis directly in Wireshark for timing, jitter, timestamps. It can also be used to compare the ingress packet with the egress packet and how long it took to traverse Asterisk. Depending on the result that narrows down where to actually identify/resolve any potential issues. If things pass through Asterisk quickly and correctly then you know Asterisk is not the problem and something outside of it is, such as the Linux instance, or perhaps even something networking related.
If stuff like this has been done then elaborating on it would be good.