TL;DR: Asterisk 13 app_queue.c with a high queue count causes high load and system CPU time. This is exacerbated by the CPU core count available, because the high load comes from thread mutex locks. The more cores, the more time the mutexes take, the less time for real call handling. A high number of queued callers causes choppy audio because of the lack of available CPU time, effectively denying system usability for existing bridged calls.
I currently run a vanilla install of Asterisk 13.6.0 with a couple minor patches of mine for after-call-work automatic pausing as well as tweaked reason text for that and automatic pausing. It is running on Gentoo in a Qemu KVM virtual machine running on a host with dual E5-2620 v3 cpus at 2.4GHz and 128G RAM. The virtual machine (as I write this) currently has 4 virtual cpus and 4G RAM.
I have been battling with a scalability issue in queueing calls in app_queue.c. We currently run nine ACD queues. Seven of the nine contain about 15 people. Four of those seven queues are “Support,” for Customers, Sales, Contractors, and Tenants. They contain the same people (using the whisper feature to inform the agent what the their role should be for the call).
During the week of the huge snowstorm on the east coast about a month ago, we were daily hammered with calls way above our usual call volume, each of those support queues were constantly filled with about ten to forty people waiting for an agent. We regularly had between about 40 to 120 calls waiting. When this happened, system started to have choppy audio. As I’d listen to the audio, it did NOT sound like the chops or static that packet loss sounds like, where a packet is lost every hundred milliseconds or so. It would drop audio for rather long chunks of time (still a fraction of a second per instance) before the audio would pick up again. This would happen to all of the current calls in progress, including the unbridged ones which were simply listening to hold music in the queue. The nature of how the audio chopped rendered most calls in progress useless, because there wasn’t enough audio for any of the parties to understand each other.
Needless to say, our Asterisk phone system was not very popular that week. I was confused because we had handled 40 or so concurrent calls before without a hitch. But something about queueing the calls was driving the system bonkers.
In my troubleshooting, I noticed that when this would happen, the CPU load (meaning the number of processes waiting in line for CPU time) would easily raise over the number of cores in the machine, sometimes going over 30. That isn’t 30% cpu usage, its 30 concurrent processes waiting for CPU time. I tried to do my due diligence as a system administrator to find the bottleneck in the system. I determined that this was absolutely not a network problem. We were far from capacity in both packets per second and bandwidth. I determined that this had little to do with any cpu-intensive things, like transcoding, recording to disk, or encoding those recordings into mp3. I am nicing mp3 encoding to the lowest priority anyway. The disk usage was relatively idle even with the recording. In watching top, and switching to threads, I noticed that the cpu usage of each Asterisk thread when things were good was about <2%, but every once in a while, when the load would skyrocket, things would exacerbate and each thread would suddenly consume over 15% CPU. Yet, I saw that this was still not using much actual user CPU computation time. Most cpus on the system had a lot of idle time left over, and most of the high CPU usage was found in the system category (which I understand is low level kernel or libc code).
My first reflex was to give the machine more CPU cores, because a high load should definitely be alleviated by more cores chomping away at the computation, right? I bumped the virtual machine from 8 virtual cores (which become 8 threads on the host) to 12 virtual cores (12 threads on the host). I did not want to go much beyond that because though the host has 24 threads, 12 of those are hardware-virtualized “hyper threads.” To my surprise, this exacerbated the problem. The load often skyrocketed to near 100, and the Asterisk threads under duress consumed much more cpu and each core was spending the same time in system usage, even though there were more cores.
I thought that maybe there was contention with other virtual machines on the host, so I migrated all of the other virtual machines away, leaving the asterisk virtual machine alone on the host, having all of the cpu resources to itself. That didn’t help at all.
I thought that maybe it was wasting CPU by transcoding the hold music on the fly, so I made sure that the hold music was encoded for the native call encoding. That didn’t make any difference. I already had it encoded for ulaw, and adding the other encodings was superfluous – it was already using ulaw. I tried turning the hold music off entirely. That didn’t help either.
I started testing with a test virtual machine with various hardware configurations and recompiled Asterisk for debugging and started dumping gmon profiles. I used sipp to simulate callers queueing up in the various queues. It seemed to take a few more sipp connections to get the system to the level where it acted like Production, probably because it is fake RTP traffic and because my test system had zero actual calls in progress, but I was able to duplicate the problem nonetheless. In the gprof dumps, I noticed that most of the CPU time was spent in thread mutex locks and atomic integer fetches. As I browsed the app_queue.c, I noticed that it locks out all other threads while it checking to see if it can hand a call off to an agent in ast_tryconnect, among other functions. That explained the high system wait time. That explained the high load. This is thread mutual exclusive lock contention.
I thought that maybe this had to do with the overhead of virtualization, and that the lock contention might be alleviated by reducing any latency that virtualization may cause. So, I revamped my test OS to fit in a Linux container instead of a KVM image. In the Linux container, Asterisk had access to the bare-hardware kernel with all 24 virtual cpus, 12 cores and 12 hyperthreads. This did not help either; it exacerbated the problem even more. For grins and giggles I tested outside of a Linux container with the same result.
In all my testing it started to become clear to me: the more CPU cores that are available to Asterisk, the more expensive queueing calls in app_queue.c is. At this point, I backed the production machine back to 4 virtual CPUs, and started to drop queued calls out of the queue to voicemail after five minutes of hold time to keep the number of calls in the queue low. Keeping the amount of queued calls in each queue under about 10 ensured that nary an RTP audio packet was dropped because of constipated CPUs. Real bridged channels use hardly any cpu in comparison, but they sure are subject to lock contention constipation.
Could Asterisk queues possibly perform better on a single CPU machine? My gut kept saying “no way” because Asterisk was consuming all 12 to 24 cores in system time alone to hold some 80 fake queued calls. But it’s easy enough to test. I spun up the test Virtual Machine with only one virtual core and tested. I watched as system load increased and held between 10 and 20 (which is still scary high by the way), but there it was, queued with 120 calls, and not dropping any audio because of any cpu time contention. I pushed it up to 160 queued calls, and things began to become wiggy, but to think it was handling this capacity without the extra multiprocessing was surprising to me. This probably makes sense because the volatile variables, mutex locks, and other hot memory will sit in the L1/L2 caches of the core it’s forced to run on.
As I was trying to research this problem, I found the following thread which I believe is quite related to the problem I am seeing:
At the end of that thread, the solution was to downgrade to the Asterisk 11.6 (Certified) version, because it apparently does not have the lock contention flaws that appears in subsequent versions in 11, and probably all of 13. I don’t have the luxury to downgrade to 11, because I’ve already become highly invested in the new ways that Queues are handled in Asterisk 13 in all of my queue management, reporting, and real-time monitoring scripting.
I tried using cgroups and cpuset to exclusively run Asterisk on one core, and then placing ALL other OS operations on other cores. That seemed to slightly mitigate the problem, except that I couldn’t find a way to get AGI scripts and other shell-out calls from Asterisk to extricate themselves from the exclusive cpuset, as well as the fact that I actually wouldn’t mind channel and bridge threads to run on other cores where plenty of CPU cpu time is available. This seems to be the only thing I can really do at this point, since I don’t really have any expertise in the app_queue.c code itself.
I think this should be classified as a bug: it seems to be some sloppy multiprocessing programming for app_queue.c. There needs to be a more elegant way to handle mutually exclusive locks, and a less thread-blocking way to check for available agents and distribute queued calls. Another option might be to be able to set the affinity for all app_queue.c threads to a single core, so it doesn’t have to contend with all of the other truly multithreaded processes, like bridges and channels.
Would it be inappropriate for me to report this as a bug in the Asterisk Jira?
Have any of you run into this problem and how did you fix it?