Performance issue with idle AMI connections

Hi!

We are using the AMI “Redirect” action to forward a call from a queue to a new phone. (Which some ~20 queue-members already ringing.)
We’re debugging an issue where this causes ~1.5s CPU “sys” load (not 100% reproducible), delaying AMI events on the same connection.

In an attempt to narrow it down, I’ve managed to reproduce the “sys” load by creating 100 extra AMI connections (manager.conf: read=call, write=call) that do nothing and ((carry no TCP traffic))*. (In reality we have ~20, doing some polling.)

*((Edit: Sorry, small mistake: In my reproduction attempt, the TCP connections do see lots of AMI events, which justifies the load. In the real system, the AMI login is with “events: off”. Changed the reproducer, events are gone from TCP stream, but load/issue is still as described below. Traffic verified in wireshark.))

Those 100 connections trigger 170ms of additional CPU “sys” load per “Redirect”, which surprises me a bit (they are inactive?). Is this expected to cause problems?

Any ideas what we can try / check / analyze?

Asterisk 18.6.0 on Rocky VM. (Same issue with earlier versions.)
Below is a strace and perf output during the load:

strace -p $(pgrep -f /usr/sbin/asterisk) -f -c

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 91.77   80.143584        6077     13188     13151 poll
  2.51    2.189540        2302       951           epoll_wait
  2.34    2.044230       24336        84        73 restart_syscall
  1.78    1.556101        4590       339        42 futex
  0.56    0.489837      122459         4           nanosleep
  0.40    0.347178          26     13226     13226 rt_sigreturn
  0.34    0.300971          22     13226           tgkill
  0.21    0.184507          13     13253           getpid
  0.06    0.053148          50      1062           write

perf record -a ; perf report

Samples: 33K of event 'cpu-clock:pppH', Event count (approx.): 8267500000
Overhead  Command          Shared Object             Symbol
  98.65%  swapper          [kernel.kallsyms]         [k] native_safe_halt
   0.26%  asterisk         [kernel.kallsyms]         [k] vmware_sched_clock
   0.04%  asterisk         [kernel.kallsyms]         [k] _raw_spin_unlock_irqrestore
   0.03%  asterisk         app_queue.so              [.] ast_copy_string
   0.03%  asterisk         asterisk                  [.] hash_ao2_iterator_next
   0.03%  swapper          [kernel.kallsyms]         [k] finish_task_switch
   0.03%  asterisk         [kernel.kallsyms]         [k] tcp_poll

Small correction (previous post edited): I noticed the ~20 AMI connections actually do login with “events: off”. Without that, they would get a significant amount of events.

Maybe those suppressed events still have a cost, e.g. in terms of threading overhead?

Each thread still receives notification of the events, and goes through them applying filtering.

Thanks. I’ve noticed the “tgkill” syscall count scales (roughly) with the number of “Event: QueueMemberStatus” AMI messages. (That would be sent for each queue via each AMI connection, if they were not filtered.)

In the real setup we have a bunch of phones (~30) that are all members of in the same queues (~40) and a few AMI connections (~70). This leads approx. to 30x40x70 = ~80’000 “tgkill” syscalls. (Plus associated syscalls of the same number: “poll”, “rt_sigreturn”, “getpid”, leading to 1-2s CPU time spent in the Linux kernel.)

Possible mitigation:

[declined_message_types]
decline=queue_member_status_type 

I have measured that this mitigation works. It drastically reduces the number of syscalls, as well as the CPU load.

Not sure yet if we will be able to use it, but we probably don’t use those QueueMemberStatus messages. (Certainly we don’t need them per-queue in AMI.) I hope disabling it doesn’t affect the “inuse” ringing strategy, or something else asterisk-internal…

I’m somewhat surprised that asterisk is sending a signal (SIGURG, I think) in addition to sending messages…? Would it be possible to disable this, just to see if it helps?

If you modify the code… you could… but that likely has consequences to it. SIGURG is used to wake up threads that are polling so they can do something.

Ah okay, I was looking at the poll syscalls - it makes more sense if SIGURG is the only notification.

Maybe ideally there would be a single SIGURG for the whole loop that goes through all queue members. Doesn’t look like an easy change, though. Or maybe the OS was supposed to combine the multiple signals into a single thread wake-up; in that case, maybe there is a fix at the OS/threading level.

But now that we understand what triggers the CPU load, we can probably just avoid it.