High CPU spikes with PJSIP and Asterisk 13.15

Hi all,

I am migrating the old elastix system to new FreePBX Distro (latest 10.13.66, Asterisk 13.15) and I am experiencing strange CPU spikes from time to time even though the system is under no load. Spikes sometimes exceed 100% CPU (more than one fully utilized core). I have KVM-based virtualized environment with 4CPUs.
Could you please point me to a direction, how to figure out, why is this happening? I tried ps -LlFm and pstack with high load threads with following description:

#0  0x00007f140e3dc6d0 in memcpy () from /lib64/libc.so.6
#1  0x00007f14103d5eff in ?? () from /usr/lib64/libsqlite3.so.0
#2  0x00007f14103c0560 in sqlite3_step () from /usr/lib64/libsqlite3.so.0
#3  0x00000000004f98f5 in ast_db_gettree ()
#4  0x00007f13c5d5cfde in ?? () from /usr/lib64/asterisk/modules/res_sorcery_astdb.so
#5  0x00000000005ca2c1 in ast_sorcery_retrieve_by_regex ()
#6  0x00007f138eb57f6b in ast_sip_location_retrieve_aor_contacts_nolock () from /usr/lib64/asterisk/modules/res_pjsip.so
#7  0x00007f138eb5a587 in ast_sip_location_retrieve_aor_contacts () from /usr/lib64/asterisk/modules/res_pjsip.so
#8  0x00007f138eb5a827 in ast_sip_for_each_contact () from /usr/lib64/asterisk/modules/res_pjsip.so
#9  0x000000000045eb0c in ?? ()
#10 0x000000000045ee3f in __ao2_callback ()
#11 0x00007f138eb591c1 in ?? () from /usr/lib64/asterisk/modules/res_pjsip.so
#12 0x00007f138eb5c1e4 in ast_sip_cli_traverse_objects () from /usr/lib64/asterisk/modules/res_pjsip.so
#13 0x00000000004da4cc in ast_cli_command_full ()
#14 0x00000000004da630 in ast_cli_command_multiple_full ()
#15 0x0000000000453fce in ?? ()
#16 0x0000000000600bb4 in ?? ()
#17 0x00007f140f0b3aa1 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f140e43b93d in clone () from /lib64/libc.so.6

My top results are:


top - 12:46:03 up  2:51,  2 users,  load average: 0.10, 0.14, 0.13
Tasks: 175 total,   1 running, 174 sleeping,   0 stopped,   0 zombie
Cpu0  : 32.7%us,  4.3%sy,  0.0%ni, 63.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu1  :  9.0%us,  3.3%sy,  0.0%ni, 87.7%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Cpu2  : 17.9%us,  1.6%sy,  0.0%ni, 74.3%id,  0.0%wa,  0.0%hi,  0.0%si,
6.3%st
Cpu3  : 58.0%us,  2.3%sy,  0.0%ni, 39.7%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   3922216k total,  2187576k used,  1734640k free,    93200k buffers
Swap:   786428k total,        0k used,   786428k free,   534736k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2061 asterisk  20   0 3021m  88m  16m S 99.0  2.3  17:45.63 asterisk
20974 asterisk  20   0  331m  26m 9056 S  4.3  0.7   0:00.13 php
20950 root      20   0  110m 2444 1252 S  2.7  0.1   0:00.82 htop
 1492 mysql     20   0  947m  40m 6284 S  1.7  1.1   0:36.99 mysqld
 3114 mongodb   20   0  327m  43m  20m S  0.7  1.1   0:33.33 mongod
    4 root      20   0     0    0    0 S  0.3  0.0   0:00.16 ksoftirqd/0
  860 root      20   0     0    0    0 S  0.3  0.0   0:00.39 kauditd
 1746 asterisk  20   0 4607m 484m  18m S  0.3 12.6   2:59.63 java
 2140 root      20   0  323m  16m 9100 S  0.3  0.4   0:06.46 php
20951 root      20   0 15032 1308  948 R  0.3  0.0   0:00.08 top
    1 root      20   0 19360 1500 1196 S  0.0  0.0   0:00.81 init

System is currently configured with 250 PJSIP endpoints (150 online). No other channel is in active use.

I have seen similar report here, but with no answer, just pointing at PJSIP inefficiency.

Thank you.

The trace you have provided is from a CLI command listing out either AORs or contacts, which hits the disk for the sqlite database. Itā€™s possible that is taking a period of time resulting in high CPU time, or it takes a bit to create the results which uses CPU.

Generally though you either have to eliminate usage to try to isolate what is using CPU, or use a profiling tool to see what is taking CPU time and then looking at code to try to understand why/how to improve it.

Thank you VERY much. Now I know what to look for. My guess is that freepbx does this listing for its statistics and part will be my job of running pjsip list * for monitoring. Is there a less CPU-expensive way of getting the online endpoints? Is pjsip show * similar from the performance point of view?

Thank you once again.

It depends on what exactly is being queried. In the case of contacts we store those in the sqlite3 database by default, so they get pulled from there. Everything else by default is stored in memory.

Thank you, will do some testing.

Hi again,

I thought I solved the problem. According to jcolps suggestions I changed the monitoring and stopped asking asterisk for information about contacts. It looked pretty good, spikes were much less common.

But during the week I added 200 more PJSIP extensions (now 500 in total) and the spikes got worse. Now I get 200+% CPU utilization on asterisk every 40-50 seconds even if there is no load at all (just the usual OPTIONS, REGISTER).

I kind of wonder, if the PJSIP is ready to be used with hundreds of extensions. Does anyone have any suggestions, how to optimize the asterisk behavior? I doubt I am first one to experience this.

Thank you very much

Weā€™ve tested it with upwards of 1000 endpoints. But the usage patterns and how things are stored dictate the repercussions of doing so, along with traffic usage itself. This is an area that could certainly be improved to handle more cases but it takes quite a long time to set up scenarios, test, isolate the problem areas, determine how (even if) they can be improved.

I think that, with the serialisation question, you may be the first person with the requirement, so, even if it tries to lock things properly, the code may not have been well exercised.

Thank you guys for your answers. Should I then file a minor bug report? Can I help somehow to isolate the problem? It is a production environment with FreePBX so I cannot recompile the whole stuff. I will definitely touch the 1000 endpoints threshold in several months so it would be great for me to know if I can rely on PJSIP or should consider switching back to CHAN_SIP instead.

Thanks again

Itā€™s not something weā€™d really accept a bug report for in its current state, itā€™s also not something that is likely to be worked for quite awhile unless a community member picks it up. Someone ultimately has to do the work to isolate things as Iā€™ve said. As for switching back to chan_sip, thatā€™s up to you.

Thank you jcolp and please forgive me, if my bad english caused you to think I want you to decide which channel is ok for my environment. In fact, I am rather interested in your personal opinion about PJSIP and its current state. Is it something anyone can rely on? My point is, I got this feeling, but during this migration from oldy Asterisk 1.8.14 that worked almost like a charm I have experienced deadlocks and this strange CPU utilization, which makes me wonder if PJSIP is ready to be used in medium business. I thought there would be tons of guys with 500+ endpoints configuredā€¦

Nevermind. I am definitely prepared to do my part as the member of community, but am rather inexperienced in this. So, for you guys to take this problem as a bug and start working on it, what is it you need? I can create a testing environment and compile there whatever you want. I am also capable of simulating 500+ endpoints in terms of registrations and calls. So please, how should I proceed?

Thank you very much.

I think Jonathon meant the state of the bug report. I think he is saying there is not enough information to locate the problem well enough for anyone to want to investigate it.

Very well, for that there is the part where I am offering some testing, but I dont know what you needā€¦Backtraces? Do I need to compile asterisk with some special flags (DONT OPTIMIZE, BETTER BaCKTRACESā€¦)?

Thank you

We would need complete details about the environment. Whether .conf files are in use and their contents, if any AMI is going on at the time, dialplan, everything. We essentially have to reproduce the exact environment and start profiling it to see where CPU time is being spent. I also have no time frame on when we would get to this. While there are people running such a configuration there are not a lot, and they have differing configurations resulting in different performance characteristics. Itā€™s a large amount of time spent for only a few people, so it goes lower in the list for now.

Thank you guys. I understand that this is not a number one priority. First, I will try the same setup in different virtual environment for myself. Then, If I see no change in behavior, I will do what I can to deliver all the data you requested.

Thank you very much for your time and expertise.

Hello everyone!

I have the very same problem with Asterisk version 13.17.2. I have several Asterisk systems with lots of extensions which work perfectly on chan_sip. Recently we decided to migrate to chan_pjsip as it has many features we would like to use and which are not available in chan_sip, one of them is maxcontacts directive and ability have static contacts along with dynamic ones. I set everything up, first with 1-2 extensions, tested - everything is good. However as soon as I add all of the extensions i need, and it is around 2500, on pjsip reload console freezes, CPU goes to 100% and after pjsip is unresponsive, all extensions went to Unreachable state and phones canā€™t register.
I can see messages like:

The ā€˜subm:endpoint_topic_all-cached- taskā€™ - processor queue reached 500 scheduled tasks.

core show taskprocessors

shows queue and number in queue increasing under pjsip/default like this:

Processor                                      Processed   In Queue  Max Depth  Low water High water
app_voicemail                                          0          0          0        450        500
ast_msg_queue                                          0          0          0        450        500
CCSS_core                                              0          0          0        450        500
iax2_transmit                                          0          0          0        450        500
pjsip/default-0000000b                              1708      22675      22674        450        500
pjsip/default-0000000c                              1834      22673      22653        450        500
pjsip/default-0000000d                              1825      22672      22668        450        500
pjsip/default-0000000e                              1777      22668      22665        450        500
pjsip/default-0000000f                              1821      22674      22655        450        500
pjsip/default-00000010                              1697      22678      22654        450        500
pjsip/default-00000011                              1429      22864      22854        450        500
pjsip/default-00000012                              1825      22664      22639        450        500

I searched google a lot and found that pjsip now uses and rely on task processors and if there are many taskprocessors in queue then pjsip slows down until it frees queue. I even added 1 more core to this test server to see if it helps but asterisk now eat 200% of CPU and situation has not changed.
It is worth mentioning that all of my extensions have static
contact=sip:${EXTEN}@IP:5060
directive because i need asterisk to dial not only registered IP but sending a call to another server as well.
As soon as i lower down number of loaded extensions to, say, 450 it loads, a bit slow but extension are working fine.
So it seems like the problem is in amount of extensions.

I tested this configuration and this problem happens on all versions beginning from version 13.8(didnā€™t test earlier ones) and to 15.
I gathered backtrace and ā€œcore show locksā€ information during this problem. Should this be added to asterisk tracker or there is no chance for this to be checked and fixed?

I appreciate you time and thank you in advance to all who can help me on that.

Any difference if you adjust the thread pool size in pjsip.conf?

I tested and now i donā€™t see any error messages. However cli still freezes and pjsip is unresponsive. I found one interesting thing is that if i disable qualify on all extensions - it takes about 20-25 seconds to reload pjsip module (it stucks for this times) but after this everything works fine. It still takes more time to reload pjsip and to show all endpoints but at least it doesnā€™t die on reload.
Hope that helps anyone.

Hello devox, do you know if this problem was solved ?

I have this problem too, after some ā€œreloadsā€ in cli the pjsip dont process more calls.

I use version 15.4.1

Hi,

No. I reduced number of endpoints on servers and if i need more - i disable qualify, it seems that the issue is with many qualify requests at the same time.