Res_pjsip configuration high CPU load

I am running Asterisk 20.8.1, and I have a CPU load issue with the following setup:

[res_pjsip]

endpoint/cache=memory_cache,expire_on_reload=no,object_lifetime_stale=3600,object_lifetime_maximum=3720,full_backend_cache=yes

endpoint=realtime,ps_endpoints

auth/cache=memory_cache,expire_on_reload=no,object_lifetime_stale=3600,object_lifetime_maximum=3720,full_backend_cache=yes

auth=realtime,ps_auths

aor/cache=memory_cache,expire_on_reload=no,object_lifetime_stale=3600,object_lifetime_maximum=3720,full_backend_cache=yes

aor=realtime,ps_aors

domain_alias/cache=memory_cache,expire_on_reload=no,object_lifetime_stale=3600,object_lifetime_maximum=3720,full_backend_cache=yes

domain_alias=realtime,ps_domain_aliases

transport=config,pjsip.conf,criteria=type=transport

contact/cache=memory_cache,expire_on_reload=no,object_lifetime_stale=3600,object_lifetime_maximum=3720,full_backend_cache=yes

contact=realtime,ps_contacts

In the case where the contact part is not set (commented), it has been observed that with 30-40 calls, the CPU load increases from 1-3 to 12-20 (on an 11-core CPU). Additionally, Playback audio and calls start to lag, as if there is packet loss. What could be causing this?

When you say “commented out”, are you commenting out both the contact/cache and the contact entry? If you do that then contact storage reverts to the astdb.sqlite3 database in /var/lib/asterisk. This should be MUCH FASTER than using a database backend with much lower CPU utilization since the sqlite3 database will most probably be in memory. For this reason, we typically recommend that realtime not be used for contacts unless you need to share them with another hot-standby asterisk instance.

If you’ve truly commented out both contact lines in sorcery.conf and CPU utilization increases then I’d suspect a host filesystem issue. Is this a bare-metal or virtualized environment? If virtualized, what is the backend storage type of the VM filesystem? Local to the host or SAN? Block device?, QCOW file, etc?

If you’ve only commented out the contact/cache line, then all requests for a contact object will result in a trip to the database and back.

How many contacts total are there? How many permanent (via a ‘contact’ parameter in an aor) vs dynamic (via inbound registration) are there?

Also, once a call is in progress, there’s very little activity for the contact objects involved so something isn’t adding up.

Thank you for your response.
Both contact lines were commented out. The problem is likely due to astdb. Since the PBX operates in failover mode, storing astdb in sqlite3 didn’t always yield good results. When failover occurred and the PBX switched to the PBX on the other VM, the Queue and DeviceStates didn’t always return to their previous state. There were cases where Queue and DeviceState statuses were lost, and more rarely, the sqlite3 became corrupted.
Therefore, we decided to migrate astdb storage to MySQL using a patch.
Is there a solution to store astdb in sqlite3 where, in failover systems, the other VM can receive the astdb state before the switchover occurs?

Just trying to understand what you are doing. When you say failover mode, do you mean a couple of machines running VMWARE (like this) or are you using something like High Availability for Asterisk (HAast from Telium)?

If VMWare, you would have to constantly copy the astdb (whether SQLite or MySQL) to a standalone device from the running node, and then restore is from that device to the failover node AFTER the failover node has started but before Asterisk service has started. If HAast, you could designate the astdb (in either format) to be synchronized by HAast - which will automatically prevent a corrupted astdb from moving to the secondary node. (so that should not be a factor)

Going back to your question - is the high CPU load because Asterisk is starting with a corrupt Astdb, or is it really a PJSIP issue, a file system issue, flood of network traffic, etc? Have your checked IO stats to see where the bottleneck is?

Our system is built similarly to VMware solution (you mentioned). We have previously tried copying astdb.sqlite3 file, after that we tried storing astdb.sqlite3 file on DRBD storage, but there were always exceptional cases where, for some reason, the astdb file could become corrupted (it happened in a few cases), which is why we decided to store the astdb in MySQL.

We considered trying this solution: Corosync with Asterisk or AIS, but finally we did not try it.

The high CPU load, was because we commented out the entire contact section in the sorcery configuration, which led to attempts to write to the astdb (we didn’t know it would store data there in such cases). However, since the astdb is stored in MySQL, it makes sense that when calls arrived in many Queues with many Endpoints, it was constantly writing/reading from MySQL (~22 query/sec). I believe this explains why performance degraded compared to when contact caching was enabled in sorcery. The difference between the two sorcery settings was clearly visible in Percona, with the number of MySQL astdb queries dropping to about 1/6.

What concerns me is that in the long term (with more Endpoints/Queues), this problem might reoccur. Based on the above, storing the astdb in MySQL is likely not going to yield good results in our case. That’s why I asked if there is any best practice (possibly the aforementioned Distributed-Device-State) for transitioning Queue and DeviceState after the switch?

I’m still a bit confused. If you store contacts directly in the database via sorcery.conf and you use your astdb-mysql patch to have everything else in the astdb stored in mysql, are things still not working well?

Where did you get the astdb-mysql patch from? How/where is the mysql database implemented?