Pjsip stop response

I have a situation, sometimes the PJSIP stop response,

For example: the ip phone send to register and pjsip dont response,
the extensions send a call (invite) and pjsip dont response.

The only method for return to work, is restart asterisk.

I have a big traffic in server, 6000 endpoints and a 300 simultaneous calls, he works a 5 - 12 days, and happen this situation.

I analized the pjsip log debug but dont saw notting.

Do you get warnings about insufficient resources? Also, what does htop say? Depending on the configuration and the other stuff on the server you might run out of memory after some time and considerable excessive paging could occur and that would explain what you see.

The other question is of course that 6000/300 looks very ambitious. I think, usually you would want to have some kind load balancing and something that detects failures. The simplest solution could be to place a kamailio server, configured as a stateless proxy, in front of a couple of Asterisk boxes. If you can detect failures within a couple of seconds, maintenance would also be a lot simpler. On the other hand, things get more complicated if you have to support a lot of telephony features, like subscriptions, etc.

This could be a deadlock, see

Getting a Backtrace - Asterisk Project - Asterisk Project Wiki.

In any case we need to know the exact version of Asterisk, and I think the core show channels output and pjsip show channels output might be instructive.

My version is 18.12.1

I have a 5 servers with 6k-7k endpoints, the memory in server its ok, he have 128GB, normally we have 20-30Gb free.

The problem of backtrace is only return information if we have crash of asterisk process no ?

You can force a dump from a running process. There will a pause whilst it is running, but it sounds like the system is no longer usable. You can also force a stop with a dump by using, for example, kill -6.

That figure generally isn’t useful, as Linux will use free space for caches and buffers, and only try and maintain a limited amount that is truly free. However, the error here is probably towards greatly over-estimating your real memory needs.

Ok, I will kill process for get backtrace.

What you mean with “However, the error here is probably towards greatly over-estimating your real memory needs.”

There is probably a lot more than 20-30GB available to Asterisk, if it needs it. You are probably not using anything close to 100GB for Asterisk. The figures below are from a desktop system, not from one running Asterisk, but you are probably looking at the Free figure, when the more realistic indication of the memory that is effectively free is the available figure:

root@dhcppc4:~# free
              total        used        free      shared  buff/cache   available
Mem:        8114032     6209040      356296      371636     1548696     1256228
Swap:      16601084      453284    16147800

Note that, if you have any swap usage on your system, you did have a memory crisis at some stage.

If the process is salvageable, you can use gcore to get a snapshot of the running process, although it may still cause a few seconds of disruption.

Hi, it looks like a stasis overload or a Asterisk performance misconfiguration.

What kind of Asterisk version are you running?
What kind of Asterisk architecture are you running? ARA? If yes: What and how many real time modules are you running on your Asterisk?
Are you using sorcery strategies?
Are you using direct-media on your PJSIP Endpoints? (A good configuration could avoid an Asterisk PBX overload)
Are you qualifying your PJSIP Endpoints?
Have you properly configured the Linux Daemon as per users system requirements?

Regards,

Hello,

The version is 18.12.0
I don’t use realtime, my configs is in files

I dont use sorcey

I dont use direct media because the calls need “record”

What do you mean with qualify endpoints ?

Sorry, little mistake from my side; I meant: Are you qualifying your contacts? (Asterisk 18 Configuration_res_pjsip - Asterisk Project - Asterisk Project Wiki)

More questions:

  1. have you detected some kind of problem (retransmissions) with yours PJSIP Timers? (Asterisk 18 Configuration_res_pjsip - Asterisk Project - Asterisk Project Wiki)

  2. What kind of pjsip transport are you using? Are you using TLS on your Asterisk?

  3. When your PBX crush (pjsip module) the end users still have RTP audio?

When your PBX crush, or on your logs history, check if in your Asterisk full log file there are some notifications from “taskprocessor.c”.

From https://www.asterisk.org/asterisk-task-processor-queue-size-warnings/:
“….

  • Disable features you do not need. It is important that if you do not need features like Homer(HEP), CDR, CEL, or AMI that you do not enable them and for HEP you shouldn’t load those modules either.

  • You can adjust the thread pool parameters for PJSIP and stasis task processors. The PJSIP thread pool is configured in pjsip.conf. The stasis thread pool is configured in stasis.conf.
    …”

Finally, take a look to sorcery cache system: Sorcery - Asterisk Project - Asterisk Project Wiki

I use transports udp, tcp and wss

When pjsip stop response the previous calls continue normally,

I will search about taskprocessor in full log in next crash.

My stasis.conf
[threadpool]
initial_size = 15
idle_timeout_sec = 120
max_size =60

this picture show whats tasks have many calls

About qualify contacts, I don’t have many contacts with this option

This today:

[Jun 15 08:31:52] WARNING[25221][C-00000db0] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks.

[Jun 15 08:31:59] WARNING[21789][C-00000b89] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:32:14] WARNING[9083][C-00000dca] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:32:19] WARNING[9252][C-00000dd6] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:32:46] WARNING[9176][C-00000dcf] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:32:47] WARNING[9842][C-00000def] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:44] WARNING[18575][C-00000d89] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:44] WARNING[17508][C-00000e21] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:48] WARNING[9978][C-00000df7] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:49] WARNING[3353][C-00000d29] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:53] WARNING[17636][C-00000de6] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:53] WARNING[17646][C-00000e29] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:54] WARNING[3353][C-00000d29] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:54] WARNING[17668][C-00000e2c] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:57] WARNING[17813][C-00000e37] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:58] WARNING[17846][C-00000e3e] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

[Jun 15 08:33:59] WARNING[3353][C-00000d29] taskprocessor.c: The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks again.

Hi,
depending on your scenario, and if you have correctly configured the inbound auth (and the endpoints), I think that is not quietly necessary to send OPTIONS on your local endpoints (users phone) and this can be avoided; you may want the reverse process in some (particular) cases: UA sends OPTIONS to SIP-SERVERs - as per Disaster Recovery scenario for example.

The stasis errors, are not directly connected to ASTERISK; but, in most cases, they could be conected to a bad development or drew of your Dialplan.

You can try to avoid those behaviors following some best practics - that you can find on ASTERISK documentation too - and better:

1) load "limit" module on you ASTERISK and check the limits of your service
2) Configure you ASTERISK Linux Deamon(s) as per scenario requirements (check on ASTERISK source directory in contrib/systemd/ or other guides)
3) Optimize your PJSIP settings as per scenario requirements
4) Have you loaded all involved PJSIP modules on you ASTERISK environment - but not res_hep_pjsip.so?
4) Unload from your ASTERISK the unused modules (CDR.csv, AMI, etc..)
5) Is stasis module loaded on your ASTERISK environment?

Remember: PJSIP has is own taskprocessor configurations - but maybe this is not necessary.

Note: the taskprocessor error seems that tells you “The ‘stasis/p:channel:all-000043c8’ task processor queue reached 500 scheduled tasks.”; so, what kind of operations are you doing (or not) on those channels? Could you find/try a “liteweighted” version of your PBX dialplan?

Sorry, I don’t have - maybe nobody - the plug-and-solve solution for those cases; you maybe need to start a structured and organized troubleshooting, if you want to eradicate this problem from your environment, and - for sure - try to get more info enabling the Backtraces as @david551 suggested above.

Hope this can help

Regards,