13.10.0 pjsip deadlock

This message appeared today in logs:
taskprocessor.c: The ‘pjsip/distributor-00000167’ task processor queue reached 500 scheduled tasks.

All peers started to disconnect and pjsip became unresponsive.

Any chance to debug it?

What sort of setup are you using, are you connected to realtime? As well we have a wiki page[1] which talks about what that actually means and when it happens you can grab a backtrace[2] which may provide further detail on what is taking so long.

[1] http://blogs.asterisk.org/2016/07/13/asterisk-task-processor-queue-size-warnings/
[2] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-GettingInformationForADeadlock

I did not personally set up “realtime”, maybe it is turned on by default.

I use freepbx distro, updated from asterisk 13.5.0.
I dont use any queues in my configuration. Just trunks and extensions.

Before problem appears again I’ll try to get this information.

We have been having this issue for about 3 weeks now. It has been tough to track down. No real errors, just this warning seems associated. Asterisk stops responding to all PJSIP registrations. It still however seems to register to our trunks.

We are on Asterisk 13.10 with the FreePBX distro. Very recent install. Happens randomly typically once a day.

We’d still need a backtrace to try to figure out what is going on.

I will try to get one on the next occurrence.

is the show locks important? I do not have that complied in.

Both would be ideal.

Hello, I just signed up for an account here because I’m having the same exact issue.

I am running FreePBX Distro, hosted in AWS, serving about 300 phones with 50 simultaneous calls at once, and we use PJSIP for everything. I have full FreePBX, Asterisk, and SAR logs recorded if you need them.

Here’s the log output; we get a pjsip/distributor message then all phones go offline until the server is rebooted. Our trunks are treated as PJSIP endpoints, so they are disconnected too. Note the FreePBX dashboard showing all the Asterisk endpoints dropping on the graph.


[2016-09-01 21:06:52] WARNING[1963] taskprocessor.c: The ‘pjsip/distributor-00000116’ task processor queue reached 500 scheduled tasks.
[2016-09-01 21:07:13] VERBOSE[5562] res_pjsip/pjsip_configuration.c: Contact [redacted] is now Unreachable. RTT: 0.000 msec
[2016-09-01 21:07:13] VERBOSE[5562] res_pjsip/pjsip_configuration.c: Endpoint [redacted] is now Unreachable
[2016-09-01 21:07:55] VERBOSE[31317] res_pjsip/pjsip_configuration.c: Contact 4611/sip:4611@ [redacted]:1040 has been deleted
[2016-09-01 21:07:55] VERBOSE[31317] res_pjsip/pjsip_configuration.c: Endpoint 4611 is now Unreachable
[2016-09-01 21:07:55] VERBOSE[31317] res_pjsip/pjsip_configuration.c: Contact 4629/sip:4629@ [redacted]:1191 has been deleted
[2016-09-01 21:07:55] VERBOSE[31317] res_pjsip/pjsip_configuration.c: Endpoint 4629 is now Unreachable
[2016-09-01 21:07:55] VERBOSE[31317] res_pjsip/pjsip_configuration.c: Contact 4637/sip:4637@ [redacted]:1194 has been deleted

From this referenced page it is stating that this functionality is new in V13.10.0 which is where I started seeing this issue. Probably a good starting point for investigation.

http://blogs.asterisk.org/2016/07/13/asterisk-task-processor-queue-size-warnings/
pjsip/distributor

Task processor created by res_pjsip.so. PJSIP distributes incoming messages to these task processors if the message is not associated with any other task processor. The task processor assignment is determined by hashing the SIP request’s Call-ID and From tag. These task processors are new since Asterisk v13.10.0.

We had this happen again. I was able to run this command, I’m not sure the data is helpful.
gdb -ex “thread apply all bt” --batch /usr/sbin/asterisk pidof asterisk > /tmp/backtrace-threads.txt

This issue has not occurred for about a week until today. One difference today is the Internet provider was down so one of the two trunks and some of the endpoints were unreachable. I know this does not cause the problem but maybe this exacerbates it.

We would need to see the backtrace in order to determine what is going on (if it is a new issue, an existing filed one, or an already fixed one).

I am also new here, and the issue described here matches pretty much what I am seeing.

Also a FreePBX user. On Friday I found my office with no phone working. i quickly checked asterisk logs and pjsip log and what amused me was that I could see all phones trying to register to the PBX (lots of REGISTER coming from the phones) but what striked me was that there was absolutely no answer/reply to those commands. Hence, the phones could not register. The trunks were still online.

I looked at the chart which shows online vs offline peers over time and found the exact time when everything fell apart, suddenly the line representing the # of online peers fell sharply and the # of offline peers rised sharply, just like the image BigChief posted.

Unfortunately, I could not spend much time debugging as I needed the PBX back up so I upgraded everything that was pending update and rebooted the PBX. Everything came back up and I had my phones working again.

Later, after the PBX was working again and I found this thread, I searched for the “taskprocessor.c” warning message and I found it just like reported here:

[2016-09-15 19:07:34] WARNING[2043] taskprocessor.c: The ‘pjsip/distributor-0000
004a’ task processor queue reached 500 scheduled tasks.
[2016-09-15 19:07:34] WARNING[2043] taskprocessor.c: The ‘pjsip/distributor-0000
004a’ task processor queue reached 500 scheduled tasks.

In my case, two such messages were logged at exactly the same time.

Right after those lines, the log had lots of:
[2016-09-15 19:11:35] VERBOSE[32099] res_pjsip/pjsip_configuration.c: Contact [redacted] is now Unreachable. RTT: 0.000 msec

This proved that the issue started exactly the moment the taskprocessor.c warning was logged.

Now, I know that what is needed now is a backtrace, I will try to get one if/when it happens again, but I am the forth one to report it here, maybe someone could try to take a look at it despite the absence of backtraces?

A note to the other FreePBX users reporting this issue here, I opened a ticket on FreePBX issue tracker:
http://issues.freepbx.org/browse/FREEPBX-13175

Usually, if a bug belongs upstream, issue reports like mine are closed and the FreePBX devs tell to go open a ticket upstream. However, since all users reporting the issue here are FreePBX users I think it was worth opening and letting someone from FreePBX side pay attention to it.

I would like to ask everyone seeing this issue to vote for my issue report on the FreePBX tracker.

I confirm that this bug also exists in Asterisk 13.11.2, previously I was running 13.10 but now it happened again on 13.11.2.

I have a backtrace now (but “core show locks” does not work). I hope it will shed some light on this issue:
http://pastebin.com/6HHxDz2E

A few other pieses of info:

  • Asterisk is reporting 3 active calls when no call is ongoing;
  • Many hours since this bug occured and the taskprocessor is like this still (from ‘core show taskprocessors’):
    Processor Processed In Queue Max Depth Low water High water
    pjsip/distributor-0000004e 110 515 2 450 500
  • Command ‘core show channels’ shows two extensions with state “ringing” when they are not.

As I mentioned on your filed issue this has already been fixed and will be in the next release.

Thanks for your attention to this topic, but are you sure it is the same issue?

Because the taskprocessor.c message on the issue that was fixed mentions a completely different processor, here we have problem with processor named pjsip/distributor-00000XXX, on the issue you say is fixed the issue was on processor ‘subm:rtp_topic-000000de’.

Also, in my case there is no masquerading going on, seems like the issue you considered a duplicate of this one had a lot to do with masquerading.

If you are not 100% sure these issues are related, could you reopen the ticket so at least it gets a chance to be looked at?

I was preparing myself to recompile asterisk 13.11.2 with the right flags and to turn on debug logging, but once you quickly considered this bug as a duplicate of another one that is already fixed, I was really discouraged to invest that much time into it. If you reconsider it I will work in order to add these pieces of info to the bug report:

  • debug log
  • another backtrace (from recompiled asterisk 13.11.2)
  • core show locks
  • core show taskprocessors
  • core show threads

The backtrace you have provided is the same as the linked issue and a masquerade is going on, you just may not know it.

From the linked issue backtrace:

Thread 19 (Thread 0x7f0a23691700 (LWP 8490)):
#0  0x00007f0ad5537b20 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x000000000053d01d in __ast_cond_wait (filename=0x7f0aa82e1bd2 "res_pjsip_session.c", lineno=1583, func=0x7f0aa82e2d00 <__PRETTY_FUNCTION__.26246> "ast_sip_session_suspend", cond_name=0x7f0aa82e1c10 "&suspender->cond_suspended", mutex_name=0x7f0aa82e2088 "ao2_object_get_lockaddr(suspender)", cond=0x7f0a78018fe8, t=0x7f0a78018f90) at lock.c:600
#2  0x00007f0aa82e0bee in ast_sip_session_suspend (session=0x7f0ab0173ea8) at res_pjsip_session.c:1583
#3  0x00007f0a246d92bf in chan_pjsip_indicate (ast=0x7f0ab00eb5f8, condition=34, data=0x7f0a74115758, datalen=4) at chan_pjsip.c:1335
#4  0x00000000004b9df3 in indicate_data_internal (chan=0x7f0ab00eb5f8, _condition=34, data=0x7f0a74115758, datalen=4) at channel.c:4559
#5  0x00000000004ba27b in ast_indicate_data (chan=0x7f0ab00eb5f8, _condition=34, data=0x7f0a74115758, datalen=4) at channel.c:4725
#6  0x00000000004c01d2 in channel_do_masquerade (original=0x7f0a7c49c7d8, clonechan=0x7f0ab00eb5f8) at channel.c:6614
#7  0x00000000004cafa4 in ast_channel_move (dest=0x7f0a7c49c7d8, source=0x7f0ab00eb5f8) at channel.c:10552
#8  0x0000000000484ee6 in after_bridge_move_channel (chan_bridged=0x7f0ab00eb5f8, data=0x7f0a7c49c7d8) at bridge_channel.c:1824
#9  0x0000000000477c85 in ast_bridge_run_after_callback (chan=0x7f0ab00eb5f8) at bridge_after.c:243
<< Private Info Removed >>
#18 0x00000000005796ea in __ast_pbx_run (c=0x7f0ab00eb5f8, args=0x0) at pbx.c:4235
#19 0x000000000057ae16 in pbx_thread (data=0x7f0ab00eb5f8) at pbx.c:4555
#20 0x00000000006008f6 in dummy_start (data=0x7f0ab0141220) at utils.c:1235
#21 0x00007f0ad553261a in start_thread () at /lib64/libpthread.so.0
#22 0x00007f0ad486e59d in clone () at /lib64/libc.so.6

From your backtrace:

Thread 9 (Thread 0x7fda80fec700 (LWP 5464)):
#0  0x00007fdad088368c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000539850 in __ast_cond_wait (filename=0x7fdacd4b5149 "res_pjsip_session.c", lineno=1583, func=0x7fdacd4b6500 "ast_sip_session_suspend", cond_name=0x7fdacd4b527e "&suspender->cond_suspended", mutex_name=0x7fdacd4b5b70 "ao2_object_get_lockaddr(suspender)", cond=0x7fdac01152a8, t=0x7fdac0115250) at lock.c:600
#2  0x00007fdacd4b2b6d in ast_sip_session_suspend (session=0x7fdac00f0878) at res_pjsip_session.c:1583
#3  0x00007fda5efde2df in chan_pjsip_indicate (ast=0x7fdac006cf88, condition=34, data=0x7fda80fe41f4, datalen=4) at chan_pjsip.c:1335
#4  0x00000000004b8429 in indicate_data_internal (chan=0x7fdac006cf88, _condition=34, data=0x7fda80fe41f4, datalen=4) at channel.c:4559
#5  0x00000000004b88ad in ast_indicate_data (chan=0x7fdac006cf88, _condition=34, data=0x7fda80fe41f4, datalen=4) at channel.c:4725
#6  0x00000000004bea54 in channel_do_masquerade (original=0x7fdac00c74b8, clonechan=0x7fdac006cf88) at channel.c:6614
#7  0x00000000004c9b05 in ast_channel_move (dest=0x7fdac00c74b8, source=0x7fdac006cf88) at channel.c:10552
#8  0x00000000004838b3 in after_bridge_move_channel (chan_bridged=0x7fdac006cf88, data=0x7fdac00c74b8) at bridge_channel.c:1824
#9  0x0000000000476465 in ast_bridge_run_after_callback (chan=0x7fdac006cf88) at bridge_after.c:243
#10 0x000000000046c791 in ast_bridge_join (bridge=0x7fdac008af68, chan=0x7fdac006cf88, swap=0x0, features=0x7fda80fe4990, tech_args=0x0, flags=3) at bridge.c:1716
#11 0x000000000050c474 in ast_bridge_call_with_flags (chan=0x7fdac006cf88, peer=0x7fdac0113488, config=0x7fda80fe5a30, flags=0) at features.c:658

So, the backtrace is the same and looking at the overall backtrace shows it to be the same problem which was fixed.

Great, thanks so much! Now crossing fingers for 13.12 to be released soon.

Will check why masquerade is happening and see if I can disable it for now.

Masquerades are an internal thing that can’t be disabled, but are used by different functionality. If you could determine your specific usage that causes them to be used you could discontinue that.

Hi, I have just experienced this (or pretty similar) issue on 13.14 branch. During the minimal load Asterisk stopped accepting SIP messages (chan_pjsip), all my phones were unable to renew their registrations and after a while a a described error message with task processor emerged. Asterisk did not crash so I have no dump. During this problem a CPU usage spiked to cca 50% (2 out of 4 CPUs were fully utilized).

System:
Virtualized 4CPU/4GB RAM
FreePBX 10.13.66-19 with 80 extensions all on PJSIP

Log:
[2017-05-27 18:34:24] ERROR[1616] res_pjsip.c: Error 120032 ‘Broken pipe’ sending OPTIONS request to endpoint 7000
[2017-05-27 18:34:24] VERBOSE[1327] res_pjsip/pjsip_configuration.c: Contact 7000/sip:7000@10.0.68.207:56883;transport=TCP is now Unreachable. RTT: 0.000 msec
… (similar output for other phones)
[2017-05-27 19:28:31] VERBOSE[1327] res_pjsip/pjsip_configuration.c: Contact 7203/sip:7203@10.0.68.121:5060;user=phone has been deleted
[2017-05-27 19:28:32] VERBOSE[1327] res_pjsip/pjsip_configuration.c: Endpoint 7203 is now Unreachable
[2017-05-27 23:40:01] WARNING[20722] taskprocessor.c: The ‘pjsip/default-0000000d’ task processor queue reached 500 scheduled tasks.

Could you please point me to a direction on how to trace the issue and eventually prevent it from happening again?

Thank you.