Asterisk semi-unresponsive, SIP deadlock

Hello

I am having a problem with Asterisk 1.6.2.17 (installed from source) running on Centos 5.4. I am using FreePBX 2.8.1 for configuration. We mainly have SIP softphones, with a couple of SIP hard phones and IAX extensions. We have one SIP trunk that all our calls are routed over.

The system was running virtually fine with Asterisk 1.6.1, but since upgrading to 1.6.2.17 last week, Asterisk is frequently becoming partially unresponsive (deadlocked?). Yesterday, this happened 4 times. When it happens, It is not possible to make a call, internal or external, and the softphones display ‘408 unregistered, timeout’ warnings. Asterisk is not crashing with a core dump. It seems to me that SIP is getting stuck, as there are some other things that are logged when it is deadlocked.

The log file does not indicate any error, or anything consistently happening just before it deadlocks. Once the deadlock happens, the Asterisk process has to be killed with -9 for it to restart. Issuing “amportal restart” or doing a regular kill on the asterisk process makes asterisk log this “Waiting for inactivity to perform halt…” but it just sits there.

Log output:
[Mar 8 10:42:36] VERBOSE[730] app_dial.c: – Called 513
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 916 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 517 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 265 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 513 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 264 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 261 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 259 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 271 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 514 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 915 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 915 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 912 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 272 (queued)
[Mar 8 10:42:36] VERBOSE[372] chan_sip.c: == Extension Changed 513[ext-local] new state Ringing for Notify User 511 (queued)
[Mar 8 10:51:07] NOTICE[385] chan_iax2.c: Peer ‘263’ is now UNREACHABLE! Time: 37
[Mar 8 10:51:17] NOTICE[386] chan_iax2.c: Peer ‘263’ is now REACHABLE! Time: 36
[Mar 8 10:52:14] VERBOSE[773] logger.c: Asterisk Event Logger Started /var/log/asterisk/event_log
[Mar 8 10:52:14] VERBOSE[773] config.c: == Parsing ‘/etc/asterisk/asterisk.conf’: [Mar 8 10:52:14] VERBOSE[773] config.c: == Found
[Mar 8 10:52:14] VERBOSE[773] loader.c: Asterisk Dynamic Loader Starting:
[Mar 8 10:52:14] VERBOSE[773] config.c: == Parsing ‘/etc/asterisk/modules.conf’: [Mar 8 10:52:14] VERBOSE[773] config.c: == Found
[Mar 8 10:52:14] NOTICE[773] loader.c: 2 modules will be loaded.

I had seen in another thread that DNS timeouts could be the problem, but I don’t think it is. I have a caching nameserver running on the box.

Make sure that you have optimisation disabled and thread debugging enabled, in the compile options, and rebuild, if necessary.

The next time this happens, try to connect a new CLI process and enter the command “core show locks”.

Rather than using kill -9 to terminate the process, attach a debugger and obtain backtraces as described in doc/backtrace.txt (in the Wiki for later versions), or use gcore to force a dump.

After obscuring sensistive date, raise a bug report on issues.asterisk.org and attach the relevant files.

Before using kill -9, try kill -3 which should also force dump, although possibly only of the active thread. kill -9 should only ever be used when all else fails.

Thank you for the suggestions, I will have a bash