Asterisk hanging occasionally


#1

Lately I am having problems with an Asterisk server that has been working just fine for the last couple of months. What happens is that Asterisk process hangs - it is running and is accessible with “asterisk -r” but it does not receive/respond to any SIP messages.

The server setup is quite simple and the system size is very small. The system consist of:

  • HP Microserver server that is running CentOS 6.0 and Asterisk 1.8.5.0 (run as non-root)
  • ISDN-to-VoIP gateway (connected to Asterisk via SIP Trunk)
  • GSM-to-VoIP gateway (connected to Asterisk via SIP Trunk)
  • 2 Grandstream GXP2100 IP Phones

The DialPlan is quite simple - all incomming calls are routed to ring group 450, that rings both SIP Phones (200 and 201). All the calls are recored with the help of MixMonitor (I suspect that that the problem is here). This is the relevant part of the dialplan:

;Call to ring group 450
exten => 450,1,Set(MixMon_Filename=${STRFTIME(${EPOCH},,%Y%m%d-%H%M%S)}-${CALLERID(num)}-${UNIQUEID})
exten => 450,n,MixMonitor(${MixMon_Filename}.wav)
exten => 450,n,Dial(SIP/200&SIP/201,30,t)
exten => 450,n,Hangup()

The server is running very good, but lately we have been having problems with telephony not working. The cause of the problem is in the Asterisk process just hanging and not receiving/responding to any more SIP messages. Asterisk hanged for the first time a week ago (after couple months of normal operation), the second time yesterday morning and the third time today morning. The Asterisk process hanging is quite randomly and is becomming more and more frequent, which is concerning me.
This is the Asterisk process behaviour when the server gets in a state when telephony is not working:

  1. asterisk and safe_asterisk processes are running
  2. I can locally login to asterisk console via “asterisk -r” command
  3. command “sip show peers” is showing all peers as up and running (gateways and IP Phones have status OK)
  4. command “sip set debug on” is working OK, but when I make a call, I do not see any packets showing in Asterisk console. If I make a packet capture with tcpdump I can see that the SIP packets are reaching the server’s NIC, it is just not reaching the Asterisk process
  5. command “core restart now” is not working (after I issue the command, nothing happens)
  6. Linux command “kill asterisk_pid” does not kill the Asterisk process - I have to use “kill -9 asterisk_pid” to kill the process
  7. after I manuall kill the Asterisk process, safe_asterisk starts up the Asterisk again - after that the Asterisk process is running OK (accepting and replying to SIP messages as expected)

When I checked the Asterisk logs I saw that the last call log before the Asterisk hangs was:

[Sep 17 08:42:26] VERBOSE[13940] pbx.c:     -- Executing [450@incomming_calls:1] Set("SIP/GSM_Trunk_1-00000309", "MixMon_Filename=20110917-084226-055555555-1316241740.777") in new stack
[Sep 17 08:42:26] VERBOSE[13940] pbx.c:     -- Executing [450@incomming_calls:2] MixMonitor("SIP/GSM_Trunk_1-00000309", "20110917-084226-055555555-1316241740.777.wav") in new stack
[Sep 17 08:42:26] VERBOSE[13940] pbx.c:     -- Executing [450@incomming_calls:3] Dial("SIP/GSM_Trunk_1-00000309", "SIP/200&SIP/201,30,t") in new stack
[Sep 17 08:42:26] VERBOSE[13940] netsock2.c:   == Using SIP RTP CoS mark 5
[Sep 17 08:42:26] VERBOSE[13940] app_dial.c:     -- Called SIP/200
[Sep 17 08:42:26] VERBOSE[13940] netsock2.c:   == Using SIP RTP CoS mark 5
[Sep 17 08:42:26] VERBOSE[13940] app_dial.c:     -- Called SIP/201
[Sep 17 08:42:26] VERBOSE[13940] app_dial.c:     -- SIP/200-0000030a connected line has changed. Saving it until answer for SIP/GSM_Trunk_1-00000309
[Sep 17 08:42:26] VERBOSE[13940] app_dial.c:     -- SIP/201-0000030b connected line has changed. Saving it until answer for SIP/GSM_Trunk_1-00000309
[Sep 17 08:42:26] VERBOSE[13942] app_mixmonitor.c:   == Begin MixMonitor Recording SIP/GSM_Trunk_1-00000309
[Sep 17 08:42:26] VERBOSE[13940] app_dial.c:     -- SIP/201-0000030b is ringing
[Sep 17 08:42:26] VERBOSE[13940] app_dial.c:     -- SIP/200-0000030a is ringing

For reference, this is a log of a “normal call”:

[Sep 17 08:32:27] VERBOSE[13848] pbx.c:     -- Executing [450@incomming_calls:1] Set("SIP/GSM_Trunk_1-000002ff", "MixMon_Filename=20110917-083227-055555555-1316241141.767") in new stack
[Sep 17 08:32:27] VERBOSE[13848] pbx.c:     -- Executing [450@incomming_calls:2] MixMonitor("SIP/GSM_Trunk_1-000002ff", "20110917-083227-055555555-1316241141.767.wav") in new stack
[Sep 17 08:32:27] VERBOSE[13848] pbx.c:     -- Executing [450@incomming_calls:3] Dial("SIP/GSM_Trunk_1-000002ff", "SIP/200&SIP/201,30,t") in new stack
[Sep 17 08:32:27] VERBOSE[13849] app_mixmonitor.c:   == Begin MixMonitor Recording SIP/GSM_Trunk_1-000002ff
[Sep 17 08:32:27] VERBOSE[13848] netsock2.c:   == Using SIP RTP CoS mark 5
[Sep 17 08:32:27] VERBOSE[13848] app_dial.c:     -- Called SIP/200
[Sep 17 08:32:27] VERBOSE[13848] netsock2.c:   == Using SIP RTP CoS mark 5
[Sep 17 08:32:27] VERBOSE[13848] app_dial.c:     -- Called SIP/201
[Sep 17 08:32:27] VERBOSE[13848] app_dial.c:     -- SIP/200-00000300 connected line has changed. Saving it until answer for SIP/GSM_Trunk_1-000002ff
[Sep 17 08:32:27] VERBOSE[13848] app_dial.c:     -- SIP/201-00000301 connected line has changed. Saving it until answer for SIP/GSM_Trunk_1-000002ff
[Sep 17 08:32:27] VERBOSE[13848] app_dial.c:     -- SIP/201-00000301 is ringing
[Sep 17 08:32:27] VERBOSE[13848] app_dial.c:     -- SIP/200-00000300 is ringing
[Sep 17 08:32:29] VERBOSE[13848] app_dial.c:     -- SIP/200-00000300 connected line has changed. Saving it until answer for SIP/GSM_Trunk_1-000002ff
[Sep 17 08:32:29] VERBOSE[13848] app_dial.c:     -- SIP/200-00000300 answered SIP/GSM_Trunk_1-000002ff
[Sep 17 08:33:25] VERBOSE[13848] pbx.c:   == Spawn extension (incomming_calls, 450, 3) exited non-zero on 'SIP/GSM_Trunk_1-000002ff'
[Sep 17 08:33:25] VERBOSE[13849] app_mixmonitor.c:   == MixMonitor close filestream
[Sep 17 08:33:25] VERBOSE[13849] app_mixmonitor.c:   == End MixMonitor Recording SIP/GSM_Trunk_1-000002ff

From what I can gather here is that there is some problem with the MixMonitor that is causing the Asterisk to hang. I am not sure if I am correct (and even if I am right, I do not know how to fix it), so I need your help with identifying and fixing the problem.

Two things that I have also noticed and find a bit odd are also:

With the call that caused the Asterisk process to hang, the “Begin MixMonitor Recording” message came very late - after the “Called SIP/200” and “Called SIP/201” messages. I have been observing other call logs and I did not see the “Begin MixMonitor Recording” message showing up so late - usually the messages show up in the same order as in the upper call log.

  1. What does the “connected line has changed. Saving it until answer for” mean?

This is my first serious problem with Asterisk so I hope that we can get to the bottom of it. Thanks for all the help in advance.


#2

This sounds like a deadlock. You need to build with thread debugging enabled (and no optimisation). When it jams, issue the command “core show locks”. Also follow the advice that is given if you google for “asterisk wiki backtrace” to get a backtrace.

Although running non-root probably isn’t causing the problem, it may complicate the process of installing the debug version and obtaining the backtrace.

The connected line has changed message will relate to a new feature, the ability to change caller and called ID information mid-call. I’m not familiar with the details.


#3

david55 thank you very much for a very detailed answer. Getting a backtrace seems like a very good idea, but I would need to re-compile the Asterisk on the server. My problem is that this is a production server that can not afford any long downtime. So I defnitily do not have permission to play arround with it :smile: The fact that this bug happens very randomly does not help matters out.

This problem sounds very strange to me, because I am running Asterisk 1.8.5.0 on 2 or 3 different other servers (that are running much larger ammount of internal phones and call load) and I do not have any issues with this Asterisk version. But it is true that I do not use MixMonitor on that servers and all those severs are running CentOS 5.6 or 5.7.

If I would go and make a backtrace - wouldn’t it make sense that I first upgrade to the latest Asterisk version and see if this problem also happens on the latest Asterisk version?

What are the chances that the problem is in the CentOS 6.0-Asterisk issue? I was checking the CentOS logs and could not see anything wrong with Asterisk process. On the CentOS 6.0 I have SELinux enabled to the default value (SELINUX=enforcing, SELINUXTYPE=targeted - I did not change any default SELinux settings). Would it make sense to disable SELinux?


#4

In my experience, deadlocks are in Asterisk and they can be very difficult to reproduce.


#5

You could try checking what Asterisk is doing, if it has any subprocesses spawned and what are they doing. I have had similar issues with other software, not with Asterisk, caused by the application waiting to write to a piped program, which in turn has stopped reading stdin for some reason (in my case full hard drive).
Check (using ps -eLf) Asterisk threads and their state. Try running a strace against the asterisk process and threads to see what are they doing.

This is not going to fix your issue, but should generally help you track it down.


#6

Most Asterisk deadlocks are between threads and you need to have quite a good understanding of the internals to work out what they are from the backtrace (although an unoptimised one won’t be accepted in a bug report, it would stil indicate the general area in the code).

For most deadlocks, core show locks is the best initial debugging tool, but it does require a build with the necessary code included.


#7

I have been working with Linux and Asterisk for quite some time and this is the first time that I came across deadlocks. So please excuse any possible dumb questions that I might have.

If I run “ps -eLf | grep asterisk” I see about 43 processes and I do not see anything wrong now when all is running OK.

So what do you propose I do in this case? Upgrade to 1.8.6.0 (compile with debug support) and see if the deadlock happens again?

Would it make sense to disable the SELinux or this should not have any influence on the matter?

If we are not able to solve the deadlock issue, do you think that periodical (daily) Asterisk reboots would help?


#8

Most deadlocks are not dependent on up time. They are the result of race conditions which tend to be random, although they are more likely when the system is busy.

Most multi-threaded systems are vunlnerable to deadlocks although they can and should be designed out.


#9

I’m having the same problem. In my case, it’s a school, so I usually can’t afford any extra downtime for the debug - though with these hints, i’ll try it the next time.

What happens for me: call come in on analog (sangoma & dahdi), hits the day/night switch, goes over to a ring group of 5 phones. Of those 5, 1 does not exist yet. This works great. However, in both instances where it hung, the SIP channels did not work:

[Sep 21 11:03:30] VERBOSE[5287] pbx.c: – Executing [s@macro-dial-one:37] Dial(“DAHDI/3-1”, “SIP/7001,15,r”) in new stack
[Sep 21 11:03:30] VERBOSE[5287] netsock2.c: == Using SIP RTP TOS bits 184
[Sep 21 11:03:30] VERBOSE[5287] netsock2.c: == Using SIP RTP CoS mark 5

and what would normally follow:

[Sep 21 07:26:01] VERBOSE[4618] app_dial.c: – Called 7000

is NOT present; at that point, the calls aren’t going anywhere. Asterisk still responds to asterisk -r

I didn’t have SIP debug on, and I got no useful debug as noted here. But twice in the space of 15 minutes is really peculiar.

Next time (which is hopefully a long time from now), i’ll try to remember the debug.

Asterisk 1.8.4 built by root @ phone on a i686 running Linux on 2011-05-27 00:46:09 UTC

Any thoughts?

jim o


#10

There is a recent issue on the issue tracker about problems when calling unreachable SIP devices.


#11

I have made some progress on this problem. I have narrowed down the problem to the time when a caller puts an existing call on hold and picks up another incomming call that comes in on another telephone line. It might also have something to do with only putting the caller on hold, the deadlock trigger is definitly in there somewhere.

When I test out the scenario with 2 incomming calls on the same phone (1st caller is put on hold until the agent finishes with the 2nd caller), the server soon starts to experience problems in a form that the asterisk process starts burning 100% of the CPU.

top - 10:35:28 up 7 days, 22:37,  2 users,  load average: 1.09, 0.95, 0.51

Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie

Cpu(s):  3.5%us,  1.0%sy,  0.0%ni, 95.0%id,  0.4%wa,  0.0%hi,  0.1%si,  0.0%st

Mem:    899708k total,   867404k used,    32304k free,   228004k buffers

Swap:  2096112k total,     2352k used,  2093760k free,   466676k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                         

 1733 _asteris  20   0 39660  17m 7360 S 97.7  2.0  43:57.77 asterisk      

By running “sip show channelstats” I also noticed that neither of the SIP channels closed down even though both calls have long ended:

192.168.9.220 - IP Phone
192.168.9.210 - GSM-VoIP gateway (connected to Asterisk via SIP Trunk)
192.168.9.211 - ISDN-VoIP gateway (connected to Asterisk via SIP Trunk)

192.168.9.220    362b4cb7111  00:10:11 0000010656  0000072807 (87.23%) 0.0000 0000010692  0000000000 ( 0.00%) 0.0005

192.168.9.220    331c9724262  00:05:32 0000000514  0000000000 ( 0.00%) 0.0000 0000000528  0000000000 ( 0.00%) 0.0001

192.168.9.211    7d7542991d6  00:05:37 0000001304  0000000000 ( 0.00%) 0.0000 0000001258  0000000000 ( 0.00%) 0.0001

192.168.9.210    cb561f26703  00:10:16 0000016169  0000000000 ( 0.00%) 0.0000 0000016100  0000000000 ( 0.00%) 0.0000

I just do not get why Asterisk is not closing down active channels if the call has ended. The IP phone that is used in the setup is Grandstream GXP2110. I did not check the SIP messages that the SIP phone is sending to Asterisk when this happens. I will first upgrade the Asterisk version from 1.8.5.0 to 1.8.7.0 and if this does not solve the issue, I will go into further debugging.

Does anyone have any good ideas to why something simple as putting a caller on hold or switching between two active SIP lines on the IP phone is bringing down the server?


#12

SIP channel structures are held for a few tens of seconds after the call because the other side may lose a packet and send a retransmission which needs to be matched against the call.

Internally, the operations you describe are amongst the most complex in Asterisk.