Channels getting stuck in "Ring" state

j3flight · June 16, 2009, 3:59pm

CentOS 5.2 x86_64
Asterisk 1.4.25
Zaptel 1.4.12.1
Digium TE412P

My Asterisk box is configured as a conferencing server. All 4 T1s are configured as FEATD_MF, and are connected to a Nortel DMS-100 switch. (I wish I could do PRI, but can’t.)

The system has worked great for quite a long time, except for the following issue, which seems to pop up about once a week… Every once in a while, a channel will get stuck in “Ring” state. For example:

Channel       Location        State    Application(Data)     
Zap/77-1      s@tdma:1        Ring     (None)                
1 active channel
0 active calls

The DMS-100 shows this channel “idle” while it is stuck in Asterisk. It will remain stuck like this permanently until I restart Asterisk. Once I do that (and nothing else), the channel is clear again with no problems.

I look at the logs and there is nothing to indicate what went wrong. The “bad” call comes in, the switch sends the ANI and DNIS via DTMF and then NOTHING. Where a normal call would get dropped into the dialplan, these “stuck” calls go nowhere…

Good call:

[Jun 14 07:14:16] DTMF[25747] channel.c: DTMF end '5' received on Zap/61-1, duration 0 ms
[Jun 14 07:14:16] DTMF[25747] channel.c: DTMF end accepted without begin '5' on Zap/61-1
[Jun 14 07:14:16] DTMF[25747] channel.c: DTMF end passthrough '5' on Zap/61-1
[Jun 14 07:14:16] DTMF[25747] channel.c: DTMF end '#' received on Zap/61-1, duration 0 ms
[Jun 14 07:14:16] DTMF[25747] channel.c: DTMF end accepted without begin '#' on Zap/61-1
[Jun 14 07:14:16] DTMF[25747] channel.c: DTMF end passthrough '#' on Zap/61-1
[Jun 14 07:14:17] VERBOSE[25747] logger.c:     -- Executing [1111111111@tdma:1] Answer("Zap/61-1", "") in new stack
[Jun 14 07:14:17] DEBUG[25747] chan_dahdi.c: Took Zap/61-1 off hook

Bad Call:

[Jun 14 07:21:07] DTMF[25761] channel.c: DTMF end '5' received on Zap/54-1, duration 0 ms [Jun 14 07:21:07] DTMF[25761] channel.c: DTMF end accepted without begin '5' on Zap/54-1 [Jun 14 07:21:07] DTMF[25761] channel.c: DTMF end passthrough '5' on Zap/54-1 [Jun 14 07:21:07] DTMF[25761] channel.c: DTMF end '#' received on Zap/54-1, duration 0 ms [Jun 14 07:21:07] DTMF[25761] channel.c: DTMF end accepted without begin '#' on Zap/54-1 [Jun 14 07:21:07] DTMF[25761] channel.c: DTMF end passthrough '#' on Zap/54-1
This problem has occurred on ALL the T1s in the group.
It has occurred in multiple locations in my dial plan (all under the TDMA context though, which is where inbound calls drop into from my T1s.)

I have switched cards to a new TE412P, no change.
I have rebooted the server, no change.
I have upgraded Asterisk, no change.
I have made changes to the dialplan, no change.
The T1s are error free.

Has anyone had any experience with this?
Thanks…

j3flight · June 20, 2009, 12:34am

bump…
Really, no one?

bkofd · June 23, 2009, 2:54am

Not even Soft Hangup works?

Awhile back we updated our CentOS server and the version of libpri we had been using would no longer compile so we updated Libpri to the latest but left the system running Zaptel still. We started to have some really strange behavior, not the same as what you are having , but it was very odd. We fixed it by updating to dahdi and updating our asterisk(was required to work with dahdi).

My suggestion would be to update to the latest version of dahdi (may require a recompile of asterisk), and libpri if you are using it.

You can use the option ‘dahdichanname = no’ under options in asterisk.conf, to prevent having to change any other asterisk configuration and dialplan changes. You will still need to configure the dahdi drivers on the system, but the genconf tends to do a decent job of that for you.

j3flight · June 23, 2009, 9:56pm

Nope, soft hangup does nothing… I guess I should have mentioned that. Also, when I do a "core show channel " it does not indicate that it’s blocked anywhere.

A couple days ago, I turned on FULL debug and verbose and got some extra info when a channel stuck. It turns out that chan_dahdi is receiving an on-hook event JUST after the final # in the Feature Group D string. I think this whole thing might be a race condition between the event handling thread and the thread pulling in the DTMF tones.

If a user hangs up the call just as the DTMF tones are being completed, chan_dahdi looks like it might be trying to put the (now dead) channel into PBX_Run(). I have put some debug code into chan_dahdi as a first attempt to prove this, but it seems plausible at the moment.

I’ll post back when the next channel gets stuck.

j3flight · June 23, 2009, 9:57pm

By the way, thanks for the advice on moving to DAHDI. I had considered that before and may do it if my current debugging fails to get me anywhere…

j3flight · June 30, 2009, 12:48am

I think I’ve got this fixed, but I’ll have to wait a week or two to be sure…

Inside ss_thread of chan_dahdi, it pulls in all the digits for the FEATD_MF channel and eventually drops the caller into pbx_run(). Within that portion of the code, there is a short wait state after the final FEATD wink that is handled with a sleep(100) - the comments next to it call it a ‘guard time’.

If a caller hangs up AFTER the final digit collection but BEFORE this safe_sleep(100) is complete, the safe_sleep(100) exits non-zero and the code jumps straight to the thread exit. The result of that is a stuck channel.

I was actually able to reproduce it after about 15 tries by timing my hangup just right. If I increased the wait timer (to like 2000), I could reproduce the issue EVERY TIME. I added call to dahdi_set_onhook() (and some debug text) if the safe_sleep() returns non-zero and that seems to have cured the problem…

I’ll keep an eye on my logs for a while and if my debug text pops up periodically but my channels stay clean, I’ll know it’s fixed. Then, I’ll post a bug report and patch.