[FIXED] SIP bridging with Dial() results in zombie channels

I am having a problem with Asterisk 1.8.7.1 where one of my servers is having difficulty cleaning up after one SIP channel executes a Dial() to another SIP channel within the extension logic, and then one of the two channels hangs up.

I’m guessing it has something to do with my extension logic itself since other Asterisk servers I’m running (and with much simpler dialplans) at the same version level do not have this problem, and their sip.conf settings are relatively similar. But I’ll be darned if I can figure out what it is.

The problem is that the extensions.conf for the particular server in question is a huge, sprawling thing that hooks into SQL databases and HTTP datastores using func_curl, so it wouldn’t be prudent to post it here nor do I suspect it would be particularly easy for someone to debug without giving them remote access to the server. What I’m hoping is that someone can help point me in the right direction after giving them the pertinent clues.

Here’s what I can tell you:

  1. I’m only dealing with SIP channels and the SIP channel driver.
  2. I’m not even using chan_local anywhere.
  3. I have some macros but I’m basically using them as poor-man’s subroutines.
  4. I don’t Dial() from within Macros.
  5. I have no ‘h’ extensions defined in any contexts.

…and here are the symptoms:

  1. When a call (either inbound or outbound…doesn’t matter which leg of the call) is in progress, and one of the legs hangs up, Asterisk just sits there like a lump without sending a BYE message to the remaining active channel.

  2. If you wait and watch, and the remaining channel keeps its RTP stream to Asterisk open (phone off-hook still), the remaining leg will FINALLY get sent a BYE from Asterisk exactly 30 seconds after the first leg hangs up. It doesn’t matter how long the call lasted (seconds or hours), you can count on needing to wait 30 seconds from the instant the first channel is dead. When this happens, the channel is cleaned up as expected.

  3. If you hang up the remaining channel (put phone back on-hook) BEFORE 30 seconds are up, Asterisk really doesn’t like this. A few seconds after you do this, you will see a “WARNING: chan_sip.c: __sip_autodestruct: Autodestruct on dialog ‘[…]’ with owner in place (Method: BYE)” show on the console, and the BYE dialog will remain “stuck” in ‘sip show channels’ until you restart Asterisk. After a while, these really begin to accumulate.

It’s not a hangup detection issue…if I add a simple one-liner ‘h’ extension in the appropriate context that simply calls Wait(0), I see that this gets executed right when the first leg calls it quits. (Besides, I’m not bridging to a Zaptel/DAHDI interface or something else that can’t always properly detect hangup events or where they are ambiguous.) But it is exactly 30 seconds after the ‘h’ extension has finished executing that I finally see my “Spawn extension ([context], [exten], [#]) exited non-zero on '[channel]” (assuming the remaining channel hasn’t also called it quits before 30 seconds are up).

The question is, of course, what the heck could be causing Asterisk to wait 30 seconds between when it gets the BYE from the first channel and when it decides to send a BYE to the remaining open channel. I’ve scoured my extensions.conf file and have no explanation for this. My understanding from my research so far is that execution of the dialplan gets stayed by Dial() until the Dial application has exited (when one of the parties hangs up/says BYE), at which point it jumps to the local ‘h’ extension before it hangs up on the other party, and that sometimes you can gum up things by taking your sweet time within ‘h’. Well, I have no ‘h’ extension except when I put in a dummy one for debugging (the aforementioned “exten => h,1,Wait(0)”).

Any ideas what could cause something like this or what I can be looking for? Any help is much appreciated!

Thanks,

– Nathan

For anybody who is curious or who runs across this thread in the future with the same or a similar issue, I finally figured out what was going on…though it wasn’t terribly easy.

The thing is that there is (as I learned) a lot of stuff going on behind-the-scenes that Asterisk – even at the highest debug and verbosity levels – does not clue you into. Based on the 30-second thing, I knew there had to be some kind of action happening at the end of a call that was timing out, and was apparently configured to wait for 30 seconds before giving up and moving on. But I had no way to see what this thing was…when the first channel hung up, nothing indicated to me what action was being taken right before the 30-second countdown went into effect.

I finally got to the point where I had to scour through chan_sip.c source to look for clues. I found sip_hangup, which is pointed to by ast_hangup, and so I fired up gdb, set a breakpoint at ast_hangup, placed a call, caused one end to hang up, waited 30 seconds, and sure enough execution was paused at ast_hangup. Ran a backtrace and started going through dial_exec_full() in app_dial.c which is a mammoth function. Didn’t really see anything obvious, but I ran across some references to CDRs, and that’s when I thought, “waitaminute…”

I realized that CDRs are probably saved/committed when a call hangs up. I also realized that I forgot to mention one thing earlier (which didn’t seem relevant at the time anyway): I’m sending all of my CDRs to a RADIUS server. I have two servers: a primary and a backup. So I checked, and sure enough, my primary RADIUS server was nonresponsive. (This whole system is in a testing phase, and I haven’t gotten around to getting live monitors in place for everything yet.) I checked the RADIUS client configuration file, which has a timeout default of 10 seconds set and a retry # default of 3 set. 10 * 3 == 30 seconds. Bingo.

I restarted the RADIUS daemon on the primary server, and hang-ups were instantaneous after that. I have since adjusted the timeout and retry values in the RADIUS client config file to much lower values so that this kind of thing doesn’t happen again.

It sure would have been nice to have REALLY detailed/verbose debug information on the Asterisk console…you know, something that showed “Saving CDRs…” or SOMETHING useful. I would have been able to solve this problem a lot more quickly. (I have both verbosity and debug cranked up to 9.)

Thanks for reading and for letting me vent,

– Nathan