Autodestruct on dialog WARNING - system eventually locked up

#1

Hi guys, we watched again in horror today as one of our boxes eventually ground to a halt… the only thing we could really see was lines of the same sort of message:

[2019-04-11 14:59:58] WARNING[12765]: chan_sip.c:4337 __sip_autodestruct: Autodestruct on dialog ‘02080e0775ca38c25084edad1a671a1d@1.1.1.1:5060’ with owner SIP/Trunkin1-0000dc1e in place (Method: BYE). Rescheduling destruction for 10000 ms

I have read most of the other topics related to this, but still feel like I don’t have good place to start. Also they seem to be from years ago.

This only seems to happen on Asterisk 13.25.0. - its a test case box for us moving all 30 odd servers over to Asterisk 13. We are currently on Asterisk 11 (latest available build). The problem is Ast11 does a deadlock on chan_sip from time-to-time where nothing shows on the CLI at all… It just sits there doing nothing. - So this is why we are keen to move to 13.

Some notes about the situation :

  • The boxes are all carrying around 100-150 simultaneous calls at any one time.

  • The calls built up over a period of about 20 minutes, and during that time the system and CLI is responsive. After reaching about 900 channels it eventually caused extensions to go offline - that’s when we pulled the trigger.

  • It wasn’t particularly busy, in fact the days was calming down to some degree. Before that, also around 3-4pm in the afternoon.

  • If this condition occurs from a long “h” dial-plan why does it suddenly start out the blue, and why is it that from that point on, its a one-way path to a total asterisk crash. (Wouldn’t this just effect a handful of calls)

  • We do have “some” things in the “h” but at the same time as this, another calls and their “h” channels are being executed fine. Including writing to the CDR db.

  • A lot of the calls go into queues - and they don’t seem to exit. There seems to be some relationship to queues - though im not 100% sure on this - because we did a module reload on queue, and it had no effect.

From what I see it - its either a chan_sip bug, or we have a monkey wrench in the “h” dial-plan that’s preventing some channels from being torn down and others not, on something of an linear trajectory.
(It’s unlikely to be something in the dial plan as it would probably happen all the time then.)

So… Is it time to go to PJSIP?, or is this curable?

What can we do to track this down? bare in mind these are production boxes, and under normal testing conditions this never happens. Running debugging tools on our production boxes is going to be tricky.

0 Likes

#2

YES, the main reason is the chan_pjsip module is core supported , not the same with chan_sip which is community supported, assuming this issue is caused by a bug, you dont have an estimated time frame when is going to be fixed using chan_sip

0 Likes

#3

Rodger that… now the only question: how do I tell my boss that we need to re-write the entire dial plan and back-end systems, that we spent multiple millions of dollars developing… ooops :frowning:

0 Likes

#4

I think before spent such huge amount of money ,he should have taken into consideration the pro and con about using open source products. Have he ever run this command core show warranty. there is an interesting part that says " THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION".

but going back to your issue, I can’t say it will be fixed using pjsip channel driver, so it is better to way to the reply of others members and see if they have some advises that could help you to get this fixe

0 Likes

#5

As it pertains to timelines, chan_pjsip was introduced in Asterisk 12, which came out in 2013. We’ve been marching towards this for quite some time now.

0 Likes