Hi guys, we watched again in horror today as one of our boxes eventually ground to a halt… the only thing we could really see was lines of the same sort of message:
[2019-04-11 14:59:58] WARNING[12765]: chan_sip.c:4337 __sip_autodestruct: Autodestruct on dialog ‘02080e0775ca38c25084edad1a671a1d@1.1.1.1:5060’ with owner SIP/Trunkin1-0000dc1e in place (Method: BYE). Rescheduling destruction for 10000 ms
I have read most of the other topics related to this, but still feel like I don’t have good place to start. Also they seem to be from years ago.
This only seems to happen on Asterisk 13.25.0. - its a test case box for us moving all 30 odd servers over to Asterisk 13. We are currently on Asterisk 11 (latest available build). The problem is Ast11 does a deadlock on chan_sip from time-to-time where nothing shows on the CLI at all… It just sits there doing nothing. - So this is why we are keen to move to 13.
Some notes about the situation :
-
The boxes are all carrying around 100-150 simultaneous calls at any one time.
-
The calls built up over a period of about 20 minutes, and during that time the system and CLI is responsive. After reaching about 900 channels it eventually caused extensions to go offline - that’s when we pulled the trigger.
-
It wasn’t particularly busy, in fact the days was calming down to some degree. Before that, also around 3-4pm in the afternoon.
-
If this condition occurs from a long “h” dial-plan why does it suddenly start out the blue, and why is it that from that point on, its a one-way path to a total asterisk crash. (Wouldn’t this just effect a handful of calls)
-
We do have “some” things in the “h” but at the same time as this, another calls and their “h” channels are being executed fine. Including writing to the CDR db.
-
A lot of the calls go into queues - and they don’t seem to exit. There seems to be some relationship to queues - though im not 100% sure on this - because we did a module reload on queue, and it had no effect.
From what I see it - its either a chan_sip bug, or we have a monkey wrench in the “h” dial-plan that’s preventing some channels from being torn down and others not, on something of an linear trajectory.
(It’s unlikely to be something in the dial plan as it would probably happen all the time then.)
So… Is it time to go to PJSIP?, or is this curable?
What can we do to track this down? bare in mind these are production boxes, and under normal testing conditions this never happens. Running debugging tools on our production boxes is going to be tricky.