I have an asterisk server installed with Verison 16.17.0 on CentOS Linux release 7.9.2009 (Core) on AWS Infrastructure.
In this server, I have two SIP trunks connected from 2 different telecom operators and I have noticed that my sip peers are going unreachable intermittently but the sip show registry showing that SIP is still registered on my server, I have also checked ping response and there are no drops.
Am using the chan_sip driver.
As checked the logs able to see that
[Sep 14 11:46:14] NOTICE[17337] chan_sip.c: Peer ‘OPR-sip-trunk’ is now Lagged. (2005ms / 2000ms)
[Sep 14 11:46:24] NOTICE[17337] chan_sip.c: Peer ‘OPR-sip-trunk’ is now Reachable. (12ms / 2000ms)
During this happens the existing calls whichever running on the server are remain connected and working and new outgoing calls getting dropped with reason
[Sep 14 11:46:24] WARNING[20821][C-0000027d] app_dial.c: Unable to create channel of type ‘SIP’ (cause 20 - Subscriber absent)
I have also checked this SIP trunk from The operator end and there are no issues from the Telecom operator end.
This is a qualify failure. In particular, it is showing a round trip delay of over two seconds, which is well beyond what most speech users would consider acceptable.
To avoid should i disable the qualify or is there any suggestion as the calls were getting failed during the time. And this is also happening intermittently - Daily around 6 times.
As said earlier I have two SIP Trunks configured on the servers.
When this issue happens my both SIP trunks going unreachable. 1 of the SIP trunks is an Internal SIP trunk on a different asterisk box and even that SIP trunk is also getting unreachable when this issue happened.
Is there a chance of a dead lock ?
what is your view on this ?
More likely something is competing for CPU time. Is the system virtual? You might find it correlates to other uses of the same host, or even host housekeeping operations.
I’ve seen this kind of behavior a lot before. David is right, but in case you are running a VM you might want to allow some disk caching (the more “unsafe” the better) as this usually tends to effectively freeze the entire box. Also, you get the message because of the 2s timeout. You could get a pcap trace of the entire communication and check the overall behavior. I guess you run into the problem only sporadically.
In your case it is the server, but I’ve seen this a lot with older SIP phones and SIP doorbells. I typically got the message every couple of hours and the most critical factors were usually WiFi connections and with doorbells getting an extra video stream for other purposes. Anyway, you are always temporarily running out of some kind of resource.
You might still want check whether disabling qualification helps and I don’t agree with David here. If there is a problem with getting an answer to the OPTIONS request, Asterisk blocks until the next request. If the problem occurs only for a few seconds then the line is blocked too long. Of course, if you try to call when the communication is blocked you are out of luck again. It’s more like a Gremlin taking out the battery for a second and then putting it back, but if you check every two minutes you could get blocked for up to 2 minutes.
I’ve not researched how Amazon manage EC2 instances in detail, so whilst I’m sure that there will be behaviour that gives away that they are really virtual, I don’t know to what extent that will present in a way that would produce the observed effects.