I have a customer that has two siptrunks connections - both using TLS.
Frequently, but not always, Asterisk marks the endpoints unreachable immediately after resceiving their response to an OPTIONS request sent by us.
The qualify_frequency is set to 60, and when we 1 minute later send a new OPTIONS request, Asterisk now believe that the endpoint is reachable again.when we receive their response.
I see no pattern for when this happens, but typically happens 5-6 times per day and other days not at all. It happens both when there is high load on the system, and early moring when the (almost) only activity is OPTIONS requests and responses.
I have attached 3 examples from today, and the transport and endpoint configuration that we have.
I deleted your attachments because they had ip addresses in them but try setting the qualify_frequency to something other than exactly 60 seconds. Try 55 seconds.
Unfortunately setting the qualify frequency to 55 seconds did not change anything.
The SipTrunks are still wrongly reported as unreachable 2-3 times a day.
It “only” lasts for a minute, but this is a very busy customer, and calls are getting lost because the agents can’t answer the calls. When they fail to answer, then our system marks the agent as unreachable, and within seconds all agents become unreachable.
Please note that we have two siptrunks configured for this tenant.
One for international numbers and another for national ones.
Could Asterisk get confused and marking the wrong endpoint as unrechable when it receives a response to OPTIONS?
And why will Asterisk ever mark anything as unrechable when it receives a reply to OPTIONS?
Do you have another Idea about where the error lies?
And how to avoid it?
From a SIP protocol perspective if a response is received after the request transaction goes away, then it wouldn’t be marked as reachable because it has no idea about it anymore. Is that the cause? Unknown, but just answering that question.
A debug level log would be needed[1] as that shows the logic of what is happening inside.
I tried setting the debug level to 5 as written, but anything higher than 2 results in a massive log, that I would never be able to send to you.
Running it just for a 30-40 seconds,resulted in a full log of 6MB.
I have now set it to 2, and hope that this is information enough.
There is actually a third SipTrunk that we do register towards, however never see the issue for this.
We do not register the 2 siptrunks that peridically becomes unreachable.
We and the provider has hardcoded the remote contact to use.
Please note the these SipTrunks are using TLS - don’t know whether this can explain something…
Just noticed that the same localnet was shown 3 times when I requested the transport settings from the CLI.
This is however not what is writted in the pjsip.conf.
I don’t know whether this could be related to the unreachable problem, but just to be sure, I have pasted the entire transport configuratin as writted in pjsip.conf below:
You are right. I inadvertently gave you the transports for a wrong Tenant.
This tenant uses the ports 5066 and 5067 for unencrypted and encrypted respectively.
All tenants is however set up in same way - only the poprts differ, which is how we seperate tenants from each other.
The port 5063 is the port of the SBCs that we communicate with.
I have attached the correct transport configuration.
attached is also the full log with debug level set to 2. I have however removed the start and the end of the log to be able to attach it. The entire log is 42MB and seemingly you don’t accept zip attachments. If you need the entitre log, then just say so.
There are two incidents on the log where the two SBCs were deemed unreachable:
You will see a lot of other “is now Unreachable” log messages, but that is merely webRtc softphones logging off which is working like a charm, and is not part of the problem.