Peers in Unreachable / Lagged / Reachable status "Loop"

Hi There community,

I hope that who is reading this is fine,

About this case, some days (not a specific pattern) I’m having recurrent SIP Peer disconnections, on my LAN, MPLS and WAN net links with two different Asterisk physical hardware.

Peer X is Lagged
Peer X is Unreachable
Peer X is Reachable

Like a recurrent loop between those states.

I can’t find any log, Warning, Error or Notice that helps with this, when restarting the service everything seems normal

Summary:

2 x Asterisk certified/16.8-cert3 in HP G8 physical hardware
1 x WAN Network (1 IP per server)
1 x LAN Network (1 IP per server)
2 x MPLS Networks (1 IP per server and MPLS)
4 NICs per Server

Does anyone have experienced something similar?
I don’t understand why just certain days of the year this happens if i haven’t moved any configuration of sip.conf, since we have iptables+fail2ban we don’t think that this is an attack but I’m not sure.

Edit1: ICMP seems fine, this happens only with Asterisk SIP OPTIONS (I don’t want to disable it)

Any advice or ideas will be greatly appreciated
If further information is required please let me know so i can retrieve it.
Kind Regards

These state changes are based on connectivity tests done with SIP OPTIONS, so will only be seen if you have them enabled. However, if you are getting them, other methods are likely to be equally affected, regardless of whether you are sending OPTIONS.

Dear @david551

Thanks for the advices,
What debugging tool or command would you recommend in order to check this?
I’m 100% sure that this is asterisk service or SIP issue, since there is no way that my LAN, WAN and MPLS netlinks are compromised at the same time.

Thanks in advance!!
Kind Regards and have a gr8 day
Diego Espinoza

Why are you using certified Asterisk, but not asking your questions via Sangoma’s commercial support contract channels? If you don’t have a contract, the first thing to do would be to upgrade to 16.18.0.

Dear @david551,

Certified asterisk is under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation (Core show licence)

I haven’t paid for any modules thats why im here, Also, I use this version because its the most “stable” according to some asterisk courses (But if you have a different opinion, kindly let me know).

And this service cannot be interrupted , so i can’t do a upgrade like a hot-swap work with some hdd’s.

Thanks for your feedback
Kind Regards
Diego Espinoza

The most stable version within any major version is the latest sub-version, of an LTS version. Certified versions are only intended for use by people with support contracts from Sangoma, and actually come with some components disabled by default, because they are not included in the support contract.

They are only more stable to the extent that they have fewer bugs fixed, as only those reported by paying customers are fixed.

Dear @david551

Thanks for the advice about asterisk version,
Im currently trying to look for possible causes of this behavior,
I checked DNS subservices of asterisk by disabling them and forcing to google dns with no success

What do you recommend to debug?

Kind Regards
Diego Espinoza

LAGGED and UNREACHABLE happen due to Asterisk not getting a 200 OK response from the other side in a certain amount of time. Devices going REACHABLE could be due to Asterisk getting a 200 OK response in the right amount of time or the device re-registered.

This is mostly a network issue.

Any response will do. However the OP seems convinced that they don’t have network problems.

Thanks for the replies!

I’m quite sure that this isn’t a network issue because calls won’t lose voice quality, Linux MTR is fine and who uses the system haven’t reported anything yet,

I’m sure mainly because this happens in two different physical machines with 4 netlinks each (In total, 2 WAN, 2 LAN, and 4 MPLS)

Hope you get the core idea.
I’m running out of resources and i don’t want to “re-install” a productive server just yet

Any ideas will be greatly appreciated
Kind Regards

Are you, by any chance, using Asterisk Realtime Architecture, with a non-local database?

Dear @david551.

I’m currently using CDR_ODBC to a local database in the same HW, but i have some replication and monitoring processes.

So, LocalDB gets replicated to a RemoteDB very quickly
And the monitor executes a CLI cmd ‘Core show channels concise’ via php

Do you think that maybe I’m having services/network throttles?
Because i still have hardware to exploit

Thanks again for your replies mate,
Kind Regards

I don’t think CDR database accesses can block anything critical.

So you have run the proper debugs to trace SIP traffic and 100% see the replies back to OPTIONs request?

Dear @BlazeStudios,

On my SIP traces i see inbound SIP OPTIONS being replied but sometimes with pdd, same case with outbound SIP OPTIONS requests,
The thing is that, i see no problems with MTR to WAN and LAN addresses, but how its possible to have PDD in LAN?

Thanks for the replies
Kind Regards
Diego Espinoza

Are you referring to PDD as Post Dial Delay? Because if you are that has nothing to do with the OPTION messages.

Dear @BlazeStudios

Sorry if you didn’t get the idea, but, sometimes asterisk replies or receives 200 OK with some delay,
And i know that it could cause Peer disconnection / Lagging, but, i don’t understand why is this happening only with SIP traffic in WAN and LAN.

RTP audio and ICMP tests are ok.

Its like, that Asterisk is getting congested and can’t process when there is HW available

Kind Regards

Without seeing verbose logging or the actual SIP trace hard to say what is going on.

Dear @BlazeStudios ,

It seems that after removing non-reacheable peers this “disconnections and laggs” stopped, i really not sure, but this was the last change made to asterisk

Hope it helps someone that reads this
I’ll be closing this thread if no further issue is logged

Thanks for the assistance, @BlazeStudios , @david551
Kind Regards