[Solved]: 20 SIP phones all become UNREACHABLE

Hello,
We have 20 sip phones connected to asterisk on a dedicated switch. Once or twice a day all phones lose their registration and show as UNREACHABLE in the asterisk console.

I used to be able to correct this without a reboot by restarting asterisk and then manually re-registering a phone or two. Once that was done - POP - all the other phones would come back online.

Recently, however, I have had to actually reboot the server to get the phones to re-register. Ping times during the UNREACHABLE phase are very low - all on a local network with good wiring using a 100Mb switch.

I have qualify = on on all phones in sip.conf, and have tried disabling qualify (phones lose registration very quickly), and changing other options on my zap channels (fairly certain those have nothing to do with it).

Of note is that all our in-progress zap calls get hung up as asterisk thinks the phones no longer exist. Zap calls flow in properly after the UNREACHABLE and before the reboot, and I can watch them go straight to voicemail from the console.

Has anyone had this issue? I have been googling for days, and finding nothing. Any info would be appreciated, and of course, I can post my configs if it will help. All phones are snom 320s, and I am running asterisk SVN-branch-1.2-r76653.

Thanks in advance,
-Nick

Hi

What do the logs show ?

It sounds like a networking problem, are you 100% sure all is setup correctly with respect to broadcast address, dns, gateway for the server and the gateway. is the sip.conf correct with respect to teh localnet?

Do you have any sip trunks ? and do they keep on working OK

Ian

Thanks for the reply!

The logs actually show nothing out of the ordinary, except the asterisk log show something similar for all sip phones:

Jul 26 10:01:15 NOTICE[9793] chan_sip.c: Peer ‘102’ is now UNREACHABLE! Last qualify: 42

I agree it sounds like a networking problem, but all is set up as it should be. It is a triple-homed host with one card dedicated to ip phone traffic, one dedicated to local computer traffic, and one connected to the internet. This machine is also our firewall (just iptables). Nothing is logging an error about connectivity, and I can ping the phones just fine when asterisk is complaining, well within the default qualify time.

We don’t have any sip trunks, so I couldn’t say, but we do have an IAX link to a branch office (they never have this issue with almost the exact same setup, but they only have two phones). The IAX link stays up while the phones are in the unreachable state.

Thanks again for the reply,
-Nick

Hmm

My guess would be * is listening to the wrong interface. or possably there is something on one of the lans with the wrong address.

you might want to use wireshark or the like to see whats happening to the packets when its happening.

Ian

Check your registration timeouts as well.

Most clients (phones, devices, softphones) have a setting that forces a re-registration event periodically.

If the Asterisk registration timeout has been exceeded and the device hasn’t re-registered yet, Asterisk will mark that SIP endpoint as unreachable.

Currently, I think that Asterisk marks a SIP endpoint as unreachable if it hasn’t re-registered in 3600 seconds. Set your re-registration for something less than that.

Thank you both for your replies!
I will reply to both of your messages at once:

The bindaddr in sip.conf is the address of my internal phone interface.
I have re-checked all the endpoints on the phone lan and the pc lan, and all are being addressed properly. In fact, I have the phones set up to get their ip address from dhcp set to a static ip for each phone (based on MAC).

That was my thought too, but sip is a chatty protocol, and I would need to basically log all sip traffic all day until it happens again (random).

I have checked each phone, and all phones are set to 3600 s. I have changed this in my config file for the phones to 3000 s. I will reboot them all tonight and see if that changes anything, but I have tried it before with as little as 600 s . This did not keep it from happening again, however.

The only similarities I can find between any of these disconnects is that one of my users has forwarded all their calls (302) to another extension. When someone calls that forwarded phone and gets a hold of the person that calls are forwarded to, several seconds-minutes later, all peers become unreachable. It could just be coincidence, though. And it doesn’t happen every time - I have tried calling in to forwarded extensions many many times after-hours and cannot replicate the issue.

The other similarity is that upwards of 4 of our 6 zap channels are in use.

-Nick

Hi

[quote]That was my thought too, but sip is a chatty protocol, and I would need to basically log all sip traffic all day until it happens again (random).
[/quote] My thought was whe they had lost registration, as the phones will still be trying to talk and * will be as well (hopefully)

Ian

Thanks Ian - I’ll give it a shot and see what it looks like next time it happens.

Actually, I thought I had it solved, but it just happened again - does anyone know of something similar to wireshark, but with a curses-based interface?

I don’t have X installed on that server. If I need to, I can install X, but I’d rather not.

Thanks,
-Nick

if you are using a managed switch you can run wireshark on a windows laptop. set up 1 of hte ports on the switch as a monitor port and plug the wireshark PC into that port. another one that is good is called tracebuster you can filter with it pretty well too…

TCPdump is about the only Linux commandline way I know to monitor a port… you could TCPdump that voice interface to a file and then search that file for any SIP registration traffic.
-Christopher

Thank you! For some reason I had completely forgotten about tcpdump. I am capturing all packets on the voice interface now.

Once I have the packet data, I’ll post back with my findings (and maybe post a bug report?).

-Nick

Hi

Dont forget to filter it with grep

Ian

You may wanna try this:

For catching IP packets, run “tcpdump –vv –s 0 –w /tmp/call.cap host (destination IP address)â€

Any resolution to this - I’m seeing an extremely similar problme at one of my sites. I’m running Asterisk 1.2.23

Thx

M

Hi,
I’ve had this problem at a client’s site.

Asking on another forum, I was told it’s down to DNS lookup failures.
In this particular case, the problem appears if the internet connection fails for some reason. The setup has some SIP trunks that were set up using the hostnames of the remote systems.
When asterisk tries to re-register & fails on the DNS lookup, the whole SIP subsystem acts like it’s vanished off the local network and only re-appears when the internet connection is restored. (Zaptel trunks & extensions continue to function througout the problem).

The advice I was given was simply to use IP addresses rather than hostnames for anything outside the local net.

Since making this change I’ve not had a fault report, but the internet rarely goes down at that site.

I had this problem too, and I’m still working on it. I found this information at * website:

[b]If qualify=yes or a numeric value, then asterisk will sometimes poke this peer by sending a “SIP OPTIONS” request to phones or other pbx’s.

If they do not reply on time, they will be considered unreachable, and this message will be printed on the asterisk CLI.

When the phone is back online (first time it replies on time) then asterisk will tell you Peer ‘XXX’ is now REACHABLE, if we got a reply from the phone, but not on time, the message Peer ‘XXX’ is now too LAGGED will be printed on the CLI.

The timeout is set to 2000ms by default. (If you specify qualify=yes).
But you could also set it to any other value.

e.g. qualify=3000

  1. Reasons for seeing this message:

When a phone is rebooted, or when a phone hangs, or when its shut down this message might pop up.

(Or when there is a too big delay on the network).

If all your phones become unreachable at the same time, its probably your asterisk server that has network problems instead of the phone.

When a phone is unreachable, asterisk will not try to call it. (So you might want to set this value not too low, or you might want to completely disable it).

If the phone that has unreachable messages all the time is behind a NAT, it might be that the UDP timeout is set too low on the firewall.[/b]

So, there are two options. Number is to disable the qualify atribute, what should change the status of the peer to unmonitored. Increasing the time of the qualify atribute is not likely to work, because 2 seconds is plenty of time to get the response from the peer in most situations (of course it may be of some value sometime, and it worth the shot). It’s important to note that the qualify value is just a timeout for the response, it’s not an interval that asterisk uses to pool the peer. Even thought you won’t get an unreachable status, it’s not guaranteed that your peer is going to work, and that’s because of the last line in Reasons for seeing this message. This means that if your peer is in a LAN, your gateway has an UDP timeout to bind it’s public port to your application. The bind is established the first time you send a packet, and it should remain active for a period of time without receiving new packets. With the timeout occur, then your appication won’t receive any packet any more, and won’t be able to receive the options request from asterisk. That’s the main problem… In this case, the notify atribute helps to keep the UDP binding active, because it keeps the traffic in the gateway public port.
Resuming: Try to increase the qualify timeout. If it doesn’t work, you’re loosing the UDP binding, and your application is not receiving the packets anymore.

Well, I got a tcpdump output from the time the sip peers went to ‘Unreachable’ - it shows 192.68.55.115 trying to register several times, DHCP request, etc.

If you want to check it out, it will be temporarily at http://www.commund.com/tcpdump_output.txt

I don’t see anything weird in the dump, though.

I have my qualify set to 3000 for all peers, and we don’t have any external SIP channels - DNS is actually hosted on the same server and continues to resolve just fine after this happens.

Restarting asterisk does not fix it, the only thing that seems to reset it is to reboot the server. Taking interfaces up and down doesn’t correct it - I really have no idea why this is happening.

Luckily, after beefing up the UPS, it seems to only want to do it every other week or so. Before the UPS upgrade, it was doing it once a day.

Let me know what you think.

Thanks for your time,
-Nick

Well, I got a tcpdump output from the time the sip peers went to ‘Unreachable’ - it shows 192.68.55.115 trying to register several times, DHCP request, etc.

If you want to check it out, it will be temporarily at http://www.commund.com/tcpdump_output.txt

55.1 is the sip server, 50.3 is my desktop, 55.115 is my desk phone.

I don’t see anything weird in the dump, though.

I have my qualify set to 3000 for all peers, and we don’t have any external SIP channels - DNS is actually hosted on the same server and continues to resolve just fine after this happens.

Restarting asterisk does not fix it, the only thing that seems to reset it is to reboot the server. Taking interfaces up and down doesn’t correct it - I really have no idea why this is happening.

Luckily, after beefing up the UPS, it seems to only want to do it every other week or so. Before the UPS upgrade, it was doing it once a day.

Let me know what you think.

Thanks for your time,
-Nick

Sorry for the double post :blush:

Having a problem reading today, I think.

If your UPS then maybe it is a bad powe problem - typically UPS’s do some sort of filtering on the power line. Have you considered changing your server PSU?

M