All Registrations drop at once...everything unreachable


#1

hot failover setup:

Dual quad core HW raid1 dell r410’s, completely redundant

Rsync script copying all pertinent directories every 5 minutes.

Heartbeat is working between the two boxes watching httpd, asterisk and mysqld (controls for heartbeat work as well, I.E. grabbing resources)

Mysqld replication is working as slave master (second box has realtime copy of asterisk database).

Failover is working real time with sip registrations end points etc.

Boxes are behind Cisco ASA5505, 172.19.x.x addy scheme, private 192.168.x.x for heartbeat on ether1 (direct x-over connection), ether0 is also gb connected to HP managed switch, all in a datacenter

Asterisk 1.6.2.13 on both boxes

Everything’s running fine on Primary box for like 15, 20, 30 minutes. Then everything is unreachable, sip debug shows retransmittal #6 or so for phones and registration to sip provider keeps trying. Phones will not reg register (using softphone for testing on this side, all phones are remote). Have turned up logging but still nothing noted in the logs, just a series of unreachable extensions then sip provider and nothing re registers until network is restarted/box is rebooted. Restarting asterisk does not seem to clear the issue.

I have the bind addy set to the virtual which is floating between the two boxes with heartbeat, audio and reg works fine when it works

Boxes were 1.6.2.10 yesterday, I read about a bug which suggested sip stack was having issues and rpm’d to 1.6.2.13, same issue resulted.

Box didn’t seem to do this in the Lab, which was behind a Draytek, not a Cisco. Cisco has the no sip fixup stuff done to it and proper ports are set to the virtual IP. But I have not yet convinced myself this is a network issue outside of the network card(s) on the r410.

I have turned up logging and been watching the console when this happens and have no idea whats going on. There is no issue reported per se.

Anyone? I’m starting to pull my hair out. Appreciate your responses and knowledge in advance.


#2

just did it again after an hour of uptime…makes no sense


#3

Hi

does it do the same on the failover box ?

also I would assume you have been running tcpdump for captuing all the traffic on port 5060.
watching the console wont show you much .

Sounds like you have a network issue as this setup should work fine, we have it on many sites.

Ian


#4

this is what i figure…it was fine behind that draytek and now fighting it behind the Cisco ASA5505, also just read that nic bonding (virtual interface) can have issue behind cisco devices…

hmmm


#5

Ian, looks like you were right. With the floating ip the Cisco apparently doesnt like the mac addy change:

Feb 22 2011 12:22:39: %ASA-4-405001: Received ARP request collision from 172.19.x.x/842b.2b77.e95b on interface INSIDE with existing ARP entry 172.19.x.x/842b.2b77.28f6

This looks like the IP may be flipping back and forth between the boxes and could certainly be the reason that all connections are dropping at once. There is nothing in the ASA that would close all connections at once, unless it lost link completely.

Which makes sense due to a virtual floating IP…

Also when the issue happens, the tcp dump shows on way sip packets, only outbound:

14:37:30.638822 IP 172.19.x.x.sip > 70.x.x.170.33533: SIP, length: 522
14:37:33.478641 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:37.478562 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:38.478382 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:39.478328 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:40.638402 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:41.478205 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:41.638203 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:42.638153 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:43.638081 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:44.638039 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:45.477993 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:49.478761 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:53.478544 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:54.637596 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:55.637425 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:56.637373 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:57.478456 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:57.638311 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:58.478273 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:37:58.637260 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:37:59.478211 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:38:01.478074 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:38:05.477873 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:38:08.636734 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:38:09.477660 IP 172.19.x.x.sip > 208.x.x.x.sip: SIP, length: 625
14:38:09.637637 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:38:10.637595 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522
14:38:11.637546 IP 172.19.x.x.sip > 70.182.x.x.33533: SIP, length: 522


#6

Hi

On th affected system we have like this we have a script that as well as changing the floating IP we also change the mac address at the same time.

Ian


#7

interestingly enough, the dump starts going one way (as posted above) when the heartbeat kicks over, which I would expect.

But, the question then becomes why is heartbeat kicking over repeatedly.

I just turned off heart beat…and brought up a static eth0:0 alias…and everything seems ok.

Anyone have suggestions why heartbeat could be poping back and forth between the servers causing this issue?


#8

what does that script consist of? And I still need to find out why heartbeat is kicking over without failure.

Thanks for your direction Ian, it is extremely appreciate. I owe you a beer mate.


#9

Hi are you saying that heartbeat is flipping ?

might be less disjointed conversation if you skype me as im not always loking at the forum

Ian
www.cyber-cottage.co.uk


#10

Just in case you all didn’t know, or couldn’t tell from his 2,590+ posts, IAN is THE MAN!

Cheers mate! Open source is all about folks like you helping folks like me.


#11

Howdy,

What did the final solution end up being? Taking over the MAC also?


#12

I am not at that point yet, but it is still in progress.

A Large Part of the issue was that the DNS records (publicly) didn’t exist when it was in the Lab. They do now as its being migrated to production. As a result I modified the DNS entries and HA started working much better. With that being said, I am still trying to figure out how to get postfix to do an auth relay based on a DNS record that doesn’t exist. If I re-create the record, then the packets will hit the internet again and hence Heartbeat will be having more issues.

I have figured out this much:

The heart beat timers had to be increased. I didn’t realize it until doing to low level testing, but the crossover cables is auto full 100 between eth1 on the two boxes and having some packet loss (30-40%). I am troubleshooting that now. I have tried to manually set it at 10 half, 10 full 100 full 100 half etc and only 100 full works. When the packet loss issue is fixed (still working this, makes no sense) I will turn the timers heartbeat back up.

The other issue I am running into is a message in the Heartbeat logs: ERROR: both machines own our resources

I also am troubleshooting that. I figure the later has something to do with the prior and should be troubleshot in that order.


#13

If anyone has suggestions on how to takeover the mac on the secondary box I am all ears, as this is a large part of the issue. The Cisco asa5505 doesnt like swapping MAC’s on the virtual IP.

That’s actually what gave rise to the entire issue with HA being the culprit.


#14

so…in summary:

UNREACHABLE was due to mac address changing on the network due to unstable heartbeat connection. Apparently my guy put a used/handmade crossover cable between the boxes for heartbeat…which resulted in 30-40% packet-loss.

Took a few days to figure out but it is resolved. Result is that HA is stable and mac isn’t flopping around anymore, which means the boxes are now stable.

I also, with Ian’s help, was able to get a script (all credit due to him) to control the virtual IP’s mac when swapping services, it is the first executed script of the 5 services that the boxes are clustered together for.


#15

also, just as a side note, the Cisco ASA5505 has a time frame of 4 hours to control mac/ip (ARP) associations. This was dropped to 60 seconds and the HA setup now works without issue and without the scripts.


#16

Hi Jake,

Can you share me how to write/configure the HA script to change the MAC address after HA is up? I am assuming i need to write one script in /etc/init.d, and put it in /etc/ha.d/haresources file? I am running 2 Asterisk server with HA, want the Mac address of eth0:0 float over to slave node as well if primary node down.

Thanks,
Rocky

[quote=“voipcitadel.com”]so…in summary:

UNREACHABLE was due to mac address changing on the network due to unstable heartbeat connection. Apparently my guy put a used/handmade crossover cable between the boxes for heartbeat…which resulted in 30-40% packet-loss.

Took a few days to figure out but it is resolved. Result is that HA is stable and mac isn’t flopping around anymore, which means the boxes are now stable.

I also, with Ian’s help, was able to get a script (all credit due to him) to control the virtual IP’s mac when swapping services, it is the first executed script of the 5 services that the boxes are clustered together for.[/quote]