hot failover setup:
Dual quad core HW raid1 dell r410’s, completely redundant
Rsync script copying all pertinent directories every 5 minutes.
Heartbeat is working between the two boxes watching httpd, asterisk and mysqld (controls for heartbeat work as well, I.E. grabbing resources)
Mysqld replication is working as slave master (second box has realtime copy of asterisk database).
Failover is working real time with sip registrations end points etc.
Boxes are behind Cisco ASA5505, 172.19.x.x addy scheme, private 192.168.x.x for heartbeat on ether1 (direct x-over connection), ether0 is also gb connected to HP managed switch, all in a datacenter
Asterisk 1.6.2.13 on both boxes
Everything’s running fine on Primary box for like 15, 20, 30 minutes. Then everything is unreachable, sip debug shows retransmittal #6 or so for phones and registration to sip provider keeps trying. Phones will not reg register (using softphone for testing on this side, all phones are remote). Have turned up logging but still nothing noted in the logs, just a series of unreachable extensions then sip provider and nothing re registers until network is restarted/box is rebooted. Restarting asterisk does not seem to clear the issue.
I have the bind addy set to the virtual which is floating between the two boxes with heartbeat, audio and reg works fine when it works
Boxes were 1.6.2.10 yesterday, I read about a bug which suggested sip stack was having issues and rpm’d to 1.6.2.13, same issue resulted.
Box didn’t seem to do this in the Lab, which was behind a Draytek, not a Cisco. Cisco has the no sip fixup stuff done to it and proper ports are set to the virtual IP. But I have not yet convinced myself this is a network issue outside of the network card(s) on the r410.
I have turned up logging and been watching the console when this happens and have no idea whats going on. There is no issue reported per se.
Anyone? I’m starting to pull my hair out. Appreciate your responses and knowledge in advance.