Wes,
If you work it out, can you please post a summary and the resolution. If anything else, to let people know not to wait anymore as no more updates are coming. I know that usually you take a deep breath and move onto the next fire, but you have an audience here 
B.
Sorry this took so longā¦
So far, weāve had two resolutions to this issue.
The first involved having our electricians come in and check all of the distribution panels - several of them were incorrectly wired, which was resulting in instability at the desktop. This did not affect the servers, however, and did not contribute to the locking up issue we were seeing, but having clean power made a tremendous difference in calls dropping at the desktop. Note that there were extensive tests required for line noise, etcā¦not just making sure hot and neutral werenāt swapped.
The second involved updating to the 1.2.x branch - so far, on three production boxes, weāve had zero issues with 14 days of uptime on two of them. The third was done Tuesday, and the fourth box just was finished today. So, half our servers are on 1.2, the other half on 1.0.7.
The bulk of the issues we see are on the older boxes, so I have to say that our stability issues were caused by running the older Asterisk and/or Zaptel versions. We also have a dialer box running Vicidial that was updated to 1.2 a few weeks ago, and it has been running much better than previously.
I donāt know what more I could add, other than simply updating to 1.2, making a few minor modifications to the modules.conf and musiconhold.cog fixed our problems. We get several warnings due to our dialplan being slightly out of date (deprecated commands, etc) but those will be cleared up in the coming weeks, once all of our servers are running 1.2.
I will try to check in on this post should anyone have any further questions about our experience.
Thanks to everyone who helped out.
W
well, i spoke too soon.
two of the boxes we just upgraded have locked up since the upgrade to 1.2. at this point, i think we going to end up wiping the machines and doing a fresh reinstall of the entire OS along with asterisk. i donāt know what is causing this, nor do i know how to effectively troubleshoot it.
iām open to suggestions.
Are these servers on a power conditioning battery backup? something like an APC Smart-UPS?
I would still recommend Slackware Linux 10.2. We have a total of 12 production Asterisk servers at different locations all running Slackware with a custom 2.4 kernel and I only have one that locks up periodically(it seems to be some kind of motherboard issue, it is our only non-ASUS motherboard system)
whoiswes:
Gentoo is the best Linux Distribution. Everything is compiled from the source.
Try turning off hyper threading and have you checked the logs in linux and also in the Dell Bios for PCI Parity Errors?
Also at boot try adding nmi_watchdog=1
Hopefully we can get some debugging info. If it is a interupt error.
weāve been getting the parity errors since day one, and a call to digium a few months back indicated this was normal and we shouldnāt worry about it.
we havenāt dropped hyperthreading yet, but that is on my list of possibilities. i feel that this issue is something specific to these two machines, as we have 8 total 2850ās, all identical in config, and these are the only two affected. since the hardware is the same, the software is the only variable left.
[quote=tommy13v]Also at boot try adding nmi_watchdog=1
Hopefully we can get some debugging info. If it is a interupt error.[/quote]
would you mind giving me a bit more info on this? iām still fairly new to linux and donāt have all of the finer points of troubleshooting down yet.
these servers were the first two we built, and i think that has alot to do with it. they werenāt actually designed as production boxes, but were built more as test boxes that were shoehorned into production after the fact. i will be doing a thorough cleaning of both dialplans, after which iāll blow away the entire array and start from scratchā¦i think that will do wonders.
thanks for the input guys.
Have you run an exhaustive motherboard test from a boot CD on these machines during off hours? (memtest, burn-in, etcā¦)
I found a bad chunk of memory at 1968MB out of my 2048MB of RAM on one of my flakey servers a few months ago, replaced the last DDR2 ram piece and no freezes since then.
memtest86.com/
ultimatebootcd.com/
Actually there was a firmware issue with the 4 port T1 cards back in the summer that would give this parity error and shut the server down. That is why I mention the NMI watchdog setting.
Are you using LILO or grub for your bootloader?
Here are some examples.
grub.conf
default=0
timeout=10
splashimage=(hd0,0)/grub/splash.xpm.gz
title HA Test Kernel (2.4.9-10smp)
root (hd0,0)
# This is the kernelās command line.
kernel /vmlinuz-2.4.9-10smp ro root=/dev/hda2 nmi_watchdog=1
end of grub.conf
On systems using lilo instead of grub, add nmi_watchdog=1 to the āappendā section in /etc/lilo.conf. For example:
lilo.conf
prompt
timeout=50
default=linux
boot=/dev/hda
map=/boot/map
install=/boot/boot.b
lba32
image=/boot/vmlinuz-2.4.9-10smp
label=linux
read-only
root=/dev/hda2
append=ānmi_watchdog=1ā
end of lilo.conf
weāre using grubā¦
so this flag will basically allow the system to reboot itself in the event of an interrupt hang (that is what iām getting from a quick google of nmi_watchdog). will there be any error logs that we can look at to determine if in fact the cards are causing the issue?
itās funny you mention summer - the two machines having the issues are running the oldest cards we haveā¦they would have been purchased around august/september. all of the other boxes are running newer cards, purchased october or later.
any way to determine the firmware version of our digium card? weāre running the quad-T1ās in every box.
do a ālspciā on your command line and look for this:
(the quad cards will say ārev 01ā or ārev 02ā)
If you have any rev 01 get them upgraded as soon as you can it really does help.
Digium T400P:
01:0a.0 Bridge: PLX Technology, Inc.: Unknown device d00d (rev 01)
Digium T100P:
01:0b.0 Network controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface
Clone X100P modem card(X100P):
01:0b.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface
Digium TE410P and TE405P (firmware rev 1):
00:09.0 Communication controller: Xilinx Corporation: Unknown device 0314 (rev 01)
Digium TE406P and TE405P (firmware rev 2):
01:0a.0 Communication controller: Unknown device d161:0405 (rev 02)
Sangoma a104u:
01:0a.0 Network controller: Unknown device 1923:0400
Sangoma a104d:
01:0a.0 Network controller: Unknown device 1923:0100
I would call Digium and reference this link from the list this past summer.
72.14.203.104/search?q=cache:gPZ ⦠=firefox-a
Sounds like the same issue.
well, itās not thatā¦both of those servers show this:
thanks Matt.
i figured you meant cat /proc/interrupts:
CPU0 CPU1 CPU2 CPU3
0: 13494758 13494424 26496293 26524683 IO-APIC-edge timer
8: 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 1 IO-APIC-level acpi
14: 32180 297940 277548 109365 IO-APIC-edge ide0
177: 64674 1041727 595476 588189 IO-APIC-level megaraid
185: 28341706 0 0 18 IO-APIC-level eth0
193: 20706952 22975309 4781637 31527971 IO-APIC-level wct4xxp
NMI: 1 0 0 0
LOC: 80012579 80012582 80012581 80012580
ERR: 0
MIS: 0
UPDATE - yesterday we had three of our servers drop randomly. one was running 1.2.4, the other two are 1.0.7. The 1.2 box has the nmi_watchdog flag added to grub.conf, and has hyperthreading disabled.
The odd thing is that we hadnāt had a single issue with any of our servers locking up for about two weeks, and out of the blue, three of them in the same day lock up. Iām beginning to wonder if we have some external influence affecting these machines.
If anyone has ANY suggestions on this, please please please let me know.
What kind of Battery backups do you use? Does it have power conditioning?
Switching from a regular battery backup (UPS) to an APC Smart-UPS with power conditioning helped our clustered-crashes to the point that we donāt have them at all any more.
we have an enterprise level power backup system that controls our entire data center, including the 5 and 10 ton A/C units. Itās a Liebert 50 kVA system, hooked into our generator as well.
i donāt know about power filtering, though - for some reason, i have a feeling it DOESNāT do any power filtering. i will find out about that.
iāve been scouring google groups for issues, and came across one thing i havenāt tried yet - turning off APIC. has anyone tried this, and if so, did it make any differences for you?
thanks guys.