[SOLVED] Asterisk Stability Issues - updated 6/7

Wes,

If you work it out, can you please post a summary and the resolution. If anything else, to let people know not to wait anymore as no more updates are coming. I know that usually you take a deep breath and move onto the next fire, but you have an audience here :smiley:

B.

Sorry this took so longā€¦

So far, weā€™ve had two resolutions to this issue.

The first involved having our electricians come in and check all of the distribution panels - several of them were incorrectly wired, which was resulting in instability at the desktop. This did not affect the servers, however, and did not contribute to the locking up issue we were seeing, but having clean power made a tremendous difference in calls dropping at the desktop. Note that there were extensive tests required for line noise, etcā€¦not just making sure hot and neutral werenā€™t swapped.

The second involved updating to the 1.2.x branch - so far, on three production boxes, weā€™ve had zero issues with 14 days of uptime on two of them. The third was done Tuesday, and the fourth box just was finished today. So, half our servers are on 1.2, the other half on 1.0.7.

The bulk of the issues we see are on the older boxes, so I have to say that our stability issues were caused by running the older Asterisk and/or Zaptel versions. We also have a dialer box running Vicidial that was updated to 1.2 a few weeks ago, and it has been running much better than previously.

I donā€™t know what more I could add, other than simply updating to 1.2, making a few minor modifications to the modules.conf and musiconhold.cog fixed our problems. We get several warnings due to our dialplan being slightly out of date (deprecated commands, etc) but those will be cleared up in the coming weeks, once all of our servers are running 1.2.

I will try to check in on this post should anyone have any further questions about our experience.

Thanks to everyone who helped out.

W

well, i spoke too soon.

two of the boxes we just upgraded have locked up since the upgrade to 1.2. at this point, i think we going to end up wiping the machines and doing a fresh reinstall of the entire OS along with asterisk. i donā€™t know what is causing this, nor do i know how to effectively troubleshoot it.

iā€™m open to suggestions.

Are these servers on a power conditioning battery backup? something like an APC Smart-UPS?

I would still recommend Slackware Linux 10.2. We have a total of 12 production Asterisk servers at different locations all running Slackware with a custom 2.4 kernel and I only have one that locks up periodically(it seems to be some kind of motherboard issue, it is our only non-ASUS motherboard system)

whoiswes:
Gentoo is the best Linux Distribution. Everything is compiled from the source.

Try turning off hyper threading and have you checked the logs in linux and also in the Dell Bios for PCI Parity Errors?

Also at boot try adding nmi_watchdog=1

Hopefully we can get some debugging info. If it is a interupt error.

weā€™ve been getting the parity errors since day one, and a call to digium a few months back indicated this was normal and we shouldnā€™t worry about it.

we havenā€™t dropped hyperthreading yet, but that is on my list of possibilities. i feel that this issue is something specific to these two machines, as we have 8 total 2850ā€™s, all identical in config, and these are the only two affected. since the hardware is the same, the software is the only variable left.

[quote=tommy13v]Also at boot try adding nmi_watchdog=1

Hopefully we can get some debugging info. If it is a interupt error.[/quote]

would you mind giving me a bit more info on this? iā€™m still fairly new to linux and donā€™t have all of the finer points of troubleshooting down yet.

these servers were the first two we built, and i think that has alot to do with it. they werenā€™t actually designed as production boxes, but were built more as test boxes that were shoehorned into production after the fact. i will be doing a thorough cleaning of both dialplans, after which iā€™ll blow away the entire array and start from scratchā€¦i think that will do wonders.

thanks for the input guys.

Have you run an exhaustive motherboard test from a boot CD on these machines during off hours? (memtest, burn-in, etcā€¦)

I found a bad chunk of memory at 1968MB out of my 2048MB of RAM on one of my flakey servers a few months ago, replaced the last DDR2 ram piece and no freezes since then.

memtest86.com/
ultimatebootcd.com/

Actually there was a firmware issue with the 4 port T1 cards back in the summer that would give this parity error and shut the server down. That is why I mention the NMI watchdog setting.

Are you using LILO or grub for your bootloader?

Here are some examples.

grub.conf

default=0
timeout=10
splashimage=(hd0,0)/grub/splash.xpm.gz
title HA Test Kernel (2.4.9-10smp)
root (hd0,0)
# This is the kernelā€™s command line.
kernel /vmlinuz-2.4.9-10smp ro root=/dev/hda2 nmi_watchdog=1

end of grub.conf

On systems using lilo instead of grub, add nmi_watchdog=1 to the ā€œappendā€ section in /etc/lilo.conf. For example:

lilo.conf

prompt
timeout=50
default=linux
boot=/dev/hda
map=/boot/map
install=/boot/boot.b
lba32

image=/boot/vmlinuz-2.4.9-10smp
label=linux
read-only
root=/dev/hda2
append=ā€œnmi_watchdog=1ā€

end of lilo.conf

weā€™re using grubā€¦

so this flag will basically allow the system to reboot itself in the event of an interrupt hang (that is what iā€™m getting from a quick google of nmi_watchdog). will there be any error logs that we can look at to determine if in fact the cards are causing the issue?

itā€™s funny you mention summer - the two machines having the issues are running the oldest cards we haveā€¦they would have been purchased around august/september. all of the other boxes are running newer cards, purchased october or later.

any way to determine the firmware version of our digium card? weā€™re running the quad-T1ā€™s in every box.

Take a look here.

http://72.14.203.104/search?q=cache:gPZqiIAep7cJ:lists.digium.com/pipermail/asterisk-users/2005-July/116058.html+&hl=en&gl=us&ct=clnk&cd=1&client=firefox-a

do a ā€˜lspciā€™ on your command line and look for this:
(the quad cards will say ā€œrev 01ā€ or ā€œrev 02ā€)

If you have any rev 01 get them upgraded as soon as you can it really does help.

Digium T400P:
01:0a.0 Bridge: PLX Technology, Inc.: Unknown device d00d (rev 01)

Digium T100P:
01:0b.0 Network controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface

Clone X100P modem card(X100P):
01:0b.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface

Digium TE410P and TE405P (firmware rev 1):
00:09.0 Communication controller: Xilinx Corporation: Unknown device 0314 (rev 01)

Digium TE406P and TE405P (firmware rev 2):
01:0a.0 Communication controller: Unknown device d161:0405 (rev 02)

Sangoma a104u:
01:0a.0 Network controller: Unknown device 1923:0400

Sangoma a104d:
01:0a.0 Network controller: Unknown device 1923:0100

I would call Digium and reference this link from the list this past summer.

72.14.203.104/search?q=cache:gPZ ā€¦ =firefox-a

Sounds like the same issue.

well, itā€™s not thatā€¦both of those servers show this:

thanks Matt.

post a /cat/proc

i figured you meant cat /proc/interrupts:

CPU0 CPU1 CPU2 CPU3 0: 13494758 13494424 26496293 26524683 IO-APIC-edge timer 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 1 IO-APIC-level acpi 14: 32180 297940 277548 109365 IO-APIC-edge ide0 177: 64674 1041727 595476 588189 IO-APIC-level megaraid 185: 28341706 0 0 18 IO-APIC-level eth0 193: 20706952 22975309 4781637 31527971 IO-APIC-level wct4xxp NMI: 1 0 0 0 LOC: 80012579 80012582 80012581 80012580 ERR: 0 MIS: 0

UPDATE - yesterday we had three of our servers drop randomly. one was running 1.2.4, the other two are 1.0.7. The 1.2 box has the nmi_watchdog flag added to grub.conf, and has hyperthreading disabled.

The odd thing is that we hadnā€™t had a single issue with any of our servers locking up for about two weeks, and out of the blue, three of them in the same day lock up. Iā€™m beginning to wonder if we have some external influence affecting these machines.

If anyone has ANY suggestions on this, please please please let me know.

What kind of Battery backups do you use? Does it have power conditioning?

Switching from a regular battery backup (UPS) to an APC Smart-UPS with power conditioning helped our clustered-crashes to the point that we donā€™t have them at all any more.

we have an enterprise level power backup system that controls our entire data center, including the 5 and 10 ton A/C units. Itā€™s a Liebert 50 kVA system, hooked into our generator as well.

i donā€™t know about power filtering, though - for some reason, i have a feeling it DOESNā€™T do any power filtering. i will find out about that.

iā€™ve been scouring google groups for issues, and came across one thing i havenā€™t tried yet - turning off APIC. has anyone tried this, and if so, did it make any differences for you?

thanks guys.