[SOLVED] Asterisk Stability Issues - updated 6/7

b.different · February 15, 2006, 1:05pm

Wes,

If you work it out, can you please post a summary and the resolution. If anything else, to let people know not to wait anymore as no more updates are coming. I know that usually you take a deep breath and move onto the next fire, but you have an audience here

B.

whoiswes · March 10, 2006, 8:53pm

Sorry this took so long…

So far, we’ve had two resolutions to this issue.

The first involved having our electricians come in and check all of the distribution panels - several of them were incorrectly wired, which was resulting in instability at the desktop. This did not affect the servers, however, and did not contribute to the locking up issue we were seeing, but having clean power made a tremendous difference in calls dropping at the desktop. Note that there were extensive tests required for line noise, etc…not just making sure hot and neutral weren’t swapped.

The second involved updating to the 1.2.x branch - so far, on three production boxes, we’ve had zero issues with 14 days of uptime on two of them. The third was done Tuesday, and the fourth box just was finished today. So, half our servers are on 1.2, the other half on 1.0.7.

The bulk of the issues we see are on the older boxes, so I have to say that our stability issues were caused by running the older Asterisk and/or Zaptel versions. We also have a dialer box running Vicidial that was updated to 1.2 a few weeks ago, and it has been running much better than previously.

I don’t know what more I could add, other than simply updating to 1.2, making a few minor modifications to the modules.conf and musiconhold.cog fixed our problems. We get several warnings due to our dialplan being slightly out of date (deprecated commands, etc) but those will be cleared up in the coming weeks, once all of our servers are running 1.2.

I will try to check in on this post should anyone have any further questions about our experience.

Thanks to everyone who helped out.

W

whoiswes · March 15, 2006, 6:02pm

well, i spoke too soon.

two of the boxes we just upgraded have locked up since the upgrade to 1.2. at this point, i think we going to end up wiping the machines and doing a fresh reinstall of the entire OS along with asterisk. i don’t know what is causing this, nor do i know how to effectively troubleshoot it.

i’m open to suggestions.

mflorell · March 15, 2006, 6:14pm

Are these servers on a power conditioning battery backup? something like an APC Smart-UPS?

I would still recommend Slackware Linux 10.2. We have a total of 12 production Asterisk servers at different locations all running Slackware with a custom 2.4 kernel and I only have one that locks up periodically(it seems to be some kind of motherboard issue, it is our only non-ASUS motherboard system)

hookhook · March 16, 2006, 4:00am

whoiswes:
Gentoo is the best Linux Distribution. Everything is compiled from the source.

tommy13v · March 16, 2006, 4:31am

Try turning off hyper threading and have you checked the logs in linux and also in the Dell Bios for PCI Parity Errors?

tommy13v · March 16, 2006, 4:50am

Also at boot try adding nmi_watchdog=1

Hopefully we can get some debugging info. If it is a interupt error.

whoiswes · March 16, 2006, 2:24pm

we’ve been getting the parity errors since day one, and a call to digium a few months back indicated this was normal and we shouldn’t worry about it.

we haven’t dropped hyperthreading yet, but that is on my list of possibilities. i feel that this issue is something specific to these two machines, as we have 8 total 2850’s, all identical in config, and these are the only two affected. since the hardware is the same, the software is the only variable left.

[quote=tommy13v]Also at boot try adding nmi_watchdog=1

Hopefully we can get some debugging info. If it is a interupt error.[/quote]

would you mind giving me a bit more info on this? i’m still fairly new to linux and don’t have all of the finer points of troubleshooting down yet.

these servers were the first two we built, and i think that has alot to do with it. they weren’t actually designed as production boxes, but were built more as test boxes that were shoehorned into production after the fact. i will be doing a thorough cleaning of both dialplans, after which i’ll blow away the entire array and start from scratch…i think that will do wonders.

thanks for the input guys.

mflorell · March 16, 2006, 2:53pm

Have you run an exhaustive motherboard test from a boot CD on these machines during off hours? (memtest, burn-in, etc…)

I found a bad chunk of memory at 1968MB out of my 2048MB of RAM on one of my flakey servers a few months ago, replaced the last DDR2 ram piece and no freezes since then.

memtest86.com/
ultimatebootcd.com/

tommy13v · March 16, 2006, 3:41pm

Actually there was a firmware issue with the 4 port T1 cards back in the summer that would give this parity error and shut the server down. That is why I mention the NMI watchdog setting.

Are you using LILO or grub for your bootloader?

Here are some examples.

grub.conf

default=0
timeout=10
splashimage=(hd0,0)/grub/splash.xpm.gz
title HA Test Kernel (2.4.9-10smp)
root (hd0,0)
# This is the kernel’s command line.
kernel /vmlinuz-2.4.9-10smp ro root=/dev/hda2 nmi_watchdog=1

end of grub.conf

On systems using lilo instead of grub, add nmi_watchdog=1 to the “append” section in /etc/lilo.conf. For example:

lilo.conf

prompt
timeout=50
default=linux
boot=/dev/hda
map=/boot/map
install=/boot/boot.b
lba32

image=/boot/vmlinuz-2.4.9-10smp
label=linux
read-only
root=/dev/hda2
append=“nmi_watchdog=1”

end of lilo.conf

whoiswes · March 16, 2006, 3:50pm

we’re using grub…

so this flag will basically allow the system to reboot itself in the event of an interrupt hang (that is what i’m getting from a quick google of nmi_watchdog). will there be any error logs that we can look at to determine if in fact the cards are causing the issue?

it’s funny you mention summer - the two machines having the issues are running the oldest cards we have…they would have been purchased around august/september. all of the other boxes are running newer cards, purchased october or later.

any way to determine the firmware version of our digium card? we’re running the quad-T1’s in every box.

tommy13v · March 16, 2006, 3:56pm

Take a look here.

http://72.14.203.104/search?q=cache:gPZqiIAep7cJ:lists.digium.com/pipermail/asterisk-users/2005-July/116058.html+&hl=en&gl=us&ct=clnk&cd=1&client=firefox-a

mflorell · March 16, 2006, 3:57pm

do a ‘lspci’ on your command line and look for this:
(the quad cards will say “rev 01” or “rev 02”)

If you have any rev 01 get them upgraded as soon as you can it really does help.

Digium T400P:
01:0a.0 Bridge: PLX Technology, Inc.: Unknown device d00d (rev 01)

Digium T100P:
01:0b.0 Network controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface

Clone X100P modem card(X100P):
01:0b.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface

Digium TE410P and TE405P (firmware rev 1):
00:09.0 Communication controller: Xilinx Corporation: Unknown device 0314 (rev 01)

Digium TE406P and TE405P (firmware rev 2):
01:0a.0 Communication controller: Unknown device d161:0405 (rev 02)

Sangoma a104u:
01:0a.0 Network controller: Unknown device 1923:0400

Sangoma a104d:
01:0a.0 Network controller: Unknown device 1923:0100

tommy13v · March 16, 2006, 3:59pm

I would call Digium and reference this link from the list this past summer.

72.14.203.104/search?q=cache:gPZ … =firefox-a

Sounds like the same issue.

whoiswes · March 16, 2006, 4:00pm

well, it’s not that…both of those servers show this:

thanks Matt.

tommy13v · March 16, 2006, 4:10pm

post a /cat/proc

whoiswes · March 16, 2006, 4:24pm

i figured you meant cat /proc/interrupts:

CPU0 CPU1 CPU2 CPU3 0: 13494758 13494424 26496293 26524683 IO-APIC-edge timer 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 1 IO-APIC-level acpi 14: 32180 297940 277548 109365 IO-APIC-edge ide0 177: 64674 1041727 595476 588189 IO-APIC-level megaraid 185: 28341706 0 0 18 IO-APIC-level eth0 193: 20706952 22975309 4781637 31527971 IO-APIC-level wct4xxp NMI: 1 0 0 0 LOC: 80012579 80012582 80012581 80012580 ERR: 0 MIS: 0

whoiswes · April 5, 2006, 1:21pm

UPDATE - yesterday we had three of our servers drop randomly. one was running 1.2.4, the other two are 1.0.7. The 1.2 box has the nmi_watchdog flag added to grub.conf, and has hyperthreading disabled.

The odd thing is that we hadn’t had a single issue with any of our servers locking up for about two weeks, and out of the blue, three of them in the same day lock up. I’m beginning to wonder if we have some external influence affecting these machines.

If anyone has ANY suggestions on this, please please please let me know.

mflorell · April 5, 2006, 2:07pm

What kind of Battery backups do you use? Does it have power conditioning?

Switching from a regular battery backup (UPS) to an APC Smart-UPS with power conditioning helped our clustered-crashes to the point that we don’t have them at all any more.

whoiswes · April 5, 2006, 2:29pm

we have an enterprise level power backup system that controls our entire data center, including the 5 and 10 ton A/C units. It’s a Liebert 50 kVA system, hooked into our generator as well.

i don’t know about power filtering, though - for some reason, i have a feeling it DOESN’T do any power filtering. i will find out about that.

i’ve been scouring google groups for issues, and came across one thing i haven’t tried yet - turning off APIC. has anyone tried this, and if so, did it make any differences for you?

thanks guys.

Topic		Replies	Views
Flaky hangup detection on ZAP possible cause of instability Asterisk Support	5	301	July 9, 2008
Asterisk instability Asterisk Support	1	305	June 15, 2006
Asterisk wont hang up... sometimes. Close to switching Asterisk Support	14	753	December 11, 2008
PCIe TE121 card causing server to freeze Asterisk Support	24	841	December 3, 2008
TE410P unexpected system restart Asterisk Support	16	1191	March 13, 2015

[SOLVED] Asterisk Stability Issues - updated 6/7

grub.conf

end of grub.conf

lilo.conf

end of lilo.conf

Related topics