[SOLVED] Asterisk Stability Issues - updated 6/7

If it’s an NMI issue, SVN Branch as of 3/3/06 and 1.2.5 release contains a fix in zaptel. Release version 1.2.4 of zaptel does not contain the fix.

We have APIC on all of our Asterisk servers and it doesn’t seem to be an issue with them.

Not to be confused with ACPI which you should be able to disable on your motherboard. Do you have ACPI active? We have had problems with ACPI before, and we disable it on all new machines when we build them.

oops, i meant ACPI, so you are correct. if i have this straight, i need to add ‘acpi=off’ to grub.conf to disable ACPI. i don’t recall if the 2850’s have a BIOS option to disable ACPI - i doubt it, given how much dell restricts their BIOSes.

i will give that a shot on at least one of our machines - i wish these problems occurred more frequently - it would be much easier to determine if the changes i am making are having the slightest effect.

I had a similar behaviour with Asterisk 1.2.x running in realtime mode. Check your “asterisk.conf” for “highpriority=yes” and change it to “no” if it’s set.

This is just a workaround, the bug is still there but I prefer a stable system running asterisk with normal priority than running it with realtime priority and freezing at least once in 24h in my case.

i think i saw your thread - i have checked all of our boxes, and that variable isn’t even set - i’m assuming if undefined, it defaults to no.

i went ahead and added it to one server…it didn’t seem to change anything on my test box (in the way asterisk loaded, anyway).

another one of our 1.0.7 boxes locked up this morning - it had been rebooted less than three hours before.

this box has the nmi_watchdog flag set, and that seemed to make no difference at all.

one thing i noticed in the messages file on startup was that i was receiving the following:

Apr 7 08:36:48 aml-ast kernel: TE4XXP: Span 1 configured for ESF/B8ZS Apr 7 08:36:48 aml-ast kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue Apr 7 08:36:48 aml-ast kernel: You probably have a hardware problem with your RAM chips

i’m assuming this is related to zaptel and not the physical memory - i also find it hard to believe that 6 out of 8 servers have bad RAM. regardless, i will look into replacing the RAM, provided we have some extra lying around.

this machine did not have the high-priority=no flag set in asterisk.conf, so i added it.

another thing - in the zaptel 1.2.5 release notes, there is this note:

i’d be interested to know what changes were made…i will be updating our 1.2.x boxes to the new zaptel version this weekend, but only one of those has ever had problems, and they are all running identical configurations, for the most part.

i am still open to suggestions or ideas, especially if anyone can tell me how to start logging these lockups…to my knowledge, nothing is being written anywhere that can help me determine what is happening - if someone can tell me how i would go about gathering this information for debugging purposes, i would very much appreciate it.

thanks for continuing to read this one…

Don’t dismiss the idea of bad RAM(although 6 out of 8 is not very probable). I banged my head against a wall for several months until I ran memtest overnight on one of our that was freezing randomly and it found an error at the end of the last DDR2 DIMM. We replaced it and all has been good since then.

Download the ultimate boot CD and give it a try:
ultimatebootcd.com/

For a begining you can set “kernel.sysrq = 1” in /etc/sysctl.conf.

“kernel.panic = 15” in the same file tells the kernel to reboot itself after 15 seconds after it encountered a kernel panic. That didn’t help in my case related to the realtime priority because the kernel was fine, it was just that asterisk entered an infinite loop and it was consuming all the processor power. However, if you have issues with kernel modules (like zaptel) and the kernel panics, that might be a quick workaround to have them autoreset when they lock.

Having SysRq enable you’re supposed to be able to get extrainformation from kernel at any given time using the keyboard. More info about it you can find in the kernel sources in the Documentation/sysrq.txt file. Basically by pressing Alt+SysRq+letter you can get different info like memory info (letter = m), running tasks (t), current registers and flags § and help (h).

Now, if you have a kernel panic and you want to debug it, you need a core dump. One way to obtain it is to look at netdump daemon (included in Redhat distributions). Basically it dumps a core over the network when a kernel panic is encountered. You can test that by using Alt+SysRq+c to simulate a kernel crash.

Good luck.

I just sent an email to digium, and they suggested that it is the onboard ethernet controller causing the issues (uses the e1000 chipset).

Why on earth the 2850 was listed on a recommended servers page (which has since been removed) I have no idea…I guess we’re off to Microcenter to pick up some gig-E cards…

/frustrated beyond belief right now

EDIT: what really irks me is that I had called them twice (PAID support, no less) about this and nobody even suggested that the e1000 might be the culprit. ARGH!

How can i do to configure Asterisk to work with Cisco Gateway with SIP protocol, and not configure the option sip-ua on Cisco?.
I will like to configure only dial-peer voice XXX voip on the Cisco
Please, can you help me???
Thank´s

UPDATE 4/11

After conferring with Digium tech support, we determined that the onboard gigabit controller might be the problem, so we disabled all onboard controllers (actually, we disabled everything but the RAID controller) and installed a Linksys EG1032 Gig-E card (realtek 8169 chipset). The machine was up from Saturday afternoon until yesterday afternoon, at which point it locked up again. I was out of the office, but as far as I can tell, the issue is the same as it was with the onboard controller.

So we’re back to square one (almost).

I had a few thoughts - we are currently running kernel 2.6.11 - any reason to update?

angler, on this box, we are running * 1.0.7, so updating the zaptel drivers is not an option (yet). Moving to 1.2 across the board is in the pipe, but it is something we are planning on doing.

We use Linux Kernel 2.4.31. It’s very stable and we’ve never had any issues with drivers or compatibility using it.

Also, have you by chance tried a Sangoma a104u quad T1 card in these machines to see if they have the same issues as the the machines when they have Digium cards in them?

MATT—

UPDATE: sorry this took so long to update, but i’ve been pretty busy.

we had one of our servers crash and burn (lost two hard drives in the array, and thus the array)…after getting it rebuilt, we completely reformatted all of our machines and did reinstalls of everything. we were still having stability issues, and finally just went out and bought a Sangoma A104D.

immediately, almost all of our sound quality issues disappeared, and we haven’t had a server lock up yet.

our remote location (fed via a 10Mbit fiber pipe) was having some slight dropouts in audio - updating the e1000 drivers has almost eliminated that as well…by and large, the sangoma cards have been an absolute godsend.

i know that many users don’t have any issues at all with the digium cards, and i’m not knocking them, but we probably won’t buy anything but sangoma from here on out…it was a night and day difference for us.

Careful, you are posting this on a Digium-owned server, might get banned :smiley:

That was our experience on a few of our servers as well. We do still have some servers running on Digium cards, and they run very well, but we had a couple that didn’t start behaving until we put Sangoma cards in them(especially two quad cards in one system).