[SOLVED] Asterisk Stability Issues - updated 6/7

Hello everyone,

We are having random lockups on some of our servers that require the machine to be physically powered off and brought back up. We have 8 servers currently in production, all with the same basic config. Our dialplans are simple - three or four inbound 800 numbers that dump into menus with a few options. Each server has one or two inbound call queues. We are running mysql and apache, but no other major processes other than asterisk.

Our system config:

Asterisk 1.0.7 (running zaptel 1.0.9.2 drivers as per Digium’s suggestion)
Dell Poweredge 2850 (Dual Xeon 3.0s, HT enabled)
4 GB RAM
Ultra 320 SCSI RAID 5 Disk array
Gig E
TDM410P Quad T1 cards
Fedora core 4
init level 3, no framebuffer
onboard sound, CD, second NIC, COM port - all disabled in bios

We are currently restarting the servers every night, as that helps some, but after only an hour or two, the servers can lock up again. This doesn’t affect every server equally - two of them are particularly prone to this and lock up three or four times a week, while another box hasn’t locked up in three weeks.

I’m fairly inexperienced when it comes to system building in linux. I’ve made sure that all of the cards are on their own IRQ, and that we are not running any extraneous processes, but even those steps haven’t helped much.

Another thing that we are seeing on one of the boxes in particular is that all active calls will just drop. This seems to affect one of the queues the most - the CSR’s will be on the phone answering queue calls and every one will just drop - their phones immediately ring with the next calls in queue. This happens once or twice a week, and this server has been fairly stable otherwise.

I’m about to pull my hair out because nothing I have done has made anything better. If you have any thoughts or suggestions, I’d very much appreciate them.

Wes

UPDATE - yesterday we had three of our servers drop randomly. one was running 1.2.4, the other two are 1.0.7. The 1.2 box has the nmi_watchdog flag added to grub.conf, and has hyperthreading disabled.

The odd thing is that we hadn’t had a single issue with any of our servers locking up for about two weeks, and out of the blue, three of them in the same day lock up. I’m beginning to wonder if we have some external influence affecting these machines.

If anyone has ANY suggestions on this, please please please let me know.

Upgrade to 1.2.4. I’m suspecting zaptel driver bug(s)…

Yikes, how frustrating this has to be!

When you say lock up…you really mean that the entire system freezes and just dies? Requireing a physical power cycle? Or does the application just seem to die and you need to reboot to get things up and running again?

not an easy thing to do in a production environment, UNLESS we can stay on asterisk 1.0.7 while running zaptel 1.2.4 drivers…is this possible?

Entire system freezes - I can’t hit the machine via SSH, nor can I hit it via the console - about half the time, the machine will still respond to pings, but we cannot register phones and neither inbound nor outbound calls complete successfully.

We are planning a migration to 1.2.x, but not until two bugs are committed to the source code. Both involve the call-limit not working correctly, and both have patches, but my CTO is hesitant to push the new code live until it has been throughly tested. If we can find a stop-gap solution for another month or two, that would be enough.

Thanks guys!!!

W[/quote]

Hello,

Try upgrading to Asterisk 1.0.10 if possible. Have you done extensive modifications to 1.0.7 so that you cannot upgrade Asterisk?

Also, I would recommend getting off of Dells, try an Asus or SuperMicro motherboard system instead, and try a less bloated Linux Distro like Slackware and custom build the kernel to your hardware.

I would also recommend running a heavy testing boot-test app like memtest for several hours on the server to see if you have some faulty RAM(this happened to us a few months ago)

Do you see any kernel panic messages on the monitors of these servers when they lock up?

we are runing dell 2650’s using RHEL4 / TE411P

no issues

yea, don’t worry about the Dells, they work fine. Anyone who spent the amount of money needed to buy that many 2850’s is never going to have a case to goto a new motherboard system…management will laugh at that…and then question competance.

I’d make sure you have the latest and greatest Digium drivers. Completely remove the old zaptel drivers out of the system and download from the CVS site from scratch to make sure that you have the absolutely latest drivers)

Would you have a box that you could install a different distribution on to check stability? Download asterisk@home 2.5 (i know it works on Dells, including 2650 & 2850s) burn the ISO to a CD and install it.

A@H has the Centos distribution built in, takes about 30 minutes for it to do a complete install soup to nuts of the OS, Asterisk and Digium drivers. Basically, if you could install that, it would be a very fast way of testing the stability problem that you are having and see if it goes away. (won’t get into an arguement about production worthiness in this thread…but it IS).

We never got slackware to run stable on any hardware, and yes, we did follow the scratch install to a T.

We also can’t go drop $15000 on new servers - plus I’m partial to the dell boxes as they work fairly well. I don’t believe this is a hardware issue so much as it’s a software issue.

Like I said, when the server locks, I don’t see anything - any SSH sessions are killed and the console goes dead.

If I’m going to go through the hassle of an upgrade, we’re going to 1.2, for numerous reasons - we hadn’t planned on doing that upgrade until March, so if we could upgrade JUST the zaptel drivers for now, that would be good enough for me.

One other thing - the one box we do have running 1.2 has been up for almost two weeks now and hasn’t had a SINGLE problem, and I haven’t even optimized it yet (assigning IRQ priority and turning off extraneous peripherals) - this is also the box that the executives are on, so if there were ANY problems, we’d hear about it. I’m thinking that Asterisk is the problem, specifically the version we’re running.

dolesec - what version of * are you running?

If somebody can tell me about the zaptel driver upgrade and whether that would work, I’d appreciate it.

Thanks,

Wes

This is the proceedure that I follow to upgrade zaptel drivers on our systems. You sound knowledgeable enough to work your way through it. If anyone else reading has some suggestions to streamline or improve, let me know thanks!

cd /usr/src ;or wherever the directory that contains your Zaptel drivers
rm –rf zaptel ; removes the entire directory…just to be sure.

export CVSROOT=:pserver:anoncvs@cvs.digium.com:/usr/cvsroot
cvs login (the password is anoncvs )
cvs checkout zaptel

cd /usr/src/zaptel

make clean
make
make install

modprobe zaptel
modprobe wct2xxp (substitue wct2xxp for whatever driver you need)
cd /usr/src/zaptel
make config

the command zttool should show you the state of your digium cards…hopefully in “OK” status.

So does this mean that the zaptel driver version is independant of the asterisk driver version? In other words, can you run one version of asterisk (in our case 1.0.7) with another version of the zaptel drivers (we’d be moving to the 1.2.x branch)???

That is the only thing I really need to know at this point, and I’m betting that will alleviate our issues to a large degree.

Thanks!

Yes. The zaptel drivers are seperate. What you would need to test is compatibility, so backup all your old drivers. But, i’m running a 1.0.9 on the latest release of zaptel, no problem.

Of course like anything, this is something that you will want to test in a development environment, you just don’t want to go slapping in new drivers into production (unless you can accomidate a backout plan) But to the best of my knowledge, specifically for the digium PRI cards, the drivers are backwards compatible.

Just upgraded zaptel drivers on Asterisk version 1.0.9 with the latest release zaptel on a digium dual pri (wct2xxp from digium.com CVS site…works fine.

Should have mentioned that after the update, either reboot or issue the command:
service zaptel restart

DicksonC, that’s exactly what I needed to hear.

I’ll get our test box up and running today and hopefully we can put this issue to bed shortly!

Keep me posted, I’ll be interested in hearing the results. I never played with 1.0.7 so i’m not sure how buggy that release is, but 1.0.9 is working just fine.

Yes…PLEASE do it on a test box! heheeh…just in case!
Keep us posted

[quote=“whoiswes”]
Another thing that we are seeing on one of the boxes in particular is that all active calls will just drop. This seems to affect one of the queues the most - the CSR’s will be on the phone answering queue calls and every one will just drop - their phones immediately ring with the next calls in queue. This happens once or twice a week, and this server has been fairly stable otherwise.

Wes[/quote]

Wes I had the same problem with this. Actually we had it down to laughter that cuases the disconnects. Yes everybody laughed when I told them the calls disconnect on certain people’s laughter. Try the following:

Set callprogress=yes in zapata.conf and try disabling busydetect as well. That worked for me and we haven’t had a call dropped since.

Don’t really have advice on the other issues other than I tried Fedora Core 4 but went back to Fedora core 3.

Good luck!

hmmm, we have busydetect=no and callprogress=no currently…I had read that having those two settings enabled caused things to flake out.

it looks like 1.0.7 is not compatible with the newer zaptel drivers, so i’m toying with upgrading to 1.0.9 on the test box, since that shouldn’t affect our dial plan at all.

i will look into setting callprogress=yes though - even if just for testing.

thanks!

Incompatible? Wow, sorry to have mis-lead you on that, surprises me, what lead you to that conclusion?

How are all these boxes connected? Do you have 8 T1 lines coming in directly to each box?

I have 2 boxes but only 1 T1 line so I split the channels using an Adtran CSU 120. The 2 boxes are also identical and running the same Asterisk versions 1.24 but the second box generates errors on the Telco lines and after a day or so it will just bring down the whole T1 line. Digium told me to run it in debug mode till it crashes but I havent done that yet. I can sympathise with your situation.

DicksonC -
I was unable to compile asterisk 1.0.7 when running zaptel 1.2.3, but asterisk 1.0.9 compiles fine - I know there were changes to the channel structures starting in 1.0.8, so that doesn’t suprise me. In any case, after discussing this with my CTO, we’re probably going to jump straight to the 1.2 branch - the additional features are worth it anyways.

gventer-
we have between 3 and 4 T’s per box, with each box running a quad span card - each server is setup for an individual company, so right now they are distinct units. we eventaully will implement SER or some other derivation to allow for inter-asterisk dialing, but that is down the road. We still have one more company to convert to asterisk, and that will hopefully happen next week.

In any case, if anyone has any thoughts on how to get the zaptel 1.2.x drivers to play nice with asterisk 1.0.7, i’m all ears - that would be an easy way to patch our current installs until we have the time to migrate the dial plans to 1.2 code.

Thanks again for all the input and advice - i learn more than i thought possible every day, and most of it is due to kind individuals like yourselves.

Hey Wes?

It probably is software, so I suggest that be your priority, but if you have someone else with spare cycles - get them to check the power/heat situation as well.

My thinking is that you configured these systems relatively the same and it worked at one point (or is still working on another system now) so it may be environmental with that particular machine.

External influence (is the call volume, types of calls or basically input different?)

Environment - heating, cooling, venting, proximity, magnetism.

Both of these items will not be detectable on the system or in the code, unless you can replay call traffic (simulate the day) or find it in logs.

The heat thing will continue after an upgrade of software. For that test get a cheap thermometer and record the temperature, vs other machines, if it is higher, just put some fans on it - cheap way to rule it out. If the temp drops and it still happens - rule it out.

B.

B,

Thanks for the input - we do actually have a bit of a heat load problem in our data center, but the boxes affected by this are in the middle of the cluster - they’re the boxes running the warmest.

I will definitely check into that further though, because it makes complete sense, and I hadn’t even thought of it.

I am also going to check the power feeds to ensure we’re not overloading a circuit, as that could effect things as well…

Again, thanks for the advice - I would not have thought of that!

W