We are having random lockups on some of our servers that require the machine to be physically powered off and brought back up. We have 8 servers currently in production, all with the same basic config. Our dialplans are simple - three or four inbound 800 numbers that dump into menus with a few options. Each server has one or two inbound call queues. We are running mysql and apache, but no other major processes other than asterisk.
Our system config:
Asterisk 1.0.7 (running zaptel 184.108.40.206 drivers as per Digium’s suggestion)
Dell Poweredge 2850 (Dual Xeon 3.0s, HT enabled)
4 GB RAM
Ultra 320 SCSI RAID 5 Disk array
TDM410P Quad T1 cards
Fedora core 4
init level 3, no framebuffer
onboard sound, CD, second NIC, COM port - all disabled in bios
We are currently restarting the servers every night, as that helps some, but after only an hour or two, the servers can lock up again. This doesn’t affect every server equally - two of them are particularly prone to this and lock up three or four times a week, while another box hasn’t locked up in three weeks.
I’m fairly inexperienced when it comes to system building in linux. I’ve made sure that all of the cards are on their own IRQ, and that we are not running any extraneous processes, but even those steps haven’t helped much.
Another thing that we are seeing on one of the boxes in particular is that all active calls will just drop. This seems to affect one of the queues the most - the CSR’s will be on the phone answering queue calls and every one will just drop - their phones immediately ring with the next calls in queue. This happens once or twice a week, and this server has been fairly stable otherwise.
I’m about to pull my hair out because nothing I have done has made anything better. If you have any thoughts or suggestions, I’d very much appreciate them.
UPDATE - yesterday we had three of our servers drop randomly. one was running 1.2.4, the other two are 1.0.7. The 1.2 box has the nmi_watchdog flag added to grub.conf, and has hyperthreading disabled.
The odd thing is that we hadn’t had a single issue with any of our servers locking up for about two weeks, and out of the blue, three of them in the same day lock up. I’m beginning to wonder if we have some external influence affecting these machines.
If anyone has ANY suggestions on this, please please please let me know.