Hi all. We are currently running 9 asterisk systems at remote locations. We are in the process of swapping out our older asterisk systems with new hardware, and are having some issues with our first test system.
We are using a single TE133 card in a CentOS system with an Intel Atom C2750 processor and 4GB of RAM. HD’s are (2) intel SSD’s in RAID0. Hardware is all new.
We are using asterisk 11.9.0 and dahdi 2.9.1.1.
At random times, several times a week, dahdi will crap out and we’ll get the following message dumped into our log file repeatedly:
And dahdi will go down:
[Jun 28 03:03:47] WARNING[2097] sig_pri.c: Span 1: D-channel is down!
This will happen at seemingly random times, even when the system isn’t under load (such as at 3am). Restarting the dahdi service fixes the problem and the phone system is useable again until the next time it happens.
You might want to open a ticket with Digium’s technical support to help troubleshoot this problem.
But, based on what you said here, it sounds like something is happening on your host system which is either preventing the interrupt handler from running in a timely fashion (i.e., is the system going into a low power mode? Is there a framebuffer running, a slow serial console? ) or interrupts are not being routed reliably on this platform.
We were getting hardware under-runs after about 5.5 to 6.5 days, but it too would happen at quiet times. Framebuffers are disabled.
After replacing the PCIe riser card, we thought we fixed the problem, but not we get hardware under-runs after about 10 days. Our next option is to discard the riser completely, but that involves replacing the case (1U chassis), so before I go ahead and do that, I wanted to know if you resolved the issue.
Update - swapped the system into a new case so I don’t need a riser card, and I got an underrun after less than 4 days - a new record!
I beginning to wonder if the card just doesn’t like the motherboard / chipset. I’ve asked Digium support about that and am waiting for a reply. At this rate, I’m going to have a 2nd machine around
Another update - spoke to Digium and there “appears” to be an issue that they are trying to patch.
I also wondered whether this was being caused by power management, so I decided to disable ACPI and APIC by adding this to the kernel configuration line in /boot/grub/menu.lst
A consequence is that the PRI card is no longer on it’s own IRQ, but is sharing it with a USB hub (not in use) and smbus. However, by monitoring /proc/interrupts, I can see that I’m getting an average of 1,005 interrupts per second (Min: 1,003; Max: 1,010).
So far it’s been up for almost 7 days - I await the dreaded “John, the phones are down”!