I’ve been trying to wrap my head around the problem for a while now, but first things first here’s the story:
We’ve been running Asterisk 1.2 / Zaptel (some SVN revsion) on Debian Sarge (i386) with a Digium Wildcard TE410P in E1 mode in a HP Prolian DL380G5. That setup worked just fine for about 3 years but then we decided to add Skype For Asterisk to the setup which works only with Asterisk 1.4 or higher apparently. Why not go all the way and upgrade to 1.6? So I did…
Now I’m running ArchLinux (with a frozen pkg repo) with the following versions built from source:
Linux 18.104.22.168 (stock w/ BKL enabled)
It principal the system works until it seems to randomly cause kernel panics and reboot. When it does, it cannot ever complete a dahdi_cfg call without causing a fatal MCE. A simple reboot is not enough to get the card up again, a cold start is the only thing that sometimes works.
The MCE looks as follows:
I ran memtest86+ and the CPU test utility (from the Archlinux install CD) on the box and ran through a bunch of other things but couldn’t find a significant problem other than the card shares an IRQ with the GFX card (disabled) and an unused USB Bus (can’t be disabled). I tried putting the card into another slot after not being able to boot at all anymore and shuffled around with IRQs in the RBSU to get the least clashes possible. Now I’m stuck I’m really stuck. The system works for a random amount of time then starts this behaviour again.
Machine check exceptions are hardware problems. Software would have had to put the hardware into a diagnostic mode to provoke one on good hardware. I suspect, if you analyse the machine check code, you will find this is a cache RAM problem, rather than a main memory problem.
Asterisk 1.6 is unsupported, as of April 21st, so there is no incentive to develop for it. Asterisk 1.4 also ceased support then, although it had a longer lifetime. Any new development would have to be for Asterisk 1.8.
Skype support is third party support, and may not be of interest to Digium, because it has no benefit to their business model, and will be rejected by many open source developers because Skype is the anathema of an open protocol.
I can’t imagine the Skype interface module for Asterisk uses dahdi, except possiblhy indirectly, for timing, and shouldn’t be doing any ring 0 work itself. I’m not sure why Skype would require direct hardware access.
Again, Asterisk itself isn’t the issue at all. dahdi_cfg is causing the trouble, I didn’t even get to start the asterisk daemon yet. And I use the latest (released) version of dahdi tools and kernel parts I am aware of.
Anyway, I have the same setup on two machines now. Identical in model and specs and the configuration fails exactly the same way. Both show the sympthoms of the same problem. As I dont’ have any other machines with old PCI-X available I unfortunately can’t try them in any other servers but it’s intriguing that both indeed run perfectly fine with the old Asterisk 1.2 / Zaptel based setup. This is why I have - of course I can be proven wrong - concluded it’s not a H/W issue even though MCEs occur.
The cause would be:
Everything after that I judge as mere resulting problems not necessary the cause. mcelog says this which is rather cryptic:
Thanks for your replies but as they were not useful in our case, we have rolled the system back to the old setup (Asterisk 1.2 / Zaptel (SVN r4620)) and voila it’s running well again with warm starts and everything, who would’ve thought?
The plan now is to get rid of E1 allover since we had to throw way too much time onto this already.