Dell PowerEdge & Dahdi => CPU Context Error

Hi everybody,

we currently have a big problem with our two Dell PowerEdge 1950 servers. They both have 2 Digium TE410P.

When we were with Asterisk 1.4 and Zaptel, everything was OK, but since we are with Asterisk 1.6 and Dahdi, the servers crash randomly and an electric reboot is necessary to recover a usable server.

When a server crashes, here it what it is showed :

_ CPU 1: Machine Check Exception: 0000000000000004 _ CPU 3: Machine Check Exception: 0000000000000004 _ Uhhu. NMI received for unknown reason 31 on CPU0. _ CPU 2: Machine Check Exception: 0000000000000005 _ CPU 3: Bank 0: 3200000410000800 _ Do you have a strange power saving mode enabled ? _ CPU 2: Bank 5: 3200001080200e0f _ CPU 3: Bank 5: 3200000040100e0f _ Dazed and confused, but trying to continue _ Kernel panic - not syncing: CPU context corrupt _ Kernel panic - not syncing: CPU contect corrupt _ CPU 1: Bank 0: 3200000410000800 _ CPU 1: Bank 5: 3200000044100e0f _ Kernel Panic - not syncing: CPU context corrupt

We tried to update the Bios, but not solved the problem. Thinking that the problem was an IRQ conflict between the two Digium card, we removed one but the problem was still here.

Hope you can help me.

Our configuration : PowerEdge 1950, Debian Lenny, Asterisk 1.6.2.18, Dahdi 2.4.1.2, LibPri 1.4.11.5.

Thanks in advance,

Regards,

Paul

Hardware problem.

the problem could be the two cards have the same IRQs, remove one and try … may it work

check this links may it will useful:

http://stackoverflow.com/questions/628920/problems-with-irqs-when-connecting-two-digium-card-in-and-asterisk-box

http://serverfault.com/questions/70585/manually-assign-a-pci-card-to-an-interrupt

check also the user guide provided by digium for multiple cards configuration

http://www.digium.com/en/products/digital/te410p.php#documentation

i hope this will help

Regards
Ibrahim

Thanks for you reply Ibrahim.

As I said in my message, we tried to remove one of the cards, and the problem appeared again.

Here is the interrupts of one of the servers :

cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:         91         42         37         38   IO-APIC-edge      timer
  1:          1          0          1          0   IO-APIC-edge      i8042
  8:         10         11         12          9   IO-APIC-edge      rtc0
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          1          1          0          2   IO-APIC-edge      i8042
 14:         18         18         21         18   IO-APIC-edge      ide0
 20:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2, uhci_hcd:usb4
 21:          6          6          6          7   IO-APIC-fasteoi   uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5
 32:   11003586   10693197   11000377   10694358   IO-APIC-fasteoi   wct4xxp
 64:   10694124   11003019   10686814   11003034   IO-APIC-fasteoi   wct4xxp
213:    1594958    1596576    1605601    1595401   PCI-MSI-edge      eth0
214:      27488      27347      27344      27351   PCI-MSI-edge      ioc0
218:          0          0          0          0   PCI-MSI-edge      aerdrv
219:          0          0          0          0   PCI-MSI-edge      aerdrv
220:          0          0          0          0   PCI-MSI-edge      aerdrv
221:          0          0          0          0   PCI-MSI-edge      aerdrv
222:          0          0          0          0   PCI-MSI-edge      aerdrv
223:          0          0          0          0   PCI-MSI-edge      aerdrv
NMI:          0          0          0          0   Non-maskable interrupts
LOC:     288383     245098     231942     278559   Local timer interrupts
RES:     340157     293464     307736     305125   Rescheduling interrupts
CAL:        325        393        354        407   function call interrupts
TLB:      33435      31525      35063      38320   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0
MIS:          0

Is everything seems to be OK ?

david55, could it be a hardware problem that arrived just after an update to Dahdi, and on two servers in same time ?

Thanks,

Paul

Machine checks are generally always hardware related, although maybe the driver is using a feature of the hardware that wasn’t used before.

try to disable cpu hyperthread on your bios if it is enabled.

maybe do not solve but is worth a try.

Thanks for your replies,

we will try to disable the hyperthreading.

Regards,

Paul

Hi,

Digium Support:

digium.com/support

Open a case.

let me know if it work :smiley: