TE410P unexpected system restart

Hi everybody,

I have a TE410P digium card installed in my asterisk box. It was working perfectly until a few days ago that server started to get restarted a few times during its heavy load. By heavy load I mean, when asterisk is handling around 70 concurrent calls!! which is not considered to be heavy load anyway.
I ran a few tests on server such as memory test to discover any hardware failure but everything works perfectly. Log messages (/var/log/messages, dmesg, /var/log/asterisk/*) are showing normal behavior of system until a sudden restart happens and then system boots normally again.

[quote="/var/log/messages"]Oct 27 11:32:51 localhost ntpd[1763]: Deferring DNS for 1.centos.pool.ntp.org 1
Oct 27 11:33:11 localhost ntpd[1763]: Deferring DNS for 2.centos.pool.ntp.org 1
Oct 27 11:33:31 localhost ntpd[1763]: Deferring DNS for 3.centos.pool.ntp.org 1
Oct 27 11:33:31 localhost ntpd[1763]: 0.0.0.0 c016 06 restart
Oct 27 11:33:31 localhost ntpd[1763]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
Oct 27 11:33:31 localhost ntpd[1763]: 0.0.0.0 c011 01 freq_not_set

Oct 27 12:37:34 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started.
Oct 27 12:37:34 localhost rsyslogd: [origin software=“rsyslogd” swVersion=“5.8.10” x-pid=“1355” x-info=“http://www.rsyslog.com”] start
Oct 27 12:37:34 localhost kernel: ipmi_si ipmi_si.0: Found new BMC (man_id: 0x000157, prod_id: 0x0029, dev_id: 0x20)
Oct 27 12:37:34 localhost kernel: ipmi_si ipmi_si.0: IPMI kcs interface initialized
Oct 27 12:37:34 localhost kernel: input: Sleep Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0E:00/input/input0
Oct 27 12:37:34 localhost kernel: ACPI: Sleep Button [SLPB]
Oct 27 12:37:34 localhost kernel: input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input1
Oct 27 12:37:34 localhost kernel: ACPI: Power Button [PWRB]
Oct 27 12:37:34 localhost kernel: input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input2
Oct 27 12:37:34 localhost kernel: ACPI: Power Button [PWRF]
Oct 27 12:37:34 localhost kernel: Marking TSC unstable due to TSC halts in idle
Oct 27 12:37:34 localhost kernel: [Firmware Bug]: No valid trip found
Oct 27 12:37:34 localhost kernel: ERST: Error Record Serialization Table (ERST) support is initialized.

Trimmed the log because it is believed it wasn’t useful. I only left the BMC part that was related to what said in replies.

[/quote]

I am running asterisk 13 beta1, but I have also tested beta2 which had the same behavior. I understand its not good to use beta versions but I need a feature which is added to asterisk 13 so I’m stuck to it. However, I’m not sure either this is an asterisk bug or something wrong with the card. dahdi_test also is showing that everything is as it should be.

Any help would be appreciated.

Your kernel reboot logs contain no useful information.

If you are getting a panic and restart, the chances are that the disk system is unusable at the time the diagnostic is produced, so you need to capture it from the screen. Also, if it is panicing, the problem will not lie in Asterisk; if the Asterisk subsystem is involved, it will lie in dahdi, although it could simply be a general hardware failure.

I’ll second what david55 said. There is not any useful information to indicate what the source of the restart is.

I see from the kernel logs that there is a BMC attached. Is there any information in the system event log of the BMC that could indicate why the system was restarted?

[quote=“david55”]Your kernel reboot logs contain no useful information.

If you are getting a panic and restart, the chances are that the disk system is unusable at the time the diagnostic is produced, so you need to capture it from the screen. Also, if it is panicing, the problem will not lie in Asterisk; if the Asterisk subsystem is involved, it will lie in dahdi, although it could simply be a general hardware failure.[/quote]

I stared at monitor for a while in the hope I can get anything useful from display but unfortunately when it happens system restarts instantly without getting any chance to print anything on screen. So I’m having nothing in hand to diagnose the issue. I forgot to mention that I’m also using libss7 v2.0. I’m also pretty sure that latest version of dahdi is installed.

Any hint?

[quote=“sruffell”]I’ll second what david55 said. There is not any useful information to indicate what the source of the restart is.

I see from the kernel logs that there is a BMC attached. Is there any information in the system event log of the BMC that could indicate why the system was restarted?[/quote]

I have no idea what BMC is, can you please give me more information on how can I obtain system event logs related to BMC?

BMC is new to me (I deal with developer systems, not production ones), but en.wikipedia.org/wiki/Intelligen … _Interface should explain.

You appear to have a non-default option if Linux isn’t allowing you time to read the panic report, but this may help you prevent an automatic restart: cyberciti.biz/tips/reboot-li … panic.html

[quote=“david55”]BMC is new to me (I deal with developer systems, not production ones), but en.wikipedia.org/wiki/Intelligen … _Interface should explain.

You appear to have a non-default option if Linux isn’t allowing you time to read the panic report, but this may help you prevent an automatic restart: cyberciti.biz/tips/reboot-li … panic.html[/quote]

Thanks to your tips I could finally get an eye on kernel panic only for a few seconds. I don’t know why it only lasts for about 5 seconds while I have set the kernel.panic parameter to 300 seconds. However, from what I could read in that period of time I remember this line:
Generic hardware error. PCIe failure or something like this. The only PCI-Express card that I see in there is SAS-5805 raid controller. Now, do you suspect that RAID controller is causing problems? I guess maybe there’s something wrong with IRQs. On heavy load TE410P needs many interrupts, maybe this affects how SAS controller is working?

Edit 1:
I’ve also configured kdump before the crash happens but /var/crash path is empty!
I’ve followed http://linuxsysconfig.com/2013/03/kdump-on-centos-6/ to configure kdump.

Edit 2:
Disabling SAS controller from BIOS didn’t prevent kernel panic!

From the BIOS is there anything about reading the “System Event Log”? The baseboard management controller can reset the computer if it detects an NMI or another fatal error, but normally they will record something in their log to help with troubleshooting.

The TE410P generates one interrupt a millisecond regardless of how many calls are active on it, so I doubt your problem has anything do do with interrupts

In your original report you said:

Is there anything else that happened to the server about the time this problem happened? Did you update the kernel or the BIOS?

(as an aside, I would not have trimmed the kernel logs, there is other potentially useful information that was trimmed out of there, i.e… I could see that you had a Gen 4.6 card from the part that you trimmed out, which was good to know).

I don’t have physical access to server for a few hours. I’ll check it as soon as I get there and will let you know of results.

Didn’t know that. If it is so, then you must be right.

Well, the only thing I can think of is that I upgraded asterisk 13 beta1 to beta2. Initially I thought it is the root cause of the problem so I rolled back to beta1 but the problem continued. However currently I’ve upgraded asterisk 13 to its release version that was released a few days ago. Additionally, The system is having new users each day, so I guess it is more related to number of users using system rather than any change I might have made. I strongly believe it because server works perfectly about 20 hours a day but it panics only during its peak hours which is between 9:30AM-12:30PM and 6PM.

[quote=“sruffell”]
(as an aside, I would not have trimmed the kernel logs, there is other potentially useful information that was trimmed out of there, i.e… I could see that you had a Gen 4.6 card from the part that you trimmed out, which was good to know).[/quote]

Sorry for that, Thought it was only making the post unreadable. I’ll manage to revert it as soon as possible.

Why are you using a beta version on a production system?

If you use a code block for the log quote, it will get scroll bars.

I needed a feature which is added to asterisk 13 and it was in beta release until 26 Oct. I’ve installed LTS release yesterday but it didn’t solve the problem. :frowning:

Didn’t know that. I’ll use it. First, I need to collect a new log report. Thanks for the tip.

I could get a screenshot of the problem. Any idea what’s happening?
Image is uploaded here: http://i58.tinypic.com/21lmmvl.jpg

Still not much to go on. It looks like the code was running in user space when the fault occurred, so, although the current process is Asterisk, there is no strong evidence that it triggered the fault.

As david55 said, from the backtrace all we can see was that there was an nmi generated, but other than that not much.

The system event logs will typically record the source of the error. For example, it says PCIe error, but we still do not know if it’s from a PCIe to PCI bridge, from the sas controller, or from the video hardware, etc…

If you want to troubleshoot this on this machine, I would try to get any SEL output.

Also, if you purchased this card recently, you are entitled to support from Digium. The tech support department can help you collect more detailed information about your server (i.e. is the BIOS updated to the latest, are you on the latest rev of the card, etc…)

But I would really try to get the SEL output as that might really narrow down what you even need to look at. You could also collect a full kernel trace if you setup a serial console to collect the output of the crash on another machine.

[quote=“sruffell”]As david55 said, from the backtrace all we can see was that there was an nmi generated, but other than that not much.

The system event logs will typically record the source of the error. For example, it says PCIe error, but we still do not know if it’s from a PCIe to PCI bridge, from the sas controller, or from the video hardware, etc…

If you want to troubleshoot this on this machine, I would try to get any SEL output.

Also, if you purchased this card recently, you are entitled to support from Digium. The tech support department can help you collect more detailed information about your server (i.e. is the BIOS updated to the latest, are you on the latest rev of the card, etc…)

But I would really try to get the SEL output as that might really narrow down what you even need to look at. You could also collect a full kernel trace if you setup a serial console to collect the output of the crash on another machine.[/quote]

First of all, I need to thank both of you for being with me on this problem. Since this is a production system, I’m losing users and it’s adversely affecting my business. So, I really appreciate all your kindness.

Unfortunately this card is not under Digium’s support plan so I need to solve this problem without their help.
I’ve also checked system event log but there was only an error with minor severity and a code with no more description so I thought it was not anything important. However I cleared the log to check it again after next kernel panic. Will let you know if anything special will be logged there.
Regarding SEL output, are you referring to System Event Log shown in BIOS that I mentioned it earlier? I’ll search the net to see how can I collect full kernel log and will do so as soon as I get access to server. In the meantime, is there anything else I can do?

  1. Setup another server that you could use to replace to one that is having the problems while you trouble shoot on the main server.

  2. Make sure the BIOS is updated to the latest. I’ve seen problems with errors like this before when the BIOS had misconfigured the host bridged.

  3. Prepare a system to use to capture the console from the troubled system via serial port. Then you might be able to get the top of the backtrace and get a better idea where the kernel is. Although, often times with NMIs this isn’t an exact science because the error is being reported by the host bridge and that can sometimes come in after the offending transaction.

Other than that…I’m not sure what to tell you…

The big problem you have here is that the fault is being detected by hardware, not by software. The software is only reporting the fact that the hardware has told it that there has been an unrecoverable error.

Sorry for bringing this topic back. I thought letting others know what I did at long last might be useful for others to diagnose similar problems. You are not believing what was the actual cause of this unexpected behavior!! After I disabled echo cancellation by commenting echocanceller=… in system.conf the problem was gone and I have the system up and running now for almost 4 months!! Any idea what might be the root of this problem?!