Spans jumping randomly

Have installed a high-availability system with two asterisks:
Four ethernet cards:
1. Network.
2. Heartbeat dedicated.
3 and 4. Two FoneBRIDGEs for 8 E1s.

The system is working OK until the spans go down. Asterisk logs:

[May 23 10:19:52] WARNING[11781] chan_dahdi.c: Detected alarm on channel 94: Yellow Alarm
[May 23 10:19:52] WARNING[11781] chan_dahdi.c: Detected alarm on channel 95: Yellow Alarm
.
. (every channel of the span 4)
.
[May 23 10:19:52] NOTICE[11780] chan_dahdi.c: PRI got event: Alarm (4) on D-channel of span 4
.
.
.
[May 23 10:19:52] NOTICE[11781] chan_dahdi.c: Alarm cleared on channel 94
[May 23 10:19:52] NOTICE[11781] chan_dahdi.c: Alarm cleared on channel 95
.
. (again, alarm clear in every channel)
.
[May 23 10:19:52] NOTICE[11780] chan_dahdi.c: PRI got event: No more alarm (5) on D-channel of span 4

The kernel logs say:

May 23 10:19:52 ambato kernel: TDMoX: New master: DYN/ethmf/eth1/00:50:c2:65:d7:10/3

The jumps are random: I don’t have any other warning or error message, doesn’t have any order (can be the 1, 2, 3, 4, etc.), doesn’t jump the same times (once, twice, 23 times, etc.). But they just occur when we have telephony traffic: from 7am to 9pm.

The project is not finished: we are expecting a lot of people using the system. Now, we are using only the first span, and they lost telephony when the span 1 jumps. But the other spans are already connected and waiting for traffic.

I have disconnected the avahi-daemon (no more avahi messages in kernel logs) and set the IRQs of ethernet card (no more HDLC messages in Asterisk logs). The only messages I have in kernel and Asterisk are the examples show above. Heartbeat has been disconnected and we are working with only one system and 4 spans. The jumps continue as always…

Telco is going to check the spans, but by the night… Any sugestion? Someone has had a similar scenario?

Someone told me the cause is a slip in the signaling channel because sometimes telco has a delay. When it happens in the first span, Asterisk is capable to use few channels until I restart it. All the other channels are seen occupied by the telco and send a message of congestion to the caller.

Why does it happen? Any sugestion to solve the problem?

I’ve wrote “OK, we have made a timing analysis and the telco has slips in the signaling so, the case is close and they are going to solve it. By the way, they stop calling Asterisk “that little system you have” (in spanish sounds even worst)”.

But we still have the problem. Telco says they have solved it. So, if someone has any experience on it or wants more info, let me know and I’ll post it.