Problems with Dahdi/Zap overnight

For some reason every time we come back in the morning, we cannot make or receive calls on the Zap/Dahdi interface. It has taken me some time to be able to source what the problem might be, and in fact the only way to resolve it is to reset the whole server.

This morning I’ve had a look through the overnight log, which I’ve set as full/verbose/everything under the sun!.. and think I may have found the source of the problem… Can someone make any suggestions as to what might be causing it?

[quote][Nov 25 02:12:50] VERBOSE[21977] logger.c: == Starting post polarity CID detection on channel 1
[Nov 25 02:12:50] VERBOSE[17310] logger.c: – Starting simple switch on ‘Zap/1-1’
[Nov 25 02:12:51] NOTICE[17310] chan_dahdi.c: Got event 4 (Alarm)…
[Nov 25 02:12:51] DEBUG[17310] chan_dahdi.c: Ignoring Polarity switch to IDLE on channel 1, state 9
[Nov 25 02:12:51] DEBUG[17310] chan_dahdi.c: Polarity Reversal event occured - DEBUG 2: channel 1, state 9, pol= 0, aonp= 0, honp= 0, pdelay= 600, tv= -781474917
[Nov 25 02:12:52] NOTICE[17310] chan_dahdi.c: Alarm cleared on channel 1
[Nov 25 02:12:52] DEBUG[17310] chan_dahdi.c: Ignore switch to REVERSED Polarity on channel 1, state 9
[Nov 25 02:12:52] DEBUG[17310] chan_dahdi.c: Ignoring Polarity switch to IDLE on channel 1, state 9
[Nov 25 02:12:52] DEBUG[17310] chan_dahdi.c: Polarity Reversal event occured - DEBUG 2: channel 1, state 9, pol= 0, aonp= 0, honp= 0, pdelay= 600, tv= -781473731
[Nov 25 02:12:53] WARNING[17310] chan_dahdi.c: Detected alarm on channel 1: Red Alarm
[Nov 25 02:12:53] WARNING[17310] chan_dahdi.c: Hangup received waiting for ring. Exiting simple switch
[Nov 25 02:12:53] VERBOSE[17310] logger.c: – Hungup ‘Zap/1-1’
[Nov 25 02:12:54] VERBOSE[21977] logger.c: == Starting post polarity CID detection on channel 1
[Nov 25 02:12:54] VERBOSE[17316] logger.c: – Starting simple switch on ‘Zap/1-1’
[Nov 25 02:12:55] NOTICE[17316] chan_dahdi.c: Got event 5 (No more alarm)…
[Nov 25 02:12:57] WARNING[17316] chan_dahdi.c: CID timed out waiting for ring. Exiting simple switch
[Nov 25 02:12:57] VERBOSE[17316] logger.c: – Hungup ‘Zap/1-1’

[Nov 25 07:24:14] VERBOSE[21977] logger.c: == Starting post polarity CID detection on channel 1
[Nov 25 07:24:14] VERBOSE[2550] logger.c: – Starting simple switch on ‘Zap/1-1’
[Nov 25 07:24:16] NOTICE[2550] chan_dahdi.c: CallerID number: 901234567890, name: EXTERNAL, flags=0
[Nov 25 07:24:18] WARNING[2550] chan_dahdi.c: CID timed out waiting for ring. Exiting simple switch
[Nov 25 07:24:18] VERBOSE[2550] logger.c: – Hungup ‘Zap/1-1’

[Nov 25 07:24:53] VERBOSE[21977] logger.c: == Starting post polarity CID detection on channel 1
[Nov 25 07:24:53] VERBOSE[2756] logger.c: – Starting simple switch on ‘Zap/1-1’
[Nov 25 07:24:53] NOTICE[2756] chan_dahdi.c: Got event 2 (Ring/Answered)…
[Nov 25 07:24:55] WARNING[2756] chan_dahdi.c: CID timed out waiting for ring. Exiting simple switch
[Nov 25 07:24:55] VERBOSE[2756] logger.c: – Hungup ‘Zap/1-1’

[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [9123456@DLPN_Nursery:1] Macro(“Zap/3-1”, “trunkdial-failover-0.3|Zap/g2/9123456|IAX2/123456/441234123456|trunk_1|84435161”) in new stack
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [s@macro-trunkdial-failover-0.3:1] Set(“Zap/3-1”, “CALLERID(num)=”) in new stack
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: Set
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [s@macro-trunkdial-failover-0.3:2] GotoIf(“Zap/3-1”, “0?1-dial|1”) in new stack
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: GotoIf
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [s@macro-trunkdial-failover-0.3:3] Set(“Zap/3-1”, “CALLERID(all)=01234567890”) in new stack
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: Set
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [s@macro-trunkdial-failover-0.3:4] Goto(“Zap/3-1”, “1-dial|1”) in new stack
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Goto (macro-trunkdial-failover-0.3,1-dial,1)
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: Goto
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [1-dial@macro-trunkdial-failover-0.3:1] Dial(“Zap/3-1”, “Zap/g2/9123456”) in new stack
[Nov 25 07:27:38] WARNING[3767] app_dial.c: Unable to create channel of type ‘Zap’ (cause 34 - Circuit/channel congestion)
[Nov 25 07:27:38] VERBOSE[3767] logger.c: == Everyone is busy/congested at this time (1:0/1/0)
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: Dial
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [1-dial@macro-trunkdial-failover-0.3:2] GotoIf(“Zap/3-1”, “26 > 0 ?1-CONGESTION|1:1-out|1”) in new stack
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Goto (macro-trunkdial-failover-0.3,1-CONGESTION,1)
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: Gotoif
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [1-CONGESTION@macro-trunkdial-failover-0.3:1] Dial(“Zap/3-1”, “IAX2/84435161/441452731880”) in new stack
[Nov 25 07:27:38] WARNING[3767] app_dial.c: Unable to create channel of type ‘IAX2’ (cause 20 - Unknown)
[Nov 25 07:27:38] VERBOSE[3767] logger.c: == Everyone is busy/congested at this time (1:0/0/1)
[Nov 25 07:27:38] DEBUG[3767] app_macro.c: Executed application: Dial
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Executing [1-CONGESTION@macro-trunkdial-failover-0.3:2] Hangup(“Zap/3-1”, “”) in new stack
[Nov 25 07:27:38] VERBOSE[3767] logger.c: == Spawn extension (macro-trunkdial-failover-0.3, 1-CONGESTION, 2) exited non-zero on ‘Zap/3-1’ in macro ‘trunkdial-failover-0.3’
[Nov 25 07:27:38] VERBOSE[3767] logger.c: == Spawn extension (macro-trunkdial-failover-0.3, 1-CONGESTION, 2) exited non-zero on ‘Zap/3-1’
[Nov 25 07:27:38] VERBOSE[3767] logger.c: – Hungup ‘Zap/3-1’
[/quote]

As a bit of explanation… Zap channels 1 & 2 are PSTN lines (fxo), and 3 is a telephone (i.e. fxs).

It’s the events at 2:12 am that concern me, as I think it must be something reseting on the telephone line, causing a failure, but I’m not sure.

Thanks
Nunners

Hi

Sounds like what used to be called a SALT .

basicly a automated line test, where are you based ?

what does zap show channels say in the morning and what does show channels verbose have to say ?

running

service asterisk stop
service zaptel restart
service asterisk start

should clear the issue or even just “service asterisk restart” may well do it

Ian

I’ve since reset the server, so no point doing that now… but I will check next time it happens.

Is there anyway of monitoring say the show channel status, and if it does have an error, doing an automatic reset?

Oh - I’m in the UK… using BT PSTN lines… which is probably the problem!

Hi

Proberly a SALT test then, You need to speak to BT about getting CPC signals enabled on you line.
As to tests best option is to do the restart with a cron at 4am

Ian

CPCs enabled… overnight tests off… bt were quite quick for once!.. we’ll wait and see what happens tonight!

Hi

If you ask the right questions they usually are.

Ian

On a TDM400P, Internally, in chan_dahdi, this happens because the alarm event is generated & handled in the normal / idle channel code but the noalarm (clear) event fires in the middle of the caller ID detection. Caller ID detection is triggered by the polarity event and the code does NOT clear the InAlarm flag.

Anyway, if anyone wants it, I have a patch (unofficial) for chan_dahdi.c to clear the InAlarm if the noalarm event is received there.

It is, as yet, not very well tested (I wrote it half an hour ago) but so far its done the job.

The BT line test can be triggered from bt.com if you have a login. Also, an engineer can start a test too so you are not totally safe if the auto test is turned off.

I also have a patch to make CID more reliable on BT analog lines.

Hi Jedi,

Both patches sounds fantastic, but as my only server is fully operational (i.e. I don’t have a test one) I’d prefer not to try it until it’s tested further.

Are you planning to submit it to the asterisk team for full inclusion?

Many thanks
Nunners

I don’t know about fantastic, they are both pretty simple, only 3 or 4 lines. The alarms patch is new (still working today) but the CID patch has been in use for over 2 years on production & test 1.2 & 1.4 servers.

I intend to raise bug reports & submit the patches but just when I finally got enthused about the idea, the bug reporting system seems to have done a runner!

I have the luxury of a test & a production system and a bunch of users who know that if I break it I can fix it pretty quickly!!

Patches shown below FYI.

chan_dahdi-1.4-cid.diff: -

--- chan_dahdi.c.orig	2006-12-10 03:04:18.000000000 +0000
+++ chan_dahdi.c	2006-12-28 01:12:53.343195508 +0000
@@ -6147,6 +6147,11 @@
 						}
 						if (i & DAHDI_IOMUX_SIGEVENT) {
 							res = dahdi_get_event(p->subs[index].dfd);
+                                                        /* Ignore ring before end of cid 'slot' (955ms = 7640 @ 8K samples/sec) */
+                                                        if (res == DAHDI_EVENT_RINGOFFHOOK && samples < 7640)
+                                                              ast_log(LOG_NOTICE, "Ignoring event %d (%s) due to waiting for CID, samples=%d\n", res, event2str(res), samples);
+                                                        else
+                                                        {
 							ast_log(LOG_NOTICE, "Got event %d (%s)...\n", res, event2str(res));
 
 							if (p->cid_signalling == CID_SIG_V23_JP) {
@@ -6160,6 +6165,7 @@
 								res = 0;
 								break;
 							}
+							}
 						} else if (i & DAHDI_IOMUX_READ) {
 							res = read(p->subs[index].dfd, buf, sizeof(buf));
 							if (res < 0) {

chan_dahdi-1.4-inalarm.diff: -

--- asterisk-1.4.22/channels/chan_dahdi.c.orig	2008-12-10 11:59:36.000000000 +0000
+++ asterisk-1.4.22/channels/chan_dahdi.c	2008-12-29 23:08:45.000000000 +0000
@@ -6311,6 +6311,7 @@
 						}
 						if (i & DAHDI_IOMUX_SIGEVENT) {
 							res = dahdi_get_event(p->subs[index].dfd);
+							if (res == DAHDI_EVENT_NOALARM) p->inalarm = 0;
 							ast_log(LOG_NOTICE, "Got event %d (%s)...\n", res, event2str(res));
 
 							if (p->cid_signalling == CID_SIG_V23_JP) {
@@ -6415,6 +6422,7 @@
 							if (i & DAHDI_IOMUX_SIGEVENT) {
 								res = dahdi_get_event(p->subs[index].dfd);
 								ast_log(LOG_NOTICE, "Got event %d (%s)...\n", res, event2str(res));
+								if (res == DAHDI_EVENT_NOALARM) p->inalarm = 0;
 								res = 0;
 								/* Let us detect distinctive ring */
 		
@@ -6520,6 +6528,7 @@
 					if (i & DAHDI_IOMUX_SIGEVENT) {
 						res = dahdi_get_event(p->subs[index].dfd);
 						ast_log(LOG_NOTICE, "Got event %d (%s)...\n", res, event2str(res));
+						if (res == DAHDI_EVENT_NOALARM) p->inalarm = 0;
 						/* If we get a PR event, they hung up while processing calerid */
 						if ( res == DAHDI_EVENT_POLARITY && p->hanguponpolarityswitch && p->polarity == POLARITY_REV) {
 							ast_log(LOG_DEBUG, "Hanging up due to polarity reversal on channel %d while detecting callerid\n", p->channel);
@@ -6591,6 +6600,7 @@
 						if (i & DAHDI_IOMUX_SIGEVENT) {
 							res = dahdi_get_event(p->subs[index].dfd);
 							ast_log(LOG_NOTICE, "Got event %d (%s)...\n", res, event2str(res));
+							if (res == DAHDI_EVENT_NOALARM) p->inalarm = 0;
 							res = 0;
 							/* Let us detect callerid when the telco uses distinctive ring */

Alarm issue reported to bug tracker http://bugs.digium.com/view.php?id=14163.

CID issue currently comes under another report http://bugs.digium.com/view.php?id=12531

Awesome stuff…