New PSTN prov fixes disc. supervision, causes instabilility?

For years I’ve had ZAP disconnect supervision issues.

It was a tale of two servers, one with disconnect supervision issues on zap and the other without (though both with disconnect supervision issues on other channels). The only difference was the return address on the bills for the Analog Lines.

I tried everything after the PSTN people assured me that there would be no difference between Verizon and At&T in terms of disconnect supervision. Nothing worked.

Restarting Asterisk Nightly seemed to help, but guaranteed me nothing.

Finally, after running out of ‘try the new versions’ to try, architectures to change, platforms to swap, PSU’s to change, Digium Hardware to swap for Sangoma equivalents, and distros to try I switched PSTN provider anyway.

It seems to have worked. Almost two weeks have passed and zero hangup detection issues (on zap). The second week I even allowed Asterisk to sleep peacefully at night without a restart. Still no issues.

It appears I have angered the Gods.

Yesterday the server just shut down. The fans were running, but I could not ssh in for the first time since asterisk 1.2 and the deadlocks (after ‘avoiding the initial’ ones, as the logs used to say). The system logs indicated nothing, nor did Asterisk’s.

Someone told me they had slightly moved the server (it’s not in a rack) so I just chalked it up to that. Perhaps the power cord became disconnected enough to shut down the server, but not enough to trigger a restart on power failure (yeah, right).

Then, today, less than 24 hours after restarting, something happened that I had never seen before… Not in asterisk 1.0X, 1.2X, or all the versions of 1.4X which I’d tried to solve the hangup supervision problem:

[size=150]A call would come in over Zap, ring once on the calling end, and as soon as Asterisk picked up the caller would get a BUSY signal… The phones would continue to ring internally, but the busy signal would persist for the calling party. When answered, the users told me all they got was ‘dead air’.[/size]

No cores were dumped, no errors in the logs besides this cryptic message:

[Dec 10 22:47:37] WARNING[7960] chan_iax2.c: I was supposed to send a LAGRQ with callno 13345, but no such call exists (and I cannot remove lagid, either).

… which I always get since switching to 32-bit Linux. I’ve tried googling it but the results are equally cryptic, and appear to have nothing to do with hangup supervision on Zap before, or Haywire v2.0 now (see below for info on the original ‘haywire’ state).

I wish I were able to give some information for debugging besides the histrionics, but after seeing it for myself my first response was to restart asterisk on the production server and get the phones working ASAP (which, of course, restarting asterisk achieved).

This is worse than ‘Haywire’ v.1.0 (the affectionate term my users described the deadlocks in 1.2 as), and worse than the disconnect supervision issues with the old PSTN provider. At least in the latter case calls would roll over to a VoIP provider and it would cost my users money at a rate of 2 cents per minute until the nightly restart.

This, on the other hand, will cost them clients. They have been through so much, and I have 0 answers for them at this point, nor do I have anything of substance with which to file a bug report.

I don’t know if I need to sacrifice a goat or what, for there is no dump, and really no chance for me to waste time debugging the next time this happens, and I can only assume it will happen again according to Murphy’s law.

So what should I do now? I’m sorry for all the histrionics, but I have nothing else other than the specs on the server:

> core show version
Asterisk 1.4.22 built by root @ claudia on a i686 running Linux on 2008-10-13 05:53:27 UTC

# uname -a

Linux claudia 2.6.25-gentoo-r8 #1 SMP Mon Oct 13 03:18:25 Local time zone must be set--see zic  i686 Dual Core AMD Opteron(tm) Processor 165 AuthenticAMD GNU/Linux

# cat /usr/src/zaptel-
 * version.h
 * Automatically generated

# cat /usr/src/libpri-1.4.7/version.c

> show uptime
System uptime: 8 hours, 2 minutes, 17 seconds

Has anyone ever heard of this?


Yes it can be lot of fun at times!

What Zap hardware are you using?

Turn the debug on in the logger.conf and post the output when it happens again.

messages => notice,warning,error,debug