Flaky hangup detection on ZAP possible cause of instability

How can I get asterisk to run for 60 days with no intervention on my part?

  • Keep troubleshooting and upgrading.
  • Hire someone to take a look at the problem.
  • Learn the code and debug it for real. It’s the only way.
  • Make and take less calls

0 voters

I have several asterisk servers, but one of the locations has never quite worked right. We have been through three motherboards, two tdm400p, a sangoma a200, countless config tweaks, and just about every version of asterisk from 1.1, to 1.2, to 1.4 but nothing seems to help.

Over the years it has gotten more reliable (I no longer see catastrophic deadlocks, and only rare crashes), but lately it has plateaued and maybe even started to get worse.

The main issue from my perspective is that inbound calls sometimes stay open indefinitely, and long after both parties have disconnected. Given enough time, I will show channels, and see that all three of my FXO are tied up.

Calls will then start coming in over VoIP at 2cents per minute. I have to either restart asterisk or soft hangup to free them. If i restart asterisk daily it doesn’t guarantee me anything.

When this happens the system is much more likely to go into a state my users affectionately call ‘haywire’.

When the system has gone haywire, many intermittent problems will rear their ugly heads.

  1. Phantom Calls- Calls come in but are disconnected the moment we answer at one of our SIP phones.

  2. Locked SIP Channels- I often see that a call between a sip phone and one of the parking extensions is in an indefinitely active state.

  3. Random disconnects when transferring- Calls will randomly be dropped when parking or transferring a call.

I was planning on getting a butt set to monitor the incoming lines, but I can tell that the TelCo is fluctuating the voltages as it should in some of the software that came with my sangoma card.

The only difference between this location and another location which works much more reliably (besides the actual building) is that the problematic server uses AT&T as its CLEC, whereas the well-behaved server uses Verizon.

(*Note that this “sister-server” is only relatively well-behaved. Occasionally I will see 20 active sip channels between two of my Snom 360 phones, and no one is actually on the line. This doesn’t seem to bother the server, and I just soft-hangup or restart it when this is an issue, but it’s similar enough to this problem that I thought it worth mentioning.)

The server is running Gentoo linux

zapata.conf

Linux claudia 2.6.18-gentoo-r6 #6 SMP Wed May 9 22:02:20 EDT 2007 x86_64 Dual Core AMD Opteron(tm) Processor 165 AuthenticAMD GNU/Linux

The motherboard is a Tyan S265 series socket 939 with nforce4 chipset.

Im using a sangoma a200 with 4FXO and 4FXS, but when I used a tdm400p (and * 1.2.X) everything was less reliable in general, with deadlocks on top.

;autogenerated by /usr/local/sbin/config-zaptel  do not hand edit
;Zaptel Channels Configurations (zapata.conf)
;
;For detailed zapata options, view /etc/asterisk/zapata.conf.orig

[trunkgroups]

[channels]
context=default
usecallerid=yes
hidecallerid=no
callwaiting=yes
usecallingpres=yes
callwaitingcallerid=yes
threewaycalling=yes
transfer=yes
canpark=yes
cancallforward=yes
callreturn=yes
echocancel=yes
echocancelwhenbridged=no
echotraining=256
group=1
callgroup=1
pickupgroup=1

immediate=no

;Sangoma A200 [slot:9 bus:1 span:1]
context=outbound
group=0
signalling = fxo_ks
channel => 1-4

rxgain=-6
txgain=-6

context=inbound
hanguponpolarityswitch
busydetect=yes
busycount=10
group=1
signalling = fxs_ks
channel => 5-8

Here is my dialplan for incoming calls:

[open]
exten => s,1,Answer
exten => s,n,wait(1)
exten => s,n,Dial(zap/1&zap/2&sip/carmen&sip/aron&sip/maria&sip/register,25)
exten => s,n,voicemail(b0)
exten => s,n,hangup

If I call my FXO line and hangup within 5 seconds, asterisk will miss the hangup about 15% of the time and when someone answers they will get a dialtone directly on an FXO. I’m not sure if this is in any way related to the other problems we’ve been having. My understanding is that all PBX have some problem with that situation.

The only thing I see in my log that worries me is an occasional:

[May 29 19:58:36] WARNING[20286] chan_sip.c: Remote host can't match requestBYE to call 'blah@blah'. Giving up.

but like every other problem it seems to be totally random.

I’m almost inclined to believe it is something with the location, but its so intermittent and I am never there. The closest I’ve come to experiencing it first hand is once I called in and had the call dropped when I expected they had answered on their end.

[b]I’m not expecting someone to have a magic bullet, but I’m hoping someone out there has some troubleshooting ideas for me.

If I could have asterisk run for 30 days without any intervention on my part, that would be a huge first step.[/b]

On the other hand, perhaps it is time to throw in the towel. If it were crashing I would at least have a dump for someone else to work with, but I don’t even have that. My history with Asterisk has been one of intermittent predictability, so why should anything ever change?

I see no reason to start throwing more money at the problem and swapping more parts, and how would I post about this in the ‘Job Opportunities’ forum? “Bounty: Make my asterisk run for 60 days.” (If I’m paying for it, I expect at least a two month solution).

I’m not a software engineer, and I’m starting to realize features are a lot less important to me and my users than stability. Perhaps Asterisk just isn’t for me, and requires someone a little more talented to administrate and troubleshoot.

If this is the case, will someone please tell me :smile:. I wont be offended. Perhaps there is an alternate software solution someone could recommend for us mere mortals who only know enough C to not quit our day jobs.

I use Sangoma now, because in the beginning, when my TDM400P were unreliable (to a greater extent than the a200 are now), the Sangoma sales rep promised me superior technical support.

My extended troubleshooting conversation with the Sangoma tech support team ended like so:

I’m thinking perhaps I’ll try to put in a TDM400P again, and see if Digium support will /really/ troubleshoot the issue. Sangoma wasn’t at all helpful, but they at least went through the motions of logging in and checking out the situation.

The bottom line is that every version of Asterisk I’ve used since 1.1.X has had major reliability issues with chan_zap, though they no longer result in/coincide with deadlocks (check the other unanswered posts be me over the years). I don’t even know of a way of compiling information for a proper bug report, when asterisk doesn’t dump a core. Did I miss a wiki article?


I’m not looking for a magic bullet at this point, but can anyone help me formulate a strategy to isolate the problem? Anything that gets me to a useful bug report would be a great success. All I need to do is wait for a bad hangup.

I’m sure its gonna involve gdb and some tracing, but I’m very beginner when it comes to C programming / debugging, and asterisk is out of my league.

I can’t believe that no one is at least echoing my sentiments. I’ve used asterisk for years, and it’s never been a question of “if”, but “when” asterisk would start acting up between it’s scheduled restart.

P.S. Since my first post, and because sangoma was troubleshooting with me,I stopped restarting asterisk daily to allow the issue to happen. I caught it several times in that short span, and already have 2.2gb of mp3 in my main VM dir.

Hi

One simple check is to check that the line has Calling party clear signals and what these are and what the value of disconnect is if its disconnect signalling. This is the most common problem with Co lines not disconnecting and has been since Alog lines have existed. Im sure in the source is teh value that the tdm card is looking for but I havent inclination to look. (Digium Guys, What is it?)

Ian

ianplain:

Thanks for taking the time to respond.

[quote=“ianplain”]Hi

One simple check is to check that the line has Calling party clear signals and what these are and what the value of disconnect is if its disconnect signalling.[/quote]

Im not familiar with the term “Calling party clear signals”. I tried google searching it, and variations of it, but no luck.

Are we talking about the voltage modulation the CO uses to signal a hangup has taken place?

If so, my Sangoma A200 has some sort of built on voltmeter, and i was able to demonstrate voltage fluctuating as it should:

claudia ~ # wanpipemon -i w1g1 -c astats -m 5

        ------- Voltage Status  (FXO,port 4) -------

VOLTAGE : 50 Volts


claudia ~ # wanpipemon -i w1g1 -c astats -m 5

        ------- Voltage Status  (FXO,port 4) -------

VOLTAGE : 7 Volts


claudia ~ # wanpipemon -i w1g1 -c astats -m 5

        ------- Voltage Status  (FXO,port 4) -------

VOLTAGE : 1 Volts


claudia ~ # wanpipemon -i w1g1 -c astats -m 5

        ------- Voltage Status  (FXO,port 4) -------

VOLTAGE : 1 Volts


claudia ~ # wanpipemon -i w1g1 -c astats -m 5

        ------- Voltage Status  (FXO,port 4) -------

VOLTAGE : 50 Volts

Is this the signaling you were referring to?

Update

To all,
I’ve since updated to the latest asterisk/zaptel/libpri, and whoa nelly!

After the first 24 hours, even my “well-behaved server” had BOTH its FXO channels in a perma-offhook state.

That’s two server, two locations, 5 total FXO, same result.

My poorly-behaved server has some hung sip<->parking channels already (usually this one hangs sip<->zap), and there are some new manifestations this time:

Eventually most if not all command (“like show channels” and “soft hangup blah”) will fail to output any information or do anything. Subsequent commands will totally lock the console the first time I hit tab complete, and i must ^C to get out.

I’m noticing that it isnt necessarily a hung FXO that will cause this behavior. Currently there is a typical hung call between a parking spot and Snom Phone, and the same symptoms are showing up. This remains true even after I close the console and reconnect.

I haven’t seen this in the absence of a “stuck” channel of some sort.

I’m noticing a pattern: Both my asterisk servers have issues detecting hangup accross both chan_sip, chan_zap, and chan_local. The only difference is which channels tend to misbehave more. If im lucky its only sip/local, if im unlucky its zap, and if im really unlucky its both.

So in summation, the newest version of asterisk has made my general hangup detection problems more consistent, and added a new wrinkle in the form of an unresponsive console. I can no longer simply soft-hangup to resolve the issue when channels get stuck unpredictably, and on a daily basis.

If there is any bright side, it seems that the “intermittent-disconnect-when-answering-fxo-from sip-phones” issue I discussed in my opening post were resolved by removing a faulty disk that had degraded from my array.

Also it seems that even with my FXO all tied up, channels getting stuck, and the console useless, asterisk continues to chug along, and my user’s aren’t complaining. See, its not all terrible… Knock on wood and go figure!

I’m hoping this will make it easier to file a bug report.

I heard that libpri 1.4.4 has a nasty bug, which could explain my recent aggravation of my issues (or perhaps its just coincidence).

I couldnt remember exactly what packages I had before, so I just downgraded everything one point release. Since then I’ve had almost a week of uptime with no undetected hangups…

Im now using:
Asterisk 1.4.20.1
Zaptel 1.4.9.2
Libpri 1.4.3
and I didnt install asterisk-addons, which I wasn’t using anyway.

Of course, it is impossible to know what really ‘fixed it’ (if it is fixed), and if I was superstitious I would be knocking on wood right now, but I’m pretty happy. I could probably go back to restarting asterisk nightly and get weeks upon weeks of seemingly flawless execution.

It’s a good thing too, because my latest theory revolved around poor AMD64 support, and I was about to reinstall everything.

After a few more days, the ‘well-behaved’ server had its usual half dozen imaginary conversations going on between two inactive sip phones, and the ‘poorly-behaved’ server hsa started locking FXO every few days.

Here’s what it looks like on the well-behaved server…

… and the poorly-behaved server does something similar, though the culprit can be an inbound or outbound call.

So the problems I’ve always had continue. As I mentioned, both of these systems are AMD64, and I’m starting to think that was a mistake, so out of desperation and exasperation I’ll be changing the poorly-behaved server to a fresh x86 install.