[STUMPED] - calls cutting out/static - updated 7/6

We are experiencing call quality issues for our remote location, namely calls cutting out and breaking up for our agents. The building houses about 60 users, 30 or so of which are on calls at any one time. The location is connected to our main office via a 10Mbit low-latency fiber trunk, and gigabit switches on either side of the fiber endpoints. The floor at the remote location is all 100Mbit. Each user is running a Dell Optiplex 170L, 2.8GHz or greater, XP SP2, 256MB RAM, and Eyebeam 1.10n for their softphone, with ulaw as the codec.

We are connected to the PSTN through a Sangoma A104D, using E&M Wink signalling.

When the call is breaking up, it’s normally one-sided, and sounds like “popping”, “cutting up” and “breaking up” - it’s not a dropout in ALL the audio, but more of just a garbling of the audio.

I have spent the past three days working on this issue, and have opened the issue with Sangoma and Counterpath. I have been monitoring our bandwidth closely - we’re averaging around 4.5Mbit, so we should have plenty available. I just updated the onboard gigabit drivers to the current version provided by Intel, and that has helped with the larger dropouts, but we’re still getting the choppy voice.

Sangoma statistics aren’t showing anything out of the ordinary - the system is performing as it should.

We have two other servers that are identical in configuration that serve the main office, and they have no sound quality issues whatsover - the only difference between the server having the issues and the ones that aren’t is the connection to the users - one is a local LAN connection, the other is the WAN.

I just found out about the configuration option ‘jitterbuffers=X’ in zapata.conf. I had not previously used this setting, and changed it from nothing to jitterbuffers=8 this morning. As the system is live, I haven’t had a chance to do a restart of asterisk to load the modifications - has anyone had any experience with the jitterbuffer setting, and has it made a difference?

Otherwise, if anyone has any suggestions, questions, comments, or encouragement, I am in dire need of any/all.

UPDATE 7/6 - happening again, see my last post for more.

30 simultaneous users on g.711 is about 1/2 of the 4.5Mbits you are seeing. Are you doing any CoS over that 10mbit link to prioritize the VoIP packets. And what is the MTU size given the Gigabit connections. It sounds like that link may be your issue, but if using softphones, does any of your equipment on the path have the ability to re-classify the traffic to isolate the VoIP only since it will all be coming from the PCs?

p

Hi

I think the jitterbuffer is enabled by default, Might be wrong though. As to th popping have done an ethereal capture? Is it happening on all calls/Phones ?

Also have you set the TOS bits on all the phones ?

It does sound like a capacity issue, What else is using the fibre? is it just voice, or is it being used for the data as well?.

I would also check that everything is running full duplex and 100M where needed.

hey guys,

thanks for the comments. to answer your questions:

no, we are not currently doing any QOS over the pipe - the problem actually didn’t start till a few weeks ago…before that, the calls were very clean. and, during that time, we had 60 remote desktop sessions going over the same pipe, so the overall bandwidth utilization would have been much HIGHER when the calls were going through fine…we have since moved the remote server to the local location, so that other than VOIP traffic, there is not much else going over the pipe - we DO have a domain controller up here at the remote location that synchs with the master at the main office - i will have to ask our network admin to assist with determining how much overhead that might be adding.

also, i just conferred with all of the users that are using Polycom phones (probably 8 in total) and they have either none or very few sound issues with those - the problems seem to be isolated to the softphones. the one polycom that has been reported as having issues might actually be a handset cord as well…

MTU on the adapter is the default, 1500 I believe…also, the 4.5Mbit utilization we were showing before has dropped to around 2.5 currently…so either something else was running or the call load was higher than i thought - this location tends to average around 30 calls, which is what I quoted above.

I was not familiar with the TOS flag, but after reading up on voip-info, that refers to QOS in general - again, not implemented for the reasons mentioned above. That is on my short list of things that I want to get done, but I think that may fix the problem, not solve it…

I believe we are in the process of turning the pipe up from 10Mbit to 20Mbit, and our provider has mentioned enabling a filter that would help the voice traffic get through, but I keep coming back to the fact that it worked FINE three weeks ago, and since then, the following have been done:

  1. complete rebuild of server, going from full install of fedora to minimal, along with optimization of IRQs and boot options - none of the other servers that were rebuilt have had any issues whatsoever…the only thing that comes to mind is if there was some sort of network shaping daemon that was running on the full install that was not installed with the minimal…any ideas?

  2. migration from digium to sangoma quad-span T1 cards (made a massive improvement in voice quality)

  3. move of main application server from main office to remote location, negating need for ~60 remote desktop sessions over the fiber WAN connection

thanks again for the comments - i really appreciate the help.

one more thing - it’s affecting polycoms as well…

i had set up an extension that called the milliwatt app, so that i would have a steady outbound audio stream that i could listen for cuts on. i just called that test extension from a polycom and almost immediately had a 1/2 second drop in the audio…so i’m leaning heavily towards it being entirely the network.

question is - why did this pop up all of a sudden? i can’t imagine that a full (including the GUI and every daemon known to man!) install of fedora would perform BETTER than a stripped down minimal install, optimized for voice…that is completely counter-intuitive to me - but i am not a seasoned linux veteran. we also tried a third party NIC (since the onboard chip conflicted with the digium card, we figured we’d try disabling it) and that made no difference on the minimal install…

i guess the only other thing would be that our provider changed something and didn’t tell us…

if anyone has any thoughts on our contradiction, please let me know.

wes, could be coincidence. some kind of network bursts that were not happening before? anyway, i believe in the basic fedora networking code, ToS is honored automatically (the pfifo code). so, if you have the lowdelay bit set, that might help. i’m not sure what you mean by “fix the problem instead of solve it”. if there is network traffic causing delays of VoIP packets, that’s why ToS is available, no?

Hi

Without TOS/QOS all packets are on a first come first served. It is standard practice to have TOS enabled, Voice has to be the highest priority traffic with the lowest latancy on your network. Without setting the TOS bit its just one of the herd and will wait paitently in line.
Is the “jitter” one way or both ways ?

Ian

the jitter is both ways - it is affecting both the caller and callee, anyways, which is what i base that statement off of.

i know and understand what QOS does and how it works, and i agree that it would probably make a difference in our situation - but, since we did not have it enabled before (none of our core network has any QOS routing enabled), i want to know what is causing the problem now, if possible.

i’m not the type of person that likes the quick fix - i want to understand why this didn’t occur on a much more saturated network, and why it is occurring now…

again, i don’t mean to seem ungrateful, i’m just so frustrated at how everything seems to be exactly the opposite of what i’m expecting…

i’m on my way back to the main office now and will corral our network admin and get him started on setting the switches up for QOS.

thanks.

i understand your frustration. it just seems like it might be some unrelated network activity suddenly biting you, and immunizing yourself to it can’t be a bad thing…

Hi

Enabling QOS/TOS is part of the core design and implimention of a voip network. It seems that the basics have been missed out and now the problems are starting to surface. Doing the groundwork is not a quick fix.

well, i don’t always get to make those calls, i’m just a worker bee.

and WHY DID IT WORK FINE BEFORE?

not directing at you, just trying to figure out what the hell is going on. we worked fine without the ‘groundwork’ in place for over six months…

we have no idea, since we don’t know what’s wrong. but again, you may have just been lucky and things worked until some unrelated thing broke.

i know, i know…and i’m sorry to yell, but i’m pretty sure you guys understand what’s going on in my head right now…

in any case, after i got back to the main office and actually talked through everything, i do NOT think it’s the network, and here is why.

i had set up my milliwatt echo test and dialed in from my mobile phone, which would have routed the voice signal through the PSTN, into our mux, into the asterisk box via a zap channel, and right back out…at no point did it even touch the network.

i was getting the exact same dropouts and cuts when calling in via a purely zap channel.

this tells me (and the two other guys that are working with me) that it’s probably something on the server itself…we’re starting by replacing the sangoma card, and will start testing individual components if that fails to fix it…and if the hardware checks out, then i’ll blow away the system and rebuild it from scratch…again…

thanks again, will update when i know more.

ugh, you have my sympathies.

Hi

Well that puts a different slant on it. What hardware is the server? And do you have access to an ISDN test set so you can monitor the channels.

It does sound like a hardware issue. When you used digium hardware what was the results of zttest?

Coming back to hardware is it SMP and what drives?

and one thought is it running cool as we head into warmer weather and as they warm up funny things can happen.

with digium hardware, we NEVER got 100% on zttest, and we had random lockups…moving to the sangoma hardware has made a 100% improvement in the stability of the system, and the sound quality is much better than it was, but still is having the issues at hand.

we dropped the server, put the new card in, and are still getting pops and drops on the echo test…so that is out.

hardware wise, it’s a dell 2850, dual 2.8GHz Zeon’s, 4GB RAM, 3 x 73GB U320 HD’s (RAID 1 mirror with a hotspare) on the onboard RAID controller, dual 1GB NICs.

we currently have hyperthreading disabled, no ACPI, no framebuffer, and no irqbalance. the sangoma is interrupting on CPU0, everything else is interrupting on CPU1. running fedora core 4, fairly minimal install (httpd and mysql are only two options installed, for basic CDR reporting - turning either off makes no difference).

we ARE recording every call, but even at low call volumes, we are still seeing the issues, and all of our other servers are recording everything, and they don’t have a problem…

one thing happened when we booted back up that is worth mentioning - we have three T1’s in this box, two LD and one local. the second LD span is showing up in zttool and asterisk, but no calls are being routed over it. we have asterisk set to dial outbound calls over the highest numbered trunk, while inbounds come in on the lowest. yesterday, the same thing happened - we “lost” the second T1, even though it shows up and isn’t showing any errors or anything…yesterday, a warm reboot fixed the problem, but i’m wondering if possibly we have a bad cable or some other random problem that might be contributing to this issue…

we just redid the A/C in the data center, so it’s actually running cooler than it did all winter - we have a total of 15 ton of cooling capacity, FWIW…but i will note the component temperature when i open the server back up in an hour (the company is finishing up for the night).

this is beginning to almost be fun, like one of those horrible movies that at first sucks, but gets to be almost enjoyable because it’s so bad…

EDIT:

asterisk 1.2.4, zaptel 1.2.5, no libpri (using robbed-bit T1’s)
wanpipe 2.3.4-beta drivers (recommended by sangoma)
2.6.11-1.1369 kernel
fedora core 4, with mysql and http only install options

i found a thread on the asterisk-dev forums that was discussing noise on the line, and one of the things that this thread mentioned was that monitor has issues sometimes and can cause dropouts - kevin fleming wrote this:

[quote]
Monitor() is likely the source of the problem them; it is known to cause
audio path inconsistencies because it does all the writes to the
filesystem synchronously, and if the filesystem does not respond quickly
enough it will cause the audio path to be disrupted.[/quote]

so, we disabled all call recording, and that has fixed the click and pops, but now we’re just having audio dropouts (pauses, where no audio is sent or received) - it seems to be one sided.

i just did my test call, and was still getting occassional pops, but no dropouts in 15 minutes. so ANOTHER theory shot to sh!t.

i’m going to push for an upgrade to 1.2.7.1 either today or tonight, and just keep plugging away.

if ANYONE has ANY ideas in the meantime please let me know!!!

I know it has been mentioned and I’ve read your responses so I won’t re-hash it. But - as a baseline, enable some form of ToS/CoS so that it does not linger in the back of your head. This should be done regardless if your network has the equipment to handle it. Also - if Asterisk isn’t running as root, you will have to either apply a patch that is available (I don’t have a pointer to it but ran across it) or setup iptables to set ToS bits on the outbound VoIP packets (and make sure the switches are looking at ToS bits).

p

QOS is in the pipe, and I agree 100% with you. But I do not think that will solve the problem because the problem presents itself across all channel types.

we bumped up to 1.2.7.1 and that seems to have helped (no issues reported…yet). i recompiled zaptel 1.2.5 (patched by the sangoma installer already) as well…i’m going to be supremely pissed if we had a bad build and it was some minor little thing. i highly doubt it, and would assume at this point that we have faulty hardware somewhere.

i do think QOS would help though, and will try to get the powers that be to implement it across the entire company. thus far, i haven’t had much luck in convincing that.

i’m supremely pissed - it was probably a faulty build.

upgrading to 1.2.7.1 made almost all of the problems completely go away. i’m guessing that it wasn’t actually asterisk, but the zaptel drivers that were fubar’d and that a recompile/reinstall was actually what did it…

either way, we’ve had no issues since 10:00am this morning except for 1 dropped call (which probably was outside of our network).

i am hesitant to post this, as murphy’s law will kick in and the sucker will start acting up again…but if that happens, i’m just going to rebuild it.

FWIW, has anyone ever had something like this happen, where a recompile fixes things?