Audio loss for 1 second every minute

I’m running two servers with identical configurations of Asterisk, both version 1.8.9.0 and compiled from source. Both are dual quad-core Xeons with no more than 30 concurrent calls. Both connect to the same exact providers. Server1 (one with issues) is running CentOS 5.7, while Server2 (working fine) is running CentOS 5.6. Both servers are on fiber connections at different datacenters. Server1 is located in a city less than 200 miles from me, while Server2 is about 2000 miles away.

The server with problems seems to have a 1 second loss of audio every minute exactly. I can call and get audio loss at 0:57, 1:57, 2:57…so on. If i hang up about 30 seconds after I get the audio loss, and make another call I’ll get audio loss approx at 0:27, 1:27, 2:27…and so on. I’ve rebooted the server. I’ve been testing using a 5 minute audio file of a generated tone. It’s in sln format and I have used the same file on both servers.

Anyone have any ideas of where to start? I can’t really narrow it down unless there is some incompatibility with the Asterisk 1.8.9.0 source compiled on CentOS 5.7?

You have any other process on that system that’s attacking your CPU at 60 second intervals?

I’ve check the crontab and nothing is going on.

We only use this server for Asterisk. I don’t see what could cause the server to ‘lock up’ every 60 seconds for 1 second. It’s a dual quad-core Xeon.

Update:
I’ve been watching top and I haven’t seen anything hit the CPU hard other than top itself (0.3) and Asterisk (0.5).

Update2:
I made two calls roughly 15 seconds apart to my test number that plays the test tone. Both calls dropped audio at the exact same time for one second, then did it again 60 seconds later. Does this provide evidence that it is not an Asterisk issue? Could it be a bandwidth problem since I don’t see anything pegging the CPU?

Update3:
I ran top with a refresh rate of 0.1 seconds and made a test call. The top display ‘froze’ at the same time I got audio loss on the phone call. This doesn’t conclusively prove anything since I’m connecting to the server via SSH. However, it does make sense that a network issue would cause the SSH window to freeze at the same time of the audio loss.

instead of top, try sar and see if you can find the spikes, sounds like something is pre-empting your processes.
thegeekstuff.com/2011/03/sar-examples/

Anything in the error logs?

I was thinking about this, and the fact that it occurs at such regular intervals leads me to believe that it’s flaky hardware somewhere. The system will choke on io if it is having trouble, say, accessing a disk, or if the network interface is having trouble.

So use sar 1 during a call to see what the system is like at the time of the fault.

sar 1
Linux 2.6.32-5-686 (HFC-6800)   03/30/2012      _i686_  (4 CPU)

11:35:10 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:35:11 AM     all     13.14      0.00      5.11      0.97      0.00     80.78
11:35:12 AM     all     29.18      0.00      3.99      1.25      0.00     65.59
11:35:13 AM     all      6.68      0.00      3.71      2.48      0.00     87.13

I’m guessing you’ll see a spike in iowait there, which may indicate the hardware issue. Then it’s just a matter of using the rest of the tools to narrow down what’s at fault.

I’ve seen cases in the past where System Management Interrupts (SMI) can cause about one second interrupt latencies at very regular intervals. In one case just updating the kernel seemed to exacerbate the issue but I never tracked it down to the specific change in BIOS / Kernel interaction pushed the impact up to 900ms in that case (It was on a Dell Poweredge 2600).

What I normally do is compile cyclictest and run it in a realtime priority class on each CPU. This will tell you if the issue is across all the CPUs or only on a subset of them.