Timers in Asterisk, Meetme audio quality problem

Hi All!

Hoping someone can give me some definitive answers regarding timing in Asterisk.

Recently I upgraded our company MeetMe conference bridge to 11.5.0 from 1.6.0.28 on a new 24 core RHEL 6.4 server platform. The end users are all connected to Cisco UCM and connected to this platform using a SIP trunk/G711u/Directmedia.

The bridge was up for over 3 years with no issues using 1.6.0, but since the upgrade I have received a couple of reports about bad audio for isolated participants on a larger conference after around 45 minutes or so. The audio is severely distorted, and the caller has to disconnect and reconnect, after which the problem is resolved for the remainder of the call. Other users are not affected at the same time. No errors logged anywhere that I can find to help explain it.

I’m looking at Asterisk timing along with the other possible root causes. On the new platform, dahdi_test results are not what I expected… most individual tests are 99.997, but occassionally I see tests as low as 99.6. Strangely, the bad timer results seem to come in sets. 99.997 99.997 99.604 99.604 99.997 (etc, etc.) We’ve tried lots of tweaks to RHEL kernel timing sources, etc, with no change in the dahdi_test behavior.

In the process of researching this, I uncovered a couple of things that have changed since the original 1.6.0 build:

–MeetMe no longer requires dahdi for timing. (Still for bridging.) My installation uses the res_timing_timerfd.so interface, which is used instead of dahdi if available.

Question 1: How do I confirm the timerfd interface is working properly? “dahdi_test” results are now meaningless for my purposes, right? If dahdi is not used for timing, is 99.6 still an indicator of a system problem?

–There are reportedly better timing sources available than HPET.

Question 2: RHEL 6.4 (our build anyway) uses tsc for timing, but can also use hpet and acpi. “tsc” is now the preferred method, yes?

Can anyone give me the scoop on timing nowadays?

For those of you that made it to the end of this, I thank you. :smile:

-Brian

Just out of curiosity, did you change your DAHDI version as well or just the version of Asterisk?

Went from DAHDI 2.3.0.1 to 2.7.0, which was latest at the time of build.

Ok, that might explain the variance in the dahdi_test results. Try running dahdi_test with the “-v” flag and look at the Cummulative Accuracy.

# dahdi_test -v -c 5
Opened pseudo dahdi interface, measuring accuracy...

8192 samples in 8191.944 system clock sample intervals (99.999%)
8192 samples in 8191.512 system clock sample intervals (99.994%)
8192 samples in 8191.848 system clock sample intervals (99.998%)
8192 samples in 8191.824 system clock sample intervals (99.998%)
8192 samples in 8191.808 system clock sample intervals (99.998%)
--- Results after 5 passes ---
Best: 99.999% -- Worst: 99.994% -- Average: 99.997400%
Cummulative Accuracy (not per pass): 99.997

If you’re computer is running with a slower system timer, you’ll see more jitter on the clock samples, but the cumulative should still be relatively solid. Another thing you’ll want to make sure of now is that the system is keeping good wall clock. Sometimes that means running ntpd if the local clock isn’t so good.

Ok, at the bottom is what I get on a similar system with DAHDI 2.8.0. I had to wait quite a few samples before the 99.6’s showed up, that is the section I included here. Otherwise, the results are all extremely good.

Current kernel timer is tsc:
#cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Just so I understand this… nowadays the measure of a good clock is the cumulative timer, and it’s no longer true that individual tests (or the average) necessarily point to a problem?

Really appreciate the help with this!

-Brian

#dahdi_test -v

8192 samples in 8191.801 system clock sample intervals (99.998%)
8192 samples in 8191.776 system clock sample intervals (99.997%)
8192 samples in 8191.728 system clock sample intervals (99.997%)
8192 samples in 8191.879 system clock sample intervals (99.999%)
8192 samples in 8223.624 system clock sample intervals (99.614%)
8192 samples in 8159.799 system clock sample intervals (99.607%)
8192 samples in 8191.824 system clock sample intervals (99.998%)
8192 samples in 8191.680 system clock sample intervals (99.996%)
8192 samples in 8191.808 system clock sample intervals (99.998%)
8192 samples in 8223.752 system clock sample intervals (99.612%)
8192 samples in 8159.735 system clock sample intervals (99.606%)
8192 samples in 8191.848 system clock sample intervals (99.998%)
8192 samples in 8191.760 system clock sample intervals (99.997%)
8192 samples in 8191.752 system clock sample intervals (99.997%)
8192 samples in 8191.784 system clock sample intervals (99.997%)
8192 samples in 8223.688 system clock sample intervals (99.613%)
8192 samples in 8160.928 system clock sample intervals (99.621%)
8192 samples in 8190.704 system clock sample intervals (99.984%)
8192 samples in 8223.704 system clock sample intervals (99.613%)
8192 samples in 8191.808 system clock sample intervals (99.998%)
8192 samples in 8159.808 system clock sample intervals (99.607%)^C
— Results after 127 passes —
Best: 99.999% – Worst: 99.606% – Average: 99.971558%
Cummulative Accuracy (not per pass): 99.997

In my experience yes.

That being said however, there was one discussion on the asterisk-users mailing list back in October where someone had some problems with the core timer in DAHDI on an openvz issue but I do not think we ever determined if updating to a newer kernel resolved the issue for this user:

Dahdi dummy is more accurate than core timer?

In my experience, using more CPU cycles to take the jitter out of conferencing in the kernel should not be necessary since all the other sources of jitter should be much greater unless there is something on the system that prevents the system timer from running for 10+ms, which would be highly unlikely.

One other thing I just realized, DAHDI-Linux 2.3.0.1 had the core timer changes, so probably the difference you’re seeing now in dahdi_test has to do with perhaps a kernel change between your old system and the new one?

If you would like to get a 1ms interrupt to driver your conferencing again as a test you can try loading dahdi_dummy, but it shouldn’t make a difference in most normal environments.

First off, as we troubleshoot the original problem today, it’s looking less and less likely the issue is with the Asterisk timers. The symptoms seem isolated to a certain small group of end points, only after a certain duration on a conference. (~45 min or more.) If the timers were a problem, I would think this would show up more at random, and on a more even distribution of bridges and participants. Based on current data, that’s completely not the case.

So this discussion is becoming more academic. That said, it’s very handy to know if how/when the results of dahdi_test indicate a problem for future events.

Purely as feedback, I wanted to share the below information. Please note that loading dahdi_dummy doesn’t seem to change things a whole lot. Also, it’s kinda odd to me that the cumulative is so much different than the average; probably makes more sense if you know how it’s calculated. When I get some time I’ll take a look at the code to see if I can figure it out.

The example below that shows a bunch of bad samples is how things look when it’s in a “bad” period. When things are “good”, I get long periods of 99.997s. Sure seems like something is running on the system (off and on) that robs cycles somehow…

-Brian

=========

/etc/init.d/dahdi restart

Unloading DAHDI hardware modules: done
Loading DAHDI hardware modules:
dahdi_dummy: [ OK ]

Running dahdi_cfg: [ OK ]

dahdi_test -v

Opened pseudo dahdi interface, measuring accuracy…

8192 samples in 8159.944 system clock sample intervals (99.609%)
8192 samples in 8223.760 system clock sample intervals (99.612%)
8192 samples in 8159.392 system clock sample intervals (99.602%)
8192 samples in 8191.792 system clock sample intervals (99.997%)
8192 samples in 8191.792 system clock sample intervals (99.997%)
8192 samples in 8191.769 system clock sample intervals (99.997%)
8192 samples in 8223.824 system clock sample intervals (99.612%)
8192 samples in 8159.785 system clock sample intervals (99.607%)
8192 samples in 8216.400 system clock sample intervals (99.702%)
8192 samples in 8191.072 system clock sample intervals (99.989%)^C
— Results after 10 passes —
Best: 99.997% – Worst: 99.602% – Average: 99.772413%
Cummulative Accuracy (not per pass): 99.987