Unexplained RTP marker in MeetMe capture

Hi,

Hoping to get some guidance here. Been trying to track down a very very weird problem for months and today I finally ran across a “smoking gun”. Looking to understand what’s happening a little better.

In our company we have a CUCM 8.6 system and make very heavy use of Asterisk for a MeetMe based bridge (among other things.) Since an upgrade to Asterisk 11.5 last summer, been getting sketchy reports of individual callers to the bridge getting “bad quality” audio, isolated to Cisco 7962G phones. (We have all sorts of phones in our fleet, but only the 7962 has this issue.) It’s relatively rare, and I have not been able to reproduce it, but I have personally experienced it. When it does happen, it only affects one conference participant at a time.

The issue sounds exactly like this–and forgive me for posting a link to Cisco’s forum here :blush: :
supportforums.cisco.com/thread/2180971

Pretty weird. Lasts about 90 seconds and goes away, or you can hold/unhold to restart the RTP and it also goes away.

Today, finally captured the network traffic and I see that exactly when this issue starts, there is a “marked” RTP packet in the stream. I’m also missing a number of packets leading up to the marked packet, so it looks just like this issue:
issues.asterisk.org/jira/i#brow … RISK-17952

I have a capture at the phone and also a tcpdump from the server that shows this, so I know it’s coming from the server itself. The capture glitches a little at that point when I play it back using Wireshark, so it appears the 7962 has some particular problem with this marked packet, and basically loses it’s mind for a bit. Other phones seem more resiliant.

In the Asterisk issue 17952, the behavior is listed as “normal” since Asterisk is not the source of the audio and is not the source of the skew leading to the marking.

But…

In this case the source of the RTP is Asterisk MeetMe on that server, so I can’t figure out why Asterisk would suddenly think it’s a good idea to skip some audio. The server definitely knows that something happened, else it would not mark the packet, right?

And no, I don’t think we have issues with server capacity… this HP blade has 24 cores, RHEL 6.4, 256 GB RAM. Typically runs with a load average of 0.2. I’ve synthetically tested the server all the way up to the 512 call DAHDI limit, and tested ConfBridge to a couple of thousand. Normal load is a small fraction of that.

And yes, I do hope to migrate to ConfBridge some day… the problem is that the WebMeetme based portal we use would need totally redone, and time is required. :smile:

I thought I would reach out for some guidance while I try to dig through the code.

For now, can anyone point me to where this RTP header is built? Dahdi or Asterisk? Is it rebuilt by the bridge, or is it copied from a source packet?

Any thoughts on this issue would be greatly appreciated.

Thank you!

-Brian

CUCM phones are definitely intolerant of timestamps jumping without a corresponding SSRC change. For real source changes, we modified an earlier version of Asterisk to fake and SSRC change whenever the marker bit was set.

David,

Thank you for the reply!

So to make sure I’m clear on this… sounds like that my particular case will require that SSRC workaround you describe. Was there a patch you could point me to? Like I said earlier, I’ll still do my own digging, just trying to save some time.

Previously I was using 1.6.0.28 and now I’m on 11.5. I assume I can look in the 1.6.0.28 code for this and potentially develop a patch from that?

-Brian

I think the latest versions now push real SSRC changes through, so I’m not sure that a patch could be easily used. However it was basically in rtp.c to add one to the outging SSRC every time that the marker bit was set. I think most of the code was actually to add an option to control this.

As it was a hack, and it might already have been past the end of life on 1.6.1, I’m not sure we ever submitted it.

We have made progress on this. To help others with similar problems, here’s some additional information…

First, the root cause of the skipped RTP frames and then marked packet:

We were able to replicate this in a lab setting as well as observe it in production.

We use MeetMe with Realtime enabled, but our use case is an “always on” bridge for each user. We have 1850 bridges configured. For each caller, MeetMe+Realtime queries the database for the “endtime” field on the minute. In our case, the MySQL database had nothing in the way of indices or query caching, thus slowing the DB response. So… under load (~80 or more callers) roughly a third of the current RTP streams would show this “pause” during the endtime SQL queries, followed by a restart of RTP with a marked frame (and no SSRC change) for those RTP streams affected.

Is this a bug, or an artifact of the way MeetMe+Realtime works? The code is clearly blocked somewhere during realtime activity, but only some of the time.

To work around it, we added an index for the “confno” field in the realtime table, and also enabled the MySQL query cache on the database server. That combination sped up the DB enough to prevent the problem from recurring.

I would like to look at a way to query once per bridge rather than once per call. Once per call seems unnecessary.

Second, the “Tron” audio problem on the 7962 phones:

I was NOT able to replicate this in the lab, but there is clear coorelation between this weird 7962 audio and the skipped audio + RTP marker problem described above.

We have not experienced this problem since the workaround to the RTP marker issue was implemented, and I would have expected to get at least one report by now based on prior frequency of occurence.