Exceptionally long queue length queuing to PJSIP/

I’ve just experienced an Asterisk server becoming completely unresponsive, as in every single service on it became available. Not only Asterisk, but also a set of other services used for a variety of other stuff - I was also unable to SSH into it.

The only thing I found in the logs were these (repeated many, many times):

[Jun 18 15:33:56] WARNING[13337] channel.c: Exceptionally long queue length queuing to PJSIP/registrar-0000d96a

The Asterisk instance is configured to core dump, but no dump was generated.

What I’m pondering is, since the instance was completely dead, including non-asterisk related services, if this was Asterisk that somehow managed to pull down the instance, or if those warnings was a result of some other failure on the instance?

Is this a virtual environment? If so, what kind of resources are on the host, and what did you assign to the guest(s) on the host?

Ah yes vital information.
It’s hosted on AWS, running on a c5.xlarge instance. That’s 4 cores and 8GB of ram. These servers host at max around 30 calls. Was serving 16 when it blew up.

I’m running 5 of these, and I’ve only seen this happen on this particular instance.

So just want some input on how I could gather more information in case this is something that happened to asterisk, or if the consensus might be that perhaps the physical hardware just suffered some sort of disruption. The only reason I’m concerned is that I saw another crash on this exact instance last week as well.

Even if Asterisk perhaps deadlocked itself, I don’t believe this would take down the entire OS with it? Just looking for experience or insight.

I’ve seen the behavior you describe in virtual environments where the host CPU is oversubscribed. In esx it’s the costop and wait stats that give you an indication that this is the problem ( not sure if that’ll be the same in AWS ).

Right off the bat I’d pull back to a single core for your instance, see if that helps.

So I’ve done some more digging after having had the issue again yesterday, and it seems there is a memory leak under some pretty brutal circumstances.

It seems we have a user with a UAC (MicroSIP) which some times goes absolutely haywire, and in respond to it’s INVITE (From Asterisk), will just explode and send it’s 180 Ringing Reply on a loop. We’re talking thousands of them. I can’t even retrieve them all from Homer, it just cuts off after 100. But those 100 was sent within the timespan of 17 milliseconds.

But judging from the network monitoring we’re talking megabytes upon megabytes of 180 Ringing replies. Asterisk shot up it’s memory usage by 2GB each times this happened, until I suppose it just used all of the available memory and just locked up the system some how. The memory usage doesn’t go down even after several hours.

While this happens it seems that Asterisk starts printing these lines (hundreds per second):
Exceptionally long queue length queuing to PJSIP/registrar-0000d96a

The channel noted there is the channel belonging to the outbound call being placed to the UAC that craps out and just floods 180 Ringing Replies back.

I’ve moved the offending customer to a separate Asterisk 16.11.1 instance (the crashes has happened on 16.6.1).

Is there anyone who can provide some guidance on what I should do to gather up more information or what I can provide to create a proper issue ticket for this? Unless it’s already been addressed in 16.11.1, this is seems like it could be a DoS attack vector that should be mitigated some how.

Some kind of packet capture, logs, everything. Issues should be filed on the issue tracker[1]. I’m not really sure though that there is a way to mitigate such a problem. You ultimately end up consuming resources to process such things, even to the amount that you have to process to then decide to block. The only thing we could possibly do is put an option to allow a fixed size of work queue items, but then the result of that is you then potentially drop legit SIP traffic so you’re still going to potentially have problems there. There’s no real “aha!” fix for such things.

[1] https://issues.asterisk.org/jira

Oh I agree, while the flooding happens it is what it is. I’m working on getting our Kamailio server in front of Asterisk to drop these replies in some way.

However, each flood of replies, Asterisk had permanently increased it’s memory foot print significantly, and it didn’t go down. When it happened yesterday I managed to isolate this Asterisk server before it actually crashed out. Normally Asterisk uses something like 5MB of ram. After I’d isolated it, it was sitting at 6GB usage, and it didn’t go down. I restarted it and re-instated it in it’s cluster this morning, 12 hours after I’d isolated it.

So I’m more concerned there is a memory leak that happens when this “attack” occurs.

Anyways, I’ll create an issue on the issue tracker! Is there some way I can provide you guys with an pcap privately? The one I’ve got contains IP addresses of our customers and stuff, which I don’t want to share publicly.

PCAPS can be sent to asteriskteam@digium.com

Issue created ASTERISK-28962

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.