Random 503 errors on register & taskprocessors?

Hi there,

I have an issue where my webRTC clients are disconnected randomly.

I am on Asterisk 16.10, using realtime and full ARI.

I got in the console and saw this message :

WARNING[9818]: taskprocessor.c:1160 taskprocessor_push: The 'stasis/m:rtp:all-00001616' task processor queue reached 500 scheduled tasks again.

This message was followed on the client side with a 503 error

SIP/2.0 503 Service Unavailable
Via: SIP/2.0/WSS rthnds28cabp.invalid;rport;received=127.0.0.1;branch=z9hG4bK5595006
Call-ID: 3ct89of0q9jgtnip2ablfl
From: "b05c090d-f1aa-48f2-bfe7-4be0d6ebf7f7" <sip:b05c090d-f1aa-48f2-bfe7-4be0d6ebf7f7@redacted>;tag=1ehu7kmpcs
To: <sip:b05c090d-f1aa-48f2-bfe7-4be0d6ebf7f7@redacted>;tag=z9hG4bK5595006
CSeq: 13 REGISTER
Server: Asterisk PBX 16.10.0
Content-Length:  0

Channel count if fairly low (about 100/120) for a dual core Xeon 8275CL CPU @ 3.00GHz.
CPU usage is about 25%, load average 15 is 0.80.

If I understand correctly, task processors warnings are just warnings, nothing is “blocked”, but why Asterisk replied with a 503 Service unavailable for the registration?

I know that this is not much details, but there are no other warnings or messages that I can use to track the issue, have you any clues?

How can I debug/track what’s going on with the taskprocessors?

Thanks for your help :pray:

That taskprocessor is the RTP topic, so it would be most likely events related to RTCP and traffic there. You could see if disabling such messages in stasis.conf[1] would make it go away. Upgrading Asterisk may also change conditions.

As for REGISTER getting a 503, if a taskprocessor overloads then new things (such as calls or registrations) are prevented to give time to work off any load. There’s an option to limit this[2]. I have no idea if your version of Asterisk has it.

[1] asterisk/stasis.conf.sample at master · asterisk/asterisk · GitHub
[2] asterisk/pjsip.conf.sample at master · asterisk/asterisk · GitHub

Hi @jcolp ,

Thanks for your feedback!

So the struggle is not at the stasis level but rather at the RTP level?

That’s weird, I have another IPBX with the exact same hardware, os & config but about 300 channels and it has no issues at all.

Is there something I can do to try to pinpoint what’s wrong or what is causing the overload?

As for the provided links, the comments state that those changes can be dangerous, I’m evaluating the risks/benefits, it seems like if I change the taskprocessor overload triggers, then asterisk can crash under certain circumstances? that sounds risky.

RTP is likely creating the stasis messages, which are then being published to the topic, and then they’re taking a period of time to process. The period of time to process is resulting in the queue growing faster than it can be handled. There is stasis statistics stuff when developer mode is enabled that can give some insight, but yet again I have no idea if it exists in the version of Asterisk you’re running.

The act of changing the option won’t itself cause a crash, but it can cause more things to be allowed to happen resulting in an overload.

Thanks for your explanations,

I understand for the option that allowing more things to be processed can lead to a crash if I run out of system resources.

As for the RTP taskprocessor, I think I understand the point, so there are RTP trafic that can be Calls related (invite, bye, …) or session related (Register, options…).
Each “interaction” between the ipbx and an endpoint generating one or more stasis messages under the “RTP” taskprocessor and not the “Stasis” taskprocessor.

Is that correct?

Then, I have an overload issue somewhere in the chain, so it could be the stasis application responding too slowly? or something inside the IPBX that is stuck? (i’m thinking of possible/plausible scenarios)

RTP is media. RTCP messages are related to statistics and information about the media, they are sent and received. I don’t know your complete deployment or how you are using things/what you are doing so I can’t answer any further beyond what I have.

You are right, I was so focused on the SIP/ARI stack that I mixed up things… RTP is RTP.

I will try to investigate on the RTCP messages, to check what’s going on.
I’m remembering something else, I have a sipcapture node so I’m using res_hep, and I know that it is processing RTCP messages.

The only thing is that on my other IPBXs, I didn’t have this issue, and all my IPBX uses the exact same system image :thinking:

It seems to me that if Asterisk is implementing throttling measures, it really should be logging a warning to that effect. Currently the warning just reads to me as cautionary rather than actually indicating a change in behaviour.

I’d suggest filing an issue for such a thing. I don’t know if that’s been gone back and forth on. It feels like we did have it as such, then people complained. I could just be misremembering another log message though.

Quick update,

I’m thinking about my issue and I believe that it is linked to the webRTC usage.

I had multiple servers that were suddenly affected this last 6-8 days, they have the same system image (so the exact configuration), and all are using webRTC.
The issue has not affected servers that are not used for webRTC.

The symptoms are the same : I start seeing webRTC disrutions, with frequent taskprocessor warnings.
The taskprocessor warnings disappear as the number of channels decreases, it seems that about 100 channels the issue is mostly gone.
All servers (ipbx) uses the same trunks, RTCP is enabled on the provider side and working, but i don’t belive that it is causing issues since all servers are receiving them and everything was working fine for months without any change on my side (trafic is almost the same).

Once the issue starts, the only way to fix it is to restart asterisk, then everything goes back to normal.

I attached the “core show taskprocessor” output to my post just in case, indeed the m/rtp topic max queue depth is really big, I am still trying to figure out what’s happening.

taskprocessors.txt (570.7 KB)

Made a tcpdump for ~1min including all trafic, RTCP trafic seems not that high: