Task processor overload

I’ve been getting 503 responses to REGISTER requests from phones, and looking at the source code, the only way that could happen is task processor overload (pjsip_distributor.c). I found that these log messages appear for almost every call:

WARNING[27858][C-00000001] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks.
WARNING[25160][C-00000008] taskprocessor.c: The 'stasis/p:endpoint:PJSIP/XXX-0000001c' task processor queue reached 500 scheduled tasks.

The dialplan is rather complex, so I suspected that the first message is caused by excessive VarSet and Newexten events. Indeed, after I disabled their creation in source and recompiled, the first warning disappeared.

I have no clue about the second, though, and I would appreciate some guidance. If I dial one PJSIP endpoint from another, the warning usually appears for both, but more often for the calling one.

If you haven’t seen it already this post has some good, and still applicable information about taskprocessors and queue size warnings. .

Note the ‘p’ in:

stasis/p:endpoint:PJSIP/XXX-0000001c

That means it’s “pooled”, so adjusting the stasis threadpool might potentially help some. From the blog post:

You can adjust the thread pool parameters for PJSIP and stasis task processors. The PJSIP thread pool is configured in pjsip.conf. The stasis thread pool is configured in stasis.conf.

1 Like

Thanks for the reply. I’ve played with the thread pool settings already, but doesn’t seem to make a difference. We are talking about a system with a single active call. I am not doing anything fancy, one PJSIP endpoint dials another. Does that really require 500+ tasks? The fact that developers put this warning tells me that reaching 500 scheduled tasks is not a normal or desirable situation, so I would like to know what could theoretically cause it. Is there some way I can debug it? I can never catch these tasks with “core show taskprocessors”.

After further investigation, the second warning is also caused by dialplan complexity. My mistake was thinking that PJSIP endpoint task processor only deals with SIP signalling, but apparently it deals with everything that happens on the channel. I keep call routing table in Postgres database and process rules in dialplan, which generates extraordinary amount of tasks. I am guessing Varset and Newexten are again the culprits.

OK, after more digging, I’ve encountered the option “hide_messaging_ami_events”, which is supposed to supress Newexten and Varset events, and even defaults to “yes” in v18. However, even if I explicitly set it in asterisk.conf, I am still getting those events over AMI. Am I not understanding something?

That option has no impact on those events for calls. It is for the text messaging channel, not for regular channels.

The stasis.conf configuration file can be used to stop the creation of internal Stasis messages for different types, such as the ones you mentioned.

1 Like

Thanks. I am guessing ast_channel_snapshot_type is for Newexten? I’ve tried disabling that, but then I get:

[2021-09-03 15:35:26.926] NOTICE[29403] cdr.c: CDR simple logging enabled.
[2021-09-03 15:35:26.927] ERROR[29403] cel.c: Failed to register for Stasis messages
[2021-09-03 15:35:26.936] ERROR[29403] loader.c: *** Failed to load module cel
[2021-09-03 15:35:26.936] ERROR[29403] asterisk.c: Module initialization failed. ASTERISK EXITING!

I don’t need CEL, can I disable it alltogether?

Disabling ast_channel_snapshot_type would end poorly. It’s used by a lot of core things. If CEL is disabled in cel.conf then it will not subscribe to messages.

OK, so I can’t disable Newexten without compromising Asterisk stability?

The creation of the internal message, no. There are filters that can be done in manager.conf if you’re actually using AMI so it at least doesn’t go out AMI.

1 Like

As an observation, I tried to find documentation on Aterisk stasis messaging, but it seems to be missing all but an extremely high level overview and stateless descriptions of a few APIs, I couldn’t find any information, less detailed than the source code, about the internal structure, and in particular why it needs task processor threads, or on the ways that the different components of Asterisk communicate using it, including the naming structures for the various queues that can overflow.

Have I overlooked something, or are there some modules that are good examples of its use both as publisher and subscriber, which aren’t a mega byte of source code?

We seem to be more and more reports of these overloads, but it is difficult to respond to them at any more than a very high level, namely that they are only warnings and symptomatic of other problems, and I haven’t really seen any replies that specifically identify the bottlenecks involved.

I don’t know of anything precisely, but stasis itself isn’t overly complex. You create a message, you publish it to a topic, each subscriber receives that message. A subscriber can either receive the message using its own dedicated thread or a thread out of a pool. If the subscriber can’t process things fast enough (messages are published faster than being consumed), then the queue of messages (they’re ordered) grows and you get a warning.

The naming of things is to a degree arbitrary. In developer mode, however, there are stasis CLI commands (stasis show and TAB to auto complete) to allow you to inspect things further which can provide better insight.

I guess the conclusion is that having a dialplan with a lot of processing is unfeasible. The solution is to offload the critical parts to FastAGI?

One of the core users of Asterisk, FreePBX has huge dialplans.

I know, but not as huge as this. Mine actually has less lines than theirs, but there is a lot of complex processing in a loop. A single call generates 40k - 50k scheduled tasks.

This sounds like ‘AGI/AMI territory’ to me. ‘Dialplan’ is fine for routing calls and simple call flow, but when I hear ‘complex’ and ‘lots of stuff’ I think it’s time for a real programming language – features, debugging, real syntax checking with real error messages, etc.

1 Like

I’ve written it in AEL, which at least tries to be a programming language. I didn’t have much trouble with AEL itself, and everything works. It just didn’t occur to me that sheer number of darn Newextens is going to be a problem.

AEL is a step up, but a misplaced semi-colon or brace and half your dialplan disappears with no or misleading warnings.

You also still have limited functionality, no debugging, obtuse quoting, no libraries, …

Yeah, tell me about it. Still, I could have lived with it, were it not for this issue.

I guess the problem is that you are doing a very long run of processing without blocking, and in the short term all the other CPU cores are busy, so not all the subscribers can read out their queues, or the subscribers are blocking, so they can’t read out their full queue.