Is the CLI command “stasis show topics” particularly heavy to run? On an Asterisk (v20) with 1-2 agents and one ARI application it returns around ~300 topics. We ran it on a production system with 150+ agents, the CLI hung, and asterisk appeared to deadlock and stopped processing incoming SIP traffic until restarted.
Looking to know more on this command or if anyone experienced anything similar?
There’s been no reported issues with it, though I suspect few actually run it because it’s generally only useful for developers practically speaking. If a deadlock did occur, then a backtrace would show where.
Yeah, there is just no way we would figure out a way to replicate this as of now without risking impacting production. We do not have the tools to mimic such load in a test environment, unfortunately.
The live container that holds the topics is a hash container optimized for searches by topic name and hash containers aren’t sorted for obvious reasons. So, in order to show a sorted list, the live container is copied topic-by-topic into a list container that has a sort function. While it’s copying, a read lock is held on the live container. When the copy finishes, the read lock on the live container is released and the list container is dumped to the console.
There are a few ways we could speed this up.
We could change the live container from a hash container to a RBtree container. Red-Black trees are naturally sorted so we could then dump the live container directly to the console. HOWEVER…
a. The “topic_all” container is critical to core operation so changing it to a different container type could have unintended consequences.
b. We’d still have to keep a read lock on it while dumping and if we’re dumping direct to the console and for some reason the console I/O blocks, we’d be preventing write operations on the live container. Not a good thing.
We could change the temporary list container to a RBTree container. In theory, this should be much less expensive than a list with a sort function which means we’d be holding the lock on the live container for a shorter period of time and the time to display the first topic would also be shorter. We actually use this technique in a few other places.
If you think this would be worth the effort, go ahead and open an issue and we’ll take a look at it when we get a chance. Of course, you could also try the change yourself and open a PR. The function that handles “stasis show topics” is in main/stasis.c and is fairly simple. The “cli_show_tasks” function in res/res_pjsip/pjsip_scheduler.c is a good example of this.
I waited a good 30 seconds for a response by the CLI, then I jumped onto another terminal session and noticed it halted processing new SIP requests. Then I just killed the Asterisk process and restarted the service to restore the functionality.
It took me at least 2-3 minutes to complete this process, during this time we were unable to process any call. I did never check back to see if it ever returned a response. And I cannot tell for how long I would have to wait in order for Asterisk to release the lock.
/sbin/service asterisk stop
/sbin/service asterisk start
I later dropped max_size to 125.
Tinkering or tampering with these parameters is risky. Make changes gradually, observe changes. What is going on here like in all tuning parameters is a balancing act.
In my case Asterisk came to a halt because it exceeded the threadpool size for too long. However, it’s necessary for Asterisk to exceed the threadpool size many times during the day - just not too many times. It’s a balancing act. Make the threadpool too large and Asterisk gets overwhelmed with processes being run. Make it too small and Asterisk gets overwhelmed with processes not being run. Either way is going to lead to a freeze. My environment is a border condition environment that the Goldilocks Zone for that parameter is not where the Asterisk devs hard code the default for that parameter. So, I had problems - I then adjusted the defaults - and problems went away. Fortunately for me, this was a new pre-production test environment. Also fortunately it was a virtualized server so I could add CPU cores and ram easily.
Your running of stasis show topics in your environment is a border condition that virtually nobody does - thus you’re not using Asterisk in a normal use case - and so you also are going to have to tune it by changing defaults. Start with the Asterisk Tuning link above, and make some adjustments that you think might help, run your command on the 150 agent Asterisk install during the lightest part of the day, and see if it prevents Asterisk from hanging. If not then go back and adjust it again and run it again. Keep making changes and testing each time to see if your changes help or hurt.
Ever tuned a shortwave radio to pick up the best signal? Same idea.
It’s not a pretty way to do it but there’s so many variables that this is the quickest way to do it.
I am unsure if that is a road we wish to take, we would loose the dynamics of having to manually optimize specific parameters depending on the load. If you ask me no command in the CLI should ever lock the Asterisk for several minutes, such commands should only be accessible via specific build flags.
We have a farm of many different Asterisk’s with ranging amounts of agents, this solution would quickly become admin headache.
This particular system that was impacted are running 4 CPU:s with these settings. But we have no other issues except for this one deadlock following the statis show topics.
I would not expect that touching threadpool settings would have much of an impact on this specific case, as it doesn’t impact the number of topics in said container. My guess is there would be incidental impact since there would be additional work going on, it could even make it worse with more threads handling topics and subscriptions and more activity.
As George said, you can open an issue. There would be no time frame on when it would get looked at.
Referencing the thread is fine, but in the future for things like this a backtrace is really really really good because it can actually confirm things.
Taskpool would have no direct involvement or interaction with this, this is nowhere near the problem it was written to solve. George has already done analysis on this in his prior post. My backtrace comment was for the future, to save time if something like this happens again.
I didn’t tell him to mess with the threadpool settings I told him to study the tunables and adjust the ones that he thinks might affect it. Big difference, there.
In my case - I got lucky because the developers helpfully put a print statement for a threadpool debug that most likely they had their own purposes for putting in there, which in my environment basically shot a red rocket up saying “look, dummy, there’s something weird going on here” I mean, I know perfectly well a debug statement is useless if it’s being emitted so incredibly often it spams every other debug statement out of the log - whatever developer put that there wouldn’t have been able to use it at all in my environment. Logic told me that whoever put that statement in there left it in there because it -wasn’t- overpowering everything else - thus - something about my environment was border.
I was lucky, I had the problem handed to me on a silver platter.
For sure, a quick first step for him would be to crank up debugging to it’s highest level and see if anything unusual might be shaken out. If not, then he can proceed to more complex debugging - the most complex of all, of course, is thoroughly understanding the code and the tunables available and experimenting, if nothing grossly obvious is thrown up.
But, a backtrace isn’t a bad choice, either. It might even reveal something.
I have to disagree with this - in the world of computers and IT and so on, 90% of the admins out there don’t know what the heck a CLI is much less are able to spell it. The CLI is pretty much understood by most people to be reserved for developers or admins who could be developers if they wanted to be but are too lazy (like me) to go for the gold and be a developer.
It’s very sweet you have admins running Asterisk systems via CLI, who would consider something like this to be a headache. Here’s a nickel, buy yourself a real server farm and wrap all your Asterisk servers in a GUI they will kiss your feet, lol.
My back of the envelope calculations says that, whilst the time the lock is held does seem rather more than should be possible, with the number of topics being sorted, there is no way that you are going to sort them in an acceptable amount of time to have a global resource locked.
The topics are being used by internal Asterisk processing, so not under direct control of the OP, and no-one has commented on whether the number is reasonable, or represents a resource leak.
My back of the envelope calculations says that, whilst the time the lock is held does seem rather more than should be possible, with the number of topics being sorted, there is no way that you are going to sort them in an acceptable amount of time to have a global resource locked.
The topics are being used by internal Asterisk processing, so not under direct control of the OP, and no-one has commented on whether the number is reasonable, or represents a resource leak.
Just to make sure I understand — david551 are you saying that, even if we switched the temporary list container to an RBTree (to reduce sort time and lock duration), it still wouldn’t make a big difference because the total number of topics is so large that sorting them will always take too long?
I wasn’t sure if that contradicts or just sets expectations for the RBTree idea.