Asterisk 20.3 (PJSIP) grinds to a halt on concurrent BLF subscriptions

We’re experiencing some issues currently with our Asterisk 20.3 environment (with PJSIP of course). For the better part it runs perfectly fine, except for one little issue: Asterisk practically completely grinds to a halt when there are several concurrent BLF subscriptions.

When there’s a single subscription incoming (as monitored via Asterisk CLI and Wireshark), there is nothing wrong. Our usual Grandstream phones also have this nice habit of only sending the next subscription after the previous has been answered.

But we’re using a softphone application by Acrobits which sends many subscriptions (up to 25) at once. When this happens. Asterisk becomes unusable for at least 30 seconds. If any phone tries to do anything, expect to wait up to 10 seconds before its requests result in any Asterisk dialplan action.

In our production environment we’re mostly running Asterisk 16. In those instances this is not an issue at all. I’ve been scrolling through changelogs concerning this, but haven’t found anything so far.

Is anyone else experiencing issues like this? Has something changed in PJSIP concerning BLF subscriptions?

I have a couple of log files which were made on a machine with this issue, however this is a live machine, so there’s too much noise and private data in these logs. Will try to reproduce and create clean log files on a seperate machine and will post those a.s.a.p.

I hope someone can verify the issue or maybe point me in the right direction if the error is ours. Thanks in advance!

[Update #1]
I tried recreating this same issue on a clone of the VM where we ran into the problem. However there no issues were present. So it’s related to the amount of traffic on the live environment. Is there a way to share logs without exposing these to the entirety of the www?

[Update #2]
Downgrading from Asterisk 20.3 to 20.2.1 seems to have resolved the issue, as per suggestion of jcolp. Seems to have been related to the PJSIP update in 20.3.

There’s been no reports of this and I haven’t seen anything elsewhere. Within Asterisk there hasn’t been any major changes to subscriptions, aside from some changes to handle the latest version of PJSIP that we updated to. You could try rolling back to an earlier Asterisk 20 and seeing if the update to the latest PJSIP is what caused it.

A deadlock backtrace[1] would also show what everything is doing inside of Asterisk.

[1] Getting a Backtrace - Asterisk Project - Asterisk Project Wiki.

We’re on 20.1 (with 20.3 in testing) but since around 18.something we’ve had a really obscure issue that has something to do with state/hint subscriptions in where Asterisk will completely stop processing calls and even killing it manually can take up to 5 minutes.

Unfortunately, we haven’t been able to catch it in the act and it happens weeks or months apart but it’s definitely an issue. I’ve hesitated bringing it up because we just don’t have any data.

About the only thing that we can find, and it does NOT line up with the event, is stasis manager errors where our ami queues reach 3000 tasks. We see this several times a day on various servers but it doesn’t seem to impact anything.

By the time it gets to me the servers have typically been restarted :frowning:

Thanks for the quick reply jcolp! I’ve downgraded to 20.2.1 and the issue seems to have vanished. So it looks like its PJSIP related indeed. I’ll stick to 20.2.1 for now and monitor it during further testing.

The deadlock backtraces weren’t applicable I think because Asterisk never really crashed but just slowed down to near zero for a while after which it came back up to speed. Good to have that doc at hand though - thanks.

bkervaski that problem sounds eerily similar on some level. However killing Asterisk manually was never really an issue with our situation as I recall. We are having a similar other issue where Asterisk becomes unresponsive because of (what seems) an overabundance of memory usage. That was one of the primary reasons why we are looking into upgrading to Asterisk 20 (LTS)

A deadlock backtrace shows what is going on at the point the backtrace is taken, which can shed light on what exactly is going on.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.