Good morning folks
I just wanted to share something that will certainly benefit new Asterisk admins and probably give a chuckle to the experienced ones.
About 6 months ago we upgraded to Asterisk 16 with PJSIP from 13 using chansip. It didn’t come without its challenges but wasn’t terrible.
Since then, we’ve had the occasional issue where PJSIP contacts would become unavailable as a result of a CPU spike.
It’s been frustrating to say the least, as we haven’t until just now been able to figure out why because this is one of those issues if you setup to observe behavior, the behavior doesn’t happen.
Two things were happening:
- A call causing an race condition (infinite loop) in certain edge cases causing …
- The logger to fall behind because the file system couldn’t keep up (due to a missing parenthesis in a dialplan)
This resulting in asterisk misbehaving (or behaving properly, depending on your point of view) and, if the call didn’t hang up soon enough or there were multiple race conditions, eventually core dumped and failed.
What’s interesting about this is that when watching the Asterisk console with verbose output, we could never get it to fail. Apparently, the act of outputting to the console was just enough to let the server breath and catch up.
Only when we set verbosity to 0 did Asterisk, or rather our dialplan causing the infinite loop, would fail.
We actually already had infinite loop detection in our dialplan, but an update around the same time moving from 13 to 16 bypassed it under certain conditions.
So, the moral of the story, after some heartache, make sure you build in infinite loop detection in cases where your users have some control over the dialplan. Something as simple as an inherited variable (i.e., __LOOP) and using INC() will do it.
Also, if you’re new to Asterisk, consider using Wait() to give the server time to relax and catch up. Just having a Wait(0.5) in place strategically would have prevented crashes.
Hope this helps somebody, cheers