I am using 20.2.0 on Debian 12. Every so often Asterisk will stop responding to INVITES and seem to no longer work. If I do lsof | egrep 'asterisk.*STREAM \(CONNECTED' | wc -l I get back 44872. lsof | egrep ' asterisk' | wc -l gives me back 79466. When I look at cat /proc/`(pidof asterisk)`/limits I get back
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size unlimited unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 65535 65535 processes
Max open files 1048576 1048576 files
Max locked memory 8388608 8388608 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31566 31566 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
It seems as if “something” is not releasing the fd’s for the RTP streams and we are then hitting one of the limits. What would be the reason for this and how would I go about troubleshooting such an issue?
The reason is generally that the underlying channel is still around. Not releasing RTP file descriptors would also not cause it to stop responding to INVITEs, they’d be responded to but just rejected. It sounds more like a deadlock, which would need a running backtrace.
The specific link I posted is for a deadlock. The underlying ast_coredumper script that collects the information works on crashes or deadlocks depending on the given arguments. The “–running” argument causes it to locate a running Asterisk instance and get a backtrace.
@jcolp I just had it happen again. I have never put in an issue since the move to GitHub. Does it go here Issues · asterisk/asterisk · GitHub ? How do I attach the dump data while keeping it secure?
That is the place to file issues, yes. You can not attach things securely. It is advised to scrub the backtrace of information you consider sensitive before attaching. If you REALLY can’t do so, then you COULD send it to asteriskteam@sangoma.com however this limits any investigation or resolution to Sangoma. There is no timeframe or even if such a thing would be resolved.
I am responding here after working this on GH ([bug]: Asterisk stops responding to SIP INVITES · Issue #373 · asterisk/asterisk · GitHub). I currently have two machines running Debian 12. Two keep track of things the logs on GH were from the box a14. I have another box, a15 that also started having this problem this morning. In the logs the last thing that I saw before things going sideways was
[2023-10-26 00:13:28] ERROR[3647858][C-000009ae] res_config_mysql.c: MySQL RealTime: Ping failed (2006). Trying an explicit reconnect.
[2023-10-26 00:13:28] VERBOSE[3647858][C-000009ae] res_musiconhold.c: Started music on hold, class '60720elmv0', on channel 'PJSIP/endpoint-external-000009ad'
After that Asterisk stopped responding to SIP INVITES. A debug of pjsip showed the same Taskprocessor overload alert error. When looking at a14 I do not see any MySQL errors. I have restarted a15 as I needed to process calls (which is now OK). Running ```core show taskprocessors`` I get back
From what I gather stasis/p:endpoint:PJSIP/endpoint-external-0000001a has 1238 tasks but a max of 500 in the queue? There are currently 0 calls on the box so why would PJSIP be taking up any more resources? Why are they not being free’d up? a14 is still up in it’s “sad” state if it would help to look at it directly.
The taskprocessor list is showing that stasis/p:endpoint:PJSIP/endpoint-external-0000001a has 1238 tasks in the queue but has a max queue depth of 9. This contradiction indicates that the thread handling that taskprocessor queue is deadlocked for some reason because it has not been able to update the max queue depth statistic. By default when a taskprocessor queue reaches the high water level, Asterisk stops processing any further new PJSIP calls until the queue backlog goes below the low water level.
Are there any docs that explain what the different numbers in the queue are and what they mean? I have since restarted Asterisk. When this happens again in such a case where the tasks are higher than the max should I do a back trace or should be looking elsewhere? Also another interesting thing I noticed is when I did asterisk -rx' module show' it showed 65 use Count even though there were no agi’s running. I assume something is “stuck” which may be a bug?