Outage after taskprocessor_push: The 'subm:rtp_topic-000000aa' task processor queue reached 500 scheduled tasks

Hi!

I am facing this error on my production setup for a 95 endpoint “callcenter”:
“taskprocessor.c:888 taskprocessor_push: The ‘subm:rtp_topic-000000aa’ task processor queue reached 500 scheduled tasks”

After this error, no registrations are possible, effectively dropping all endpoints. New incoming calls from the SBC (IP Auth) succeed into asterisk but then fail because the endpoint is not registered.

Environment:

  • Debian Stretch
  • version 13.18.2
  • mostly extensions.ael
  • PJSIP only
  • HDD is not full
  • 8G RAM, 2G Swap
  • 4 vCores on “Intel Xeon E312xx (Sandy Bridge, IBRS update)”

Is this an overload problem or a known bug?

This setup was fine since Jan 2018 but I noticed the same problem yesterday and today at approx. the same time. The only change during this time was a BIOS update to mitigate intel flaws on the hypervisor. VMs since use a secured version of QEMU-KVM.

Thank you very much.

There is a blog post[1] talking about what a task processor queue reached message means.

[1] https://blogs.asterisk.org/2016/07/13/asterisk-task-processor-queue-size-warnings/

I found and read this blog post but I have to admit, I am not sure how to proceed.

This problem seems to be similar to:

Core-Dumps are enabled but this is no crash, so I am unable to show any backtrace (DONT_OPTIMIZE and BETTER_BACKTRACES are set).

I will try to create a manual core dump when this happens again.

Do you have a hint for me in the mean time?

I’d first suggest upgrading to the latest version, as we do fix and tweak things. Secondly you have to determine what is causing the system to be slow on processing and why.

Ok, I am already preparing 13.21-cert2 but it is not production ready with my adjustments yet.
I will monitor the VMs behaviour and add more RAM and CPUs to it as the hardware is dedicated to this VM (it’s just a VM to be able to migrate easy between hosts).

This is just a workaround but might lower the occurrence in the meantime.

Thanks for your feedback!

After some further debugging with journalctl, I noticed that both mysql servers where backed up (lvm snapshot) 10 minutes before the outage. One is for endpoints and main tables (replicated) and one is with local-only stoarge for CDRs. This might introduced lags during prime time.

ODBC/realtime also produce blocked tasks, is this assumption correct?

Yes, that can cause things to get blocked.