I have a issue where I’m getting into a state that asterisk become unavailable. One log entry that i see is ast_queue_frame: Exceptionally long voice queue length queuing to Local.
I’m in process to replace chan_sip with pjsip. I’ve encountered this on Rhel 7, with multiple versions of asterisk from the latest 13, to 16.7.
Is there anything I can do to avoid getting into a state where the system goes down.
Not really, this indicates either system overload or a deadlock has happened whereby a channel is blocked. Without determining what specifically in your environment triggers it, or isolating and resolving the issue, not exactly anything specific that can be said.
It would be somewhere in Asterisk, based on something you are doing. A backtrace at the time[1] would show what is going on, and would need to be looked at by someone familiar with the codebase.
Hello,
We have exactly the same pb on the version 16.7.0 since few days.
The message ‘Exceptionally long voice queue length queuing to Local’ occur each time just after the bridge message.
The pb occur with 15 or 100 CPS.
We have temporraly fixed the pb to configure the CDR in batch mode, but we don’t understand the link between both.
High process and/or low IOPS on server caused by webserver and local database could be your problem.
Try increase your server power or move database and web to another server. It will reduce the server consumption.
I’m running my database already on different nodes. I’m also using realtime.
When you changed cdr to batch mode did that make any difference? This is a race condition of some sort that is generated when something is exercised that leads down the dark path.
Yes no error when we are in batch CDR mode with 200 CPS.
Again we don’t undersand the link because like you our CDR are stored to another server and we don’t have web server on this machine only AS*
24 Core Cpu use at 5%
idl time = 90%
2 PDU of 1100watts
so is not a consumption of the server because he do nothing.
Curious as to what got you to change to batch mode? I see the issues when I get to about 30cps and I see no load. My hosts is 2vcpus’ usage about 12%. The cluster is in vmware in aws and the underling hardware would be i3 bare metal equivalents with the vsan and all ssd storage in them.
When realtime or database is involved with Asterisk it can block critical paths, resulting in problems. For example in chan_sip UDP traffic is single threaded, and by default ODBC is single threaded if that is used, so the result being chan_sip is blocked when it has to query the database. If that slows down even for a moment that can cause issues. This also occurs elsewhere such as CDR which can block the entire CDR handling while records are being stored if batch is not used.
I don’t recommend tightly coupling a database to Asterisk unless you know the precise characteristics of the database and can do performance analysis in combination with Asterisk when used.
With chan_sip we discover that in overload case the bye message can be not sent correctly and on time as long as the cdr are injected in the database. And can cause some billing difference.
On our case we don’t use realtime or database, it’s a basic call generate PJSIP call by AMI and connect to a local channel when the call is answered.
We use an external database only for the CDR.
But since we move in batch mode the CDR, we don’t have anymore this message. “Exceptionally long voice queue…”
Each time the message appears when asterisk start to bridge both legs, but we don’t understand the link with the batch mode cdr at this level in the process call…
Having multiple connections can be beneficial, but it is still dependent on the performance of the underlying database itself. If your queries are slow that can still block the system. If you have multiple connections and those end up slow as well, you can then end up blocking it even further or in different ways.