Asterisk stops processing calls under heavy load

Hi,

I am having an issue with Asterisk 11.18.0 where after a certain amount of queue calls (I would say about 40), the system stops responding to inbound calls. It doesn’t actually ever crash but I can see many locks (Core show locks) in the system. Some are related to mysql. I have an external system that is querying the database and it never stops functioning therefore I don’t believe it is a mysql issue.

The only errors I can see in the log are:

htop and top show only 20% cpu usage during the occurrence of the issue and RAM does not seem to be an issue either.

Running Centos 6.8 32bit. 8GB RAM (PAE). Quad Core processor with HT x5550 2.67Ghz.

Any suggestions? What else can I use to troubleshoot?

wiki.asterisk.org/wiki/display/ … rADeadlock gives information on how to debug deadlocks.

core show locks:

pastebin.com/iUC2KUq3

gdp threads:

pastebin.com/jYCWHjZX

Not sure how to read these exactly, any assistance would be appreciated.

You are looking for circles of two or more locks where threads are holding one and waiting on the other, or, in this case, signs that a lock is being held indefinitely for other reasons, namely the odbc lock at 0xab3e488 in this case.

It looks to me as though your ODBC connection for call logging has stalled and everything is waiting for it.

That’s likely to be a third party problem and is one of the reasons why I would want to keep call logging as simple as possible.

I agree, it looks as though something is up with the underlying database connection.

This is personally why I don’t like tightly coupling Asterisk to databases. It’s handy at times, but if your database acts up then Asterisk reflects that.

I have now moved the mysql database and put it on it’s own dedicated server. Let’s see how it functions today.

Thanks

Hi, we had the same issue occur almost immediately. Have not had much time to come back here because we had chaos with the system going down during busy hours (100 concurrent calls). We had to move them to their old Asterisk server which does not suffer from this issue.

Things we have tried:

Installed Asterisk 11 on brand new hardware (HP Server, Dual Xeon [32 cores total], 96GB RAM, RAID 10 10k Hard Drives). I know, overkill. This was all we had available at the time. Same issue occurs with this new server. I can successfully reproduce the issue with sipp on a simple dialplan:

[from-sip-test]
exten => 696969,1,Answer
exten => 696969,n,Musiconhold(default)
exten => 696969,4,Hangup

After about 30 calls, (10 calls per second), sipp starts receivinv transmit errors (no response from host).

command:

Shows thousands of packets queued.

Asterisk only shows about 10 calls.
See this in the error log:

res_musiconhold.c: Failed to write frame to 'SIP/sip-test-00000024': Resource temporarily unavailable

Htop shows over 80GB of RAM free, CPU spikes to almost 95% during this time (all 32 cores).

Backtrace:
pastebin.com/qY6X5mqm

Core show locks:
pastebin.com/cUdFm5mB

I tried different Asterisk versions to 11.12, 11.16, same results. This was running on 11.18. Old server which does not suffer from this issue (as far as I can tell) is running 11.12 x64 on debian. This test was done on Ubuntu 14, 64 bit.

Not sure what else to try.

Oh, and I removed CDR logging via odbc/mysql. It now logs directly to filesystem with csv files.

At the moment, the forum server’s encryption is unsafe as far as Debian Iceweasel is concerned, which means that it is not easy for me to access it from home, where I have some time. Consequently you are going to have to do more of the leg work and actually work out the cycle of locks, or the locks that seem to be being held for too long, and the threads that seem to be running when you would expect them to be waiting.

@david55, I’ve opened a ticket for this already with Digium’s IT folks.

There are a very large number of threads waiting for /var/lib/asterisk/bin/record_runafter.pl

That has either stalled or you have overwhelmed the system with them. As it isn’t part of Asterisk, you will need to debug that.

If it, in turn, uses AMI, or asterisk -r, I suppose you could have a deadlock that is not explicitly visible.

There also seem to be a lot of locks associated with TLS sessions. Are you using the HTTP AMI interface, by any chance?

The issue occurs regardless of running that script. No, I do not use the http AMI interface. The simple dialplan I showed in my previous post, which does not run that script, still has calls failing after 30 concurrent calls. I opened a ticket with digium and they mentioned this seems like a bug. After downgrading to Asterisk 11.7, I can now sustain about 150 concurrent calls at a rate of 15 calls per second. Still not as many as the old server, but much better.

The number of scripts running still needs to be explained, as there is clearly some resource those scripts need but are not getting. I suppose it could be CPU, if Asterisk is running at elevated priority. Running lots of perl scripts is likely to be expensive because of the set up time needed for each one.

The other big factor was in AMI. I was guessing HTTP because they seemed to be holding locks whilst reading network input, which I wouldn’t normally have expected, but I;d need to go down into the code to find the exact context. That lock wasn’t part of a classic deadlock, so the missing resource has to be something else, e.g. network input itself.

In my view AMI should not be holding locks at such points. It already has a problem that network output stalling can cause it to hold critical locks. and freeze the system. However I can’t see that happening here. We modified it to buffer all the output in a file, to decouple network output from internal locks, although that will increase processor and memory load.

I can see no lock cycles. Although you have a large number of locks, that looks more like the result of the system being close to overload than any classical deadlock, as the configuration changes between the show locks and the backtrace.

All the Asterisk threads appear to be in waits, although I can’t be sure that some of these aren’t spin locks, without ps -l output with the show individual threads option.

What I do see is about 56 threads apparently stalled in the runafter script, so, even if that is not the only factor you need to either explain why it is stalling or remove it from the equation, and re-run the analysis to find the next level of problem.

Is this script transferring recordings over the network, but it may well be that you have network problems. That might also explain why you had problems with the database.

I didn’t fully understand the bit about Digium and bug reports. Digium only handle bugs in commercially supported derivatives of Asterisk or relating to support to the hardware they sell. Were you referring to the community issues tracker at issues.asterisk.org. If so you should cross link the report and this thread.

Regarding overloads. A system, like Asterisk, that uses locks heavily can rapidly increase its CPU usage as it approaches the point of overload, because locks that are taken are much more expensive than ones that are not. I’m not sure where the tipping point is, but, for example, you may find CPU uses goes up fairly smoothly to 80%, but only a little more load is needed to exceed the processing power of the system.

What I was referring to when I mentioned Digium support is I currently have a support agreement for Open Source Asterisk. I submitted a ticket and their final response was “it may be a bug”. I had posted here simultaneously to receive different opinions and also, once I found the solution, to post it as well.

They were quick to blame of what appeared to be, the last process in the “waiting” state. First it was the mysql connection. I then removed mysql was same results. Then it was “record_run_after.pl” script. Removed that as well with same results. Finally, they agreed that there was some issue with the core and suggested to downgrade the version of Asterisk. I tried the latest 11 certified version.

Simply downgrading Asterisk and changing NOTHING at all, the issue almost completely disappeared and I can now handle 150 concurrent calls. Prior to the downgrade, I could only handle about 30 concurrent calls.

I may not have provided the debug information for the other test scenario, but I can assure you, I removed any scripts and had that simple dialplan I showed in my previous post with the same exact issue. I will change the Asterisk version once again, during after hours, to reproduce the debug information with the simple dialplanI I posted previously (basic music on hold). Perhaps it can assist troubleshooting the source of the bug/issue.

Thanks for taking the time and looking into this. I understand it is complex and hopefully this can benefit others in the process.

Thanks Illizit for being so clear in your analysis.

I just wanted to remark we have the same problem… once I get to 30-35 concurrent calls using an predictive dialer application, Asterisk 11.17 stops responding. CPU, IO, Troughput and Memory looks perfectly stable (its a IBM Xeon with 8 core and 4 MB ram)… we have downgraded to 11.10 and see what happens… then I’ll let you know. I hope we could find something to clarify wich is the problem in order to look for a solution.

[quote=“leri”]Thanks Illizit for being so clear in your analysis.

I just wanted to remark we have the same problem… once I get to 30-35 concurrent calls using an predictive dialer application, Asterisk 11.17 stops responding. CPU, IO, Troughput and Memory looks perfectly stable (its a IBM Xeon with 8 core and 4 MB ram)… we have downgraded to 11.10 and see what happens… then I’ll let you know. I hope we could find something to clarify wich is the problem in order to look for a solution.[/quote]

Did you see any errors in the logs? I would try using the certified version of asterisk unless you have a specific need for a different version.

Illizit, have you found any solutions since, besides downgrading?

I’m having much the same problem. I can’t queue more than, say, 10 calls before things start to go sour. The excessive locking in app_queue.c drives system load in a quad-cpu system beyond 4. The actual cpu usage isn’t pegged, the load is just driven up because all the futex locking clogs up the threads. If I queue more, say 20 or more, eventually all of the rtp streams start stuttering and chopping audio, even though there may even be idle cpu available. I guess futex locks don’t show up in top as real cpu usage.

I can’t really downgrade to the certified 11 version because I’m already heavily invested in Asterisk 13 and its new way of doing things, and I need the latest versions fixes because the older asterisk 13 app_queue.c segfaults sometimes.

i have the exact same issue, contacted digium and basically was given the run around. did any of you find the solution to this problem? please let me know. thank you. tried this with asterisk 11.20

Unbelievable. I would hope by now this bug would have been caught and prevented from moving up to later versions. I am only able to produce these errors under high load, I guess no one has had a significant amount of load on these versions of asterisk? I find that hard to believe.

The fix for me was to downgrade to the certified version of Asterisk 11. I have not had this issue since and I have been to afraid to use anything higher than the certified version for now.