Why is Asterisk using all CPU time?

We’re a SIP provider for some 700 clients, mostly using Linksys ATAs. The OS is 2.6.17-gentoo-r4 #1 SMP and we’re running Asterisk 1.4.19.

For the past several days at least, our Asterisk server has been running at 100% CPU usage, and as far as I can tell, there’s not really anything unusual happening in Asterisk, except that perhaps an unusually high number of SIP channels are open - over 100 where “normal” operation has about 60 or so.

What would I need to do to properly diagnose this problem and find the culprit?

post the output of “ps -ef |grep asterisk” and “top” as well

Ian

Unfortunately Asterisk requires several days of uptime before it starts doing this, and as such I can’t post this stuff until it starts doing this again. I restarted it yesterday and that “fixes” the problem, but it’s only a short term solution.

In the meantime though, I can definitely say that at the time Top showed that Asterisk was the only process consuming any real resources. Neither Mysql nor any other process was showing any significant CPU time, either in Top or ‘ps ax’.

Currently this is what ‘ps axo time,comm --sort stime’ says:

00:00:00 kjournald
00:00:01 udevd
00:00:01 bbrun
00:00:04 pdflush
00:00:04 kswapd0
00:00:04 kjournald
00:00:15 syslog-ng
00:00:09 khelper
00:00:27 init
00:02:10 sshd
00:02:28 kjournald
00:28:50 mysqld
02:17:25 asterisk

(all 0 stime processes have been removed except kjournald for illustrative purposes)

Hi

I am also getting the same problem. Asterisk shows over 100% cpu usage, dialparties.agi also shows up frequently some times with about 70% or more cpu usage so does op_server.pl. What confuses me though is that at times not many calls will be coming in or at times none but cpu % will still be high. Were you able to find a solution to your problem? I have 15 zap channels, 3 sip channels, 35 extensions, 2GB ram and 3.6 GHz processor. Please help me out.

Thanx!

Okay, and now that the weekend is done, here’s output from top:

[code]
top - 09:01:53 up 27 days, 20:37, 1 user, load average: 1.47, 1.40, 1.26
Tasks: 73 total, 3 running, 70 sleeping, 0 stopped, 0 zombie
Cpu(s): 61.7% us, 36.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 1.7% hi, 0.0% si
Mem: 1555076k total, 1453696k used, 101380k free, 347056k buffers
Swap: 803232k total, 600k used, 802632k free, 684504k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2201 root 22 0 84712 45m 11m S 99.3 3.0 484:49.92 asterisk
12767 root 17 0 5012 1648 1340 S 0.3 0.1 0:00.01 sshd
1 root 15 0 1516 520 452 S 0.0 0.0 0:33.13 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0[/code]

also:

# ps -ef |grep asterisk
root      2201     1  8 Apr17 ?        08:20:50 asterisk
root     15437 12085  0 09:18 pts/1    00:00:00 grep asterisk

Hmm

What does show channels show, The only time I have seen anything like this was a malformed dialplan looping

Ian

Here’s what ‘show channels’ shows:

voip2*CLI> show channels
Channel              Location             State   Application(Data)             
Zap/20-1             778XXXXXXX@local:2   Up      VoiceMail(u778XXXXXXX)        
Zap/2-1              (None)               Up      Bridged Call(SIP/0014BFxxxxxx-
SIP/0014BFxxxxxx-1-b 604XXXXXXX@default:1 Up      Dial(Zap/g1/604XXXXXXX)       
SIP/xxx.xxx.xx.xx-08 (None)               Up      Bridged Call(SIP/0014BFxxxxxx-
SIP/0014BFxxxxxx-1-b 1250XXXXXXX@default: Up      Dial(SIP/XXXXXXXXXXXXXXX@203.1
SIP/xxx.xxx.xx.xx-08 (None)               Up      Bridged Call(SIP/0014BFxxxxxx-
SIP/0014BF4EC157-1-b 1250XXXXXXX@default: Up      Dial(SIP/XXXXXXXXXXXXXXX@203.1
SIP/xxx.xxx.xx.xx-08 (None)               Up      Bridged Call(SIP/0014BFxxxxxx-
SIP/0014BFxxxxxx-1-b 1250XXXXXXX@default: Up      Dial(SIP/XXXXXXXXXXXXXXX@203.1
Zap/3-1              (None)               Up      Bridged Call(Zap/19-1)        
Zap/19-1             604XXXXXXX@default:1 Up      Dial(Zap/g1/604XXXXXXX)       
SIP/xxx.xxx.xx.xx-08 (None)               Up      Bridged Call(SIP/0014BFxxxxxx-
SIP/0014BFxxxxxx-1-b 1406XXXXXXX@default: Up      Dial(SIP/XXXXXXXXXXXXXXX@203.1
SIP/xxx.xxx.xx.xx-08 (None)               Up      Bridged Call(SIP/0014BFxxxxxx-
SIP/0014BFxxxxxx-1-b 1250XXXXXXX@default: Up      Dial(SIP/XXXXXXXXXXXXXXX@203.1
SIP/xxx.xxx.xx.xx-0  (None)               Up      Bridged Call(SIP/0018F8xxxxxx-
SIP/0018F8xxxxxx-2-b 01135XXXXXXX@defa    Up      Dial(SIP/011358XXXXXXX@204.2
SIP/203.160.32.46-b5 (None)               Up      Bridged Call(SIP/001C10xxxxxx-
SIP/001C10xxxxxx-1-b 1604XXXXXXX@default: Up      Dial(SIP/XXXXXXXXXXXXXXX@203.1
Zap/21-1             604XXXXXXX@local:3   Up      VoiceMailMain()               
20 active channels
11 active calls

Coincidentally, the stuff behind the XX’s looks pretty normal to me, just incoming and outgoing calls

Hi

Well if this is all thats going on then you have a serious problem I have boxes with over 200 calls in progress and I would be worried if the load wne over 0.3 let alone 1.4 as yours does.

You need to set a lab system copy the dialplan and hammer it to see whats going wrong with it. something isnt right, What codecs are you using ?

Ian

Codecs allowed are ulaw, alaw, g729:20, g723.1 and g726

The really odd part is that it ramps up slowly over time. I would really not expect that if the dialplan was in an infinite loop.

Might this be a problem with the underlying operating system? We’re running Gentoo 2.6.17-gentoo-r4 #1 SMP on an Intel P4 2.40GHz cpu.

 # asterisk -rx 'core show uptime'
System uptime: 17 hours, 58 minutes, 6 seconds 
Last reload: 1 hour, 10 minutes, 54 seconds 
top - 12:29:14 up 29 days, 4 min,  1 user,  load average: 0.18, 0.23, 0.19
Tasks:  71 total,   2 running,  69 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.7% us,  1.3% sy,  0.0% ni, 95.0% id,  0.0% wa,  0.3% hi,  0.7% si
Mem:   1555076k total,  1459236k used,    95840k free,   300664k buffers
Swap:   803232k total,      600k used,   802632k free,   751500k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                  
13800 root      22   0 67856  34m  10m S  4.0  2.2  82:44.64 asterisk                                                 
    1 root      15   0  1516  520  452 S  0.0  0.0   0:35.37 init                                                     
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                              
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.06 ksoftirqd/0                                              
    4 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0 

I’m getting the same issue on one of my asterisk boxes, also running 1.4.19.

Funny thing is, though, that I’m running two other identical Asterisk machines with slightly differing hardware, and neither of them have this issue. Dialplans, etc, are all the same.

One of the boxes is an HP ProLiant DL380 G5 with 2gb RAM and a quadcore Xeon, with a TE110P card.

Another is an HP ProLiant DL380 G3 with 2gb RAM and two single core Xeons, with a Sangoma A101 card.

The affected box, however, is an HP ProLiant ML350 with 4gb RAM and a quadcore Xeon, with a Sangoma A102 card hooked to two T1s. I rebooted it this morning and it was fine for a few hours and then began being affected again.

These machines all run the exact same dialplan, OS versions, kernel versions, and Asterisk versions. The only difference is the hardware.

It doesn’t affect anything other than the reading of callback numbers in voicemails, but this is an incredibly high-maintenance client, and I’d love for a resolution to be found. For the most part, my top looks like his.

This morning, I pulled the server down because I had to go from an A102 to an A102D card (customer complained of echo on the line), and in the process, I removed 2gb RAM bringing the total down to 2gb.

That seems to have fixed the issue.

Is there a known problem running asterisk with 4gb RAM?

I’m having the same problem on 1.4.19.1. It has been happening on a few of my boxes - one with 5GB of ram and approx 230 peers and 4 T1s, another with 3GB of ram and <30 peers. Both use FreePBX for configuration, and both seem to use 100% of a single core (dual hyperthreaded xeons) after being up for several days. It doesn’t seem to happen to another box I have that doesn’t use FreePBX and has 2GB of ram.

I never noticed this behavior on 1.4.18 - it started when we moved to 1.4.19.1.

I have the a similar problem. What I have noticed is that everytime my CPU spikes to 106% is that someone was checking there voicemail and the SIP channel is still open even though they have hung up and the user reports that nothing strange happened to them while they were checking voicemail and that they just hung up when they were done. Their phone also shows that the call was hungup. Here is an example from core show channels:

Channel Location State Application(Data)
SIP/209-b6c89b88 777@custom-from-inte Up VoiceMailMain(209@default)

First I thought it might be a corrupted mail box, but it happens to be different mailboxes every time and always with someone who was checking their voicemail.

Bingo, that’s the problem alright. It’s always someone checking their voicemail. We have our voicemail number set as *98 (with another number that works if they want to check it remotely) and it’s always that number. The application VoiceMailMain() appears to be the culprit, and it seems to be stuck in a loop.

Is there any way to kill the application without restarting Asterisk?

Hi

Is *98 in the main context ? sounds like hangup is being called correctly.

Ian

Yes, it’s in the main context.

Coincidentally, I’m upgrading to Asterisk 1.4.22 tonight, to see if the problem has already been resolved.

Hi Are you actualy seeing stuck channels ? or is it that load goes up when the voicemail is used ?

Having sites with 100+ office extensions I have never had this issue

what does the *98 dialplan looklike ?

Ian