About performance

Hi folks,

We have an Asterisk (11.20) server with 8 PRI (Sangoma Card) and some agents and call recording ( We don’t do any special thing for calls). We work with AMI for other third party application. when calls are more than ~190, AMI don’t response. I couldn’t find any error in logs.Our server(HP DL380) performance is good.
So what may be a bottleneck for our Asterisk? It has problem with 240 calls or more ? How can I tune Asterisk for more than 190 calls?

Thanks and Regards.

When you say the server performance is good what metrics are you looking at? Can you use sysstat to collect regular data for a fire busy hours and paste it here?

I am thinking the issues might be io related.

I had similar before where I would notice:

  • AMI slowing down
  • AM Proxy Crashing
  • increased Post dial delay

It turned out my iowait was very high. This I did to make this more reliable was:

  • No more raid 5 (MySQL doesn’t like it)

  • did call recording in RAM

  • created a new asterisk start up process that loaded all my greetings and moh in to ram.

This helped me lower the reads and writes on the server and reduce the iowait.

Some reference sites for you:

http://dba.stackexchange.com/questions/12977/is-raid-5-suitable-for-a-mysql-installation

Dear John,

Thanks so for your reply.
I check with ‘top’. I’ll use sysstat and update you.
I think you’re right. I/O may cause this issue. Did you test SSD storage ? it can help?
About RAID 5, what is your replacement solution?

Thanks and Regards.

Note
The Manager API is not exactly famous for its ability to handle multiple simultaneous connections gracefully (even though this has improved immensely in version 1.4). If you anticipate this kind of load, it is worth considering an AMI proxy such as the “Simple Asterisk Manager Proxy” (a Perl script), which can handle many connections and bundles them in a single connection. This is completely transparent to the script accessing the AMI. Of course, for the purposes of playing around, it isn’t strictly necessary.
http://the-asterisk-book.com/1.6/asterisk-manager-api.html

Thanks for reply.
But this is right for Asterisk 11.20 too? AMI isn’t improved from 1.6 to 1.11 any more?
We have one connection to AMI. Does Asterisk fork to more?

@psdk honestly I don’t know how many simultaneous connections the current version of AMI can handle without hanging up. Anyway according to your described issue it seems you need some kind of proxy to handle more connection without hang up your AMI server.

The proxy is not entirely transparent because it serialises the requests. This could be a particularly big hit if the OP is doing synchronous originates, or other slow running actions.

Also, I can’t see anything that actually says there is more than one manager thread running.

Also check the writetimeout option , If the device connected via this user accepts input slowly,
the timeout for writes to it can be increased to keep it
from being disconnected (value is in milliseconds)

writetimeout = 100

RAID 1 or 10

If IO is your problem you want to carry out as much changes as possible to reduce iowait.

Dear David,

Do you suggest to use proxy? or other solution!
When I monitor third app server, it has only one connection to AMI. So Asterisk handles all requests with one connection and one thread?

If you only have one connection, the proxy isn’t going to help.

What is the AMI action on which it is stalling?

Note that, if you were having issues with the timeout, you would be getting errors logged about that.

Yes I have only one connection.
Third app uses this connection to receive events and send some commands.
I checked all logs. Asterisk doesn’t have any error log for this.
In high load, AMI doesn’t answer to app commands or in most cases it answers with big delay about 40~50 secs.

Given that Asterisk uses non-blocking writes for AMI responses, and would report a timeout if the requestor was failing to read responses, I think we have to assume that Asterisk is failing to obtain a lock, rather than a round trip flow control issue. That will be very sensitive to the actual command being used, but is possible that it is trying for a conditional lock but the resource is locked so much of the time that it can never catch it when it is free.

Unfortunately you need to compile with thread debugging enabled, which has a significant performance penalty, in itself, to be able to positively confirm this. Just in case it has been built with thread debugging, you could try to see whether the CLI command “core show locks” is accepted.

Actually, you may be able to get some way to seeing what is happening by forcing a core dump. This can done with the gcore command. It will probably freeze the application for a fraction of a second, but will have to be done when there is a high load, so there is going to be some disruption. You can then use gdb to identify the AMI thread and see what it is doing. It is rather easier to do this if the code was built with optimisation disabled.

The exact command that is stalling is likely to be significant in terms of trying to get a black box diagnosis.

If it is related to a conditional lock, one would expect the delays to start rising quite quickly beyond a certain level of load.

By the way, do you mean 240 calls or 240 channels. If you have all the channels busy on point to point calls, that would be 120 calls, but if they were all on IVR or voicemail, it could be 240. This makes a difference to the number of processes running,

Also, do you use parking. There was an issue, which may or may not have been fixed, that the parking code used the select system call, which is limited to 1024 file descriptors. 240 calls would, almost certainly exceed that and cause strange behaviour.

Thanks so for your technical answer.

In our case, we have 8 PRI links, 240 calls that connect to users. So in full load, we have 240 calls on system, some of them are in queue and some of them are connected to users, and when load is high we have more than 240 channels (some on IVR, some in queue and some are connected to SIP users).
About parking: we don’t use parking feature directly. but we’re using “waiting(X)” in dial plan before “answer()” until to find a destination for that call. it may consider as parking issue?

As @david551 mentioned, all of this is guesswork unless you can provide specifics. You need to provide:

  • The AMI commands you are running when things “stall”, along with the specific parameters provided to them.

  • What the system is specifically doing when things “stall”. Merely providing the number of simultaneous calls isn’t helpful.

  • Some indication of what is occurring in the system when things slow down. A thread dump would be helpful, although it will tank your performance. Another option is to get a snapshot of the threads using gdb. Either way, you should do this analysis in a labbed up environment in order to avoid impacting your production system.

It’s important to note that each AMI session gets its own dedicated thread. That does not mean, however, that each Action sent to a session will be executed by that thread. Some actions which are expected to be long running are either dispatched out onto other threads, or have an option to force that (such as Originate). As such, how you are using AMI will dictate its performance in certain scenarios.

2 Likes