ARI high latency on POST channels/ring and POST channels/answer


#1

Hi everyone,

I’m working with the ARI interface, and I noticed that the two requests:

  • POST channels/channel_id/ring
  • POST channels/channel_id/answer
    take much more time to be managed when compared with all the other requests (almost two order of magnitude).

I made my tests using the latest version of asterisk (tried also with the current repository HEAD) and two different hardware, by tracing http and sip calls using wireshark, measuring the time between my http request and the relative sip message sent from asterisk.
In every test I made, ring and answer took more than 150ms, while all the other requests took less than 10ms.

You can test yourself by using ARI tester and wireshark.

Anyone observed the same issue? Do you think it’s a configuration problem or an intrinsic property of the current asterisk implementation?

Thanks
Daniele


#2

I haven’t seen any other reports of a latency problem with these, or at least noone else has noticed. Both should be asynchronous operations and return fairly quickly. What version of Asterisk are you using?


#3

Hi @jcolp,

thanks for your quick replay. I’m developing an open source C++ library to use asterisk ARI interface (I need it for a mission critical application that will do heavy use of ARI).

I tried several asterisk versions: 14.0.2, the git commit “5b42dd7” (latest commit) and the commit “ada0032” (current HEAD in master branch). Every time, for “ring” and “answer” I got much slower response time with respect to other requests.

I attach one of the trace I took with version 14.0.2 (but all the versions behave the same): you can see that ring took 168ms, answer took 139ms while the other operations (in the trace you can see the hangup / sip bye) are much faster (~5ms).

Thank you so much for your help.


#4

Oh, you are saying that the SIP response takes that period of time to occur? Now I understand. Which SIP channel driver? There are two. As well, it’s likely just the way it is. Those are queued up, handled by one thread, and then given to the SIP channel driver to do. This can take a bit of time.


#5

@jcolp sorry if I wasn’t clear enough.

I’m using a sip channel (not a pjsip).

Anyway, what seems real strange to me is that:

  1. only ring and answer have this problem. The other requests (e.g., “sip bye”, “create”, “sip message”, and so on) are always faster (~5ms)

  2. when I perform my test, asterisk is unloaded: there is just one channel active (the one I’m working on), and I perform one request at a time (using aritester) and waiting for the previous request to complete. So I guess the http thread queue should be empty, and so the channel driver… and anyway, 150ms is a very long period of time: which task / set of task can take so long to be performed?

Thanks again


#6

I haven’t looked at the interactions with chan_sip so I’m not sure why it would be slowing down there.


#7

@jcolp I tried also with chan_pjsip, but the results are the same.
Can you take a wireshark trace to see if you measure high latencies too?


#8

I’m not currently set up to do such a thing and don’t have the time. If you believe there’s an underlying bug you can file an issue[1] but as the answer and ring still occurs, it’s likely to be less priority.

[1] https://issues.asterisk.org/jira


#9

@jcolp of course.

However, I appreciate if anyone else could try my test on his asterisk setup to confirm my measure.
To perform the test, only wireshark and aritester are needed.

In the meantime, I’ll have a look at asterisk source to try to understand better the problem.

Thanks again.


#10

@jcolp I checked the source code, and I think now I understand better the problem.

When a channel enter a stasis application, a new thread is spawned and the function stasis_app_exec executed (file res_stasis.c). The function enter in a while loop that performs a ast_waitfor(channel, MAX_WAIT_MS) call that waits for events with a timeout of MAX_WAIT_MS (200ms).

Before a “channel/answer” request, the control flow is blocked on the ast_waitfor for 200ms, then handles the requests inside control_dispatch_all, then waits again on the ast_waitfor and so on… I see that the while loop repeats exactly every 200ms. So, when a “channel/answer” request arrives, ast_waitfor doesn’t wake up immediately (!) but waits for the 200ms timeout. When ast_waitfor exits for timeout, finally the channel performs the answer (WCET for the answer: 200ms).

After answering the channel, I see that the while cycle enter every 20ms, because ast_waitfor now exits after only 20ms (and not 200ms as before). So, next requests are more responsive (WCET for following requests: 20ms).

So, it seems to me that:

  • the ast_waitfor function doesn’t work as expected (it doesn’t wake up on http requests, but always wait the timeout expiration) OR
  • http requests don’t wake up ast_waitfor by design (but then, what are the events it’s waiting for?).

In the first case, it’s a bug. In the second, it’s a big performance flaw. Could you please help me understand what’s the case?

Thanks a lot.


#11

It was probably overlooked for that usage pattern since the queue would eventually get processed. There is no mechanism in place to wake that thread up from what I can see. It may be possible to do such a thing.


#12

@jcolp I see. If I file an issue do you think there is a chance someone will fix this problem? :slight_smile:


#13

It’s possible, but I would provide no guarantee on when. The fact it takes 200ms hasn’t impacted, or been noticed, by anyone else.


#14

@jcolp I guess it’s because ARI is not widely used yet; at least, not for intensive load applications. I ran a stress test using sipp and I’ve noticed immediately.

Anyway, I hope my effort to test ARI and to bring a C++ library to the community will help to increase the number of applications using ARI.

I will fill an issue.


#15

It is used quite heavily in some environments. I can’t disclose where.


#16

@jcolp good, so I’d say it’s worth solving also the performance issues :slight_smile:
Thanks again!


#17

I too am experiencing a similar about a 200ms delay in the PJSIP channel driver, on the channel.answer(). Asterisk 13.14.0.

I am in my ARI StasisStart callback, and measuring times throughout my StasisStart function, to find out where my delays are coming from. My code delays are ~0.002 ms, channel.answer() = 200ms, And ari_confbridge.addChannel() around 203 ms. I will get more information from a wireshark capture shortly.

Just wanted to let you all know that this is also in the PJSIP channel driver…

Thanks…


#18

Since there is a bug in the stasis core, the problem manifests in every kind of channels.
For a quick dirty fix of the problem, you can modify the constant MAX_WAIT_MS from the current value (200ms) to a lower value. But of course, this is not a real solution.


#19

I will try that, for the short term. Yes, this is a real problem for our implementation. I have 200+ endpoints that will be calling in all about the same time, (external trigger), get validate in my stasis app for correct extension that triggered the stasis, answer(), then add to the correct confbridge for the group conversation. These group calls/conferences typically only last for about 30 secs or so anyhow, but have a lot of varying endpoints that need to participate at times. Everything was going great until this delay thing, for the answer and add to confbridge. Thanks…


#20

Daniele77, Because my distro is from Sangoma (freepbx) I am not sure I can make the #define change and create a duplicate build… Good thought though…

I guess the 200mS delay probably works just fine in the “typical” dial plan where you have a lot of parallel channel activities going on, and letting channel items stack up to be processed in bulk makes sense. However, because the stasis_app gets called synchronously, where the answer(), etc. happen, it never gets a chance to queue any channel operations up, since the callback is essentially single threaded, so the next stasis_callback won’t occur until the 1st one is done, you always end-up hitting the bottleneck (i.e. 200ms timeout). If the stasis app was able to be called asynchronously (as fast as events happen), then it would probably work just fine.