Howto implement speech recognition barge-in with ARI

We have developed an ARI application that uses Google’s Speech Recognition API (we need Hebrew SR, so no other option). The app records the user’s speech and sends it to Google for recognition. Can you give us some tips on how to implement barge-in in this configuration?


Maybe I’ll ask that in this way: Is there a way to get an event from ARI when voice is detected?

The TALK_DETECT dialplan function[1] can be used to detect when talk is detected. This will raise an ARI event, and it can be set on a channel in ARI using the normal channel variable route.


Seems exactly what I’m looking for. I’ll try and update. Thanks!

A little bit tricky - main problem is that if I start recording on ChannelTalkingStarted event I loose the start of the user’s sentence. I can lower the loss if I increase the sensitivity of the detection but then I receive more false detections. Is there a way to buffer the last second or so?

BTW: I thing the documentation on the wiki page above has a mistake on the descriptions of the parameters. If I understand correctly the first is the time of silence to be identified as end of talk, and the second is the energy to be considered as talking.

There is no way to buffer the last second. The dialplan function is strictly to know when talking starts and ends.

As for the documentation please leave a comment on the wiki page and we’ll look into it.

I am trying to record the user on call start, and recognize on ChannelTalkingStarted event minus a second. Strange thing is that when I start to record the user, playing prompts stops working. What’s going on here?

You can’t do two things at once to a channel. Record in ARI is just that it, it records the channel as if you were calling Record() in the dialplan. It is not a MixMonitor equivalent. The foundation is there to implement such a thing though using a Snoop channel and Record on the Snoop channel.

I see. Is there any preference between creating and recording a snoop channel or a holding bridge?

I don’t understand the question. They are separate things. While in a bridge you also have limited control over the channel.

I meant that I can record the user using a snoop channel, or adding him to a bridge and record the bridge. I am wondering what is better, regarding resource usage.

You’d need to profile and see for your use case what would be better.

O.K., I’’ try and see. Thanks.

Working well now with snoop channel. Thanks!


Will it be possible for you to share source code of your application? I am curious to know how it is done. You can e-mail it to me on


Do you have some advice to do the same things?

Basically, what we do is creating a snoop channel, which immediately starts recording the user, and we save the time the recording started. Then, when we get the ChannelTalkingStarted on the snoop channel, we save this time, too. When finally the ChannelTalkingFinished arrives, we stop recording, and copy the recorded file from one secong before the ChannelTalkingStarted event (since we record in ulaw, no problem doing so). Then we send this file to Google speech recognition.

I think the secret is to jump into the Asterisk internals and see if you can build access to the dialplan speech applications through ARI. That would be ideal