Asterisk 15, Jack, streams, speech recognition... so many questions!


#1

I know Asterisk has a speech recognition interface built in, but I need to go beyond that, with APIs like Lex, Wit or Luis etc.

These APIs can repond really quickly and accurately as they can receive and interpret an audio stream, but in all the Asterisk speech recognition tools I can find, they all say that they need to save the speech to a file, then convert, then upload via whatever API.

There’s also the very cheap/free high quality speech synthesis services like Amazon Polly, which can also return an audio stream object (or save a file).

Removing the “record in Asterisk/store as file/convert file/upload file <> receive stream/save file/convert file/playback in Asterisk” part of the sequence would save vital seconds of silence and caller annoyance.

I am looking at, for example, Google’s speech to text service, and it can cope with the following codecs, some of which can be used directly from Asterisk, yes? So there is no need to transcode? As Google returns “live” word results, I could set an API to “watch” for a trigger word, and then use the AGI to trigger something in Asterisk.

Well, that is how I am assuming it can/should be done!

Also, there is JackAudio. And I’m thinking to myself that surely something could be done here, but then also, surely one of the people far more skilled than me would have already done this?

So, have they either missed a trick here, or is there something in a recent version of Asterisk

LINEAR16 Uncompressed 16-bit signed little-endian samples (Linear PCM).
FLAC FLAC (Free Lossless Audio Codec) is the recommended encoding because it is lossless–therefore recognition is not compromised–and requires only about half the bandwidth of LINEAR16. FLAC stream encoding supports 16-bit and 24-bit samples, however, not all fields in STREAMINFO are supported.
MULAW 8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.
AMR Adaptive Multi-Rate Narrowband codec. sample_rate_hertz must be 8000.
AMR_WB Adaptive Multi-Rate Wideband codec. sample_rate_hertz must be 16000.
OGG_OPUS Opus encoded audio frames in Ogg container (OggOpus). sample_rate_hertz must be 16000.
SPEEX_WITH_HEADER_BYTE Although the use of lossy encodings is not recommended, if a very low bitrate encoding is required, OGG_OPUS is highly preferred over Speex encoding. The Speex encoding supported by Cloud Speech API has a header byte in each block, as in MIME type audio/x-speex-with-header-byte. It is a variant of the RTP Speex encoding defined in RFC 5574. The stream is a sequence of blocks, one block per RTP packet. Each block starts with a byte containing the length of the block, in bytes, followed by one or more frames of Speex data, padded to an integral number of bytes (octets) as specified in RFC 5574. In other words, each RTP header is replaced with a single byte containing the block length. Only Speex wideband is supported. sample_rate_hertz must be 16000.

Example services…
https://cloud.google.com/speech/reference/rpc/google.cloud.speech.v1#google.cloud.speech.v1.RecognitionConfig
https://pypi.python.org/pypi/SpeechRecognition/3.7.1
https://wit.ai/faq
https://wiki.asterisk.org/wiki/display/AST/Asterisk+15+Application_JACK

(Mod - please move if this is the wrong subforum)


#2

The tools you are referring to don’t use the actual Asterisk speech recognition API. The speech recognition API provides a stream of media as received to the engine in a format it asks for, and then expects the engine to provide feedback (such as end of speech detected, start of speech detected, results). I expect noone has really looked into doing such a thing and just used what was at hand to interface, as it is easier.


#3

Sure - I understand the Speech Rec API would not be of use in my case, but I think what I was asking (badly!) was… is there a way to use something like Jack to pass audio to and from, say, the Google Speech API?

I’m looking at these three links, and wondering if these can be glued together in such a way, or am I completely misunderstanding the concept of streams?

https://wiki.asterisk.org/wiki/display/AST/Application_EAGI

“Using ‘EAGI’ provides enhanced AGI, with incoming audio available out of band on file descriptor 3”

https://wiki.asterisk.org/wiki/display/AST/New+in+15

“For Asterisk 15, the stream concept has been codified with a new set of capabilities designed specifically for manipulating streams and stream topologies that can be used by any channel driver.”

https://wiki.asterisk.org/wiki/display/AST/Asterisk+15+Application_JACK

“Other applications can be hooked up to these ports to access audio coming from, or being send to the channel”

I might not be able to build what I want to do by myself, but I’d like to know what is possible so that I know what I’m asking for.

Thanks!


#4

I can’t comment on EAGI or JACK from that perspective but I’m confused over how you came to the conclusion that the speech rec API would not be of use. If an Asterisk module was written then it could certainly be used. That’s what people haven’t done, they’ve done things outside of Asterisk which has resulted in the experience you reference (record, transcode, send, get back result, etc).

If you are wanting to do it outside of Asterisk without having to touch it, then we don’t really have a good option for that. You can try piecing something together like with EAGI or JACK as you’ve mentioned, but you may be in uncharted territory.


#5

Thank you! OK, in which case, I will take a better look at the https://wiki.asterisk.org/wiki/display/AST/Speech+Recognition+API if you are saying this would be right in this situation.

It’s just that all the demo code I have found on places like Github for both recognising and generating speech with Asterisk have completely bypassed the Asterisk functions and have jumped directly to using AGI, and I was just wondering why that was, and had (perhaps wrongly?) assumed that they had tried the SpeechRec API and had no luck.

But then again, SO much example code also seems to be stuck on 1.8/Python 2.x - programmers in this world seem very conservative with updating to latest/current versions of anything.


#6

It’s likely because Asterisk modules are written in C, and it’s more difficult to do things in that fashion. Using the Record and ship it off using Python, etc, is just easier and gets the job done for a lot of people to where they find it acceptable.


#7

Ah, OK, I understand now. Thanks - so as I understand (correct me if wrong)

If I want to do “realtime” passing of audio to an API, then Asterisk Speech Recognition might do it, Jack would probably not do it, but for ease of use, just stick the the current AGI scheme that everyone else is using? As long as there’s some kind of comforting “we’re working on it” noise going on, the delay isn’t too bad, it’s just that I was trying to almost eliminate it.

(Or, of course, I could hire someone who knows a bit of C to write a script - shouldn’t be rocket science, right? - for example, there are already official libraries for Google Speech to Text in C#, GO, JAVA, NODE.JS, PHP, PYTHON and RUBY!)


#8

The Asterisk Speech Recognition does provide a stream of audio as received to the implementation. So it will most certainly do that. But yes, the AGI approach is the easiest.


#9

OK, continuing on to the next stage…

Briefly: I want to be able to have “press or say (number)”, with Asterisk listening for a spoken number, but accepting a DTMF digit, too.

I’m posting everything I found so far, here, partly to show working, but also in case anyone else finds it useful. So, moving on…

This looked hopeful for a moment until I realised that it doesn’t do DTMF:
https://wiki.asterisk.org/wiki/display/AST/Asterisk+15+Application_SpeechBackground

So then there’s https://wiki.asterisk.org/wiki/display/AST/Asterisk+15+Application_Record, which can terminate on any DTMF key with “y”, but according to the docs, “RECORD_STATUS” only sets a flag of “DTMF” (A terminating DTMF was received (’#’ or ‘*’, depending upon option ‘t’)).
So, I don’t get to know which key was pressed via that method, either.

There’s very little information I can find about the built-in functions for speech recognition.
https://wiki.asterisk.org/wiki/display/AST/Speech+Recognition+API doesn’t actually explain how to integrate the actual speech engines.

In this previous forum post, Asterisk 15, Jack, streams, speech recognition... so many questions! , jcolp explained that most people don’t use the speech interface anyway, because
"Asterisk modules are written in C, and it’s more difficult to do things in that fashion. Using the Record and ship it off using Python, etc, is just easier and gets the job done for a lot of people to where they find it acceptable.
So, AGI it is! But I’m still stuck on how I record for speech AND get a DTMF if it was dialled.

Regarding speech in general, even “Asterisk - The Definitive Guide” just says:

“Asterisk does not have speech recognition built in, but there are many third-party speech
recognition packages that integrate with Asterisk. Much of that is outside of the scope
of this book, as those applications are external to Asterisk” - helpful!

The speech-rec mailing list at http://lists.digium.com/pipermail/asterisk-speech-rec/ hasn’t been posted to since 2013

Someone else asked about speech recognition and unimrcp in this post:
http://lists.digium.com/pipermail/asterisk-users/2017-February/290875.html

uniMCRP https://mojolingo.com/blog/2015/speech-rec-asterisk-get-started/
http://www.unimrcp.org/manuals/html/AsteriskManual.html#_Toc424230605
This has a Google Speech Recogniser plugin, but it’s $50 per channel http://www.unimrcp.org/gsr

Reasons to use Lex over Google TTS
• Has just been released in eu-west-1: https://forums.aws.amazon.com/ann.jspa?annID=5186
• Supports 8KHz telepony https://forums.aws.amazon.com/ann.jspa?annID=4775
• Is in the core AWS SDK http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/LexRuntime.html
• Has a number slot type: http://docs.aws.amazon.com/lex/latest/dg/built-in-slot-number.html

  • this means no accidental recognition of “won”, “one” or “juan” instead of 1!

The pricing is definitely right: “The cost for 1,000 speech requests would be $4.00, and 1,000 text requests would cost $0.75. From the date you get started with Amazon Lex, you can process up to 10,000 text requests and 5,000 speech requests per month for free for the first year”.

Amazon Transcribe looks promising too, but is only available for developer invitation at this time:
https://aws.amazon.com/transcribe/ https://aws.amazon.com/transcribe/pricing/

But all I need now is the quickest, simplest way to send Lex a short 8KHz file and get a single digit back, as quickly and reliably as possible.

Before I travel too far down this road, can someone point me in the right direction and possibly steer me away from the wrong path?!


#10

Hi,

@lardconcepts Did you make some progresses in your project ? I am trying to do the same with google speech api, I have something working with agi but it introduces some delay and noise which make the solution not good enough to be usable.


#11

To be honest, I gave up for the time being. But what I DID discover was that silence takes as long to process as speech. If you have gaps at the start, end, and between words, you can definitely dramatically speed processing time by removing it - see here: https://unix.stackexchange.com/questions/293376/remove-silence-from-audio-files-while-leaving-gaps