Is it possible to integrate OpenAI Whisper with Asterisk?

Hi everyone,

I wanted to ask if there’s a way to use OpenAI Whisper with Asterisk, similar to how it works with the aeap-speech-to-text module.

Has anyone tried this or have any ideas on how Whisper could be integrated with Asterisk in a similar manner? If there are any existing projects or approaches, I’d appreciate any pointers!

Thanks in advance!

I’m not aware of any out of the box here you go implementation. The fundamentals to do it exist with Asterisk, but it ultimately has to be done.

1 Like

Thanks for your positive news.
I think the Python script is the least of the problems, but how can I send the live audio stream to my Python script via the dial plan?

You can either write an AEAP implementation using its protocol if all you want is text to speech, or you write an ARI application using external media if you want to do more.

I don’t want TTS, I want STT :grimacing:

Sorry, I meant speech to text. AEAP only does speech to text currently.

1 Like

Ouweia, I can’t do C++ and I believe AEAP is written in C++. There are already Python scripts for OpenAI Whisper for live audio, but I need to figure out how to make Asterisk send the live audio to the script. Maybe I’ll find a solution to rewrite the existing AEAP script, as you already suggested. I’ll ask in the OpenAI forum and once I have a solution, I will of course share it in this post. I wish you a pleasant day. Best regards from Germany!

AEAP isn’t “written in C++”. AEAP is a defined protocol[1] between Asterisk and an outside application that can be in any language. The example we provide is in Javascript.

[1] Asterisk External Application Protocol (AEAP) - Asterisk Documentation

1 Like

Thank you for your correction. I just checked, and I think the provider.js needs to be adapted for WhisperLive. However, I’ll need to dig deeper into it to figure it out. I wanted to use VOSK, but it requires a huge amount of resources, and my server’s RAM isn’t sufficient for it. Google STT is all well and good, but I’m looking for a local solution that doesn’t rely on an external provider.

If you cache the rendered audio from the external STT provider, your dependence and costs may be tolerable.

We use Polly (AWS) and Watson (IBM). When we receive the STT audio, I store it as /tts-cache/<provider>/<language>/<voice>/<md5sum-of-text>.wav.

For some of our clients, we can extract the text and variables from their script and pre-render the text.

You can reduce Vosk models to fit your memory, for example, you can remove rnnlm and rescore folders from the model and it will fit 2Gb probably.

As for Whisper, it needs much more compute (GPU card) and not really real-time.

1 Like

try using audio sockets

The problem is that I only have 1GB of memory, and Asterisk is already running on the server. Only short sentences need to be recognized, which should not be longer than 10-15 seconds.

Can you provide me with an example of a Dialplan using AudioSockets and VOSK?

OpenAI Whisper does not natively support streaming audio input. So you have to send OpenAI a recording file.

I have done this type of thing 3 different ways, and they all work, although one of them was not using OpenAI because of the lack of streaming capability. How you do it just depends on your specific application. If you want to stream the audio real-time, OpenAI is probably not the best choice for your situation. Although you can chunk the audio into small segments and send them sequentially, there are better services if you want to do real-time streaming though. If you want to do that, I’d suggest you do it using ARI with an external media channel to get the audio into your app. That is the best method I believe because you can also stream TTS back to Asterisk then also, but for a simple application using any script to send the recording file to OpenAI will work. One use case I have done is just transcribing voicemail, and I send the file for transcription via an AGI script called in a hangup handler.

1 Like

That sounds great. Would you share your code here?

Sorry, no I can’t.

And PN?

What are you asking?

I asked if you could send me the code in a private message.

“I can’t” really means I can’t. It wasn’t just an objection to sharing it puclicly in the forum. I don’t own the code since I wrote it as an employee. Besides it’s in Go, not Python.

Sorry. Not everyone can share their code. My point in responding was to point out you don’t have to overly complicate it. Depending on your specific needs it could be very simple, as in about 5-6 lines of code. If you need to stream the audio it is much more complicated and in that scenario you have to overcome that OpenAI doesn’t accept audio streams.

Good luck.