Interactive TTS and STT methods

I’m currently developing an interactive voice response system that leverages Google’s Text-to-Speech (TTS) and Speech-to-Text (STT) services. Here’s a breakdown of how I envision the system working:

  1. Initial Greeting: When a caller dials in, they’re greeted with a standard message, like “Thanks for calling, tell me something.”
  2. Caller Input: After the greeting, the system waits for the caller to respond. There’s a 5-second silence period before asking if they are still there. Recording continues until there is 3 seconds of silence.
  3. Speech-to-Text Conversion: Whatever the caller says is captured and converted into text using Google’s STT.
  4. Playback with TTS: The system then reads back the transcribed text to the caller using Google TTS. For example, “I think you said: [caller’s words].”
  5. Follow-Up Prompt: After the playback, the caller is asked, “Would you like to tell me more?”
  • If the answer is yes, the system loops back to the initial “Tell me something” prompt.
  • If the answer is no, the caller hears “Thanks for calling” before the call ends.
  1. Interruption Handling: Importantly, the system allows callers to interrupt the TTS playback. If they interrupt during the playback of “I think you said: [caller’s words]”, their new input is captured, converted to text, and read back.

For the technical setup, I’m using Asterisk Gateway Interface (AGI) with a Python script for basic tests. I’ve successfully managed to set up an external Node.js application in one of my experiments. This app processes the recordings using Google services and then sends the results back to my debug window.

I’ve seen some promising examples and frameworks for this kind of system, but I’m looking for advice on the most efficient and effective way to implement these features. Any insights, especially from those who have worked on similar projects, would be greatly appreciated!

For reference, some of the reps and articles I’ve come across:

1 Like

I have a similar working setup with the ChatGPT API for AI, Whisper for speech-to-text, and the Google Text-to-Speech API for playback. I use the Asterisk record app. I tried implementing the interruption part, but the results were not as expected. For now, I have it working in a simple way, but it works very well. It’s like having ChatGPT through the phone.

whisper is not real-time. do you chunk data somehow? can you share your setup on github?

If you need real-time, you will have to use the ARI external media option, plus Google real-time transcription

what do you think about AEAP https://docs.asterisk.org/Configuration/Interfaces/Asterisk-External-Application-Protocol-AEAP/ ?

First time I heard about EAP, I did some quick research and reading. It seems like a simple way to implement an Asterisk External Speech-to-Text application using the Google TTS API.

I will take some time to test the GitHub code above. On the Asterisk side, everything looks good. Let’s see if the node.js code is functional.

I have one live demo with STT, GPT and Google TTS.
If still not fixed then ping here will explain that here

This would be great to hear how you’ve done it.

I found jambonz which does this OOTB and is easy to setup but I’d like to stay within Asterisk.

Try Vosk too, see here

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.