I’m currently developing an interactive voice response system that leverages Google’s Text-to-Speech (TTS) and Speech-to-Text (STT) services. Here’s a breakdown of how I envision the system working:
- Initial Greeting: When a caller dials in, they’re greeted with a standard message, like “Thanks for calling, tell me something.”
- Caller Input: After the greeting, the system waits for the caller to respond. There’s a 5-second silence period before asking if they are still there. Recording continues until there is 3 seconds of silence.
- Speech-to-Text Conversion: Whatever the caller says is captured and converted into text using Google’s STT.
- Playback with TTS: The system then reads back the transcribed text to the caller using Google TTS. For example, “I think you said: [caller’s words].”
- Follow-Up Prompt: After the playback, the caller is asked, “Would you like to tell me more?”
- If the answer is yes, the system loops back to the initial “Tell me something” prompt.
- If the answer is no, the caller hears “Thanks for calling” before the call ends.
- Interruption Handling: Importantly, the system allows callers to interrupt the TTS playback. If they interrupt during the playback of “I think you said: [caller’s words]”, their new input is captured, converted to text, and read back.
For the technical setup, I’m using Asterisk Gateway Interface (AGI) with a Python script for basic tests. I’ve successfully managed to set up an external Node.js application in one of my experiments. This app processes the recordings using Google services and then sends the results back to my debug window.
I’ve seen some promising examples and frameworks for this kind of system, but I’m looking for advice on the most efficient and effective way to implement these features. Any insights, especially from those who have worked on similar projects, would be greatly appreciated!
For reference, some of the reps and articles I’ve come across: