Hi everyone,
I’m currently building an AI-driven voice system using Asterisk and AGI scripts. The system can successfully initiate calls, and once the user picks up, the AI agent plays the assigned instructions or messages. The current call flow works like this:
Model → Asterisk → TTS → User
After completing the message, the call ends politely, and the full recording is saved for transcription and summary generation.
The issue I’m facing is with real-time two-way communication. While the agent can speak to the user based on predefined scripts, it does not yet process the user’s speech input in real time.
In other words, I’m still developing this flow:
User → Asterisk → STT → Model → User
I’m looking for ideas, best practices, or examples on how to implement real-time streaming of user audio to a model and get instant responses back during the call. Any pointers on integrating STT with AGI scripts or handling real-time audio streams efficiently would be highly appreciated.
Nice work on your project — I’ve been in a very similar situation in the past.
When I was trying to move from AGI-driven playback to real-time interaction, one of the solutions I adopted was using Asterisk’s AudioSocket module. It allows you to stream audio both ways between Asterisk and your AI stack (STT → LLM → TTS), making real-time conversation possible.
If you’d like a reference implementation, take a look at AgentVoiceResponse (AVR) — an open-source project that’s already built around AudioSocket and supports providers like OpenAI, Deepgram, Gemini, and ElevenLabs.