Hi, since some months ago I began to work in a personal project. This is an open source voice/chat bot communication platform.
For the voicebot I am using asterisk and an audiosocket server and it is integrated with RASA assistant for the bot responses. My idea is to create a community behind this project, so if you are interested in the project, you can contact me.
Hi…how are you finding the latency issue between the various phases of the interactions?
I’m integrating openai and between TTS, STT and openai calls the call has gaps of silence that are too large.
I experimented that same, most than all due to interactions to ChatGPT, which I am using for detecting language and translating if necessary. My chatbot understand only English, but my idea was that any languages can be used by the speaker. For reducing the latency, first I designed a mechanism for silence detection, that is working pretty well. Once the audio is generated I am using whisper to transcribe, then picotts to generate back audio from the RASA response. Both of them are working pretty fast.
My experiments has shown that the point of more latency is when translating, so if both chatbot and user are using the same language the latency is not high.
Sorry to bother you…
For reducing the latency, first I designed a mechanism for silence detection you wrote
I’m using the classic :
same => n,Set(FILENAME=/var/sounds/${UNIQUEID}.wav)
same => n,Record(${FILENAME},2,0,k)
and you ?
I downloaded an audio file with the classic office buzz on it and I wanted to put it in the background so as not to have such large gaps of silence. But I haven’t managed it yet.
How do you plan to handle them?
YES…but I should manage the process asynchronously…I had opted to reproduce the comfort noise throughout the entire call.
I’m trying to understand how to do it…it seems like a bit of a difficult topic even for AI, moreover…
Yeah I see. The way you are trying is not possible, because all call management is made through audiosocket server in golang, asterisk is just sending the request to it. Can you open an issue in my project? I will see how to handle that, I think is a good idea to implement. Can you pass me as well the audio with the “classic office buzz”?
First of all, its a very good work and a good idea (Kudos ). Basically this is the only way to create a VoiceBot (STT, Rasa, TTS).
Maybe it would be nice to prepare the code for local processing (for example selfhosted whisper) Currently im working on a project for my company regarding Speech2Text, or call it a research instead of a project, but we had some good experiences with the self hosted Whisper on T4.
The Whisper model also could do the language identification, so i think maybe the ChatGPT part of the voicebot could be replaceable with Whisper maybe, because it could detect the language from the first chunk.
Anyway, im also working in Go (with our Piper TTS integration also), so maybe i will use your code, but i need time to understand the whole thing.
Wow, thanks for telling me that about whisper model. Didn’t know it I am going to test it, that really will help. It has to be using the whisper model selfhosted?
I dont know how is the openai api works with whisper (or transcribe), but the local model could do the detection, referring to their docs on github.
There is a whisper.detect_language() function for that.
You can see it here, almost end of the page: GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Yeah, I was wondering to find the local model in golang but it seems that there isn’t nothing, I saw that the golang openAI api has a type AudioResponse with a Language parameter, I was wondering that I will receive from API the language, but I am receiving it empty: openai package - github.com/sashabaranov/go-openai - Go Packages
I will try to use the local model implementing an small API to call it from the audiosocket server
They wrote that the opensource model can do this, but the API cannot
Maybe, if youre going to cloud native way and dont want to implement the possibility to use local model or local api, then you cold use some language detection API. I dont know that the guys at OpenAI implemented something for that or not.
Yes, this is what im doing with my own Whisper container on Tesla T4
I think its a design decision, but many of us like/want an onpremise solution according to their minds or regulations rules, etc.
If you can implement this as a configurable thing (like openai api or local api), then it could be run on a local machine without cloud based 3rd party involved.
Maybe later i can send some PRs to the project, because im also interested to creating a Voicebot in the long run and your approach is one of the closest to my ideas (Golang, rasa, local tts).
I am working on it. If the local whisper works fine I will get rid of the usage of openAI API. I can translate text as well using other tool, and I am using openAI only for transcribing and translating.
I have further ideas as well, like integrating with Anthropic. I am seeing that Claude3.5 sonnet has a very positive feedback when talking about implementation of assistants.
I want to expand the choice of using RASA or Anthropic as assistant
Its slow on CPU, you need CUDA enabled gpu to run the model faster. I think we can achive as same as openai api on a tesla t4 gpu.
But i can test it for you