Looking for Technical support regarding Asterisk+vosk customization, need Anytime STT

Hello, We are Helping Hand India NGO, working In India for poor student’s education.
We are using Asterisk 20 with vosk STT multilingual transcription English and Hindi, everything is working good, We need real time (always-on) STT within our Asterisk environment. An Expert already helped us in past to making it Multilingual. I am sharing here existing content to understand.

same => n,SpeechCreate(vosk^en)
same => n,SpeechCreate(vosk^hi)
same => n,SpeechCreate(vosk^enin)
;same => n,SpeechBackground(silence1,4,p,en^hi^enin)
same => n,SpeechBackground(quizivr/2025/${LSelect}Main-Menu-April25E,0,p,en^hi^enin)
same => n,Verbose(0,Result was ${SPEECH_TEXT(0^en)})
same => n,Verbose(0,Result was ${SPEECH_TEXT(0^hi)})
same => n,Verbose(0,Result was ${SPEECH_TEXT(0^enin)})
same => n,Set(EnglishVoice=${SPEECH_TEXT(0^en)})
same => n,Set(HindiVoice=${SPEECH_TEXT(0^hi)})
same => n,Set(EnglishVoice2=${SPEECH_TEXT(0^enin)})
same => n,SpeechDestroy(vosk^en)
same => n,SpeechDestroy(vosk^hi)
same => n,SpeechDestroy(vosk^enin)

cat /etc/asterisk/res_speech_vosk.conf
[general]
[en]
type=horse
url = ws://localhost:2702
[hi]
type=horse
url = ws://localhost:2700
[enin]
type=horse
url = ws://localhost:2701

docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3eb6ceef0959 alphacep/kaldi-en-in:latest “python3 ./asr_serve…” 3 hours ago Up 3 hours 0.0.0.0:2701->2700/tcp, :::2701->2700/tcp sleepy_spence
860d807f84a8 alphacep/kaldi-en:latest “python3 ./asr_serve…” 3 hours ago Up 3 hours 0.0.0.0:2702->2700/tcp, :::2702->2700/tcp brave_bassi
b08767cb1e33 alphacep/kaldi-hi:latest “python3 ./asr_serve…” 3 hours ago Up 3 hours 0.0.0.0:2700->2700/tcp, :::2700->2700/tcp funny_austin

You can reply/quote with your charges on our email.
office@helpinghandindiango.org

I think you’re going to need to be more specific on what exactly this means and what you’re looking for.

OK, In current situation user able to speak once in starting, We need anytime user speaks during their session, one to eighty, and they will go directly to desired module. We have eighty modules.

Excellent!

To clarify, STT is Speech To Text. This is still a thing. So is Automatic Speech Recognition (ASR). And Text To Speech (TTS)!

However, there’s been lots of software developer growth in this area recently – see the forums, natch! – and the current nomenclature gravitates towards the more generic phrases like Artificial Intelligence (AI) and Large Language Models (LLM). Not to turn this reply into a Public Service Announcement, but it is definitely a note-to-self: it would be nice :rainbow: :sun: :smile: to revive STT/ASR/TTS in the VoiP space that Asterisk lives in, if only to help regain some focus on the problems that folks are trying to solve.

I don’t think this is close to the current crop of speech recognition requirements. Whilst I suppose it could be done with a continuous speech recognition system and the text for that postprocessed, this seems to be limited vocabulary, isolated speech. The unusual feature is that it isn’t listening for an immediate response.

At least one consequence of using a current generation recognizer is that the recognition is likely to be delayed, as they will require significant look ahead, to reliably decode the speech, whereas, if one knows it is one of eighty numbers, one already has a lot of context, and the technology can be, maybe, 20 years old.

I also think a lot more problem analysis needs to be provided here, as I suspect, at the very least, the media channel is also being used, at least, DTMF input.

It would probably be better if there were a DTMF attention signal, that caused an attempt to read speech, rather than using hte speech, itself, as the trigger.

Appreciate your guidance, I have tried already but helpful content not found. I think experts can give quality/reliable solution.

Please suggest a good way for continue/anytime voice activity detection in above scenario , BackgroundDetect() or TALK_DETECT function is seems not useful. any good reference to implement Alexa type solution or helpful links related to above requirements.

Alexa uses an attention word, and, I believe detects that in the device or speaker. Only when it detects that trigger does it go to the cloud to decipher the full request. The analogy for phones would be that the phone detected the trigger word.

(Asterisk can detect a trigger DTMF, although some phone systems need a stronger attention signal, in the form of a hook flash, or the earthing of one of wires, in analogue systems.)

Try using audiosockets

Regards
CJ