Audio quality for Speech To Text

Hi,

I am using ARI ExternalMedia. The audio source is redirected to a websocket (I use rtp-udp-server.js which is avaible here:

asterisk-external-media/lib at master · asterisk/asterisk-external-media · GitHub

) which is listened by a STT platform (Google and Deepgram).

My problem is that the audio quality is not so good for STT. Is there a way to improve the quality?

Thanks

Define “not so good”. It’s been fine for our usage and testing. The audio itself is pretty much what was received, aside from transcoding.

Thanks for your reply
“not so good” for the STT platform.
If I use for example my microphone for the STT, the audio quality is good enough to have a low latency and receive a good transcription and the speech detection ending.
In a another way, when using asterisk, the latency is higher, the transcription is less good and the speech detection ending is sent later or not detected.

What does “using asterisk” mean. You need to be specific for where the audio is coming from. For example are you referring to receiving a call from an ITSP? Have you recorded the incoming audio to examine it or looked at latency there?

Things internally in Asterisk deliver it fairly fast, within milliseconds, and even sending it externally adds minimal latency. It also doesn’t alter the audio itself.

Hi sorry for the late answer.

USING ASTERISK means originate a call with Asterisk, in this case I use the ExternalMedia app to transfer the audio (the person who is called) to a websocket which is used to be listened by google or deepgram. By using the websocket, I remove the headers from the RTP to avoid errors with the STT platforms.

In this case, the STT platform gives bad results and the latency to detect the end of the speak is much longer than if I use the STT plateform with my computer microphone (so with a better audio quality).

My question was how to improve the audio quality using externalMedia to have better results with the STT platforms.

Originate a call where? External media doesn’t touch the audio and just forwards it on, so if you’re receiving bad or delayed audio from outside of Asterisk then it would be relayed as such. There isn’t anything specific to external media to “improve” the audio quality. There are some dialplan functions for audio tweaking:

https://docs.asterisk.org/Asterisk_21_Documentation/API_Documentation/Dialplan_Functions/AGC/
https://docs.asterisk.org/Asterisk_21_Documentation/API_Documentation/Dialplan_Functions/DENOISE/

But I doubt this would help. You need to identify where your issue is actually coming from by looking at the audio as received by Asterisk.

I would expect the STT system to do the first, and my guess is that denoising will lose some signal as well, so is best left to the STT platform.

The main things for good audio are:

  1. don’t call anyone on the public phone network (limited audio bandwidth);

  2. even more so, don’t call anyone on the public mobile network (as above and aggressive vocoding - also additional latency).

And for effective use of LLMs:

Process the conversation as a whole and do not attempt to recognise it incrementally.

Thanks! I can’t find anything except that if I call someone (not a SIP phone connected to ASTERISK) with my SIP number I use externalMedia to the ‘external channel’ (only the voice of the person I called), then externalMedia gives me RTP packets that I transform by removing the headers and apply swap16 to have the data well organized.
After this simple transformation the audio is listened by Google API or DeepGram API.

It works only if I consider the RTP packets arriving as 16kHz, by declaring 8 it is not working. Maybe the issue come from here?

Thanks for the advices but my solution needs to call on the mobile network because everybody is not connected to the same asterisk server and to have a quick response, I need to analyze the audio with a streaming mode. If I want to analyze the entire audio, that mean I have to detect the start and the end of speaking, analyze it and send it to the LLM. With the streaming mode, when someone stops speaking, I already have what he said + the end speaking detection latency (for example Speech_Final on DeepGram)

In that case, the ultimate limit on quality is the mobile network. Although a lot of mobile network support extended frequency ranges, with suitable phones, they all use codecs that do aggressive signal processing, in the frequency domain, so you will never get as clean audio as you would from a simple microphone capture.

G.722 is the only public network codec that would get close, and that is only used for landlines, and I’m not sure how widely implemented it is. The traditional G.711 and GSM codecs, use 8kHz sampling.

A bit of googling for “mean opinion score” will give you human scorings for the quality of various codecs.

I assume you mean 16kHz.

Thanks a lot! and Yes it is kHz sorry