Audio quality for Speech To Text

meirjdd · April 17, 2024, 12:44pm

Hi,

I am using ARI ExternalMedia. The audio source is redirected to a websocket (I use rtp-udp-server.js which is avaible here:

asterisk-external-media/lib at master · asterisk/asterisk-external-media · GitHub

) which is listened by a STT platform (Google and Deepgram).

My problem is that the audio quality is not so good for STT. Is there a way to improve the quality?

Thanks

jcolp · April 17, 2024, 1:25pm

Define “not so good”. It’s been fine for our usage and testing. The audio itself is pretty much what was received, aside from transcoding.

meirjdd · April 17, 2024, 1:32pm

Thanks for your reply
“not so good” for the STT platform.
If I use for example my microphone for the STT, the audio quality is good enough to have a low latency and receive a good transcription and the speech detection ending.
In a another way, when using asterisk, the latency is higher, the transcription is less good and the speech detection ending is sent later or not detected.

jcolp · April 17, 2024, 1:35pm

What does “using asterisk” mean. You need to be specific for where the audio is coming from. For example are you referring to receiving a call from an ITSP? Have you recorded the incoming audio to examine it or looked at latency there?

Things internally in Asterisk deliver it fairly fast, within milliseconds, and even sending it externally adds minimal latency. It also doesn’t alter the audio itself.

meirjdd · April 21, 2024, 8:05am

Hi sorry for the late answer.

USING ASTERISK means originate a call with Asterisk, in this case I use the ExternalMedia app to transfer the audio (the person who is called) to a websocket which is used to be listened by google or deepgram. By using the websocket, I remove the headers from the RTP to avoid errors with the STT platforms.

In this case, the STT platform gives bad results and the latency to detect the end of the speak is much longer than if I use the STT plateform with my computer microphone (so with a better audio quality).

My question was how to improve the audio quality using externalMedia to have better results with the STT platforms.

jcolp · April 21, 2024, 10:04am

Originate a call where? External media doesn’t touch the audio and just forwards it on, so if you’re receiving bad or delayed audio from outside of Asterisk then it would be relayed as such. There isn’t anything specific to external media to “improve” the audio quality. There are some dialplan functions for audio tweaking:

https://docs.asterisk.org/Asterisk_21_Documentation/API_Documentation/Dialplan_Functions/AGC/
https://docs.asterisk.org/Asterisk_21_Documentation/API_Documentation/Dialplan_Functions/DENOISE/

But I doubt this would help. You need to identify where your issue is actually coming from by looking at the audio as received by Asterisk.

david551 · April 21, 2024, 10:38am

I would expect the STT system to do the first, and my guess is that denoising will lose some signal as well, so is best left to the STT platform.

The main things for good audio are:

don’t call anyone on the public phone network (limited audio bandwidth);
even more so, don’t call anyone on the public mobile network (as above and aggressive vocoding - also additional latency).

And for effective use of LLMs:

Process the conversation as a whole and do not attempt to recognise it incrementally.

meirjdd · April 28, 2024, 10:53am

Thanks! I can’t find anything except that if I call someone (not a SIP phone connected to ASTERISK) with my SIP number I use externalMedia to the ‘external channel’ (only the voice of the person I called), then externalMedia gives me RTP packets that I transform by removing the headers and apply swap16 to have the data well organized.
After this simple transformation the audio is listened by Google API or DeepGram API.

It works only if I consider the RTP packets arriving as 16kHz, by declaring 8 it is not working. Maybe the issue come from here?

meirjdd · April 28, 2024, 10:58am

Thanks for the advices but my solution needs to call on the mobile network because everybody is not connected to the same asterisk server and to have a quick response, I need to analyze the audio with a streaming mode. If I want to analyze the entire audio, that mean I have to detect the start and the end of speaking, analyze it and send it to the LLM. With the streaming mode, when someone stops speaking, I already have what he said + the end speaking detection latency (for example Speech_Final on DeepGram)

david551 · April 28, 2024, 11:23am

In that case, the ultimate limit on quality is the mobile network. Although a lot of mobile network support extended frequency ranges, with suitable phones, they all use codecs that do aggressive signal processing, in the frequency domain, so you will never get as clean audio as you would from a simple microphone capture.

G.722 is the only public network codec that would get close, and that is only used for landlines, and I’m not sure how widely implemented it is. The traditional G.711 and GSM codecs, use 8kHz sampling.

A bit of googling for “mean opinion score” will give you human scorings for the quality of various codecs.

I assume you mean 16kHz.

meirjdd · April 28, 2024, 11:25am

Thanks a lot! and Yes it is kHz sorry

system · May 28, 2024, 11:26am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Codec & voice quality Asterisk Support	0	153	June 29, 2006
Poor Sound Quality with External Calls Asterisk Support	7	1566	April 6, 2006
Basic Audio Quality issues Asterisk Support	9	367	September 12, 2005
Why Skype Audio quality is better than Asterisk Asterisk Support	16	486	November 13, 2006
RTP Payload Size Asterisk Support	2	227	October 22, 2010

Audio quality for Speech To Text

Related topics