Call recording Transcripting - Google/OpenAI

I have a client requesting for call transcription for all calls made and received.

I have been looking at Google Speech to Text API but currently they don’t support GSM encoding. I have a script the coverts the call recording to GSM:

sox /var/spool/asterisk/monitor/$UNIQUEID.wav -e gsm-full-rate /var/spool/asterisk/monitor/$DATE/$UNIQUEID.WAV

I was thinking that if I can push the recording to Google API before it’s encoded it might work. However I am trying to figure out how much additional load this can cause on the system.

Alternatively copy the non encoded file to a storage server for processing.

Any advise will be much appreciated.


I have also looked into the OpenAI speech to text open source code which seems to work but is extremely slow without Nvidia GPU.
It’s also extremely in accurate with the small and medium english model.

Speech recognition is difficult enough without degrading the signal with lossy compression, like GSM, Also, GSM is no longer a good technical choice for lossy speech codecs, as there are better ones commonly available.

Can you suggest better ones to use?

Our system has been using GSM for the past 15 year to assist with storage space.

On Friday 29 March 2024 at 07:07:10, faqterson via Asterisk Community wrote:

Can you suggest better ones to use?

Our system has been using GSM for the past 15 year to assist with storage
space.

For the best quality (and David551 has already made the point that this is
important for computed speech recognition) you want to be using G.711 - ie:
Ulaw or Alaw.

If this means you start running out of storage space, then I think this simply
means it’s time you bought some new disks. After all, with 8Tb, 12Tb and 16Tb
disks commonly available, I find it hard to imagine how many recordings you
have which means you can’t store them.

Antony.


What makes you think I know what I’m talking about?
I just have more O’Reilly books than most people.

                                               Please reply to the list;
                                                     please *don't* CC me.

You could do the speech processing in the device using our open source softphone. It does tts and summarisation. The record, transcription and summarization is later on sent to the desired webhook.

1 Like

I am assuming then that I will have to change from .wav format if changing to .ua

    exten => s,n,MixMonitor(${CDR(uniqueid)}.wav,ab,/usr/local/scripts/convert ${CDR(uniqueid)})

Won’t having .ua files have problems playing the file via a Media player? Also will speech to text API’s support this format.

Since most of our voice providers only support g729 codecs. Trans-coding is expected for .ua files and could increase load on the server?


We had one client that request stereo audio for a period to have caller on left ear and agent on right ear. This did increase avg daily recording folder size from 1-1.5Gb to 20-25Gb. This also massively increased the hard drive IO load.

1.2G - /var/spool/asterisk/monitor/2019-09-30
25G - /var/spool/asterisk/monitor/2019-11-21

I would like to avoid this at all cost.

.ua appears to be a container format, so I’d expect the cost of converting from signed linear, 8kHz, 16 bin, mono in a .wav wrapper, to the same in a .ua wrapper, to be very low.

Someone is going to have to transcode it to linear to do the voice recognition. Moreover, G.729 is essentially obsolete in the Western world, as there are better low bit rate codecs, and the trend is to high voice bandwidth, given data is so cheap.

I doubt that any speech to text service will accept G.729 directly and anything that has ever been in G.729 will be starting off at a disadvantage regarding transcription quality.

I think I am going to stick to .wav recording format and just removed the covert sox to gsm:

-e gsm-full-rate

New Mixmonitor for Asterisk 20.

exten => s,n,MixMonitor(${CDR(uniqueid)}.wav,ab)

and attempt to use it on OpenAI Whisper