RECORD FILE: what are default options and how can be changed?

Hello!

I use the RECORD FILE command in AGI projects and I was wondering what are default options and how to change them:

filename - i use a random string
format - i use WAV but I wonder if MP3 is supported too

The sample rate of WAV files I get from record file is 8000Hz, though I would like to know if I can make recordings at 16000Hz or 44100Hz because some ASR services need files at this sample rate to work.

Thanks for any info you can give about.

You don’t want to use .WAV for speech recognition. For 8kHz, you want to use .wav. The former has a GSM payload, the latter has a linear PCM one.

I don’t believe that MP3 is supported for recording, and it would be overkill for telephone quality speech and would need to be converted to linear PCM, for the algorithms.

If you are using traditional telephone quality speech, there will be no benefit in going beyond 8,000 Hz sampling (the best you might get from some PSTN calls, everything else being right, is 16,000), but the .sln formats, which are are raw PCM, with no meta data, would allow you to record with up sampled audio. .sln16 is 16,000, and .sln44 is 44,100. The second parameter of Record() sets the format.

None of this will improved the recognition of sibilants, if the original audio is standard PSTN 8kHz rate.

Actually, if someone is only offering 44.1kHz, I would seriously wonder whether the model has been trained on telephone quality speech.

Thanks that is an explanation way better than what I was expecting.
The problem appears to be related to general support for 8000Hz audio because even if it is upsampled at 16000Hz or more, the transcript is blank.
At this point I have opened a ticket with a sound sample to the provider of the transcription service.

Hi,

please let me know what scenario you need this for!

Do you need it to trigger an action in the dialplan when a specific word is recognized, or do you need it for full transcription?

If it’s about triggering a dialplan action based on a recognized word, you could install Vosk in a Docker container. It works quite well for me, since I prefer to process the data offline instead of using external ASR services.

Actually, if someone is only offering 44.1kHz, I would seriously wonder whether the model has been trained on telephone quality speech.
That was illuminating. I asked support of the ASR service and indeed the model i was sending queries, it was not trained at all to transcribe telephony quality speech.
What solved the case was using a custom legacy model that the ASR service was already trained for telephony audio and that was a legacy model that a customer was used to.

—-

please let me know what scenario you need this for!

Do you need it to trigger an action in the dialplan when a specific word is recognized, or do you need it for full transcription?

*If it’s about triggering a dialplan action based on a recognized word, you could install Vosk in a Docker container. It works quite well for me, since I prefer to process the data offline instead of using external ASR services.
*
The goal was to recover the functionality of an ASR service provider which as a product was supported years ago. The new API system has a new interface, so software changes needed to be made.

Vosk is definitely an interesting suggestion I will be looking into, thanks!
So far I have tried Whisper tiny model.
The target languages are English and Italian.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.