Using External Media to play audio file from cloud text-to-speech results in noisy audio output

Hello,

I’m trying to create an application connected to Asterisk using External Media. The application follows this scheme:

https://docs.asterisk.org/Development/Reference-Information/Asterisk-Framework-and-API-Examples/External-Media-and-ARI/

The difference is that I’d like to process the transcription. Use text-to-speech from cloud and again use RTP to stream audio back to the channel.

On the Asterisk side, phparia (GitHub - wormling/phparia: Framework for creating ARI (Asterisk REST Interface) applications.) is used to initiate the external media channeling the sound over RTP to my application. In my application, I wrote a simple jitter buffer and forward the media stream to STT. After few more steps, I take the text response, generate audio file and stream it back over RTP. Again the RTP transmitter is written by hand. It simply cuts the audio file, wraps the chunks into RTP and sends them to Asterisk.

It almost works and prints this to the output:

[Feb  1 16:04:21.626] VERBOSE[3431966] dial.c: Called 10.10.7.125:42573
[Feb  1 16:04:21.627] VERBOSE[3431966] dial.c: UnicastRTP/10.10.7.125:42573-0x14ebe8e3a040 answered
[Feb  1 16:04:21.627] VERBOSE[3431966] ari/resource_channels.c: Launching Stasis(filter_1,mediaresend,{\"callName\":\"65bbb2f54a7899.39571233\"}) on UnicastRTP/10.10.7.125:42573-0x14ebe8e3a040
[Feb  1 16:04:21.780] VERBOSE[3431965][C-00000001] res_rtp_asterisk.c: 0x14ebe64c6000 -- Strict RTP qualifying stream type: audio
[Feb  1 16:04:21.827] VERBOSE[3431967] bridge_channel.c: Channel UnicastRTP/10.10.7.125:42573-0x14ebe8e3a040 joined 'simple_bridge' stasis-bridge <65bbb2f54a7899.39571233>
[Feb  1 16:04:21.834] VERBOSE[3431965][C-00000001] res_rtp_asterisk.c: 0x14ebe64c6000 -- Strict RTP switching source address to 172.23.254.27:4010
[Feb  1 16:04:22.401] VERBOSE[3431967] res_rtp_asterisk.c: 0x14ebe8e3d000 -- Strict RTP qualifying stream type: <unknown>
[Feb  1 16:04:22.571] VERBOSE[3431967] res_rtp_asterisk.c: 0x14ebe8e3d000 -- Strict RTP qualifying stream type: <unknown>
[Feb  1 16:04:22.741] VERBOSE[3431967] res_rtp_asterisk.c: 0x14ebe8e3d000 -- Strict RTP qualifying stream type: <unknown>
[Feb  1 16:04:22.911] VERBOSE[3431967] res_rtp_asterisk.c: 0x14ebe8e3d000 -- Strict RTP qualifying stream type: <unknown>
[Feb  1 16:04:22.911] VERBOSE[3431967] res_rtp_asterisk.c: 0x14ebe8e3d000 -- Strict RTP switching source address to 10.10.7.125:50370

The only problem is that the sound is corrupted. I can understand it, but it is noisy.

I tried to capture the network traffic and play it with wireshark and it sounds better:

The same result, clear sound, can be achieved when streaming the RTP data into ffmpeg.

I tried to understadnd the RTP receiver in Asterisk (res/res_rtp_asterisk.c) but it’s been a struggle so far.

I use 8bit, 8kHz, alaw encoding everywhere. The audio signal coming from Asterisk to my application is fine, only the other way is corrupted.

Can I expect Asterisk to restore the audio signal received over UDP/RTP? Is there a jitter buffer on the input or should I configure it or provide the UDP packets in order? Or am I missing something?

Hope, the explanation makes sense and thanks for any ideas,
Martin

My current suspicion is that Asterisk does not know how to interpret the RTP stream because it writes:

[Feb  1 16:04:22.571] VERBOSE[3431967] res_rtp_asterisk.c: 0x14ebe8e3d000 -- Strict RTP qualifying stream type: <unknown>

Which looks like the codec is unknown:

if (rtp->rtp_source_learn.stream_type == AST_MEDIA_TYPE_UNKNOWN) {
	struct ast_rtp_codecs *codecs;
	
	codecs = ast_rtp_instance_get_codecs(instance);
	rtp->rtp_source_learn.stream_type =
		ast_rtp_codecs_get_stream_type(codecs);
	ast_verb(4, "%p -- Strict RTP qualifying stream type: %s\n",
		rtp, ast_codec_media_type2str(rtp->rtp_source_learn.stream_type));
}

But I’m not sure what to do about it.

There is no jitterbuffer in Asterisk normally, so you’ll need to clarify what this does exactly. It needs to provide a stream of audio - the packets can’t all be sent at once, and the RTP header needs to reflect things properly (sequence number, timestamp).

It is a RTP transmitter implemented as a loop where every 20ms it sends a next chunk of audio data. It is a pretty simple 200LoC calculation + loop with sleep.

Anyway I double checked the sequence numbers and timestamps and actually, there was a bug! Wireshark suggested the problem with the triangle/square/etc. shapes that are all over the audio signal I posted in the first message.

I fixed the issue in timestamp calculation + one more in the fact that silence is not 0 in ALAW, but it is 85. Now it works and I can clearly understand.

ffmpeg was probably too smart when it just accepted anything I sent without much complaints.

So thanks for your help :slight_smile: