MixMonitor with `D` flag produces distorted/unplayable audio when two channel legs have different codecs (ulaw + slin16)

Asterisk Version: 22.8.0
Channel Driver: PJSIP
AGI: FastAGI (Node.js)

Background

I am building a voice bot system using Asterisk 22.8.0 with PJSIP and a Node.js FastAGI server. Incoming PSTN calls hit my dialplan, get routed to FastAGI, and I record both legs using MixMonitor with the D flag to produce a stereo interleaved raw file (one speaker per channel).


The Problem

My two channel legs are running on different codecs and sample rates:

  • Channel 1 (PSTN / caller leg): ulaw → 8kHz, 8-bit µ-law
  • Channel 2 (Voice bot / application leg): slin16 → 16kHz, 16-bit signed linear

Asterisk channel stream output confirms this:

-- Streams --
Name: audio-0
    Type: audio
    State: sendrecv
    Group: -1
    Formats: (slin16)

My FastAGI executes MixMonitor like this:

AGI Script Executing Application: (MixMonitor) Options: (/Recording/4663.raw,D)

As per Asterisk documentation, the D flag:

Interleaves the audio coming from the channel and the audio going to the channel and outputs it as a 2 channel (stereo) raw stream. You must use the .raw extension.

This creates a single stereo interleaved file: 4663.raw


Converting with sox — All Attempts Fail

Since the two legs have different sample rates, no single sox command produces correct audio:

Attempt 1 — Treating as 16kHz (slin16):

sox -r 16000 -e signed-integer -b 16 -c 2 4663.raw 4663.wav

→ Audio plays too fast, ulaw leg is double speed

Attempt 2 — Treating as 8kHz (ulaw):

sox -r 8000 -e signed-integer -b 16 -c 2 4663.raw 4663.wav

→ Audio plays too slow and distorted, slin16 leg is half speed

soxi output of the resulting WAV:

Input File     : '4663.wav'
Channels       : 2
Sample Rate    : 8000
Precision      : 16-bit
Duration       : 00:02:30.50 = 1204000 samples
File Size      : 4.82M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

The duration does not match the actual call length at either sample rate.


What I Already Tried

Attempt — Force codec via FastAGI before MixMonitor:

SET VARIABLE CHANNEL(audioreadformat) slin16
SET VARIABLE CHANNEL(audiowriteformat) slin16

Result — Asterisk throws a WARNING and ignores it:

WARNING[117679][C-00000024]: func_channel.c:802 func_channel_write_real:
Unknown or unavailable item requested: 'audioreadformat'

WARNING[117679][C-00000024]: func_channel.c:802 func_channel_write_real:
Unknown or unavailable item requested: 'audiowriteformat'

So audioreadformat and audiowriteformat are clearly read-only and cannot be set via AGI or dialplan Set().


My Questions

  1. When MixMonitor uses the D flag and the two legs have different sample rates (ulaw 8kHz vs slin16 16kHz), at what sample rate does Asterisk actually write the interleaved stereo .raw file? Does it upsample, downsample, or just write raw bytes as-is?

  2. Is there any supported dialplan or AGI method to force both channel legs to the same codec/sample rate before MixMonitor starts — without having to enforce it at the PJSIP endpoint config level?

  3. If enforcing at the PJSIP endpoint level (disallow=all / allow=ulaw) is the only option, does that cause transcoding overhead on the slin16 bot leg, and is there a way to avoid it?

  4. Is there a correct sox command to handle a stereo raw file where left and right channels have different native sample rates?

Any help or pointers to the relevant Asterisk source code or documentation would be hugely appreciated. Thank you!

On Wednesday 08 April 2026 at 12:33:11, gauravs456 wrote:

My two channel legs are running on different codecs and sample rates:

  • Channel 1 (PSTN / caller leg): ulaw → 8kHz, 8-bit µ-law
  • Channel 2 (Voice bot / application leg): slin16 → 16kHz, 16-bit
    signed linear

Why?

Since this voice bot’s audio is going to get sent down the PSTN channel,
what’s the purpose of using a codec and sample rate which don’t match?

You’re not going to get a better quality of audio for the person on the end of
the PSTN connection, because 16-bit 16kHz can’t be sent down that channel.

I’d suggest the best solution is to get your voice bot to speak 8-bit 8kHz.

Antony.


I conclude that there are two ways of constructing a software design: One way
is to make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies.

  • C A R Hoare

Thank you for the response. I should clarify — I’m using externalMedia which provides flexibility to use various codecs (ulaw, slin16, slin48, opus, etc.).

My question then becomes: If the bot/externalMedia leg supports these higher-quality codecs, shouldn’t we be able to leverage them end-to-end, rather than being limited to ulaw just because of the PSTN caller leg?

In other words:

  1. Is there a way to transcode the PSTN leg (ulaw → slin16/slin48) on-the-fly so both legs can operate at a higher quality internally?

  2. If both legs can work at, say, slin16, would the MixMonitor recording then produce clean stereo output?

  3. Or is the limitation that PSTN codecs are hard constraints, and we’re forced to downgrade the entire system to ulaw?

On Wednesday 08 April 2026 at 18:07:02, gauravs456 wrote:

  1. Is there a way to transcode the PSTN leg (ulaw → slin16/slin48)
    on-the-fly so both legs can operate at a higher quality internally?

You can transcode it, yes, but you can’t improve the quality.

A transcoder can’t create detail which is (by design) missing from the
original. You’ll end up with four times the amount of data containing
precisely the same (quality of) information.

  1. If both legs can work at, say, slin16, would the MixMonitor recording
    then produce clean stereo output?

I believe it would, yes.

  1. Or is the limitation that PSTN codecs are hard constraints, and we’re
    forced to downgrade the entire system to ulaw?

Yes.

Antony.


APL [is a language], in which you can write a program to simulate shuffling a
deck of cards and then dealing them out to several players, in four
characters, none of which appear on a standard keyboard.

  • David Given

Well, that depends on your telco. More and more calls are moving across carrier networks with HD codecs like G722 from end-to-end.