MixMonitor D option produces invalid stereo .raw when bridged channels have different native sample rates

Title: MixMonitor D option produces invalid stereo .raw when bridged channels have different native sample rates


Hi all,

I’m hitting an issue with the D option in MixMonitor on Asterisk 22.8.0 when the two bridged channels use codecs with different native sample rates. The output .raw file ends up unusable — no single -r value passed to sox produces correct-sounding audio.

Setup

A typical inbound call in my voice-bot architecture:

  • Leg A (PJSIP) — inbound caller, NativeFormats: (ulaw) → 8 kHz

  • Leg B (WebSocket) — AI voice bot endpoint, NativeFormats: (slin16) → 16 kHz

The two are joined in a Stasis-bridged call. Channel info:

PJSIP/.../00000199
  NativeFormats: (ulaw)
  WriteFormat:   slin16
  ReadFormat:    slin16
  WriteTranscode: Yes (slin@16000)->(slin@8000)->(ulaw@8000)
  ReadTranscode:  Yes (ulaw@8000)->(slin@8000)->(slin@16000)

WebSocket/vpaas_rtp_engine/...
  NativeFormats: (slin16)
  WriteFormat:   slin16
  ReadFormat:    slin16
  WriteTranscode: No
  ReadTranscode:  No

MixMonitor launched against the PJSIP leg:

MixMonitor(/Recording/<hash>.raw,Db)

What I observe

The resulting .raw file plays incorrectly at every sample rate I try with sox:

sox -t raw -r 8000  -e signed -b 16 -c 2 file.raw out.wav  # too slow
sox -t raw -r 16000 -e signed -b 16 -c 2 file.raw out.wav  # too fast
sox -t raw -r 12000 -e signed -b 16 -c 2 file.raw out.wav  # still too fast

No fixed -r value produces normal-speed playback. Spectral analysis of the raw bytes shows energy mirrored across the spectrum, characteristic of a stream where frame rates are inconsistent — as if frames from one direction are at 8 kHz and the other at 16 kHz, but written into the same interleaved stereo stream without rate normalization.

What works

When both channels have matching native sample rates (e.g. ulaw ↔ ulaw, or slin16 ↔ slin16), the D option works perfectly and sox -r <rate> produces clean stereo output with caller on one channel and bot on the other.

The breakage only happens when native rates differ across the bridge.

Workaround I’m using

Switching to r(file) + t(file) (two separate .wav files, each with proper headers encoding the correct per-direction rate) and post-merging with sox -M works correctly across all codec combinations, because sox reads the rate from each WAV header and resamples as needed.

Suggestion

It would be very helpful if the D option could:

  1. Detect when the two directions have different sample rates and resample one to match the other before interleaving, OR

  2. Emit the raw stream at a defined fixed rate (e.g. the higher of the two, with the lower direction upsampled), OR

  3. At minimum, document this limitation — currently the docs just say “Interleave the audio coming from the channel and the audio going to the channel and output it as a 2 channel (stereo) raw stream”, with no mention that both directions must be at the same native rate.

The ideal behavior would be a guarantee that the resulting .raw file is always playable at a single, deterministic sample rate regardless of codec mismatch on the bridge — that would make D reliable for mixed-codec architectures like AI voice bots, WebRTC<->PSTN bridges, etc.

Environment

  • Asterisk 22.8.0 (also reproduced on 22.8.2)

  • Standard app_mixmonitor.so

  • Mixed PJSIP (ulaw) ↔ WebSocket (slin16/slin24/slin48/opus) bridges

  • chan_websocket channel driver

Has anyone else run into this? Is there a way to force the audiohook to a specific format before interleaving, or is the r()+t()+sox merge the only path for mixed-codec stereo recording?

Thanks!
Claude Help me to write this problem Statement
Thanks to claude

It could very well be the same underlying thing as: