Audio distortion/lag with chan_websocket ExternalMedia at 90+ concurrent calls (Asterisk 22.8.2)

I’m experiencing intermittent audio distortion and lag when running 90+ concurrent calls using chan_websocket with ARI ExternalMedia. Audio plays fine at lower call volumes (30-50 calls) but degrades at scale. I’ve done extensive debugging on my application side and narrowed the issue down to Asterisk’s internal behavior.

Architecture

Client App (TTS/AI)
    ↓ (WebSocket - audio + mark events)
RTP Engine (Node.js, running in K8s pod, 4 vCPU / 4GB RAM)
    ↓ (WebSocket per call - binary audio frames)
Asterisk 22.8.2 (Docker container, 16 vCPU / 16GB RAM)
    ↓ (RTP)
SIP Phone / Caller
  • Codec: slin16 (optimal_frame_size = 640 bytes = 20ms per frame)

  • Transport: WebSocket (chan_websocket)

  • Direction: Sending audio TO Asterisk for playback to the caller

  • Using START_MEDIA_BUFFERING mode for non-opus formats

  • Using MARK_MEDIA / MEDIA_MARK_PROCESSED for playback position tracking

  • Using FLUSH_MEDIA for barge-in/interruption

What I’ve Verified (Not the Problem)

  1. Node.js event loop is healthy. My setInterval(20ms) timer fires with avg 20.00ms gap, max 21.2ms. Zero drift.

  2. No WebSocket backpressure. ws.bufferedAmount on the Asterisk-facing sockets stays near 0. Data reaches Asterisk instantly.

  3. Asterisk task processors are clean. core show taskprocessors shows 0 items in queue for all pjsip distributors. No backlog.

  4. Network is fine. No packet loss or significant latency between my RTP engine pod and Asterisk container.

  5. My application sends audio in correct frame-aligned multiples (exact multiples of 640 bytes for slin16).

What I Observe

XOFF/XON cycling with consistent 2-second XOFF duration

When sending audio at a burst rate (e.g., 5x real-time = 3200 bytes per 20ms tick), I see XOFF lasting exactly ~2030ms every cycle:

XOFF lasted 2038ms
XOFF lasted 2029ms
XOFF lasted 2021ms
XOFF lasted 2030ms
XOFF lasted 2036ms
XOFF lasted 2027ms

This makes sense given the hardcoded thresholds in chan_websocket.c:

c

#define QUEUE_LENGTH_MAX         1000   // 20 seconds of audio
#define QUEUE_LENGTH_XOFF_LEVEL   900   // XOFF at 18 seconds
#define QUEUE_LENGTH_XON_LEVEL    800   // XON at 16 seconds

XOFF→XON gap = 100 frames × 20ms = 2000ms. Matches exactly.

The puzzle: buffer should never run empty, but audio still distorts

During the 2-second XOFF period, there are still 800+ frames (16 seconds of audio) sitting in Asterisk’s internal buffer. The phone should never run out of audio to play. Yet callers hear distortion/lag/choppy audio.

Distortion onset varies with system load

  • Call A (first call on a fresh system): audio starts lagging after ~14-15 seconds

  • Call B (started while 93 other calls are active): audio starts lagging within 4-5 seconds

  • At lower concurrency (30-50 calls): no distortion at all

Different send rates, same problem at scale

Send Rate Behavior at 90+ calls
1x (640 bytes/tick) Choppy — no cushion, micro-gaps between packets
2x (1280 bytes/tick) Distortion after 14-15 seconds per call
5x (3200 bytes/tick) Distortion still occurs, XOFF/XON cycling every ~2 sec
4-5x burst with XOFF Audio quality slightly better but still degrades

My Questions

  1. Is there a way to configure the XOFF/XON thresholds (QUEUE_LENGTH_XOFF_LEVEL, QUEUE_LENGTH_XON_LEVEL) without recompiling Asterisk? I don’t see any chan_websocket.conf setting for buffer sizes.

  2. Could the frame timer thread be the bottleneck? With 90+ channels each having 800-900 frames queued, Asterisk needs to pop ~4500 frames/second across all channels. Could the timing thread fall behind under this load, causing uneven frame delivery to the RTP output?

  3. Is there a known concurrency limit for chan_websocket channels? I understand regular SIP/RTP channels can handle hundreds of calls, but chan_websocket with START_MEDIA_BUFFERING involves additional queue management per channel. Is there a practical ceiling?

  4. Could START_MEDIA_BUFFERING mode behave differently under high concurrency compared to passthrough mode? Should I be sending audio differently?

  5. Are there any Asterisk configuration settings (e.g., http.conf, timer settings, thread pool settings) that could improve chan_websocket performance at this scale?

Environment

  • Asterisk: 22.8.2 (Docker container)

  • Host: 16 vCPU, 16 GB RAM

  • Channel driver: chan_websocket

  • Codec: slin16

  • Concurrent calls: 90-95

  • RTP Engine: Node.js application in Kubernetes pod (4 vCPU, 4GB RAM)

Relevant core show taskprocessors output (during 90+ calls)

All pjsip distributors show 0 items in queue, max depth 3-5, which appears healthy.

Any insights from the community would be greatly appreciated. Happy to provide additional diagnostics or logs.

I would suggest trying the 22.9.0 release candidate, which has a change as a result of an issue[1] with audio.

[1] [bug]: chan_websocket doesn’t work with genericplc and transcoding · Issue #1785 · asterisk/asterisk · GitHub

Thanks for pointing me to #1785. I’ll test with transcode_via_sln = no and genericplc = false and report back.

However, I believe my issue is different. #1785 is about garbled audio from transcoding path corruption — it happens at any call volume and is constant. My issue is lag/choppy audio that only appears at 90+ concurrent calls and is load-dependent. At 30-50 calls, audio is perfectly clean.

My debugging shows:

  • Audio reaches Asterisk on time (zero WebSocket backpressure)

  • XOFF/XON cycling works correctly (~2030ms matches the 100-frame gap between XOFF at 900 and XON at 800)

  • During XOFF, 800+ frames (16 sec) still sit in the buffer — the phone should never starve

  • Task processors show 0 queue depth

Could the internal frame timer thread become a bottleneck at 90+ chan_websocket channels? With 90 channels × 50 frames/sec = 4500 frame pops/second, could timing drift cause uneven RTP delivery under this load?

No, you should really just test the release candidate. The change itself has to do with the internal switching of codecs. It is good to eliminate it as a factor, instead of potentially chasing something that was already resolved just in case.

It is unlikely. You can see if timing is affected on the system in general using the “timing test” CLI command.

You can measure this with a packet capture.