Incoming audio stream to audio socket clips first words/letters after pause (especially sibilants)

Hi everyone,

I’m facing an issue with Asterisk where the incoming voice stream from a telephony provider trunk is sent to an audio socket (AI voice assistant). The problem is that after a pause in speech, the first letters or word of the caller’s phrase are clipped or smeared, especially if it starts with sibilant consonants (like “sh” or “s”). If the caller speaks continuously, everything is fine.

Setup:

  • Asterisk version: 20.6.0-rc1

  • OS: Ubuntu 22.04

  • Codecs: ulaw

  • Standard Audiosocket module is used

  • rtp_keepalive = 2, rtp_symmetric = yes, jitterbuffer enabled

Symptoms:

  • Clipping happens only after silence (pause >1-2 sec).

  • No issues with continuous speech.

  • Logs show no obvious errors.

What I’ve tried:

  • Captured with tcpdump —no problems were found, no packet loss or jitter issues.

  • Tried different settings of jitterbuffer and rtp_keepalive.

Any ideas on debugging further or fixes? Suspecting silence suppression from provider.

Thanks!

This is almost certainly comfort noise / silence suppression (CNG) on the provider side, or possibly VAD on your own endpoint. The clue is that it only happens after a 1-2 second pause and targets the leading edge of speech — that’s the classic signature of a codec or gateway re-keying the audio stream after suppressing silence frames.

A few things to check and try:

**1. Confirm silence suppression from the trunk provider**

Capture on the inbound leg specifically and look for CN (Comfort Noise) packets — payload type 13 in RTP:

```

tcpdump -i eth0 -w /tmp/inbound.pcap port 5060 or portrange 10000-20000

```

Open in Wireshark, filter `rtp.p_type == 13`. If you see CN packets, the provider is running VAD/CNG. Some providers let you disable it, some don’t. Worth a support ticket to ask.

**2. Disable VAD on your PJSIP endpoint**

In your endpoint config, make sure silence detection is off:

```

[your-endpoint]

type=endpoint

rtp_timeout=120

rtp_timeout_hold=300

```

And in `rtp.conf`:

```

[general]

strictrtp=yes

rtpstart=10000

rtpend=20000

```

Asterisk doesn’t enable VAD by default, but confirm you don’t have `silenceThreshold` or `silencesuppression=yes` anywhere in your configs.

**3. The jitterbuffer angle**

Since you’re using AudioSocket (external process), the jitterbuffer setting matters a lot. The fixed jitterbuffer can introduce exactly this symptom — it holds initial frames after silence while it re-fills. Try switching to adaptive:

```

[general]

jbenable=yes

jbforce=yes

jbimpl=adaptive

jbmaxsize=200

jbtargetextra=40

jbresyncthreshold=1000

jblog=yes

```

The `jbresyncthreshold` is key here — set it high (1000ms) so the jitterbuffer doesn’t try to resync after every silence gap, which causes exactly the clipping you describe.

**4. rtp_keepalive isn’t enough**

You have `rtp_keepalive=2` which sends keepalive packets every 2 seconds. That keeps the NAT pinhole open but doesn’t solve the actual problem — the provider’s media gateway is still going to clip the leading edge when it restarts the audio stream after silence. What might help more is `send_rpid=yes` and making sure the provider keeps the media path active.

**5. AudioSocket-specific workaround**

If the provider won’t disable CNG and the jitterbuffer tuning doesn’t help, you can add a small pre-buffer in your AudioSocket application. Before processing speech, buffer ~100ms of audio after detecting the transition from silence to voice. This eats the garbled leading edge and gives you clean audio. Not ideal but it works when the trunk is the problem.

One more thing: you’re on 20.6.0-rc1 which is a release candidate. There were some AudioSocket fixes in 20.7.0 and 20.8.0 related to frame timing. Worth upgrading to the latest stable 20.x if you haven’t already.