Crash - FRACK and excessive refcount on ao2 object

backtrace 100045.txt (5.2 KB)

syslog.txt (442 Bytes)

We’re running a .Net Stasis application which crashed during a voice conference session, with the result that audio was lost for all participants (about 30 at that time). The backtrace of the crash dump (included) tells us that the crash coincides with a failed assertion in a bridge channel operation as a result of a failed allocation for a file descriptor, associated with an excessive refcount condition for an ao2 object. This from the messages log:

[Jan 16 18:05:20] ERROR[6693][C-00000012] frame.c: Excessive refcount 100000 reached on ao2 object 0x1330e40
[Jan 16 18:05:20] ERROR[6693][C-00000012] frame.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x1330e40 (0)
[Jan 16 18:05:20] ERROR[6710][C-00000026] frame.c: Excessive refcount 100000 reached on ao2 object 0x1330e40
[Jan 16 18:05:20] ERROR[6710][C-00000026] frame.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x1330e40 (0)

We’ve traced through the source, and it looks like the file descriptor in question is for a recorded file, but this seems a very high refcount (100000) given that the application that was restarted shortly before the crash. We’re running 14.6, which seemed to fix an earlier problem we were having with system crashes with 14.3
(https://community.asterisk.org/t/stasis-task-processor-queue-warnings-and-channel-c-frack/71504)
All was going well over many such conferences until now.

The Stasis application uses mute/unmute, and moving channels between bridges, extensively. We use MOH in a lobby channel, which is multi-encoded for all supported codecs, and some voice prompts in sln format. There’s a fast ARI connection to the controlling application running on a Windows box. At the time of the crash ARI disconnected. At this time our logs (and those of our DID provider) show no associated incoming or outgoing calls, or mute/unmute, or channel move operations. This seems to have come out of the blue. We are using

same => n,Set(JITTERBUFFER(adaptive)=default)

in the dialplan, before the channel is routed into the Stasis app.

Incidentally, Asterisk seemed to recover later, some 10-15 minutes after the channels had left the conference (!), and reconnected to the ARI controller application.

What could be the issue? It would be good to have a general idea.

After tracing through the code we have identified the issue as an excess of references to a codec object, triggered by a request from the jitterbuffer code in abstract_jb.c. For the time we have disabled the Asterisk jitterbuffer, and are submitting this as a bug/vulnerability.