Taskprocessor_push segfault occurs consistently

Just not in such a way that we can reproduce. This post seems to have an identical issue without resolution: Crash at taskprocessor.c:1171

So it happens about once a month with the following output from ast_coredumper

Thread 1 (Thread 0x7fb11e7e1700 (LWP 2955687)):
#0  0x000055710463aee0 in taskprocessor_push (t=<optimized out>, tps=0x7fb1904ea730) at taskprocessor.c:1239 
#1  ast_taskprocessor_push (tps=0x7fb1904ea730, task_exe=<optimized out>, datap=0x7fb1a814c5f0) at taskprocessor.c:1245
#2  0x00007fb1e081af5f in chan_pjsip_hangup (ast=0x7fb1a80d4b70) at chan_pjsip.c:2578
#3  0x000055710452aaaa in ast_hangup (chan=chan@entry=0x7fb1a80d4b70) at channel.c:2612
#4  0x00007fb1e1e8f520 in wait_for_answer (in=in@entry=0x7fb1c46de010, out_chans=out_chans@entry=0x7fb11e7dcf70, to=to@entry=0x7fb11e7dcf38, peerflags=peerflags@entry=0x7fb11e7ddae0, opt_args=opt_args@entry=0x7fb11e7dd2f0, pa=pa@entry=0x7fb11e7dd390, num_in=<optimized out>, result=<optimized out>, dtmf_progress=<optimized out>, mf_progress=<optimized out>, mf_wink=<optimized out>, sf_progress=<optimized out>, sf_wink=<optimized out>, hearpulsing=<optimized out>, ignore_cc=<optimized out>, forced_clid=<optimized out>, stored_clid=<optimized out>, config=<optimized out>) at app_dial.c:1426

So the crashing line would appear to be

tps->listener->callbacks->task_pushed(tps->listener, was_empty);

And when inspecting the coredump, I see that the taskprocessor has already been cleaned up:

{ stats = {max_qsize = 3, _tasks_processed_count = 7}, 
  local_data = 0x0, 
  tps_queue_size = 0, 
  tps_queue_low = 2250, tps_queue_high = 2500, 
  tps_queue = {first = 0x0, last = 0x0}, 
  listener = 0x0, 
  thread = 18446744073709551615, 
  executing = 0, 
  high_water_warned = 0, high_water_alert = 0, 
  suspended = 0, 
  subsystem = 0x7f6dc0211e52 "", 
  name = 0x7f6dc0211e20 "pjsip/outsess/compass-proxy-00.00.00.00-000e91" (edited), 
  <incomplete sequence \340>}

We are running asterisk 20.5. I don’t see mention of the issue in the changelog for newer releases, nor changes around the code making the bad attempt to access the torndown tps, and the issue seems to be around in some form since asterisk 16.

I’ll add info as I research it, as I’m kind of new to the asterisk code, and know there are some points I can explore more. We want to deploy some patch soon though, to see if it might improve things, whether that’s locking the TPS until after the listener has been accessed, adding some missing reference increment, or something else. I’m not quite sure where in the process the TPS is being destroyed yet.

I’ll also add an issue to Github soon if that’s welcome, I know we originally attempted to share this with ASTERISK-28834: Segfault in taskprocessor_push but I suppose that didn’t get migrated over.

I’m just going to try something, so I’ve been looking at what can be done without knowing the root
cause. Taskprocessor_push seems like it’s wrapped nicely so I want to try to just do a null check there for the listener, before adding the task to the queue. I don’t see any other usages close by that would indicate the cleanup occurring specifically in the taskprocessor_push function.

I think what’s happening is the session decrementing the ref count of the TPS too soon, because it’s being destroyed I guess. I don’t understand it very well, but there is a comment referencing some issues with the INVITE session when disconnected. This sounds related because I think these crashes are largely happening when a caller hangs up before they can be answered. I’ll need to check if we’ve set HAVE_PJSIP_INV_SESSION_REF, which maintains it’s own ref of the invite session.

/* When a PJSIP INVITE session is created it is created with a 
 * count of 1, with that reference being managed by the underlying 
 * of the INVITE session itself. When the INVITE session transitions 
 * a DISCONNECTED state that reference is released. This means we can 
 * rely on that reference to ensure the INVITE session remains for 
 * lifetime of our session. To ensure it does we add our own 
 * and release it when our own session goes away, ensuring that the 
 * session remains for the lifetime of session. */

Sidenote: The info report generated by ast_coredumper lists a couple of dozen taskmanagers, but shows no tasks being run when I don’t think this should be correct, as there are quite a few calls ongoing during the crash. Not sure if that aspect of the report is generally reliable though.