Segfault on 18.12.1

Well aware of the proper procedure here having just rebuilt appropriately to capture a backtrace next time it happens.

In the meantime, I’d be very thankful if someone could please consult the following and tell me if the backtrace is the appropriate step or if there is obvious other problem. This doesn’t appear particularly load-related, though I’ve never seen it happen at idle.

Jun 20 16:22:02 localhost kernel: [302654.726316] show_signal_msg: 4 callbacks suppressed
Jun 20 16:22:02 localhost kernel: [302654.726320] asterisk[341717]: segfault at 15c ip 0000558f7a70aaff sp 00007efcdbb2d760 error 4 in asterisk[558f7a5b0000+1ff000]
Jun 20 16:22:02 localhost kernel: [302654.726331] Code: a7 89 ea ff eb c8 0f 1f 44 00 00 f3 0f 1e fa 55 53 81 e6 e0 00 00 00 48 89 fb 48 83 ec 08 83 fe 20 74 48 31 c0 83 fe 40 75 37 <48> 0f be 2b 40 84 ed 74 48 e8 53 7b ea ff 48 8b 08 b8 05 15 00 00

I wondered if the above might reveal something obvious like ‘you’re running out of RAM’ to the trained eye. 18.12.1 on a 4GB dedicated CPU VPS.

Thank you everyone!

That’s not a backtrace, it merely states a segfault occurred in the asterisk process. That’s it.

Yes mate I appreciate that. I said above that having set myself to produce a backtrace next time this happens, whilst I wait I wondered if the above reveals anything? Is there a possibility this is brought on by running out of RAM?

It reveals what I said. It could be because code assumed something was allocated when it wasn’t, it could be something else.

Segmentation faults shouldn’t happen, not even if you run out of RAM.

You can probably infer that you were doing something unusual at the time, but that is just because the usual will have been better tested.

Assuming it hadn’t jumped to a random address, you might be able to use gdb to translate the IP value to a line of code. With position independent code, I’m not sure whether things will be in the same place when you reload them, so I’m not certain that the translation will be stable. Also remember that Asterisk loads modules at run time, so you might have to let it run, in the debugger, to get the faulting address populate with code.

Thanks David. I’ll have a pop at translating the address to a line of code, and beyond that will be back with a backtrace when I get one. Thank you :slight_smile:

Okay, I can’t trigger this thing at will, but can reasonably expect it to happen every 2 or 3 days.

Backtrace looks like this, its always the same:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--c
Core was generated by `/usr/sbin/asterisk -mqfg -C /etc/asterisk/asterisk.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055dae5937ab7 in ast_str_case_hash (str=0x15c <error: Cannot access memory at address 0x15c>) at /home/ahand/asterisk-18.13.0/include/asterisk/strings.h:1285
1285		while (*str) {
[Current thread is 1 (Thread 0x7fd4c187b700 (LWP 91490))]
(gdb) bt
#0  0x000055dae5937ab7 in ast_str_case_hash (str=0x15c <error: Cannot access memory at address 0x15c>) at /home/ahand/asterisk-18.13.0/include/asterisk/strings.h:1285
#1  0x000055dae5937d00 in channel_snapshot_uniqueid_hash_cb (obj=0x15c, flags=64) at stasis_channels.c:211
#2  0x000055dae57d0d6a in hash_ao2_find_first (self=0x55dae6539af8, flags=OBJ_SEARCH_KEY, arg=0x15c, state=0x7fd4c18793e0) at astobj2_hash.c:363
#3  0x000055dae57ceacb in internal_ao2_traverse
    (self=0x55dae6539af8, flags=OBJ_SEARCH_KEY, cb_fn=0x55dae5937d02 <channel_snapshot_uniqueid_cmp_cb>, arg=0x15c, data=0x0, type=AO2_CALLBACK_DEFAULT, tag=0x0, file=0x55dae5a35f61 "stasis_channels.c", line=911, func=0x55dae5a36a80 <__PRETTY_FUNCTION__.16790> "ast_channel_snapshot_get_latest") at astobj2_container.c:318
#4  0x000055dae57cee16 in __ao2_callback
    (c=0x55dae6539af8, flags=OBJ_SEARCH_KEY, cb_fn=0x55dae5937d02 <channel_snapshot_uniqueid_cmp_cb>, arg=0x15c, tag=0x0, file=0x55dae5a35f61 "stasis_channels.c", line=911, func=0x55dae5a36a80 <__PRETTY_FUNCTION__.16790> "ast_channel_snapshot_get_latest") at astobj2_container.c:414
#5  0x000055dae57ceed6 in __ao2_find
    (c=0x55dae6539af8, arg=0x15c, flags=OBJ_SEARCH_KEY, tag=0x0, file=0x55dae5a35f61 "stasis_channels.c", line=911, func=0x55dae5a36a80 <__PRETTY_FUNCTION__.16790> "ast_channel_snapshot_get_latest")
    at astobj2_container.c:437
#6  0x000055dae593c005 in ast_channel_snapshot_get_latest (uniqueid=0x15c <error: Cannot access memory at address 0x15c>) at stasis_channels.c:911
#7  0x00007fd51b982ddd in publish_chanspy_message (snoop=0x7fd4e126b838, start=0) at res_stasis_snoop.c:138
#8  0x00007fd51b9833e7 in snoop_hangup (chan=0x7fd4e126e400) at res_stasis_snoop.c:228
#9  0x000055dae58136e1 in ast_hangup (chan=0x7fd4e126e400) at channel.c:2612
#10 0x00007fd51b983ccb in stasis_app_control_snoop
    (chan=0x7fd4e108deb0, spy=STASIS_SNOOP_DIRECTION_IN, whisper=STASIS_SNOOP_DIRECTION_NONE, app=0x7fd4e1258104 "lebedev", app_args=0x0, snoop_id=0x7fd4e12532f8 "83865b86a5684328bcb6e76e60d26f2e")
    at res_stasis_snoop.c:386
#11 0x00007fd4d73e545b in ari_channels_handle_snoop_channel
    (args_channel_id=0x7fd4e125391a "31d8ab91d5534c3da0a626d738364fcd", args_spy=0x7fd4e123a714 "in", args_whisper=0x0, args_app=0x7fd4e1258104 "lebedev", args_app_args=0x0, args_snoop_id=0x7fd4e12532f8 "83865b86a5684328bcb6e76e60d26f2e", response=0x7fd4c1879b10) at ari/resource_channels.c:1638
#12 0x00007fd4d73e561c in ast_ari_channels_snoop_channel (headers=0x7fd4e0e378b0, args=0x7fd4c18798d0, response=0x7fd4c1879b10) at ari/resource_channels.c:1655
#13 0x00007fd4d73df22b in ast_ari_channels_snoop_channel_cb (ser=0x7fd4f8229910, get_params=0x7fd4e1258090, path_vars=0x7fd4e12538a0, headers=0x7fd4e0e378b0, body=0x0, response=0x7fd4c1879b10)
    at res_ari_channels.c:2581
#14 0x00007fd4d75e6bb6 in ast_ari_invoke
    (ser=0x7fd4f8229910, uri=0x7fd4c1879c6a "channels/31d8ab91d5534c3da0a626d738364fcd/snoop", method=AST_HTTP_POST, get_params=0x7fd4e1258090, headers=0x7fd4e0e378b0, body=0x0, response=0x7fd4c1879b10) at res_ari.c:587
#15 0x00007fd4d75e8347 in ast_ari_callback
    (ser=0x7fd4f8229910, urih=0x7fd4d75f43e0 <http_uri>, uri=0x7fd4c1879c6a "channels/31d8ab91d5534c3da0a626d738364fcd/snoop", method=AST_HTTP_POST, get_params=0x7fd4e1258090, headers=0x7fd4e0e378b0)
    at res_ari.c:1058
#16 0x000055dae59bc9e8 in handle_uri (ser=0x7fd4f8229910, uri=0x7fd4c1879c6a "channels/31d8ab91d5534c3da0a626d738364fcd/snoop", method=AST_HTTP_POST, headers=0x7fd4e0e378b0) at http.c:1490
#17 0x000055dae59bdc35 in httpd_process_request (ser=0x7fd4f8229910) at http.c:1931
#18 0x000055dae59bdf7f in httpd_helper_thread (data=0x7fd4f8229910) at http.c:1994
#19 0x000055dae5959698 in handle_tcptls_connection (data=0x7fd4f8229910) at tcptls.c:274
#20 0x000055dae596d876 in dummy_start (data=0x7fd4f804ba30) at utils.c:1574
#21 0x00007fd51aed1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fd51ac51133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is there anything revelatory about the above? ast_str_case_hash looks fairly innocuous as functions go. Thank you again everyone.

The problem indicated by the given backtrace already has an issue filed for it[1].

[1] [ASTERISK-29604] ari: Segfault with lots of calls - Digium/Asterisk JIRA

For my understanding, please, what is it about 29604 that makes it equivalent to my segfault? I can’t identify common elements in the backtrace.

Multiple attached backtraces are the same as yours.

Can you elaborate for my understanding, please? I see no reference to ast_str_case_hash in the referenced issue, I am sure I’m looking at the wrong thing but would be very thankful of a brief explanation.

Under Attachments there are backtraces. The core-brief.txt file and core-brief-5.txt files, at the bottom of them, have the same crash. Or to be more specific, they are close enough to be the same - different because the underlying memory is different, resulting in a crash in a different place during the same general operation. That is: Hanging up the Snoop channel, and getting the snapshot so a message can be published.

Not when it is given the wrong address for a string. I would say that things probably went wrong no less than as far back as __ao2_find.

Thank you both this is starting to make sense to me now.

From my stasis app sequence of calls, this looks like a race condition wherein I attempt to snoop a channel right at the same moment it is hanging up.

So this is unrelated to load, just that the more calls I put through the greater the chance that the hangup coincides with the start of the snoop.