Suddenly system crashing with segmentation faults

eyalhasson · October 30, 2018, 12:31pm

Hello,

We have Asterisk 13.22.0 with FreePBX 14.0.4.1, which services a large ARI application with 1000 lines. It ran smoothly for 4 months. Suddenly yesterday night it started getting segmentation faults, and crashes offtens - sometimes after 10 minutes, and sometimes after a 3 hours. On the CentOS (7) messages file it is logged as bellow. It is very difficult to install a debug version on the system, since we use FreePBX. Any idea what is causing this?

Thanks,
Eyal Hasson.

Oct 30 09:06:03 phonelinuxsrv kernel: asterisk[67154]: segfault at 28 ip 000000000069b02b sp 00007fc05cc9d1d0 error 6 in asterisk[400000+38d000]
Oct 30 09:06:03 phonelinuxsrv kernel: asterisk[67397]: segfault at 28 ip 000000000069b02b sp 00007fc0597641d0 error 6 in asterisk[400000+38d000]
Oct 30 09:37:28 phonelinuxsrv kernel: asterisk[78451]: segfault at 28 ip 000000000069b02b sp 00007f42af2cb1d0 error 6 in asterisk[400000+38d000]
Oct 30 09:37:28 phonelinuxsrv kernel: asterisk[80086]: segfault at 28 ip 000000000069b02b sp 00007f409bc121d0 error 6 in asterisk[400000+38d000]
Oct 30 11:04:27 phonelinuxsrv kernel: asterisk[97364]: segfault at 98024040 ip 000000000062ba14 sp 00007f6cceff8f30 error 4 in asterisk[400000+38d000]
Oct 30 11:52:47 phonelinuxsrv kernel: asterisk[99197]: segfault at 28 ip 000000000069b02b sp 00007fae727901d0 error 6 in asterisk[400000+38d000]
Oct 30 11:52:47 phonelinuxsrv kernel: asterisk[106686]: segfault at 28 ip 000000000069b02b sp 00007fac60fa31d0 error 6 in asterisk[400000+38d000]
Oct 30 12:19:49 phonelinuxsrv kernel: asterisk[114809]: segfault at 28 ip 000000000069b02b sp 00007f7e66bf51d0 error 6
Oct 30 12:19:49 phonelinuxsrv kernel: asterisk[111934]: segfault at 28 ip 000000000069b02b sp 00007f7e66de91d0 error 6
Oct 30 12:19:49 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 12:19:49 phonelinuxsrv kernel:
Oct 30 12:19:49 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 12:32:24 phonelinuxsrv kernel: asterisk[118518]: segfault at 28 ip 000000000069b02b sp 00007fc3114081d0 error 6
Oct 30 12:32:24 phonelinuxsrv kernel: asterisk[116093]: segfault at 28 ip 000000000069b02b sp 00007fc311cd21d0 error 6 in asterisk[400000+38d000]
Oct 30 12:32:24 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 13:25:47 phonelinuxsrv kernel: asterisk[129339]: segfault at 28 ip 000000000069b02b sp 00007fd6e1221f40 error 6 in asterisk[400000+38d000]
Oct 30 13:25:47 phonelinuxsrv kernel: asterisk[130106]: segfault at 28 ip 000000000069b02b sp 00007fd51324df40 error 6 in asterisk[400000+38d000]
Oct 30 13:41:03 phonelinuxsrv kernel: asterisk[133734]: segfault at 28 ip 000000000069b02b sp 00007f6a955021d0 error 6 in asterisk[400000+38d000]
Oct 30 13:41:03 phonelinuxsrv kernel: asterisk[132926]: segfault at 28 ip 000000000069b02b sp 00007f696a6191d0 error 6 in asterisk[400000+38d000]

jcolp · October 30, 2018, 12:32pm

Without a backtrace[1] there’s no good way to point to what is going on. The only other thing you could do is look at logs and see what was going on when it crashed.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

eyalhasson · October 30, 2018, 1:55pm

I was not aware that FreePBX distro can produce backtrace. Attached are the files. Thanks.

core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-locks.txt (657 Bytes)
core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-thread1.txt (10.3 KB)
core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-brief.txt (902.0 KB)
core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-full.txt (2.3 MB)

jcolp · October 30, 2018, 2:16pm

What kind of usage does the system see and what is it used for? The crash appears to be due to excessive usage of a media format.

eyalhasson · October 30, 2018, 2:22pm

Mostly the system playbacks lectures in vox or ulaw formats. It also records lectures, but this is a small friction compare to the listeners.

david551 · October 30, 2018, 2:24pm

It appears to me that it has double faulted. An assert failure would normally produce signal 3, so it looks like the assert code itself faulted.

jcolp · October 30, 2018, 2:25pm

You would need to file an issue[1] ideally with information of how to reproduce and further details about the system. There is no time frame on when it would get looked into, though.

[1] https://issues.asterisk.org/jira

eyalhasson · October 30, 2018, 2:41pm

Thanks, I will. But meanwhile, can you point me to where the problem lies. Maybe I can circumvent it. This is a very heavy usage system and we need something urgently.

jcolp · October 30, 2018, 2:59pm

I can’t, either the media format is being leaked for reference counting or it’s just being used too much. There is no evident reason.

eyalhasson · October 30, 2018, 3:02pm

Can you please elaborate - what does it mean “media format is being leaked for reference counting or it’s just being used too much”?

Thanks.

jcolp · October 30, 2018, 3:05pm

Media formats are stored as reference counted objects. The assertion is because that count has reached an excessively large number. The reason why, I don’t know.

eyalhasson · October 30, 2018, 3:07pm

And “media formats” themselves are the actual media files the system playbacks (like ulaw prompts etc.)?

jcolp · October 30, 2018, 3:09pm

They are an internal structure representation of a media format, such as “ulaw”. They are not the file itself. The media frames themselves have a reference to the media format.

eyalhasson · October 30, 2018, 3:15pm

Sure. Do you know what order of magnitude is the “an excessively large number”?

jcolp · October 30, 2018, 3:19pm

It’s in the backtrace:

#4 0x00000000005fcc55 in __ast_assert_failed (condition=0, condition_str=0x7f6a95502940 “Excessive refcount 100000 reached on ao2 object 0x31abfc8”, file=0x6c3b7b “astobj2.c”, line=518, function=0x6c3d91

“Excessive refcount 100000 reached on ao2 object 0x31abfc8”

eyalhasson · October 30, 2018, 3:50pm

Does it mean 100000 media files are concurrently opened? This is way higher then what this system uses. It must means files are not being closed. But even so, how can it reach such a number within 10 minutes?

jcolp · October 30, 2018, 3:50pm

No, it means that the internal representation of a media format has been referenced 100000 times. Not that there are that many media files open.

eyalhasson · October 30, 2018, 3:53pm

I see. But still how can this happen with 10 minutes?

david551 · October 30, 2018, 3:54pm

More precisely that it has been referenced that many more times than it has been unreferenced. Normally every reference should eventually result in a deference.

Note, without checking the code, I can’t be sure that the number quoted is the actual count, or the maximum allowed count. If it is only the latter, this could be fault secondary to memory corruption. In that case the exact details of the crash may not be repeatable.

eyalhasson · October 30, 2018, 3:59pm

I crated backtrace for some more crashes, all show the same problem. Is there a way to see which file is being referenced (or dereferenced)?

Topic		Replies	Views
Asterisk Crashing [segmentation fault] - Asterisk 20.9.2 Asterisk Support	3	129	November 24, 2024
Segmentation Fault Asterisk Support	5	405	January 19, 2008
Asterisk Crash : Segmentation Fault Asterisk SIP	8	3505	December 5, 2020
Segfault in asterisk 1.6.1.0 Asterisk Support	2	497	June 15, 2009
Segmentation Fault - Asterisk 18.4 Asterisk Support	4	209	October 15, 2022

Suddenly system crashing with segmentation faults

Related topics