Suddenly system crashing with segmentation faults

Hello,

We have Asterisk 13.22.0 with FreePBX 14.0.4.1, which services a large ARI application with 1000 lines. It ran smoothly for 4 months. Suddenly yesterday night it started getting segmentation faults, and crashes offtens - sometimes after 10 minutes, and sometimes after a 3 hours. On the CentOS (7) messages file it is logged as bellow. It is very difficult to install a debug version on the system, since we use FreePBX. Any idea what is causing this?

Thanks,
Eyal Hasson.

Oct 30 09:06:03 phonelinuxsrv kernel: asterisk[67154]: segfault at 28 ip 000000000069b02b sp 00007fc05cc9d1d0 error 6 in asterisk[400000+38d000]
Oct 30 09:06:03 phonelinuxsrv kernel: asterisk[67397]: segfault at 28 ip 000000000069b02b sp 00007fc0597641d0 error 6 in asterisk[400000+38d000]
Oct 30 09:37:28 phonelinuxsrv kernel: asterisk[78451]: segfault at 28 ip 000000000069b02b sp 00007f42af2cb1d0 error 6 in asterisk[400000+38d000]
Oct 30 09:37:28 phonelinuxsrv kernel: asterisk[80086]: segfault at 28 ip 000000000069b02b sp 00007f409bc121d0 error 6 in asterisk[400000+38d000]
Oct 30 11:04:27 phonelinuxsrv kernel: asterisk[97364]: segfault at 98024040 ip 000000000062ba14 sp 00007f6cceff8f30 error 4 in asterisk[400000+38d000]
Oct 30 11:52:47 phonelinuxsrv kernel: asterisk[99197]: segfault at 28 ip 000000000069b02b sp 00007fae727901d0 error 6 in asterisk[400000+38d000]
Oct 30 11:52:47 phonelinuxsrv kernel: asterisk[106686]: segfault at 28 ip 000000000069b02b sp 00007fac60fa31d0 error 6 in asterisk[400000+38d000]
Oct 30 12:19:49 phonelinuxsrv kernel: asterisk[114809]: segfault at 28 ip 000000000069b02b sp 00007f7e66bf51d0 error 6
Oct 30 12:19:49 phonelinuxsrv kernel: asterisk[111934]: segfault at 28 ip 000000000069b02b sp 00007f7e66de91d0 error 6
Oct 30 12:19:49 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 12:19:49 phonelinuxsrv kernel:
Oct 30 12:19:49 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 12:32:24 phonelinuxsrv kernel: asterisk[118518]: segfault at 28 ip 000000000069b02b sp 00007fc3114081d0 error 6
Oct 30 12:32:24 phonelinuxsrv kernel: asterisk[116093]: segfault at 28 ip 000000000069b02b sp 00007fc311cd21d0 error 6 in asterisk[400000+38d000]
Oct 30 12:32:24 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 13:25:47 phonelinuxsrv kernel: asterisk[129339]: segfault at 28 ip 000000000069b02b sp 00007fd6e1221f40 error 6 in asterisk[400000+38d000]
Oct 30 13:25:47 phonelinuxsrv kernel: asterisk[130106]: segfault at 28 ip 000000000069b02b sp 00007fd51324df40 error 6 in asterisk[400000+38d000]
Oct 30 13:41:03 phonelinuxsrv kernel: asterisk[133734]: segfault at 28 ip 000000000069b02b sp 00007f6a955021d0 error 6 in asterisk[400000+38d000]
Oct 30 13:41:03 phonelinuxsrv kernel: asterisk[132926]: segfault at 28 ip 000000000069b02b sp 00007f696a6191d0 error 6 in asterisk[400000+38d000]

Without a backtrace[1] there’s no good way to point to what is going on. The only other thing you could do is look at logs and see what was going on when it crashed.

[1] https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

I was not aware that FreePBX distro can produce backtrace. Attached are the files. Thanks.

core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-locks.txt (657 Bytes)
core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-thread1.txt (10.3 KB)
core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-brief.txt (902.0 KB)
core.phonelinuxsrv.kolhl.com-2018-10-30T13-41-03 0200-full.txt (2.3 MB)

What kind of usage does the system see and what is it used for? The crash appears to be due to excessive usage of a media format.

Mostly the system playbacks lectures in vox or ulaw formats. It also records lectures, but this is a small friction compare to the listeners.

It appears to me that it has double faulted. An assert failure would normally produce signal 3, so it looks like the assert code itself faulted.

You would need to file an issue[1] ideally with information of how to reproduce and further details about the system. There is no time frame on when it would get looked into, though.

[1] https://issues.asterisk.org/jira

Thanks, I will. But meanwhile, can you point me to where the problem lies. Maybe I can circumvent it. This is a very heavy usage system and we need something urgently.

I can’t, either the media format is being leaked for reference counting or it’s just being used too much. There is no evident reason.

Can you please elaborate - what does it mean “media format is being leaked for reference counting or it’s just being used too much”?

Thanks.

Media formats are stored as reference counted objects. The assertion is because that count has reached an excessively large number. The reason why, I don’t know.

And “media formats” themselves are the actual media files the system playbacks (like ulaw prompts etc.)?

They are an internal structure representation of a media format, such as “ulaw”. They are not the file itself. The media frames themselves have a reference to the media format.

Sure. Do you know what order of magnitude is the “an excessively large number”?

It’s in the backtrace:

#4 0x00000000005fcc55 in __ast_assert_failed (condition=0, condition_str=0x7f6a95502940 “Excessive refcount 100000 reached on ao2 object 0x31abfc8”, file=0x6c3b7b “astobj2.c”, line=518, function=0x6c3d91

“Excessive refcount 100000 reached on ao2 object 0x31abfc8”

Does it mean 100000 media files are concurrently opened? This is way higher then what this system uses. It must means files are not being closed. But even so, how can it reach such a number within 10 minutes?

No, it means that the internal representation of a media format has been referenced 100000 times. Not that there are that many media files open.

I see. But still how can this happen with 10 minutes?

More precisely that it has been referenced that many more times than it has been unreferenced. Normally every reference should eventually result in a deference.

Note, without checking the code, I can’t be sure that the number quoted is the actual count, or the maximum allowed count. If it is only the latter, this could be fault secondary to memory corruption. In that case the exact details of the crash may not be repeatable.

I crated backtrace for some more crashes, all show the same problem. Is there a way to see which file is being referenced (or dereferenced)?