We have Asterisk 13.22.0 with FreePBX 14.0.4.1, which services a large ARI application with 1000 lines. It ran smoothly for 4 months. Suddenly yesterday night it started getting segmentation faults, and crashes offtens - sometimes after 10 minutes, and sometimes after a 3 hours. On the CentOS (7) messages file it is logged as bellow. It is very difficult to install a debug version on the system, since we use FreePBX. Any idea what is causing this?
Thanks,
Eyal Hasson.
Oct 30 09:06:03 phonelinuxsrv kernel: asterisk[67154]: segfault at 28 ip 000000000069b02b sp 00007fc05cc9d1d0 error 6 in asterisk[400000+38d000]
Oct 30 09:06:03 phonelinuxsrv kernel: asterisk[67397]: segfault at 28 ip 000000000069b02b sp 00007fc0597641d0 error 6 in asterisk[400000+38d000]
Oct 30 09:37:28 phonelinuxsrv kernel: asterisk[78451]: segfault at 28 ip 000000000069b02b sp 00007f42af2cb1d0 error 6 in asterisk[400000+38d000]
Oct 30 09:37:28 phonelinuxsrv kernel: asterisk[80086]: segfault at 28 ip 000000000069b02b sp 00007f409bc121d0 error 6 in asterisk[400000+38d000]
Oct 30 11:04:27 phonelinuxsrv kernel: asterisk[97364]: segfault at 98024040 ip 000000000062ba14 sp 00007f6cceff8f30 error 4 in asterisk[400000+38d000]
Oct 30 11:52:47 phonelinuxsrv kernel: asterisk[99197]: segfault at 28 ip 000000000069b02b sp 00007fae727901d0 error 6 in asterisk[400000+38d000]
Oct 30 11:52:47 phonelinuxsrv kernel: asterisk[106686]: segfault at 28 ip 000000000069b02b sp 00007fac60fa31d0 error 6 in asterisk[400000+38d000]
Oct 30 12:19:49 phonelinuxsrv kernel: asterisk[114809]: segfault at 28 ip 000000000069b02b sp 00007f7e66bf51d0 error 6
Oct 30 12:19:49 phonelinuxsrv kernel: asterisk[111934]: segfault at 28 ip 000000000069b02b sp 00007f7e66de91d0 error 6
Oct 30 12:19:49 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 12:19:49 phonelinuxsrv kernel:
Oct 30 12:19:49 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 12:32:24 phonelinuxsrv kernel: asterisk[118518]: segfault at 28 ip 000000000069b02b sp 00007fc3114081d0 error 6
Oct 30 12:32:24 phonelinuxsrv kernel: asterisk[116093]: segfault at 28 ip 000000000069b02b sp 00007fc311cd21d0 error 6 in asterisk[400000+38d000]
Oct 30 12:32:24 phonelinuxsrv kernel: in asterisk[400000+38d000]
Oct 30 13:25:47 phonelinuxsrv kernel: asterisk[129339]: segfault at 28 ip 000000000069b02b sp 00007fd6e1221f40 error 6 in asterisk[400000+38d000]
Oct 30 13:25:47 phonelinuxsrv kernel: asterisk[130106]: segfault at 28 ip 000000000069b02b sp 00007fd51324df40 error 6 in asterisk[400000+38d000]
Oct 30 13:41:03 phonelinuxsrv kernel: asterisk[133734]: segfault at 28 ip 000000000069b02b sp 00007f6a955021d0 error 6 in asterisk[400000+38d000]
Oct 30 13:41:03 phonelinuxsrv kernel: asterisk[132926]: segfault at 28 ip 000000000069b02b sp 00007f696a6191d0 error 6 in asterisk[400000+38d000]
Without a backtrace[1] there’s no good way to point to what is going on. The only other thing you could do is look at logs and see what was going on when it crashed.
You would need to file an issue[1] ideally with information of how to reproduce and further details about the system. There is no time frame on when it would get looked into, though.
Thanks, I will. But meanwhile, can you point me to where the problem lies. Maybe I can circumvent it. This is a very heavy usage system and we need something urgently.
Media formats are stored as reference counted objects. The assertion is because that count has reached an excessively large number. The reason why, I don’t know.
They are an internal structure representation of a media format, such as “ulaw”. They are not the file itself. The media frames themselves have a reference to the media format.
Does it mean 100000 media files are concurrently opened? This is way higher then what this system uses. It must means files are not being closed. But even so, how can it reach such a number within 10 minutes?
More precisely that it has been referenced that many more times than it has been unreferenced. Normally every reference should eventually result in a deference.
Note, without checking the code, I can’t be sure that the number quoted is the actual count, or the maximum allowed count. If it is only the latter, this could be fault secondary to memory corruption. In that case the exact details of the crash may not be repeatable.