Asterisk/FreePBX Crashing and FRACK! Errors

stevensedory · September 28, 2017, 3:02pm

Running FreePBX 13.0.192.16 and Asterisk 13.17.0

I have previously posted about this issue here:

Host: Dell R720 with 2x Xeon E5-2620 2.00GHz (6 Core) and 64GB RAM DDR3 ECC), local PERC storage.

Hypervisor: Proxmox 4.4-1.

Network: using onboard Quad NIC. Bridge “vmbr0” points to “bond0” as the bridge port, and bond0 has eth0 and eth1 in it in “active-backup” mode, each going to one of our two core switches. Using Cisco 3560G. Switch ports are in trunk mode, with native vlan set to our management vlan. VMs are tagged to our public facing vlan, for direct internet access.

VMs are running FreePBX/Asterisk versions mentioned above. Each have 4GB RAM fixed with ballooning disabled, 4 cores (2 sockets, 2 cores; have tried with NUMA enabled and disabled) with type “Default (kvm64)”, NIC using E1000 model, vdisk is 300G presented as ide0 as a raw image on a local LVM-Thin volume.

Endpoints: All endpoints are NAT’d. We use TCP for SIP with an obscure port (not 5060 or near that). RTP traffic on our VSP’s required port range is allowed as well. All other traffic is dropped per the FreePBX firewall.

In summary, what is happening is that we get a bunch of errors like this:

[2017-09-28 02:05:18] ERROR[7061] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
[2017-09-28 02:05:24] ERROR[6934] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
[2017-09-28 02:05:28] ERROR[7107] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)

and right before and after, we have most of our peers go unreachable. Sometime Asterisk will crash afterwards, sometimes not.

The issue happens intermittently, but seems to happen more frequently on the VMs that have more peers/endpoints (100+). I don’t think we’ve had it happen on any VMs that had less than 100 peers/endpoints.

We recently chopped a server that had about 130 endpoints into two of 110 and 20. More accurately, we moved 110 off server A to server B, leaving 20 on server B. Before that move, we were experiencing FRACK! errors every day (anywhere from 20-300, usually all within a 20 minute window or so). Once the 110 were moved to server B, server A has never again had FRACK! errors or asterisk crashes. Server B however is having them now, just much less often then when all 130 endpoints were on server A. My assumption for that is due to the slightly lower endpoint total on the VM.

This morning was one of those instances. We had 193 errors, identical to the three I posted above (minus the ERROR[number] being different). AND, we had a crash afterwards. Here is the backtrace: http://pastebin.freepbx.org/view/8cccc15f

So I come to you, the asterisk community, for help. I first posted on the FreePBX forum, and was directed here.

I understand this may point to a memory issue, but what is strange is that the Dell iDrac log doesn’t show any memory errors in it. Perhaps there are errors but iDrac just isn’t seeing them to report them. I’m hoping someone out there can parse through the backtrace and give me a clear answer to what the problem is. Thanks in advance.

david551 · September 28, 2017, 5:14pm

You have memory corruption. If you are lucky, there will be an earlier error, or warning, that gives a clue as to what caused the corruption. Otherwise debugging this will be different.

stevensedory · September 28, 2017, 6:29pm

Thank you for the response. So I’m in a tough spot troubleshooting wise. This is a production server with several VMs running on it. Would you suggest swapping out the ram in it? Or do you think the memory corruption is something at the software level regardless of what memory is inside? Or is there just not enough information at this point to tell?

david551 · September 28, 2017, 11:22pm

Memory corruption is normally the result of software errors. The problem with it is the symptoms can be delayed relative to the original cause.

stevensedory · September 29, 2017, 1:07am

Okay thanks. Any suggestions on the next best step troubleshooting this?

stevensedory · October 2, 2017, 4:10pm

Any ideas on best next steps troubleshooting anyone?

david551 · October 2, 2017, 5:16pm

You either have to work out what event triggers the crash, or you need to look at a crash dump and see if you can identify what the corrupted data was over-written with. The latter is very difficult.

Although, my recent experience is that such faults are software related, it might be worth running a memory diagnostic, assuming you have a standby machine, or a time of day that can be used for maintenance.

stevensedory · October 2, 2017, 6:01pm

Thanks David. I pastbin’d the backtrace of the dump (above), but the dump itself was too big.

Here’s a link to the dump: https://cloud.verticalcomputers.com/index.php/s/Ae3YCLWysUsPqPr

Is our dump something you would be willing to look at?

david551 · October 2, 2017, 8:41pm

The dump will not make sense without the exact same binaries, including libraries, that you used. The chances of recognizing what is causing the corruption is way too low to make it worth my time looking.

stevensedory · October 2, 2017, 9:07pm

Understood. So if you were us, what would you do, assuming the memory is not corrupt? We have done fresh installs from the FreePBX 13 distro, fully updated, and are still having this problem.

david551 · October 2, 2017, 9:17pm

I would look for errors about failed locks.

Actually, rebuilding with thread debugging may give you more messages.

Where I have debugged this sort of thing, it has been from a developer perspective and involves quite a lot of knowledge of how the code works.

stevensedory · October 2, 2017, 9:29pm

Would these failed lock messages be in the asterisk logs or?

Can you give me an example of what I’m looking for?

stevensedory · October 5, 2017, 3:34pm

See above message? Also, we get a lot of Serious Network errors on these servers like this:


[2017-10-05 06:06:33] ERROR[7418] chan_sip.c: Serious Network Trouble; __sip_xmit returns error for pkt data
[2017-10-05 06:06:47] ERROR[7418] chan_sip.c: Serious Network Trouble; __sip_xmit returns error for pkt data
[2017-10-05 07:39:11] ERROR[8830] chan_sip.c: Serious Network Trouble; __sip_xmit returns error for pkt data

I’ve noticed they only come up on servers that we have TCP enabled for SIP, and are using an obscure port number (for security). Are there any known issue with using TCP for SIP? We have both TCP and UDP enabled, but only our VSP communicates with us via UDP.

stevensedory · October 10, 2017, 6:09pm

Anyone?

Do you think the cause of our FRACK Errors and crashing are related to these Serious Network Trouble Errors?

david551 · October 10, 2017, 8:38pm

They are unlikely to be related.

stevensedory · October 10, 2017, 8:45pm

Hi Dave, I just found this: https://www.syscore.dk/blog/1442/asterisk-1-8-and-kvm-segmentation-fault

It says: In case you wonder why on earth is asterisk failing with a segmentation fault after a fresh install (compilation by hand from sources) – and you happen to run it in a KVM virtual machine – then the answer is pretty easy: make sure you run make menuconfig before you start the compilation and remove the compilation flag called “BUILD NATIVE”. Once you do that asterisk will run normally.

Ever hear about needed to do this with KVM VMs? I know nothing about compiling asterisk for the FreePBX distro, but if someone thinks this is a good lead, I’ll do it.

david551 · October 10, 2017, 8:48pm

Broken VMs normally cause an illegal instruction fault, not a segmentation fault. The VM mis-describes the capabilities or the virtual CPU and gcc compiles code that uses instructions that are on the CPU described but not on the real/emulated one.

stevensedory · October 10, 2017, 8:52pm

So where does this leave me?

Topic		Replies	Views
Asterisk crashes from time to time after upgrade to 13.17.0 Asterisk Support	17	2516	October 20, 2017
Asterisk crash after a new setup on FreePBX Asterisk Support	4	733	April 19, 2016
Asterisk 13.22.0 Segmentation fault PJSIP TLS+SRTP about 60 endpoints Asterisk Support	5	716	August 16, 2019
Asterisk 1.8.15.0 + Digium B410p problems Asterisk Support	3	246	January 27, 2014
SegFault libasteriskpj.so crashes immediately after upgrade Asterisk	0	637	July 26, 2016

Asterisk/FreePBX Crashing and FRACK! Errors

Related topics