Asterisk 16.4 every 20 some days chan sip wigs out

About every 20ish days I have problem occur where phones and trunks look registered, but nothing is working, Phones show offline.

I’ve tried to unload and reload chan sip but it wont. I can only do a core restart to reslove it.

IN logs I have errors like
2019-12-03T16:07:52.747169-06:00 asterisk[67371]: ERROR[67420]: chan_sip.c:4321 in __sip_reliable_xmit: Serious Network Trouble; __sip_xmit returns error for pkt data
2019-12-03T16:08:06.746635-06:00 asterisk[67371]: ERROR[67420]: chan_sip.c:4321 in __sip_reliable_xmit: Serious Network Trouble; __sip_xmit returns error for pkt data

Since I’m on 16.4, I was planning to update to 16.6.2 now that its out. Any ideas on the chan sip issue. Looking at the release notes I don’t see anything that is a clear cut match but I see ASTERISK-28282 which might be part of the problem. I’m not 100% sure.

Additional, When the event occurs, now happening ever 2 days I cant put a call to a phone, but the existing calls don’t drop. As well it looks like calls continue to come into the system, I just cant send them to the endpoints.

Getting bad, Didn’t make it a full day now. So this is now very high up my list.

I noticed there is a RHEL patch for SDL. if not using video does asterisk use SDL for audio as well?

It does not. As well the chan_sip module is community supported, which is likely why noone has really responded or dug into this.

I understand. It has to be something environmental, but finding it is a bugger.

I have a change planned for 16.6.2 as I know that has some fixes but I’ve been on 16.4.0 since June 10 and this issue only started in the last 2 weeks. I’ve been on vacation for a good part of that so I know its not changes that I’ve made.

Continuing the discussion from Asterisk 16.4 every 20 some days chan sip wigs out:

I’ve upgarded to 16.6.2 so running the latest release code now Hope that it fixes the issues seen but based on the release notes I have my doubts.

16.6.2 still has the same issue. Well or the environment does. But what would cause this? OS is Rhel 7 current updates

If you had been running Asterisk 16.4.0 for a few months, but this issue only started a couple weeks ago then I’d try to focus on what changed in the last couple of weeks.

Did the Asterisk configuration change? Something on the network, or network configuration? Are you using realtime? If so any database related changes? New endpoints? Same or different trunks? etc…

What kind of transports are the endpoints configured to use? Does it happen for all endpoints?

Any other warnings/errors in the log? If you haven’t already try enabling debug and setting it to at least level 3. Anything of note in the output?

It’s going to have take months to confirm a 20 day cycle.

20 days sounds like a resource leak to me, so I would look at memory and file descriptor use.

It started happening every 2 to 3 days.

I’ll look at the fd sie Memory has been looking good. I still won’t be surprised to find some weird item in the aws / vmware hypervisor. It running in a aws region with Vmware software defined datacenter.

I have not found any items to suggest a issue in the environment. However its likely different than most users. It is mostly vmware running on aws i3 bare metal hardware. Vmware is newer then what you can buy for on prem usage. Its a service so it also means vmware manages the layer and so some items require vmware techs to make changes.

Servers are running Rhel 7 as the os. with current patches.

I’m close to what woudl be a normal high period and I see fd up in the 900 range so not very far from the 1024 default. I’ve increased the limits so see if I get up past that.

One thing for sure is I’ve not seen any messages in logs around events that show too many open files messages. But if anyone has an ideas by all means let me know.

so far , no issues but the holidays are much lighter loads then normal

I don’t think is it . so you looks like calls into system so the endpoints . so that why this is high up the list

check if you have a cron job shutting off asterisk gracefully ? or something of that nature?

It happened again.

Ulimits are up at 65535 . what else would this be? No there is no job to shutdown asterisk.

I have 16.6.2 running right now,

Does 1. ASTERISK-28561 that is fixed in 16.7.0 apply to chan_sip or chan_pjsip. I’m using chan_sip not pjsip.
The affected componts suggest chan_pjsip.

Looking for what might going on.

It should apply to either. Meaning the problem might occur when using either channel driver without the patch applied. Note, though that the problem can only occur if you are initiating a call using a fast originate (Async=true). If you are not originating calls as such then this issue wouldn’t affect you.

so it shouldn’t be using Async=true so it thats likley not it.
What is it then?

I updated to 16.7.0 now. I have to find out what is the cause of the problem however.

Symptoms remain the same in that existing calls don’t drop, but phones loose get the dreaded X on the display and obviously can’t establish new calls.