PJSIP not answering INVITE in time

Since about a month ago my Asterisk cluster became very unstable.

Initially Asterisk was crashing with SEGFAULT. We found a work around after opening this issue.

The work around is described there and involved using IP instead of a hostname on the PJSIP trunk configuration as the DNS resolution was taking too long.

After that we didn’t experience more crashes but some calls started being rejected. I don’t know if this started after we fixed the crash or at the same time the crash started but it was not present before these problems started.

The most common rejected call flow is shown here:

Sometimes Asterisk will actually answer but it will be too late, as shown here:

IMAGE ON NEXT POST DUE TO FORUM RESTRICTIONS (IMAGE 2)

These were taken after we increased the invite timeout on Kamailio to 6s, up from 3s. It mitigated some of rejected calls. I don’t feel that increasing it further is the best way to go.

On Askterisk PJSIP logs I can see the invites being received as shown here. (This log was taken before the timeout increase to 6s)

I noticed that the thread that processes the response didn’t print any logs for several seconds. In the example linked above the thread had this log:

[Aug 18 15:55:31] VERBOSE[121143] res_pjsip_logger.c: <--- Transmitting SIP response (560 bytes) to UDP:<KAMAILIO IP>:5060 --->

Then these:

[Aug 18 15:55:37] VERBOSE[121143] pbx_variables.c: Setting global variable 'SIPDOMAIN' to '<ASTERISK IP>'
[Aug 18 15:55:37] VERBOSE[121143] res_pjsip_logger.c: <--- Transmitting SIP response (560 bytes) to UDP:<KAMAILIO IP>:5060 ---> //This is the first reply we see from the logs linked above

In between these, I can see other threads replying to other SIP messages.

CONTINUING ON NEXT POST DUE TO FORUM RESTRICTIONS

EDITED TO UNFLAG. Not sure why this was flagged.

IMAGE 2

CONTINUING

Investigating task processors I have one sample output here.

The most notable entries are:

Processor                                                               Processed   In Queue  Max Depth  Low water High water
stasis/m:bridge:all-00000056                                               532491          0        409        450        500
stasis/m:cache_pattern:2/bridge:all-0000004f                               532492          0        532        450        500
stasis/m:cdr:aggregator-00000005                                          7790024      12723      16159       4500       5000
stasis/m:manager:core-00000006                                            8266280          0       2002       2700       3000

This is one of the few examples where I can actually see something “In Queue”, most of the time the few seconds it takes me to run “core show taskprocessors” are enough to see all queues clear, but always with high “Max Depth”.

CDR is disabled. make menuselect was run with --disable-category MENUSELECT_CEL --disable-category MENUSELECT_CDR

The full make menuselect:

make menuselect.makeopts
menuselect/menuselect --disable app_morsecode   --disable chan_skinny   --disable chan_dahdi   --disable chan_oss   --disable chan_sip   --disable app_dahdiras   --disable app_adsiprog   --disable app_getcpeid   --disable app_ices   --disable app_image   --disable app_macro   --disable app_nbscat   --disable app_url   --disable cdr_mysql   --disable codec_dahdi   --disable res_fax   --disable res_convert   --disable res_calendar   --disable res_format_attr_h263   --disable res_format_attr_h264   --disable res_format_attr_vp8   --disable res_format_attr_siren7   --disable res_format_attr_siren14   --disable res_odbc   --disable res_odbc_transaction   --disable res_config_sqlite3   --disable res_adsi   --disable res_smdi   --disable func_odbc   --disable cel_odbc   --disable cel_beanstalkd   --disable cel_pgsql   --disable cel_radius   --disable cel_tds   --disable cel_sqlite3_custom   --disable astdb2sqlite3   --disable astdb2bdb   --disable-category MENUSELECT_CEL   --disable-category MENUSELECT_CDR   --disable-category MENUSELECT_TESTS  menuselect.makeopts

I can still see CDR and CEL modules with ´module show like XXX`

myasterisk*CLI> module show like cdr
Module                         Description                              Use Count  Status      Support Level
app_cdr.so                     Tell Asterisk to not maintain a CDR for  0          Running              core
app_forkcdr.so                 Fork The CDR into 2 separate entities    0          Running              core
cdr                            CDR Engine                               3          Running              core
func_cdr.so                    Call Detail Record (CDR) dialplan functi 0          Running              core
4 modules loaded
myasterisk*CLI> module show like cel
Module                         Description                              Use Count  Status      Support Level
app_celgenuserevent.so         Generate an User-Defined CEL event       0          Running              core
cel                            CEL Engine                               2          Running              core
res_format_attr_celt.so        CELT Format Attribute Module             1          Running              core
3 modules loaded

I didn’t realize the app_XXX and func_XXX would still be loaded even with CDR disabled. I have since unchecked them from make menuselect but “cdr” and “cel” still load. I don’t know if they NEED to exist or if I missed something.

EDITING TO UNFLAG. Again not sure why this was flagged.

About the specific use case:
All SIP traffic goes through Kamailio.
There are several Asterisk boxes talking to Kamailio with the same type of load. Both physical and virtual.
The problem happens on both physical and virtual machines. The higher the number of calls the most likely it will happen.
Each box isn’t handling more than 100 concurrent channels.
Before this began, the same hardware was handling easily 1000 channels. It was split as a possible countermeasure. It didn’t work, it happens just as often, as in, it happened before 20 times a day across 5 servers, now it happens 20 times a day across 40 servers.
The incoming call that is seen in these logs and another type less common that I didn’t see a specific error log for, both go to a dialplan and are eventually put in ARI stasis and bridged together.
In the calls that are rejected, if it’s first pattern, no response, no dialplan is executed. In the second patter, late reply, only the first line of the dialplan is executed.
A third and last type of call is initiated by ARI and put either bridged with the calls mentioned above or put on hold.
There is a good amount of channel manipulation going on inside the ARI application. I can go into more detail.
This happens on “Asterisk certified/16.8-cert10” and 18.5.1 and another version we were using before this started and we upgraded to these versions.
Our software wasn’t updated for more than a month before this started.
Nothing in Asterisk was changed for months before this started.

After my longest post ever, what could be causing these calls to be rejected?

I’m not sure - but on which machine did you actually do the trace? On Kamailio? Did you trace on Asterisk, too? Are there any differences between those two trace points? Long time ago I had a completely broken in kernel ethernet driver, which queued UDP packets pretty long and suddenly released them after 8 or more seconds. Can you eliminate a problem like this? This could btw. explain the slow DNS, too (which usually uses UDP, too).
Did you compare the time Asterisk received the packet in the log to the time it has been seen on the machine (tcpdump)?

Yes.

Yes. All the same, it just doesn’t have the communication with , of course. Including the times that differ only by a few milliseconds.

Yes, they match. During the investigation we assumed the problem was somewhere else, such as retransmits or network so we closely compared all of those.

This is very interesting!

Our virtual machines are all clones but the physical servers have different hardware. I’ll see how I can validate this. OS wise these are all on CentOS 7 with everything up to date.

Ok - so I think you don’t have a problem with the ethernet driver. If I understand it correctly, the problem wasn’t there from beginning, means, it has been working fine quite a long time and it “suddenly” came up. The problem can be seen on any VM and bare metal machines and it is load independent.

The question is: what has been changed before you could see the problem at the first time?
Did you do some CentOS update, e.g.? Or did you do some “changes” on the asterisk side (like adding more users e.g.?) or any other configuration change.

You told, DNS resolution would take too long. Can you see the same behavior using dig, e.g. or is this behavior limited to Asterisk, too? Does Asterisk start the DNS lookup and it takes very long time until Asterisk gets an answer or does it take very long time until Asterisks starts the lookup at all?

My actual working hypothesis is, that there has been (some sort of) a change, which causes this problem and therefore my goal would be to try to find this change and revert it.

Nothing. I did a history on the machines to see if anyone touched them.
Talked to the sysadmin “Hey, we both know you changed something, just come clean already and we can fix this”. But no, nothing there as well.

Even when do updates we are careful to run one server first and monitor before doing it with everything.

5 days prior he had enabled zabbix, the closest change we could find. It wasn’t enabled on all the servers though and the problem was happening on server that had or didn’t have it.

I noticed that the resolution for zabbix hostname also takes long on occasion. On a random sample it’s about the same proportion of zabbix requests taking too long to Asterisk related requests taking too long.

We spent a long time looking at this, I’m sure something changed but I can’t find that change inside our infrastructure.

Important update. We found a workaround! We removed all DNS lookups that Asterisk was doing and the problem stopped happening.

Looking at packet capture we saw that Asterisk was trying to resolve its name through the DNS, “myasterisk”. It would always return “Unknown hostname”. It has been like this for 7 years without issues but clearly wasn’t optimal. Updating the hosts file and setting that to localhost removed the last remaining DNS resolution and we are no longer experiencing issues.

We are still investigating what caused this since we first had the SEGFAULT and now the authentication failure, there might be more symptoms of whatever changed.

Congrats! This would mean something changed on DNS side (configuration, …). Now it would be good to carefully examine the complete responses of your DNS regarding correct format, characters and so on. But I think, you know it self best how to proceed and to narrow down the problem. From my point of view, I can’t see any problem using bind 9.11.4.