PJSIP Occasionally Stops Processing Calls to Completion

Since upgrading my test system to Asterisk 16.17.0, I’ve had an issue twice now where chan_pjsip seems to stop processing some calls. The first time this happened, a particular endpoint couldn’t make outgoing calls, and I noticed a bunch of identical dead air call attempts were stacked up in “pjsip show channels” so I just restarted Asterisk and things started working again. Now, after several days of things working fine, this same endpoint can’t receive incoming calls.

With this endpoint that currently can’t receive calls, there are two scenarios happening. In the first scenario, when someone calls one of my DIDs, I see an INVITE come in through the Asterisk console and there is no acknowledgement sent back from chan_pjsip at all. The provider sends a few more invites before giving up (the CSeq remains the same on all invites).

The other scenario is that the provider sends me an INVITE and chan_pjsip sends back Trying, but doesn’t do anything else. Dial never gets executed (which is the first thing the dialplan should do for a call coming into this extension). Eventually my provider sends me a few CANCELs and PJSIP never sends an OK.

I’ve placed a few test calls over approximately a ten minute span, and these two scenarios continue to happen. I haven’t restarted Asterisk or PJSIP because this problem isn’t hurting anyone at the moment, but I will have to restart by the end of the day.

Does anyone know why this might be happening and what I might be able to do? I set this system up for testing and I definitely can’t switch my production system to 16 unless I can figure out how to resolve this.

This sounds like a deadlock. See Getting a Backtrace - Asterisk Project - Asterisk Project Wiki for getting debugging information.

I didn’t start asterisk with the g option, so based on that article I gather there nothing I can do to get debug information unless I kill asterisk and restart it with g. However, I haven’t had my breakfast or coffee yet so I’ll look this over again later.

root@cloud:~# ps -C asterisk u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     12611  0.7  6.5 1645220 66516 ?       Ssl  Apr07  42:04 /usr/sbin/asterisk

If you have a deadlock, you don’t actually need it to produce a core file, as you can produce a backtrace from the running memory image.

https://man7.org/linux/man-pages/man1/gcore.1.html

Thanks for that. I just generated the core dump with gcore. I will go through it later to try to scrub anything that I’d like to scrub.

Is the best way to proceed to open an issue on Jira?

Interestingly, now the problem appears to have morphed. Now chan_pjsip is parsing the incoming REGISTER messages, but isn’t responding to them.

I have got the same issues since I moved from Asterisk 13.36 to Asterisk 16.8, one month ago. I thought that it was only limited to my system, due to an error regarding chan_pjsip configuration, but you described exactly the same problems.

After investigating with my SIP Provider (OVH France), we listed :

  • incoming REGISTER without answer
  • first INVITE, followed by trying and then nothing else, and then awaiting 5 sec to send the 200 ACK to the cancel request coming from my provider due to a lack of response
    It not applies on all calls but on some of them, when not registered.

I will also generate the core dump to investigate the deadlock “assumption”

I also upgraded from 13 (I don’t remember which minor version) to 16. Could this have something to do with an upgrade path vs a fresh install? I wouldn’t think so, but I remain open minded.

If you reload chan_pjsip or restart Asterisk, does the problem occur? Your problem description lets me remember an issue with another European provider, but I’d need more details.

If the problem never occurs after a restart, then I have more questions. You’d need to check the registration status and the status of the aors (or simply pjsip show contacts).

You are right, the problem does not occurs just after restart, but only for few hours.

As usual, it will be something obvious but I can’t get it !

 Contact:  <Aor/ContactUri..............................> <Hash....> <Status> <RTT(ms)..>
==========================================================================================

  Contact:  my_trunk/sip:0033xxxxxxx@siptrunk.ovh.net:506 8ec5fdf743 NonQual         nan

      Aor:  <Aor..............................................>  <MaxContact>
    Contact:  <Aor/ContactUri............................> <Hash....> <Status> <RTT(ms)..>
==========================================================================================

      Aor:  my_trunk                                              0
    Contact:  my_trunk/sip:0033xxxxxxx@siptrunk.ovh.net:5 8ec5fdf743 NonQual         nan


 ParameterName        : ParameterValue
 ==============================================================
 authenticate_qualify : false
 contact              : sip:0033xxxxxxx@siptrunk.ovh.net:5060
 default_expiration   : 1800
 mailboxes            : 
 max_contacts         : 0
 maximum_expiration   : 7200
 minimum_expiration   : 60
 outbound_proxy       : 
 qualify_frequency    : 0
 qualify_timeout      : 3.000000
 remove_existing      : false
 support_path         : false
 voicemail_extension  :

FYI : i did not upgrade the system, it is a fresh install of the 16.8, but I reused the config files of my previous running instance (13.36 with chan_pjsip), that’s why I am looking for some issues regarding the parameters of the chan_pjsip.

Can you switch back to the old system and check whether the problem does not occur?

The reason why I am asking is that sometimes providers change things themselves. Not related to this problem, I recently read about an announcement that one of the larger providers changed some things how their DNS servers work. I don’t really know whether they changed things, but there was an announcement that simple name lookup won’t work after April, 1.

You could also check the states of the OVH connections in the router. I guess they allow tcp as well, so one could check that for signalling as well. In case of tcp, you should see 1 or more established connections permanently.

On my system, the issue doesn’t occur immediately after a restart. In both cases the issue occurred about three days after starting Asterisk.

I recompiled with DONT_OPTIMIZE last night so that hopefully I can provide a better core dump, assuming the issue comes up again. It’s been about 15 hours and no issue yet, but I think both of the previous times this issue occurred, the system was up for three days.

I doubt that it is a stability issue. Could you publish your pjsip registration (scrub the personal stuff) and which OVH product you are using.

I switched back to the old system, and i have got the same issue of “missed calls”.
Upgrading from 13.36 to 16.8 could not be the problem… but a change in the provider’s infrastructure could be: I asked OVH to investigate on their side…

They might not know themselves, :slight_smile: , or at least the people that usually talk to customers. Please publish your registration configuration. I have something specific on my mind, but I’d like to first check that before I open my mouth.

thank you

[general]
externip = 111.111.111.111

[transport-udp]
type = transport
protocol = udp
bind = 192.168.31.1:5060
external_media_address = 111.111.111.111
external_signaling_address = 111.111.111.111
local_net = 192.168.31.0/24

[registration-options](!)
type = registration
retry_interval = 20
max_retries = 20
expiration = 1800
transport = transport-udp
line=yes

[reg_my_trunk](registration-options)
outbound_auth = auth_reg_my_trunk
client_uri = sip:0033xxxxxxx@siptrunk.ovh.net:5060
server_uri = sip:siptrunk.ovh.net:5060
endpoint=my_trunk

[auth_reg_my_trunk]
type = auth
password = password
username = 0033xxxxxxx

[endpoint-options](!)
type = endpoint
context = default
dtmf_mode = rfc4733
disallow = all
allow = alaw
rtp_symmetric = yes
force_rport = yes
rewrite_contact = yes
rtp_keepalive = 30
direct_media = no
send_rpid = yes
language = fr

[my_trunk]
type = aor
contact = sip:0033xxxxxxx@siptrunk.ovh.net:5060
default_expiration = 1800

[my_trunk]
type = identify
endpoint = my_trunk
match = siptrunk.ovh.net

[my_trunk](endpoint-options)
from_user = 0033xxxxxxx
outbound_auth = auth_reg_my_trunk
aors = my_trunk

You could check whether they rotate IP addresses for siptrunk.ovh.net. Asterisk always resolves names and by that the IP addresses might get out of sync. For example, it could be that your trunk is still registered to an old address, while the new call uses a different IP. Then it would depend on what the OVH servers are doing and whether you still have a valid registration.

I did not find what I was looking for.

For long-term observations I would attach a HEP server to Asterisk. HOMER is pretty easy to set up as it comes as a container and the configuration is trivial. Of course, you can also dump everything into a flat file, but searching for problems in thousands of lines is no fun.

The endpoint has a public ip so you need to remove these :

rtp_symmetric = yes
force_rport = yes
rewrite_contact = yes

Why? I’d say the PBX is behind a NAT router as the transport section suggests:

The options referenced (rtp_symmetric, force_rport, and rewrite_contact) are if the defined endpoint is behind NAT, not if the PBX is behind NAT.

And this is one of the commonest mistakes in cha_sip configurations, typically those copied and pasted from ITSP recommendations. It is normally fairly harmless, though.