PJSIP is now Unreachable / Realtime

Afternoon all,

We have been trying to hunt down the cause of this issue for sometime and hoping some less tired eyes will spot it.

Issue we are experiencing is multiple pjsip trunks reporting unreachable. They will always come back after a period of time (random length, could be seconds to minutes).

Whilst the issue first appeared on 16.22.0, we have been able to recreate the issue on v16.15.0 as well. We have tested ip based and outbound auth based trunks, both seem to be affected. Originally thought it was due to the carriers we are testing with using SRV and a config not being correct, not 100% sure though. Using realtime on all systems.

Looking at pjsip logger, registrations and options all look to be fine or returning straight away.

Hoping a bit of help here to understand whats wrong as I think its a qualify issue but not sure.

pjsip_wizard.conf

[hosted_trunk]
type = wizard
endpoint/transport=transport-udp
endpoint/allow = !all,ulaw,alaw
endpoint/rewrite_contact=yes
endpoint/context = default
endpoint/force_rport = yes
endpoint/send_rpid = yes
aor/qualify_frequency = 60
sends_auth = yes
sends_registrations = yes
remote_hosts = carrier_srv_fqdn
outbound_auth/username = xxxxxxxxxxx
outbound_auth/password = xxxxxxxxxxx
send_pai = yes

pjsip.conf

[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:6443
local_net=x.x.x.0/24
external_media_address=x.x.x.x
external_signaling_address=x.x.x.x
external_signaling_port=6443

Debug/verbose logging show the trunks becoming unreachable but why is a bit beyond our understanding of the logs.

All vm’s are hosted in Azure, worth mentioning we also have another 16.22.0 in another region with identical configuration not experiencing the issue.

Would appreciate thoughts/directions on what to look at next, would be great.

You’ve stated you’re using realtime, but you’ve provided a config file. How is realtime in use? For endpoints? Something else?

And unreachable occurs when a SIP OPTIONS is sent and does not get a response. Without providing a SIP trace and logging, that’s the extent of what can be said.

Thanks,

Yes realtime is used to manage endpoints and their auth/aor, not trunks. We also use it for dialplan for outbound, voicemail.

I have a cap from pjsip logger but it just shows the registration every 60 mins and SIP OPTIONS retry every 60 seconds. There are no missing option replies. We get a 200 OK every time.

I don’t suppose the below provides any insight or have I not even pulled the right section?

[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_options.c: Qualifying all contacts on AOR ‘trunk_context’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_options.c: Qualifying contact ‘trunk_context@@4e3878d3e32c171ebc3f7dc87f185cc5’ on AOR ‘trunk_context’
[Apr 22 05:36:32] DEBUG[28599] res_odbc.c: Reusing ODBC handle 0x7f91cc004790 from class ‘realtime’
[Apr 22 05:36:32] DEBUG[28599] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoints WHERE id = ?
[Apr 22 05:36:32] DEBUG[28599] res_config_odbc.c: Parameter 1 (‘id’) = ‘trunk_context’
[Apr 22 05:36:32] DEBUG[28599] res_odbc.c: Releasing ODBC handle 0x7f91cc004790 into pool
[Apr 22 05:36:32] DEBUG[28599] res_odbc.c: Reusing ODBC handle 0x7f91cc004790 from class ‘realtime’
[Apr 22 05:36:32] DEBUG[28599] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_domain_aliases WHERE id = ?
[Apr 22 05:36:32] DEBUG[28599] res_config_odbc.c: Parameter 1 (‘id’) = ‘carrier_srv_fqdn’
[Apr 22 05:36:32] DEBUG[28599] res_odbc.c: Releasing ODBC handle 0x7f91cc004790 into pool
[Apr 22 05:36:32] DEBUG[28599] res_pjsip.c: 0x7f91f8217480: Wrapper created
[Apr 22 05:36:32] DEBUG[28599] res_pjsip.c: 0x7f91f8217480: Set timer to 3000 msec
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: Performing SIP DNS resolution of target ‘carrier_srv_fqdn’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: Transport type for target ‘carrier_srv_fqdn’ is ‘UDP transport’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: [0x7f91f810a7c8] Created resolution tracking for target ‘carrier_srv_fqdn’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: [0x7f91f810a7c8] Added target ‘carrier_srv_fqdn’ with record type ‘35’, transport ‘UDP transport’, and port ‘5060’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: [0x7f91f810a7c8] Added target ‘_sip._udp.carrier_srv_fqdn’ with record type ‘33’, transport ‘UDP transport’, and port ‘5060’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: [0x7f91f810a7c8] Added target ‘carrier_srv_fqdn’ with record type ‘1’, transport ‘UDP transport’, and port ‘5060’
[Apr 22 05:36:32] DEBUG[28599] res_pjsip/pjsip_resolver.c: [0x7f91f810a7c8] Starting initial resolution using parallel queries for target ‘carrier_srv_fqdn’
[Apr 22 05:36:32] DEBUG[26012] dns.c: DNS search failed for carrier_srv_fqdn
[Apr 22 05:36:32] DEBUG[26012] dns_system_resolver.c: DNS search failed for query: ‘carrier_srv_fqdn’
[Apr 22 05:36:32] DEBUG[26040] res_pjsip_registrar.c: Woke up at 1650605792 Interval: 30
[Apr 22 05:36:32] DEBUG[26040] res_odbc.c: Reusing ODBC handle 0x7f91cc004790 from class ‘realtime’
[Apr 22 05:36:32] DEBUG[26040] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_contacts WHERE expiration_time <= ? ORDER BY expiration_time
[Apr 22 05:36:32] DEBUG[26040] res_config_odbc.c: Parameter 1 (‘expiration_time <=’) = ‘1650605792’
[Apr 22 05:36:32] DEBUG[26040] res_odbc.c: Releasing ODBC handle 0x7f91cc004790 into pool
[Apr 22 05:36:32] DEBUG[26040] res_pjsip_registrar.c: Expiring 0 contacts
[Apr 22 05:36:35] DEBUG[26027] res_pjsip.c: 0x7f91f8217480: Internal tsx timer expired after 3000 msec
[Apr 22 05:36:35] DEBUG[26027] res_pjsip.c: 0x7f91f8217480: Timer handled here
[Apr 22 05:36:35] DEBUG[26027] res_pjsip.c: 0x7f91f8217480: Callbacks executed
[Apr 22 05:36:35] VERBOSE[7562] res_pjsip/pjsip_configuration.c: Endpoint trunk_context is now Unreachable
[Apr 22 05:36:35] VERBOSE[7562] res_pjsip/pjsip_options.c: Contact trunk_context/sip:carrier_srv_fqdn is now Unreachable. RTT: 0.000 msec
[Apr 22 05:36:35] DEBUG[7562] res_pjsip/pjsip_options.c: AOR ‘trunk_context’ now has 0 available contacts

A SIP trace would be done using “pjsip set logger on”. There are additional arguments you can use to limit to a specific hostname or IP address. Outside of Asterisk tcpdump can be used to do a packet capture.

It could also be that DNS resolution failed temporarily. More likely the SRV just failed, but I don’t know because your placeholder has “carrier_srv_fqdn” which references SRV.

Yes I think I didn’t have that on in the last capture.

So having run another debug with pjsip logger on, I think I can now see where the fault is which is a delay for sending the OPTION.

It starts the AOR qualify process, if there is a delay of more than 3 seconds, then its deemed unreachable. The times I am seeing the issue, the OPTION isn’t sent until 4-6 seconds after the qualify process begins (already set unreachable).

Even with a instant OPTION return the trunk remains unreachable. These delays can go on for awhile as well meaning multiple successful OPTION returns are not picked and the trunk is unavailable for us during this period.

Is there a way to adjust the tolerance in the dns resolver for pjsip?

Example below shows the qualify start at 15:01:55, fails at 15:01:58, OPTION send at 15:02:00, trunk remains offline

[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_options.c: Qualifying all contacts on AOR ‘hosted_trunk’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_options.c: Qualifying contact ‘hosted_trunk@@4e3878d3e32c171ebc3f7dc87f185cc5’ on AOR ‘hosted_trunk’
[Apr 23 15:01:55] DEBUG[988] res_odbc.c: Reusing ODBC handle 0x7f91f0015790 from class ‘realtime’
[Apr 23 15:01:55] DEBUG[988] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_endpoints WHERE id = ?
[Apr 23 15:01:55] DEBUG[988] res_config_odbc.c: Parameter 1 (‘id’) = ‘hosted_trunk’
[Apr 23 15:01:55] DEBUG[988] res_odbc.c: Releasing ODBC handle 0x7f91f0015790 into pool
[Apr 23 15:01:55] DEBUG[988] res_odbc.c: Reusing ODBC handle 0x7f91f0015790 from class ‘realtime’
[Apr 23 15:01:55] DEBUG[988] res_config_odbc.c: Skip: 0; SQL: SELECT * FROM ps_domain_aliases WHERE id = ?
[Apr 23 15:01:55] DEBUG[988] res_config_odbc.c: Parameter 1 (‘id’) = ‘hosted_trunk.voip.fonify.info’
[Apr 23 15:01:55] DEBUG[988] res_odbc.c: Releasing ODBC handle 0x7f91f0015790 into pool
[Apr 23 15:01:55] DEBUG[988] res_pjsip.c: 0x7f91e00321f0: Wrapper created
[Apr 23 15:01:55] DEBUG[988] res_pjsip.c: 0x7f91e00321f0: Set timer to 3000 msec
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: Performing SIP DNS resolution of target ‘hosted_trunk.voip.fonify.info’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: Transport type for target ‘hosted_trunk.voip.fonify.info’ is ‘UDP transport’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Created resolution tracking for target ‘hosted_trunk.voip.fonify.info’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘hosted_trunk.voip.fonify.info’ with record type ‘35’, transport ‘UDP transport’, and port ‘5060’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘_sip._udp.hosted_trunk.voip.fonify.info’ with record type ‘33’, transport ‘UDP transport’, and port ‘5060’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘hosted_trunk.voip.fonify.info’ with record type ‘1’, transport ‘UDP transport’, and port ‘5060’
[Apr 23 15:01:55] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Starting initial resolution using parallel queries for target ‘hosted_trunk.voip.fonify.info’
[Apr 23 15:01:56] DEBUG[26012] dns.c: DNS search failed for hosted_trunk.voip.fonify.info
[Apr 23 15:01:56] DEBUG[26012] dns_system_resolver.c: DNS search failed for query: ‘hosted_trunk.voip.fonify.info’
[Apr 23 15:01:58] DEBUG[26027] res_pjsip.c: 0x7f91e00321f0: Internal tsx timer expired after 3000 msec
[Apr 23 15:01:58] DEBUG[26027] res_pjsip.c: 0x7f91e00321f0: Timer handled here
[Apr 23 15:01:58] DEBUG[26027] res_pjsip.c: 0x7f91e00321f0: Callbacks executed
[Apr 23 15:01:58] DEBUG[988] res_pjsip/pjsip_options.c: Contact hosted_trunk/sip:hosted_trunk.voip.fonify.info status didn’t change: Unreachable, RTT: 0.000 msec

[Apr 23 15:01:59] DEBUG[26012] dns.c: DNS search failed for hosted_trunk.voip.fonify.info
[Apr 23 15:01:59] DEBUG[26012] dns_system_resolver.c: DNS search failed for query: ‘hosted_trunk.voip.fonify.info’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] All parallel queries completed
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] SRV record received on target ‘_sip._udp.hosted_trunk.voip.fonify.info’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘kam001.z1.las.voipproxy.net’ with record type ‘1’, transport ‘UDP transport’, and port ‘7000’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] SRV record received on target ‘_sip._udp.hosted_trunk.voip.fonify.info’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘kam002.z1.las.voipproxy.net’ with record type ‘1’, transport ‘UDP transport’, and port ‘7000’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] SRV record received on target ‘_sip._udp.hosted_trunk.voip.fonify.info’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘kam003.z1.grr.voipproxy.net’ with record type ‘1’, transport ‘UDP transport’, and port ‘7000’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] SRV record received on target ‘_sip._udp.hosted_trunk.voip.fonify.info’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Added target ‘kam002.z1.grr.voipproxy.net’ with record type ‘1’, transport ‘UDP transport’, and port ‘7000’
[Apr 23 15:01:59] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] New queries added, performing parallel resolution again
[Apr 23 15:02:00] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] All parallel queries completed
[Apr 23 15:02:00] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] A record received on target ‘kam001.z1.las.voipproxy.net
[Apr 23 15:02:00] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] A record received on target ‘kam002.z1.las.voipproxy.net
[Apr 23 15:02:00] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] A record received on target ‘kam003.z1.grr.voipproxy.net
[Apr 23 15:02:00] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] A record received on target ‘kam002.z1.grr.voipproxy.net
[Apr 23 15:02:00] DEBUG[26012] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Resolution completed - 4 viable targets
[Apr 23 15:02:00] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Address ‘0’ is 162.252.251.41:7000 with transport ‘UDP transport’
[Apr 23 15:02:00] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Address ‘1’ is 162.252.251.42:7000 with transport ‘UDP transport’
[Apr 23 15:02:00] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Address ‘2’ is 162.252.250.230:7000 with transport ‘UDP transport’
[Apr 23 15:02:00] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Address ‘3’ is 162.252.250.42:7000 with transport ‘UDP transport’
[Apr 23 15:02:00] DEBUG[988] res_pjsip/pjsip_resolver.c: [0x7f91e00e0218] Invoking user callback with ‘4’ addresses
[Apr 23 15:02:00] DEBUG[988] res_pjsip/pjsip_message_filter.c: Re-wrote Contact URI host/port to x.x.x.x:6443 (this may be re-written again later)
[Apr 23 15:02:00] DEBUG[988] netsock2.c: Splitting ‘162.252.251.41’ into…
[Apr 23 15:02:00] DEBUG[988] netsock2.c: …host ‘162.252.251.41’ and port ‘’.
[Apr 23 15:02:00] DEBUG[988] res_pjsip_nat.c: Re-wrote Contact URI port to 6443
[Apr 23 15:02:00] DEBUG[988] netsock2.c: Splitting ‘x.x.x.x:6443’ into…
[Apr 23 15:02:00] DEBUG[988] netsock2.c: …host ‘x.x.x.x’ and port ‘6443’.
[Apr 23 15:02:00] DEBUG[988] netsock2.c: Splitting ‘162.252.251.41:7000’ into…
[Apr 23 15:02:00] DEBUG[988] netsock2.c: …host ‘162.252.251.41’ and port ‘7000’.
[Apr 23 15:02:00] VERBOSE[988] res_pjsip_logger.c: <— Transmitting SIP request (435 bytes) to UDP:162.252.251.41:7000 —>
OPTIONS sip:hosted_trunk.voip.fonify.info SIP/2.0
Via: SIP/2.0/UDP x.x.x.x:6443;rport;branch=z9hG4bKPj66d0c8bf-38b6-4984-895a-f51dfa68ab53
From: sip:hosted_trunk@x.x.x.x;tag=4f03b2b8-c2fd-47a0-b7e7-4a918b361250
To: sip:hosted_trunk.voip.fonify.info
Contact: sip:hosted_trunk@x.x.x.x:6443
Call-ID: 0add92e1-9465-4c81-b686-e35a0de7e60e
CSeq: 31839 OPTIONS
Max-Forwards: 70
User-Agent: Asterisk PBX 16.22.0
Content-Length: 0

[Apr 23 15:02:00] VERBOSE[26027] res_pjsip_logger.c: <— Received SIP response (383 bytes) from UDP:162.252.251.41:7000 —>
SIP/2.0 200 Rawr!!
Via: SIP/2.0/UDP x.x.x.x:6443;rport=6443;branch=z9hG4bKPj66d0c8bf-38b6-4984-895a-f51dfa68ab53;received=x.x.x.x
From: sip:hosted_trunk@x.x.x.x;tag=4f03b2b8-c2fd-47a0-b7e7-4a918b361250
To: sip:hosted_trunk.voip.fonify.info;tag=3cc5c7754c521ae90c1a083c37a4749d.b911
Call-ID: 0add92e1-9465-4c81-b686-e35a0de7e60e
CSeq: 31839 OPTIONS
Content-Length: 0

Ok so based on this, I believe I’ve answered my own question. As qualify_timeout default is 3 seconds, and if the OPTIONS response is not back before this then the contact is then deemed unreachable.

In this situation, as its DNS resolution that is delaying the initial request, increasing the timeout to 10 should provide a bit of tolerance/workaround even though it is not resolving the root cause of the issue.

I’d suggest that you raise this as a bug report, as it wouldn’t surprise me if this explains a lot of intermittent unreachable reports. I suspect it may be difficult to fix, as the pjsip library may get in the way.

Setting a very large qualify time isn’t really good solution, as any time more than about 100ms, if not an artefact of DNS failures, implies an increasing unusable media path.

Ok done - [ASTERISK-30025] PJSIP affected by delays in DNS resolution - Digium/Asterisk JIRA

I totally agree increasing the qualify time doesn’t fix the root cause, unfortunately due to the nature of DNS/SRV and all things network/latency, we are not sure of another way to work around this for the time being.

We have changed the DNS endpoints we are using from Microsoft’s own to cloudflare, will see if that makes a difference, but will take more testing to see if the latency is from the pbx to dns or dns to domain host/dns.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.