Endpoints become unavaiable every few minutes

Hello

I have this problem occuring throughout the day. Multiple endpoints become unreachable at the same time, and then after a few seconds go back to reachable. Generally the disconnected endpoints are from the same incoming ip. And this happens to different users from different ip, not juste a specific address.

Nothing in the firewall that could cause this issue. Also if an endpoint in in a call, and becomes unreachable, the call is still on and not dropped.

OS version: Ubuntu 22.04.1
Asterisk version: 18.15.1
PJSIP version: 2.12.1

I’m using websocket for the clients.

websocket transport

[wss-tls-transport]
type=transport
protocol=wss
bind=0.0.0.0
external_media_address=MY_PUBLIC_IP
external_signaling_address=MY_PUBLIC_IP
cert_file=CERT_FILE
priv_key_file=PRIV_KEY
websocket_write_timeout=1500
[2022-12-19 11:33:26] VERBOSE[372] res_pjsip/pjsip_options.c: Contact 1401/sip:cr78jqmi@196.50.14.90:57755;transport=ws;x-ast-orig-host=r7ajg6im8a25.invalid:0 has been deleted
[2022-12-19 11:33:26] VERBOSE[372] res_pjsip/pjsip_configuration.c: Endpoint 1401 is now Unreachable
[2022-12-19 11:33:26] VERBOSE[20451] res_pjsip/pjsip_options.c: Contact 1588/sip:l4jskccr@196.50.14.90:56911;transport=ws;x-ast-orig-host=qbpck0eqrao0.invalid:0 has been deleted
[2022-12-19 11:33:26] VERBOSE[20451] res_pjsip/pjsip_configuration.c: Endpoint 1588 is now Unreachable
[2022-12-19 11:33:26] VERBOSE[4003] res_pjsip/pjsip_options.c: Contact 1544/sip:7l56ss8e@196.50.14.90:58861;transport=ws;x-ast-orig-host=cqm5d07sn6oe.invalid:0 has been deleted
[2022-12-19 11:33:26] VERBOSE[4003] res_pjsip/pjsip_configuration.c: Endpoint 1544 is now Unreachable
.......

Sometimes i see these errors relating to websocket and ssl, before or while the problem is occuring, either one of them or both at the same time, but it’s not always the case .

[2022-12-19 11:27:01] ERROR[62] iostream.c: SSL_shutdown() failed: error:00000005:lib(0):func(0):DH lib, Underlying BIO error: Bad file descriptor
[2022-12-19 11:27:01] VERBOSE[62] res_http_websocket.c: WebSocket connection from '196.50.14.90:36770' forcefully closed due to fatal write error

Thanks in advance

You are having network problems. You need to trace the OPTIONS request and responses through the network until you find where they get lost.

Thanks for quick the reply.

That was my first assumption. That is why we moved our asterisk to aws from another cloud provider, we thought maybe there was a network problem with that specific provider, but the problem still occurs even after changing cloud providers.

I forgot to mention that asterisk is in a docker container. Can it be related to the problem? Does docker create some kind of bottleneck or something of the kind ? We were hesitant to use docker with asterisk but a lot of people use it without a problem so we made the jump.

Thank you for your time

Docker is strongly discouraged, but I don’t know enough about it to know whether it could drop OPTIONS.

What is making me think maybe it is specific to the sip side, is we don’t have problems with audio. If there was a problem with network or docker droping packets, then we would have a lot of audio problems, which we don’t have.

Is it possible that the web server within asterisk that handles websocket can’t support a lot of traffic, and so close some connections when overloaded?

Thanks

The bad file descriptor error is more like the other party closed the connection, but I have no personal experience of WebRTC.

Something that i noticed that seems stange.

An endpoint sends a register, asterisk removes the contact for that endpoint “due to request”, and sends a 200 OK, immediately the endpoint becomes unreachable, and then other endpoints that have the same ip address become unavailable.

[2022-12-19 12:49:34] VERBOSE[36573] res_pjsip_logger.c: <--- Received SIP request (664 bytes) from WSS:REMOTE_IP:62730 --->
REGISTER sip:mydomain SIP/2.0
Via: SIP/2.0/WSS tokp14avjl56.invalid;branch=z9hG4bK7366797
To: <sip:1587@mydomain>
From: <sip:1587@mydomain>;tag=pk8p4fjcl4
CSeq: 18 REGISTER
Call-ID: 02t7126m9najp18c2g2g
Max-Forwards: 70
Authorization: Digest algorithm=MD5, username="1587", realm="asterisk", nonce="1671450565/e6ac38f8df635e3fca5730469cc38449", uri="sip:mydomain", response="add98768778405be8dfb1dd2faf9
bb79", opaque="6e738aa74cfe8182", qop=auth, cnonce="sc94ih7ku6p9", nc=00000001
Contact: <sip:vqqsfald@tokp14avjl56.invalid;transport=ws>;expires=0
Supported: outbound, path, gruu
User-Agent: SIP.js/0.20.0
Content-Length: 0


[2022-12-19 12:49:34] VERBOSE[36573] res_pjsip_registrar.c: Removed contact 'sip:vqqsfald@REMOTE_IP:62730;transport=ws;x-ast-orig-host=tokp14avjl56.invalid:0' from AOR '1587' due to request

[2022-12-19 12:49:34] VERBOSE[36573] res_pjsip_logger.c: <--- Transmitting SIP response (348 bytes) to WSS:REMOTE_IP:62730 --->
SIP/2.0 200 OK
Via: SIP/2.0/WSS tokp14avjl56.invalid;rport=62730;received=REMOTE_IP;branch=z9hG4bK7366797
Call-ID: 02t7126m9najp18c2g2g
From: <sip:1587@mydomain>;tag=pk8p4fjcl4
To: <sip:1587@mydomain>;tag=z9hG4bK7366797
CSeq: 18 REGISTER
Date: Mon, 19 Dec 2022 11:49:34 GMT
Server: Asterisk PBX 18.15.1
Content-Length:  0

[2022-12-19 12:49:34] VERBOSE[20451] res_pjsip/pjsip_options.c: Contact 1587/sip:vqqsfald@REMOTE_IP:62730;transport=ws;x-ast-orig-host=tokp14avjl56.invalid:0 has been deleted
[2022-12-19 12:49:34] VERBOSE[20451] res_pjsip/pjsip_configuration.c: Endpoint 1587 is now Unreachable
[2022-12-19 12:49:34] VERBOSE[20451] res_pjsip_registrar.c: Removed contact 'sip:lhl9hicd@REMOTE_IP:55761;transport=ws;x-ast-orig-host=fvrqt4qpullo.invalid:0' from AOR '1524' due to shut
down
[2022-12-19 12:49:34] VERBOSE[4003] res_pjsip_registrar.c: Removed contact 'sip:3i4cnauh@REMOTE_IP:51369;transport=ws;x-ast-orig-host=r9e4n067qlbo.invalid:0' from AOR '1552' due to shutd
own
[2022-12-19 12:49:34] VERBOSE[64003] res_pjsip_registrar.c: Removed contact 'sip:cr78jqmi@REMOTE_IP:57755;transport=ws;x-ast-orig-host=r7ajg6im8a25.invalid:0' from AOR '1401' due to shut
down
[2022-12-19 12:49:34] VERBOSE[75690] res_pjsip/pjsip_options.c: Contact 1524/sip:lhl9hicd@REMOTE_IP:55761;transport=ws;x-ast-orig-host=fvrqt4qpullo.invalid:0 has been deleted
[2022-12-19 12:49:34] VERBOSE[20451] res_pjsip_registrar.c: Removed contact 'sip:m3me8hqg@REMOTE_IP:59313;transport=ws;x-ast-orig-host=fvis0e38vsnp.invalid:0' from AOR '1407' due to shut
down
...........

All the endpoints removed are from the same ip address as the first one that was removed. Other endpoints with different ip aren’t removed

AOR config

type=aor
max_contacts=1
qualify_frequency=30
remove_existing=yes

And i have rewrite_contact=yes in the endpoint config

I tried setting max_contacts to 2 as stated here: PJSIP mis-configuration can cause loss of SIP registrations ⋆ Asterisk, but it juste made things worse

The “max_contacts” setting has three types of values that are useful.

  • Setting to at least twice the number of devices desired for the endpoint should give enough headroom to avoid unexpected registration loss.

That’s an unregister. The endpoint has asked to have the registration removed.

https://www.rfc-editor.org/rfc/rfc3261#section-10.2.2

Thanks, that something new i learned today :slight_smile: .

Shouldn’t that remove only the endpoint that sent the unregister, not every endpoint from that ip address? I see the same pattern that repeats, a Register request is sent for one endpoint, that endpoint is removed, then every endpoint with the same ip is removed immediately afterwards :confused:.

I’m thinking may be the problem is caused by the option remove_existing=yes in aor config. I’m setting it to no for all users and hoping this headache ends here.

I’d look more at the significance of “shutdown” in the messages.

I think you need to get tcpdump/wireshark traces and see how many TLS connections are being set up, when they are taken down, and by whom.

I don’t think it’s a connection problem. I could reproduce the same problem in another server.

I have two users connected from the same ip address. When i do pjsip show contacts, i have two separate contacts for each user.

  Contact:  <Aor/ContactUri..............................> <Hash....> <Status> <RTT(ms)..>
==========================================================================================

  Contact:  1017/sip:hb14hf8l@MY_IP_ADDRESS:46418;transpo 5f391ede42 Avail        89.639
  Contact:  1052/sip:43q2g0nn@MY_IP_ADDRESS:32940;transpo 27eef1bc06 Avail       219.331

If one of the users disconnected for whatever reason, its contact is removed, but also the contact of the other user who is from the same ip, pjsip show contacts is empty.

This is the problem i think, asterisk removes all of the contacts from the same ip address, if juste one is disonnected.

Asterisk doesn’t remove all contacts from the same IP address. It doesn’t work like that, and we’ve had no other reports of such an issue.

I’d suggest doing what David mentioned and getting a packet capture to determine who is shutting down the connections.

1 Like

Ok thank you @jcolp and @david551. I’ll look into it.

On the client side, i see that the websocket connection is still up, while asterisk is removing the contact and considering it unreachable. It becomes reachable again when the client sends a new Register request, on that same open connection!!

Are there any other reasons for this message other than a dropped connection ?

Removed contact 'sip:mcu7vmm4@X.X.X.X:50178;transport=ws;x-ast-orig-host=ijou1s2tent9.invalid:0' from AOR '1017' due to shutdown

Also, is there a cli command to show websocket connections? I couldn’t find any.

Thanks

As TCP connections, they would show with the shell command “netstat”.

DIdn’t think of that, thanks a lot.

We are seeing something very similar with asterisk shutting down websocket transports while the socket itself appears to stay open on the client side. It only appears to be an issue with the websocket transport, other transports are working as expected. The endpoints go unreachable until they re-register. I can confirm this behavior was introduced in 18.15.1 and that downgrading to 18.15.0 resolves the issue but I don’t like that as a solution as 18.15.1 is a security update.

Since you’ve isolated it down to that I’ve created an issue[1] and we’ll be holding off on releases of the current release candidates so we can determine what is going on.

[1] [ASTERISK-30369] res_pjsip: Websockets from same IP shut down when they shouldn't be - Digium/Asterisk JIRA