In Asterisk 16 we’re seeing TCP connections kept open when a WSS client continuously registers over and over using a different source port (assuming some kind of NAT issue on the client’s end).
When this happens, Asterisk doesn’t release the associated connection and the http.conf session_limit (100 is the default) is reached, web sockets become unresponsive with an associated log entry “HTTP session count exceeded 100 sessions”.
Asterisk does eventually release the socket exactly 15 minutes later outputting to the CLI:
[Dec 29 15:08:35] ERROR[23690]: res_http_websocket.c:531 ws_safe_read: Error reading from web socket: Connection timed out
[Dec 29 15:08:35] ERROR[29866]: iostream.c:552 ast_iostream_close: SSL_shutdown() failed: error:00000005:lib(0):func(0):DH lib, Underlying BIO error: Bad file descriptor
The only thing I can find that has the default number “15” is in http.conf’s session_keep_alive=15000 but this should be milliseconds.
I’ve tried the various PJSIP aor settings (minimum_expiration, default_expiration, max_expiration) without success in an attempt to get this under control (since we can’t control the client).
Do you have qualify_frequency set on the AOR? I’d expect that would cause the connection drop to be detected sooner, since it would be trying to send a packet at an interval.
The PJSIP keep_alive_interval is not explicitly set, so should default to 90. I don’t think this is PJSIP related.
I think the built-in http server is holding open the TCP port, there’s no further traffic. If I restart Asterisk, the ports get closed. This is well after I’ve blocked the offending ip address.
When not blocked, only the last 4 registered contacts IP and Port have traffic (as expected) but the http sessionlimit is what makes it unresponsive so http or res_http_websocket is counting those connections and eventually, 15 minutes on the dot, cleaning them up.
So the issue really appears to be http holding the ports open.
Obviously, preventing the client behavior resolves it. Assuming it won’t hold non-authenticated sessions open like this or it’s a DoS opportunity.
However, with roaming end-users, network to network, this is bound to happen. Somebody probably has some crazy double nat setup where they aren’t getting the responses and the client just rapid fires off additional attempts.
The HTTP server doesn’t hold the connection at all once the websocket is established. It is passed to res_pjsip_transport_websocket which becomes the owner and waits for any data to come in on it[1]. I would expect that if we were to send a packet and the connection is closed, then that thread should wake up with a failure and it would close. I could be wrong though, TCP is not my specialty.
Okay, so in summary where the client can’t receive the SIP messages via the web socket, in this case looks like because of a double NAT scenario on their end where the 200 OK isn’t being received by the device that initiated the connection, so the client tries again in rapid-fire succession getting ahead of http.conf’s sessioncount denying further connections until the dead sockets are cleaned up at some point minutes later.
Not sure where to go from here as far as Asterisk is concerned, should I open a bug report?
In this case, we control the client so we can just adjust the settings preventing the rapid-fire of packets that inevitably lead to essentially a DoS situation.