Hi everyone,
I’m investigating an issue where TLS handshake operations appear to hang indefinitely when running Asterisk Certified 20.7 (bundled pjproject) in a high packet-loss network environment.
Observed behavior
When SIP transport is set to TLS:
-
The TCP connection is established normally.
-
Packet loss causes some TLS handshake records to drop.
-
TCP does not fail (no RST, no fatal error), so the socket stays “alive”.
-
As a result, SSL_do_handshake() keeps returning WANT_READ/WANT_WRITE forever.
-
The TLS transport in pjproject never times out, so the connection remains stuck indefinitely.
This eventually results in what looks like a transport-level lockup, especially when multiple handshake attempts accumulate.
Code-level investigation
While reviewing Asterisk Certified 20.7 sources, I noticed this behavior:
My understanding (please confirm if this is correct):
-
pjsip_tls_setting includes a field:
pj_time_val timeout
(documented as: TLS negotiation timeout. If set to zero, no timeout is applied.)
-
In Asterisk’s res_pjsip transport initialization logic,
Asterisk does not set this timeout value explicitly.
-
Therefore, the default created by pjsip_tls_setting_default() remains:
timeout.sec = 0;
timeout.msec = 0;
Meaning:
TLS handshake has no timeout in Asterisk Certified 20.7.
Question 1 — Is this understanding correct?
Does Asterisk intentionally leave the TLS handshake timeout unset?
Proposed fix direction
To avoid “forever pending” TLS negotiations under lossy networks, I am considering a patch such as:
tls_setting.timeout.sec = N; // e.g., 5 or 10 seconds
tls_setting.timeout.msec = 0;
Inserted in res_pjsip before calling pjsip_tls_transport_start2().
Question 2 — Would defining a handshake timeout be acceptable in Asterisk’s design?
Are there any known reasons not to set pjsip_tls_setting.timeout?
Question 3 — What would be a reasonable default timeout value?
For example:
-
5 seconds (common handshake expectation)
-
10 seconds (allows some retransmissions)
-
Any recommendations from maintainers or others familiar with pjproject TLS behavior?
Goal
I would like to determine whether:
-
the behavior is by design,
-
the appropriate fix belongs in Asterisk, pjproject, or both,
-
and what timeout value would be suitable for production environments.
If helpful, I can provide:
-
packet captures,
-
thread backtraces (gdb),
-
or core show locks output.
Any guidance would be greatly appreciated.
Thank you!

