TLS handshake may hang indefinitely under packet loss — Is pjsip_tls_setting.timeout unused in Asterisk Certified 20.7?

Hi everyone,

I’m investigating an issue where TLS handshake operations appear to hang indefinitely when running Asterisk Certified 20.7 (bundled pjproject) in a high packet-loss network environment.

Observed behavior

When SIP transport is set to TLS:

  • The TCP connection is established normally.

  • Packet loss causes some TLS handshake records to drop.

  • TCP does not fail (no RST, no fatal error), so the socket stays “alive”.

  • As a result, SSL_do_handshake() keeps returning WANT_READ/WANT_WRITE forever.

  • The TLS transport in pjproject never times out, so the connection remains stuck indefinitely.

This eventually results in what looks like a transport-level lockup, especially when multiple handshake attempts accumulate.

Code-level investigation

While reviewing Asterisk Certified 20.7 sources, I noticed this behavior:

My understanding (please confirm if this is correct):

  • pjsip_tls_setting includes a field:

    pj_time_val timeout

    (documented as: TLS negotiation timeout. If set to zero, no timeout is applied.)

  • In Asterisk’s res_pjsip transport initialization logic,

    Asterisk does not set this timeout value explicitly.

  • Therefore, the default created by pjsip_tls_setting_default() remains:

timeout.sec = 0;
timeout.msec = 0;

Meaning:

TLS handshake has no timeout in Asterisk Certified 20.7.

Question 1 — Is this understanding correct?

Does Asterisk intentionally leave the TLS handshake timeout unset?

Proposed fix direction

To avoid “forever pending” TLS negotiations under lossy networks, I am considering a patch such as:

tls_setting.timeout.sec = N;     // e.g., 5 or 10 seconds
tls_setting.timeout.msec = 0;

Inserted in res_pjsip before calling pjsip_tls_transport_start2().

Question 2 — Would defining a handshake timeout be acceptable in Asterisk’s design?

Are there any known reasons not to set pjsip_tls_setting.timeout?

Question 3 — What would be a reasonable default timeout value?

For example:

  • 5 seconds (common handshake expectation)

  • 10 seconds (allows some retransmissions)

  • Any recommendations from maintainers or others familiar with pjproject TLS behavior?

Goal

I would like to determine whether:

  • the behavior is by design,

  • the appropriate fix belongs in Asterisk, pjproject, or both,

  • and what timeout value would be suitable for production environments.

If helpful, I can provide:

  • packet captures,

  • thread backtraces (gdb),

  • or core show locks output.

Any guidance would be greatly appreciated.

Thank you!

Be aware that any changes that may occur as a result of this will not be merged to certified unless you are a support agreement customer.

No.

This should be configurable alongside the other transport options.

Being configurable a default of 5 seems fine.

The TCP transport layer at the sending end will notice that there has been no recent incoming packet with a receive count that covers the packet that was lost, and will retransmit the lost packet. The packet loss will then be resolved and things will continue normally.

Optionally, the receiving side will notice that it has received a packet with a transmit position that implies a missing packet and will send a selective selective ack for those correctly received, allowing the sender to know there is a problem, and retransmit just the missing ones.

Basically TCP is a reliable protocol and will recover from lost packets below the level at which the affect TLS. I think your problem analysis is wrong.

Thank you for your advice.I will make sure to set the struct pjsip_tls_setting parameters appropriately using asterisk.I understand that the following process is performed in res/res_pjsip/config_transport.c:

ast_sip_initialize_sorcery_transport() // [Asterisk] Initialize transport settings, create PJSIP structure object↓transport_apply() // [Asterisk] Apply transport settings to PJSIP structure object↓pjsip_tls_transport_start() and pjsip_tls_transport_start2() // Settings are passed to PJSIP

You said 5 seconds is appropriate, but do you know of any other people who explicitly set the timeout parameter of struct pjsip_tls_setting?What timeout period do they use?

I’m using Miracle Linux (el8), and I’d like to set an appropriate value taking into account the timeout period on the OS’s TCP stack.

Thank you for your advice.
I also believe that packets are normally retransmitted by the OS’s TCP stack, maintaining the sequence.
However, I believe the handshake will fail in the following cases:
Case 1) The network error does not resolve within the number of retransmissions, and the retransmitted packet does not arrive.
Case 2) The packet arrival wait time is long, and retransmissions are not performed for a long time.

I am using Miracle Linux (el8) as my OS.
I am currently checking the timeout setting on the OS’s TCP stack, but since this is a mission-critical product, I am considering whether I can also protect it on the asterisk side.

I have no further information or knowledge.

Did you make this change, hijiri-yoshida and did it work as you expected?

I have run into time out issues like this before. Not with TLS but with different phone testing. I tend to “collect” different VoIP desk phones and test them with Asterisk. I have several testing areas I use with many different manufacturer’s and models of VoIP desk phones. Many of these are obsolete models and/or obsolete firmware.

The problem of misbehavior with VoIP in a high packet-loss network is common, in my experience. But the reality is, that you often cannot fix the problem by mucking around with settings in Asterisk. You also have to make adjustments in the endpoints - and desk telephones are notorious for not being able to make adjustments in.

For example, to make a fix in one of my Cisco Multiprotocol firmware phones, I would need to have a service contract, then file a bug with Cisco TAC and it would then languish for probably close to a year before Cisco’s firmware developer even looked at it. And Cisco explicitly supports Asterisk with MPP firmware on their phones. However, their “support” is along the lines of “the bigger you are the more we pay attention to you” In reality I’d have to have several thousand 3PCC/MPP phones under service contract, representing probably $25-$50k a year in revenue for Cisco for maintenance agreements, for them to pay any attention at all to a timeout bug.

This is why my efforts are mostly limited to documenting such misbehavior and posting it here and over on the FreePBX forums. Because the reality is that some phones DON’T have a problem.

For greenfield rollouts, such as if you have a known hostile network environment, it is ALWAYS better to just test and test and test until you develop a working setup with some manufacturer’s phones. The most common hostile network environments are, high-packet loss, transit through a VPN or other tunnel that reduces MTU, and transit through a network address translator. If you have ANY of those three you are far better off developing a tested list of known working endpoints and bone-stock Asterisk, then locking on to ONE model of endpoint that doesn’t work quite right and then trying to beat Asterisk into submission via settings changes in Asterisk.

Many endpoints do, in fact, offer MANY different settings knobs that can be tweaked in the SIP settings on the phone, so it is not impossible to get things to work in a hostile environment. For example there’s a few SIP settings in Polycom firmware:

And there’s more in Cisco 3PCC:

So tweaking these as well as tweaking settings in Asterisk might yield a reliable connection in a hostile environment such as a high packet-loss network.

But never forget that you are dealing with possible bugs in the 2 network stacks here, the Debian or other Linux you are running Asterisk on, and the embedded Linux or Java or whatever is being used on the endpoint itself. Often these network bugs remain latent and un-triggered on friendly network environments but rear their heads on hostile ones. So even if you do identify an issue it may be that even tweaking Asterisk AND the endpoint SIP settings, won’t result in something stable.

The Achillies Heel of the VoIP protocols is they were all designed back in the Dark Ages when network designers could make assumptions like everyone has perfect networks and translation wasn’t a thing. What we have today with implementation is that perfect design that’s been modified due to the realities of actual networks….