Basically, you always get an echo, but without VoIP, the echo is so fast that people don’t notice it. With VoIP people do notice.
A SIP phone will do its own echo cancellation, and may well remember the settings between one call and another. This is to cancel the echo originating from that phone and its environment. For PSTN you need to do it yourself, and each call will have different echo characteristics, because you are calling different people.
It takes time for the echo canceller to measure those characteristics, and, as it has to do it using live speech, the echo will be noticeable until the calculation completes. In theory one ould speed this up by transmitting a special test signal when the line connects. This is what modems do, but such a signal would be rather annoying to a remote human party.
I don’t know what is considered the state of the art for echo canceller convergence, but it is certainly not zero.
What an echo canceller has to do is to predict the echo that will be produced by the outbound signal and subtract it, from the inbound signal. It needs to measure the echo to do this, with the added problem that it is not just receiving the echo, but also incoming speech.