IAX trunk and linux NAT : lost registration after disconnect

Or how we can easily loose IAX2 registration to a peer after a WAN disconnection :

Hi. I’m using a pppoe DSL link, a small router with openwrt whiterussian 0.9, a linux distibution for embedded routers.

Behind this router, i have an asterisk PC, running on centos.

I have a registration nightmare with asterisk, and i think i’ve finally tracked down a problem with iptables and NAT.

So here is the problem :

The asterisk server register with another Asterisk located on Internet, through the ppp0 WAN link. The link is using pppoe with a static IP address.This detail is important.

After a pppoe disconnect / reconnect, i’m loosing udp connectivity from the internal LAN Asterisk (UDP 4569), to the asterisk Internet peer (UDP 4569 too).

To track the problem, i’ve made two tcpdumps on the ppp0 interface. One before the pppoe disconnect, one after.

And i’ve discovered something that could be considered like a strange NAT bug :

Before pppoe disconnect, all is ok, IAX regsitration works well; we have local request from the local asterisk machine and answers from the Internet asterisk peer :

tcpdump -i ppp0 host asterisk.external.peer # (addresses have been volontary changed for privacy)

15:18:56.107906 IP my.external.IP.4569 > asterisk.external.peer.4569: UDP, length 12
15:18:56.148521 IP asterisk.external.peer.4569 > my.external.IP.4569: UDP, length 12
15:18:56.149290 IP my.external.IP.4569 > asterisk.external.peer.4569: UDP, length 12
15:18:57.024791 IP my.external.IP.4569 > asterisk.external.peer.4569: UDP, length 28
15:18:57.065587 IP asterisk.external.peer.4569 > my.external.IP.4569: UDP, length 39
15:18:57.066404 IP my.external.IP.4569 > asterisk.external.peer.4569: UDP, length 62
15:18:57.110189 IP asterisk.external.peer.4569 > my.external.IP.4569: UDP, length 56

But after the pppoe reconnect :

tcpdump -i ppp0 host asterisk.external.peer

15:22:27.178244 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:29.178923 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:29.184854 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 28
15:22:30.175913 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:31.178632 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:37.178865 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:37.184855 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:39.179526 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:39.180041 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:39.184878 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 12
15:22:39.185024 IP 192.168.15.100.4569 > asterisk.external.peer.4569: UDP, length 28

We can see that the LAN address of the asterisk server is sent over the WAN !!

So we loose communication with the Internet asterisk peer, the reason seems evident, as the peer cannot reply to a non routable LAN address.

It’s not the first time i see this problem, but this time, i tracked it down.

This is a very annoying NAT bug, because as long the we are trying to send data through the NAT opened IAX2 port, the abnormal state is hold. The LAN address is sent over Internet, instaed of the public masqueraded address. So we loose definitively this udp IAX2 connection.

To get things back to normal, we need to stop the sending of udp data to the NAT during more than 30 secondes. (more than the NAT udp session timeout). Then after this delay, a new session is opened in conntrack and masquerading is working again.

Unfortunately, asterisk, when loosing the registered state with his peer, try to send data almost each second, causing a definitively lost connectivity.

It would be nice if we could change those register timings. A 60 seconds retry time should be ok. But better should be to have it settable, like we can set the qualify timings.

Actually, IAX2 register timings are only settable at the receiving side. This is not what we need to solve the problem. We need to be able to set the timing at the sender side.

It seems that the problem does come from conntrack. So there is no way to flush the connection (no userland for conntrack) and no possibility to restart a module as conntrack is build into the kernel.

I will add that the problem seems to exhibit only when using a static IP pppoe connection. I have no problem, until now, with dynamic pppoe dsl links.

It should be possible to solve the bug inside the linux kernel, but be assured that a big quantity of routers have certainly this bug and will not be updated because they are physical devices.

So a simple register timing change in the IAX channel module, should permitt to solve definitively this problem, who is very annoying for a lot of asterisk users and VOIP providers.

Thanks a lot for your help,

Olivier.

some time ago. i ended up moving asterisk to my firewall to avoid NAT, since there seemed to be no obvious solution. i guess one hack would be when you detect the link come back up, you use the asterisk manager interface to send an IAX2 reload command to your centos server?

Those solutions do work in theorically.

i’m testing actually a script who detects the IAX2 lost registration.

For memory :
The registration is lost because the LAN IP of the asterisk server is sent to Internet instead of the external Public IP after a static IP pppoe disconnect.

So this script is restarting asterisk with a delay of about 40 secondes.

This delay permitt to the conntrack to reset (the default NAT session timeout is generally about 30 seconds on linux NAT routers).

Then, the NAT masquerading process works again.

But ! It seems that the linux kernel is tainted after such an operation. I have got a very choppy processor load, with peaks to 100 % every 3 seconds on the router !

It seems that all linux kernels are concerned, 2.4 and later 2.6 as well. A linux developper tested this for me on different machines and confirmed the problem.

So i entirely aggree with you, my advice would be to not use NAT, if you don’t want problems, except if you know what you do and you can debug at low level in case of problems.

Use IPv6 instead !!

If you use it, be prepared to have some strange problems one day or another, as the linux kernels seems particularly bugged when it comes to NAT, specially when there is a static external address.

Another advice would be to never trust softwares, and be prepared to be extremely cautious with new versions.

Simpler is always better !

Olviier.