Trunk Down - Making A* losing all extensions registrations

Hi guys, we are having the following issue in our A* installation.
Basically we have nearly 20 extensions in the A* and 10 to 12 sip trunks, those trunks are from several accounts we have overseas and local operators, so all our clients can calls dialing a local number.

Now the problem we are facing is that for some reason, a trunk goes down ( operator’s issue ), the whole system stops accepting registration from the extensions, losing the PBX completely.

The idea, is that if a single trunks goes down, the PBX keeps working as it is, but with that trunk mark somehow unavailable… until its again.
It looks like until that trunks is up again, the other REGISTERS are not taken into account…

You may suggest using the following entry in the general config: “registerattempts” to a non-zero value. But that make the trunk unavailable until next reload right?

There is a way of having a trunk down for as long as the operator takes to resolve their issue while keeping the rest of the PBX “alive” .

Thanks in advance

The behavour you are decribing is not normal for a valid Asterisk installation. Availability of one peer (SIP extension or SIP trunk) should not make any affect on the availabitliy of other peers. I am 100% sure of this.

Do you have the Asterisk server on a static public IP or is the public IP changing?

Are you running Asterisk behind NAT? Are the IP phones behind NAT?

It sounds more like a DNS temporary failure than a trunk failure.

Ok, let me give you some more details.

  1. The A* is indeed configured with a public IP.
  2. The trunks are all in public IP’s
  3. The phones are behind NAT ( that is not an issue, I double check the NAT table and timers ).
  4. we faced the DNS issue in the past, it’s not that this time, the packets is indeed sent ( tcpdump check ).

I have some extra info, it looks like for some reason the SIP messages are queued ( internally ) and not presented to the upper layer until it’s too late.
What we did was to simultaneously run a tcpdump in order to see the messages coming and leaving the NIC and run a full log.

What we noticed is that when the A* in under this behaviour is not getting the responses ( at an application layer, they arrive to the server as per the tcpdump trace ) and so are queued the incoming REGISTERS ( or at least is what I think at this time ).

The full log ( verbose at 5 and debug at 5 ) shows this:

[Nov 5 12:24:11] NOTICE[3599] chan_sip.c: – Registration for '’ timed out, trying again (Attempt #5)
[Nov 5 12:24:11] DEBUG[3599] chan_sip.c: Stopping retransmission on ‘2b93ab8a6cb1690f3c237c7e4b66f63a@’ of Request 145: Match Found
[Nov 5 12:24:11] DEBUG[3599] chan_sip.c: Allocating new SIP dialog for 2b93ab8a6cb1690f3c237c7e4b66f63a@ - REGISTER (No RTP)
[Nov 5 12:24:11] DEBUG[3599] chan_sip.c: Scheduled a registration timeout for id #20240

right after that message is sent, the trace shows the SIP OK, but the Application does not see it.

after a while you see a bunch of this messages:

[Nov 5 12:24:21] DEBUG[3599] chan_sip.c: = No match Their Call ID: 5b9d273717495d776c2a06a12fda6f82@ Their Tag Our tag: as005eb670
[Nov 5 12:24:21] DEBUG[3599] chan_sip.c: = No match Their Call ID: 61a4be170ef9059e096195e750c8525d@ Their Tag Our tag: as7baee248
[Nov 5 12:24:21] DEBUG[3599] chan_sip.c: = No match Their Call ID: 690b971e2c0f7266562479b54c46c88a@ Their Tag Our tag: as16d4bea3
[Nov 5 12:24:21] DEBUG[3599] chan_sip.c: = No match Their Call ID: 018b475f04c4ee323ea2462a558d57f6@ Their Tag Our tag: as23dd9ecb
[Nov 5 12:24:21] DEBUG[3599] chan_sip.c: = No match Their Call ID: 48701f4b7781f87778136046374d59fd@ Their Tag Our tag: as795d2507
[Nov 5 12:24:21] DEBUG[3599] chan_sip.c: = No match Their Call ID: 211cf8dc25ce2650692199ef02026584@ Their Tag Our tag: as5b277f68

and when I say a bunch i mean like a 100 or so ( i suspect that are all the “missing” packets ). and they appear right after the A* gives up ( we have set registerattempts=5 ).

All that is making me believe that is not a “trunk” down, the A* “thinks” they are down because the responses are being missed by the application.

Any thoughts? some queuing issue?