Which metrics shall I monitor to estimate Asterisk health of a cluster node?

Hello,

I’m currently using Keepalived in an active-passive Asterisk 2-nodes cluster.
It works like this:

  • when passive node detects active node does not broadcast VRRP messages anymore, it promotes itself as the new active node.

When promoting from passive to active, a node graps floating IPs and starts Asterisk daemon.

I’m thinking about adding the capability, for an active node to detect “Asterisk is failed” and give up active role, leaving the passive node to promote itself to active state.

Which metrics would you include in an Asterisk health index with which an Asterisk instance would estimate “I’ve got important error I can’t recover from and the other Asterisk instance, configured the same way, has reasonable chances to not meet the same errors”.

To avoid flapping between nodes, I think I would first include the last time gave up active role: if this time is close from current time, then health is considered nominal.

Other metrics that comes to mind are:

  • local resources (memory, CPU, disk, file handle, …) availability as one node may be impacted while the other is node,
  • connectivity with some servers (NTP, log, provisioning, …) but not all (DNS ? …)
  • asterisk version
  • some asterisk module absence

What else, would you include ?
Suggestions ?

Best regards

I would check for SIP responses, by eg. sending an option request every eg. 10 sec. If it replies, asterisk is working as expected.

Other metrics could be

  • Registered endpoints
  • Registered trunks (If you use registrations on your trunks)

The difficult situation to detect is a deadlock and procession an OPTIONs request won’t lock a lot so may well not detect one. It is very unlikely to detect one on anything other than the channel driver handling the request. Register might exercise slightly more, but is still mainly contained within a single channel driver.

you should monitor deadlock, check for every error message appear in log file.

@Chano:
I’m not enthusiast about looking at SIP responses: those mostly depends on config errors or errors in northbound or southbound errors and those probably won’t go away if active node changes.

@david551, @voip.com.vn
Changing active node when deadlock occurs seems very appropriate as promoting passive node is probably is the best thing to do to restore telephony service.
The real question is:
how do you detect a deadlock in general or an Asterisk deadlock, specifically ?
how do you trigger a deadlock for unit testing ?

I think all I was pointing out was that OPTIONS isn’t a good test for deadlocks, as it may indicate the system is up when it is actually fatally crippled. You need something that exercises a significant part of the system, so you certainly need to get up to the dialplan and make outgoing calls, but you also need to exercise each channel driver, if you are using more than one.

You can only really simulate a deadlock with white box testing, but in any case, the one you are trying to simulate is the one that hasn’t been found and fixed, already!

Well, the reason to use a SIP response, was mostly based on my experience with Asterisk, where it would sometimes stop processing SIP responses, but otherwise appear to be working just fine. It was not meant as the only thing to do, just one of the metrics, and something I know personally can stop working.

Also wouldn’t restarting Asterisk be a better solution in most cases, and also faster, than starting Asterisk and moving the IP address to another server? Most of the time an Asterisk restart, for me, takes about 5 sec.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.