I’m currently using Keepalived in an active-passive Asterisk 2-nodes cluster.
It works like this:
- when passive node detects active node does not broadcast VRRP messages anymore, it promotes itself as the new active node.
When promoting from passive to active, a node graps floating IPs and starts Asterisk daemon.
I’m thinking about adding the capability, for an active node to detect “Asterisk is failed” and give up active role, leaving the passive node to promote itself to active state.
Which metrics would you include in an Asterisk health index with which an Asterisk instance would estimate “I’ve got important error I can’t recover from and the other Asterisk instance, configured the same way, has reasonable chances to not meet the same errors”.
To avoid flapping between nodes, I think I would first include the last time gave up active role: if this time is close from current time, then health is considered nominal.
Other metrics that comes to mind are:
- local resources (memory, CPU, disk, file handle, …) availability as one node may be impacted while the other is node,
- connectivity with some servers (NTP, log, provisioning, …) but not all (DNS ? …)
- asterisk version
- some asterisk module absence
What else, would you include ?