Asterisk High Availability

Hello all,

are there any tips and tricks for Asterisk (18/20) High Availability?
Has anyone ever build an Asterisk ha?

Any recommendations and tips are welcome.

In front of this cluster, there will be a Kamailio cluster.

Many thanks in advance.

Best Regards
Tom

1 Like

Youā€™d need to define what high availability means to you. It can mean different things to different people.

You are absolutely right, that is a good point. Sorry for that.

Best case for me is: (at least) two active-active instances, the phones are registered to both instances at the same time, calls can flow independently through both asterisk servers (incoming and outgoing) and e.g. subscriptions from a phone are synchronized between both servers. If one Server is out of service, the other one should work without problems or large outages. Max. Outage ā†’ maybe only the active call, but no more than 5 minutes

Also possible for me, but not as good: Active Hot-Standby with switching time max. 10-15 minutes.

I hope this helps. If not, please let me know whats missing.

BR
Tom

1 Like

This post on Voip-info does a good job explaining what is and what is not high availability in regards to Asterisk.
This post on StackExchange does a good job explaining available options for Asterisk high availability.
Even Youtube has an ā€œAsterisk Clustering in 5 minā€ videoā€¦But the advice you get is only as good as the knowledge of the person postingā€¦this video (and others like it) probably donā€™t understand HA in the context of a PBX.

And thereā€™s also lots of bad advice around (HA solutions which donā€™t work in the real world).

I think you have to be realistic about how significant a disaster you need to recover from, and what is the real cost of an outage to your organization. Sometimes the right decision is just keeping a spare Asterisk box around, and moving the network cable over when things go bad. But if the cost to your organization is actually high enough, then the advice above is a good start.

The Best solution I have implemented is:

remoto high availability Database (Amazon, Linode, Vultr)

2 opensips server with Cluster Module Pacemaker and Corosync for HA

2 or more Asterisk server to distributed the calls/services

RTPEngine HA to recover the currents calls (only 5 seconds of interruptions on the calls)

OR something like this

If you want use only Asterisk you canā€™t recover the currents calls

2 Likes

Iā€™d class that as somewhere between cold and lukewarm, not hot.

This is why itā€™s important to understand your needs. Do you need to resume call recording etc on the peer after failover? Do you need to transfer all of the agents signed in, callers in queue (in order), etc? Can you have a single point of failure before the cluster (Kamailio/AB switch/controllable router), can you share resources behind the cluster (database, file share, etc)?

If you want two fully independent nodes that will act as a cluster then you are heading into commercial HA territory. The web is full of DIY solutions (and hey in Linux there is a package for everything). But when you carefully consider the weaknesses of the various HA solutions most drop off the list pretty quickly.

As per the links above, critical call centers (911, fire, hospital, etc) donā€™t allow anything shared between the nodes. (Nothing in front, nothing behind), so that there is no single point of failure. Thatā€™s when you have true HA.

A good corollary is Ciscoā€™s HA for routers, check out this link. That is what you want your Asterisk cluster to look like (replace the word ā€œrouterā€ with ā€œAsteriskā€ in the diagrams). Two fully independent nodes acting like a cluster, using HSRP protocol to implement an HA router/firewall.

However, since our company didnā€™t want to spend the money we have a cheap standby router/firewall which I plug-in if the old one dies. HSRP would be nice, just not worth the $8k investment.

I suspect t.zimmermann meant 10-15 seconds (not minutes). :slight_smile: If he really meant minutes then restoring a backup onto a spare box and plugging it becomes solution #1.

How can you recover the streams if Asterisk dies. AFAIK, RTPEngine can only recover the stream if the proxy (Kamailio/OpenSIPS) dies.

you right

you can recover only calls RTPEngine manage.

Regards

1 Like

First of all, many thanks for all the answers.
Maybe i should give more background information.

At the moment we have several asterisk servers virtualized on our proxmox cluster in our datacenter (true HA, automatic movement of virtual machones etc.).

We want a voip solution, which is as failsafe as possible. The Server will be virtualizied again on a failsafe proxmox Cluster (3 Nodes, redundant Network infrastructure etc.). If we need for example a galera Cluster, that wouldnĀ“t be a problem. The ā€œVoIP Infrastructureā€ is in our Datecenter, Phones will be in different sites.

At the moment weĀ“ve got the Anynode SBC in front of our Asterisk Servers ( and weĀ“ve got very good experience with this SBC, itĀ“s our preferred SBC at the moment; Kamailio is not the easiest ā€œtoolā€ :slight_smile: , but if itĀ“s necessary, weĀ“ll find a way to configure it).

If we loose the active call thatĀ“s not a problem. But it shouldnĀ“t be more than this. Phone States and Queue log-ins should be ā€œvisibleā€ on all asterisk servers. Also there souldnĀ“t be a single point of failure. For updates e.g. we want to ā€œshutdownā€ one asterisk server, without having any ā€œproblemsā€ (beside of loosing an active call maybe).

IĀ“ve attached a picture of my pov a the moment. And i think thereĀ“s missing some service/server to have a functional HA-Solution.
Maybe someone can ā€œeditā€ this and complete a possible solution?

I hope this makes my situation more clear. We want to be as independent from one Asterisk Server as possible.

BR and many thanks for your help so far
Tom

That would require multiple independent implementations of Asterisk! Thatā€™s something Iā€™ve only really heard of in aircraft critical systems. One of the things you donā€™t get told about by high availability vendors is that software to implement them tends to be a source of systematic single point failures.

It is possible that the Stasis rework has made it easier to synchronise Asterisk instances; I havenā€™t looked at the innards in enough detail since then, but you will have to do things like two phase commits across servers in order to synchronise states. Also you need a transaction structure, even at detailed levels. That potentially makes the code a lot more complex and therefore fragile.

Asterisk does not have a way of sharing queue state between systems. If you want a fault tolerant queue where any call can land on any box then you need to create your own solution.

Just FYI, physical hardware failure is the least likely cause of VoIP service failure. We looked at installations with VMware, and they showed how to v-motion an Asterisk server keeping all calls up, etc. However, this only works if the VM/host disappears (complete shutdown/failure). The virtualization/container system has no visibility into VoIP health. Donā€™t confuse that with any type of VoIP HA!

I donā€™t think 3 Asterisk servers buys you anything more than two server in terms of HA (particularly if in the same data center).

If you want to move full system state (queues/agents/device state/calls in progress/etc) between cluster members then you either have to start creating many single points of failure in front and behind the Asterisk server. Or, go to one of the commercial solutions (see the links above). I have never seen an open source product do what you are asking. Maybe someone else hasā€¦

I know nothing about the features of Proxmox, but VMWare has a feature that essentially runs a par of machines in tandem, syncing every instruction between the two machines. If one dies, the other takes over right away, with almost no downtime. It DOES require twice the resources of just running a single instance, but it offers more or less instant fail over.

Perhaps Proxmox can do something similar?

Okay, then iĀ“ll have a look at the solutions with two Asterisk servers (should be fine for the moment).

Maybe you know icinga? Icinga is a monitoring system (based on nagios) which has the possibility for an active master-master setup with satellites for executing checks for example. And icinga is open-source :slight_smile: (okay yeah only two serversā€¦)

3 Server could make sense in our use-case because we have in this cluster 3 Nodes, in 3 different fire compartments. But if we have any kind of ha, thatĀ“s definitely better than now.

If anyone else has a good idea, please let me know :slight_smile:
IĀ“m happy for every idea. And for now, iĀ“ll read the posts above.

Best Regards
Tom

I think the StackExchange and Voip-Info links posted above will answer your questions!

Iā€™ve used Icinga & Nagios and they are very capable systems. To make sensors which monitor enough VoIP/Asterisk/System parameters you have a fair bit of work ahead of you. You will have to dream up the combination of inputs which would indicate an Asterisk node to be in ā€˜failedā€™ health. You will want to hook into the AMI to get some of that data, sniff the SIP channels (look at the SNORT app), send SIP options commands to up/downstream hosts, monitor latency of connections, etc. In other words, you will be building a fairly complex set of logic to measure/monitor a cluster. Then you have to write code to make the nodes talk to one another, negotiate which takes over when the nodes recover from outage, determine which node should take over if the cluster reassembles with both nodes active, etc. As you can see, just taking over once the other node dies is probably the easiest (and least likely) scenario you have to handle

In effect you would be recreating one of the VoIP HA products (see the links). Thatā€™s what youā€™re paying for with the commercial apps. If youā€™re comfortable with programming all that (and have the time) then I suspect you could build something decent.

As you might have guessed I started by building my own VoIP HA at my last job. But with every outage we had a new use case and I had to modify my code (python). After a couple of years it got out of hand and I admitted defeat. Thereā€™s a LOT more to this than meets the eye (at least to do it properly). I figure we spent $10k of my time trying to recreate a commercial package that cost half as much. So we finally gave up and bought a package (and no regret - it paid for itself within 6 months when our primary ITSP had serious quality issue and the HA software failed over to backup trunks).

Because we had an inbound call center, we estimated we would lose approx $1500 per minute during an outage. So it was pretty easy to make the case to buy an HA VoIP package. However, if your per minute outage cost is really low (eg: small office with mostly internal calls) then I would suggest skipping HA and just keeping a spare server/VM on hand.

If you really want to build your own, then message me outside the forum and I can send you my old Python code. It works, but once youā€™ve used a commercial product I would never go back.

In my limited experience, A multi active solution is very tough in asterisk to achieve. The problems to solve for are

  1. Queue state replication is a mess. You can replicate some events with corosync but you wind up with custom devstates and if you miss anything then you have agents marked in call after the call has been transferred. Or they are on a queue call and continue to get calls. This is one of the most major issues that have to be coded for.
    a. If you use realtime asterisk The database layer becomes a bottle neck fast.
  2. If you go with multiple servers you can do near ha with Pacemaker. ie you have a warm standby that you can automate failover. Some challenges are
    a. You have to keep your files in sync. the astdb.sqlite3 does not like to live on NFS but nfsv4 gets close. but if you are missing anything your config is out of date or wrong and its a mess but this can be done.
    b. Pacemaker is full of unintended surprises. so you have to be prepared to deal with outages caused by your config. Whatever you use for stonith source can be critical. if it fails the cluster fails. ie I use the aws ec2 api to get info and its not that great but if you make it more tolerant to missing responses its does work.

There are other options, but I would not recommend running 2 instances of asterisk if you have queues. If its an outbound call center with direct inbound I would absolutely. The issues are minor on devastates in those conditions. But if queues are involved then no way would I want to unleash the resulting mess.

Not to mentioned rebuilding queues (of inbound callers), call recording in progress, moving stats between nodes, etc.

When most people first think about ā€œHAā€ they are taking far too simplistic a view. You will even find HA products that sell you a bunch of open source packages as VoIP HA. Or HA packages that just copy of config backup from one node to another (seriouslyā€¦you would be amazed who is selling that!).

Hello, my solution thatĀ“s i found for my case.

Server A - Master (Operation) IP 192.168.0.240
Server B - Slave (Slave) IP 192.168.0.241

All my SIP Peers is configuration to 192.168.0.240. If my Server Master is DOWN, i change the IP my Slave to 192.168.0.240, start Asterisk and all SIP Peers work fine.

About Replication:

And it is solved my problem if my server Master is DOWN for any situations ā€¦