Realtime cache consistency in HA environments

Hello everyone,

I have run into a limitation when using Asterisk Realtime together with the stale cache mechanism.

For realtime objects, Asterisk provides a cache and allows individual objects to be invalidated through AMI using stale or expire. This works very well in large deployments because it enables targeted updates. If the same process updates the database and invalidates the corresponding Asterisk cache entry, the cache remains consistent with the database.

However, the situation becomes more complicated in a high-availability environment.

In our setup, databases on different nodes are synchronized automatically. As a result, a change may be made on one node while another Asterisk instance continues to use a cached version of the object. The second node has no built-in way to determine whether the cached object is still consistent with the current database record.

In many systems, this problem is solved by introducing a revision field (or version field). The field receives a new unique value whenever the object is modified, and the cached revision can then be compared with the database revision to detect changes.

I would like to ask for advice on where such a field could be integrated so that, if implemented correctly, a patch would have a chance of being accepted into the official Asterisk sources.

At the moment, my preferred approach is to add an optional revision field to every PJSIP object, with an empty value by default. The field would be available both for Realtime and for static configuration. I believe it could also be useful for configuration-based deployments, not only Realtime.

My idea is intentionally simple: the field would not introduce any special processing by itself. It would only be loaded from Realtime/config sources and exposed through AMI, allowing external systems to implement revision tracking and cache validation logic.

What do you think about this approach? Are there better places in the architecture where such functionality should be implemented?

Can you explain more about how it would actually work in practice?

Oh Joshua, greetings.

First of all, I’d like to thank you for your responsiveness and for the tremendous contribution you’ve made to Asterisk over the years. You’re frequently mentioned in our regional Asterisk community, and people genuinely appreciate how active and helpful you’ve been.

To answer your question about how such a field could be used in practice, I’ll describe our use case.

We store all PJSIP configuration in MariaDB using Realtime and Sorcery memory cache. Our management service writes changes to the database and then selectively refreshes affected objects in Asterisk using AMI commands such as SorceryMemoryCacheExpireObject.

This has worked very well for us because configuration changes become visible immediately, while still benefiting from caching. Unexpected restarts are not a problem because Asterisk rebuilds its state from the database during startup.

We’re now implementing a two-node HA deployment with database replication. One node may be offline or unreachable for some time, but the databases will eventually synchronize themselves once connectivity is restored.

The challenge is that an external synchronization process may miss some change notifications while it is offline. When it comes back, I would like to determine whether the objects currently cached by Asterisk match the current state of the database and selectively refresh only the objects that are outdated.

That is where I thought the existing revision field might be useful. If the revision of cached Sorcery objects were exposed, an external synchronization process could compare cache state with database state and perform targeted refreshes immediately after startup, failover, network recovery, or service restarts, instead of waiting for object_lifetime_stale to expire or invalidating larger portions of the cache.

While investigating this further, I realized there is another complication. What I actually need is access to the revision value of the cached object itself. I initially thought AMI actions such as PJSIPShowAors might help, but they appear to read directly from the realtime backend rather than from the Sorcery memory cache, so they always show the current database state.

I was hoping to find something similar in the sorcery memory cache dump, but so far I’ve only seen cache-related information such as expiration and stale timers.

So I’m still exploring possible approaches. If you have thoughts on how this could be solved properly, I’d be very interested to hear them.

I don’t have direct experience with this, so others that do may provide input but I always encourage people to try to do things outside of Asterisk if at all possible instead of putting it into Asterisk or modifying Asterisk.

My immediate question is: Is targeted refreshes worth it for this scenario? Will the scenario occur often enough to warrant trying to be clever like this, or is telling Asterisk to refresh all of it sufficient?

That’s a very good question, and honestly I don’t yet know how often this scenario will occur in practice. We are only now implementing the HA functionality, and real-world feedback will come later.

What makes me cautious about a full cache invalidation is the size of some deployments. Our PBX product can have up to 3000 extensions, and the current approach of using SorceryMemoryCacheExpireObject together with explicitly reloading only the affected objects has been working reliably for a long time. In practice, we see many small configuration changes, and targeted refreshes have handled them very efficiently. Updates become visible immediately, we continue to benefit from caching, and we have not observed any noticeable performance degradation from this approach.

Looking at it from another angle, Asterisk already provides SorceryMemoryCacheExpireObject, but there is currently no straightforward way for an external service to determine whether a particular cached object is out of date and should be refreshed. That was the motivation behind my original idea.

That said, I think I’ll probably close this topic. My initial proposal seemed like a relatively small change that would not add much complexity while making cache management more convenient. After thinking about it more, I now realize that simply adding a revision field to PJSIP objects does not actually solve the problem. An external application would also need a way to retrieve the revision of the cached object itself, which would require more substantial changes.

So in the end we will most likely either implement something specific to our environment or simply rely on SorceryMemoryCacheExpire when needed.

Thank you for taking the time to look at this and share your thoughts

The “external synchronization process” must cover all zones globally, not just local ones.

If node-1 has a “change notification,” then the “external synchronization process” above node-1 should make the necessary adjustments to nodes 2 and 3.
If node-3 is unavailable, then it should add a “run later” marker.

*cough* CAP Theorem *cough*