AppQueue stability issues

Background

We developed a proxy that interacts with Asterisk v1.2.x via the Manager API and FastAGI interfaces. The purpose of this integration is to allow our core products to leverage the Asterisk system for enterprise call centers, including inbound, outbound (progressive, preview and predictive) and multi-media (email, webchat, etc).

The platform leverages the existing appqueue application within Asterisk to maintain agent state and distribution models. The platform also supports SIP/IAX2/H323/ZAP channels for network interconnection to the PSTN and other telephony systems. The current architecture is to use a single Asterisk platform for agents/queues and then either include Zaptel cards in the same box, or separate those into individual gateway servers and network each via IAX2, as this is determined by sizing requirements.

Further, the system is using the appmeetme application to fill in gaps around third-party call control handling (no hold feature, etc) of Asterisk. Therefore, all agents are connected to a MeetMe room, and then a callback is made from the Queue to the MeetMe room for each agent.

Issue

Stability issues have been encountered where the appqueue application appears to lock and calls are no longer delivered to agents. This may be reproduced easily with anywhere from 10-30 agents, anywhere from 5 seconds to 10 minutes or more, when Zaptel channels are used, and less frequently/reliably at 100 agents when SIP or IAX2 are only used but the issue still occurs (we have not run tests where we did a noload in modules.conf of the zaptel module, only not called it during processing).

The result is that the Asterisk server must be restarted, our system reconnected and the calls launched again creating significant issues within a contact center.

Detail

The ticket we have open with backtraces on this issue may be found here:

#8275 - bugs.digium.com/view.php?id=8257

Although, we believe that the CDRCUSTOM assignment is really the result of a symptom as opposed to a direct flaw in the CDRCUSTOM module as the root cause of this issue. We have run a test where we did a noload on the cdr_custom, and the problem occurred further validating this, although we did not post the backtrace as of yet.

With this, as indicated in #8275, we believe the issues are more closely related to:

#7870 - bugs.digium.com/view.php?id=7870

and

#8069 - bugs.digium.com/view.php?id=8069

and

#6626 - bugs.digium.com/view.php?id=6626
(Even though this is purported to be fixed, the symptoms are similar.)

Actions

We are willing to pay for someone who may narrow this issue down into a concise bug report, and then further post a bounty to fix the problem and contribute that fix back into the community. This is a vital issue for Asterisk in the enterprise contact center space, so work arounds and or fixes are key.