I was hoping someone may be able to help with my issue, which I have so far been unable to resolve.
We have been running Asterisk 1.8.4.2 in production for some time now, on Centos 5.6 (32bit) machines, without any issues whatsoever.
We are now trying to swap them out with some new machines, running Centos 6.4 (64bit), Asterisk 1.8.15-cert2, same config/dialplan etc. However, randomly, after a period of no load, Asterisk stops processing SIP calls yet the CLI and manager interface are both fully responsive.
When it happens, the command “netstat -anp |grep 5070” (the port we are listening on) shows lots of packets queued waiting to be processed.
Then, all of a sudden, after 10-15 mins or so, Asterisk springs back into life and the queued messages are processed.
I have the output of “core show locks” and also a backtrace but can’t see a way to attach files here - I can provide links to them if required.
I hope someone has some experience of a similar issue, as I am just about running out of ideas
Thanks for the reply, but that’s the first thing I tried. The SIP messages come in but there is no response from Asterisk while it is in this state. As said before, the messages appear to be queued and are eventually processed when Asterisk wakes up again.
Make sure you are using a current version of dahdi, or a non-dahdi timing source. If that is OK, you need to follow the deadlock dbugging procedures:
build with thread debugging enabled and optimisation disabled (the former has as significant performance cost). When it stalls, run the “core show locks” CLI command (start a new CLI connection, if necessary) and use gcore to get a dump then get backtraces (google “asterisk wiki backtrace” for details.
Thanks, David - I am only using Dahdi as a timing source and it is the most current version. I have disabled all other timing modules.
As per my original post, I already have the output of “core show locks” and also a backtrace, but there is nowhere to attach them here and I didn’t like to paste them directly into the message
It appears to be blocked in the “realtime” mysql handler. I believe that is community supported, which could mean that getting a fix will take a long time.
It looks to me as though it is trying to read the table “ast_sip_buddies”, but may have lost the database connection and is trying to re-establish it.
Thanks - I did think that, but Wireshark shows no attempts to reconnect and the Mysql server is up the entire time (it is in use by other Asterisk boxes and I can connect to it manually using mysql client), so not sure what Asterisk is waiting for in this instance.
I will try switching to res_odbc and report back. I have searched before but didn’t really find any definitive answers - is there a significant performance hit from introducing the additional layer of abstraction or is it negligible? I understand it will largely depend on hardware and how heavily the database is used in our configuration, but there must have been some kind of comparison or benchmark performed somewhere in the past?
Yes, it could be, although there are no entries like that in the log to suggest that is what is happening
I suppose it could be an issue with libmysqlclient - currently, the latest is installed from RPM but I think I will try the one in the Centos repo (although older, it’s likely to be 100% compatible with the OS, at least).