Avoided initial deadlock but the subsequent ones got me

Hi all. I have an asterisk server that deadlocks often. One a week at worst. I have another identical machine that works fine. The only difference is that the well behaved machine uses one tdm400p while the other uses two.

Hoping to eliminate the OS or the particular installation of asterisk as the source of the problem, i tarred up the root and kernel of the well-behaved server and copied it over the misbehaving server.

After about two days it deadlocked again. The users, understandably, first reboot the server, then call me to complain, so attaching gdb and a bug report is out of the question.

I’m now at my witt’s end with this, and I was just hoping someone with deadlock resoltuion experience can review my remaining list of possible suspects for the cause, and soltuions for the problem:

SUSPECTS

  1. Multiple TDM400p cards
  2. Use of metermaid patch for parking hints with 1.2
  3. lots of & dialing (as in dial(sip1/&sip/2&sip/3&sip/4))
  4. poor support of 64bit and/or nforce platform?
  5. bugs in asterisk 1.2 branch

…some of those are mine, and some were suggested by others. In order by what I feel is most likely the cause.

These are the possible solutions im conseidering…

SOLUTIONS

  1. Switch to a T-1 + channel bank
  2. Switch to a sangoma a200 card
  3. Downgrade to 1 * TDM400P
  4. Try asterisk 1.4 (parking hints feature built into branch)
  5. Try a different distribution (using Gentoo now)
  6. Callweaver

… Those are mostly ordered by what i think is most lieklly to help, but I could be way off.

Finally, i suppose it is important to list what I’ve tried:

TRIED SO FAR

  1. Replacing PSU
  2. Replaceing UPS
  3. Cloning OS from identical well-behaved computer
  4. Recompiling asterisk / upgrading asterisk version.

… upgrading actually made it worse (that was before i cloned the os) so i downgraded back to 1.2.16 and zaptel 1.2.14.

Im beginning to think there are 18 point releases of asterisk just to get you to restart the service before it deadlocks. in my case, even that wouldnt be soon enough.

im not looking for a concrete solution, just some insight as im running out of troubleshooting options.