Mem leak plus CPU maxed out 1.6.2.18-19

We have some issues with asterisk 1.6.2.18-19, that I hope someone can help us with.

Customer 1 had .18 from the beginning, but we experienced memory leaks and was glad to read that .19 took care of those issues - we upgraded to .19.
The thing that happened upgrade was that the CPU maxed out after a while and then causing alot of errors. (Used more CPU from the beginning than .18).
We reverted to .18 again and scheduled regular restarts of asterisk, to avoid memory leak problems.
We log the activity of asterisk in top every 15 minutes. In one case after a unplanned restart we actually had two asterisk processes running (this only happened once though). To solve this we had to manually restart asterisk.

Customer 2 has .19, but has fewer clients and no mixmonitor recording . the CPU is somewhat high, but does not max out.

Any ideas? We are now looking at 1.8.5.0 - not sure if that solves out problem, but we have some indications that it might be a solution.

Below are the main issues in more details.

Issues 1.6.2.18:
Memory leak: Asterisk leaks memory over time, which results in untimely restarts.
Although this log doesn’t show a restart, it has occured.
A cron job restarts asterisk at 12:40 and a cronjob takes a top snapshot every 15 minutes.
Notice that the mem column slowly grows over time.

Wed Jul  6 12:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  476m  19m 6636 S    0  1.0   0:00.84 asterisk

Wed Jul  6 13:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  476m  19m 6680 S    0  1.0   0:02.06 asterisk

Wed Jul  6 13:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  615m  50m 7340 S   32  2.5   2:25.31 asterisk

Wed Jul  6 13:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  710m  90m 7364 S   30  4.5   7:11.83 asterisk

Wed Jul  6 13:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  736m 129m 7368 S   30  6.4  12:10.48 asterisk

Wed Jul  6 14:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  762m 169m 7372 S   24  8.5  16:59.23 asterisk

Wed Jul  6 14:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  789m 211m 7364 S   20 10.5  22:05.06 asterisk

Wed Jul  6 14:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  815m 250m 7360 S   22 12.5  27:12.60 asterisk

Wed Jul  6 14:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  839m 289m 7356 S   24 14.4  32:05.90 asterisk

Wed Jul  6 15:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  853m 323m 7356 S   12 16.1  36:35.96 asterisk

Wed Jul  6 15:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  840m 325m 7340 S    4 16.2  37:25.60 asterisk

Wed Jul  6 15:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  879m 367m 7340 S   26 18.3  41:43.53 asterisk

Wed Jul  6 15:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  954m 391m 7320 S   26 19.5  46:29.26 asterisk

Wed Jul  6 16:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  953m 400m 7312 S   26 20.0  51:13.64 asterisk

Wed Jul  6 16:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  953m 406m 7312 S   26 20.3  55:28.70 asterisk

Wed Jul  6 16:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  953m 411m 7308 S   18 20.5  59:43.93 asterisk

Wed Jul  6 16:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24954 root      20   0  953m 416m 7276 S   22 20.7  62:34.02 asterisk
|
|

------- An unplanned restart happens between these two points in time.
|
|
Wed Jul 6 17:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  501m  26m 7328 S    6  1.3   0:21.56 asterisk

Wed Jul  6 17:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  572m  36m 7348 S    6  1.8   1:28.05 asterisk

Wed Jul  6 17:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  579m  49m 7356 S   10  2.5   2:40.68 asterisk

Wed Jul  6 17:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  589m  62m 7364 S   10  3.1   3:55.15 asterisk

Wed Jul  6 18:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  594m  73m 7372 S    6  3.6   5:09.73 asterisk

Wed Jul  6 18:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  602m  83m 7372 S    6  4.2   6:21.94 asterisk

Wed Jul  6 18:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  611m  97m 7372 S   12  4.8   7:37.94 asterisk

Wed Jul  6 18:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  618m 108m 7364 S    6  5.4   8:53.90 asterisk

Wed Jul  6 19:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  624m 118m 7364 S    2  5.9  10:06.03 asterisk

Wed Jul  6 19:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  625m 118m 7352 S    0  5.9  10:25.13 asterisk

Wed Jul  6 19:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  627m 124m 7344 S    8  6.2  11:08.39 asterisk

Wed Jul  6 19:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  628m 126m 7340 S    6  6.3  12:20.40 asterisk

Wed Jul  6 20:00:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  692m 128m 7340 S    8  6.4  13:30.57 asterisk

Wed Jul  6 20:15:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  692m 131m 7340 S    4  6.5  14:40.06 asterisk

Wed Jul  6 20:30:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  692m 133m 7336 S   10  6.6  15:45.51 asterisk

Wed Jul  6 20:45:01 CEST 2011

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23049 root      20   0  692m 135m 7336 S    8  6.8  16:56.46 asterisk

Issues 1.6.2.19:
CPU Maxes out: We tried using 1.6.2.19 at two different customers. The only difference between these two, is that one (customer 1) uses MixMonitor to record all calls. At that customer, we had a significantly higher CPU usage compared to 1.6.2.18 and after a while both cores maxed out, spewing a lot of ERROR[26536] res_timing_timerfd.c: Read error: Bad file descriptor messages in the logs.
While we haven’t had the same amount of trouble with 1.6.2.19 at the other customer, we still can spot a bunch of bad file descriptor messages as well.

To minimize our problems, we restart asterisk twice a day using a cron script, once at 08:00 and again at 12:40, despite this we have hit the described issues.

Those versions are on security fixes only. The only way anything else might get fixed is if it is collateral damage resulting from a security fix.

Ok, so, that leaves us with:

If anyone has experienced this problem and got it solved using 1.8.x or if a downgrade would solve this.
or
Could it be a module or conf-issue? Perhaps we might be able to disable some functionality.
or
Any other thought or clues that might help us dig further.

Good news! - We no longer have this issue - Downgrading to 1.6.2.9 was the solution. Memory is steady and CPU does not max out.

We have not yet tried with 1.8. Might be a future project.

With respect to memory leaks, for anyone that comes across this thread later, I’ll repost some advice on tracking them down:

Asterisk can optionally be compiled with a memory allocation debugger. To build this option in, from the menuselect, browser to “Compiler Flags” and then enable “MALLOC_DEBUG.” Recompile and install Asterisk.

Then, from the CLI, you can do:

memory show summary

to see memory utilization by file. So, you can run it, wait a bit till your system says memory’s been leaked, and then run it again to compare and provide a pointer as to the culprit.

Once you’ve tracked down the file that’s at fault, you can then do:

memory show allocations offending_file.c

and it’ll provide the allocations by source line number. If something’s gone bonkers, there should be an allocation (or allocations) with really crazy numbers.

Armed with some leading information, you can open an issue on the issue tracker at:

issues.asterisk.org

Reporting guidelines can be found here:

asterisk.org/developers/bug-guidelines

Cheers