Frequent disconnections from asterisk

Getting frequent asterisk disconnections from our server.The asterisk is running on SSL and with a calling rate of 8000 calls per hour.The straight relation is observed between the calling rate and rate of disconnections.
In asterisk logs fflush() returned error is seen after which it disconnects our server.

the fflush is done in utils.c in asterisk 1.6.2. Though the support is over but still requesting for community support
Line Number : 1403
int ast_careful_fwrite(FILE *f, int fd, const char *src, size_t len, int timeoutms) at
while (fflush(f))

It is not clear to me what your question is. With 8.000 calls per hour of 5 minutes ( just an assumption) there is an average of around 650 cuncurrent calls which is quit a lot. Unless you have a complete server park with load balancing mechanisms in place and lots of bandwidth I guess your solution runs out of resources ( memory, bandwidth, interrups, cpu cycles etc. ) and starts to drop calls.

Warning: You should upgrade. For many, many reasons, not the least of which you are running a version of Asterisk that is behind on security updates. If you aren’t analyzing security vulnerabilities reported against Asterisk since 1.6.2 went completely EOL and patching Asterisk yourself, you’re leaving yourself vulnerable.

That aside, the code in ast_careful_fwrite is probably similar between versions. Notably, the code in recent versions of Asterisk will output what the error is:

	while (fflush(f)) {
		if (errno == EAGAIN || errno == EINTR) {
			/* fflush() does not appear to reset errno if it flushes
			 * and reaches EOF at the same time. It returns EOF with
			 * the last seen value of errno, causing a possible loop.
			 * Also usleep() to reduce CPU eating if it does loop */
			errno = 0;
			usleep(1);
			continue;
		}
		if (errno && !feof(f)) {
			/* Don't spam the logs if it was just that the connection is closed. */
			ast_log(LOG_ERROR, "fflush() returned error: %s\n", strerror(errno));
		}
		n = -1;
		break;
	}

Do you get an error message sent out when fflush returns an error? If so, what is it?

Note that one of the major reasons for ast_careful_fwrite failing when used by AMI (I’m assuming you’re using AMI, you didn’t specify that you were but that’s one of the few invokers of that function) is due to a remote system not processing events fast enough. When a remote system fails to read events quickly enough, the write timeout on the stream triggers and the socket disconnects. That gives you two options:

  1. Write your application so that it pulls events off of the socket as fast as possible and dispatches them to other threads/processes for handling.

  2. Increase the write timeout via the writetimeout parameter in manager.conf. If that option isn’t available in your version of Asterisk, then you should upgrade. (You should upgrade.)

@mjordan
According to our logs it seems like
fflush is failing(returning non zero value) without setting the errno which is causing the manager disconnection and thus 0 errno is coming.
Also from the logs the data we are writing is very less and there is no timeout occurring discarding any probability of slow read…

Log:
[Dec 2 17:30:12] ERROR[75468] utils.c: fflush() returned error: Success 0
[Dec 2 17:30:12] ERROR[75468] utils.c: feof() returned error : Success after fflush returned error : Success with error code 0
[Dec 2 17:30:12] ERROR[75468] utils.c: feof() with error 0, DATA: final src length 0, initial src length 19, timeout in ms 20000 and elapsed time in ms 0
[Dec 2 17:30:12] WARNING[75468] manager.c: send_string returned error
[Dec 2 17:30:12] DEBUG[75468] manager.c: do_message returned error
[Dec 2 17:30:12] VERBOSE[75468] manager.c: == Manager ‘asterisk’ logged off from 10.10.35.141

The custom code of ast_careful_fwrite which we are using is:

while (fflush(f)) {
       if (errno == EAGAIN || errno == EINTR) {
			if (errno == EAGAIN) {
				egainCounter++;
			}
			if (errno == EINTR) {
				einterCountr++;
			}
                        if (egainCounter + einterCountr > 10) {
				ast_log(LOG_ERROR, "fflush() EAGAIN: %d + EINTR: %d  reached : %d\n", egainCounter, einterCountr, egainCounter+einterCountr);
                                n = -1;
                                break;
                        }
			continue;
		}
		errorVal = errno;
        ast_log(LOG_ERROR, "fflush() returned error: %s %d \n", strerror(errno),errorVal);

		if (!feof(f)) {
			/* Don't spam the logs if it was just that the connection is closed. */
			ast_log(LOG_ERROR, "feof() returned error : %s after fflush returned error : %s with error code %d\n", strerror(errno), strerror(errorVal),errno);
			ast_log(LOG_ERROR, "feof() with error %d, DATA: final src length %d, initial src length %d, timeout in ms %d and elapsed time in ms %d \n ",errno,len,src_initial_len,timeoutms,elapsed );
		}
		n = -1;
		break;
	}

If that’s the case, then whatever you’re running on isn’t abiding by the contract fflush is documented to provide:

Return Value

Upon successful completion 0 is returned. Otherwise, EOF is returned and errno is set to indicate the error.

If EOF is being returned and errno is not being set, there’s nothing Asterisk can do. Your system is lying to Asterisk, and Asterisk can’t make assumptions about what is happening.

That being said, that may not be what is happening here. Since errno is a global, you may be losing the error reason due to some other system call executing by the time ast_log is called. Rather than cache errno off into errorVal, you could try printing the error quickly. You could also try calling explain_fflush to try and get a better reason on why fflush failed.

Have you tried:
ulimit -n 32768 -c unlimited && echo “OK” || echo "FAILED"
before launching asterisk main process (not a console reconnection)

Hello
Note this issue is only occurring with SSL.
We have tested the same scenario with asterisk 13.9 also with our server.The issue is recreatable with greater then 10 concurrent calls of short duration’s (less than 10 seconds).
We have tested both asterisk with
OpenSSL 1.0.1e-fips 11 Feb 2013/openssl-1.0.1e-48.el6_8.1.x86_64
glibc-2.12-1.192.el6.x86_64/glibc-2.12-1.166.el6_7.3.x86_64
One more thing that we have observed is that
when fflush is called, hook point ssl_write in tcptls.c is called which then calls SSL_write of openssl.We are not getting any call to ssl_write of tcptls.c when fflush fails.