z-push creates php threads (not processes) until server crashes

I’ve been struggling with the following problem for a while now and would very much appreciate your help.

My webserver completely crashes on a regular basis due to too many php threads. The number of php processes is stable within a normal range (e.g. 35 processes). The number of threads per process constantly increases over time. After a couple of hours the total sum of all threads in all php processes gets greater than 900 - and that’s when normal server operations start to shut down.

The threads are somehow tied to z-push. After disabling z-push and restarting php, the number of threads per process is 1 and stays that way. I am using z-push to sync kopano data to about 7 different android smartphones.

I am running

  • a virtual server
  • 8 GB RAM
  • 4 vCores
  • ubuntu 16.04
  • nginx 1.10.3
  • php-fpm 7.0.3
  • kopano core 8.4.90
  • kopano webapp 3.4.3
  • z-push 2.4.4

my …/fpm/pool.d/www.conf looks like this (alphabetical order):

user = www-data
request_terminate_timeout = 600s
pm.start_servers = 20
pm.min_spare_servers = 20
pm.max_spare_servers = 35
pm.max_requests = 500
pm.max_children = 72
pm = dynamic
listen.owner = www-data
listen.group = www-data
listen = /run/php/php7.0-fpm.sock
group = www-data
env[TMP] = /tmp
env[TMPDIR] = /tmp
env[TEMP] = /tmp
env[PATH] = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
env[HOSTNAME] = $HOSTNAME

And my nginx configuration looks like this:

user www-data;
worker_processes auto;
pid /run/nginx.pid;

events {
    worker_connections 1024;
    # multi_accept on;
}

http {
    client_max_body_size 501m;
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 3;
    types_hash_max_size 2048;
    server_names_hash_max_size 8192;
    include /etc/nginx/mime.types;
    default_type application/octet-stream;  
    log_format  main  '$remote_addr - $host $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';     
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2; 
    ssl_prefer_server_ciphers on;
    map $status $loggable {
        ~^404  0;
        default 1;
    }
    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log ;
    gzip on;
    gzip_disable "msie6";
    application/xml+rss text/javascript;
    gzip_min_length 1100;
    gzip_vary on;
    gzip_comp_level 2;
    gzip_proxied any;
    gzip_types      text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;  
    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
    limit_conn_zone $binary_remote_addr zone=addr:10m;  
    limit_conn addr 50; 
    fastcgi_cache_path /etc/nginx/cache levels=1:2 keys_zone=YOURAPP:100m inactive=60m;
    fastcgi_cache_key "$scheme$request_method$host$request_uri";    
    upstream php {
        server unix:/run/php/php7.0-fpm.sock;
    }    
    map $http_upgrade $connection_upgrade {
            default upgrade;
            '' close;
    }     
}

server {
    listen 443;
    server_name my.server.de;
    # ssl specs here 
    # some general location stuff here
    root    /usr/share/z-push;
    index   index.php;

    fastcgi_param  QUERY_STRING       $query_string;
    fastcgi_param  REQUEST_METHOD     $request_method;
    fastcgi_param  CONTENT_TYPE       $content_type;
    fastcgi_param  CONTENT_LENGTH     $content_length;
    fastcgi_param  SCRIPT_NAME        $fastcgi_script_name;
    fastcgi_param  REQUEST_URI        $request_uri;
    fastcgi_param  DOCUMENT_URI       $document_uri;
    fastcgi_param  DOCUMENT_ROOT      $document_root;
    fastcgi_param  SERVER_PROTOCOL    $server_protocol;
    fastcgi_param  REQUEST_SCHEME     $scheme;
    fastcgi_param  HTTPS              $https if_not_empty;
    fastcgi_param  GATEWAY_INTERFACE  CGI/1.1;
    fastcgi_param  SERVER_SOFTWARE    nginx/$nginx_version;
    fastcgi_param  REMOTE_ADDR        $remote_addr;
    fastcgi_param  REMOTE_PORT        $remote_port;
    fastcgi_param  SERVER_ADDR        $server_addr;
    fastcgi_param  SERVER_PORT        $server_port;
    fastcgi_param  SERVER_NAME        $server_name;
    fastcgi_param  REDIRECT_STATUS    200;
    fastcgi_index   index.php;
    fastcgi_param   SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_param   REQUEST_URI $1;
    fastcgi_param   PHP_FLAG "magic_quotes_gpc=off \n register_globals=off \n magic_quotes_runtime=off \n short_open_tag=on";
    fastcgi_param   PHP_VALUE "post_max_size=20M \n upload_max_filesize=20M \n max_execution_time=3660";
    fastcgi_param   HTTP_PROXY ""; 
    fastcgi_read_timeout 3660; 

    location ~* /Microsoft-Server-ActiveSync {
        alias       /usr/share/z-push/index.php;
        access_log  /var/log/nginx/z-push-access.log;
        error_log   /var/log/nginx/z-push-error.log;
        client_max_body_size 20m;
        client_body_buffer_size 128k;
        fastcgi_pass    unix:/var/run/php/php7.0-fpm.sock; 
    }

    location ~* /AutoDiscover/AutoDiscover.xml {
        alias       /usr/share/z-push/autodiscover/autodiscover.php;
        access_log  /var/log/nginx/z-push-autodiscover-access.log;
        error_log   /var/log/nginx/z-push-autodiscover-error.log;
        fastcgi_pass    unix:/var/run/php/php7.0-fpm.sock;
    }

    location / {
        try_files $uri $uri/ index.php; 
    }
}

And of course z-push-conf.php:

<?php

    define('TIMEZONE', 'Europe/Berlin');
    setlocale(LC_CTYPE, "de_DE.UTF-8");  
    define('BASE_PATH', dirname($_SERVER['SCRIPT_FILENAME']). '/');
    define('SCRIPT_TIMEOUT', 0);
    define('USE_CUSTOM_REMOTE_IP_HEADER', false);
    define("CERTIFICATE_OWNER_PARAMETER", "SSL_CLIENT_S_DN_CN");
    define('USE_FULLEMAIL_FOR_LOGIN', true);
    define('STATE_MACHINE', 'FILE');
    define('STATE_DIR', '/var/lib/z-push/');
    define('IPC_PROVIDER', '');
    define('LOGBACKEND', 'filelog');
    define('LOGLEVEL', LOGLEVEL_ERROR);
    define('LOGAUTHFAIL', true);
    define('LOGUSERLEVEL', LOGLEVEL_DEVICEID);
    $specialLogUsers = array();
    define('LOGFILEDIR', '/var/log/z-push/');
    define('LOGFILE', LOGFILEDIR . 'z-push.log');
    define('LOGERRORFILE', LOGFILEDIR . 'z-push-error.log');
    define('LOG_SYSLOG_HOST', false);
    define('LOG_SYSLOG_PORT', 514);
    define('LOG_SYSLOG_PROGRAM', 'z-push');
    define('LOG_SYSLOG_FACILITY', LOG_LOCAL0);
    define('PROVISIONING', false);
    define('LOOSE_PROVISIONING', false);
    define('PROVISIONING_POLICYFILE', 'policies.ini');
    define('SYNC_CONFLICT_DEFAULT', SYNC_CONFLICT_OVERWRITE_PIM);
    define('SYNC_FILTERTIME_MAX', SYNC_FILTERTYPE_ALL);
    define('PING_INTERVAL', 300);
    define('FILEAS_ORDER', SYNC_FILEAS_LASTFIRST);
    define('SYNC_MAX_ITEMS', 25);
    define('UNSET_UNDEFINED_PROPERTIES', false);
    define('SYNC_CONTACTS_MAXPICTURESIZE', 5242880);
    define('ALLOW_WEBSERVICE_USERS_ACCESS', false);
    define('USE_PARTIAL_FOLDERSYNC', false);
    define('PING_LOWER_BOUND_LIFETIME', false);
    define('PING_HIGHER_BOUND_LIFETIME', 320); 
    define('SYNC_TIMEOUT_SHORT_DEVICETYPES', "");     
    define('SYNC_TIMEOUT_MEDIUM_DEVICETYPES', "");
    define('SYNC_TIMEOUT_LONG_DEVICETYPES',   "iPod, iPad, iPhone, WP, WindowsOutlook, WindowsMail");
    define('RETRY_AFTER_DELAY', 300);
    define('BACKEND_PROVIDER', '');
    define('SEARCH_PROVIDER', '');
    define('SEARCH_WAIT', 10);
    define('SEARCH_MAXRESULTS', 10);
    define('KOE_CAPABILITY_GAB', true);
    define('KOE_CAPABILITY_RECEIVEFLAGS', true);
    define('KOE_CAPABILITY_SENDFLAGS', true);
    define('KOE_CAPABILITY_OOF', true);
    define('KOE_CAPABILITY_OOFTIMES', true);
    define('KOE_CAPABILITY_NOTES', true);
    define('KOE_CAPABILITY_SHAREDFOLDER', true);
    define('KOE_CAPABILITY_SENDAS', true);
    define('KOE_CAPABILITY_SECONDARYCONTACTS', true);
    define('KOE_CAPABILITY_SIGNATURES', true);
    define('KOE_CAPABILITY_RECEIPTS', true);
    define('KOE_CAPABILITY_IMPERSONATE', true);
    define('KOE_GAB_STORE', 'SYSTEM');
    define('KOE_GAB_FOLDERID', '');
    define('KOE_GAB_NAME', 'Z-Push-KOE-GAB');
    $additionalFolders = array(
        array(
            'store'     => "SYSTEM",
            'folderid'  => "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
            'name'      => "IWDR-Kontakte",
            'type'      => SYNC_FOLDER_TYPE_USER_CONTACT,
            'flags'     => DeviceManager::FLD_FLAGS_NONE
        ),        
    );

And finally, I end up with plenty of entries in z-push.log which look like these:

22/09/2018 07:22:07 [ 8782] [ERROR] [name@server.de] Please check your logs for this PID and errors like PHP-Fatals or Apache segmentation faults and report your results to the Z-Push dev team.
22/09/2018 08:08:06 [16667] [ERROR] [name@server.de] LoopDetection->ProcessLoopDetectionPreviousConnectionFailed(): Command 'FolderSync' at 22/09/2018 08:08:01 with pid '15348' terminated unexpectedly or is still running.

I understand that this kind of problem is often connected to timeouts. Can you give me advice on the best practice to avoid these timeouts?
Are these timeouts also the reason for the ever growing number of threads?

Any help would be very much appreciated. Thank you in advance!

Hi smilbert,

did you check nginx log for segfaults regarding those pids? Or z-push.log if they were running for longer than 30 seconds? Are theses ProcessLoopDetectionPreviousConnectionFailed entries for FolderSync command only or also for other commands?
Do you have a proxy in front of nginx?
Do these android devices belong to different users?
When you disable Z-Push, are there some Webapp users? Or do you disable it as well?

Maybe it would help to reduce pm.max_requests value to e.g. 200.

Manfred

From a practical point of view - if your system is slow generating the responses to z-push you can run into a situation where the mobile has given up waiting, and re-issues the request while the server is still churning through gathering the data and preparing it. This can spiral out of control.

I would suggest you try reducing the

define('SYNC_MAX_ITEMS', 512);

in the config file to see if it alleviates the problem. Try 256 or even 128 and observe if it makes a difference. There is little downside for synced clients - obviously for new clients it will slow down the initial sync as more connection cycles will be needed to get the data over to the device.

If you identify that is helps you have a clue where to start looking - performance of your collaboration suite/z-push. .

Thank you both for your replies. As you suggested, I tried the following steps:

  • set “SYNC_MAX_ITEMS” to 25 and even down to 5
  • set “pm.max_requests” to 200 and even down to 100

Both attempts did not solve the problem. Before each change, I deacviated z-push for at least 6 Minutes so all clients would be disconnected. After the changes, I restartet php. Other things I recently and unsuccessfully tried:

  • set “SCRIPT_TIMEOUT” to “100”
  • set “PING_INTERVAL” to 1000
  • set “SYNC_CONTACTS_MAXPICTURESIZE” to 1242880
  • set “SEARCH_WAIT” and “SEARCH_MAXRESULTS” to 5 and even to 2
  • deactivate “additionalFolders”

And to answer your questions, Manfred:

  • My nginx error log shows plenty of entries like this: 2018/09/30 21:42:02 [error] 20417#20417: *249222 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 94.223.123.456, server: my.server.de, request: "POST /Microsoft-Server-ActiveSync?Cmd=Ping&User=my.user%40myDomain.de&DeviceId=android1234567890407&DeviceType=Android HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "my.server.de"
  • I cannot find any interesting error messages in the z-push-logs. There are various entries with times of 300 seconds, though.
  • Lately, the "ProcessLoopDetectionPreviousConnectionFailed " did not appear anymore. I only find one of those in the logs - and that one is for “Provisioning” instead of “FolderSync”
  • I do not have a proxy in front of nginx.
  • The Android devices belong (mostly) to different users. Two users use more than one device, about three devices sync for more than one user.
  • When I disable z-push (by setting root-only permissions on its index.php), I leave Kopano Webapp fully accessible.

Thank you for your suggestions so far. Any additional help would be greatly appreciated!
Smilbert

Did you check your timeouts for:

?

++umgfoin

fastcgi_read_timeout was already set to the pretty high value of 3660.
And I increased max_execution_time to 360.
Still no improvement…

Hi Smilbert,

the Ping and Sync with heartbeat requests in Z-Push may be open up until 59 minutes (3540 seconds). fastcgi_read_timeout value in the nginx config already reflects that. However max_execution_time and request_terminate_timeout are way below this value. I’m not very familiar with nginx and php-fpm, but maybe you have to increase these two values as well?

Alternatively you could set PING_HIGHER_BOUND_LIFETIME in Z-Push config to some value lower than max_execution_time/request_terminate_timeout. Just be aware that this will result in a request every PING_HIGHER_BOUND_LIFETIME if there were no changes.

Manfred