z-push creates php threads (not processes) until server crashes
-
I’ve been struggling with the following problem for a while now and would very much appreciate your help.
My webserver completely crashes on a regular basis due to too many php threads. The number of php processes is stable within a normal range (e.g. 35 processes). The number of threads per process constantly increases over time. After a couple of hours the total sum of all threads in all php processes gets greater than 900 - and that’s when normal server operations start to shut down.
The threads are somehow tied to z-push. After disabling z-push and restarting php, the number of threads per process is 1 and stays that way. I am using z-push to sync kopano data to about 7 different android smartphones.
I am running
- a virtual server
- 8 GB RAM
- 4 vCores
- ubuntu 16.04
- nginx 1.10.3
- php-fpm 7.0.3
- kopano core 8.4.90
- kopano webapp 3.4.3
- z-push 2.4.4
my …/fpm/pool.d/www.conf looks like this (alphabetical order):
user = www-data request_terminate_timeout = 600s pm.start_servers = 20 pm.min_spare_servers = 20 pm.max_spare_servers = 35 pm.max_requests = 500 pm.max_children = 72 pm = dynamic listen.owner = www-data listen.group = www-data listen = /run/php/php7.0-fpm.sock group = www-data env[TMP] = /tmp env[TMPDIR] = /tmp env[TEMP] = /tmp env[PATH] = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games env[HOSTNAME] = $HOSTNAME
And my nginx configuration looks like this:
user www-data; worker_processes auto; pid /run/nginx.pid; events { worker_connections 1024; # multi_accept on; } http { client_max_body_size 501m; sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 3; types_hash_max_size 2048; server_names_hash_max_size 8192; include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $host $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_prefer_server_ciphers on; map $status $loggable { ~^404 0; default 1; } access_log /var/log/nginx/access.log; error_log /var/log/nginx/error.log ; gzip on; gzip_disable "msie6"; application/xml+rss text/javascript; gzip_min_length 1100; gzip_vary on; gzip_comp_level 2; gzip_proxied any; gzip_types text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript; include /etc/nginx/conf.d/*.conf; include /etc/nginx/sites-enabled/*; limit_conn_zone $binary_remote_addr zone=addr:10m; limit_conn addr 50; fastcgi_cache_path /etc/nginx/cache levels=1:2 keys_zone=YOURAPP:100m inactive=60m; fastcgi_cache_key "$scheme$request_method$host$request_uri"; upstream php { server unix:/run/php/php7.0-fpm.sock; } map $http_upgrade $connection_upgrade { default upgrade; '' close; } } server { listen 443; server_name my.server.de; # ssl specs here # some general location stuff here root /usr/share/z-push; index index.php; fastcgi_param QUERY_STRING $query_string; fastcgi_param REQUEST_METHOD $request_method; fastcgi_param CONTENT_TYPE $content_type; fastcgi_param CONTENT_LENGTH $content_length; fastcgi_param SCRIPT_NAME $fastcgi_script_name; fastcgi_param REQUEST_URI $request_uri; fastcgi_param DOCUMENT_URI $document_uri; fastcgi_param DOCUMENT_ROOT $document_root; fastcgi_param SERVER_PROTOCOL $server_protocol; fastcgi_param REQUEST_SCHEME $scheme; fastcgi_param HTTPS $https if_not_empty; fastcgi_param GATEWAY_INTERFACE CGI/1.1; fastcgi_param SERVER_SOFTWARE nginx/$nginx_version; fastcgi_param REMOTE_ADDR $remote_addr; fastcgi_param REMOTE_PORT $remote_port; fastcgi_param SERVER_ADDR $server_addr; fastcgi_param SERVER_PORT $server_port; fastcgi_param SERVER_NAME $server_name; fastcgi_param REDIRECT_STATUS 200; fastcgi_index index.php; fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name; fastcgi_param REQUEST_URI $1; fastcgi_param PHP_FLAG "magic_quotes_gpc=off \n register_globals=off \n magic_quotes_runtime=off \n short_open_tag=on"; fastcgi_param PHP_VALUE "post_max_size=20M \n upload_max_filesize=20M \n max_execution_time=3660"; fastcgi_param HTTP_PROXY ""; fastcgi_read_timeout 3660; location ~* /Microsoft-Server-ActiveSync { alias /usr/share/z-push/index.php; access_log /var/log/nginx/z-push-access.log; error_log /var/log/nginx/z-push-error.log; client_max_body_size 20m; client_body_buffer_size 128k; fastcgi_pass unix:/var/run/php/php7.0-fpm.sock; } location ~* /AutoDiscover/AutoDiscover.xml { alias /usr/share/z-push/autodiscover/autodiscover.php; access_log /var/log/nginx/z-push-autodiscover-access.log; error_log /var/log/nginx/z-push-autodiscover-error.log; fastcgi_pass unix:/var/run/php/php7.0-fpm.sock; } location / { try_files $uri $uri/ index.php; } }
And of course z-push-conf.php:
<?php define('TIMEZONE', 'Europe/Berlin'); setlocale(LC_CTYPE, "de_DE.UTF-8"); define('BASE_PATH', dirname($_SERVER['SCRIPT_FILENAME']). '/'); define('SCRIPT_TIMEOUT', 0); define('USE_CUSTOM_REMOTE_IP_HEADER', false); define("CERTIFICATE_OWNER_PARAMETER", "SSL_CLIENT_S_DN_CN"); define('USE_FULLEMAIL_FOR_LOGIN', true); define('STATE_MACHINE', 'FILE'); define('STATE_DIR', '/var/lib/z-push/'); define('IPC_PROVIDER', ''); define('LOGBACKEND', 'filelog'); define('LOGLEVEL', LOGLEVEL_ERROR); define('LOGAUTHFAIL', true); define('LOGUSERLEVEL', LOGLEVEL_DEVICEID); $specialLogUsers = array(); define('LOGFILEDIR', '/var/log/z-push/'); define('LOGFILE', LOGFILEDIR . 'z-push.log'); define('LOGERRORFILE', LOGFILEDIR . 'z-push-error.log'); define('LOG_SYSLOG_HOST', false); define('LOG_SYSLOG_PORT', 514); define('LOG_SYSLOG_PROGRAM', 'z-push'); define('LOG_SYSLOG_FACILITY', LOG_LOCAL0); define('PROVISIONING', false); define('LOOSE_PROVISIONING', false); define('PROVISIONING_POLICYFILE', 'policies.ini'); define('SYNC_CONFLICT_DEFAULT', SYNC_CONFLICT_OVERWRITE_PIM); define('SYNC_FILTERTIME_MAX', SYNC_FILTERTYPE_ALL); define('PING_INTERVAL', 300); define('FILEAS_ORDER', SYNC_FILEAS_LASTFIRST); define('SYNC_MAX_ITEMS', 25); define('UNSET_UNDEFINED_PROPERTIES', false); define('SYNC_CONTACTS_MAXPICTURESIZE', 5242880); define('ALLOW_WEBSERVICE_USERS_ACCESS', false); define('USE_PARTIAL_FOLDERSYNC', false); define('PING_LOWER_BOUND_LIFETIME', false); define('PING_HIGHER_BOUND_LIFETIME', 320); define('SYNC_TIMEOUT_SHORT_DEVICETYPES', ""); define('SYNC_TIMEOUT_MEDIUM_DEVICETYPES', ""); define('SYNC_TIMEOUT_LONG_DEVICETYPES', "iPod, iPad, iPhone, WP, WindowsOutlook, WindowsMail"); define('RETRY_AFTER_DELAY', 300); define('BACKEND_PROVIDER', ''); define('SEARCH_PROVIDER', ''); define('SEARCH_WAIT', 10); define('SEARCH_MAXRESULTS', 10); define('KOE_CAPABILITY_GAB', true); define('KOE_CAPABILITY_RECEIVEFLAGS', true); define('KOE_CAPABILITY_SENDFLAGS', true); define('KOE_CAPABILITY_OOF', true); define('KOE_CAPABILITY_OOFTIMES', true); define('KOE_CAPABILITY_NOTES', true); define('KOE_CAPABILITY_SHAREDFOLDER', true); define('KOE_CAPABILITY_SENDAS', true); define('KOE_CAPABILITY_SECONDARYCONTACTS', true); define('KOE_CAPABILITY_SIGNATURES', true); define('KOE_CAPABILITY_RECEIPTS', true); define('KOE_CAPABILITY_IMPERSONATE', true); define('KOE_GAB_STORE', 'SYSTEM'); define('KOE_GAB_FOLDERID', ''); define('KOE_GAB_NAME', 'Z-Push-KOE-GAB'); $additionalFolders = array( array( 'store' => "SYSTEM", 'folderid' => "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", 'name' => "IWDR-Kontakte", 'type' => SYNC_FOLDER_TYPE_USER_CONTACT, 'flags' => DeviceManager::FLD_FLAGS_NONE ), );
And finally, I end up with plenty of entries in z-push.log which look like these:
22/09/2018 07:22:07 [ 8782] [ERROR] [name@server.de] Please check your logs for this PID and errors like PHP-Fatals or Apache segmentation faults and report your results to the Z-Push dev team. 22/09/2018 08:08:06 [16667] [ERROR] [name@server.de] LoopDetection->ProcessLoopDetectionPreviousConnectionFailed(): Command 'FolderSync' at 22/09/2018 08:08:01 with pid '15348' terminated unexpectedly or is still running.
I understand that this kind of problem is often connected to timeouts. Can you give me advice on the best practice to avoid these timeouts?
Are these timeouts also the reason for the ever growing number of threads?Any help would be very much appreciated. Thank you in advance!
-
Hi smilbert,
did you check nginx log for segfaults regarding those pids? Or z-push.log if they were running for longer than 30 seconds? Are theses ProcessLoopDetectionPreviousConnectionFailed entries for FolderSync command only or also for other commands?
Do you have a proxy in front of nginx?
Do these android devices belong to different users?
When you disable Z-Push, are there some Webapp users? Or do you disable it as well?Maybe it would help to reduce pm.max_requests value to e.g. 200.
Manfred
-
From a practical point of view - if your system is slow generating the responses to z-push you can run into a situation where the mobile has given up waiting, and re-issues the request while the server is still churning through gathering the data and preparing it. This can spiral out of control.
I would suggest you try reducing the
define('SYNC_MAX_ITEMS', 512);
in the config file to see if it alleviates the problem. Try 256 or even 128 and observe if it makes a difference. There is little downside for synced clients - obviously for new clients it will slow down the initial sync as more connection cycles will be needed to get the data over to the device.
If you identify that is helps you have a clue where to start looking - performance of your collaboration suite/z-push. .
-
Thank you both for your replies. As you suggested, I tried the following steps:
- set “SYNC_MAX_ITEMS” to 25 and even down to 5
- set “pm.max_requests” to 200 and even down to 100
Both attempts did not solve the problem. Before each change, I deacviated z-push for at least 6 Minutes so all clients would be disconnected. After the changes, I restartet php. Other things I recently and unsuccessfully tried:
- set “SCRIPT_TIMEOUT” to “100”
- set “PING_INTERVAL” to 1000
- set “SYNC_CONTACTS_MAXPICTURESIZE” to 1242880
- set “SEARCH_WAIT” and “SEARCH_MAXRESULTS” to 5 and even to 2
- deactivate “additionalFolders”
And to answer your questions, Manfred:
- My nginx error log shows plenty of entries like this:
2018/09/30 21:42:02 [error] 20417#20417: *249222 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 94.223.123.456, server: my.server.de, request: "POST /Microsoft-Server-ActiveSync?Cmd=Ping&User=my.user%40myDomain.de&DeviceId=android1234567890407&DeviceType=Android HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "my.server.de"
- I cannot find any interesting error messages in the z-push-logs. There are various entries with times of 300 seconds, though.
- Lately, the "ProcessLoopDetectionPreviousConnectionFailed " did not appear anymore. I only find one of those in the logs - and that one is for “Provisioning” instead of “FolderSync”
- I do not have a proxy in front of nginx.
- The Android devices belong (mostly) to different users. Two users use more than one device, about three devices sync for more than one user.
- When I disable z-push (by setting root-only permissions on its index.php), I leave Kopano Webapp fully accessible.
Thank you for your suggestions so far. Any additional help would be greatly appreciated!
Smilbert -
Did you check your timeouts for:
- php-fpm: max_execution_time
- nginx: fastcgi_read_timeout
?
++umgfoin
-
fastcgi_read_timeout was already set to the pretty high value of 3660.
And I increased max_execution_time to 360.
Still no improvement… -
Hi Smilbert,
the Ping and Sync with heartbeat requests in Z-Push may be open up until 59 minutes (3540 seconds). fastcgi_read_timeout value in the nginx config already reflects that. However max_execution_time and request_terminate_timeout are way below this value. I’m not very familiar with nginx and php-fpm, but maybe you have to increase these two values as well?
Alternatively you could set PING_HIGHER_BOUND_LIFETIME in Z-Push config to some value lower than max_execution_time/request_terminate_timeout. Just be aware that this will result in a request every PING_HIGHER_BOUND_LIFETIME if there were no changes.
Manfred
-
Hi Manfred,
again, thank you for your advice. To sum it all up: After trying everything mentioned above, my problem still hasn’t changed. (However, the segmentation faults are gone now. They appearently weren’t connected to the thread-flooding-problem after all.)
So I concentrated on further analyzing the problem using z-push-top.php. As it turns out, the threads are not spawned after the usual server side timeouts we discussed above. In z-push-top, I see many connections turn grey (–> terminated) after 30 seconds. And each time that happens, the php thread count goes up by one.
If I understand the comment in z-push-config correctly, then the 30s-Limit is not set on the server side but it is defined by the android clients, right? Do you have any suggestions on what I can do to solve that?Thank you and best regards
Smilbert -
Hi Smilbert,
yes, the Android clients have a built-in 30 seconds timeout for a request and if they don’t receive a response they will issue a new request. However it’s different for Ping requests as the devices send Lifetime/HeartbeatInterval parameter (which could be up to 3540 seconds) and also honour it.
Did you notice which connections are mostly turning grey? Ping? Sync?It might be that the processes aren’t being terminated correctly or there’s some issue with PHP-FPM, but unfortunately I don’t have any other suggestions.
Manfred
-
Hi Manfred,
I’d say that more than 90% of them are ping processes, the rest are Syncs and FolderSyncs.
In most cases, the column “Additional Information” just states “OK : lifetime 30s” or simply “OK” for the ping terminations.
Straaaaange!Stefan
-
It is very strange that you are getting 30s lifetimes.
If you look at the active Ping or Sink commands, what do you see immediately after the Ping/Sync - it sould show elapsed/maximum seconds and then the list of folders
-
One Month ago I updated Kopano Core from 8.4.90 to 8.6.82. That actually solved the problem. Now there’s only one thread per process. Better late than never. Thank you all for your help.