Over the last 6 months I've been facing an issue I can't seem to get rid of: apparently random "too many files open" / "can't open TCP/IP socket (24)" / "getaddrinfo: can't open file" errors in my applications' logs.
I run a stack as follows: mariadb, postgresql, memcached, redis and a couple node-based applications inside Docker containers, apache with Passenger running a Ruby on Rails (ruby 2.5.5, Rails 6) application and sidekiq, all on a CentOS 7 machine (3.10.0-1127.el7.x86_64) with 6 cores and 16Gb of RAM. Load averages at about 10% with small spikes during main business hours, almost never over 30%.
Initially I thought it was this other Java app causing this problem, after shutting it down the issue still pops up only after more time.
Whatever I do I cannot reproduce this in the CLI, it just happens apparently at random, without any significant load on the machine.
1 hour after a service restart I have the following stats:
Total open files
$ lsof | wc -l
30594
Top processes by open files
$ lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
8260 mysqld
4804 node
2728 Passenger
2491 container
2095 postgres
1445 dockerd
1320 processor
817 php-fpm
720 httpd
709 ruby
Mariadb variables:
MariaDB [(none)]> Show global variables like 'open_files_limit';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| open_files_limit | 65535 |
+------------------+-------+
1 row in set (0.01 sec)
MariaDB [(none)]> Show global status like 'opened_files';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Opened_files | 6151 |
+---------------+-------+
1 row in set (0.00 sec)
I've set the maximum open files to 130k in sysctl.conf thinking it will solve the problem, it only buys me some time, it still pops up only later
$ sysctl fs.file-nr
fs.file-nr = 3360 0 131070
I just did a quick "ab" test, the number of open files didn't go up much:
$ ab -n 1000 -c 10 http://www.example.com/
9589 mysqld
4804 node
4577 Passenger
3756 httpd
3225 postgres
2491 container
2166 utils.rb:
2080 ruby
1445 dockerd
1364 processor
This is probably irrelevant as a real user would not hit the homepage repeatedly.
I have a hunch that the culprit may be Docker somehow (I've ran much busier servers without dockerizing the databases) but would rather investigate other possibilities before moving the databases out of Docker as it will be a very painful process.
Would appreciate some pointers