Help solving an elusive file descriptor leak

Question

Over the last 6 months I've been facing an issue I can't seem to get rid of: apparently random "too many files open" / "can't open TCP/IP socket (24)" / "getaddrinfo: can't open file" errors in my applications' logs.

I run a stack as follows: mariadb, postgresql, memcached, redis and a couple node-based applications inside Docker containers, apache with Passenger running a Ruby on Rails (ruby 2.5.5, Rails 6) application and sidekiq, all on a CentOS 7 machine (3.10.0-1127.el7.x86_64) with 6 cores and 16Gb of RAM. Load averages at about 10% with small spikes during main business hours, almost never over 30%.

Initially I thought it was this other Java app causing this problem, after shutting it down the issue still pops up only after more time.

Whatever I do I cannot reproduce this in the CLI, it just happens apparently at random, without any significant load on the machine.

1 hour after a service restart I have the following stats:

Total open files

$ lsof | wc -l
30594

Top processes by open files

$ lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
   8260 mysqld
   4804 node
   2728 Passenger
   2491 container
   2095 postgres
   1445 dockerd
   1320 processor
    817 php-fpm
    720 httpd
    709 ruby

Mariadb variables:

MariaDB [(none)]> Show global variables like 'open_files_limit';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| open_files_limit | 65535 |
+------------------+-------+
1 row in set (0.01 sec)

MariaDB [(none)]> Show global status like 'opened_files';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Opened_files  | 6151  |
+---------------+-------+
1 row in set (0.00 sec)

I've set the maximum open files to 130k in sysctl.conf thinking it will solve the problem, it only buys me some time, it still pops up only later

$ sysctl fs.file-nr
fs.file-nr = 3360   0   131070

I just did a quick "ab" test, the number of open files didn't go up much:

$ ab -n 1000 -c 10 http://www.example.com/

   9589 mysqld
   4804 node
   4577 Passenger
   3756 httpd
   3225 postgres
   2491 container
   2166 utils.rb:
   2080 ruby
   1445 dockerd
   1364 processor

This is probably irrelevant as a real user would not hit the homepage repeatedly.

I have a hunch that the culprit may be Docker somehow (I've ran much busier servers without dockerizing the databases) but would rather investigate other possibilities before moving the databases out of Docker as it will be a very painful process.

Would appreciate some pointers

Nick M · Accepted Answer · 2020-11-03 12:20:45Z

1

This was caused by having only 4096 inotify handlers. I've increased the limits and the issue is gone.

fs.file-max = 131070
fs.inotify.max_user_watches = 65536

answered Nov 3, 2020 at 12:20

Nick M

1211 silver badge10 bronze badges

The title of the question says that there's a leak. Was/is there actually a leak? If so, increasing the file descriptor limit isn't solving the root of the issue and you're at risk of the same error happening in the future. I'm running into a similar issue.

kylejw2
– kylejw2

2023-02-08 23:34:41 +00:00
Commented Feb 8, 2023 at 23:34
It was and still is a leak, can't fix the application causing the leak so I've bumped the limits. Had to go as high as 20M on a few larger machines.

Nick M
– Nick M

2023-03-15 21:54:03 +00:00
Commented Mar 15, 2023 at 21:54

Add a comment |

Stack Exchange Network

Help solving an elusive file descriptor leak

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Help solving an elusive file descriptor leak

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions