Based on how the number of open files dropped down to 15 or lower after the error, I doubt the issue was caused by an actual leak.
I have had issues with not quite FD leaks, where we would open the same file a bunch of times for some tasks. It is not a leak because we close all of the FDs at the end of the task. In particular, this meant that it slipped past the explicit FD leak detection logic we had in our test harness. It also worked flawlessly in our long running stress tests.
For a while people assumed it was legit, because it only showed up on tasks that involved thousands of files, and the needed FD limit seemed to scale to the input file count.
I have had issues with not quite FD leaks, where we would open the same file a bunch of times for some tasks. It is not a leak because we close all of the FDs at the end of the task. In particular, this meant that it slipped past the explicit FD leak detection logic we had in our test harness. It also worked flawlessly in our long running stress tests.
For a while people assumed it was legit, because it only showed up on tasks that involved thousands of files, and the needed FD limit seemed to scale to the input file count.