Thanks as always, though I'm not sure what that gets us. Indirectly, it made me realize we are making some assumptions about what oomkiller does, I think. I think half the team has touched this case, so I'm not 100% sure who knows what.
Is the customer paranoid? I should probably not answer that.
That said, ultimately this isn't *that* big of a deal. We aren't talking about data loss. You just need to restart the service. Nonetheless, HA is HA, and this is not that.
All the processes are going to go off in a power failure situation, so I don't think this test does anything there. Maybe there is some other resource limiter (DDoS?) that this tests, but it seems to me we are really just testing oomkiller. I should probably have a better understanding of what exactly oomkiller does.
That all said, I figured out the issue is that there is an epoll_wait() that doesn't get closed (timeouts are a thing that exist). My assumption is that systemd adds some sort of epoll_wait() timeout, but I need to validate that assumption.