OOMProf: Profiling on the Brink

OOMProf: Profiling on the Brink

by ingve

devin

I find OOM Kills are frequently one of the most misunderstood ops issues. I've seen it frequently get treated as "This application ran out of memory". I think that "OOM Killed" is a bad way of describing clearly to a user debugging an issue what is happening the first place, never mind the rabbit hole of investigation that can follow it.

xmichael909

OOMKiller... in most cases where it has killed things, i feel it would’ve been far better to just let the system slog along and spill onto disk instead of killing the process outright, or as the article says killing the wrong process like it always seems to do.

jcelerier

whenever this happens I have to reboot my system anyways because it becomes unuseable, e.g. it becomes impossible to type / move the mouse for >1 hour. between me hitting the hardware reset button of my computer (or sometimes the power plug on more modern userspace-based reset buttons) vs just killing a process, I know what I prefer

gdbsjjdn

You know that's exactly what swap is right? You can enable swapping if that's the behaviour you want.

xmichael909

31m

Thanks Tips....

baq

default linux vm settings are abysmal. ratios are all wrong, overcommit is... questionable, cache flushes happen too late. my worst experiences are when there's a shitton of free memory and my processes get OOM killed, and not even due to fragmentation.

Bender

I've not seen OOM on any of my systems for a very long time but I also set overcommit ratio to 0 and vm.min_free_kbytes to a higher number based on a formula. I have not allocated swap in a couple decades even in tiny VM's at VPS providers. If memory gets tight I move apps to a node with more memory and leave plenty free for inode/dentry/page cache. Unused RAM is never wasted.

Crafted by Rajat

Source Code

hckrnws

OOMProf: Profiling on the Brink