[16:08] <JoshX> Hi! I have multiple ubuntu 14.04.5LTS systems running on the same hardware with the same kernel producing the same kernel crashes..
[16:09] <JoshX> crashlogs are here: https://pastebin.com/xnGmtLNg and https://pastebin.com/0vQX9fKz for example
[16:10] <JoshX> kernels are  Linux pc7-1428 4.4.0-116-generic #140~14.04.1-Ubuntu SMP Fri Feb 16 09:25:20 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[16:10] <JoshX> on all systems
[16:10] <JoshX> every time this happens, the machine is under medium IO load, low memory load
[16:11] <JoshX> and after the crash behaves very eratic.. random crashing processes, reboot and shutdown fail
[16:11] <JoshX> so far we've had this on 9 systems running identical hardware and software
[16:12] <JoshX> doing mainly writing ffmpeg -> h264 stream -> pipeline -> to JPG frames to disk (disks are connected via USB in a mirror MD0)
[16:14] <TJ-> JoshX: looks like it could be due to fragmented memory allocations 
[16:14] <JoshX> TJ-: is this something I can do anything about?
[16:14] <JoshX> we're running stock 14.04.5LTS with no exciting software on average hardware
[16:15] <JoshX> Intel(R) Core(TM) i7-5550U CPU @ 2.00GHz CPU With 16Gb DDR3L ram
[16:15] <TJ-> JoshX: I'm only guessing, but if the processes are requesting many smaller allocations (I infer due to the strack trace showing compaction_alloc) it's posisble that eventually the kernel cannot create a block large enough for a request
[16:16] <JoshX> I understand, but am i running out of memory or out of large enough fragments?
[16:16] <TJ-> you'll notice the strack trace passes through do_huge_pmd_anonymous, try_to_compact_pages etc
[16:17] <TJ-> so I suspect fragmentation has ended up in a situation where even compaction cannot handle things (possibly there's a bug there that ought not to be as well)
[16:18] <TJ-> I think there are some kernel 'knobs' you can twiddle to influence these things (sysctl)
[16:19] <JoshX> perhaps I can add a PPA and try a upstream/newer/beta kernel?
[16:19] <JoshX> of enable something to produce more/better logging?
[16:19] <JoshX> the crashes are so random I have yet to reproduce them at will
[16:21] <TJ-> JoshX: do they happen after the systems have been working for some time ?
[16:21] <JoshX> well these 2 systems crashed after 200k sec and 400k sec if i see the logs
[16:22] <JoshX> so 2.5 and 4 days or so
[16:22] <JoshX> not too long
[16:22] <TJ-> JoshX: if those processes are fragmenting memory that's about what I'd expect - I wouldn't expect this immediately after boot
[16:23] <JoshX> but we're running some nodejs processes, rabbitmq, ffmpeg
[16:23] <JoshX> and ionotify on file close events
[16:23] <JoshX> on ~40 files / sec
[16:24] <JoshX> so we're handeling tons of IO events on file closes to move files to the right place
[16:24] <JoshX> and we're queuing stuff in rabbitmq.. 
[16:25] <JoshX> could be erlang/rabbitmq messing things up
[16:25] <JoshX> could be nodejs
[16:25] <JoshX> but the end result is always the same, an unstable machine that is pingable and accessible via ssh
[16:26] <JoshX> but which does not work anymore.. (so monitoring this sucks since all seems normal)
[16:27] <JoshX> can i perform a command on these machines to see how bad fragmentation is at this moment?
[16:28] <JoshX> so I could log things over time and keep a graph or something to see if the problem is slowly getting worse?
[16:29] <TJ-> JoshX: try "echo m > /proc/sysrq-trigger"
[16:29] <TJ-> JoshX: then check dmesg
[16:30] <JoshX> https://pastebin.com/Xu6130pW
[16:30] <TJ-> JoshX: oh, forget that, we have buddy now! "cat /proc/buddyinfo"
[16:30] <JoshX> root@pc7-1428:/var/log# cat /proc/buddyinfo
[16:30] <JoshX> Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
[16:30] <JoshX> Node 0, zone    DMA32    325    115     59     36     10      8     50     44     26      0      0
[16:31] <JoshX> Node 0, zone   Normal     20     22      4     12      5      3    101     84     44      0      0
[16:31] <JoshX> 'erm..'
[16:31] <JoshX> is this good? bad? :)
[16:32] <TJ-> the numbers are blocks of 4KiB, 8KiB, 16KiB....
[16:33] <TJ-> you might to read the docs in ./Documentation/ directory of the kernel source for the detail of that
[16:33] <JoshX> I have a system in stress test now thats reporting higher numbers
[16:34] <JoshX> root@pc7-1422:~# cat /proc/buddyinfo
[16:34] <JoshX> Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
[16:34] <JoshX> Node 0, zone    DMA32    283     94     18     16      7     10     10     38     39      2      0
[16:34] <JoshX> Node 0, zone   Normal    209   3879     31      9     11     19     23      7      4     10      1
[16:35] <JoshX> and the memory info from dmesg: https://pastebin.com/q6iZC3Sk
[16:35] <TJ-> you might try adjusting the memory overcommit settings
[16:37] <JoshX> how might I do that?
[17:00] <apw> it seems unreasonable for it to panic like that there