=== himcesjf_ is now known as him-cesjf === kamal is now known as Guest26711 [16:08] Hi! I have multiple ubuntu 14.04.5LTS systems running on the same hardware with the same kernel producing the same kernel crashes.. [16:09] crashlogs are here: https://pastebin.com/xnGmtLNg and https://pastebin.com/0vQX9fKz for example [16:10] kernels are Linux pc7-1428 4.4.0-116-generic #140~14.04.1-Ubuntu SMP Fri Feb 16 09:25:20 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [16:10] on all systems [16:10] every time this happens, the machine is under medium IO load, low memory load [16:11] and after the crash behaves very eratic.. random crashing processes, reboot and shutdown fail [16:11] so far we've had this on 9 systems running identical hardware and software [16:12] doing mainly writing ffmpeg -> h264 stream -> pipeline -> to JPG frames to disk (disks are connected via USB in a mirror MD0) [16:14] JoshX: looks like it could be due to fragmented memory allocations [16:14] TJ-: is this something I can do anything about? [16:14] we're running stock 14.04.5LTS with no exciting software on average hardware [16:15] Intel(R) Core(TM) i7-5550U CPU @ 2.00GHz CPU With 16Gb DDR3L ram [16:15] JoshX: I'm only guessing, but if the processes are requesting many smaller allocations (I infer due to the strack trace showing compaction_alloc) it's posisble that eventually the kernel cannot create a block large enough for a request [16:16] I understand, but am i running out of memory or out of large enough fragments? [16:16] you'll notice the strack trace passes through do_huge_pmd_anonymous, try_to_compact_pages etc [16:17] so I suspect fragmentation has ended up in a situation where even compaction cannot handle things (possibly there's a bug there that ought not to be as well) [16:18] I think there are some kernel 'knobs' you can twiddle to influence these things (sysctl) [16:19] perhaps I can add a PPA and try a upstream/newer/beta kernel? [16:19] of enable something to produce more/better logging? [16:19] the crashes are so random I have yet to reproduce them at will [16:21] JoshX: do they happen after the systems have been working for some time ? [16:21] well these 2 systems crashed after 200k sec and 400k sec if i see the logs [16:22] so 2.5 and 4 days or so [16:22] not too long [16:22] JoshX: if those processes are fragmenting memory that's about what I'd expect - I wouldn't expect this immediately after boot [16:23] but we're running some nodejs processes, rabbitmq, ffmpeg [16:23] and ionotify on file close events [16:23] on ~40 files / sec [16:24] so we're handeling tons of IO events on file closes to move files to the right place [16:24] and we're queuing stuff in rabbitmq.. [16:25] could be erlang/rabbitmq messing things up [16:25] could be nodejs [16:25] but the end result is always the same, an unstable machine that is pingable and accessible via ssh [16:26] but which does not work anymore.. (so monitoring this sucks since all seems normal) [16:27] can i perform a command on these machines to see how bad fragmentation is at this moment? [16:28] so I could log things over time and keep a graph or something to see if the problem is slowly getting worse? [16:29] JoshX: try "echo m > /proc/sysrq-trigger" [16:29] JoshX: then check dmesg [16:30] https://pastebin.com/Xu6130pW [16:30] JoshX: oh, forget that, we have buddy now! "cat /proc/buddyinfo" [16:30] root@pc7-1428:/var/log# cat /proc/buddyinfo [16:30] Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3 [16:30] Node 0, zone DMA32 325 115 59 36 10 8 50 44 26 0 0 [16:31] Node 0, zone Normal 20 22 4 12 5 3 101 84 44 0 0 [16:31] 'erm..' [16:31] is this good? bad? :) [16:32] the numbers are blocks of 4KiB, 8KiB, 16KiB.... [16:33] you might to read the docs in ./Documentation/ directory of the kernel source for the detail of that [16:33] I have a system in stress test now thats reporting higher numbers [16:34] root@pc7-1422:~# cat /proc/buddyinfo [16:34] Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3 [16:34] Node 0, zone DMA32 283 94 18 16 7 10 10 38 39 2 0 [16:34] Node 0, zone Normal 209 3879 31 9 11 19 23 7 4 10 1 [16:35] and the memory info from dmesg: https://pastebin.com/q6iZC3Sk [16:35] you might try adjusting the memory overcommit settings [16:37] how might I do that? [17:00] it seems unreasonable for it to panic like that there