/srv/irclogs.ubuntu.com/2018/03/22/#ubuntu-kernel.txt

=== himcesjf_ is now known as him-cesjf
=== kamal is now known as Guest26711
JoshXHi! I have multiple ubuntu 14.04.5LTS systems running on the same hardware with the same kernel producing the same kernel crashes..16:08
JoshXcrashlogs are here: https://pastebin.com/xnGmtLNg and https://pastebin.com/0vQX9fKz for example16:09
JoshXkernels are  Linux pc7-1428 4.4.0-116-generic #140~14.04.1-Ubuntu SMP Fri Feb 16 09:25:20 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux16:10
JoshXon all systems16:10
JoshXevery time this happens, the machine is under medium IO load, low memory load16:10
JoshXand after the crash behaves very eratic.. random crashing processes, reboot and shutdown fail16:11
JoshXso far we've had this on 9 systems running identical hardware and software16:11
JoshXdoing mainly writing ffmpeg -> h264 stream -> pipeline -> to JPG frames to disk (disks are connected via USB in a mirror MD0)16:12
TJ-JoshX: looks like it could be due to fragmented memory allocations 16:14
JoshXTJ-: is this something I can do anything about?16:14
JoshXwe're running stock 14.04.5LTS with no exciting software on average hardware16:14
JoshXIntel(R) Core(TM) i7-5550U CPU @ 2.00GHz CPU With 16Gb DDR3L ram16:15
TJ-JoshX: I'm only guessing, but if the processes are requesting many smaller allocations (I infer due to the strack trace showing compaction_alloc) it's posisble that eventually the kernel cannot create a block large enough for a request16:15
JoshXI understand, but am i running out of memory or out of large enough fragments?16:16
TJ-you'll notice the strack trace passes through do_huge_pmd_anonymous, try_to_compact_pages etc16:16
TJ-so I suspect fragmentation has ended up in a situation where even compaction cannot handle things (possibly there's a bug there that ought not to be as well)16:17
TJ-I think there are some kernel 'knobs' you can twiddle to influence these things (sysctl)16:18
JoshXperhaps I can add a PPA and try a upstream/newer/beta kernel?16:19
JoshXof enable something to produce more/better logging?16:19
JoshXthe crashes are so random I have yet to reproduce them at will16:19
TJ-JoshX: do they happen after the systems have been working for some time ?16:21
JoshXwell these 2 systems crashed after 200k sec and 400k sec if i see the logs16:21
JoshXso 2.5 and 4 days or so16:22
JoshXnot too long16:22
TJ-JoshX: if those processes are fragmenting memory that's about what I'd expect - I wouldn't expect this immediately after boot16:22
JoshXbut we're running some nodejs processes, rabbitmq, ffmpeg16:23
JoshXand ionotify on file close events16:23
JoshXon ~40 files / sec16:23
JoshXso we're handeling tons of IO events on file closes to move files to the right place16:24
JoshXand we're queuing stuff in rabbitmq.. 16:24
JoshXcould be erlang/rabbitmq messing things up16:25
JoshXcould be nodejs16:25
JoshXbut the end result is always the same, an unstable machine that is pingable and accessible via ssh16:25
JoshXbut which does not work anymore.. (so monitoring this sucks since all seems normal)16:26
JoshXcan i perform a command on these machines to see how bad fragmentation is at this moment?16:27
JoshXso I could log things over time and keep a graph or something to see if the problem is slowly getting worse?16:28
TJ-JoshX: try "echo m > /proc/sysrq-trigger"16:29
TJ-JoshX: then check dmesg16:29
JoshXhttps://pastebin.com/Xu6130pW16:30
TJ-JoshX: oh, forget that, we have buddy now! "cat /proc/buddyinfo"16:30
JoshXroot@pc7-1428:/var/log# cat /proc/buddyinfo16:30
JoshXNode 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      316:30
JoshXNode 0, zone    DMA32    325    115     59     36     10      8     50     44     26      0      016:30
JoshXNode 0, zone   Normal     20     22      4     12      5      3    101     84     44      0      016:31
JoshX'erm..'16:31
JoshXis this good? bad? :)16:31
TJ-the numbers are blocks of 4KiB, 8KiB, 16KiB....16:32
TJ-you might to read the docs in ./Documentation/ directory of the kernel source for the detail of that16:33
JoshXI have a system in stress test now thats reporting higher numbers16:33
JoshXroot@pc7-1422:~# cat /proc/buddyinfo16:34
JoshXNode 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      316:34
JoshXNode 0, zone    DMA32    283     94     18     16      7     10     10     38     39      2      016:34
JoshXNode 0, zone   Normal    209   3879     31      9     11     19     23      7      4     10      116:34
JoshXand the memory info from dmesg: https://pastebin.com/q6iZC3Sk16:35
TJ-you might try adjusting the memory overcommit settings16:35
JoshXhow might I do that?16:37
apwit seems unreasonable for it to panic like that there17:00

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!