/srv/irclogs.ubuntu.com/2018/03/22/#ubuntu-kernel.txt

=== himcesjf_ is now known as him-cesjf
=== kamal is now known as Guest26711
JoshX	Hi! I have multiple ubuntu 14.04.5LTS systems running on the same hardware with the same kernel producing the same kernel crashes..	16:08
JoshX	crashlogs are here: https://pastebin.com/xnGmtLNg and https://pastebin.com/0vQX9fKz for example	16:09
JoshX	kernels are Linux pc7-1428 4.4.0-116-generic #140~14.04.1-Ubuntu SMP Fri Feb 16 09:25:20 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux	16:10
JoshX	on all systems	16:10
JoshX	every time this happens, the machine is under medium IO load, low memory load	16:10
JoshX	and after the crash behaves very eratic.. random crashing processes, reboot and shutdown fail	16:11
JoshX	so far we've had this on 9 systems running identical hardware and software	16:11
JoshX	doing mainly writing ffmpeg -> h264 stream -> pipeline -> to JPG frames to disk (disks are connected via USB in a mirror MD0)	16:12
TJ-	JoshX: looks like it could be due to fragmented memory allocations	16:14
JoshX	TJ-: is this something I can do anything about?	16:14
JoshX	we're running stock 14.04.5LTS with no exciting software on average hardware	16:14
JoshX	Intel(R) Core(TM) i7-5550U CPU @ 2.00GHz CPU With 16Gb DDR3L ram	16:15
TJ-	JoshX: I'm only guessing, but if the processes are requesting many smaller allocations (I infer due to the strack trace showing compaction_alloc) it's posisble that eventually the kernel cannot create a block large enough for a request	16:15
JoshX	I understand, but am i running out of memory or out of large enough fragments?	16:16
TJ-	you'll notice the strack trace passes through do_huge_pmd_anonymous, try_to_compact_pages etc	16:16
TJ-	so I suspect fragmentation has ended up in a situation where even compaction cannot handle things (possibly there's a bug there that ought not to be as well)	16:17
TJ-	I think there are some kernel 'knobs' you can twiddle to influence these things (sysctl)	16:18
JoshX	perhaps I can add a PPA and try a upstream/newer/beta kernel?	16:19
JoshX	of enable something to produce more/better logging?	16:19
JoshX	the crashes are so random I have yet to reproduce them at will	16:19
TJ-	JoshX: do they happen after the systems have been working for some time ?	16:21
JoshX	well these 2 systems crashed after 200k sec and 400k sec if i see the logs	16:21
JoshX	so 2.5 and 4 days or so	16:22
JoshX	not too long	16:22
TJ-	JoshX: if those processes are fragmenting memory that's about what I'd expect - I wouldn't expect this immediately after boot	16:22
JoshX	but we're running some nodejs processes, rabbitmq, ffmpeg	16:23
JoshX	and ionotify on file close events	16:23
JoshX	on ~40 files / sec	16:23
JoshX	so we're handeling tons of IO events on file closes to move files to the right place	16:24
JoshX	and we're queuing stuff in rabbitmq..	16:24
JoshX	could be erlang/rabbitmq messing things up	16:25
JoshX	could be nodejs	16:25
JoshX	but the end result is always the same, an unstable machine that is pingable and accessible via ssh	16:25
JoshX	but which does not work anymore.. (so monitoring this sucks since all seems normal)	16:26
JoshX	can i perform a command on these machines to see how bad fragmentation is at this moment?	16:27
JoshX	so I could log things over time and keep a graph or something to see if the problem is slowly getting worse?	16:28
TJ-	JoshX: try "echo m > /proc/sysrq-trigger"	16:29
TJ-	JoshX: then check dmesg	16:29
JoshX	https://pastebin.com/Xu6130pW	16:30
TJ-	JoshX: oh, forget that, we have buddy now! "cat /proc/buddyinfo"	16:30
JoshX	root@pc7-1428:/var/log# cat /proc/buddyinfo	16:30
JoshX	Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3	16:30
JoshX	Node 0, zone DMA32 325 115 59 36 10 8 50 44 26 0 0	16:30
JoshX	Node 0, zone Normal 20 22 4 12 5 3 101 84 44 0 0	16:31
JoshX	'erm..'	16:31
JoshX	is this good? bad? :)	16:31
TJ-	the numbers are blocks of 4KiB, 8KiB, 16KiB....	16:32
TJ-	you might to read the docs in ./Documentation/ directory of the kernel source for the detail of that	16:33
JoshX	I have a system in stress test now thats reporting higher numbers	16:33
JoshX	root@pc7-1422:~# cat /proc/buddyinfo	16:34
JoshX	Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3	16:34
JoshX	Node 0, zone DMA32 283 94 18 16 7 10 10 38 39 2 0	16:34
JoshX	Node 0, zone Normal 209 3879 31 9 11 19 23 7 4 10 1	16:34
JoshX	and the memory info from dmesg: https://pastebin.com/q6iZC3Sk	16:35
TJ-	you might try adjusting the memory overcommit settings	16:35
JoshX	how might I do that?	16:37
apw	it seems unreasonable for it to panic like that there	17:00

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!