=== himcesjf_ is now known as him-cesjf | ||
=== kamal is now known as Guest26711 | ||
JoshX | Hi! I have multiple ubuntu 14.04.5LTS systems running on the same hardware with the same kernel producing the same kernel crashes.. | 16:08 |
---|---|---|
JoshX | crashlogs are here: https://pastebin.com/xnGmtLNg and https://pastebin.com/0vQX9fKz for example | 16:09 |
JoshX | kernels are Linux pc7-1428 4.4.0-116-generic #140~14.04.1-Ubuntu SMP Fri Feb 16 09:25:20 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | 16:10 |
JoshX | on all systems | 16:10 |
JoshX | every time this happens, the machine is under medium IO load, low memory load | 16:10 |
JoshX | and after the crash behaves very eratic.. random crashing processes, reboot and shutdown fail | 16:11 |
JoshX | so far we've had this on 9 systems running identical hardware and software | 16:11 |
JoshX | doing mainly writing ffmpeg -> h264 stream -> pipeline -> to JPG frames to disk (disks are connected via USB in a mirror MD0) | 16:12 |
TJ- | JoshX: looks like it could be due to fragmented memory allocations | 16:14 |
JoshX | TJ-: is this something I can do anything about? | 16:14 |
JoshX | we're running stock 14.04.5LTS with no exciting software on average hardware | 16:14 |
JoshX | Intel(R) Core(TM) i7-5550U CPU @ 2.00GHz CPU With 16Gb DDR3L ram | 16:15 |
TJ- | JoshX: I'm only guessing, but if the processes are requesting many smaller allocations (I infer due to the strack trace showing compaction_alloc) it's posisble that eventually the kernel cannot create a block large enough for a request | 16:15 |
JoshX | I understand, but am i running out of memory or out of large enough fragments? | 16:16 |
TJ- | you'll notice the strack trace passes through do_huge_pmd_anonymous, try_to_compact_pages etc | 16:16 |
TJ- | so I suspect fragmentation has ended up in a situation where even compaction cannot handle things (possibly there's a bug there that ought not to be as well) | 16:17 |
TJ- | I think there are some kernel 'knobs' you can twiddle to influence these things (sysctl) | 16:18 |
JoshX | perhaps I can add a PPA and try a upstream/newer/beta kernel? | 16:19 |
JoshX | of enable something to produce more/better logging? | 16:19 |
JoshX | the crashes are so random I have yet to reproduce them at will | 16:19 |
TJ- | JoshX: do they happen after the systems have been working for some time ? | 16:21 |
JoshX | well these 2 systems crashed after 200k sec and 400k sec if i see the logs | 16:21 |
JoshX | so 2.5 and 4 days or so | 16:22 |
JoshX | not too long | 16:22 |
TJ- | JoshX: if those processes are fragmenting memory that's about what I'd expect - I wouldn't expect this immediately after boot | 16:22 |
JoshX | but we're running some nodejs processes, rabbitmq, ffmpeg | 16:23 |
JoshX | and ionotify on file close events | 16:23 |
JoshX | on ~40 files / sec | 16:23 |
JoshX | so we're handeling tons of IO events on file closes to move files to the right place | 16:24 |
JoshX | and we're queuing stuff in rabbitmq.. | 16:24 |
JoshX | could be erlang/rabbitmq messing things up | 16:25 |
JoshX | could be nodejs | 16:25 |
JoshX | but the end result is always the same, an unstable machine that is pingable and accessible via ssh | 16:25 |
JoshX | but which does not work anymore.. (so monitoring this sucks since all seems normal) | 16:26 |
JoshX | can i perform a command on these machines to see how bad fragmentation is at this moment? | 16:27 |
JoshX | so I could log things over time and keep a graph or something to see if the problem is slowly getting worse? | 16:28 |
TJ- | JoshX: try "echo m > /proc/sysrq-trigger" | 16:29 |
TJ- | JoshX: then check dmesg | 16:29 |
JoshX | https://pastebin.com/Xu6130pW | 16:30 |
TJ- | JoshX: oh, forget that, we have buddy now! "cat /proc/buddyinfo" | 16:30 |
JoshX | root@pc7-1428:/var/log# cat /proc/buddyinfo | 16:30 |
JoshX | Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3 | 16:30 |
JoshX | Node 0, zone DMA32 325 115 59 36 10 8 50 44 26 0 0 | 16:30 |
JoshX | Node 0, zone Normal 20 22 4 12 5 3 101 84 44 0 0 | 16:31 |
JoshX | 'erm..' | 16:31 |
JoshX | is this good? bad? :) | 16:31 |
TJ- | the numbers are blocks of 4KiB, 8KiB, 16KiB.... | 16:32 |
TJ- | you might to read the docs in ./Documentation/ directory of the kernel source for the detail of that | 16:33 |
JoshX | I have a system in stress test now thats reporting higher numbers | 16:33 |
JoshX | root@pc7-1422:~# cat /proc/buddyinfo | 16:34 |
JoshX | Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3 | 16:34 |
JoshX | Node 0, zone DMA32 283 94 18 16 7 10 10 38 39 2 0 | 16:34 |
JoshX | Node 0, zone Normal 209 3879 31 9 11 19 23 7 4 10 1 | 16:34 |
JoshX | and the memory info from dmesg: https://pastebin.com/q6iZC3Sk | 16:35 |
TJ- | you might try adjusting the memory overcommit settings | 16:35 |
JoshX | how might I do that? | 16:37 |
apw | it seems unreasonable for it to panic like that there | 17:00 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!