/srv/irclogs.ubuntu.com/2017/08/20/#ubuntu-server.txt

=== tikun is now known as sikun
=== JanC is now known as Guest58168
=== JanC_ is now known as JanC
=== led2 is now known as led1
NginUSWhich OpenVPN installation tutorial should we be using for 16.04? I see 4 different ones that all seem (semi) official. https://๐Ÿ˜ป๐Ÿ•.ws/โคตโ™“โžฟ04:01
Hexianhey guys, I have a major intermittent issue which is affecting real time services on my Ubuntu 16.04.1 box at random times14:26
HexianI'm using netdata on the box to monitor a massive number of stats, and what appears to be happening is Ubuntu is randomly swapping in memory from disk for one of the real time process instances14:27
Hexianthe box is uses an average of 50% CPU and 30% of 32GB ram, so it should have absolutely no reason to use a swap file for these 3 important processes which each use only around 10% of system memory14:28
Hexiandoes anyone have a solution for me? it seems like it is not possible to explicitly tell Ubuntu to never use swap memory for these specific processes14:29
Hexianevery time it swaps a chunk of memory for no reason what so ever, it cause a IO wait queue which locks up my real time processes from anywhere from 500ms to over 30 00014:29
Hexianit seems like my only option is to disable the swap file completely on the entire box... I'd really like to avoid doing that if possible14:35
tomreynHexian: you could reduce swapiness, but this also affects all processes14:39
tomreynif I/O blocks for up to 30s due to swap in then something else seems to be wrong, though14:39
Hexiantomreyn: that wouldn't guarantee that the real time processes avoid swapping though, I'm not sure why there is swapping with 30% system ram usage though14:40
tomreynwhat is swapped in must have been swapped out before, try to understand why it was swapped out at all.14:40
HexianI guess it doesn't matter what process is doing the swapping, if any process swaps large pages, it's going to cause a IO wait queue and lock up any of my real time processes that happen to be reading from disk at the time14:40
tomreynit should not actually lock them up, just slow their reads down14:41
tomreynare you using HDDs or SSDs or something else for storage?14:41
tomreynany hardware or software RAID?14:42
HexianI've had this issue intermittently since I got the box, these processes don't even write to disk but use a ram disk for persistence to avoid writes14:42
Hexianthe weird part is, the issue is not bad enough to cause a serious problem every week14:42
Hexianit can happen a massive amount one day, and then not even be noticeable for the next week14:43
HexianI've been trying to solve this for months now, because when it happens seriously, it's a really major problem for my services14:43
Hexianthe hard disk is a mechanical sata drive14:45
tomreyni don't think you answered any of the questions i asked.14:45
tomreyni'd suggest you start with performance testing if you haven't done this yet. https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd14:45
Hexianit's not as fast as a SSD or raid setup, but it's not very slow either, and I need very little IO, so it should not be causing a problem14:45
Hexianlike I said, I'm doing all writes from real time services to a ram disk14:45
Hexiantomreyn: I'd need to shutdown all my services to do that, but what exactly would it accomplish?14:46
HexianI know that the hardware isn't slow14:46
HexianI deploy updates every week which involves copying ~6GB a few times14:46
drabNginUS: https://www.digitalocean.com/community/tutorials/how-to-set-up-an-openvpn-server-on-ubuntu-16-0414:46
RoyKrotating drives are slow14:46
drabNginUS: I used that one14:47
drabNginUS: worked for me right away14:47
NginUSdrab: thx14:47
HexianRoyK: like I said, I'm using virtually no IO at all, I don't even do writes to the disk...14:47
Hexianthe disk is very fast for what I need it for14:47
RoyKNefertiti: how do you monitor the system?14:47
Hexianthe only time I have an issue is when Ubuntu random swaps huge pages for no reason14:47
tomreynHexian: okay i guess the disk I/O should not matter (except for swap) if your applications only write to / read from RAM disks.14:47
Hexianthe applications read from disk at times, and that is the issue14:48
RoyKI thoguht RAM disks was a thing in the ninetees14:48
Hexianwhen ubuntu swaps and causing IO wait, it locks up any process that happens to be reading from the disk at that time14:48
RoyKbuffering does a good enough job today to stay away from that14:49
Hexianand it is a major issue since these processes run on a nanosecond scale14:49
Hexianif they lock up for more than a few ms, I have an issue, but this can cause extremely long lockups14:49
RoyKHexian: pastebin sysctl vm.swappiness14:49
HexianRoyK: I started using a ram disk for writes on this box specifically to reduce disk IO and avoid the possibility of IO wait times slowing down these processes14:50
RoyKHexian: most filesystems buffer writes unless they are sync writes14:50
HexianI know that, but the box is still locking up my processes14:50
RoyKHexian: so generally a RAM drive is not needed - it just make things worse after a power failure or similar14:51
drabHexian: how much mem do you have and how big is the ramdisk? if I had to guess I'd say something about the ramdisk is causing the depletion of the available cached memory and that turns into swapping14:51
tomreynmaybe your application needs to use a disk read/write thread so that these operations do not cause the rest of the application to stall.14:51
HexianI think this is something specific to newer Ubuntu versions, none of my older boxes have ever had this issue, and they are far lower end than this high end Xeon14:51
RoyKwell, try without a ram drive first14:52
Hexiandrab: 32GB ram, 30% used14:52
RoyKthen debug further14:52
drabHexian: 30% used for ramdisk you mean?14:52
RoyKbtw, what sort of ramdrive is this? tmpfs?14:52
HexianRoyK: I started using the ram disk months ago already, after months of intermittent issues14:52
Hexiandrab: system memory14:52
drabHexian: oh, so the issues already existed beforethe ram disk was introduced?14:52
Hexianthe problem has existed since I got this box, but it comes and goes in severity every week, some weeks I don't even notice it14:53
drabok14:53
Hexianwhich is why it is complete hell to try to solve14:53
drabindeed14:53
drabintermittent problems are the worst14:53
RoyKthese things happens at times14:53
Hexiantoday is the first time I noticed that it's swap related14:53
RoyKok, set swappiness to 114:53
HexianI'm 98% sure that the IO wait times causing my processes to lock up are happening exactly when the kernel swaps out memory pages for no reason14:54
RoyKthat'll basically turn off swap unless it's strictly needed14:54
Hexianor well, swaps in?14:54
RoyKwhere's the swap?14:54
RoyKon a slow drive?14:54
drabwell the question afaik is why it's swapping to begin with14:55
Hexianit's on the same disk, but it shouldn't need to be used, that's the point14:55
drabat least in theory swap should only occur if the buffer cache is depleated14:55
draband the system needs to load more data than it has available memory14:55
Hexianwith 30% system memory being used, there is always enough for all needed memory to be in physical ram14:55
drabwhich sounds like a condition Hexian doesn't have14:55
drabright14:55
RoyKdrab: linux swaps a lot if memory is stressed - just reduce swappiness14:55
drabRoyK: ok, well, even if just for curiosity I'd like to understand that a tad better... what does "stressed" mean? I've never heard of that14:56
RoyKHexian: did you try to reduce swappiness?14:56
drabime you either have enough mem to fit data read, or you don't and it swaps, but maybe i'm making it too simple14:56
drabI don't have a very deep understanding of mem mgmt for sure14:56
RoyKdrab: well, if memory access is heavy, linux will try to page out things not in use, and it's not always too smart about it14:57
HexianI've set vm.swappiness to 1, I'm going to keep an eye on the box for the next few hours and see14:57
Hexianthe IO lock ups have been excessive over the last 2 days14:57
Hexianwhile last week there were basically no issues at all14:57
RoyKlinux likes swap - better use an ssd if you reall need it14:58
tomreynHexian: i assume oyu have checked dmesg about these lockups?14:58
RoyKthat is - don't use spinning rust apart from mass storage14:58
RoyKjust my 2c14:58
Hexianmy only other idea would be to increase the ram disk size to 8GB and move the ~7GB of data which gets read by the real time processes into ram14:59
Hexianthat way even if the kernel causes disk wait times, the processes would be reading from ram14:59
Hexianthe processes only read small chunks from that data infrequently at random times, but if one of their reads happens to be when the kernel is swapping, the effect can be way more extreme than you'd think15:00
tomreyni'd also investigate the smart data of those disks. but you don't seem to have a resilent architecture there (i.e. not HA) so you can apparently not afford this during production.15:00
HexianI guess that's just due to the combination of the slow mechanical disk and the kernels heavy swapping15:00
drabthis may be what's happening, altho just a guess: https://www.kernel.org/doc/gorman/html/understand/understand014.html15:00
RoyKHexian: you don't need a ram drive15:00
drabThe casual reader1 may think that with a sufficient amount of memory, swap is unnecessary but this brings us to the second reason. A significant number of the pages referenced by a process early in its life may only be used for initialisation and then never used again. It is better to swap out those pages and create more disk buffers than leave them resident and unused.15:00
RoyKHexian: what sort of application is this anyway?15:00
Hexiandrab: performance is far more important than reliability for my use case15:01
drabHexian: sure, but that seems to be what it's doing nonetheless, and maybe swappiness to 1 will influence that behavior15:01
Hexianas you can hear from me doing writes a ram disk, I'd rather risk losing data in the case of a hardware failure, than have any performance spikes15:01
drabof course15:01
HexianRoyK: MMO game servers15:02
RoyKHexian: just setup proper monitoring of the server(s)15:03
RoyKzabbix, munin, something15:03
drabthis may be part of what happened since you said it didn't use to manifest as a problem15:03
drabhttps://kernelnewbies.org/Linux_4.11#head-e391b21340381dfcd6d837a15f8ec890fa1316c715:03
HexianRoyK: I'm using netdata, it's far better than those solutions15:04
drabswap mgmt changed in newer kernels15:04
HexianI can see every possible system stat in real time15:04
drabto make it more appropriate for SSDs15:04
drabie the opposite of what you want15:04
drabso that may be why you're seeing prbolems now that you didn't use to see, read that link15:04
Hexianthat does sound like it could be the culprit15:04
drabso you could in theory try an older kernel and see if that makes a difference15:05
tomreynor maybe the disk is just dieing15:05
Hexianthe box is a few months old, I doubt it's a hardware issue15:05
Hexianespecially since I can have no issues for a week or 215:05
tomreynthat's no measure15:05
RoyKHexian: haven't tried netdata yet - will check - but it seems to be laking support for windows machines15:05
* RoyK is working on setting up zabbix for monitoring ~300 machines15:06
drabRoyK: how od you monitorin windows machines with zabbix? does it have a client for ms now?15:06
RoyKthe windwows client has been there for years15:07
HexianRoyK: yeah netdata is very linux-specific unfortunately, I'd also love to use it on my windows boxes, but it's really great software, monitors an insane amount of stats in real time, with almost no system overhead15:07
draboh nice, I hvaen't looked at it for years :)15:07
drabHexian: do you do any aggregation/centralization of netdata data? to say influxdb?15:07
drabthat's what I'm trying to do15:07
drabbecause I want trending over longer periods, not just realtime data15:07
HexianI plan on it in the future, I have less linux boxes right now than I have had in the past, and netdata never used to have any aggregation options available15:08
drabyeah, it still doesn't, cavia the plugin to send data elsewhere, which includes influx now, altho graphite is also an option15:09
Hexianchanging swappiness seems have reduced IO wait spikes so far, but there has been one performance drop on the box which affected all 3 of the real time processes15:21
Hexianmost of the time when one of these performance spikes happen, it only seriously affects one of the 3 processes, but sometimes it affects 2 or all 3 at the same time15:22
HexianI assumed that was when all 3 were reading from the disk at the same time as a IO wait spike, but now I'm not so sure15:23
HexianI'll have to wait patiently for the spike and see15:23
Hexianalso, even with swappiness set to 1, these real time processes can spike to over 100 major page faults in 1s15:25
Hexian100 major faults doesn't seem so bad when fail2ban randomly does 16 000 major faults in a second15:28
Hexianagain, I'm not sure why processes, let lone something tiny like fail2ban are swapping out memory with plenty free15:28
tomreynHexian: i think drab pointed you to an explanation which he quoted from https://www.kernel.org/doc/gorman/html/understand/understand014.html earlier15:35
tomreynalso, unless you only use it for ssh and don't actually depend on password authentication, get rid of fail2ban, it's not needed / just making things worse.15:37
HexianI mean, even with swappiness 115:37
Hexianfail2ban is there just for ssh as there is password authing on the box currently15:38
HexianI don't think it actually causes enough overhead to be an issue, at least from what I've seen15:38
Hexianapart from IO wait times, disk reads and major page faults, I haven't noticed anything else which coincides with the performance drops15:40
tomreynoverhead is small, but it could be abused to prevent legitimate admins to login if ip spoofing is possible.15:40
Hexianthe only other thing that can be seen in all the statistics on the box is that the processes affected simply use far less CPU time (or virtually none at all, if they lock up long enough)15:41
Hexianfrom the processes own perspective, it just hard freezes for that period15:41
Hexianinteresting, I'll read up about that, I wasn't aware of fail2ban exploits15:42
tomreynyou could use port knocking instead or expose ssh only on a management network which you reach through a vpn15:42
Hexianyeah, options are a bit limited due to the host, but I'll worry more about security of the box once I've solved the intermittent performance issues15:44
tomreynyou could also investigate a different io scheduler, but this seems a LOT too far fetched until you have looked more into basics like defective hardware and the like. after all i think you ssaid other systems with identical configurations do not exhibit this behavior, which hints at hardware.15:44
HexianI may end up having to get a new box all together unless I can prevent these massive spikes, but I don't have any good options at this point15:45
Hexianwell, the other boxes have much older and slower hardware, and much older Ubuntu versions15:46
Hexianso there are a lot of possible factors at play15:46
Hexianwhile it could be hardware, I think it's far more likely a software related issue15:47
Hexianthe other boxes are running Ubuntu server 14.04, I assume a lot has changed between 14.04 and 16.04.115:49
tomreynokay, that's a lot harder to detemrine with homogenous hardware / software configurations15:49
tomreynsure the Os changed a lot.15:50
Hexianindeed, this kind of issue that just goes away for weeks at a time and comes back worse than ever at random times is living hell15:50
Hexianit was worse last night than I've seen in like 2 months15:50
Hexianbut it was even worse at a point 2 months ago15:51
Hexianthe processes haven't had a single spike since that one big performance drop that affected all 3 earlier15:52
Hexiancould be hours or days before I see another spike15:53
drabHexian: the easiest thing to try imho is kernel version16:18
drabif you recall from earlier, newer kernels have that optimization for swap on SSDs16:19
drabif you have older machines with older kernerls and no problems, that could be a good one to test16:19
drabjust install an older kernel on your new server and pin the pkg and see how it goes16:19
Hexianprobably worth a try, knowing my luck I'll probably end up with some terrible new issue specific to the older kernel version16:21
HexianI'm not sure that I'd want to use a 3 year old kernel on the new high end box though16:23
Hexianbut if using an old kernel for a week or two solves the problem, then at least we know it's actually the kernel16:23
drabHexian: well, '3 yrs old kernel', it's all relative, do you actually know what's in the new ones that you need?16:27
drab14.04 will support the kernel with security updates for another ~2yrs16:27
drabso it's not like you're gonna run an unsopposrted/crappy kernel16:27
draband that's 3.x vs 4.x, which is where the SSD optimizations for swap were introduced according to that changelog16:28
Hexiandrab: good point. I don't keep track of kernel changes, but there are some things that I expect may be improved in newer kernels which would be useful, like IO caching and memory deduplication16:36
HexianI've had low level features like those cause performance problems for me with very old kernels in the past16:36
Hexianone of the processes just spent 1305 ms waiting on a system call at the exact moment that 10222 major page faults for the process occurred16:43
Hexianso even with swappiness set to 1, the kernel is still swapping pages for these processes and causing them to freeze up16:43
Hexianinteresting that the process didn't do any reads according to stats at that point, so the process presumably locked up for over a second purely due to the major page faults16:45
=== xGLaDER is now known as GLaDER
drabHexian: I'm assuming you're running a real time kernel? (and the sw takes advantage of those facilities, the way it's coded, that is)18:08
Hexiandrab: I'm not using a real time kernel, even with hosting more sensitive services in the past, it was generally overkill and the increased frequency just added more overhead18:09
drabfair enough18:09
=== JanC_ is now known as JanC
_Xenial_Xerus_I can make a picture of a walker man going to the F.D.L. with the SDCARD in pocket.21:39
_Xenial_Xerus_And a worse case scenario the police beating checkpoints attack find the SDCARD however the passphrase is stored in Man's mind.21:40
_Xenial_Xerus_So they hold and torture the man in a skyrise of 'unofficial' gas chambers.21:41
_Xenial_Xerus_Man only needs to recall the single passphrase to return to his HOME21:42
_Xenial_Xerus_After the cruel and 'now usual' punishment.21:42
=== _Xenial_Xerus_ is now known as ubuntu
=== ubuntu is now known as Guest17872
=== Guest17872 is now known as Xenial
=== Xenial is now known as Guest89430
=== Guest89430 is now known as _Xenial

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!