[04:01] <NginUS> Which OpenVPN installation tutorial should we be using for 16.04? I see 4 different ones that all seem (semi) official. https://😻🍕.ws/⤵♓➿
[14:26] <Hexian> hey guys, I have a major intermittent issue which is affecting real time services on my Ubuntu 16.04.1 box at random times
[14:27] <Hexian> I'm using netdata on the box to monitor a massive number of stats, and what appears to be happening is Ubuntu is randomly swapping in memory from disk for one of the real time process instances
[14:28] <Hexian> the box is uses an average of 50% CPU and 30% of 32GB ram, so it should have absolutely no reason to use a swap file for these 3 important processes which each use only around 10% of system memory
[14:29] <Hexian> does anyone have a solution for me? it seems like it is not possible to explicitly tell Ubuntu to never use swap memory for these specific processes
[14:29] <Hexian> every time it swaps a chunk of memory for no reason what so ever, it cause a IO wait queue which locks up my real time processes from anywhere from 500ms to over 30 000
[14:35] <Hexian> it seems like my only option is to disable the swap file completely on the entire box... I'd really like to avoid doing that if possible
[14:39] <tomreyn> Hexian: you could reduce swapiness, but this also affects all processes
[14:39] <tomreyn> if I/O blocks for up to 30s due to swap in then something else seems to be wrong, though
[14:40] <Hexian> tomreyn: that wouldn't guarantee that the real time processes avoid swapping though, I'm not sure why there is swapping with 30% system ram usage though
[14:40] <tomreyn> what is swapped in must have been swapped out before, try to understand why it was swapped out at all.
[14:40] <Hexian> I guess it doesn't matter what process is doing the swapping, if any process swaps large pages, it's going to cause a IO wait queue and lock up any of my real time processes that happen to be reading from disk at the time
[14:41] <tomreyn> it should not actually lock them up, just slow their reads down
[14:41] <tomreyn> are you using HDDs or SSDs or something else for storage?
[14:42] <tomreyn> any hardware or software RAID?
[14:42] <Hexian> I've had this issue intermittently since I got the box, these processes don't even write to disk but use a ram disk for persistence to avoid writes
[14:42] <Hexian> the weird part is, the issue is not bad enough to cause a serious problem every week
[14:43] <Hexian> it can happen a massive amount one day, and then not even be noticeable for the next week
[14:43] <Hexian> I've been trying to solve this for months now, because when it happens seriously, it's a really major problem for my services
[14:45] <Hexian> the hard disk is a mechanical sata drive
[14:45] <tomreyn> i don't think you answered any of the questions i asked.
[14:45] <tomreyn> i'd suggest you start with performance testing if you haven't done this yet. https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd
[14:45] <Hexian> it's not as fast as a SSD or raid setup, but it's not very slow either, and I need very little IO, so it should not be causing a problem
[14:45] <Hexian> like I said, I'm doing all writes from real time services to a ram disk
[14:46] <Hexian> tomreyn: I'd need to shutdown all my services to do that, but what exactly would it accomplish?
[14:46] <Hexian> I know that the hardware isn't slow
[14:46] <Hexian> I deploy updates every week which involves copying ~6GB a few times
[14:46] <drab> NginUS: https://www.digitalocean.com/community/tutorials/how-to-set-up-an-openvpn-server-on-ubuntu-16-04
[14:46] <RoyK> rotating drives are slow
[14:47] <drab> NginUS: I used that one
[14:47] <drab> NginUS: worked for me right away
[14:47] <NginUS> drab: thx
[14:47] <Hexian> RoyK: like I said, I'm using virtually no IO at all, I don't even do writes to the disk...
[14:47] <Hexian> the disk is very fast for what I need it for
[14:47] <RoyK> Nefertiti: how do you monitor the system?
[14:47] <Hexian> the only time I have an issue is when Ubuntu random swaps huge pages for no reason
[14:47] <tomreyn> Hexian: okay i guess the disk I/O should not matter (except for swap) if your applications only write to / read from RAM disks.
[14:48] <Hexian> the applications read from disk at times, and that is the issue
[14:48] <RoyK> I thoguht RAM disks was a thing in the ninetees
[14:48] <Hexian> when ubuntu swaps and causing IO wait, it locks up any process that happens to be reading from the disk at that time
[14:49] <RoyK> buffering does a good enough job today to stay away from that
[14:49] <Hexian> and it is a major issue since these processes run on a nanosecond scale
[14:49] <Hexian> if they lock up for more than a few ms, I have an issue, but this can cause extremely long lockups
[14:49] <RoyK> Hexian: pastebin sysctl vm.swappiness
[14:50] <Hexian> RoyK: I started using a ram disk for writes on this box specifically to reduce disk IO and avoid the possibility of IO wait times slowing down these processes
[14:50] <RoyK> Hexian: most filesystems buffer writes unless they are sync writes
[14:50] <Hexian> I know that, but the box is still locking up my processes
[14:51] <RoyK> Hexian: so generally a RAM drive is not needed - it just make things worse after a power failure or similar
[14:51] <drab> Hexian: how much mem do you have and how big is the ramdisk? if I had to guess I'd say something about the ramdisk is causing the depletion of the available cached memory and that turns into swapping
[14:51] <tomreyn> maybe your application needs to use a disk read/write thread so that these operations do not cause the rest of the application to stall.
[14:51] <Hexian> I think this is something specific to newer Ubuntu versions, none of my older boxes have ever had this issue, and they are far lower end than this high end Xeon
[14:52] <RoyK> well, try without a ram drive first
[14:52] <Hexian> drab: 32GB ram, 30% used
[14:52] <RoyK> then debug further
[14:52] <drab> Hexian: 30% used for ramdisk you mean?
[14:52] <RoyK> btw, what sort of ramdrive is this? tmpfs?
[14:52] <Hexian> RoyK: I started using the ram disk months ago already, after months of intermittent issues
[14:52] <Hexian> drab: system memory
[14:52] <drab> Hexian: oh, so the issues already existed beforethe ram disk was introduced?
[14:53] <Hexian> the problem has existed since I got this box, but it comes and goes in severity every week, some weeks I don't even notice it
[14:53] <drab> ok
[14:53] <Hexian> which is why it is complete hell to try to solve
[14:53] <drab> indeed
[14:53] <drab> intermittent problems are the worst
[14:53] <RoyK> these things happens at times
[14:53] <Hexian> today is the first time I noticed that it's swap related
[14:53] <RoyK> ok, set swappiness to 1
[14:54] <Hexian> I'm 98% sure that the IO wait times causing my processes to lock up are happening exactly when the kernel swaps out memory pages for no reason
[14:54] <RoyK> that'll basically turn off swap unless it's strictly needed
[14:54] <Hexian> or well, swaps in?
[14:54] <RoyK> where's the swap?
[14:54] <RoyK> on a slow drive?
[14:55] <drab> well the question afaik is why it's swapping to begin with
[14:55] <Hexian> it's on the same disk, but it shouldn't need to be used, that's the point
[14:55] <drab> at least in theory swap should only occur if the buffer cache is depleated
[14:55] <drab> and the system needs to load more data than it has available memory
[14:55] <Hexian> with 30% system memory being used, there is always enough for all needed memory to be in physical ram
[14:55] <drab> which sounds like a condition Hexian doesn't have
[14:55] <drab> right
[14:55] <RoyK> drab: linux swaps a lot if memory is stressed - just reduce swappiness
[14:56] <drab> RoyK: ok, well, even if just for curiosity I'd like to understand that a tad better... what does "stressed" mean? I've never heard of that
[14:56] <RoyK> Hexian: did you try to reduce swappiness?
[14:56] <drab> ime you either have enough mem to fit data read, or you don't and it swaps, but maybe i'm making it too simple
[14:56] <drab> I don't have a very deep understanding of mem mgmt for sure
[14:57] <RoyK> drab: well, if memory access is heavy, linux will try to page out things not in use, and it's not always too smart about it
[14:57] <Hexian> I've set vm.swappiness to 1, I'm going to keep an eye on the box for the next few hours and see
[14:57] <Hexian> the IO lock ups have been excessive over the last 2 days
[14:57] <Hexian> while last week there were basically no issues at all
[14:58] <RoyK> linux likes swap - better use an ssd if you reall need it
[14:58] <tomreyn> Hexian: i assume oyu have checked dmesg about these lockups?
[14:58] <RoyK> that is - don't use spinning rust apart from mass storage
[14:58] <RoyK> just my 2c
[14:59] <Hexian> my only other idea would be to increase the ram disk size to 8GB and move the ~7GB of data which gets read by the real time processes into ram
[14:59] <Hexian> that way even if the kernel causes disk wait times, the processes would be reading from ram
[15:00] <Hexian> the processes only read small chunks from that data infrequently at random times, but if one of their reads happens to be when the kernel is swapping, the effect can be way more extreme than you'd think
[15:00] <tomreyn> i'd also investigate the smart data of those disks. but you don't seem to have a resilent architecture there (i.e. not HA) so you can apparently not afford this during production.
[15:00] <Hexian> I guess that's just due to the combination of the slow mechanical disk and the kernels heavy swapping
[15:00] <drab> this may be what's happening, altho just a guess: https://www.kernel.org/doc/gorman/html/understand/understand014.html
[15:00] <RoyK> Hexian: you don't need a ram drive
[15:00] <drab> The casual reader1 may think that with a sufficient amount of memory, swap is unnecessary but this brings us to the second reason. A significant number of the pages referenced by a process early in its life may only be used for initialisation and then never used again. It is better to swap out those pages and create more disk buffers than leave them resident and unused.
[15:00] <RoyK> Hexian: what sort of application is this anyway?
[15:01] <Hexian> drab: performance is far more important than reliability for my use case
[15:01] <drab> Hexian: sure, but that seems to be what it's doing nonetheless, and maybe swappiness to 1 will influence that behavior
[15:01] <Hexian> as you can hear from me doing writes a ram disk, I'd rather risk losing data in the case of a hardware failure, than have any performance spikes
[15:01] <drab> of course
[15:02] <Hexian> RoyK: MMO game servers
[15:03] <RoyK> Hexian: just setup proper monitoring of the server(s)
[15:03] <RoyK> zabbix, munin, something
[15:03] <drab> this may be part of what happened since you said it didn't use to manifest as a problem
[15:03] <drab> https://kernelnewbies.org/Linux_4.11#head-e391b21340381dfcd6d837a15f8ec890fa1316c7
[15:04] <Hexian> RoyK: I'm using netdata, it's far better than those solutions
[15:04] <drab> swap mgmt changed in newer kernels
[15:04] <Hexian> I can see every possible system stat in real time
[15:04] <drab> to make it more appropriate for SSDs
[15:04] <drab> ie the opposite of what you want
[15:04] <drab> so that may be why you're seeing prbolems now that you didn't use to see, read that link
[15:04] <Hexian> that does sound like it could be the culprit
[15:05] <drab> so you could in theory try an older kernel and see if that makes a difference
[15:05] <tomreyn> or maybe the disk is just dieing
[15:05] <Hexian> the box is a few months old, I doubt it's a hardware issue
[15:05] <Hexian> especially since I can have no issues for a week or 2
[15:05] <tomreyn> that's no measure
[15:05] <RoyK> Hexian: haven't tried netdata yet - will check - but it seems to be laking support for windows machines
[15:06]  * RoyK is working on setting up zabbix for monitoring ~300 machines
[15:06] <drab> RoyK: how od you monitorin windows machines with zabbix? does it have a client for ms now?
[15:07] <RoyK> the windwows client has been there for years
[15:07] <Hexian> RoyK: yeah netdata is very linux-specific unfortunately, I'd also love to use it on my windows boxes, but it's really great software, monitors an insane amount of stats in real time, with almost no system overhead
[15:07] <drab> oh nice, I hvaen't looked at it for years :)
[15:07] <drab> Hexian: do you do any aggregation/centralization of netdata data? to say influxdb?
[15:07] <drab> that's what I'm trying to do
[15:07] <drab> because I want trending over longer periods, not just realtime data
[15:08] <Hexian> I plan on it in the future, I have less linux boxes right now than I have had in the past, and netdata never used to have any aggregation options available
[15:09] <drab> yeah, it still doesn't, cavia the plugin to send data elsewhere, which includes influx now, altho graphite is also an option
[15:21] <Hexian> changing swappiness seems have reduced IO wait spikes so far, but there has been one performance drop on the box which affected all 3 of the real time processes
[15:22] <Hexian> most of the time when one of these performance spikes happen, it only seriously affects one of the 3 processes, but sometimes it affects 2 or all 3 at the same time
[15:23] <Hexian> I assumed that was when all 3 were reading from the disk at the same time as a IO wait spike, but now I'm not so sure
[15:23] <Hexian> I'll have to wait patiently for the spike and see
[15:25] <Hexian> also, even with swappiness set to 1, these real time processes can spike to over 100 major page faults in 1s
[15:28] <Hexian> 100 major faults doesn't seem so bad when fail2ban randomly does 16 000 major faults in a second
[15:28] <Hexian> again, I'm not sure why processes, let lone something tiny like fail2ban are swapping out memory with plenty free
[15:35] <tomreyn> Hexian: i think drab pointed you to an explanation which he quoted from https://www.kernel.org/doc/gorman/html/understand/understand014.html earlier
[15:37] <tomreyn> also, unless you only use it for ssh and don't actually depend on password authentication, get rid of fail2ban, it's not needed / just making things worse.
[15:37] <Hexian> I mean, even with swappiness 1
[15:38] <Hexian> fail2ban is there just for ssh as there is password authing on the box currently
[15:38] <Hexian> I don't think it actually causes enough overhead to be an issue, at least from what I've seen
[15:40] <Hexian> apart from IO wait times, disk reads and major page faults, I haven't noticed anything else which coincides with the performance drops
[15:40] <tomreyn> overhead is small, but it could be abused to prevent legitimate admins to login if ip spoofing is possible.
[15:41] <Hexian> the only other thing that can be seen in all the statistics on the box is that the processes affected simply use far less CPU time (or virtually none at all, if they lock up long enough)
[15:41] <Hexian> from the processes own perspective, it just hard freezes for that period
[15:42] <Hexian> interesting, I'll read up about that, I wasn't aware of fail2ban exploits
[15:42] <tomreyn> you could use port knocking instead or expose ssh only on a management network which you reach through a vpn
[15:44] <Hexian> yeah, options are a bit limited due to the host, but I'll worry more about security of the box once I've solved the intermittent performance issues
[15:44] <tomreyn> you could also investigate a different io scheduler, but this seems a LOT too far fetched until you have looked more into basics like defective hardware and the like. after all i think you ssaid other systems with identical configurations do not exhibit this behavior, which hints at hardware.
[15:45] <Hexian> I may end up having to get a new box all together unless I can prevent these massive spikes, but I don't have any good options at this point
[15:46] <Hexian> well, the other boxes have much older and slower hardware, and much older Ubuntu versions
[15:46] <Hexian> so there are a lot of possible factors at play
[15:47] <Hexian> while it could be hardware, I think it's far more likely a software related issue
[15:49] <Hexian> the other boxes are running Ubuntu server 14.04, I assume a lot has changed between 14.04 and 16.04.1
[15:49] <tomreyn> okay, that's a lot harder to detemrine with homogenous hardware / software configurations
[15:50] <tomreyn> sure the Os changed a lot.
[15:50] <Hexian> indeed, this kind of issue that just goes away for weeks at a time and comes back worse than ever at random times is living hell
[15:50] <Hexian> it was worse last night than I've seen in like 2 months
[15:51] <Hexian> but it was even worse at a point 2 months ago
[15:52] <Hexian> the processes haven't had a single spike since that one big performance drop that affected all 3 earlier
[15:53] <Hexian> could be hours or days before I see another spike
[16:18] <drab> Hexian: the easiest thing to try imho is kernel version
[16:19] <drab> if you recall from earlier, newer kernels have that optimization for swap on SSDs
[16:19] <drab> if you have older machines with older kernerls and no problems, that could be a good one to test
[16:19] <drab> just install an older kernel on your new server and pin the pkg and see how it goes
[16:21] <Hexian> probably worth a try, knowing my luck I'll probably end up with some terrible new issue specific to the older kernel version
[16:23] <Hexian> I'm not sure that I'd want to use a 3 year old kernel on the new high end box though
[16:23] <Hexian> but if using an old kernel for a week or two solves the problem, then at least we know it's actually the kernel
[16:27] <drab> Hexian: well, '3 yrs old kernel', it's all relative, do you actually know what's in the new ones that you need?
[16:27] <drab> 14.04 will support the kernel with security updates for another ~2yrs
[16:27] <drab> so it's not like you're gonna run an unsopposrted/crappy kernel
[16:28] <drab> and that's 3.x vs 4.x, which is where the SSD optimizations for swap were introduced according to that changelog
[16:36] <Hexian> drab: good point. I don't keep track of kernel changes, but there are some things that I expect may be improved in newer kernels which would be useful, like IO caching and memory deduplication
[16:36] <Hexian> I've had low level features like those cause performance problems for me with very old kernels in the past
[16:43] <Hexian> one of the processes just spent 1305 ms waiting on a system call at the exact moment that 10222 major page faults for the process occurred
[16:43] <Hexian> so even with swappiness set to 1, the kernel is still swapping pages for these processes and causing them to freeze up
[16:45] <Hexian> interesting that the process didn't do any reads according to stats at that point, so the process presumably locked up for over a second purely due to the major page faults
[18:08] <drab> Hexian: I'm assuming you're running a real time kernel? (and the sw takes advantage of those facilities, the way it's coded, that is)
[18:09] <Hexian> drab: I'm not using a real time kernel, even with hosting more sensitive services in the past, it was generally overkill and the increased frequency just added more overhead
[18:09] <drab> fair enough
[21:39] <_Xenial_Xerus_> I can make a picture of a walker man going to the F.D.L. with the SDCARD in pocket.
[21:40] <_Xenial_Xerus_> And a worse case scenario the police beating checkpoints attack find the SDCARD however the passphrase is stored in Man's mind.
[21:41] <_Xenial_Xerus_> So they hold and torture the man in a skyrise of 'unofficial' gas chambers.
[21:42] <_Xenial_Xerus_> Man only needs to recall the single passphrase to return to his HOME
[21:42] <_Xenial_Xerus_> After the cruel and 'now usual' punishment.