[08:33] <LeMike> Hello there. I want to debug what is taking so long in mysqld. One server is 120 times slower than 2 other server (1 weaker, 1 stronger). So I strace mysql and see some pread but lots of io_getevents and io_submit. Is this normal? How do I find the device or process that causes the problem?
[08:34] <sarnold> strace is the wrong tool for the job
[08:35] <sarnold> this is a decent starting point http://www.brendangregg.com/linuxperf.html
[08:35] <LeMike> Thanks. This is the best help so far :D dtrace or where could I start? I just found out that iotop has a "Netlink error" but the memtester passed.
[08:35] <LeMike> sarnold: thanks I read this
[08:36] <sarnold> LeMike: linux's perf tool / or the iovisor collection of tools is your best bet, no dtrace on linux
[08:39] <sarnold> alright, time to bail, have fun :)
[08:52] <LeMike> sarnold: thanks. I already like perf a lot :D read a report and liked to investigate even more but somehow the problem is gone. I have a good guess that the hoster finally found a thing on his side and shamefully fixed it silently. I hunt this shit since yesterday.
[11:29] <RoyK> LeMike: try slow query log
[11:29] <RoyK> LeMike: and make sure you install sysstat first so you can monitor the system while running the query
[11:29] <RoyK> LeMike: to see what's slowing things down - usually cpu or i/o
[14:33] <LeMike> RoyK: Slow query log just shows the query and how long it took. With that I came to the conclusion to look into mySQL and IO. Thanks
[14:35] <LeMike> I continue some other day because after a limit of 4 hours I have to stop investigating (as the customer does not want to pay more xD )
[14:41] <_KaszpiR_> LeMike what's the specs of the servers, and any noticeable mysql setting set?
[14:43] <_KaszpiR_> if they fixed it silently then I suspect that if this is on virtual platform then memory baloon driver/high cpu steal/horrid disk io perf (cause array rebuilds etc)
[14:46] <RoyK> LeMike: you should see where the load is high easily with sar
[17:53] <LeMike> RoyK: I dont get it. This seems more like recording a timespan than a single process. And what if some buffer is full or some process is just waiting for stupid reasons?
[17:59] <RoyK> it's recording a timespan, yes
[17:59] <RoyK> but then again, using things like top or htop can tell you where the problem lies
[18:00] <RoyK> iotop perhaps too
[18:01] <RoyK> if you have a truckload of different services on the same server, all competing, you should be virtualising more
[18:02] <RoyK> if the machine is dedicated to mysql, it should be trivial to find the bottleneck
[18:02] <blackflow> if I may interject for a moment.... how would virt help here. if anything it'd just slow things down with extra overhead.
[18:04] <blackflow> LeMike: you can check for two things, iowait and disk latency
[18:05] <RoyK> blackflow: it'll slow things down a few percent, but it'll definetely make it easier to monitor
[18:05] <RoyK> blackflow: throw a ton of services on the same machine and "something" is slow, well, then everything's slow
[18:07] <blackflow> pretty sure it'd be the same thing under a VM. the host kernel has to juggle processes and serialize IO in both cases
[18:07] <blackflow> but I agree about monitoring, somewhat. should be possible with cgroups to some extent too.
[18:08] <RoyK> if vitualised, monitoring on the guest will show you the bottleneck
[18:13] <LeMike> blackflow: I let it run with "iostat -d 1" and got somewhat like 2348 kB write every few seconds. iowait is 0,85% but I guess I need to reboot. dunno how to check latency
[18:13] <blackflow> LeMike: https://www.kernel.org/doc/Documentation/block/stat.txt
[18:14] <blackflow> read ticks / read I/Os = average wait (latency) for read
[18:14] <blackflow> same for write.
[18:16] <LeMike> uh. can I truncate this file while the server runs? :) this would be nice
[18:16] <RoyK> LeMike: how much memory? what is the iops count? what sort of storage does it have?
[18:16] <RoyK> what about cpu load?
[18:16] <RoyK> swap usage?
[18:17] <RoyK> sar will give you paging statistics
[18:17] <RoyK> have you tuned mysql/mariadb in any way?
[18:17] <RoyK> how large is the database's dataset?
[18:17] <blackflow> what about the db itself... effective indexes? does it access tuples sequentially or indexed?
[18:17] <RoyK> lots and lots of questions ;)
[18:17] <blackflow> indeedy.
[18:18] <RoyK> mysql allows you to run "explain <query>" to see if indices are used or not
[18:19] <RoyK> quite useful thing
[18:19] <blackflow> LeMike: I'd recommend you activate some graph baesed monitoring on that server. A picture is worth a thousand eyes, and problems are easier to spot. I use Munin for that, infact, that's how I know about that latency measure. It's from Munin's diskstats plugin.
[18:19] <blackflow> *a thousand words   (lol)
[18:20] <blackflow> I'm listening to TechSnap podcast and they're talking about "many eyes in open source security", so the fingers typed that :)
[18:21] <LeMike> 4G ram (more than 3G free also in swap free), cpu idles, iotop does not work :/ the table is 276M
[18:21] <LeMike> not touched mysql config (once for more cache etc but reset it already)
[18:22] <blackflow> 276M is a lot, esp if the DB has to sift through the rows sequentially because it's badly indexed. a single select can take seconds....
[18:22] <blackflow> (I'm assuming you meant bytes, because 270M rows would be gigabytes most likely...)
[18:22] <RoyK> and then - mysql doesn't scale too well
[18:23]  * RoyK only uses mysql at gunpoint
[18:23] <blackflow> yeah I don't know much about MySQL. PostgreSQL is my poison and I monitor all that with Munin - tuple counts, sequential vs indexed access, average query time, size, stuff...
[18:24] <blackflow> DB monitoring and statistics is a life saver.
[18:24] <RoyK> I'd choose postgres over mysql any day
[18:24] <blackflow> aye!
[18:24] <LeMike> (megabytes) Same DB and query on my laptop takes 1-2 seconds. server needs more than 60 seconds
[18:25] <RoyK> but then - some systems like wordpress don't work too well on postgres because those wp guys haven't understood a damn thing about abstraction
[18:25] <RoyK> LeMike: what sort of storage?
[18:25] <LeMike> I look up some monitoring and gather statistics. To bad I got perf.data here and was hoping to just read through and see the big time waster
[18:26] <blackflow> if MySQL is anything liek postgres,   EXPLAIN ANALYZE <query> is a treasure trove of data bout the performance of that query. number of iterations, indexes used, loops, hashes, everything.
[18:26] <LeMike> RoyK: You mean InnoDB? Or filesystem?
[18:26] <RoyK> disk
[18:27] <blackflow> RoyK: I was so pleasantly suprised when I installed GitLab and saw it developed specifically for Postgres (and recommending Postgres over the dolphin). So rare to see something written for Postgres.  Ohh ohh, and roundcube.
[18:27] <LeMike> RoyK: zfs
[18:27] <RoyK> LeMike: what sort of disks?
[18:27] <blackflow> oh wait, that changes things a bit. did you tune it for mysql?
[18:28] <blackflow> LeMike: iirc MySQL recordsize is recommended to be 16k. Postgres 8k. Now, depending on your use case and workload, that might mean a lot, or nothing at all (eg. my use case allows larger record size -- with postgres -- which benefits compressratio)
[18:29] <LeMike> RoyK: HDD. if that isn the answer then help pls
[18:30] <RoyK> LeMike: what sort of hdd and how many of them?
[18:30] <RoyK> LeMike: there's no straight answer here, but as blackflow says, the zfs blocksize for that dataset may be important
[18:32] <RoyK> LeMike: http://open-zfs.org/wiki/Performance_tuning#MySQL
[18:34] <LeMike> cant access shit with lshw. just one HDD. Got 50GB out of 3TB for this machine it seems.
[18:45] <LeMike> oh. 2x TOSHIBA DT01ACA3 in /proc/scsi/scsi. A raid? I dont see the block size. fdisk does not work and I dunno how
[18:52] <_KaszpiR_> zfs on one disk?
[18:53] <_KaszpiR_> show us  'lsblk'  output
[18:54] <RoyK> LeMike: zfs get
[18:58] <tomreyn> OS + zfs + mysql on 4GB RAM just sounds wrong.
[18:59] <tomreyn> unless its a *very* light-weight DB workload you'll want more RAM.
[19:00] <tomreyn> disks dont matter for a database server unless you designed it wrong, provided too little RAM / didn't tune it for the RAM it has available.
[19:02] <RoyK> zfs isn't that heavy on memory - that's a myth - that is - unless you turn on dedup
[19:02] <RoyK> but then, that's praying to all known gods for trouble
[19:04] <LeMike> _KaszpiR_: got it here https://pastebin.com/p3BsHefU
[19:04] <RoyK> LeMike: pastebin output of "zpool status", "zfs list" and "zfs get all", please
[19:09] <LeMike> RoyK: no pool, no dataset, zfs get all has no output (I am in PVE it seems, I am not the hoster or have access to the host)
[19:12] <RoyK> then you're not running zfs
[19:15] <LeMike> df -Th said so
[19:17] <RoyK> pastebin that
[19:17] <RoyK> keep in mind you need "sudo" in front of those zfs/zpool commands
[19:18] <LeMike> I am groot!
[19:18] <LeMike> https://pastebin.com/tqEXHQs7
[19:18] <RoyK> that doesn't make sense
[19:20] <LeMike> Either way this feels like a direction that is not really giving me an answer what specific device or thingy is taking so long . Like "the buffer is too low" or "sync takes too long" or "Raid is broken" etc.
[19:24] <tomreyn> it's probably a container and ZFS is managed outside of it
[19:24] <LeMike> Okay. This is reaching 1k € almost. I give up, tell the customer and let him kick the hoster in his nuts. I think that my app has nothin to do with being so slow :P
[19:25] <tomreyn> did you run mysqltuner?
[19:26] <tomreyn> doing so (or knowing what needs to be done OTOH) is the minimum i'd expect from a contractor in charge of managing a DB
[19:39] <LeMike> thanks, those tests are all okay. and I go to sleep now. This took the whole day and I am totally not a SysOp, just a simple dev. But I learned things. Thanks a lot! :)
[19:40] <_KaszpiR_> 1k EUR and answer is 'this is a vm provider issue, lol'
[19:40] <blackflow> RoyK: fwiw, zfs/zpool commands no longer need sudo for read only tasks  (eg list, status, get, ...)
[19:40] <RoyK> blackflow: possibly
[19:40] <blackflow> no longer = as of 0.7.x and Bionic
[19:41] <_KaszpiR_> LeMike I suggest installing some monitoring tools (newrelic is the easiest but free tier limits view stats for 24h afair)
[19:41] <RoyK> still, zfs list should show the dataset(s) and zpool list should show the pools etc
[19:41] <RoyK> if they don't, something is messed up
[19:41] <_KaszpiR_> zpool list without sudo will return permission denied
[19:41] <RoyK> LeMike: have you upgraded something lately?
[19:42] <blackflow> _KaszpiR_: not on Bionic
[19:42] <RoyK> zfs can be a bit touchy in that respect
[19:42] <blackflow> also... newrelic? why on earth.... there's plenty of nice, FOSS-y and not SaaS monitoring things. Munin, Nagios, Cacti, ...
[19:43] <_KaszpiR_> I know, but that was just an idea of getting thing super fast and easy without the whole mess required to set up other tools
[19:43] <_KaszpiR_> munin would be quite fast, though
[19:44] <blackflow> yeah. Munin is great, thats what I use.
[19:44] <RoyK> nagios went out of style some years back ;)
[19:44] <blackflow> heh
[19:44] <blackflow> what is the monitoring tool of the year now, Zabbix?
[19:44] <RoyK> I use zabbix
[19:44] <_KaszpiR_> but most of those paid projects hav eone advantage - you get a shitload of integrated monitoring stuff from the box, without much extra configs
[19:44] <RoyK> and munin
[19:44] <RoyK> why not both
[19:45] <_KaszpiR_> collectd + graphite + grafana / munin (wanna migrate from munin, though)
[19:49] <RoyK> munin is very nice to start with
[19:49] <_KaszpiR_> it works very well for say... 15 hosts
[19:49] <_KaszpiR_> but thend to be lacking if you need to aggregate stuff
[19:49] <RoyK> works fine with a hundred machines too, but it doesn't scale too well above that
[19:50] <RoyK> munin is old and rather simple
[19:50] <RoyK> but very easy to setup
[19:50] <_KaszpiR_> and about zabbix, I'm not a fan of it (but used only 2.x AFAIR and nothign else, and it was pretty nightmare)
[19:51] <RoyK> _KaszpiR_: we're at 3.4 now - quite a bit has hppened since v2 ;)
[19:54] <_KaszpiR_> I remember that the logic behind zabbix was pretty 'wtf' and it required A LOT of extra customization to get any app working, which was major pain
[19:55] <RoyK> it still seems russian
[19:55] <RoyK> but you get used to ut ;)
[19:55] <RoyK> s/ut/it/
[19:55] <_KaszpiR_> something like 'send any metric to server' was just impossible, because it had to exist on zabbix or it was dropped
[19:56] <_KaszpiR_> graphite as it's own issues but I've found it easier to manage with those
[19:56] <RoyK> but then - what else is there to choose?
[19:56] <RoyK> graphite only draws graphs
[19:56] <RoyK> it doesn't have alerts and triggers etc
[19:57] <_KaszpiR_> yeah
[19:57] <_KaszpiR_> still you can query those and have different alerting system
[19:57] <_KaszpiR_> grafana has alerting now
[19:58] <_KaszpiR_> from other solutiuons - prometheus is pretty decent
[19:59] <RoyK> we chose zabbix at work - some 350 servers, 60/40 linux/windows