=== med_ is now known as medberry [08:33] Hello there. I want to debug what is taking so long in mysqld. One server is 120 times slower than 2 other server (1 weaker, 1 stronger). So I strace mysql and see some pread but lots of io_getevents and io_submit. Is this normal? How do I find the device or process that causes the problem? [08:34] strace is the wrong tool for the job [08:35] this is a decent starting point http://www.brendangregg.com/linuxperf.html [08:35] Thanks. This is the best help so far :D dtrace or where could I start? I just found out that iotop has a "Netlink error" but the memtester passed. [08:35] sarnold: thanks I read this [08:36] LeMike: linux's perf tool / or the iovisor collection of tools is your best bet, no dtrace on linux [08:39] alright, time to bail, have fun :) [08:52] sarnold: thanks. I already like perf a lot :D read a report and liked to investigate even more but somehow the problem is gone. I have a good guess that the hoster finally found a thing on his side and shamefully fixed it silently. I hunt this shit since yesterday. [11:29] LeMike: try slow query log [11:29] LeMike: and make sure you install sysstat first so you can monitor the system while running the query [11:29] LeMike: to see what's slowing things down - usually cpu or i/o [14:33] RoyK: Slow query log just shows the query and how long it took. With that I came to the conclusion to look into mySQL and IO. Thanks [14:35] I continue some other day because after a limit of 4 hours I have to stop investigating (as the customer does not want to pay more xD ) [14:41] <_KaszpiR_> LeMike what's the specs of the servers, and any noticeable mysql setting set? [14:43] <_KaszpiR_> if they fixed it silently then I suspect that if this is on virtual platform then memory baloon driver/high cpu steal/horrid disk io perf (cause array rebuilds etc) [14:46] LeMike: you should see where the load is high easily with sar [17:53] RoyK: I dont get it. This seems more like recording a timespan than a single process. And what if some buffer is full or some process is just waiting for stupid reasons? [17:59] it's recording a timespan, yes [17:59] but then again, using things like top or htop can tell you where the problem lies [18:00] iotop perhaps too [18:01] if you have a truckload of different services on the same server, all competing, you should be virtualising more [18:02] if the machine is dedicated to mysql, it should be trivial to find the bottleneck [18:02] if I may interject for a moment.... how would virt help here. if anything it'd just slow things down with extra overhead. [18:04] LeMike: you can check for two things, iowait and disk latency [18:05] blackflow: it'll slow things down a few percent, but it'll definetely make it easier to monitor [18:05] blackflow: throw a ton of services on the same machine and "something" is slow, well, then everything's slow [18:07] pretty sure it'd be the same thing under a VM. the host kernel has to juggle processes and serialize IO in both cases [18:07] but I agree about monitoring, somewhat. should be possible with cgroups to some extent too. [18:08] if vitualised, monitoring on the guest will show you the bottleneck [18:13] blackflow: I let it run with "iostat -d 1" and got somewhat like 2348 kB write every few seconds. iowait is 0,85% but I guess I need to reboot. dunno how to check latency [18:13] LeMike: https://www.kernel.org/doc/Documentation/block/stat.txt [18:14] read ticks / read I/Os = average wait (latency) for read [18:14] same for write. [18:16] uh. can I truncate this file while the server runs? :) this would be nice [18:16] LeMike: how much memory? what is the iops count? what sort of storage does it have? [18:16] what about cpu load? [18:16] swap usage? [18:17] sar will give you paging statistics [18:17] have you tuned mysql/mariadb in any way? [18:17] how large is the database's dataset? [18:17] what about the db itself... effective indexes? does it access tuples sequentially or indexed? [18:17] lots and lots of questions ;) [18:17] indeedy. [18:18] mysql allows you to run "explain " to see if indices are used or not [18:19] quite useful thing [18:19] LeMike: I'd recommend you activate some graph baesed monitoring on that server. A picture is worth a thousand eyes, and problems are easier to spot. I use Munin for that, infact, that's how I know about that latency measure. It's from Munin's diskstats plugin. [18:19] *a thousand words (lol) [18:20] I'm listening to TechSnap podcast and they're talking about "many eyes in open source security", so the fingers typed that :) [18:21] 4G ram (more than 3G free also in swap free), cpu idles, iotop does not work :/ the table is 276M [18:21] not touched mysql config (once for more cache etc but reset it already) [18:22] 276M is a lot, esp if the DB has to sift through the rows sequentially because it's badly indexed. a single select can take seconds.... [18:22] (I'm assuming you meant bytes, because 270M rows would be gigabytes most likely...) [18:22] and then - mysql doesn't scale too well [18:23] * RoyK only uses mysql at gunpoint [18:23] yeah I don't know much about MySQL. PostgreSQL is my poison and I monitor all that with Munin - tuple counts, sequential vs indexed access, average query time, size, stuff... [18:24] DB monitoring and statistics is a life saver. [18:24] I'd choose postgres over mysql any day [18:24] aye! [18:24] (megabytes) Same DB and query on my laptop takes 1-2 seconds. server needs more than 60 seconds [18:25] but then - some systems like wordpress don't work too well on postgres because those wp guys haven't understood a damn thing about abstraction [18:25] LeMike: what sort of storage? [18:25] I look up some monitoring and gather statistics. To bad I got perf.data here and was hoping to just read through and see the big time waster [18:26] if MySQL is anything liek postgres, EXPLAIN ANALYZE is a treasure trove of data bout the performance of that query. number of iterations, indexes used, loops, hashes, everything. [18:26] RoyK: You mean InnoDB? Or filesystem? [18:26] disk [18:27] RoyK: I was so pleasantly suprised when I installed GitLab and saw it developed specifically for Postgres (and recommending Postgres over the dolphin). So rare to see something written for Postgres. Ohh ohh, and roundcube. [18:27] RoyK: zfs [18:27] LeMike: what sort of disks? [18:27] oh wait, that changes things a bit. did you tune it for mysql? [18:28] LeMike: iirc MySQL recordsize is recommended to be 16k. Postgres 8k. Now, depending on your use case and workload, that might mean a lot, or nothing at all (eg. my use case allows larger record size -- with postgres -- which benefits compressratio) [18:29] RoyK: HDD. if that isn the answer then help pls [18:30] LeMike: what sort of hdd and how many of them? [18:30] LeMike: there's no straight answer here, but as blackflow says, the zfs blocksize for that dataset may be important [18:32] LeMike: http://open-zfs.org/wiki/Performance_tuning#MySQL [18:34] cant access shit with lshw. just one HDD. Got 50GB out of 3TB for this machine it seems. [18:45] oh. 2x TOSHIBA DT01ACA3 in /proc/scsi/scsi. A raid? I dont see the block size. fdisk does not work and I dunno how [18:52] <_KaszpiR_> zfs on one disk? [18:53] <_KaszpiR_> show us 'lsblk' output [18:54] LeMike: zfs get [18:58] OS + zfs + mysql on 4GB RAM just sounds wrong. [18:59] unless its a *very* light-weight DB workload you'll want more RAM. [19:00] disks dont matter for a database server unless you designed it wrong, provided too little RAM / didn't tune it for the RAM it has available. [19:02] zfs isn't that heavy on memory - that's a myth - that is - unless you turn on dedup [19:02] but then, that's praying to all known gods for trouble [19:04] _KaszpiR_: got it here https://pastebin.com/p3BsHefU [19:04] LeMike: pastebin output of "zpool status", "zfs list" and "zfs get all", please [19:09] RoyK: no pool, no dataset, zfs get all has no output (I am in PVE it seems, I am not the hoster or have access to the host) [19:12] then you're not running zfs [19:15] df -Th said so [19:17] pastebin that [19:17] keep in mind you need "sudo" in front of those zfs/zpool commands [19:18] I am groot! [19:18] https://pastebin.com/tqEXHQs7 [19:18] that doesn't make sense [19:20] Either way this feels like a direction that is not really giving me an answer what specific device or thingy is taking so long . Like "the buffer is too low" or "sync takes too long" or "Raid is broken" etc. [19:24] it's probably a container and ZFS is managed outside of it [19:24] Okay. This is reaching 1k € almost. I give up, tell the customer and let him kick the hoster in his nuts. I think that my app has nothin to do with being so slow :P [19:25] did you run mysqltuner? [19:26] doing so (or knowing what needs to be done OTOH) is the minimum i'd expect from a contractor in charge of managing a DB [19:39] thanks, those tests are all okay. and I go to sleep now. This took the whole day and I am totally not a SysOp, just a simple dev. But I learned things. Thanks a lot! :) [19:40] <_KaszpiR_> 1k EUR and answer is 'this is a vm provider issue, lol' [19:40] RoyK: fwiw, zfs/zpool commands no longer need sudo for read only tasks (eg list, status, get, ...) [19:40] blackflow: possibly [19:40] no longer = as of 0.7.x and Bionic [19:41] <_KaszpiR_> LeMike I suggest installing some monitoring tools (newrelic is the easiest but free tier limits view stats for 24h afair) [19:41] still, zfs list should show the dataset(s) and zpool list should show the pools etc [19:41] if they don't, something is messed up [19:41] <_KaszpiR_> zpool list without sudo will return permission denied [19:41] LeMike: have you upgraded something lately? [19:42] _KaszpiR_: not on Bionic [19:42] zfs can be a bit touchy in that respect [19:42] also... newrelic? why on earth.... there's plenty of nice, FOSS-y and not SaaS monitoring things. Munin, Nagios, Cacti, ... [19:43] <_KaszpiR_> I know, but that was just an idea of getting thing super fast and easy without the whole mess required to set up other tools [19:43] <_KaszpiR_> munin would be quite fast, though [19:44] yeah. Munin is great, thats what I use. [19:44] nagios went out of style some years back ;) [19:44] heh [19:44] what is the monitoring tool of the year now, Zabbix? [19:44] I use zabbix [19:44] <_KaszpiR_> but most of those paid projects hav eone advantage - you get a shitload of integrated monitoring stuff from the box, without much extra configs [19:44] and munin [19:44] why not both [19:45] <_KaszpiR_> collectd + graphite + grafana / munin (wanna migrate from munin, though) [19:49] munin is very nice to start with [19:49] <_KaszpiR_> it works very well for say... 15 hosts [19:49] <_KaszpiR_> but thend to be lacking if you need to aggregate stuff [19:49] works fine with a hundred machines too, but it doesn't scale too well above that [19:50] munin is old and rather simple [19:50] but very easy to setup [19:50] <_KaszpiR_> and about zabbix, I'm not a fan of it (but used only 2.x AFAIR and nothign else, and it was pretty nightmare) [19:51] _KaszpiR_: we're at 3.4 now - quite a bit has hppened since v2 ;) [19:54] <_KaszpiR_> I remember that the logic behind zabbix was pretty 'wtf' and it required A LOT of extra customization to get any app working, which was major pain [19:55] it still seems russian [19:55] but you get used to ut ;) [19:55] s/ut/it/ [19:55] <_KaszpiR_> something like 'send any metric to server' was just impossible, because it had to exist on zabbix or it was dropped [19:56] <_KaszpiR_> graphite as it's own issues but I've found it easier to manage with those [19:56] but then - what else is there to choose? [19:56] graphite only draws graphs [19:56] it doesn't have alerts and triggers etc [19:57] <_KaszpiR_> yeah [19:57] <_KaszpiR_> still you can query those and have different alerting system [19:57] <_KaszpiR_> grafana has alerting now [19:58] <_KaszpiR_> from other solutiuons - prometheus is pretty decent [19:59] we chose zabbix at work - some 350 servers, 60/40 linux/windows