=== med_ is now known as medberry | ||
LeMike | Hello there. I want to debug what is taking so long in mysqld. One server is 120 times slower than 2 other server (1 weaker, 1 stronger). So I strace mysql and see some pread but lots of io_getevents and io_submit. Is this normal? How do I find the device or process that causes the problem? | 08:33 |
---|---|---|
sarnold | strace is the wrong tool for the job | 08:34 |
sarnold | this is a decent starting point http://www.brendangregg.com/linuxperf.html | 08:35 |
LeMike | Thanks. This is the best help so far :D dtrace or where could I start? I just found out that iotop has a "Netlink error" but the memtester passed. | 08:35 |
LeMike | sarnold: thanks I read this | 08:35 |
sarnold | LeMike: linux's perf tool / or the iovisor collection of tools is your best bet, no dtrace on linux | 08:36 |
sarnold | alright, time to bail, have fun :) | 08:39 |
LeMike | sarnold: thanks. I already like perf a lot :D read a report and liked to investigate even more but somehow the problem is gone. I have a good guess that the hoster finally found a thing on his side and shamefully fixed it silently. I hunt this shit since yesterday. | 08:52 |
RoyK | LeMike: try slow query log | 11:29 |
RoyK | LeMike: and make sure you install sysstat first so you can monitor the system while running the query | 11:29 |
RoyK | LeMike: to see what's slowing things down - usually cpu or i/o | 11:29 |
LeMike | RoyK: Slow query log just shows the query and how long it took. With that I came to the conclusion to look into mySQL and IO. Thanks | 14:33 |
LeMike | I continue some other day because after a limit of 4 hours I have to stop investigating (as the customer does not want to pay more xD ) | 14:35 |
_KaszpiR_ | LeMike what's the specs of the servers, and any noticeable mysql setting set? | 14:41 |
_KaszpiR_ | if they fixed it silently then I suspect that if this is on virtual platform then memory baloon driver/high cpu steal/horrid disk io perf (cause array rebuilds etc) | 14:43 |
RoyK | LeMike: you should see where the load is high easily with sar | 14:46 |
LeMike | RoyK: I dont get it. This seems more like recording a timespan than a single process. And what if some buffer is full or some process is just waiting for stupid reasons? | 17:53 |
RoyK | it's recording a timespan, yes | 17:59 |
RoyK | but then again, using things like top or htop can tell you where the problem lies | 17:59 |
RoyK | iotop perhaps too | 18:00 |
RoyK | if you have a truckload of different services on the same server, all competing, you should be virtualising more | 18:01 |
RoyK | if the machine is dedicated to mysql, it should be trivial to find the bottleneck | 18:02 |
blackflow | if I may interject for a moment.... how would virt help here. if anything it'd just slow things down with extra overhead. | 18:02 |
blackflow | LeMike: you can check for two things, iowait and disk latency | 18:04 |
RoyK | blackflow: it'll slow things down a few percent, but it'll definetely make it easier to monitor | 18:05 |
RoyK | blackflow: throw a ton of services on the same machine and "something" is slow, well, then everything's slow | 18:05 |
blackflow | pretty sure it'd be the same thing under a VM. the host kernel has to juggle processes and serialize IO in both cases | 18:07 |
blackflow | but I agree about monitoring, somewhat. should be possible with cgroups to some extent too. | 18:07 |
RoyK | if vitualised, monitoring on the guest will show you the bottleneck | 18:08 |
LeMike | blackflow: I let it run with "iostat -d 1" and got somewhat like 2348 kB write every few seconds. iowait is 0,85% but I guess I need to reboot. dunno how to check latency | 18:13 |
blackflow | LeMike: https://www.kernel.org/doc/Documentation/block/stat.txt | 18:13 |
blackflow | read ticks / read I/Os = average wait (latency) for read | 18:14 |
blackflow | same for write. | 18:14 |
LeMike | uh. can I truncate this file while the server runs? :) this would be nice | 18:16 |
RoyK | LeMike: how much memory? what is the iops count? what sort of storage does it have? | 18:16 |
RoyK | what about cpu load? | 18:16 |
RoyK | swap usage? | 18:16 |
RoyK | sar will give you paging statistics | 18:17 |
RoyK | have you tuned mysql/mariadb in any way? | 18:17 |
RoyK | how large is the database's dataset? | 18:17 |
blackflow | what about the db itself... effective indexes? does it access tuples sequentially or indexed? | 18:17 |
RoyK | lots and lots of questions ;) | 18:17 |
blackflow | indeedy. | 18:17 |
RoyK | mysql allows you to run "explain <query>" to see if indices are used or not | 18:18 |
RoyK | quite useful thing | 18:19 |
blackflow | LeMike: I'd recommend you activate some graph baesed monitoring on that server. A picture is worth a thousand eyes, and problems are easier to spot. I use Munin for that, infact, that's how I know about that latency measure. It's from Munin's diskstats plugin. | 18:19 |
blackflow | *a thousand words (lol) | 18:19 |
blackflow | I'm listening to TechSnap podcast and they're talking about "many eyes in open source security", so the fingers typed that :) | 18:20 |
LeMike | 4G ram (more than 3G free also in swap free), cpu idles, iotop does not work :/ the table is 276M | 18:21 |
LeMike | not touched mysql config (once for more cache etc but reset it already) | 18:21 |
blackflow | 276M is a lot, esp if the DB has to sift through the rows sequentially because it's badly indexed. a single select can take seconds.... | 18:22 |
blackflow | (I'm assuming you meant bytes, because 270M rows would be gigabytes most likely...) | 18:22 |
RoyK | and then - mysql doesn't scale too well | 18:22 |
* RoyK only uses mysql at gunpoint | 18:23 | |
blackflow | yeah I don't know much about MySQL. PostgreSQL is my poison and I monitor all that with Munin - tuple counts, sequential vs indexed access, average query time, size, stuff... | 18:23 |
blackflow | DB monitoring and statistics is a life saver. | 18:24 |
RoyK | I'd choose postgres over mysql any day | 18:24 |
blackflow | aye! | 18:24 |
LeMike | (megabytes) Same DB and query on my laptop takes 1-2 seconds. server needs more than 60 seconds | 18:24 |
RoyK | but then - some systems like wordpress don't work too well on postgres because those wp guys haven't understood a damn thing about abstraction | 18:25 |
RoyK | LeMike: what sort of storage? | 18:25 |
LeMike | I look up some monitoring and gather statistics. To bad I got perf.data here and was hoping to just read through and see the big time waster | 18:25 |
blackflow | if MySQL is anything liek postgres, EXPLAIN ANALYZE <query> is a treasure trove of data bout the performance of that query. number of iterations, indexes used, loops, hashes, everything. | 18:26 |
LeMike | RoyK: You mean InnoDB? Or filesystem? | 18:26 |
RoyK | disk | 18:26 |
blackflow | RoyK: I was so pleasantly suprised when I installed GitLab and saw it developed specifically for Postgres (and recommending Postgres over the dolphin). So rare to see something written for Postgres. Ohh ohh, and roundcube. | 18:27 |
LeMike | RoyK: zfs | 18:27 |
RoyK | LeMike: what sort of disks? | 18:27 |
blackflow | oh wait, that changes things a bit. did you tune it for mysql? | 18:27 |
blackflow | LeMike: iirc MySQL recordsize is recommended to be 16k. Postgres 8k. Now, depending on your use case and workload, that might mean a lot, or nothing at all (eg. my use case allows larger record size -- with postgres -- which benefits compressratio) | 18:28 |
LeMike | RoyK: HDD. if that isn the answer then help pls | 18:29 |
RoyK | LeMike: what sort of hdd and how many of them? | 18:30 |
RoyK | LeMike: there's no straight answer here, but as blackflow says, the zfs blocksize for that dataset may be important | 18:30 |
RoyK | LeMike: http://open-zfs.org/wiki/Performance_tuning#MySQL | 18:32 |
LeMike | cant access shit with lshw. just one HDD. Got 50GB out of 3TB for this machine it seems. | 18:34 |
LeMike | oh. 2x TOSHIBA DT01ACA3 in /proc/scsi/scsi. A raid? I dont see the block size. fdisk does not work and I dunno how | 18:45 |
_KaszpiR_ | zfs on one disk? | 18:52 |
_KaszpiR_ | show us 'lsblk' output | 18:53 |
RoyK | LeMike: zfs get | 18:54 |
tomreyn | OS + zfs + mysql on 4GB RAM just sounds wrong. | 18:58 |
tomreyn | unless its a *very* light-weight DB workload you'll want more RAM. | 18:59 |
tomreyn | disks dont matter for a database server unless you designed it wrong, provided too little RAM / didn't tune it for the RAM it has available. | 19:00 |
RoyK | zfs isn't that heavy on memory - that's a myth - that is - unless you turn on dedup | 19:02 |
RoyK | but then, that's praying to all known gods for trouble | 19:02 |
LeMike | _KaszpiR_: got it here https://pastebin.com/p3BsHefU | 19:04 |
RoyK | LeMike: pastebin output of "zpool status", "zfs list" and "zfs get all", please | 19:04 |
LeMike | RoyK: no pool, no dataset, zfs get all has no output (I am in PVE it seems, I am not the hoster or have access to the host) | 19:09 |
RoyK | then you're not running zfs | 19:12 |
LeMike | df -Th said so | 19:15 |
RoyK | pastebin that | 19:17 |
RoyK | keep in mind you need "sudo" in front of those zfs/zpool commands | 19:17 |
LeMike | I am groot! | 19:18 |
LeMike | https://pastebin.com/tqEXHQs7 | 19:18 |
RoyK | that doesn't make sense | 19:18 |
LeMike | Either way this feels like a direction that is not really giving me an answer what specific device or thingy is taking so long . Like "the buffer is too low" or "sync takes too long" or "Raid is broken" etc. | 19:20 |
tomreyn | it's probably a container and ZFS is managed outside of it | 19:24 |
LeMike | Okay. This is reaching 1k € almost. I give up, tell the customer and let him kick the hoster in his nuts. I think that my app has nothin to do with being so slow :P | 19:24 |
tomreyn | did you run mysqltuner? | 19:25 |
tomreyn | doing so (or knowing what needs to be done OTOH) is the minimum i'd expect from a contractor in charge of managing a DB | 19:26 |
LeMike | thanks, those tests are all okay. and I go to sleep now. This took the whole day and I am totally not a SysOp, just a simple dev. But I learned things. Thanks a lot! :) | 19:39 |
_KaszpiR_ | 1k EUR and answer is 'this is a vm provider issue, lol' | 19:40 |
blackflow | RoyK: fwiw, zfs/zpool commands no longer need sudo for read only tasks (eg list, status, get, ...) | 19:40 |
RoyK | blackflow: possibly | 19:40 |
blackflow | no longer = as of 0.7.x and Bionic | 19:40 |
_KaszpiR_ | LeMike I suggest installing some monitoring tools (newrelic is the easiest but free tier limits view stats for 24h afair) | 19:41 |
RoyK | still, zfs list should show the dataset(s) and zpool list should show the pools etc | 19:41 |
RoyK | if they don't, something is messed up | 19:41 |
_KaszpiR_ | zpool list without sudo will return permission denied | 19:41 |
RoyK | LeMike: have you upgraded something lately? | 19:41 |
blackflow | _KaszpiR_: not on Bionic | 19:42 |
RoyK | zfs can be a bit touchy in that respect | 19:42 |
blackflow | also... newrelic? why on earth.... there's plenty of nice, FOSS-y and not SaaS monitoring things. Munin, Nagios, Cacti, ... | 19:42 |
_KaszpiR_ | I know, but that was just an idea of getting thing super fast and easy without the whole mess required to set up other tools | 19:43 |
_KaszpiR_ | munin would be quite fast, though | 19:43 |
blackflow | yeah. Munin is great, thats what I use. | 19:44 |
RoyK | nagios went out of style some years back ;) | 19:44 |
blackflow | heh | 19:44 |
blackflow | what is the monitoring tool of the year now, Zabbix? | 19:44 |
RoyK | I use zabbix | 19:44 |
_KaszpiR_ | but most of those paid projects hav eone advantage - you get a shitload of integrated monitoring stuff from the box, without much extra configs | 19:44 |
RoyK | and munin | 19:44 |
RoyK | why not both | 19:44 |
_KaszpiR_ | collectd + graphite + grafana / munin (wanna migrate from munin, though) | 19:45 |
RoyK | munin is very nice to start with | 19:49 |
_KaszpiR_ | it works very well for say... 15 hosts | 19:49 |
_KaszpiR_ | but thend to be lacking if you need to aggregate stuff | 19:49 |
RoyK | works fine with a hundred machines too, but it doesn't scale too well above that | 19:49 |
RoyK | munin is old and rather simple | 19:50 |
RoyK | but very easy to setup | 19:50 |
_KaszpiR_ | and about zabbix, I'm not a fan of it (but used only 2.x AFAIR and nothign else, and it was pretty nightmare) | 19:50 |
RoyK | _KaszpiR_: we're at 3.4 now - quite a bit has hppened since v2 ;) | 19:51 |
_KaszpiR_ | I remember that the logic behind zabbix was pretty 'wtf' and it required A LOT of extra customization to get any app working, which was major pain | 19:54 |
RoyK | it still seems russian | 19:55 |
RoyK | but you get used to ut ;) | 19:55 |
RoyK | s/ut/it/ | 19:55 |
_KaszpiR_ | something like 'send any metric to server' was just impossible, because it had to exist on zabbix or it was dropped | 19:55 |
_KaszpiR_ | graphite as it's own issues but I've found it easier to manage with those | 19:56 |
RoyK | but then - what else is there to choose? | 19:56 |
RoyK | graphite only draws graphs | 19:56 |
RoyK | it doesn't have alerts and triggers etc | 19:56 |
_KaszpiR_ | yeah | 19:57 |
_KaszpiR_ | still you can query those and have different alerting system | 19:57 |
_KaszpiR_ | grafana has alerting now | 19:57 |
_KaszpiR_ | from other solutiuons - prometheus is pretty decent | 19:58 |
RoyK | we chose zabbix at work - some 350 servers, 60/40 linux/windows | 19:59 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!