/srv/irclogs.ubuntu.com/2014/08/01/#ubuntu-kernel.txt

=== ming is now known as Guest12249
=== JanC_ is now known as JanC
=== ikonia_ is now known as ikonia
smb	iri, I updated the bug report. Just to make sure, I would not be experienced enough with btrfs to claim something is good or bad right now. Just pointing out what observations I made with what we got	11:19
iri	thanks smb, that's fine :)	11:20
iri	smb: I left the single core machine copying overnight	11:41
iri	https://gist.github.com/pwaller/cb8d088ebceb2707d24b	11:41
iri	dmesg output: https://gist.github.com/pwaller/574a369ea4b65fe125b9	11:41
iri	The first link is dstat output, which shows no IOPs and 100% kernel CPU	11:41
iri	the second shows lockups in sync and btrfs-transaction	11:41
smb	iri, The xen:ballon error is something that we know of but have not yet found a other effect that the messages flooding. It is something we cannot reproduce outside ec2 and I believe the bug report on that has not been updated for a while.	11:49
iri	smb: I was assuming that error was not the root of the problem I'm experiencing	11:50
smb	ok, yeah. just wanted to make sure. looking at the other messages right now	11:51
smb	Hm, HVM guest. That does not need to be relevant either. Just not the normal type of guest I look at	11:53
smb	iri, Hm, I am getting a bit confused. from that dmesg I only see xvda1, xvdb (both pv disks) and some messages sounds like both get mounted as ext4/ext3... And then out of nowhere some device-mapper device seems to appear and have btrfs on it which btrfs recognizes as an SSD... Oh here, I missed the xvdt line.	12:03
iri	smb :)	12:03
iri	smb: I attached the xvdt device after the machine booted.	12:03
smb	So that is the one with btrfs on it. Still wondering how btrfs guesses this is a ssd... :)	12:04
iri	smb: dark magic?	12:04
smb	evil. :-P	12:04
smb	So then the next stacktrace basically says background write seemed to hung in wait for completion. The sysrq trigger backtrace only shows that (probably of little use in a single cpu case)	12:08
smb	Hm, ok so both the sync and the btrfs task seem to be in some wait_for_completion. if the stat correlates there was not a real big amount of data actually written. or read. could be (maybe more unlikely) that io requests get lost and must be retried. But in that case I would expect some messages in dmesg. Or the whole thing triggers a problem in the whole stack of guest-host-storage which unfortunately is a big blac	12:19
smb	k box	12:19
iri	hm :/	12:20
smb	iri, Do you know whether you could snapshot/restore that testdata on non-ssd storage?	12:24
iri	smb: I could give that a shot.	12:24
smb	iri, Maybe just a blind shot but if it is possible it would give at least another pointer on how much influence the backing storage type has here....	12:25
iri	smb: first interesting observation is that it detects the magnetic volume as an SSD anyway	12:56
smb	iri, Ah. Oh well. So that is broken rather than clever I suppose. not expection a third type apart from real spindle and real ssd...	12:58
iri	smb: hm. I'm getting unreasonably high read IOPS initially. 2-3000	12:58
iri	and after 10 seconds we're now stuck in the SYS spin again	12:59
iri	(single core machine, restored snapshot of the magnetic disk)	12:59
smb	ok... sigh. at least consistent in that way	12:59
iri	smb: I'm tempted to do a random write test to the block device	12:59
iri	I don't care about what's on here since I'm only going to delete it	13:00
smb	iri, yeah, that sounds reasonable.	13:00
smb	that way we get something to compare the performance of the fs against	13:00
iri	smb: yeah. Any ideas how to pick somewhere sensible to send the writes within the block device?	13:01
iri	smb: any ideas for anything better to write than /dev/zero? /dev/urandom gets 5.5MB/sec dd'ing to /dev/null	13:02
iri	I fear /dev/zero may have funky behaviour, but that is just speculation	13:02
smb	not out of my head. zeros and maybe one go just over the whole disks...	13:03
iri	smb: I get a write rate of >2kiops immediately and looking fairly reliable.	13:04
iri	(85MB/sec)	13:04
iri	though to be fair this is at the beginning of the disk.	13:04
smb	sounds like something one would expect roughly from standard sata.	13:05
iri	I'm a bit surprised by the performance, this is more what I was expecting from an SSD volume	13:06
iri	hm, I wonder if it is possible I have confused things.	13:06
iri	oops. I think I did my test against the SSD.	13:07
iri	I instructed the SSD to detach but it did not	13:08
iri	smb: I picked numerous locations and did this: time sudo dd if=/dev/zero of=/dev/mapper/vg-lv bs=16k count=1M skip=400G	13:10
iri	they all showed the expected SSD-like performance with >2kiops	13:10
iri	I'm now going to try the magnetic disk file copy test	13:10
smb	Probably should have said bare-metal spindle ... thought that was a numeber I see there for sequential writes...	13:10
smb	ok...	13:10
smb	One thing just came to my mind... when the btrfs fs was created earlier with maybe another underlying disk... mybe worth checking what minimum blocksize is used. Though I have not yet found which command to use	13:12
smb	Oh here, btrfs-show-super <dev>	13:18
iri	smb: https://gist.github.com/pwaller/ea0386762852a9bf5462	13:20
smb	ok, so sector size 4k. should be ok	13:20
iri	smb: the rate is a lot slower on the magnetic disks so it will be a while before the caches are totally full.	13:24
iri	smb: the initial behaviour I'm observing is ~100 IOPS read (max of 5mbs/sec), 0-2 write (max of 10k/sec), 100% iowait.	13:25
iri	That's for ~5 minutes	13:25
iri	I initiated a sync but that hasn't changed the write traffic situation	13:25
smb	Hm, ok. sounds pretty slow somehow	13:26
iri	smb: I suspect this is just a cold ebs which is not responsive to write traffic	13:26
iri	I see a kworker and a btrfs-transaction kernel thread stuck in the "D" state	13:26
iri	along with the sync and cat	13:26
iri	smb: I'm going to go afk for ~1h, back in a bit.	13:27
smb	ok.	13:27
iri	smb: write load has finally warmed up to peaks of 10mb/sec @ 230 IOPs. Doesn't look like a stuck system. There are bursts of 50% SYS CPU about once per minute.	13:33
iri	smb: I wonder if we're observing some inefficient algorithm in btrfs which goes horribly wrong when the IOPs are an order of magnitude higher	13:33
iri	Ah, the sys bursts are getting lnoer and more frequent	13:34
smb	iri, yeah that performance looks ok, not thrilling but ok. not sure obout a certain algorithm. rather something that causes the host side prerform badly. but not a lear idea atm	13:35
iri	smb: the system now is stuck in 100% system CPU	13:35
iri	(on the magnetic disk)	13:35
smb	unfortunately we cannot see what goes on on the host.	13:36
iri	There is still some IO traffic though, 2mb/sec @ 20iops and 1.5mb/sec @ 30 iops	13:36
iri	42% kworker/u30:1 load, 10% rcu_shed, <10% across several btrfs-endio threads	13:37
iri	smb: it spent 13 minutes in 100% sys, then spending ~1 minute in 100% iowait, then back to 100% sys	13:54
iri	and now back to iowait for five minutes	13:54
iri	smb: curiously, it's the `rm` which is blocking for > 120s	13:55
iri	(from dmesg)	13:55
smb	sounds a bit like pushing hard and waiting for things to happen. could by causeing some form of sync..	13:56
iri	('cause I'm running `sh -c 'rm -f {}.new; cat {} > {}.new'`)	13:56
iri	the iops are going low but not staying at zero for longer than a minute at a time	13:57
iri	(*minute or two)	13:57
smb	iri, I must admit I have no more ideas right now. Apart from trying to replicate your setup as closely as possible on bare-metal and see whether that shows similar behaviour,	14:07
iri	smb: I might try to make a smaller testcase	14:07
iri	smb: I'm thinking of deleting lots of files and doing a shrink	14:08
smb	ok, if you get somethign, let us know in the bug report. That is less volatile with info than irc...	14:08
iri	sure.	14:09
iri	smb: are you still attempting to reproduce anything on your end or are you done at this point?	14:09
smb	iri, I and/or arges will try to get something up just a matter of multi-tasking various things	14:10
iri	great.	14:10
=== manjo` is now known as manjo
iri	smb: is there a way to know the IO queue length to the underlying block device or anything like that?	14:21
smb	iri, its not exactly the block device but there is some info under /sys/kernel/debug/bdi (backing device info) It is a bit tedious to use as its split into directories named after major:minor number.	14:28
smb	Also only per whole block device usually, so not per partitions. But it could give some insight into various levels of the stack. You might have xvdt->dm-x->btrfs-x...	14:31
iri	smb: I so no difference in what's in there whether it is stuck in "sys" or "iowait"	14:45
iri	nor whether there are 100 IOPS or 0	14:45
iri	smb: https://gist.github.com/pwaller/762f57f330cf8a5193ae shows the disk at the top and the lvm volume below	14:46
iri	The DirtyThresh and BackgroundThresh vary a small amount	14:47
iri	and I did see the disk go into state "c"	14:47
iri	hmm, the lv state always seems to be "8" and the disk state flips between a few things, I've seen it read a/c/e/8	14:48
smb	iri, I would need to look up in the code what that even means. But not having anything in writeback sounds like it might be without influence on the writeback code which kind of would be good as that takes away some things to worry about	14:50
iri	I found the bits for state. 8 is BDI_registered, a is BDI_registered \| BDI_async_congested and e is `a \| BDI_sync_congested`	14:58
iri	the majority of the time is spent in 'a'	14:58
iri	so now I'm in an interesting case on the SSD machine where the device state is "8" (i.e, idle) but the kernel is using a whole CPU for a kworker	15:23
smb	not sure but maybe installing perf-tools and watch perf top would help to get some hint.	15:29
smb	I might be afk for a bit (or a bit longer)	15:31
iri	smb: a nice idea	15:36
iri	smb: do you know where I can get linux-tools-3.16.0-031600rc7-generic / linux-cloud-tools-3.16.0-031600rc7-generic for perf?	15:39
iri	(the packages)	15:39
=== iri is now known as iri`away

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!