/srv/irclogs.ubuntu.com/2014/08/01/#ubuntu-kernel.txt

=== ming is now known as Guest12249
=== JanC_ is now known as JanC
=== ikonia_ is now known as ikonia
smbiri, I updated the bug report. Just to make sure, I would not be experienced enough with btrfs to claim something is good or bad right now. Just pointing out what observations I made with what we got11:19
irithanks smb, that's fine :)11:20
irismb: I left the single core machine copying overnight11:41
irihttps://gist.github.com/pwaller/cb8d088ebceb2707d24b11:41
iridmesg output: https://gist.github.com/pwaller/574a369ea4b65fe125b911:41
iriThe first link is dstat output, which shows no IOPs and 100% kernel CPU11:41
irithe second shows lockups in sync and btrfs-transaction11:41
smbiri, The xen:ballon error is something that we know of but have not yet found a other effect that the messages flooding. It is something we cannot reproduce outside ec2 and I believe the bug report on that has not been updated for a while.11:49
irismb: I was assuming that error was not the root of the problem I'm experiencing11:50
smbok, yeah. just wanted to make sure. looking at the other messages right now11:51
smbHm, HVM guest. That does not need to be relevant either. Just not the normal type of guest I look at 11:53
smbiri, Hm, I am getting a bit confused. from that dmesg I only see xvda1, xvdb (both pv disks) and some messages sounds like both get mounted as ext4/ext3... And then out of nowhere some device-mapper device seems to appear and have btrfs on it which btrfs recognizes as an SSD... Oh here, I missed the xvdt line. 12:03
irismb :)12:03
irismb: I attached the xvdt device after the machine booted.12:03
smbSo that is the one with btrfs on it. Still wondering how btrfs guesses this is a ssd... :)12:04
irismb: dark magic?12:04
smbevil. :-P12:04
smbSo then the next stacktrace basically says background write seemed to hung in wait for completion. The sysrq trigger backtrace only shows that (probably of little use in a single cpu case)12:08
smbHm, ok so both the sync and the btrfs task seem to be in some wait_for_completion. if the stat correlates there was not a real big amount of data actually written. or read. could be (maybe more unlikely) that io requests get lost and must be retried. But in that case I would expect some messages in dmesg. Or the whole thing triggers a problem in the whole stack of guest-host-storage which unfortunately is a big blac12:19
smbk box12:19
irihm :/12:20
smbiri, Do you know whether you could snapshot/restore that testdata on non-ssd storage?12:24
irismb: I could give that a shot.12:24
smbiri, Maybe just a blind shot but if it is possible it would give at least another pointer on how much influence the backing storage type has here....12:25
irismb: first interesting observation is that it detects the magnetic volume as an SSD anyway12:56
smbiri, Ah. Oh well. So that is broken rather than clever I suppose. not expection a third type apart from real spindle and real ssd...12:58
irismb: hm. I'm getting unreasonably high read IOPS initially. 2-300012:58
iriand after 10 seconds we're now stuck in the SYS spin again12:59
iri(single core machine, restored snapshot of the magnetic disk)12:59
smbok... sigh. at least consistent in that way12:59
irismb: I'm tempted to do a random write test to the block device12:59
iriI don't care about what's on here since I'm only going to delete it13:00
smbiri, yeah, that sounds reasonable. 13:00
smbthat way we get something to compare the performance of the fs against13:00
irismb: yeah. Any ideas how to pick somewhere sensible to send the writes within the block device?13:01
irismb: any ideas for anything better to write than /dev/zero? /dev/urandom gets 5.5MB/sec dd'ing to /dev/null13:02
iriI fear /dev/zero may have funky behaviour, but that is just speculation13:02
smbnot out of my head. zeros and maybe one go just over the whole disks...13:03
irismb: I get a write rate of >2kiops immediately and looking fairly reliable.13:04
iri(85MB/sec)13:04
irithough to be fair this is at the beginning of the disk.13:04
smbsounds like something one would expect roughly from standard sata.13:05
iriI'm a bit surprised by the performance, this is more what I was expecting from an SSD volume13:06
irihm, I wonder if it is possible I have confused things.13:06
irioops. I think I did my test against the SSD.13:07
iriI instructed the SSD to detach but it did not13:08
irismb: I picked numerous locations and did this: time sudo dd if=/dev/zero of=/dev/mapper/vg-lv bs=16k count=1M skip=400G13:10
irithey all showed the expected SSD-like performance with >2kiops13:10
iriI'm now going to try the magnetic disk file copy test13:10
smbProbably should have said bare-metal spindle ... thought that was a numeber I see there for sequential writes... 13:10
smbok...13:10
smbOne thing just came to my mind... when the btrfs fs was created earlier with maybe another underlying disk... mybe worth checking what minimum blocksize is used. Though I have not yet found which command to use13:12
smbOh here, btrfs-show-super <dev>13:18
irismb: https://gist.github.com/pwaller/ea0386762852a9bf546213:20
smbok, so sector size 4k. should be ok13:20
irismb: the rate is a *lot* slower on the magnetic disks so it will be a while before the caches are totally full.13:24
irismb: the initial behaviour I'm observing is ~100 IOPS read (max of 5mbs/sec), 0-2 write (max of 10k/sec), 100% iowait.13:25
iriThat's for ~5 minutes13:25
iriI initiated a sync but that hasn't changed the write traffic situation13:25
smbHm, ok. sounds pretty slow somehow13:26
irismb: I suspect this is just a cold ebs which is not responsive to write traffic13:26
iriI see a kworker and a btrfs-transaction kernel thread stuck in the "D" state13:26
irialong with the sync and cat13:26
irismb: I'm going to go afk for ~1h, back in a bit.13:27
smbok.13:27
irismb: write load has finally warmed up to peaks of 10mb/sec @ 230 IOPs. Doesn't look like a stuck system. There are bursts of 50% SYS CPU about once per minute.13:33
irismb: I wonder if we're observing some inefficient algorithm in btrfs which goes horribly wrong when the IOPs are an order of magnitude higher13:33
iriAh, the sys bursts are getting lnoer and more frequent13:34
smbiri, yeah that performance looks ok, not thrilling but ok. not sure obout a certain algorithm. rather something that causes the host side prerform badly. but not a lear idea atm13:35
irismb: the system now is stuck in 100% system CPU13:35
iri(on the magnetic disk)13:35
smbunfortunately we cannot see what goes on on the host.13:36
iriThere is still some IO traffic though, 2mb/sec @ 20iops and 1.5mb/sec @ 30 iops13:36
iri42% kworker/u30:1 load, 10% rcu_shed, <10% across several btrfs-endio threads13:37
irismb: it spent 13 minutes in 100% sys, then spending ~1 minute in 100% iowait, then back to 100% sys13:54
iriand now back to iowait for five minutes13:54
irismb: curiously, it's the `rm` which is blocking for > 120s13:55
iri(from dmesg)13:55
smbsounds a bit like pushing hard and waiting for things to happen. could by causeing some form of sync..13:56
iri('cause I'm running `sh -c 'rm -f {}.new; cat {} > {}.new'`)13:56
irithe iops are going low but not staying at zero for longer than a minute at a time13:57
iri(*minute or two)13:57
smbiri, I must admit I have no more ideas right now. Apart from trying to replicate your setup as closely as possible on bare-metal and see whether that shows similar behaviour,14:07
irismb: I might try to make a smaller testcase14:07
irismb: I'm thinking of deleting lots of files and doing a shrink14:08
smbok, if you get somethign, let us know in the bug report. That is less volatile with info than irc...14:08
irisure.14:09
irismb: are you still attempting to reproduce anything on your end or are you done at this point?14:09
smbiri, I and/or arges will try to get something up just a matter of multi-tasking various things14:10
irigreat.14:10
=== manjo` is now known as manjo
irismb: is there a way to know the IO queue length to the underlying block device or anything like that?14:21
smbiri, its not exactly the block device but there is some info under /sys/kernel/debug/bdi (backing device info) It is a bit tedious to use as its split into directories named after major:minor number.14:28
smbAlso only per whole block device usually, so not per partitions. But it could give some insight into various levels of the stack. You might have xvdt->dm-x->btrfs-x...14:31
irismb: I so no difference in what's in there whether it is stuck in "sys" or "iowait"14:45
irinor whether there are 100 IOPS or 014:45
irismb: https://gist.github.com/pwaller/762f57f330cf8a5193ae shows the disk at the top and the lvm volume below14:46
iriThe DirtyThresh and BackgroundThresh vary a small amount14:47
iriand I did see the disk go into state "c"14:47
irihmm, the lv state always seems to be "8" and the disk state flips between a few things, I've seen it read a/c/e/814:48
smbiri, I would need to look up in the code what that even means. But not having anything in writeback sounds like it might be without influence on the writeback code which kind of would be good as that takes away some things to worry about14:50
iriI found the bits for state. 8 is BDI_registered, a is BDI_registered | BDI_async_congested and e is `a | BDI_sync_congested`14:58
irithe majority of the time is spent in 'a'14:58
iriso now I'm in an interesting case on the SSD machine where the device state is "8" (i.e, idle) but the kernel is using a whole CPU for a kworker15:23
smbnot sure but maybe installing perf-tools and watch perf top would help to get some hint.15:29
smbI might be afk for a bit (or a bit longer) 15:31
irismb: a nice idea15:36
irismb: do you know where I can get linux-tools-3.16.0-031600rc7-generic / linux-cloud-tools-3.16.0-031600rc7-generic for perf?15:39
iri(the packages)15:39
=== iri is now known as iri`away

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!