[11:19] <smb> iri, I updated the bug report. Just to make sure, I would not be experienced enough with btrfs to claim something is good or bad right now. Just pointing out what observations I made with what we got
[11:20] <iri> thanks smb, that's fine :)
[11:41] <iri> smb: I left the single core machine copying overnight
[11:41] <iri> https://gist.github.com/pwaller/cb8d088ebceb2707d24b
[11:41] <iri> dmesg output: https://gist.github.com/pwaller/574a369ea4b65fe125b9
[11:41] <iri> The first link is dstat output, which shows no IOPs and 100% kernel CPU
[11:41] <iri> the second shows lockups in sync and btrfs-transaction
[11:49] <smb> iri, The xen:ballon error is something that we know of but have not yet found a other effect that the messages flooding. It is something we cannot reproduce outside ec2 and I believe the bug report on that has not been updated for a while.
[11:50] <iri> smb: I was assuming that error was not the root of the problem I'm experiencing
[11:51] <smb> ok, yeah. just wanted to make sure. looking at the other messages right now
[11:53] <smb> Hm, HVM guest. That does not need to be relevant either. Just not the normal type of guest I look at 
[12:03] <smb> iri, Hm, I am getting a bit confused. from that dmesg I only see xvda1, xvdb (both pv disks) and some messages sounds like both get mounted as ext4/ext3... And then out of nowhere some device-mapper device seems to appear and have btrfs on it which btrfs recognizes as an SSD... Oh here, I missed the xvdt line. 
[12:03] <iri> smb :)
[12:03] <iri> smb: I attached the xvdt device after the machine booted.
[12:04] <smb> So that is the one with btrfs on it. Still wondering how btrfs guesses this is a ssd... :)
[12:04] <iri> smb: dark magic?
[12:04] <smb> evil. :-P
[12:08] <smb> So then the next stacktrace basically says background write seemed to hung in wait for completion. The sysrq trigger backtrace only shows that (probably of little use in a single cpu case)
[12:19] <smb> Hm, ok so both the sync and the btrfs task seem to be in some wait_for_completion. if the stat correlates there was not a real big amount of data actually written. or read. could be (maybe more unlikely) that io requests get lost and must be retried. But in that case I would expect some messages in dmesg. Or the whole thing triggers a problem in the whole stack of guest-host-storage which unfortunately is a big blac
[12:19] <smb> k box
[12:20] <iri> hm :/
[12:24] <smb> iri, Do you know whether you could snapshot/restore that testdata on non-ssd storage?
[12:24] <iri> smb: I could give that a shot.
[12:25] <smb> iri, Maybe just a blind shot but if it is possible it would give at least another pointer on how much influence the backing storage type has here....
[12:56] <iri> smb: first interesting observation is that it detects the magnetic volume as an SSD anyway
[12:58] <smb> iri, Ah. Oh well. So that is broken rather than clever I suppose. not expection a third type apart from real spindle and real ssd...
[12:58] <iri> smb: hm. I'm getting unreasonably high read IOPS initially. 2-3000
[12:59] <iri> and after 10 seconds we're now stuck in the SYS spin again
[12:59] <iri> (single core machine, restored snapshot of the magnetic disk)
[12:59] <smb> ok... sigh. at least consistent in that way
[12:59] <iri> smb: I'm tempted to do a random write test to the block device
[13:00] <iri> I don't care about what's on here since I'm only going to delete it
[13:00] <smb> iri, yeah, that sounds reasonable. 
[13:00] <smb> that way we get something to compare the performance of the fs against
[13:01] <iri> smb: yeah. Any ideas how to pick somewhere sensible to send the writes within the block device?
[13:02] <iri> smb: any ideas for anything better to write than /dev/zero? /dev/urandom gets 5.5MB/sec dd'ing to /dev/null
[13:02] <iri> I fear /dev/zero may have funky behaviour, but that is just speculation
[13:03] <smb> not out of my head. zeros and maybe one go just over the whole disks...
[13:04] <iri> smb: I get a write rate of >2kiops immediately and looking fairly reliable.
[13:04] <iri> (85MB/sec)
[13:04] <iri> though to be fair this is at the beginning of the disk.
[13:05] <smb> sounds like something one would expect roughly from standard sata.
[13:06] <iri> I'm a bit surprised by the performance, this is more what I was expecting from an SSD volume
[13:06] <iri> hm, I wonder if it is possible I have confused things.
[13:07] <iri> oops. I think I did my test against the SSD.
[13:08] <iri> I instructed the SSD to detach but it did not
[13:10] <iri> smb: I picked numerous locations and did this: time sudo dd if=/dev/zero of=/dev/mapper/vg-lv bs=16k count=1M skip=400G
[13:10] <iri> they all showed the expected SSD-like performance with >2kiops
[13:10] <iri> I'm now going to try the magnetic disk file copy test
[13:10] <smb> Probably should have said bare-metal spindle ... thought that was a numeber I see there for sequential writes... 
[13:10] <smb> ok...
[13:12] <smb> One thing just came to my mind... when the btrfs fs was created earlier with maybe another underlying disk... mybe worth checking what minimum blocksize is used. Though I have not yet found which command to use
[13:18] <smb> Oh here, btrfs-show-super <dev>
[13:20] <iri> smb: https://gist.github.com/pwaller/ea0386762852a9bf5462
[13:20] <smb> ok, so sector size 4k. should be ok
[13:24] <iri> smb: the rate is a *lot* slower on the magnetic disks so it will be a while before the caches are totally full.
[13:25] <iri> smb: the initial behaviour I'm observing is ~100 IOPS read (max of 5mbs/sec), 0-2 write (max of 10k/sec), 100% iowait.
[13:25] <iri> That's for ~5 minutes
[13:25] <iri> I initiated a sync but that hasn't changed the write traffic situation
[13:26] <smb> Hm, ok. sounds pretty slow somehow
[13:26] <iri> smb: I suspect this is just a cold ebs which is not responsive to write traffic
[13:26] <iri> I see a kworker and a btrfs-transaction kernel thread stuck in the "D" state
[13:26] <iri> along with the sync and cat
[13:27] <iri> smb: I'm going to go afk for ~1h, back in a bit.
[13:27] <smb> ok.
[13:33] <iri> smb: write load has finally warmed up to peaks of 10mb/sec @ 230 IOPs. Doesn't look like a stuck system. There are bursts of 50% SYS CPU about once per minute.
[13:33] <iri> smb: I wonder if we're observing some inefficient algorithm in btrfs which goes horribly wrong when the IOPs are an order of magnitude higher
[13:34] <iri> Ah, the sys bursts are getting lnoer and more frequent
[13:35] <smb> iri, yeah that performance looks ok, not thrilling but ok. not sure obout a certain algorithm. rather something that causes the host side prerform badly. but not a lear idea atm
[13:35] <iri> smb: the system now is stuck in 100% system CPU
[13:35] <iri> (on the magnetic disk)
[13:36] <smb> unfortunately we cannot see what goes on on the host.
[13:36] <iri> There is still some IO traffic though, 2mb/sec @ 20iops and 1.5mb/sec @ 30 iops
[13:37] <iri> 42% kworker/u30:1 load, 10% rcu_shed, <10% across several btrfs-endio threads
[13:54] <iri> smb: it spent 13 minutes in 100% sys, then spending ~1 minute in 100% iowait, then back to 100% sys
[13:54] <iri> and now back to iowait for five minutes
[13:55] <iri> smb: curiously, it's the `rm` which is blocking for > 120s
[13:55] <iri> (from dmesg)
[13:56] <smb> sounds a bit like pushing hard and waiting for things to happen. could by causeing some form of sync..
[13:56] <iri> ('cause I'm running `sh -c 'rm -f {}.new; cat {} > {}.new'`)
[13:57] <iri> the iops are going low but not staying at zero for longer than a minute at a time
[13:57] <iri> (*minute or two)
[14:07] <smb> iri, I must admit I have no more ideas right now. Apart from trying to replicate your setup as closely as possible on bare-metal and see whether that shows similar behaviour,
[14:07] <iri> smb: I might try to make a smaller testcase
[14:08] <iri> smb: I'm thinking of deleting lots of files and doing a shrink
[14:08] <smb> ok, if you get somethign, let us know in the bug report. That is less volatile with info than irc...
[14:09] <iri> sure.
[14:09] <iri> smb: are you still attempting to reproduce anything on your end or are you done at this point?
[14:10] <smb> iri, I and/or arges will try to get something up just a matter of multi-tasking various things
[14:10] <iri> great.
[14:21] <iri> smb: is there a way to know the IO queue length to the underlying block device or anything like that?
[14:28] <smb> iri, its not exactly the block device but there is some info under /sys/kernel/debug/bdi (backing device info) It is a bit tedious to use as its split into directories named after major:minor number.
[14:31] <smb> Also only per whole block device usually, so not per partitions. But it could give some insight into various levels of the stack. You might have xvdt->dm-x->btrfs-x...
[14:45] <iri> smb: I so no difference in what's in there whether it is stuck in "sys" or "iowait"
[14:45] <iri> nor whether there are 100 IOPS or 0
[14:46] <iri> smb: https://gist.github.com/pwaller/762f57f330cf8a5193ae shows the disk at the top and the lvm volume below
[14:47] <iri> The DirtyThresh and BackgroundThresh vary a small amount
[14:47] <iri> and I did see the disk go into state "c"
[14:48] <iri> hmm, the lv state always seems to be "8" and the disk state flips between a few things, I've seen it read a/c/e/8
[14:50] <smb> iri, I would need to look up in the code what that even means. But not having anything in writeback sounds like it might be without influence on the writeback code which kind of would be good as that takes away some things to worry about
[14:58] <iri> I found the bits for state. 8 is BDI_registered, a is BDI_registered | BDI_async_congested and e is `a | BDI_sync_congested`
[14:58] <iri> the majority of the time is spent in 'a'
[15:23] <iri> so now I'm in an interesting case on the SSD machine where the device state is "8" (i.e, idle) but the kernel is using a whole CPU for a kworker
[15:29] <smb> not sure but maybe installing perf-tools and watch perf top would help to get some hint.
[15:31] <smb> I might be afk for a bit (or a bit longer) 
[15:36] <iri> smb: a nice idea
[15:39] <iri> smb: do you know where I can get linux-tools-3.16.0-031600rc7-generic / linux-cloud-tools-3.16.0-031600rc7-generic for perf?
[15:39] <iri> (the packages)