[00:17] <slangasek> xnox: if this is blocking anything we should override the milestone freeze, not let ourselves be blocked by it
[00:19] <xnox> slangasek: i'm blindly test rebuilding stuff against llvm 3.5 =) if you wish, you can unblock it and then i'd test it when it propagates. All other autopackage tests passed so yeah freeze block is the only one.
[00:20] <xnox> slangasek: but it's like on all images, thus it is respin the world trigger =/
[00:25] <slangasek> xnox: no, it is not
[00:25] <slangasek> xnox: we do *not* guarantee images are up-to-date with respect to the archive during alpha milestones
[09:36] <vroomfondel> I am just jumping a bit through the kernel source at lxr.free-electrons.com... the container_of macro (which is a heavy-use macro) is defined in 10 files. Are there many redundancies like this one?
[09:40] <TJ-> vroomfondel: The drivers/staging/ are probably because those drivers haven't been cleaned up yet after being out-of-tree, for the tools/ and scripts/ those are for separate compilation units, which leaves the gpu/radeon and include/linux to explain :)
[09:42] <TJ-> vroomfondel: looking at gpu/radeon that is also a stand-alone tool ... leaving just include/linux :)
[14:22] <smoser> hey. 
[14:22] <smoser> http://paste.ubuntu.com/7914933/
[14:22] <smoser> is this the same power woes we've seen before ? 
[14:28] <rtg> smoser, doesn't look familiar to me, but its also a little sparse on details. apw ?
[14:29] <smoser> rtg, well, what you see is all i know. that and "it died"
[14:29] <smoser> :)
[14:29] <rtg> smoser, is the bare metal ? I'm just not that familiar with P8
[14:29] <rtg> is this*
[14:30] <smoser> its a guest.
[14:30] <smoser> kvm guest.
[14:31] <rtg> smoser, prolly something we need to get the IBM guys to look at
[14:32] <smoser> rtg, i'll file a bug.
[14:37] <smoser> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350889
[14:50] <apw> yeah not something i have noticed so far 
[15:00] <hallyn> apw: stgraber: on 3.16 in utopic, in a unprivileged container, all of /proc is owned by nobody:nogroup
[15:00] <hallyn> dat aint gonna fly
[15:38] <pwaller> I can now reliably reproduce this kernel hang with BTRFS: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711
[15:39] <pwaller> The machine I'm using for testing won't necessarily be around for much longer, so if there are any further immediate diagnostic things should do it would be good to know
[15:41] <rtg> pwaller, can you reproduce this on bare metal ? or is it just a Xen guest issue ?
[15:42] <pwaller> rtg: It's an amazon EC2 instance. I don't have bare metal available to test on. Or rather, I can't get the partition where I can reproduce it to a bare metal machine.
[15:42] <pwaller> (It's 750GB)
[15:47] <rtg> arges, since you're just plowing through virt issues this morning, how about having a look at this one ^^
[15:48] <pwaller> I should add, my test case involves using my existing partition. I haven't tried reproducing from scratch
[15:48] <pwaller> Mainly because I suspect a problem with the free space cache
[15:49] <pwaller> It took 12 days before first manifesting. And then 2 hours of solid writes before I ended up in a situation where I could quickly reproduce again.
[15:49] <rtg> pwaller, have you talked to upstream about it ? IIRC you said 3.16 still has the issue
[15:50] <pwaller> rtg: I have mailed upstream and had no response. I've had some interest from #btrfs who suggested trying 3.16, as did the person who responded in the ubuntu bug.
[15:57]  * arges looking
[15:59] <arges> pwaller: so you said you can rapidly reproduce. Is that documented in the bug already (still reading)
[15:59] <pwaller> Yes arges at the end
[15:59] <arges> pwaller: ahh there it is : )
[16:00] <arges> pwaller: so you pretty much just copy a large amount of files and get this error...
[16:00] <pwaller> arges: that's basically it
[16:01] <pwaller> arges: I haven't bothered trying to reproduce on a fresh BTRFS because I figure that that would be covered by other testing done to this thing
[16:01] <pwaller> arges: and I'm also not sure what would constitute "interesting enough files"
[16:02] <arges> pwaller: ok so at this point you can only reproduce on this specific instance. so if you created a new instance and ran your test it may not necessarily reproduce
[16:02] <pwaller> arges: I reproduced it on a new instance with a snapshot of the same volume on a new EBS device
[16:03] <arges> smb: ^^^ more xen fun, have you seen something like this
[16:04] <smb> arges, no, nothing btrfs related which I can remember
[16:05] <pwaller> arges: smb I'm migrating to the pub, but I will reappear on IRC in about 10 minutes in case you have further questions.
[16:05] <arges> pwaller: ok i think on this end we'll try to digest the logs a bit
[16:06] <pwaller> arges: thanks very much for looking into this :)
[16:06] <arges> np
[16:06] <pwaller> arges: I can also do further testing on the spot if you have more ideas for experiments.
[16:06] <smb> arges, The fact it runs on ec2 may be just coincidental in this case
[16:06] <arges> smb: yea i wonder if it can be reproduced in a non-ec2 xen enviornment : or ever non virt
[16:20] <pwaller> I'm back arges/smb
[16:20] <arges> pwaller: so has this happened with other btrfs volumes you've used?
[16:21] <pwaller> arges: I've only observed this on this one filesystem (but copied the filesystem to another device and observed it there too)
[16:21] <pwaller> We're running 5 systems which are configured the same, but only one other has a similar workload
[16:21] <arges> pwaller: how did you copy the filesystem?
[16:21] <pwaller> arges: EC2 snapshot restore to new volume
[16:22] <pwaller> Even on the original system we haven't observed the crash since we upgraded kernel from 3.13 to 3.15
[16:22] <pwaller> (which is now at 2 days 1h uptime)
[16:22] <pwaller> but the workloads aren't very steady.
[16:23] <arges> pwaller: but a copy of ~400GB of small files seems to cause the soft lockups
[16:23] <smb> Though from the logs we see it is less of a crash than a lockup
[16:23] <pwaller> arges it's probably more like ~275GB maximum
[16:24] <pwaller> arges: I started copying all 275GB, and 2h in to the ~6 hour operation, it locked up. I then resumed from the point of the lockup after the reboot, and then it will reliably lockup within minutes, or even within 30 seconds sometimes.
[16:25] <arges> pwaller: you resumed what exactly, and how
[16:26] <pwaller> arges: the volume contains ~23,000 sqlite files of varying sizes (4kb - 12GB). I copy them all from $PATH to ${PATH}.new
[16:26] <pwaller> I observe the machine hang at ${GIVEN_PATH}, and then if I choose files that are after ${GIVEN_PATH} in the list, I can rapidly reproduce the lockup
[16:27] <arges> pwaller: so you observe the hang. then stop the cp process?
[16:28] <arges> then resume it after teh point where it was hanging?
[16:28] <pwaller> arges: the cp process hangs. If I terminate the cp process, a kernel thread is using 100% CPU
[16:28] <pwaller> arges: If I leave the machine idle (except for the pinned CPU on a kernel thread) and unattended for 5-10 minutes, it locks up totally and becomes unresponsive to network traffic
[16:28] <arges> ah
[16:29] <pwaller> If I restart the machine and then resume the cp from the last file that was printed, it locks up fairly rapidly at that point
[16:29] <pwaller> arges: one more detail: I run 2 cp's in parallel
[16:29] <pwaller> arges: this is the literal command I'm running
[16:29] <pwaller> cat sqlite-files.txt | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'
[16:29] <pwaller> (then I resume with `tail -n+18082 | xargs`)
[16:30] <arges> pwaller: can you been able to reproduce on a single cpu instance
[16:30] <pwaller> arges: that is something I can try
[16:35] <arges> 1) create around ~300GB of small files (sqlite files for example), put the files into a list sqlite-files.txt
[16:35] <arges> 2) Start the copy:
[16:35] <arges> cat sqlite-files.txt | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'
[16:35] <arges> 3) When it hangs, identify where it hung as $NUM and resume with the following:
[16:35] <arges> tail -n+$NUM | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'
[16:35] <arges> pwaller: ^^^ does that sum up the test case
[16:35] <smb> pwaller, where does the 87% full info actually come from? df info?
[16:35] <pwaller> du -sch /volume, smb
[16:35] <pwaller> arges: looks about right
[16:36] <pwaller> arges: unfortunately the problem didn't manifest until 12 days of running originally
[16:36] <pwaller> arges: though the write workload will not have been as pathological as what you just described
[16:36] <arges> pwaller: what kind of workload were you running for those 12 days? 
[16:36] <smb> pwaller, thanks... maybe try "btrfs fi df /mountpoint"
[16:37] <pwaller> smb: https://gist.github.com/pwaller/ce4312f5e16147847a65
[16:37] <pwaller> arges: user generated workload accessing those sqlite files for arbitrary read/writes
[16:52] <arges> pwaller: have you looked at this wiki https://btrfs.wiki.kernel.org/index.php/Balance_Filters
[16:53] <arges> one thing to note is 'if you are getting out of space errors' try 'btrfs balance ...' (i'm not sure if you already tried this)
[16:54] <pwaller> arges: we just reproduced it on a single core machine in < 15s
[16:54] <pwaller> (an AWS EC2 m3.medium
[16:55] <jsalisbury> rtg, re bug 1350373, you have any idea why I can't describe commit b7dd0e in Linus' tree?  
[16:55] <jsalisbury> rtg jsalisbury@salisbury:~/src/linux$ git describe --contains b7dd0e350e0bd4c0fddcc9b8958342700b00b168
[16:55] <jsalisbury> fatal: cannot describe 'b7dd0e350e0bd4c0fddcc9b8958342700b00b168'
[16:55] <jsalisbury> rtg, am I just missing some git knowledge?
[16:55] <pwaller> arges: it sprang back into life
[16:55] <rtg> jsalisbury, there is likely no subsequent tag. is it after -rc7 ?
[16:55] <jsalisbury> rtg, ahh, that makes sense
[16:56] <pwaller> arges: probably just observed a volume warming up
[16:56] <jsalisbury> rtg, so it will work in -rc8 or whenever the next tag is added
[16:56] <rtg> jsalisbury, I think so
[16:56] <arges> pwaller: ok so at this point test a bit with the single cpu machine. and 2) look at the wiki i mentioned, and see if 'btrfs balance ...' changes anything
[16:56] <arges> its lunchtime here brb
[16:56] <jsalisbury> rtg, got it, thanks
[16:58] <pwaller> arges: I don't understand the rebalance thing. It's a single device volume, is rebalance even relevant if it isn't RAID?
[17:07] <pwaller> smb, arges: one interesting feature of the single CPU is that at times, the write load drops to 0 during the copy but the sys cpu goes to 100%
[17:08] <pwaller> which is spending ~10% in rcu_sched, 50% in kworker/u30:1 and some fraction in some btrfs processes (btrfs-transaction and btrfs-endio-met)
[17:09] <smb> pwaller, It kind of would make sense with the assumption that something in the kernel desperately trying but never giving up doing something. 
[17:09] <pwaller> the 100% sys load remains even after aborting the CP
[17:09] <pwaller> a "sync" from the command line isn't completing
[17:10] <smb> I would read the straces you posted as: the cp you do cause some background writes which are done by a kworker
[17:11] <pwaller> "echo l > /proc/sysrq-trigger" causes the machine to freeze for quite a long time
[17:13] <pwaller> dumping a stack trace via sysrq is considerably less interesting on the single-CPU machine, the stack just has "write_sysrq_trigger" in it
[17:13] <pwaller> having said that there is another stack trace with "generic_write_sync" at the top
[17:13] <pwaller> my sync is still going
[17:13] <pwaller> and the CPU is still stuck in 100% sys
[17:16] <smb> tbh not really surprised with that
[17:16] <pwaller> arges: someone from #btrfs (I don't know their credentials) said that rebalance was unlikely to have any effect in my circumstances
[17:17] <pwaller> smb: the sync finally finished after 8 minutes
[17:17] <pwaller> smb: so I guess this could just be explained by a cold block store
[17:17] <pwaller> the machine finally went down to 0% sys
[17:17] <smb> now that is more suprising if that was a lockup
[17:18] <smb> Hm... is this a case where io for some reason is slow as hell...
[17:18] <pwaller> smb: well the machine was effectively unusable given that the kernel was consuming all of the CPU. I wouldn't expect that if IO was just "running a bit slow"
[17:18] <pwaller> (but then I don't know if my expectations are "reasonable")
[17:22] <smb> It certainly should not be that drastic. I mean the guest has around 4G of memory and sure that gets used up as cache . But it sounds a bit like the writeout was done in a polling fashion...
[17:35] <pwaller> smb: "time echo l > /proc/sysrq-trigger" takes 18 seconsd
[17:35] <pwaller> smb: why would that be?
[17:38] <smb> Hm, cannot say for sure . maybe done in some soft-interrupt and if something prevents the cpu from processing them quickly...
[17:40] <smb> no... apparently ipi but I think for xen those where using event channel(s)...
[18:09] <iri-> arges: smb: I'm hitting the road so will appear/disappear from here out. Thanks for your help looking into things. If you have any further suggestions I guess I'll hear via the launchpad bug. Have a good day/evening :)
[18:09] <arges> cya thanks
[18:09] <iri-> I should add I'm also interested in ways to mitigate this since it affects production systems
[19:26] <Joe_CoT> smb`, I see you've been busy. I showed my coworker your "I had a dream" comment