/srv/irclogs.ubuntu.com/2014/07/31/#ubuntu-kernel.txt

slangasekxnox: if this is blocking anything we should override the milestone freeze, not let ourselves be blocked by it00:17
xnoxslangasek: i'm blindly test rebuilding stuff against llvm 3.5 =) if you wish, you can unblock it and then i'd test it when it propagates. All other autopackage tests passed so yeah freeze block is the only one.00:19
xnoxslangasek: but it's like on all images, thus it is respin the world trigger =/00:20
slangasekxnox: no, it is not00:25
slangasekxnox: we do *not* guarantee images are up-to-date with respect to the archive during alpha milestones00:25
=== psivaa is now known as psivaa-brb
=== psivaa-brb is now known as psivaa
vroomfondelI am just jumping a bit through the kernel source at lxr.free-electrons.com... the container_of macro (which is a heavy-use macro) is defined in 10 files. Are there many redundancies like this one?09:36
TJ-vroomfondel: The drivers/staging/ are probably because those drivers haven't been cleaned up yet after being out-of-tree, for the tools/ and scripts/ those are for separate compilation units, which leaves the gpu/radeon and include/linux to explain :)09:40
TJ-vroomfondel: looking at gpu/radeon that is also a stand-alone tool ... leaving just include/linux :)09:42
smoserhey. 14:22
smoserhttp://paste.ubuntu.com/7914933/14:22
smoseris this the same power woes we've seen before ? 14:22
rtgsmoser, doesn't look familiar to me, but its also a little sparse on details. apw ?14:28
smoserrtg, well, what you see is all i know. that and "it died"14:29
smoser:)14:29
rtgsmoser, is the bare metal ? I'm just not that familiar with P814:29
rtgis this*14:29
smoserits a guest.14:30
smoserkvm guest.14:30
rtgsmoser, prolly something we need to get the IBM guys to look at14:31
smoserrtg, i'll file a bug.14:32
smoserhttps://bugs.launchpad.net/ubuntu/+source/linux/+bug/135088914:37
ubot5Launchpad bug 1350889 in linux (Ubuntu) "kernel crash kvm guest on power8" [Undecided,New]14:37
apwyeah not something i have noticed so far 14:50
hallynapw: stgraber: on 3.16 in utopic, in a unprivileged container, all of /proc is owned by nobody:nogroup15:00
hallyndat aint gonna fly15:00
pwallerI can now reliably reproduce this kernel hang with BTRFS: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/134971115:38
ubot5Launchpad bug 1349711 in linux (Ubuntu) "Machine lockup in btrfs-transaction" [High,Incomplete]15:38
pwallerThe machine I'm using for testing won't necessarily be around for much longer, so if there are any further immediate diagnostic things should do it would be good to know15:39
rtgpwaller, can you reproduce this on bare metal ? or is it just a Xen guest issue ?15:41
pwallerrtg: It's an amazon EC2 instance. I don't have bare metal available to test on. Or rather, I can't get the partition where I can reproduce it to a bare metal machine.15:42
pwaller(It's 750GB)15:42
rtgarges, since you're just plowing through virt issues this morning, how about having a look at this one ^^15:47
pwallerI should add, my test case involves using my existing partition. I haven't tried reproducing from scratch15:48
pwallerMainly because I suspect a problem with the free space cache15:48
pwallerIt took 12 days before first manifesting. And then 2 hours of solid writes before I ended up in a situation where I could quickly reproduce again.15:49
rtgpwaller, have you talked to upstream about it ? IIRC you said 3.16 still has the issue15:49
pwallerrtg: I have mailed upstream and had no response. I've had some interest from #btrfs who suggested trying 3.16, as did the person who responded in the ubuntu bug.15:50
* arges looking15:57
=== psivaa is now known as psivaa-bbl
argespwaller: so you said you can rapidly reproduce. Is that documented in the bug already (still reading)15:59
pwallerYes arges at the end15:59
argespwaller: ahh there it is : )15:59
argespwaller: so you pretty much just copy a large amount of files and get this error...16:00
pwallerarges: that's basically it16:00
pwallerarges: I haven't bothered trying to reproduce on a fresh BTRFS because I figure that that would be covered by other testing done to this thing16:01
pwallerarges: and I'm also not sure what would constitute "interesting enough files"16:01
argespwaller: ok so at this point you can only reproduce on this specific instance. so if you created a new instance and ran your test it may not necessarily reproduce16:02
pwallerarges: I reproduced it on a new instance with a snapshot of the same volume on a new EBS device16:02
argessmb: ^^^ more xen fun, have you seen something like this16:03
smbarges, no, nothing btrfs related which I can remember16:04
pwallerarges: smb I'm migrating to the pub, but I will reappear on IRC in about 10 minutes in case you have further questions.16:05
argespwaller: ok i think on this end we'll try to digest the logs a bit16:05
pwallerarges: thanks very much for looking into this :)16:06
argesnp16:06
pwallerarges: I can also do further testing on the spot if you have more ideas for experiments.16:06
smbarges, The fact it runs on ec2 may be just coincidental in this case16:06
argessmb: yea i wonder if it can be reproduced in a non-ec2 xen enviornment : or ever non virt16:06
pwallerI'm back arges/smb16:20
argespwaller: so has this happened with other btrfs volumes you've used?16:20
pwallerarges: I've only observed this on this one filesystem (but copied the filesystem to another device and observed it there too)16:21
pwallerWe're running 5 systems which are configured the same, but only one other has a similar workload16:21
argespwaller: how did you copy the filesystem?16:21
pwallerarges: EC2 snapshot restore to new volume16:21
pwallerEven on the original system we haven't observed the crash since we upgraded kernel from 3.13 to 3.1516:22
pwaller(which is now at 2 days 1h uptime)16:22
pwallerbut the workloads aren't very steady.16:22
argespwaller: but a copy of ~400GB of small files seems to cause the soft lockups16:23
smbThough from the logs we see it is less of a crash than a lockup16:23
pwallerarges it's probably more like ~275GB maximum16:23
pwallerarges: I started copying all 275GB, and 2h in to the ~6 hour operation, it locked up. I then resumed from the point of the lockup after the reboot, and then it will reliably lockup within minutes, or even within 30 seconds sometimes.16:24
argespwaller: you resumed what exactly, and how16:25
pwallerarges: the volume contains ~23,000 sqlite files of varying sizes (4kb - 12GB). I copy them all from $PATH to ${PATH}.new16:26
pwallerI observe the machine hang at ${GIVEN_PATH}, and then if I choose files that are after ${GIVEN_PATH} in the list, I can rapidly reproduce the lockup16:26
argespwaller: so you observe the hang. then stop the cp process?16:27
argesthen resume it after teh point where it was hanging?16:28
pwallerarges: the cp process hangs. If I terminate the cp process, a kernel thread is using 100% CPU16:28
pwallerarges: If I leave the machine idle (except for the pinned CPU on a kernel thread) and unattended for 5-10 minutes, it locks up totally and becomes unresponsive to network traffic16:28
=== psivaa-bbl is now known as psivaa
argesah16:28
pwallerIf I restart the machine and then resume the cp from the last file that was printed, it locks up fairly rapidly at that point16:29
pwallerarges: one more detail: I run 2 cp's in parallel16:29
pwallerarges: this is the literal command I'm running16:29
pwallercat sqlite-files.txt | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'16:29
pwaller(then I resume with `tail -n+18082 | xargs`)16:29
argespwaller: can you been able to reproduce on a single cpu instance16:30
pwallerarges: that is something I can try16:30
arges1) create around ~300GB of small files (sqlite files for example), put the files into a list sqlite-files.txt16:35
arges2) Start the copy:16:35
argescat sqlite-files.txt | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'16:35
arges3) When it hangs, identify where it hung as $NUM and resume with the following:16:35
argestail -n+$NUM | xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'16:35
argespwaller: ^^^ does that sum up the test case16:35
smbpwaller, where does the 87% full info actually come from? df info?16:35
pwallerdu -sch /volume, smb16:35
pwallerarges: looks about right16:35
pwallerarges: unfortunately the problem didn't manifest until 12 days of running originally16:36
pwallerarges: though the write workload will not have been as pathological as what you just described16:36
argespwaller: what kind of workload were you running for those 12 days? 16:36
smbpwaller, thanks... maybe try "btrfs fi df /mountpoint"16:36
pwallersmb: https://gist.github.com/pwaller/ce4312f5e16147847a6516:37
pwallerarges: user generated workload accessing those sqlite files for arbitrary read/writes16:37
argespwaller: have you looked at this wiki https://btrfs.wiki.kernel.org/index.php/Balance_Filters16:52
argesone thing to note is 'if you are getting out of space errors' try 'btrfs balance ...' (i'm not sure if you already tried this)16:53
pwallerarges: we just reproduced it on a single core machine in < 15s16:54
pwaller(an AWS EC2 m3.medium16:54
jsalisburyrtg, re bug 1350373, you have any idea why I can't describe commit b7dd0e in Linus' tree?  16:55
ubot5bug 1350373 in linux (Ubuntu Trusty) "Kernel BUG in paravirt_enter_lazy_mmu when running as a Xen PV guest" [Medium,Triaged] https://launchpad.net/bugs/135037316:55
jsalisburyrtg jsalisbury@salisbury:~/src/linux$ git describe --contains b7dd0e350e0bd4c0fddcc9b8958342700b00b16816:55
jsalisburyfatal: cannot describe 'b7dd0e350e0bd4c0fddcc9b8958342700b00b168'16:55
jsalisburyrtg, am I just missing some git knowledge?16:55
pwallerarges: it sprang back into life16:55
rtgjsalisbury, there is likely no subsequent tag. is it after -rc7 ?16:55
jsalisburyrtg, ahh, that makes sense16:55
pwallerarges: probably just observed a volume warming up16:56
jsalisburyrtg, so it will work in -rc8 or whenever the next tag is added16:56
rtgjsalisbury, I think so16:56
argespwaller: ok so at this point test a bit with the single cpu machine. and 2) look at the wiki i mentioned, and see if 'btrfs balance ...' changes anything16:56
argesits lunchtime here brb16:56
jsalisburyrtg, got it, thanks16:56
pwallerarges: I don't understand the rebalance thing. It's a single device volume, is rebalance even relevant if it isn't RAID?16:58
pwallersmb, arges: one interesting feature of the single CPU is that at times, the write load drops to 0 during the copy but the sys cpu goes to 100%17:07
pwallerwhich is spending ~10% in rcu_sched, 50% in kworker/u30:1 and some fraction in some btrfs processes (btrfs-transaction and btrfs-endio-met)17:08
smbpwaller, It kind of would make sense with the assumption that something in the kernel desperately trying but never giving up doing something. 17:09
pwallerthe 100% sys load remains even after aborting the CP17:09
pwallera "sync" from the command line isn't completing17:09
smbI would read the straces you posted as: the cp you do cause some background writes which are done by a kworker17:10
pwaller"echo l > /proc/sysrq-trigger" causes the machine to freeze for quite a long time17:11
pwallerdumping a stack trace via sysrq is considerably less interesting on the single-CPU machine, the stack just has "write_sysrq_trigger" in it17:13
pwallerhaving said that there is another stack trace with "generic_write_sync" at the top17:13
pwallermy sync is still going17:13
pwallerand the CPU is still stuck in 100% sys17:13
smbtbh not really surprised with that17:16
pwallerarges: someone from #btrfs (I don't know their credentials) said that rebalance was unlikely to have any effect in my circumstances17:16
pwallersmb: the sync finally finished after 8 minutes17:17
pwallersmb: so I guess this could just be explained by a cold block store17:17
pwallerthe machine finally went down to 0% sys17:17
smbnow that is more suprising if that was a lockup17:17
smbHm... is this a case where io for some reason is slow as hell...17:18
pwallersmb: well the machine was effectively unusable given that the kernel was consuming all of the CPU. I wouldn't expect that if IO was just "running a bit slow"17:18
pwaller(but then I don't know if my expectations are "reasonable")17:18
smbIt certainly should not be that drastic. I mean the guest has around 4G of memory and sure that gets used up as cache . But it sounds a bit like the writeout was done in a polling fashion...17:22
pwallersmb: "time echo l > /proc/sysrq-trigger" takes 18 seconsd17:35
pwallersmb: why would that be?17:35
smbHm, cannot say for sure . maybe done in some soft-interrupt and if something prevents the cpu from processing them quickly...17:38
smbno... apparently ipi but I think for xen those where using event channel(s)...17:40
iri-arges: smb: I'm hitting the road so will appear/disappear from here out. Thanks for your help looking into things. If you have any further suggestions I guess I'll hear via the launchpad bug. Have a good day/evening :)18:09
argescya thanks18:09
iri-I should add I'm also interested in ways to mitigate this since it affects production systems18:09
=== iri- is now known as pwaller`
=== DalekSec_ is now known as DalekSec
Joe_CoTsmb`, I see you've been busy. I showed my coworker your "I had a dream" comment19:26
=== hatch__ is now known as hatch

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!