/srv/irclogs.ubuntu.com/2014/07/31/#ubuntu-kernel.txt

slangasek	xnox: if this is blocking anything we should override the milestone freeze, not let ourselves be blocked by it	00:17
xnox	slangasek: i'm blindly test rebuilding stuff against llvm 3.5 =) if you wish, you can unblock it and then i'd test it when it propagates. All other autopackage tests passed so yeah freeze block is the only one.	00:19
xnox	slangasek: but it's like on all images, thus it is respin the world trigger =/	00:20
slangasek	xnox: no, it is not	00:25
slangasek	xnox: we do not guarantee images are up-to-date with respect to the archive during alpha milestones	00:25
=== psivaa is now known as psivaa-brb
=== psivaa-brb is now known as psivaa
vroomfondel	I am just jumping a bit through the kernel source at lxr.free-electrons.com... the container_of macro (which is a heavy-use macro) is defined in 10 files. Are there many redundancies like this one?	09:36
TJ-	vroomfondel: The drivers/staging/ are probably because those drivers haven't been cleaned up yet after being out-of-tree, for the tools/ and scripts/ those are for separate compilation units, which leaves the gpu/radeon and include/linux to explain :)	09:40
TJ-	vroomfondel: looking at gpu/radeon that is also a stand-alone tool ... leaving just include/linux :)	09:42
smoser	hey.	14:22
smoser	http://paste.ubuntu.com/7914933/	14:22
smoser	is this the same power woes we've seen before ?	14:22
rtg	smoser, doesn't look familiar to me, but its also a little sparse on details. apw ?	14:28
smoser	rtg, well, what you see is all i know. that and "it died"	14:29
smoser	:)	14:29
rtg	smoser, is the bare metal ? I'm just not that familiar with P8	14:29
rtg	is this*	14:29
smoser	its a guest.	14:30
smoser	kvm guest.	14:30
rtg	smoser, prolly something we need to get the IBM guys to look at	14:31
smoser	rtg, i'll file a bug.	14:32
smoser	https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350889	14:37
ubot5	Launchpad bug 1350889 in linux (Ubuntu) "kernel crash kvm guest on power8" [Undecided,New]	14:37
apw	yeah not something i have noticed so far	14:50
hallyn	apw: stgraber: on 3.16 in utopic, in a unprivileged container, all of /proc is owned by nobody:nogroup	15:00
hallyn	dat aint gonna fly	15:00
pwaller	I can now reliably reproduce this kernel hang with BTRFS: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711	15:38
ubot5	Launchpad bug 1349711 in linux (Ubuntu) "Machine lockup in btrfs-transaction" [High,Incomplete]	15:38
pwaller	The machine I'm using for testing won't necessarily be around for much longer, so if there are any further immediate diagnostic things should do it would be good to know	15:39
rtg	pwaller, can you reproduce this on bare metal ? or is it just a Xen guest issue ?	15:41
pwaller	rtg: It's an amazon EC2 instance. I don't have bare metal available to test on. Or rather, I can't get the partition where I can reproduce it to a bare metal machine.	15:42
pwaller	(It's 750GB)	15:42
rtg	arges, since you're just plowing through virt issues this morning, how about having a look at this one ^^	15:47
pwaller	I should add, my test case involves using my existing partition. I haven't tried reproducing from scratch	15:48
pwaller	Mainly because I suspect a problem with the free space cache	15:48
pwaller	It took 12 days before first manifesting. And then 2 hours of solid writes before I ended up in a situation where I could quickly reproduce again.	15:49
rtg	pwaller, have you talked to upstream about it ? IIRC you said 3.16 still has the issue	15:49
pwaller	rtg: I have mailed upstream and had no response. I've had some interest from #btrfs who suggested trying 3.16, as did the person who responded in the ubuntu bug.	15:50
* arges looking		15:57
=== psivaa is now known as psivaa-bbl
arges	pwaller: so you said you can rapidly reproduce. Is that documented in the bug already (still reading)	15:59
pwaller	Yes arges at the end	15:59
arges	pwaller: ahh there it is : )	15:59
arges	pwaller: so you pretty much just copy a large amount of files and get this error...	16:00
pwaller	arges: that's basically it	16:00
pwaller	arges: I haven't bothered trying to reproduce on a fresh BTRFS because I figure that that would be covered by other testing done to this thing	16:01
pwaller	arges: and I'm also not sure what would constitute "interesting enough files"	16:01
arges	pwaller: ok so at this point you can only reproduce on this specific instance. so if you created a new instance and ran your test it may not necessarily reproduce	16:02
pwaller	arges: I reproduced it on a new instance with a snapshot of the same volume on a new EBS device	16:02
arges	smb: ^^^ more xen fun, have you seen something like this	16:03
smb	arges, no, nothing btrfs related which I can remember	16:04
pwaller	arges: smb I'm migrating to the pub, but I will reappear on IRC in about 10 minutes in case you have further questions.	16:05
arges	pwaller: ok i think on this end we'll try to digest the logs a bit	16:05
pwaller	arges: thanks very much for looking into this :)	16:06
arges	np	16:06
pwaller	arges: I can also do further testing on the spot if you have more ideas for experiments.	16:06
smb	arges, The fact it runs on ec2 may be just coincidental in this case	16:06
arges	smb: yea i wonder if it can be reproduced in a non-ec2 xen enviornment : or ever non virt	16:06
pwaller	I'm back arges/smb	16:20
arges	pwaller: so has this happened with other btrfs volumes you've used?	16:20
pwaller	arges: I've only observed this on this one filesystem (but copied the filesystem to another device and observed it there too)	16:21
pwaller	We're running 5 systems which are configured the same, but only one other has a similar workload	16:21
arges	pwaller: how did you copy the filesystem?	16:21
pwaller	arges: EC2 snapshot restore to new volume	16:21
pwaller	Even on the original system we haven't observed the crash since we upgraded kernel from 3.13 to 3.15	16:22
pwaller	(which is now at 2 days 1h uptime)	16:22
pwaller	but the workloads aren't very steady.	16:22
arges	pwaller: but a copy of ~400GB of small files seems to cause the soft lockups	16:23
smb	Though from the logs we see it is less of a crash than a lockup	16:23
pwaller	arges it's probably more like ~275GB maximum	16:23
pwaller	arges: I started copying all 275GB, and 2h in to the ~6 hour operation, it locked up. I then resumed from the point of the lockup after the reboot, and then it will reliably lockup within minutes, or even within 30 seconds sometimes.	16:24
arges	pwaller: you resumed what exactly, and how	16:25
pwaller	arges: the volume contains ~23,000 sqlite files of varying sizes (4kb - 12GB). I copy them all from $PATH to ${PATH}.new	16:26
pwaller	I observe the machine hang at ${GIVEN_PATH}, and then if I choose files that are after ${GIVEN_PATH} in the list, I can rapidly reproduce the lockup	16:26
arges	pwaller: so you observe the hang. then stop the cp process?	16:27
arges	then resume it after teh point where it was hanging?	16:28
pwaller	arges: the cp process hangs. If I terminate the cp process, a kernel thread is using 100% CPU	16:28
pwaller	arges: If I leave the machine idle (except for the pinned CPU on a kernel thread) and unattended for 5-10 minutes, it locks up totally and becomes unresponsive to network traffic	16:28
=== psivaa-bbl is now known as psivaa
arges	ah	16:28
pwaller	If I restart the machine and then resume the cp from the last file that was printed, it locks up fairly rapidly at that point	16:29
pwaller	arges: one more detail: I run 2 cp's in parallel	16:29
pwaller	arges: this is the literal command I'm running	16:29
pwaller	cat sqlite-files.txt \| xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'	16:29
pwaller	(then I resume with `tail -n+18082 \| xargs`)	16:29
arges	pwaller: can you been able to reproduce on a single cpu instance	16:30
pwaller	arges: that is something I can try	16:30
arges	1) create around ~300GB of small files (sqlite files for example), put the files into a list sqlite-files.txt	16:35
arges	2) Start the copy:	16:35
arges	cat sqlite-files.txt \| xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'	16:35
arges	3) When it hangs, identify where it hung as $NUM and resume with the following:	16:35
arges	tail -n+$NUM \| xargs -n1 -I{} -P2 sudo sh -c 'rm -f {}.new; cat {} > {}.new; echo {}'	16:35
arges	pwaller: ^^^ does that sum up the test case	16:35
smb	pwaller, where does the 87% full info actually come from? df info?	16:35
pwaller	du -sch /volume, smb	16:35
pwaller	arges: looks about right	16:35
pwaller	arges: unfortunately the problem didn't manifest until 12 days of running originally	16:36
pwaller	arges: though the write workload will not have been as pathological as what you just described	16:36
arges	pwaller: what kind of workload were you running for those 12 days?	16:36
smb	pwaller, thanks... maybe try "btrfs fi df /mountpoint"	16:36
pwaller	smb: https://gist.github.com/pwaller/ce4312f5e16147847a65	16:37
pwaller	arges: user generated workload accessing those sqlite files for arbitrary read/writes	16:37
arges	pwaller: have you looked at this wiki https://btrfs.wiki.kernel.org/index.php/Balance_Filters	16:52
arges	one thing to note is 'if you are getting out of space errors' try 'btrfs balance ...' (i'm not sure if you already tried this)	16:53
pwaller	arges: we just reproduced it on a single core machine in < 15s	16:54
pwaller	(an AWS EC2 m3.medium	16:54
jsalisbury	rtg, re bug 1350373, you have any idea why I can't describe commit b7dd0e in Linus' tree?	16:55
ubot5	bug 1350373 in linux (Ubuntu Trusty) "Kernel BUG in paravirt_enter_lazy_mmu when running as a Xen PV guest" [Medium,Triaged] https://launchpad.net/bugs/1350373	16:55
jsalisbury	rtg jsalisbury@salisbury:~/src/linux$ git describe --contains b7dd0e350e0bd4c0fddcc9b8958342700b00b168	16:55
jsalisbury	fatal: cannot describe 'b7dd0e350e0bd4c0fddcc9b8958342700b00b168'	16:55
jsalisbury	rtg, am I just missing some git knowledge?	16:55
pwaller	arges: it sprang back into life	16:55
rtg	jsalisbury, there is likely no subsequent tag. is it after -rc7 ?	16:55
jsalisbury	rtg, ahh, that makes sense	16:55
pwaller	arges: probably just observed a volume warming up	16:56
jsalisbury	rtg, so it will work in -rc8 or whenever the next tag is added	16:56
rtg	jsalisbury, I think so	16:56
arges	pwaller: ok so at this point test a bit with the single cpu machine. and 2) look at the wiki i mentioned, and see if 'btrfs balance ...' changes anything	16:56
arges	its lunchtime here brb	16:56
jsalisbury	rtg, got it, thanks	16:56
pwaller	arges: I don't understand the rebalance thing. It's a single device volume, is rebalance even relevant if it isn't RAID?	16:58
pwaller	smb, arges: one interesting feature of the single CPU is that at times, the write load drops to 0 during the copy but the sys cpu goes to 100%	17:07
pwaller	which is spending ~10% in rcu_sched, 50% in kworker/u30:1 and some fraction in some btrfs processes (btrfs-transaction and btrfs-endio-met)	17:08
smb	pwaller, It kind of would make sense with the assumption that something in the kernel desperately trying but never giving up doing something.	17:09
pwaller	the 100% sys load remains even after aborting the CP	17:09
pwaller	a "sync" from the command line isn't completing	17:09
smb	I would read the straces you posted as: the cp you do cause some background writes which are done by a kworker	17:10
pwaller	"echo l > /proc/sysrq-trigger" causes the machine to freeze for quite a long time	17:11
pwaller	dumping a stack trace via sysrq is considerably less interesting on the single-CPU machine, the stack just has "write_sysrq_trigger" in it	17:13
pwaller	having said that there is another stack trace with "generic_write_sync" at the top	17:13
pwaller	my sync is still going	17:13
pwaller	and the CPU is still stuck in 100% sys	17:13
smb	tbh not really surprised with that	17:16
pwaller	arges: someone from #btrfs (I don't know their credentials) said that rebalance was unlikely to have any effect in my circumstances	17:16
pwaller	smb: the sync finally finished after 8 minutes	17:17
pwaller	smb: so I guess this could just be explained by a cold block store	17:17
pwaller	the machine finally went down to 0% sys	17:17
smb	now that is more suprising if that was a lockup	17:17
smb	Hm... is this a case where io for some reason is slow as hell...	17:18
pwaller	smb: well the machine was effectively unusable given that the kernel was consuming all of the CPU. I wouldn't expect that if IO was just "running a bit slow"	17:18
pwaller	(but then I don't know if my expectations are "reasonable")	17:18
smb	It certainly should not be that drastic. I mean the guest has around 4G of memory and sure that gets used up as cache . But it sounds a bit like the writeout was done in a polling fashion...	17:22
pwaller	smb: "time echo l > /proc/sysrq-trigger" takes 18 seconsd	17:35
pwaller	smb: why would that be?	17:35
smb	Hm, cannot say for sure . maybe done in some soft-interrupt and if something prevents the cpu from processing them quickly...	17:38
smb	no... apparently ipi but I think for xen those where using event channel(s)...	17:40
iri-	arges: smb: I'm hitting the road so will appear/disappear from here out. Thanks for your help looking into things. If you have any further suggestions I guess I'll hear via the launchpad bug. Have a good day/evening :)	18:09
arges	cya thanks	18:09
iri-	I should add I'm also interested in ways to mitigate this since it affects production systems	18:09
=== iri- is now known as pwaller`
=== DalekSec_ is now known as DalekSec
Joe_CoT	smb`, I see you've been busy. I showed my coworker your "I had a dream" comment	19:26
=== hatch__ is now known as hatch

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!