/srv/irclogs.ubuntu.com/2012/03/20/#launchpad-yellow.txt

frankban	gmb: good morning, I am starting juju charms to run tests	09:48
gmb	frankban, Morning. Cool. I'm just looking at Gary's email now...	09:49
frankban	gmb: select/read hangs, scary	09:49
gmb	Yeah :/	09:49
frankban	si\	09:53
frankban	ops	09:53
* bac \o		11:36
gmb	Wow. Shows how long it's been since I did any LP development work. My env is completely broken.	11:36
bac	gmb: no kidding. and it seems things are moving around a lot over the last few weeks.	11:37
frankban	gary_poster: good morning, now I have a better understanding of what's going on with testrepository	12:06
gary_poster	frankban, awesome. good afternoon.	12:07
gary_poster	what does that mean practically for our issues?	12:08
frankban	gary_poster: the debian control file in http://bazaar.launchpad.net/~testrepository/debian/sid/python-testrepository/sid/view/head:/debian/control (the one we use) is different from the one in ubuntu upstream: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/testrepository/precise/view/head:/debian/control	12:08
gary_poster	ahhh!	12:09
gary_poster	so if we switch, all will be well? easy?	12:09
gary_poster	need to step away; back soon.	12:10
frankban	gary_poster: yes, but I have a question: my branch patches testrepository trunk. So our ppa installs the trunk revision (maybe we could use that to see if https://bugs.launchpad.net/testrepository/+bug/775214 is fixed) However, no problem for me to update the ppa to use a patched upstream release.	12:13
_mup_	Bug #775214: On python 2.7: String or Integer object expected for key, unicode found <Testrepository:Fix Committed by lifeless> < https://launchpad.net/bugs/775214 >	12:13
* frankban lunch		12:13
frankban	gary_poster: buildbot tests are currently running on ec2, access granted for you and benji (user: ubuntu)	12:19
frankban	zk: ec2-107-20-6-115.compute-1.amazonaws.com	12:19
frankban	master: ec2-50-16-78-27.compute-1.amazonaws.com	12:20
frankban	slave: ec2-23-20-98-182.compute-1.amazonaws.com	12:20
* frankban really lunches		12:20
benji	precise... <sigh>	12:23
* benji goes to report a bug on precise		12:23
gary_poster	frankban, ec2: great thank you	12:35
gary_poster	frankban, testrepository: I guess...if we can easily fix the subunit dependency issue (and I expect we can) then using trunk would be good	12:36
gary_poster	that makes me a bit nervous	12:37
gary_poster	but it is probably good for the long run and hopefully ok for the short run	12:37
gary_poster	I would guess we would have to make our own branch of the ubuntu debian bits	12:39
* gary_poster thinks aptitude rocks		12:55
benji	I've considered starting to use aptitude, and there have been a couple of times I wish I already had.	13:02
gary_poster	bac benji frankban gmb call in 2 or sooner	13:09
gmb	A tumbleweed rolls through goldenhorde	13:11
gary_poster	https://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit	13:26
gary_poster	benji, I'm going to restart post update, then I'd like to discuss strategy for tackling the test hangs when you have a moment	13:48
benji	gary_poster: sure	13:48
gary_poster	thanks	13:48
gary_poster	OK, benji. https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde when you get a chance. camera appears to be working post-update	14:04
gary_poster	benji, fwiw, this is the command I am running (no confirmation yet that it is working):	14:51
gary_poster	xvfb-run --error-file=/var/tmp/xvfb-errors.log --server-args='-screen 0 1024x768x24' -a /home/gary/launchpad/lp/sandbox/bin/test --subunit --load-list /home/gary/temp/tmp0d4ZXs	14:51
benji	k	14:51
gary_poster	yeah it seems to be working	14:51
gary_poster	I probably should have teed the output	14:52
gary_poster	but I didn't	14:52
benji	good point, I'll tee mine	14:53
bac	sorry about the email churn with failed PPA builds. i'm trying to get the packaging to work with the new name spelling. hopefully the next one will work.	15:02
gary_poster	thanks, np	15:03
* gary_poster needs to go babysit and such. biab		15:03
gary_poster	benji, are you getting a lot of "SilentLaunchpadScriptFailure: 1" errors?	15:22
benji	gary_poster: nope	15:22
gary_poster	k	15:22
gary_poster	a lot of failures on my side. going away again	15:23
benji	gary_poster: I started later than you did, so I might not be there yet.	15:23
gary_poster	maybe so	15:23
gary_poster	hm	15:23
gary_poster	saw schema related error	15:24
gary_poster	going to stop, make schema, and retry	15:24
benji	yeah, I did a pull and make schema before my run, just in case	15:30
gary_poster	ok, restarted on new ephemeral with changes made	15:32
gary_poster	now really babysitting :-P	15:33
gary_poster	argh	15:33
gary_poster	fell over	15:33
benji	:(	15:34
gary_poster	dumb mistake	15:34
gary_poster	retrying	15:34
gary_poster	there we go	15:34
gary_poster	ok, now leaving :-)	15:34
frankban	benji: http://ec2-50-16-78-27.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio	15:35
bac	"baby sitting" + "fell over" is not good	15:36
benji	frankban: what am I looking for?	15:36
frankban	the results of a parallel test run	15:36
frankban	benji: and it's doing another run...	15:37
frankban	have you started it?	15:37
benji	frankban: have I started what? Another run? no. That was probably triggered by a commit.	15:38
frankban	benji: ah... ok	15:38
benji	frankban: I'm still not sure what you would like for me to notice about the ouput. That it finished without hanging, perhaps?	15:39
frankban	benji: yes, and only 4 failures... is the hang happening only using 8 cores?	15:41
benji	gary_poster: ah, ok. Nope, we've seen a hang with just two, so the fact that you didn't get one is interesting.	15:41
benji	frankban: note that the xx-bug.txt failure is a known issue in the trunk, the production buildbot reported the same failure a few hours ago	15:42
frankban	benji: yes, I've seen	15:44
gary_poster	actually, benji, frankban, I have not seen a hang lately with two cores	16:10
gary_poster	only 4 failures is great	16:10
gary_poster	benji, do you have failures on your run? I definitely do	16:11
gary_poster	benji, maybe worth noting is that testrepository had not reported any errors.	16:18
gary_poster	I wonder if this is some kind of "buffer filling too fast" problem	16:19
gary_poster	triggered by having so many errors	16:19
gary_poster	I'm not sure how many errors I'm going to end up with on this run, but "a lot" looks like arough guess	16:20
gary_poster	benji, no hangs for me. trying to figure out a quick way to get results of run	16:32
gary_poster	"subunit2pyunit < testoutput.txt" yields a fairly confusing result: only one error?	16:38
gary_poster	benji, ok, yeah, I'm confused. I thought I saw a lot of errors flying by, but now when I look at the teed document, I see very few tracebacks. The only error I get from the command above is one for an issue that subunit itself seems to show as...successful?	16:45
gary_poster	test: lib/lp/app/javascript/overlay/tests/test_overlay.html	16:45
gary_poster	test: Could not communicate with subprocess	16:45
gary_poster	tags: zope:error_with_banner	16:45
gary_poster	successful: Could not communicate with subprocess	16:45
gary_poster	...riiiiight...	16:46
benji	gary_poster: was eating lunch; reading backlog now	16:49
benji	gary_poster: I have no failures in my non-ephemeral run	16:52
benji	my run took just under an hour and had no errors or failures at all	16:53
benji	I'm going to start another in an ephemeral container and see what that does	16:53
gary_poster	benji, interesting	16:55
gary_poster	benji, so, maybe my "I have tons of errors" was confused by the fact that I ended up searching into the previous run. not sure. in any case, the only issue I see in the tee'd file is the one I gave above. So I'm wondering what to do now, since I was unable to dupe. I was considering hacking testr to only start one process, and to include the --load-list that we are using, and see how that goes. Thoughts?	17:05
benji	gary_poster: so the intent of your hack would be to run in a normal environment, but serialize instead of parallelize in order to see if we get failures or not, right?	17:06
gary_poster	benji, not exactly. The intent would be to run a single process/container of what the eight core machine did, but exactly as it did. Specifically, I'm going to hack testr to make it think I only have one core (which will mean that it will run all the tests it is supposed to run in a single ephemeral lxc container); and I'll include --load-list=/home/gary/temp/tmp0... when I start testr, so only those tests ar	17:09
gary_poster	e run	17:09
gary_poster	If I succeed in triggering a hang, I at least have a recipe for triggering it locally. If I do not succeed, then it implies that not only does testr need to run those tests in an ephemeral lxc container, but also they must be in parallel; or my machine is sufficiently different from the ec2 machine that it doesn't trigger.	17:11
gmb	gary_poster, So, I've lost a bunch of time this afternoon to my lp setup being hideously broken. I've now rebuilt it. Do you have any guidance for me re: bug 609986?	17:16
_mup_	Bug #609986: layer setup failures don't output a failure message (they spew straight to console) <lp-foundations> <paralleltest> <Launchpad itself:Triaged> < https://launchpad.net/bugs/609986 >	17:16
gary_poster	actually maybe I don't have to hack testr to not run in parallel; just don't use --parallel	17:16
gary_poster	gmb sure, lemme get that back in my head. want to hang out for just a bit?	17:16
gmb	gary_poster, Sure. Let me get Firefox running	17:17
gary_poster	k	17:17
gmb	Ah, crap, updates..	17:18
gary_poster	gmb, https://code.launchpad.net/~launchpad/zope.testing/3.9.4-p5 is something to talk about when you are ready	17:19
* gmb looks		17:19
gmb	gary_poster, goldenhorde?	17:19
gary_poster	gmb, yeah	17:19
gmb	k	17:19
benji	gary_poster: the ephemeral run completed with one failure: lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job	17:56
gary_poster	benji, I got that one in my testr run so far	18:02
gary_poster	so, benji, we have an apparently intermittent test isolation error...	18:03
gary_poster	and we are unable to trigger the hang with merely an lxc or an ephemeral lxc.	18:03
gary_poster	I'm now adding testr to the mix	18:03
gary_poster	and if that does not hang	18:03
gary_poster	then we only have the two options that I mentioned above as the possible causes, afaik	18:04
gary_poster	I ended up only hacking my .testr.conf for what I wanted	18:04
gary_poster	and then running testr run	18:05
gary_poster	but to try and dupe the eight-way parallel run...I'm not sure how to do that, except to merely force my two-core machine to be treated as an eight-core machine by testr	18:06
gary_poster	which does not necessarily use the same test divisions	18:06
benji	yep, I agree with your evaluation of what the different outcomes suggest	18:06
gary_poster	and also demands more RAM than I have, according to the experience I had yesterday	18:06
gary_poster	tests are still running here	18:06
gary_poster	the new run is only about 20 minutes old	18:07
benji	k	18:07
gary_poster	so we may need to discuss how to instrument the ec2 machine	18:08
gary_poster	while we are waiting for the test results here	18:08
gary_poster	they might inform any result really	18:08
gary_poster	you mentioned the signal handler, and the debug build	18:08
gary_poster	I like the signal handler better than the debug build, because it changes less	18:09
gary_poster	and yet might still give us what we need.	18:09
gary_poster	(it almost seems like something that one always ought to run with)	18:09
gary_poster	we could also try that gdb hook trick	18:09
gary_poster	that lets you get into a Python process	18:10
frankban	gary_poster: EOD, my ec2 test run is still going, do you want me to leave those instances up?	18:11
gary_poster	frankban, ack. benji, I think he can kill them. what do you think?	18:11
benji	gary_poster, frankban: yeah I say kill them; I don't think we'll need them.	18:13
gary_poster	frankban, thank you. Have a great evening.	18:13
frankban	gary_poster, benji: ok, have a nice evening	18:13
benji	same to you, frankban	18:13
gary_poster	ah right, we can just sudo apt-get install python-dbg to get the debug build, can't we	18:14
benji	gary_poster: this looks like what we're looking for http://pypi.python.org/pypi/faulthandler/	18:14
benji	gary_poster: I believe so.	18:14
benji	gary_poster: I think your point about holding off on using the debug build is a good one	18:15
gary_poster	benji, nice package. the only thing that strikes me that might bite us there is testr/subunit eating things...	18:16
gary_poster	it might still work	18:16
gary_poster	but the dance would be thisL	18:16
gary_poster	:	18:16
benji	there is an option for making it write to a file	18:17
gary_poster	send signal to Python process	18:17
gary_poster	ah!	18:17
gary_poster	mhm...we would need access to the ephemeral container	18:17
benji	register(signum, file=sys.stderr, all_threads=False, chain=False)	18:17
benji	we can use the console for that	18:18
gary_poster	in order to look at the file	18:18
benji	since they hang, they don't go away :)	18:18
gary_poster	as long as we do the root passwd/shadow file trick yeah	18:18
benji	yep	18:18
gary_poster	need to remember to do that first	18:18
gary_poster	tests are still rolling along here	18:18
gary_poster	ok...at 2:19, if I start a macine and get it initialized it would be ready by 3:20 ish	18:19
benji	actually, don't we have access to the "upper" directory? we can do the shadow trick there if we need to (but doing it before launch would be easiest)	18:19
benji	gary_poster: I have a slave ready	18:19
gary_poster	benji, ooh, 8 core?	18:19
benji	gary_poster: darn, no	18:19
gary_poster	:-/	18:20
gary_poster	benji, want me to start one, or you?	18:20
gary_poster	it may be trickier than I want it to be	18:21
gary_poster	because of the new juju changes announced last night	18:21
benji	gary_poster: have at it	18:21
gary_poster	I was relying on something they just ripped out and replaced	18:21
gary_poster	ok	18:21
benji	actually, there has been some discussion about not ripping out the old options but leaving them for backward compatability	18:22
gary_poster	yeah I saw that	18:22
gary_poster	it makes sense to me not even for backwards compatibility but for setting defaults	18:23
gary_poster	ok, slave is starting. tests are still running locally. stepping away.	18:27
bac	hey gary_poster, can we have a quick chat to see if i can lend you a hand?	18:28
gary_poster	bac, hey, 1 sec	18:28
bac	gary_poster: np, i'll grab some tea	18:28
gary_poster	cool	18:28
gary_poster	bac, I await you in the horde (https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde_	18:35
gary_poster	https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde	18:35
gary_poster	benji, I have a call at 4 with Francis, btw, so you will need to take over	18:37
gary_poster	then	18:37
benji	ok	18:37
gary_poster	benji, http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/waterfall	19:38
gary_poster	benji, I have added root passwd	19:38
gary_poster	benji, so I should start a test run	19:40
benji	gary_poster: sounds good	19:40
benji	after that we'll just be watching it, right?	19:40
gary_poster	benji, now I will add you to ssh	19:40
gary_poster	yeah	19:40
gary_poster	this may have been foolish :-(	19:40
benji	how so?	19:41
gary_poster	if yesterday is any indication, hang is in > 2 hours	19:41
gary_poster	past both of our EoDs	19:41
gary_poster	maybe one of the cores will hang sooner	19:42
gary_poster	benji, oh argh	19:42
gary_poster	should I not have installed the package, or at least python-dbg, before starting a test?	19:42
benji	gary_poster: we need to install faulthandler and register a handler for USR1 that will write all thread's stacktraces to a file; oh and tweak /etc/shadow	19:44
benji	the "register a hanlder" bit might be interesting, probably hacking bin/test would be the easiest way	19:44
benji	oh, and the package needs to be installed in the container	19:45
benji	gary_poster: you may be right, we might be too late in the day to do this right; maybe we should scrub	19:45
gary_poster	benji, I already changed /etc/shadow in root and added you to authorized keys. try ubuntu@ec2-174-129-101-121.compute-1.amazonaws.com	19:45
gary_poster	or scrub ;-)	19:45
benji	gary_poster: Permission denied (publickey).	19:46
gary_poster	benji, one thing you could do is make an LP branch that has the package installed	19:46
gary_poster	for tomorrow	19:46
gary_poster	I would suggest actually hooking it in at a different location	19:46
gary_poster	benji, try adding it to lib/lp_sitecustomize.py	19:47
gary_poster	that will register it for every LP process	19:47
gary_poster	which is what we want I think	19:47
gary_poster	and is easy to do	19:48
gary_poster	If I were going to do it today I would actually use python-dbg	19:48
gary_poster	because that's simpler	19:48
gary_poster	I just kill the test	19:48
gary_poster	install -dbg	19:48
gary_poster	and restart	19:48
gary_poster	when there is a hang, give it a whirl with gdb	19:49
benji	gary_poster: I'd really like to finish at least a section of my review today, so I prefer the option of killing the slave and I make a branch and tweak lp_sitecustomize	19:50
gary_poster	cool	19:51
gary_poster	I wonder what I did wrong with authorized_keys...	19:51
gary_poster	benji, for future reference, I simply added your key (the line from https://launchpad.net/~benji/+sshkeys) to /home/ubuntu/.ssh/authorized_keys . Did I need to do anything else?	19:52
gary_poster	Maybe ec2 security thing...	19:53
benji	gary_poster: I would have thought that would work	19:54
benji	I'm pretty sure I've done just that in the past	19:54
gary_poster	yeah, already set to allow 22 through in ec2	19:56
gary_poster	fr everyone	19:56
gary_poster	benji, still watching while waiting for call. does top lie? this is what it i saying	20:05
gary_poster	Tasks: 505 total, 1 running, 504 sleeping, 0 stopped, 0 zombie	20:05
gary_poster	Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st	20:05
gary_poster	this is while I have eight parallel tests going	20:05
benji	gary_poster: I haven't seen top lie lately. (I do remember on some old red hat machines ZC once had...)	20:05
gary_poster	some tests are still running	20:07
gary_poster	just not very fast	20:07

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!