/srv/irclogs.ubuntu.com/2012/03/20/#launchpad-yellow.txt

frankbangmb: good morning, I am starting juju charms to run tests09:48
gmbfrankban, Morning. Cool. I'm just looking at Gary's email now...09:49
frankbangmb: select/read hangs, scary09:49
gmbYeah :/09:49
frankbansi\09:53
frankbanops09:53
* bac \o11:36
gmbWow. Shows how long it's been since I did any LP development work. My env is completely broken.11:36
bacgmb: no kidding.  and it seems things are moving around a lot over the last few weeks.11:37
frankbangary_poster: good morning, now I have a better understanding of what's going on with testrepository12:06
gary_posterfrankban, awesome.  good afternoon.12:07
gary_posterwhat does that mean practically for our issues?12:08
frankbangary_poster: the debian control file in http://bazaar.launchpad.net/~testrepository/debian/sid/python-testrepository/sid/view/head:/debian/control (the one we use) is different from the one in ubuntu upstream: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/testrepository/precise/view/head:/debian/control12:08
gary_posterahhh!12:09
gary_posterso if we switch, all will be well? easy?12:09
gary_posterneed to step away; back soon.12:10
frankbangary_poster: yes, but I have a question: my branch patches testrepository trunk. So our ppa installs the  trunk revision (maybe we could use that to see if https://bugs.launchpad.net/testrepository/+bug/775214 is fixed) However, no problem for me to update the ppa to use a patched upstream release.12:13
_mup_Bug #775214: On python 2.7: String or Integer object expected for key, unicode found <Testrepository:Fix Committed by lifeless> < https://launchpad.net/bugs/775214 >12:13
* frankban lunch12:13
frankbangary_poster: buildbot tests are currently running on ec2, access granted for you and benji (user: ubuntu)12:19
frankbanzk: ec2-107-20-6-115.compute-1.amazonaws.com12:19
frankbanmaster: ec2-50-16-78-27.compute-1.amazonaws.com12:20
frankbanslave: ec2-23-20-98-182.compute-1.amazonaws.com12:20
* frankban really lunches12:20
benjiprecise... <sigh>12:23
* benji goes to report a bug on precise12:23
gary_posterfrankban, ec2: great thank you12:35
gary_posterfrankban, testrepository: I *guess*...if we can easily fix the subunit dependency issue (and I expect we can) then using trunk would be good12:36
gary_posterthat makes me a bit nervous12:37
gary_posterbut it is probably good for the long run and hopefully ok for the short run12:37
gary_posterI would guess we would have to make our own branch of the ubuntu debian bits12:39
* gary_poster thinks aptitude rocks12:55
benjiI've considered starting to use aptitude, and there have been a couple of times I wish I already had.13:02
gary_posterbac benji frankban gmb call in 2 or sooner13:09
gmbA tumbleweed rolls through goldenhorde13:11
gary_posterhttps://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit13:26
gary_posterbenji, I'm going to restart post update, then I'd like to discuss strategy for tackling the test hangs when you have a moment13:48
benjigary_poster: sure13:48
gary_posterthanks13:48
gary_posterOK, benji.  https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde when you get a chance. camera appears to be working post-update14:04
gary_posterbenji, fwiw, this is the command I am running (no confirmation yet that it is working):14:51
gary_posterxvfb-run --error-file=/var/tmp/xvfb-errors.log --server-args='-screen 0 1024x768x24' -a /home/gary/launchpad/lp/sandbox/bin/test --subunit --load-list /home/gary/temp/tmp0d4ZXs14:51
benjik14:51
gary_posteryeah it seems to be working14:51
gary_posterI probably should have teed the output14:52
gary_posterbut I didn't14:52
benjigood point, I'll tee mine14:53
bacsorry about the email churn with failed PPA builds.  i'm trying to get the packaging to work with the new name spelling.  hopefully the next one will work.15:02
gary_posterthanks, np15:03
* gary_poster needs to go babysit and such. biab15:03
gary_posterbenji, are you getting a lot of "SilentLaunchpadScriptFailure: 1" errors?15:22
benjigary_poster: nope15:22
gary_posterk15:22
gary_postera lot of failures on my side.  going away again15:23
benjigary_poster: I started later than you did, so I might not be there yet.15:23
gary_postermaybe so15:23
gary_posterhm15:23
gary_postersaw schema related error15:24
gary_postergoing to stop, make schema, and retry15:24
benjiyeah, I did a pull and make schema before my run, just in case15:30
gary_posterok, restarted on new ephemeral with changes made15:32
gary_posternow really babysitting :-P15:33
gary_posterargh15:33
gary_posterfell over15:33
benji:(15:34
gary_posterdumb mistake15:34
gary_posterretrying15:34
gary_posterthere we go15:34
gary_posterok, now leaving :-)15:34
frankbanbenji: http://ec2-50-16-78-27.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio15:35
bac"baby sitting" + "fell over" is not good15:36
benjifrankban: what am I looking for?15:36
frankbanthe results of a parallel test run15:36
frankbanbenji: and it's doing another run...15:37
frankbanhave you started it?15:37
benjifrankban: have I started what?  Another run? no.  That was probably triggered by a commit.15:38
frankbanbenji: ah... ok15:38
benjifrankban: I'm still not sure what you would like for me to notice about the ouput.  That it finished without hanging, perhaps?15:39
frankbanbenji: yes, and only 4 failures... is the hang happening only using 8 cores?15:41
benjigary_poster: ah, ok.  Nope, we've seen a hang with just two, so the fact that you didn't get one is interesting.15:41
benjifrankban: note that the xx-bug.txt failure is a known issue in the trunk, the production buildbot reported the same failure a few hours ago15:42
frankbanbenji: yes, I've seen15:44
gary_posteractually, benji, frankban, I have not seen a hang lately with two cores16:10
gary_posteronly 4 failures is great16:10
gary_posterbenji, do you have failures on your run?  I definitely do16:11
gary_posterbenji, maybe worth noting is that testrepository had not reported any errors.16:18
gary_posterI wonder if this is some kind of "buffer filling too fast" problem16:19
gary_postertriggered by having so many errors16:19
gary_posterI'm not sure how many errors I'm going to end up with on this run, but "a lot" looks like arough guess16:20
gary_posterbenji, no hangs for me.  trying to figure out a quick way to get results of run16:32
gary_poster"subunit2pyunit < testoutput.txt" yields a fairly confusing result: only one error?16:38
gary_posterbenji, ok, yeah, I'm confused.  I thought I saw a lot of errors flying by, but now when I look at the teed document, I see very few tracebacks.  The only error I get from the command above is one for an issue that subunit itself seems to show as...successful?16:45
gary_postertest: lib/lp/app/javascript/overlay/tests/test_overlay.html16:45
gary_postertest: Could not communicate with subprocess16:45
gary_postertags: zope:error_with_banner16:45
gary_postersuccessful: Could not communicate with subprocess16:45
gary_poster...riiiiight...16:46
benjigary_poster: was eating lunch; reading backlog now16:49
benjigary_poster: I have no failures in my non-ephemeral run16:52
benjimy run took just under an hour and had no errors or failures at all16:53
benjiI'm going to start another in an ephemeral container and see what that does16:53
gary_posterbenji, interesting16:55
gary_posterbenji, so, maybe my "I have tons of errors" was confused by the fact that I ended up searching into the previous run.  not sure.  in any case, the only issue I see in the tee'd file is the one I gave above.  So I'm wondering what to do now, since I was unable to dupe.  I was considering hacking testr to only start one process, and to include the --load-list that we are using, and see how that goes.  Thoughts?17:05
benjigary_poster: so the intent of your hack would be to run in a normal environment, but serialize instead of parallelize in order to see if we get failures or not, right?17:06
gary_posterbenji, not exactly.  The intent would be to run a single process/container of what the eight core machine did, but exactly as it did.  Specifically, I'm going to hack testr to make it think I only have one core (which will mean that it will run all the tests it is supposed to run in a single ephemeral lxc container); *and* I'll include --load-list=/home/gary/temp/tmp0... when I start testr, so only those tests ar17:09
gary_postere run17:09
gary_posterIf I succeed in triggering a hang, I at least have a recipe for triggering it locally.  If I do not succeed, then it implies that not only does testr need to run those tests in an ephemeral lxc container, but also they must be in parallel; *or* my machine is sufficiently different from the ec2 machine that it doesn't trigger.17:11
gmbgary_poster, So, I've lost a bunch of time this afternoon to my lp setup being hideously broken. I've now rebuilt it. Do you have any guidance for me re: bug 609986?17:16
_mup_Bug #609986: layer setup failures don't output a failure message (they spew straight to console) <lp-foundations> <paralleltest> <Launchpad itself:Triaged> < https://launchpad.net/bugs/609986 >17:16
gary_posteractually maybe I don't have to hack testr to not run in parallel; just don't use --parallel17:16
gary_postergmb sure, lemme get that back in my head.  want to hang out for just a bit?17:16
gmbgary_poster, Sure. Let me get Firefox running17:17
gary_posterk17:17
gmbAh, crap, updates..17:18
gary_postergmb, https://code.launchpad.net/~launchpad/zope.testing/3.9.4-p5 is something to talk about when you are ready17:19
* gmb looks17:19
gmbgary_poster, goldenhorde?17:19
gary_postergmb, yeah17:19
gmbk17:19
benjigary_poster: the ephemeral run completed with one failure: lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job17:56
gary_posterbenji, I got that one in my testr run so far18:02
gary_posterso, benji, we have an apparently intermittent test isolation error...18:03
gary_posterand we are unable to trigger the hang with merely an lxc or an ephemeral lxc.18:03
gary_posterI'm now adding testr to the mix18:03
gary_posterand if that does not hang18:03
gary_posterthen we only have the two options that I mentioned above as the possible causes, afaik18:04
gary_posterI ended up only hacking my .testr.conf for what I wanted18:04
gary_posterand then running testr run18:05
gary_posterbut to try and dupe the eight-way parallel run...I'm not sure how to do that, except to merely force my two-core machine to be treated as an eight-core machine by testr18:06
gary_posterwhich does not necessarily use the same test divisions18:06
benjiyep, I agree with your evaluation of what the different outcomes suggest18:06
gary_posterand also demands more RAM than I have, according to the experience I had yesterday18:06
gary_postertests are still running here18:06
gary_posterthe new run is only about 20 minutes old18:07
benjik18:07
gary_posterso we may need to discuss how to instrument the ec2 machine18:08
gary_posterwhile we are waiting for the test results here18:08
gary_posterthey might inform any result really18:08
gary_posteryou mentioned the signal handler, and the debug build18:08
gary_posterI like the signal handler better than the debug build, because it changes less18:09
gary_posterand yet might still give us what we need.18:09
gary_poster(it almost seems like something that one always ought to run with)18:09
gary_posterwe could also try that gdb hook trick18:09
gary_posterthat lets you get into a Python process18:10
frankbangary_poster: EOD, my ec2 test run is still going, do you want me to leave those instances up?18:11
gary_posterfrankban, ack.  benji, I think he can kill them.  what do you think?18:11
benjigary_poster, frankban: yeah I say kill them; I don't think we'll need them.18:13
gary_posterfrankban, thank you.  Have a great evening.18:13
frankbangary_poster, benji: ok, have a nice evening18:13
benjisame to you, frankban18:13
gary_posterah right, we can just sudo apt-get install python-dbg to get the debug build, can't we18:14
benjigary_poster: this looks like what we're looking for http://pypi.python.org/pypi/faulthandler/18:14
benjigary_poster: I believe so.18:14
benjigary_poster: I think your point about holding off on using the debug build is a good one18:15
gary_posterbenji, nice package.  the only thing that strikes me that might bite us there is testr/subunit eating things...18:16
gary_posterit might still work18:16
gary_posterbut the dance would be thisL18:16
gary_poster:18:16
benjithere is an option for making it write to a file18:17
gary_postersend signal to Python process18:17
gary_posterah!18:17
gary_postermhm...we would need access to the ephemeral container18:17
benjiregister(signum, file=sys.stderr, all_threads=False, chain=False)18:17
benjiwe can use the console for that18:18
gary_posterin order to look at the file18:18
benjisince they hang, they don't go away :)18:18
gary_posteras long as we do the root passwd/shadow file trick yeah18:18
benjiyep18:18
gary_posterneed to remember to do that first18:18
gary_postertests are still rolling along here18:18
gary_posterok...at 2:19, if I start a macine and get it initialized it would be ready by 3:20 ish18:19
benjiactually, don't we have access to the "upper" directory?  we can do the shadow trick there if we need to (but doing it before launch would be easiest)18:19
benjigary_poster: I have a slave ready18:19
gary_posterbenji, ooh, 8 core?18:19
benjigary_poster: darn, no18:19
gary_poster:-/18:20
gary_posterbenji, want me to start one, or you?18:20
gary_posterit may be trickier than I want it to be18:21
gary_posterbecause of the new juju changes announced last night18:21
benjigary_poster: have at it18:21
gary_posterI was relying on something they just ripped out and replaced18:21
gary_posterok18:21
benjiactually, there has been some discussion about not ripping out the old options but leaving them for backward compatability18:22
gary_posteryeah I saw that18:22
gary_posterit makes sense to me not even for backwards compatibility but for setting defaults18:23
gary_posterok, slave is starting.  tests are still running locally.  stepping away.18:27
bachey gary_poster, can we have a quick chat to see if i can lend you a hand?18:28
gary_posterbac, hey, 1 sec18:28
bacgary_poster: np, i'll grab some tea18:28
gary_postercool18:28
gary_posterbac, I await you in the horde (https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde_18:35
gary_posterhttps://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde18:35
gary_posterbenji, I have a call at 4 with Francis, btw, so you will need to take over18:37
gary_posterthen18:37
benjiok18:37
gary_posterbenji, http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/waterfall19:38
gary_posterbenji, I have added root passwd19:38
gary_posterbenji, so I should start a test run19:40
benjigary_poster: sounds good19:40
benjiafter that we'll just be watching it, right?19:40
gary_posterbenji, now I will add you to ssh19:40
gary_posteryeah19:40
gary_posterthis may have been foolish :-(19:40
benjihow so?19:41
gary_posterif yesterday is any indication, hang is in > 2 hours19:41
gary_posterpast both of our EoDs19:41
gary_postermaybe one of the cores will hang sooner19:42
gary_posterbenji, oh argh19:42
gary_postershould I not have installed the package, or at least python-dbg, before starting a test?19:42
benjigary_poster: we need to install faulthandler and register a handler for USR1 that will write all thread's stacktraces to a file; oh and tweak /etc/shadow19:44
benjithe "register a hanlder" bit might be interesting, probably hacking bin/test would be the easiest way19:44
benjioh, and the package needs to be installed in the container19:45
benjigary_poster: you may be right, we might be too late in the day to do this right; maybe we should scrub19:45
gary_posterbenji, I already changed /etc/shadow in root and added you to authorized keys.  try ubuntu@ec2-174-129-101-121.compute-1.amazonaws.com19:45
gary_posteror scrub ;-)19:45
benjigary_poster: Permission denied (publickey).19:46
gary_posterbenji, one thing you could do is make an LP branch that has the package installed19:46
gary_posterfor tomorrow19:46
gary_posterI would suggest actually hooking it in at a different location19:46
gary_posterbenji, try adding it to lib/lp_sitecustomize.py19:47
gary_posterthat will register it for every LP process19:47
gary_posterwhich is what we want I think19:47
gary_posterand is easy to do19:48
gary_posterIf I were going to do it today I would actually use python-dbg19:48
gary_posterbecause that's simpler19:48
gary_posterI just kill the test19:48
gary_posterinstall -dbg19:48
gary_posterand restart19:48
gary_posterwhen there is a hang, give it a whirl with gdb19:49
benjigary_poster: I'd really like to finish at least a section of my review today, so I prefer the option of killing the slave and I make a branch and tweak lp_sitecustomize19:50
gary_postercool19:51
gary_posterI wonder what I did wrong with authorized_keys...19:51
gary_posterbenji, for future reference, I simply added your key (the line from https://launchpad.net/~benji/+sshkeys) to /home/ubuntu/.ssh/authorized_keys .  Did I need to do anything else?19:52
gary_posterMaybe ec2 security thing...19:53
benjigary_poster: I would have thought that would work19:54
benjiI'm pretty sure I've done just that in the past19:54
gary_posteryeah, already set to allow 22 through in ec219:56
gary_posterfr everyone19:56
gary_posterbenji, still watching while waiting for call.  does top lie?  this is what it i saying20:05
gary_posterTasks: 505 total,   1 running, 504 sleeping,   0 stopped,   0 zombie20:05
gary_posterCpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st20:05
gary_posterthis is while I have eight parallel tests going20:05
benjigary_poster: I haven't seen top lie lately.  (I do remember on some old red hat machines ZC once had...)20:05
gary_postersome tests are still running20:07
gary_posterjust not very fast20:07

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!