[09:48] <frankban> gmb: good morning, I am starting juju charms to run tests
[09:49] <gmb> frankban, Morning. Cool. I'm just looking at Gary's email now...
[09:49] <frankban> gmb: select/read hangs, scary
[09:49] <gmb> Yeah :/
[09:53] <frankban> si\
[09:53] <frankban> ops
[11:36]  * bac \o
[11:36] <gmb> Wow. Shows how long it's been since I did any LP development work. My env is completely broken.
[11:37] <bac> gmb: no kidding.  and it seems things are moving around a lot over the last few weeks.
[12:06] <frankban> gary_poster: good morning, now I have a better understanding of what's going on with testrepository
[12:07] <gary_poster> frankban, awesome.  good afternoon.
[12:08] <gary_poster> what does that mean practically for our issues?
[12:08] <frankban> gary_poster: the debian control file in http://bazaar.launchpad.net/~testrepository/debian/sid/python-testrepository/sid/view/head:/debian/control (the one we use) is different from the one in ubuntu upstream: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/testrepository/precise/view/head:/debian/control
[12:09] <gary_poster> ahhh!
[12:09] <gary_poster> so if we switch, all will be well? easy?
[12:10] <gary_poster> need to step away; back soon.
[12:13] <frankban> gary_poster: yes, but I have a question: my branch patches testrepository trunk. So our ppa installs the  trunk revision (maybe we could use that to see if https://bugs.launchpad.net/testrepository/+bug/775214 is fixed) However, no problem for me to update the ppa to use a patched upstream release.
[12:13] <_mup_> Bug #775214: On python 2.7: String or Integer object expected for key, unicode found <Testrepository:Fix Committed by lifeless> < https://launchpad.net/bugs/775214 >
[12:13]  * frankban lunch
[12:19] <frankban> gary_poster: buildbot tests are currently running on ec2, access granted for you and benji (user: ubuntu)
[12:19] <frankban> zk: ec2-107-20-6-115.compute-1.amazonaws.com
[12:20] <frankban> master: ec2-50-16-78-27.compute-1.amazonaws.com
[12:20] <frankban> slave: ec2-23-20-98-182.compute-1.amazonaws.com
[12:20]  * frankban really lunches
[12:23] <benji> precise... <sigh>
[12:23]  * benji goes to report a bug on precise
[12:35] <gary_poster> frankban, ec2: great thank you
[12:36] <gary_poster> frankban, testrepository: I *guess*...if we can easily fix the subunit dependency issue (and I expect we can) then using trunk would be good
[12:37] <gary_poster> that makes me a bit nervous
[12:37] <gary_poster> but it is probably good for the long run and hopefully ok for the short run
[12:39] <gary_poster> I would guess we would have to make our own branch of the ubuntu debian bits
[12:55]  * gary_poster thinks aptitude rocks
[13:02] <benji> I've considered starting to use aptitude, and there have been a couple of times I wish I already had.
[13:09] <gary_poster> bac benji frankban gmb call in 2 or sooner
[13:11] <gmb> A tumbleweed rolls through goldenhorde
[13:26] <gary_poster> https://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit
[13:48] <gary_poster> benji, I'm going to restart post update, then I'd like to discuss strategy for tackling the test hangs when you have a moment
[13:48] <benji> gary_poster: sure
[13:48] <gary_poster> thanks
[14:04] <gary_poster> OK, benji.  https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde when you get a chance. camera appears to be working post-update
[14:51] <gary_poster> benji, fwiw, this is the command I am running (no confirmation yet that it is working):
[14:51] <gary_poster> xvfb-run --error-file=/var/tmp/xvfb-errors.log --server-args='-screen 0 1024x768x24' -a /home/gary/launchpad/lp/sandbox/bin/test --subunit --load-list /home/gary/temp/tmp0d4ZXs
[14:51] <benji> k
[14:51] <gary_poster> yeah it seems to be working
[14:52] <gary_poster> I probably should have teed the output
[14:52] <gary_poster> but I didn't
[14:53] <benji> good point, I'll tee mine
[15:02] <bac> sorry about the email churn with failed PPA builds.  i'm trying to get the packaging to work with the new name spelling.  hopefully the next one will work.
[15:03] <gary_poster> thanks, np
[15:03]  * gary_poster needs to go babysit and such.  biab
[15:22] <gary_poster> benji, are you getting a lot of "SilentLaunchpadScriptFailure: 1" errors?
[15:22] <benji> gary_poster: nope
[15:22] <gary_poster> k
[15:23] <gary_poster> a lot of failures on my side.  going away again
[15:23] <benji> gary_poster: I started later than you did, so I might not be there yet.
[15:23] <gary_poster> maybe so
[15:23] <gary_poster> hm
[15:24] <gary_poster> saw schema related error
[15:24] <gary_poster> going to stop, make schema, and retry
[15:30] <benji> yeah, I did a pull and make schema before my run, just in case
[15:32] <gary_poster> ok, restarted on new ephemeral with changes made
[15:33] <gary_poster> now really babysitting :-P
[15:33] <gary_poster> argh
[15:33] <gary_poster> fell over
[15:34] <benji> :(
[15:34] <gary_poster> dumb mistake
[15:34] <gary_poster> retrying
[15:34] <gary_poster> there we go
[15:34] <gary_poster> ok, now leaving :-)
[15:35] <frankban> benji: http://ec2-50-16-78-27.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio
[15:36] <bac> "baby sitting" + "fell over" is not good
[15:36] <benji> frankban: what am I looking for?
[15:36] <frankban> the results of a parallel test run
[15:37] <frankban> benji: and it's doing another run...
[15:37] <frankban> have you started it?
[15:38] <benji> frankban: have I started what?  Another run? no.  That was probably triggered by a commit.
[15:38] <frankban> benji: ah... ok
[15:39] <benji> frankban: I'm still not sure what you would like for me to notice about the ouput.  That it finished without hanging, perhaps?
[15:41] <frankban> benji: yes, and only 4 failures... is the hang happening only using 8 cores?
[15:41] <benji> gary_poster: ah, ok.  Nope, we've seen a hang with just two, so the fact that you didn't get one is interesting.
[15:42] <benji> frankban: note that the xx-bug.txt failure is a known issue in the trunk, the production buildbot reported the same failure a few hours ago
[15:44] <frankban> benji: yes, I've seen
[16:10] <gary_poster> actually, benji, frankban, I have not seen a hang lately with two cores
[16:10] <gary_poster> only 4 failures is great
[16:11] <gary_poster> benji, do you have failures on your run?  I definitely do
[16:18] <gary_poster> benji, maybe worth noting is that testrepository had not reported any errors.
[16:19] <gary_poster> I wonder if this is some kind of "buffer filling too fast" problem
[16:19] <gary_poster> triggered by having so many errors
[16:20] <gary_poster> I'm not sure how many errors I'm going to end up with on this run, but "a lot" looks like arough guess
[16:32] <gary_poster> benji, no hangs for me.  trying to figure out a quick way to get results of run
[16:38] <gary_poster> "subunit2pyunit < testoutput.txt" yields a fairly confusing result: only one error?
[16:45] <gary_poster> benji, ok, yeah, I'm confused.  I thought I saw a lot of errors flying by, but now when I look at the teed document, I see very few tracebacks.  The only error I get from the command above is one for an issue that subunit itself seems to show as...successful?
[16:45] <gary_poster> test: lib/lp/app/javascript/overlay/tests/test_overlay.html
[16:45] <gary_poster> test: Could not communicate with subprocess
[16:45] <gary_poster> tags: zope:error_with_banner
[16:45] <gary_poster> successful: Could not communicate with subprocess
[16:46] <gary_poster> ...riiiiight...
[16:49] <benji> gary_poster: was eating lunch; reading backlog now
[16:52] <benji> gary_poster: I have no failures in my non-ephemeral run
[16:53] <benji> my run took just under an hour and had no errors or failures at all
[16:53] <benji> I'm going to start another in an ephemeral container and see what that does
[16:55] <gary_poster> benji, interesting
[17:05] <gary_poster> benji, so, maybe my "I have tons of errors" was confused by the fact that I ended up searching into the previous run.  not sure.  in any case, the only issue I see in the tee'd file is the one I gave above.  So I'm wondering what to do now, since I was unable to dupe.  I was considering hacking testr to only start one process, and to include the --load-list that we are using, and see how that goes.  Thoughts?
[17:06] <benji> gary_poster: so the intent of your hack would be to run in a normal environment, but serialize instead of parallelize in order to see if we get failures or not, right?
[17:09] <gary_poster> benji, not exactly.  The intent would be to run a single process/container of what the eight core machine did, but exactly as it did.  Specifically, I'm going to hack testr to make it think I only have one core (which will mean that it will run all the tests it is supposed to run in a single ephemeral lxc container); *and* I'll include --load-list=/home/gary/temp/tmp0... when I start testr, so only those tests ar
[17:09] <gary_poster> e run
[17:11] <gary_poster> If I succeed in triggering a hang, I at least have a recipe for triggering it locally.  If I do not succeed, then it implies that not only does testr need to run those tests in an ephemeral lxc container, but also they must be in parallel; *or* my machine is sufficiently different from the ec2 machine that it doesn't trigger.
[17:16] <gmb> gary_poster, So, I've lost a bunch of time this afternoon to my lp setup being hideously broken. I've now rebuilt it. Do you have any guidance for me re: bug 609986?
[17:16] <_mup_> Bug #609986: layer setup failures don't output a failure message (they spew straight to console) <lp-foundations> <paralleltest> <Launchpad itself:Triaged> < https://launchpad.net/bugs/609986 >
[17:16] <gary_poster> actually maybe I don't have to hack testr to not run in parallel; just don't use --parallel
[17:16] <gary_poster> gmb sure, lemme get that back in my head.  want to hang out for just a bit?
[17:17] <gmb> gary_poster, Sure. Let me get Firefox running
[17:17] <gary_poster> k
[17:18] <gmb> Ah, crap, updates..
[17:19] <gary_poster> gmb, https://code.launchpad.net/~launchpad/zope.testing/3.9.4-p5 is something to talk about when you are ready
[17:19]  * gmb looks
[17:19] <gmb> gary_poster, goldenhorde?
[17:19] <gary_poster> gmb, yeah
[17:19] <gmb> k
[17:56] <benji> gary_poster: the ephemeral run completed with one failure: lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job
[18:02] <gary_poster> benji, I got that one in my testr run so far
[18:03] <gary_poster> so, benji, we have an apparently intermittent test isolation error...
[18:03] <gary_poster> and we are unable to trigger the hang with merely an lxc or an ephemeral lxc.
[18:03] <gary_poster> I'm now adding testr to the mix
[18:03] <gary_poster> and if that does not hang
[18:04] <gary_poster> then we only have the two options that I mentioned above as the possible causes, afaik
[18:04] <gary_poster> I ended up only hacking my .testr.conf for what I wanted
[18:05] <gary_poster> and then running testr run
[18:06] <gary_poster> but to try and dupe the eight-way parallel run...I'm not sure how to do that, except to merely force my two-core machine to be treated as an eight-core machine by testr
[18:06] <gary_poster> which does not necessarily use the same test divisions
[18:06] <benji> yep, I agree with your evaluation of what the different outcomes suggest
[18:06] <gary_poster> and also demands more RAM than I have, according to the experience I had yesterday
[18:06] <gary_poster> tests are still running here
[18:07] <gary_poster> the new run is only about 20 minutes old
[18:07] <benji> k
[18:08] <gary_poster> so we may need to discuss how to instrument the ec2 machine
[18:08] <gary_poster> while we are waiting for the test results here
[18:08] <gary_poster> they might inform any result really
[18:08] <gary_poster> you mentioned the signal handler, and the debug build
[18:09] <gary_poster> I like the signal handler better than the debug build, because it changes less
[18:09] <gary_poster> and yet might still give us what we need.
[18:09] <gary_poster> (it almost seems like something that one always ought to run with)
[18:09] <gary_poster> we could also try that gdb hook trick
[18:10] <gary_poster> that lets you get into a Python process
[18:11] <frankban> gary_poster: EOD, my ec2 test run is still going, do you want me to leave those instances up?
[18:11] <gary_poster> frankban, ack.  benji, I think he can kill them.  what do you think?
[18:13] <benji> gary_poster, frankban: yeah I say kill them; I don't think we'll need them.
[18:13] <gary_poster> frankban, thank you.  Have a great evening.
[18:13] <frankban> gary_poster, benji: ok, have a nice evening
[18:13] <benji> same to you, frankban
[18:14] <gary_poster> ah right, we can just sudo apt-get install python-dbg to get the debug build, can't we
[18:14] <benji> gary_poster: this looks like what we're looking for http://pypi.python.org/pypi/faulthandler/
[18:14] <benji> gary_poster: I believe so.
[18:15] <benji> gary_poster: I think your point about holding off on using the debug build is a good one
[18:16] <gary_poster> benji, nice package.  the only thing that strikes me that might bite us there is testr/subunit eating things...
[18:16] <gary_poster> it might still work
[18:16] <gary_poster> but the dance would be thisL
[18:16] <gary_poster> :
[18:17] <benji> there is an option for making it write to a file
[18:17] <gary_poster> send signal to Python process
[18:17] <gary_poster> ah!
[18:17] <gary_poster> mhm...we would need access to the ephemeral container
[18:17] <benji> register(signum, file=sys.stderr, all_threads=False, chain=False)
[18:18] <benji> we can use the console for that
[18:18] <gary_poster> in order to look at the file
[18:18] <benji> since they hang, they don't go away :)
[18:18] <gary_poster> as long as we do the root passwd/shadow file trick yeah
[18:18] <benji> yep
[18:18] <gary_poster> need to remember to do that first
[18:18] <gary_poster> tests are still rolling along here
[18:19] <gary_poster> ok...at 2:19, if I start a macine and get it initialized it would be ready by 3:20 ish
[18:19] <benji> actually, don't we have access to the "upper" directory?  we can do the shadow trick there if we need to (but doing it before launch would be easiest)
[18:19] <benji> gary_poster: I have a slave ready
[18:19] <gary_poster> benji, ooh, 8 core?
[18:19] <benji> gary_poster: darn, no
[18:20] <gary_poster> :-/
[18:20] <gary_poster> benji, want me to start one, or you?
[18:21] <gary_poster> it may be trickier than I want it to be
[18:21] <gary_poster> because of the new juju changes announced last night
[18:21] <benji> gary_poster: have at it
[18:21] <gary_poster> I was relying on something they just ripped out and replaced
[18:21] <gary_poster> ok
[18:22] <benji> actually, there has been some discussion about not ripping out the old options but leaving them for backward compatability
[18:22] <gary_poster> yeah I saw that
[18:23] <gary_poster> it makes sense to me not even for backwards compatibility but for setting defaults
[18:27] <gary_poster> ok, slave is starting.  tests are still running locally.  stepping away.
[18:28] <bac> hey gary_poster, can we have a quick chat to see if i can lend you a hand?
[18:28] <gary_poster> bac, hey, 1 sec
[18:28] <bac> gary_poster: np, i'll grab some tea
[18:28] <gary_poster> cool
[18:35] <gary_poster> bac, I await you in the horde (https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde_
[18:35] <gary_poster> https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde
[18:37] <gary_poster> benji, I have a call at 4 with Francis, btw, so you will need to take over
[18:37] <gary_poster> then
[18:37] <benji> ok
[19:38] <gary_poster> benji, http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/waterfall
[19:38] <gary_poster> benji, I have added root passwd
[19:40] <gary_poster> benji, so I should start a test run
[19:40] <benji> gary_poster: sounds good
[19:40] <benji> after that we'll just be watching it, right?
[19:40] <gary_poster> benji, now I will add you to ssh
[19:40] <gary_poster> yeah
[19:40] <gary_poster> this may have been foolish :-(
[19:41] <benji> how so?
[19:41] <gary_poster> if yesterday is any indication, hang is in > 2 hours
[19:41] <gary_poster> past both of our EoDs
[19:42] <gary_poster> maybe one of the cores will hang sooner
[19:42] <gary_poster> benji, oh argh
[19:42] <gary_poster> should I not have installed the package, or at least python-dbg, before starting a test?
[19:44] <benji> gary_poster: we need to install faulthandler and register a handler for USR1 that will write all thread's stacktraces to a file; oh and tweak /etc/shadow
[19:44] <benji> the "register a hanlder" bit might be interesting, probably hacking bin/test would be the easiest way
[19:45] <benji> oh, and the package needs to be installed in the container
[19:45] <benji> gary_poster: you may be right, we might be too late in the day to do this right; maybe we should scrub
[19:45] <gary_poster> benji, I already changed /etc/shadow in root and added you to authorized keys.  try ubuntu@ec2-174-129-101-121.compute-1.amazonaws.com
[19:45] <gary_poster> or scrub ;-)
[19:46] <benji> gary_poster: Permission denied (publickey).
[19:46] <gary_poster> benji, one thing you could do is make an LP branch that has the package installed
[19:46] <gary_poster> for tomorrow
[19:46] <gary_poster> I would suggest actually hooking it in at a different location
[19:47] <gary_poster> benji, try adding it to lib/lp_sitecustomize.py
[19:47] <gary_poster> that will register it for every LP process
[19:47] <gary_poster> which is what we want I think
[19:48] <gary_poster> and is easy to do
[19:48] <gary_poster> If I were going to do it today I would actually use python-dbg
[19:48] <gary_poster> because that's simpler
[19:48] <gary_poster> I just kill the test
[19:48] <gary_poster> install -dbg
[19:48] <gary_poster> and restart
[19:49] <gary_poster> when there is a hang, give it a whirl with gdb
[19:50] <benji> gary_poster: I'd really like to finish at least a section of my review today, so I prefer the option of killing the slave and I make a branch and tweak lp_sitecustomize
[19:51] <gary_poster> cool
[19:51] <gary_poster> I wonder what I did wrong with authorized_keys...
[19:52] <gary_poster> benji, for future reference, I simply added your key (the line from https://launchpad.net/~benji/+sshkeys) to /home/ubuntu/.ssh/authorized_keys .  Did I need to do anything else?
[19:53] <gary_poster> Maybe ec2 security thing...
[19:54] <benji> gary_poster: I would have thought that would work
[19:54] <benji> I'm pretty sure I've done just that in the past
[19:56] <gary_poster> yeah, already set to allow 22 through in ec2
[19:56] <gary_poster> fr everyone
[20:05] <gary_poster> benji, still watching while waiting for call.  does top lie?  this is what it i saying
[20:05] <gary_poster> Tasks: 505 total,   1 running, 504 sleeping,   0 stopped,   0 zombie
[20:05] <gary_poster> Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
[20:05] <gary_poster> this is while I have eight parallel tests going
[20:05] <benji> gary_poster: I haven't seen top lie lately.  (I do remember on some old red hat machines ZC once had...)
[20:07] <gary_poster> some tests are still running
[20:07] <gary_poster> just not very fast