[09:48] gmb: good morning, I am starting juju charms to run tests [09:49] frankban, Morning. Cool. I'm just looking at Gary's email now... [09:49] gmb: select/read hangs, scary [09:49] Yeah :/ [09:53] si\ [09:53] ops [11:36] * bac \o [11:36] Wow. Shows how long it's been since I did any LP development work. My env is completely broken. [11:37] gmb: no kidding. and it seems things are moving around a lot over the last few weeks. [12:06] gary_poster: good morning, now I have a better understanding of what's going on with testrepository [12:07] frankban, awesome. good afternoon. [12:08] what does that mean practically for our issues? [12:08] gary_poster: the debian control file in http://bazaar.launchpad.net/~testrepository/debian/sid/python-testrepository/sid/view/head:/debian/control (the one we use) is different from the one in ubuntu upstream: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/testrepository/precise/view/head:/debian/control [12:09] ahhh! [12:09] so if we switch, all will be well? easy? [12:10] need to step away; back soon. [12:13] gary_poster: yes, but I have a question: my branch patches testrepository trunk. So our ppa installs the trunk revision (maybe we could use that to see if https://bugs.launchpad.net/testrepository/+bug/775214 is fixed) However, no problem for me to update the ppa to use a patched upstream release. [12:13] <_mup_> Bug #775214: On python 2.7: String or Integer object expected for key, unicode found < https://launchpad.net/bugs/775214 > [12:13] * frankban lunch [12:19] gary_poster: buildbot tests are currently running on ec2, access granted for you and benji (user: ubuntu) [12:19] zk: ec2-107-20-6-115.compute-1.amazonaws.com [12:20] master: ec2-50-16-78-27.compute-1.amazonaws.com [12:20] slave: ec2-23-20-98-182.compute-1.amazonaws.com [12:20] * frankban really lunches [12:23] precise... [12:23] * benji goes to report a bug on precise [12:35] frankban, ec2: great thank you [12:36] frankban, testrepository: I *guess*...if we can easily fix the subunit dependency issue (and I expect we can) then using trunk would be good [12:37] that makes me a bit nervous [12:37] but it is probably good for the long run and hopefully ok for the short run [12:39] I would guess we would have to make our own branch of the ubuntu debian bits [12:55] * gary_poster thinks aptitude rocks [13:02] I've considered starting to use aptitude, and there have been a couple of times I wish I already had. [13:09] bac benji frankban gmb call in 2 or sooner [13:11] A tumbleweed rolls through goldenhorde [13:26] https://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit [13:48] benji, I'm going to restart post update, then I'd like to discuss strategy for tackling the test hangs when you have a moment [13:48] gary_poster: sure [13:48] thanks [14:04] OK, benji. https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde when you get a chance. camera appears to be working post-update [14:51] benji, fwiw, this is the command I am running (no confirmation yet that it is working): [14:51] xvfb-run --error-file=/var/tmp/xvfb-errors.log --server-args='-screen 0 1024x768x24' -a /home/gary/launchpad/lp/sandbox/bin/test --subunit --load-list /home/gary/temp/tmp0d4ZXs [14:51] k [14:51] yeah it seems to be working [14:52] I probably should have teed the output [14:52] but I didn't [14:53] good point, I'll tee mine [15:02] sorry about the email churn with failed PPA builds. i'm trying to get the packaging to work with the new name spelling. hopefully the next one will work. [15:03] thanks, np [15:03] * gary_poster needs to go babysit and such. biab [15:22] benji, are you getting a lot of "SilentLaunchpadScriptFailure: 1" errors? [15:22] gary_poster: nope [15:22] k [15:23] a lot of failures on my side. going away again [15:23] gary_poster: I started later than you did, so I might not be there yet. [15:23] maybe so [15:23] hm [15:24] saw schema related error [15:24] going to stop, make schema, and retry [15:30] yeah, I did a pull and make schema before my run, just in case [15:32] ok, restarted on new ephemeral with changes made [15:33] now really babysitting :-P [15:33] argh [15:33] fell over [15:34] :( [15:34] dumb mistake [15:34] retrying [15:34] there we go [15:34] ok, now leaving :-) [15:35] benji: http://ec2-50-16-78-27.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio [15:36] "baby sitting" + "fell over" is not good [15:36] frankban: what am I looking for? [15:36] the results of a parallel test run [15:37] benji: and it's doing another run... [15:37] have you started it? [15:38] frankban: have I started what? Another run? no. That was probably triggered by a commit. [15:38] benji: ah... ok [15:39] frankban: I'm still not sure what you would like for me to notice about the ouput. That it finished without hanging, perhaps? [15:41] benji: yes, and only 4 failures... is the hang happening only using 8 cores? [15:41] gary_poster: ah, ok. Nope, we've seen a hang with just two, so the fact that you didn't get one is interesting. [15:42] frankban: note that the xx-bug.txt failure is a known issue in the trunk, the production buildbot reported the same failure a few hours ago [15:44] benji: yes, I've seen [16:10] actually, benji, frankban, I have not seen a hang lately with two cores [16:10] only 4 failures is great [16:11] benji, do you have failures on your run? I definitely do [16:18] benji, maybe worth noting is that testrepository had not reported any errors. [16:19] I wonder if this is some kind of "buffer filling too fast" problem [16:19] triggered by having so many errors [16:20] I'm not sure how many errors I'm going to end up with on this run, but "a lot" looks like arough guess [16:32] benji, no hangs for me. trying to figure out a quick way to get results of run [16:38] "subunit2pyunit < testoutput.txt" yields a fairly confusing result: only one error? [16:45] benji, ok, yeah, I'm confused. I thought I saw a lot of errors flying by, but now when I look at the teed document, I see very few tracebacks. The only error I get from the command above is one for an issue that subunit itself seems to show as...successful? [16:45] test: lib/lp/app/javascript/overlay/tests/test_overlay.html [16:45] test: Could not communicate with subprocess [16:45] tags: zope:error_with_banner [16:45] successful: Could not communicate with subprocess [16:46] ...riiiiight... [16:49] gary_poster: was eating lunch; reading backlog now [16:52] gary_poster: I have no failures in my non-ephemeral run [16:53] my run took just under an hour and had no errors or failures at all [16:53] I'm going to start another in an ephemeral container and see what that does [16:55] benji, interesting [17:05] benji, so, maybe my "I have tons of errors" was confused by the fact that I ended up searching into the previous run. not sure. in any case, the only issue I see in the tee'd file is the one I gave above. So I'm wondering what to do now, since I was unable to dupe. I was considering hacking testr to only start one process, and to include the --load-list that we are using, and see how that goes. Thoughts? [17:06] gary_poster: so the intent of your hack would be to run in a normal environment, but serialize instead of parallelize in order to see if we get failures or not, right? [17:09] benji, not exactly. The intent would be to run a single process/container of what the eight core machine did, but exactly as it did. Specifically, I'm going to hack testr to make it think I only have one core (which will mean that it will run all the tests it is supposed to run in a single ephemeral lxc container); *and* I'll include --load-list=/home/gary/temp/tmp0... when I start testr, so only those tests ar [17:09] e run [17:11] If I succeed in triggering a hang, I at least have a recipe for triggering it locally. If I do not succeed, then it implies that not only does testr need to run those tests in an ephemeral lxc container, but also they must be in parallel; *or* my machine is sufficiently different from the ec2 machine that it doesn't trigger. [17:16] gary_poster, So, I've lost a bunch of time this afternoon to my lp setup being hideously broken. I've now rebuilt it. Do you have any guidance for me re: bug 609986? [17:16] <_mup_> Bug #609986: layer setup failures don't output a failure message (they spew straight to console) < https://launchpad.net/bugs/609986 > [17:16] actually maybe I don't have to hack testr to not run in parallel; just don't use --parallel [17:16] gmb sure, lemme get that back in my head. want to hang out for just a bit? [17:17] gary_poster, Sure. Let me get Firefox running [17:17] k [17:18] Ah, crap, updates.. [17:19] gmb, https://code.launchpad.net/~launchpad/zope.testing/3.9.4-p5 is something to talk about when you are ready [17:19] * gmb looks [17:19] gary_poster, goldenhorde? [17:19] gmb, yeah [17:19] k [17:56] gary_poster: the ephemeral run completed with one failure: lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job [18:02] benji, I got that one in my testr run so far [18:03] so, benji, we have an apparently intermittent test isolation error... [18:03] and we are unable to trigger the hang with merely an lxc or an ephemeral lxc. [18:03] I'm now adding testr to the mix [18:03] and if that does not hang [18:04] then we only have the two options that I mentioned above as the possible causes, afaik [18:04] I ended up only hacking my .testr.conf for what I wanted [18:05] and then running testr run [18:06] but to try and dupe the eight-way parallel run...I'm not sure how to do that, except to merely force my two-core machine to be treated as an eight-core machine by testr [18:06] which does not necessarily use the same test divisions [18:06] yep, I agree with your evaluation of what the different outcomes suggest [18:06] and also demands more RAM than I have, according to the experience I had yesterday [18:06] tests are still running here [18:07] the new run is only about 20 minutes old [18:07] k [18:08] so we may need to discuss how to instrument the ec2 machine [18:08] while we are waiting for the test results here [18:08] they might inform any result really [18:08] you mentioned the signal handler, and the debug build [18:09] I like the signal handler better than the debug build, because it changes less [18:09] and yet might still give us what we need. [18:09] (it almost seems like something that one always ought to run with) [18:09] we could also try that gdb hook trick [18:10] that lets you get into a Python process [18:11] gary_poster: EOD, my ec2 test run is still going, do you want me to leave those instances up? [18:11] frankban, ack. benji, I think he can kill them. what do you think? [18:13] gary_poster, frankban: yeah I say kill them; I don't think we'll need them. [18:13] frankban, thank you. Have a great evening. [18:13] gary_poster, benji: ok, have a nice evening [18:13] same to you, frankban [18:14] ah right, we can just sudo apt-get install python-dbg to get the debug build, can't we [18:14] gary_poster: this looks like what we're looking for http://pypi.python.org/pypi/faulthandler/ [18:14] gary_poster: I believe so. [18:15] gary_poster: I think your point about holding off on using the debug build is a good one [18:16] benji, nice package. the only thing that strikes me that might bite us there is testr/subunit eating things... [18:16] it might still work [18:16] but the dance would be thisL [18:16] : [18:17] there is an option for making it write to a file [18:17] send signal to Python process [18:17] ah! [18:17] mhm...we would need access to the ephemeral container [18:17] register(signum, file=sys.stderr, all_threads=False, chain=False) [18:18] we can use the console for that [18:18] in order to look at the file [18:18] since they hang, they don't go away :) [18:18] as long as we do the root passwd/shadow file trick yeah [18:18] yep [18:18] need to remember to do that first [18:18] tests are still rolling along here [18:19] ok...at 2:19, if I start a macine and get it initialized it would be ready by 3:20 ish [18:19] actually, don't we have access to the "upper" directory? we can do the shadow trick there if we need to (but doing it before launch would be easiest) [18:19] gary_poster: I have a slave ready [18:19] benji, ooh, 8 core? [18:19] gary_poster: darn, no [18:20] :-/ [18:20] benji, want me to start one, or you? [18:21] it may be trickier than I want it to be [18:21] because of the new juju changes announced last night [18:21] gary_poster: have at it [18:21] I was relying on something they just ripped out and replaced [18:21] ok [18:22] actually, there has been some discussion about not ripping out the old options but leaving them for backward compatability [18:22] yeah I saw that [18:23] it makes sense to me not even for backwards compatibility but for setting defaults [18:27] ok, slave is starting. tests are still running locally. stepping away. [18:28] hey gary_poster, can we have a quick chat to see if i can lend you a hand? [18:28] bac, hey, 1 sec [18:28] gary_poster: np, i'll grab some tea [18:28] cool [18:35] bac, I await you in the horde (https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde_ [18:35] https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde [18:37] benji, I have a call at 4 with Francis, btw, so you will need to take over [18:37] then [18:37] ok [19:38] benji, http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/waterfall [19:38] benji, I have added root passwd [19:40] benji, so I should start a test run [19:40] gary_poster: sounds good [19:40] after that we'll just be watching it, right? [19:40] benji, now I will add you to ssh [19:40] yeah [19:40] this may have been foolish :-( [19:41] how so? [19:41] if yesterday is any indication, hang is in > 2 hours [19:41] past both of our EoDs [19:42] maybe one of the cores will hang sooner [19:42] benji, oh argh [19:42] should I not have installed the package, or at least python-dbg, before starting a test? [19:44] gary_poster: we need to install faulthandler and register a handler for USR1 that will write all thread's stacktraces to a file; oh and tweak /etc/shadow [19:44] the "register a hanlder" bit might be interesting, probably hacking bin/test would be the easiest way [19:45] oh, and the package needs to be installed in the container [19:45] gary_poster: you may be right, we might be too late in the day to do this right; maybe we should scrub [19:45] benji, I already changed /etc/shadow in root and added you to authorized keys. try ubuntu@ec2-174-129-101-121.compute-1.amazonaws.com [19:45] or scrub ;-) [19:46] gary_poster: Permission denied (publickey). [19:46] benji, one thing you could do is make an LP branch that has the package installed [19:46] for tomorrow [19:46] I would suggest actually hooking it in at a different location [19:47] benji, try adding it to lib/lp_sitecustomize.py [19:47] that will register it for every LP process [19:47] which is what we want I think [19:48] and is easy to do [19:48] If I were going to do it today I would actually use python-dbg [19:48] because that's simpler [19:48] I just kill the test [19:48] install -dbg [19:48] and restart [19:49] when there is a hang, give it a whirl with gdb [19:50] gary_poster: I'd really like to finish at least a section of my review today, so I prefer the option of killing the slave and I make a branch and tweak lp_sitecustomize [19:51] cool [19:51] I wonder what I did wrong with authorized_keys... [19:52] benji, for future reference, I simply added your key (the line from https://launchpad.net/~benji/+sshkeys) to /home/ubuntu/.ssh/authorized_keys . Did I need to do anything else? [19:53] Maybe ec2 security thing... [19:54] gary_poster: I would have thought that would work [19:54] I'm pretty sure I've done just that in the past [19:56] yeah, already set to allow 22 through in ec2 [19:56] fr everyone [20:05] benji, still watching while waiting for call. does top lie? this is what it i saying [20:05] Tasks: 505 total, 1 running, 504 sleeping, 0 stopped, 0 zombie [20:05] Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st [20:05] this is while I have eight parallel tests going [20:05] gary_poster: I haven't seen top lie lately. (I do remember on some old red hat machines ZC once had...) [20:07] some tests are still running [20:07] just not very fast