frankban | gmb: good morning, I am starting juju charms to run tests | 09:48 |
---|---|---|
gmb | frankban, Morning. Cool. I'm just looking at Gary's email now... | 09:49 |
frankban | gmb: select/read hangs, scary | 09:49 |
gmb | Yeah :/ | 09:49 |
frankban | si\ | 09:53 |
frankban | ops | 09:53 |
* bac \o | 11:36 | |
gmb | Wow. Shows how long it's been since I did any LP development work. My env is completely broken. | 11:36 |
bac | gmb: no kidding. and it seems things are moving around a lot over the last few weeks. | 11:37 |
frankban | gary_poster: good morning, now I have a better understanding of what's going on with testrepository | 12:06 |
gary_poster | frankban, awesome. good afternoon. | 12:07 |
gary_poster | what does that mean practically for our issues? | 12:08 |
frankban | gary_poster: the debian control file in http://bazaar.launchpad.net/~testrepository/debian/sid/python-testrepository/sid/view/head:/debian/control (the one we use) is different from the one in ubuntu upstream: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/testrepository/precise/view/head:/debian/control | 12:08 |
gary_poster | ahhh! | 12:09 |
gary_poster | so if we switch, all will be well? easy? | 12:09 |
gary_poster | need to step away; back soon. | 12:10 |
frankban | gary_poster: yes, but I have a question: my branch patches testrepository trunk. So our ppa installs the trunk revision (maybe we could use that to see if https://bugs.launchpad.net/testrepository/+bug/775214 is fixed) However, no problem for me to update the ppa to use a patched upstream release. | 12:13 |
_mup_ | Bug #775214: On python 2.7: String or Integer object expected for key, unicode found <Testrepository:Fix Committed by lifeless> < https://launchpad.net/bugs/775214 > | 12:13 |
* frankban lunch | 12:13 | |
frankban | gary_poster: buildbot tests are currently running on ec2, access granted for you and benji (user: ubuntu) | 12:19 |
frankban | zk: ec2-107-20-6-115.compute-1.amazonaws.com | 12:19 |
frankban | master: ec2-50-16-78-27.compute-1.amazonaws.com | 12:20 |
frankban | slave: ec2-23-20-98-182.compute-1.amazonaws.com | 12:20 |
* frankban really lunches | 12:20 | |
benji | precise... <sigh> | 12:23 |
* benji goes to report a bug on precise | 12:23 | |
gary_poster | frankban, ec2: great thank you | 12:35 |
gary_poster | frankban, testrepository: I *guess*...if we can easily fix the subunit dependency issue (and I expect we can) then using trunk would be good | 12:36 |
gary_poster | that makes me a bit nervous | 12:37 |
gary_poster | but it is probably good for the long run and hopefully ok for the short run | 12:37 |
gary_poster | I would guess we would have to make our own branch of the ubuntu debian bits | 12:39 |
* gary_poster thinks aptitude rocks | 12:55 | |
benji | I've considered starting to use aptitude, and there have been a couple of times I wish I already had. | 13:02 |
gary_poster | bac benji frankban gmb call in 2 or sooner | 13:09 |
gmb | A tumbleweed rolls through goldenhorde | 13:11 |
gary_poster | https://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit | 13:26 |
gary_poster | benji, I'm going to restart post update, then I'd like to discuss strategy for tackling the test hangs when you have a moment | 13:48 |
benji | gary_poster: sure | 13:48 |
gary_poster | thanks | 13:48 |
gary_poster | OK, benji. https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde when you get a chance. camera appears to be working post-update | 14:04 |
gary_poster | benji, fwiw, this is the command I am running (no confirmation yet that it is working): | 14:51 |
gary_poster | xvfb-run --error-file=/var/tmp/xvfb-errors.log --server-args='-screen 0 1024x768x24' -a /home/gary/launchpad/lp/sandbox/bin/test --subunit --load-list /home/gary/temp/tmp0d4ZXs | 14:51 |
benji | k | 14:51 |
gary_poster | yeah it seems to be working | 14:51 |
gary_poster | I probably should have teed the output | 14:52 |
gary_poster | but I didn't | 14:52 |
benji | good point, I'll tee mine | 14:53 |
bac | sorry about the email churn with failed PPA builds. i'm trying to get the packaging to work with the new name spelling. hopefully the next one will work. | 15:02 |
gary_poster | thanks, np | 15:03 |
* gary_poster needs to go babysit and such. biab | 15:03 | |
gary_poster | benji, are you getting a lot of "SilentLaunchpadScriptFailure: 1" errors? | 15:22 |
benji | gary_poster: nope | 15:22 |
gary_poster | k | 15:22 |
gary_poster | a lot of failures on my side. going away again | 15:23 |
benji | gary_poster: I started later than you did, so I might not be there yet. | 15:23 |
gary_poster | maybe so | 15:23 |
gary_poster | hm | 15:23 |
gary_poster | saw schema related error | 15:24 |
gary_poster | going to stop, make schema, and retry | 15:24 |
benji | yeah, I did a pull and make schema before my run, just in case | 15:30 |
gary_poster | ok, restarted on new ephemeral with changes made | 15:32 |
gary_poster | now really babysitting :-P | 15:33 |
gary_poster | argh | 15:33 |
gary_poster | fell over | 15:33 |
benji | :( | 15:34 |
gary_poster | dumb mistake | 15:34 |
gary_poster | retrying | 15:34 |
gary_poster | there we go | 15:34 |
gary_poster | ok, now leaving :-) | 15:34 |
frankban | benji: http://ec2-50-16-78-27.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio | 15:35 |
bac | "baby sitting" + "fell over" is not good | 15:36 |
benji | frankban: what am I looking for? | 15:36 |
frankban | the results of a parallel test run | 15:36 |
frankban | benji: and it's doing another run... | 15:37 |
frankban | have you started it? | 15:37 |
benji | frankban: have I started what? Another run? no. That was probably triggered by a commit. | 15:38 |
frankban | benji: ah... ok | 15:38 |
benji | frankban: I'm still not sure what you would like for me to notice about the ouput. That it finished without hanging, perhaps? | 15:39 |
frankban | benji: yes, and only 4 failures... is the hang happening only using 8 cores? | 15:41 |
benji | gary_poster: ah, ok. Nope, we've seen a hang with just two, so the fact that you didn't get one is interesting. | 15:41 |
benji | frankban: note that the xx-bug.txt failure is a known issue in the trunk, the production buildbot reported the same failure a few hours ago | 15:42 |
frankban | benji: yes, I've seen | 15:44 |
gary_poster | actually, benji, frankban, I have not seen a hang lately with two cores | 16:10 |
gary_poster | only 4 failures is great | 16:10 |
gary_poster | benji, do you have failures on your run? I definitely do | 16:11 |
gary_poster | benji, maybe worth noting is that testrepository had not reported any errors. | 16:18 |
gary_poster | I wonder if this is some kind of "buffer filling too fast" problem | 16:19 |
gary_poster | triggered by having so many errors | 16:19 |
gary_poster | I'm not sure how many errors I'm going to end up with on this run, but "a lot" looks like arough guess | 16:20 |
gary_poster | benji, no hangs for me. trying to figure out a quick way to get results of run | 16:32 |
gary_poster | "subunit2pyunit < testoutput.txt" yields a fairly confusing result: only one error? | 16:38 |
gary_poster | benji, ok, yeah, I'm confused. I thought I saw a lot of errors flying by, but now when I look at the teed document, I see very few tracebacks. The only error I get from the command above is one for an issue that subunit itself seems to show as...successful? | 16:45 |
gary_poster | test: lib/lp/app/javascript/overlay/tests/test_overlay.html | 16:45 |
gary_poster | test: Could not communicate with subprocess | 16:45 |
gary_poster | tags: zope:error_with_banner | 16:45 |
gary_poster | successful: Could not communicate with subprocess | 16:45 |
gary_poster | ...riiiiight... | 16:46 |
benji | gary_poster: was eating lunch; reading backlog now | 16:49 |
benji | gary_poster: I have no failures in my non-ephemeral run | 16:52 |
benji | my run took just under an hour and had no errors or failures at all | 16:53 |
benji | I'm going to start another in an ephemeral container and see what that does | 16:53 |
gary_poster | benji, interesting | 16:55 |
gary_poster | benji, so, maybe my "I have tons of errors" was confused by the fact that I ended up searching into the previous run. not sure. in any case, the only issue I see in the tee'd file is the one I gave above. So I'm wondering what to do now, since I was unable to dupe. I was considering hacking testr to only start one process, and to include the --load-list that we are using, and see how that goes. Thoughts? | 17:05 |
benji | gary_poster: so the intent of your hack would be to run in a normal environment, but serialize instead of parallelize in order to see if we get failures or not, right? | 17:06 |
gary_poster | benji, not exactly. The intent would be to run a single process/container of what the eight core machine did, but exactly as it did. Specifically, I'm going to hack testr to make it think I only have one core (which will mean that it will run all the tests it is supposed to run in a single ephemeral lxc container); *and* I'll include --load-list=/home/gary/temp/tmp0... when I start testr, so only those tests ar | 17:09 |
gary_poster | e run | 17:09 |
gary_poster | If I succeed in triggering a hang, I at least have a recipe for triggering it locally. If I do not succeed, then it implies that not only does testr need to run those tests in an ephemeral lxc container, but also they must be in parallel; *or* my machine is sufficiently different from the ec2 machine that it doesn't trigger. | 17:11 |
gmb | gary_poster, So, I've lost a bunch of time this afternoon to my lp setup being hideously broken. I've now rebuilt it. Do you have any guidance for me re: bug 609986? | 17:16 |
_mup_ | Bug #609986: layer setup failures don't output a failure message (they spew straight to console) <lp-foundations> <paralleltest> <Launchpad itself:Triaged> < https://launchpad.net/bugs/609986 > | 17:16 |
gary_poster | actually maybe I don't have to hack testr to not run in parallel; just don't use --parallel | 17:16 |
gary_poster | gmb sure, lemme get that back in my head. want to hang out for just a bit? | 17:16 |
gmb | gary_poster, Sure. Let me get Firefox running | 17:17 |
gary_poster | k | 17:17 |
gmb | Ah, crap, updates.. | 17:18 |
gary_poster | gmb, https://code.launchpad.net/~launchpad/zope.testing/3.9.4-p5 is something to talk about when you are ready | 17:19 |
* gmb looks | 17:19 | |
gmb | gary_poster, goldenhorde? | 17:19 |
gary_poster | gmb, yeah | 17:19 |
gmb | k | 17:19 |
benji | gary_poster: the ephemeral run completed with one failure: lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job | 17:56 |
gary_poster | benji, I got that one in my testr run so far | 18:02 |
gary_poster | so, benji, we have an apparently intermittent test isolation error... | 18:03 |
gary_poster | and we are unable to trigger the hang with merely an lxc or an ephemeral lxc. | 18:03 |
gary_poster | I'm now adding testr to the mix | 18:03 |
gary_poster | and if that does not hang | 18:03 |
gary_poster | then we only have the two options that I mentioned above as the possible causes, afaik | 18:04 |
gary_poster | I ended up only hacking my .testr.conf for what I wanted | 18:04 |
gary_poster | and then running testr run | 18:05 |
gary_poster | but to try and dupe the eight-way parallel run...I'm not sure how to do that, except to merely force my two-core machine to be treated as an eight-core machine by testr | 18:06 |
gary_poster | which does not necessarily use the same test divisions | 18:06 |
benji | yep, I agree with your evaluation of what the different outcomes suggest | 18:06 |
gary_poster | and also demands more RAM than I have, according to the experience I had yesterday | 18:06 |
gary_poster | tests are still running here | 18:06 |
gary_poster | the new run is only about 20 minutes old | 18:07 |
benji | k | 18:07 |
gary_poster | so we may need to discuss how to instrument the ec2 machine | 18:08 |
gary_poster | while we are waiting for the test results here | 18:08 |
gary_poster | they might inform any result really | 18:08 |
gary_poster | you mentioned the signal handler, and the debug build | 18:08 |
gary_poster | I like the signal handler better than the debug build, because it changes less | 18:09 |
gary_poster | and yet might still give us what we need. | 18:09 |
gary_poster | (it almost seems like something that one always ought to run with) | 18:09 |
gary_poster | we could also try that gdb hook trick | 18:09 |
gary_poster | that lets you get into a Python process | 18:10 |
frankban | gary_poster: EOD, my ec2 test run is still going, do you want me to leave those instances up? | 18:11 |
gary_poster | frankban, ack. benji, I think he can kill them. what do you think? | 18:11 |
benji | gary_poster, frankban: yeah I say kill them; I don't think we'll need them. | 18:13 |
gary_poster | frankban, thank you. Have a great evening. | 18:13 |
frankban | gary_poster, benji: ok, have a nice evening | 18:13 |
benji | same to you, frankban | 18:13 |
gary_poster | ah right, we can just sudo apt-get install python-dbg to get the debug build, can't we | 18:14 |
benji | gary_poster: this looks like what we're looking for http://pypi.python.org/pypi/faulthandler/ | 18:14 |
benji | gary_poster: I believe so. | 18:14 |
benji | gary_poster: I think your point about holding off on using the debug build is a good one | 18:15 |
gary_poster | benji, nice package. the only thing that strikes me that might bite us there is testr/subunit eating things... | 18:16 |
gary_poster | it might still work | 18:16 |
gary_poster | but the dance would be thisL | 18:16 |
gary_poster | : | 18:16 |
benji | there is an option for making it write to a file | 18:17 |
gary_poster | send signal to Python process | 18:17 |
gary_poster | ah! | 18:17 |
gary_poster | mhm...we would need access to the ephemeral container | 18:17 |
benji | register(signum, file=sys.stderr, all_threads=False, chain=False) | 18:17 |
benji | we can use the console for that | 18:18 |
gary_poster | in order to look at the file | 18:18 |
benji | since they hang, they don't go away :) | 18:18 |
gary_poster | as long as we do the root passwd/shadow file trick yeah | 18:18 |
benji | yep | 18:18 |
gary_poster | need to remember to do that first | 18:18 |
gary_poster | tests are still rolling along here | 18:18 |
gary_poster | ok...at 2:19, if I start a macine and get it initialized it would be ready by 3:20 ish | 18:19 |
benji | actually, don't we have access to the "upper" directory? we can do the shadow trick there if we need to (but doing it before launch would be easiest) | 18:19 |
benji | gary_poster: I have a slave ready | 18:19 |
gary_poster | benji, ooh, 8 core? | 18:19 |
benji | gary_poster: darn, no | 18:19 |
gary_poster | :-/ | 18:20 |
gary_poster | benji, want me to start one, or you? | 18:20 |
gary_poster | it may be trickier than I want it to be | 18:21 |
gary_poster | because of the new juju changes announced last night | 18:21 |
benji | gary_poster: have at it | 18:21 |
gary_poster | I was relying on something they just ripped out and replaced | 18:21 |
gary_poster | ok | 18:21 |
benji | actually, there has been some discussion about not ripping out the old options but leaving them for backward compatability | 18:22 |
gary_poster | yeah I saw that | 18:22 |
gary_poster | it makes sense to me not even for backwards compatibility but for setting defaults | 18:23 |
gary_poster | ok, slave is starting. tests are still running locally. stepping away. | 18:27 |
bac | hey gary_poster, can we have a quick chat to see if i can lend you a hand? | 18:28 |
gary_poster | bac, hey, 1 sec | 18:28 |
bac | gary_poster: np, i'll grab some tea | 18:28 |
gary_poster | cool | 18:28 |
gary_poster | bac, I await you in the horde (https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde_ | 18:35 |
gary_poster | https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde | 18:35 |
gary_poster | benji, I have a call at 4 with Francis, btw, so you will need to take over | 18:37 |
gary_poster | then | 18:37 |
benji | ok | 18:37 |
gary_poster | benji, http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/waterfall | 19:38 |
gary_poster | benji, I have added root passwd | 19:38 |
gary_poster | benji, so I should start a test run | 19:40 |
benji | gary_poster: sounds good | 19:40 |
benji | after that we'll just be watching it, right? | 19:40 |
gary_poster | benji, now I will add you to ssh | 19:40 |
gary_poster | yeah | 19:40 |
gary_poster | this may have been foolish :-( | 19:40 |
benji | how so? | 19:41 |
gary_poster | if yesterday is any indication, hang is in > 2 hours | 19:41 |
gary_poster | past both of our EoDs | 19:41 |
gary_poster | maybe one of the cores will hang sooner | 19:42 |
gary_poster | benji, oh argh | 19:42 |
gary_poster | should I not have installed the package, or at least python-dbg, before starting a test? | 19:42 |
benji | gary_poster: we need to install faulthandler and register a handler for USR1 that will write all thread's stacktraces to a file; oh and tweak /etc/shadow | 19:44 |
benji | the "register a hanlder" bit might be interesting, probably hacking bin/test would be the easiest way | 19:44 |
benji | oh, and the package needs to be installed in the container | 19:45 |
benji | gary_poster: you may be right, we might be too late in the day to do this right; maybe we should scrub | 19:45 |
gary_poster | benji, I already changed /etc/shadow in root and added you to authorized keys. try ubuntu@ec2-174-129-101-121.compute-1.amazonaws.com | 19:45 |
gary_poster | or scrub ;-) | 19:45 |
benji | gary_poster: Permission denied (publickey). | 19:46 |
gary_poster | benji, one thing you could do is make an LP branch that has the package installed | 19:46 |
gary_poster | for tomorrow | 19:46 |
gary_poster | I would suggest actually hooking it in at a different location | 19:46 |
gary_poster | benji, try adding it to lib/lp_sitecustomize.py | 19:47 |
gary_poster | that will register it for every LP process | 19:47 |
gary_poster | which is what we want I think | 19:47 |
gary_poster | and is easy to do | 19:48 |
gary_poster | If I were going to do it today I would actually use python-dbg | 19:48 |
gary_poster | because that's simpler | 19:48 |
gary_poster | I just kill the test | 19:48 |
gary_poster | install -dbg | 19:48 |
gary_poster | and restart | 19:48 |
gary_poster | when there is a hang, give it a whirl with gdb | 19:49 |
benji | gary_poster: I'd really like to finish at least a section of my review today, so I prefer the option of killing the slave and I make a branch and tweak lp_sitecustomize | 19:50 |
gary_poster | cool | 19:51 |
gary_poster | I wonder what I did wrong with authorized_keys... | 19:51 |
gary_poster | benji, for future reference, I simply added your key (the line from https://launchpad.net/~benji/+sshkeys) to /home/ubuntu/.ssh/authorized_keys . Did I need to do anything else? | 19:52 |
gary_poster | Maybe ec2 security thing... | 19:53 |
benji | gary_poster: I would have thought that would work | 19:54 |
benji | I'm pretty sure I've done just that in the past | 19:54 |
gary_poster | yeah, already set to allow 22 through in ec2 | 19:56 |
gary_poster | fr everyone | 19:56 |
gary_poster | benji, still watching while waiting for call. does top lie? this is what it i saying | 20:05 |
gary_poster | Tasks: 505 total, 1 running, 504 sleeping, 0 stopped, 0 zombie | 20:05 |
gary_poster | Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st | 20:05 |
gary_poster | this is while I have eight parallel tests going | 20:05 |
benji | gary_poster: I haven't seen top lie lately. (I do remember on some old red hat machines ZC once had...) | 20:05 |
gary_poster | some tests are still running | 20:07 |
gary_poster | just not very fast | 20:07 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!