[09:40] gmb: good morning, what do you think about start pairing on Gary's email after some coffee? [10:38] frankban, Hi, sorry, was afk and missed your ping... [10:39] frankban, Have you had a chance to look at Gary's changes to lxc-start-ephemeral yet? [10:40] gmb: no, I was trying to start buildbot master without success [10:41] frankban, Ah, okay. Well, I was thinking that we might be better off splitting the tasks rather than pairing... what are the problems you've been having with the -master? [10:42] gmb: I started a juju oneiric instance, apt-get update doesn't work: errors are like: [10:42] W: Failed to fetch copy:/var/lib/apt/lists/partial/us-east-1.ec2.archive.ubuntu.com_ubuntu_dists_oneiric_main_i18n_Index Encountered a section with no Package: header [10:42] Wow. [10:43] gmb: is oneiric the right choice for juju instances? [10:43] frankban, Might be worth checking with the guys in #is to see if this is a wider problem. That looks like a broken archive. [10:44] frankban, I don't know; I've got precise was the default-series for my ec2 environment. Let me see if I can bring one up. [10:50] frankban, Is this error happening during the charm's install hook, then? [10:50] gmb: yes [10:50] Okay. [10:50] the install hook adds a ppa and the runs apt-get update [10:50] * gmb keeps watching the precise instance he just created. [10:53] Huh. So, the instance never seems to get out of "pending" [10:53] I can't ssh to it. [10:53] * gmb tries oneiric [10:53] thanks gmb,a nd please use a large instance [10:54] frankban, Okay... what do I need to do to make sure that I get a large instance? [10:54] gmb: in ~/.juju/environments.yaml I have: [10:55] default-instance-type: m1.large and default-image-id: ami-ff975496 [10:55] Ah. [10:55] Thanks. [10:55] (inside the ec2 env) [11:12] gmb: I've got the master running using precise: default-instance-type: m1.large, default-image-id: ami-e0ca1689 [11:15] frankban, Hmm. My machines aren't evening coming out of "pending". And that's not for the charm, that's for juju itself. [11:18] gmb: a time will come when what works today will work tomorrow... [11:19] Hah, yes. [11:19] however, I have started the slave installation (with setuplxc), and that will take about an hour [11:21] Okay. [11:23] frankban, I need to go and do some more work on the packaging - I may have solved my problems over the weekend. Can you check out Gary's changes to lxc-start-ephemeral? I've looked at the diff and it looks fine, but I haven't actually tried using it yet. [11:24] gmb: sure [11:24] I'll also keep kicking at juju, see if I can get something working. Maybe I need to update and upgrade... [12:01] * gary_poster is still sick, and now two out of three children have it too. :-/ Meanwhile, upgrading. Will restart and then prep for call [12:01] gary_poster, Um, isn't the call at 13:10 UTC? [12:02] gmb, oh! we had daylight savings time, or whatever you all call it on that side of the pond ("summer time"?) [12:02] you have that next week gmb? [12:03] gary_poster, Yeah, I think it's the 25 that ours go forward. [12:03] In the UK anyway. [12:03] * gmb checks [12:03] yup [12:03] ok so this week still 1310 [12:03] ok [12:03] we'll switch when you all switch [12:04] In that case, ugrades & lunch... [12:04] k :-) [12:04] benji, we'll have call @ 9:10 (europe didn't switch yet and this is their lunch time) [12:05] k [12:05] * benji hates changing time zones and wishes we'd do daylight saving time all year. [12:05] :-) [12:06] * gary_poster can imagine the political slogans: [12:06] "save more daylight!" [12:06] "Won't someone think of the chi...daylight!" [12:08] I've considered pusing to ban stop signs on the (made up) basis that yields are more environmentally friendly. I'm sure I could roll DST in there somehow. [12:08] heh [12:12] gmb: should I fire up a slave instance or are you guys lovingly preparing one for us to use later? [12:15] benji: I have started juju master and slave, I am adding your ssh key to both, ok? [12:16] gary_poster: please see https://code.launchpad.net/~frankban/ubuntu/precise/lxc/bug-951150 for a working version of start-ephemeral (some small fixes) [12:17] frankban: cool; with even more overlap beteen our days today, I wonder how we're going to collaborate. Ideas? [12:21] benji: master: ec2-107-21-145-254.compute-1.amazonaws.com [12:21] slave:ec2-23-20-53-135.compute-1.amazonaws.com [12:22] benji: please add gary_poster's key, going to lunch now [12:22] frankban: "Permission denied (publickey)." I wonder if my LP key(s) are correct; checking. [12:28] overlap is only higher for one week [12:30] frankban, cool. Are those changes actually fixes (things didn't work without them) or just cleanups? [12:30] * benji tries to figure out a polite way of saying "I knew that." :) [12:30] (I know you are at lunch, just queuing questions :-) ) [12:30] heh [12:30] ok [12:31] benji, I'd be surprised if your LP keys were incorrect, because IIRC that's what Canonical's IS uses to set you up on machines [12:32] gary_poster: yeah, I've verified they are right. I'm still looking at why I can't log in. I'm only assuming my user name is "benji", but that seems like a safe assumption. [12:32] that being said, even if I could log in, I don't know what I'm expected to do [12:32] yeah...no idea. You could try "ubuntu," benji [12:32] ooh, good idea [12:32] you are supposed to add my key too, of course! ;-) [12:32] (don't ask me what to do after that) [12:33] gary_poster: you are a genious [12:33] heh [12:33] and I am not a good speller [12:33] :-) [12:37] benji, one question would be to see how the tests are running. that would be http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/ right? that's not resolving for me yet... [12:37] gary_poster: if exposed (which is likely), yes [12:37] t'ain't visible to me [12:56] gary_poster, benji: sorry, I forgot to add the relation between charms, doing now [12:56] frankban: does the slave have your lxc-start-ephemeral fixes? [12:57] benji: no [13:00] frankban, you'll want to change lxc-start-ephemeral and also the test script as I described in the email (removing -b). Do this after lunch though :-) [13:02] gary_poster: the test script should be the correct one, since I've used your new version of setuplxc in the charm config file [13:03] oh, cool [13:08] benji, frankban gmb call in 2 [13:34] if anyone finds the waterfall display suboptimal, I prefer the build info page: http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/0 [13:35] And now I have a link to refer to once I return from restarting my machine! :-) [13:35] :) [13:37] I wonder what I did to get into this "Partial Upgrade" state. [13:37] benji, does --list usually take this long (http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/0/steps/shell_8/logs/stdio) [13:38] I suppose we could be waiting for stdout's buffer to fill with something... [13:38] gary_poster: it takes longer than I would expect, but I think it should be done by now [13:38] hm. The canary was fine [13:38] we can look to see if it is in a select() live-lock [13:38] ah right [13:39] that would be a mite disheartening [13:39] I'll look. [13:39] benji, have you added my key, btw? and frankban please don't forget to add us (or at least benji) to the other two machines, so we can shut them off [13:40] gary_poster: nope; I'll do that too [13:40] ty [13:40] benji, it made progress [13:41] heh; ok [13:41] so far so good [13:44] gary_poster: you should be set up on the master and slave; I don't have access to the ZK machine yet [13:44] benji, great thank you [13:44] gary_poster and benji: you are allowed on the zookeeper instance: ubuntu@ec2-50-17-161-43.compute-1.amazonaws.com [13:44] great, thanks frankban [13:44] frankban: thanks [13:58] * benji reboots [13:59] * gmb -> reboot, tea [14:16] you guys may have remarked on this while I was away fighting the "Partial Upgrade" dragon: I'm seeing the same tmp-dir-centric failure we saw earlier: http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/0/steps/shell_8/logs/stdio [14:18] benji, we had not discussed, but I was thinking about it. I was just about to log onto the slave and see what the container's /var/tmp looks like. [14:18] * benji fills his coffee cup. [14:27] benji, tmpCXj9IX is the missing part. everything else is there. hallyn is calling me... [14:27] it is encouraging that the tests seem to be CPU bound [14:27] gary_poster: missing part? [14:27] benji, "OSError: [Errno 2] No such file or directory: '/var/tmp/ppa/joe/myppa/tmpCXj9IX'" [14:28] ah! [14:28] hmm [14:28] everything is there except the last part [14:30] uh-oh, time to change the cat litter! [14:30] (I figure everyone would want to know that) [14:30] biab [14:39] (back,btw) [14:42] frankban: I was going to review https://code.launchpad.net/~frankban/lpsetup/split-files/+merge/97028 but since there were code changes and moves mixed together, I don't think I can realistically figure out what code actually changed. I'm fine with rubber-stamping it (i.e., approve it without actually seeing what has changed) or you can make a branch with just the moves and make that a prerequisite branch so this MP will show th [14:46] benji, afaict only one test process is running :-( investigating to confirm... [14:46] why did that happen on friday again? can't remember [14:47] benji: thank you, actually that branch is just about splitting the lpsetup script into several files. The code is already reviewed, but I'd like suggestions on the project structure. [14:48] gary_poster: I don't think we know why it happened, the symptom was a selct() live-lock, if I recall correctly [14:48] right [14:48] frankban: oh; the MP says there were other changes [14:48] I had hoped that this parallel thing would fix it :-( :-( [14:48] I mean, ephemeral thing [14:48] me too [14:49] gary_poster: earlier when viewing top output I got the impression that two test processes were running [14:49] gary_poster: see processes 23716 and 22426 [14:50] benji, maybe my expectations are broken then--I expected to see two files in .testrepository, one for each process; maybe that's the combo [14:50] gary_poster: I would have (baselessly) expected the same thing. [14:52] benji: the other changes are really minor fixes, and only in how file_append is used. [14:58] frankban: is the subcommand structure new? I don't see the value-add in doing it that way versus runnable scripts. [14:59] benji, I confirmed that the .testrepository/tmp... file contains tests from both lists. So, yay, afaict [15:01] cool [15:03] benji: the file structure is new, the subcommands layer over argparse was already present. [15:04] frankban: ok, thanks [15:04] benji: thank you [15:16] benji, http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/0/steps/shell_8/logs/stdio readonly issues remain [15:16] as well as others [15:16] looks very similar [15:17] >:( [15:19] Nothing has changed in the root directory... [15:24] benji, uh-oh: http://pastebin.ubuntu.com/880494/ [15:25] gary_poster: what have you done?! ;) [15:25] :-) [15:26] benji, uh, any ideas? [15:26] gary_poster: only the obvious: there was a problem binding [15:26] tests are spewing wildly [15:26] we should look in the fstab [15:26] benji, well it worked initially [15:26] or else the tests would not have started [15:26] yeah, that is odd [15:26] so it fell over, it seems [15:27] gary_poster: what's the problem? [15:27] frankban, the mounted directories have disappeared [15:28] we had the Daniel Silverstone error and then things really went off the rails [15:28] benji, if you look in syslog you see overlayfs talking about being unable to whiteout files [15:29] darn, a whiteout underflow [15:30] which corresponds to errors we see in our test log [15:30] postheld.txt comes right after daniel silverstone [15:31] and is in syslog [15:32] that syslog looks kind of unhealthy also just with lines seeming to get munged together [15:33] benji, I'm also concerned about "non-accessible hardlink creation was attempted by: Xvfb (fsuid 110)": it looks a lot like a variant of that overlayfs bug I filed with the chmod 0444 + ln story [15:33] gary_poster: that error seems to be associated with not having the kernel config CONFIG_TMPFS_XATTR enabled [15:34] benji, the whiteout or the hardlink? [15:34] gary_poster: it does; is that just a warning, or an error? [15:34] gary_poster: white [15:34] out [15:34] ah ok. that sounds promising then [15:35] warning or error: neither, simply reported [15:36] syslog not being healthy: look at the first line of these three as an example: [15:36] Mar 12 13:40:54 ip-10-78-193-250 kernel: [12073197.514949] eth0: no IPv6 outers peent [15:36] Mar 12 13:40:54 ip-10-78-193-250 kernel: [12073197.766450] vethVsavoK: no IPv6 routers present [15:36] Mar 12 13:40:55 ip-10-78-193-250 kernel: [12073198.355046] vethexeo2M: no IPv6 routers present [15:36] someone ate two "r"s and an "s" [15:36] other similar examples in there too [15:38] benji, did you see/do you know how to check current value of CONFIG_TMPFS_XATTR? [15:38] that is quite odd [15:38] gary_poster: nope, let me see [15:39] gary_poster: are we using a tmpfs as the upper filesystem? [15:39] benji, yes. also saw "Xattrs are also needed for overlayfs." [15:40] gary_poster: it seems that tmpfs doesn't support xattrs; so... we need a different upper fs [15:40] (someone at least proposed adding it, but it apparently hasn't happened yet) [15:40] benji, it does if that thing you found is turned on. it was a patch specifically for this purpose [15:41] ah! [15:41] (the discussion I am reading is from 2011: http://www.serverphorums.com/read.php?12,301386) [15:41] It looks like it is available: http://cateee.net/lkddb/web-lkddb/TMPFS_XATTR.html [15:42] but...not sure... [15:43] also http://kernel.xc.net/html/linux-2.6.11/i386/TMPFS_XATTR [15:43] gary_poster: it looks like a compile-time option :( [15:43] I wondered about that; that's what it looked like to me too... [15:47] benji, what's our kernel version in precise? [15:48] gary_poster: 3.2.0-18-virtual #29-Ubuntu [15:48] ah 3.2.0-17.27 [15:48] or thereabouts [15:48] cool [15:48] :) [15:48] how does one check that [15:48] I looked in release notes [15:48] uname -a [15:48] ah right uname [15:51] gary_poster: it seems that xattr is enabled for tmpfs: grep TMPFS_XATTR /boot/config-3.2.0-18-virtual [15:51] ah, good call frankban [15:52] benji: thanks for the review [15:53] frankban: my pleasure, I hope it was helpful [15:53] frankban: ooh, good find! in that case, we're back to tryign to figure out why we're getting whiteout errors [15:55] benji: about the author, that was someting I wanted to ask, thank you... Can I use launchpad as mantainer and driver for the lp project too? [15:57] benji, frankban, the one thing that I know I did in a crazy way is that we are using an overlayfs as the upper part of an overlayfs [15:57] there's an easy fix for that [15:57] make a new tmpfs [15:57] and use that [15:58] frankban: I /think/ so, for the lazr projects we have a maintainer of https://launchpad.net/~lazr-developers and no driver; it couldn't hurt to use https://launchpad.net/~launchpad as the maintainer [15:59] biab [15:59] frankban: we could also set Owner to https://launchpad.net/~launchpad-leader, like LP [15:59] benji: ok [16:00] benji, I need to step away. Want to try adjusting the branch to make a separate tempfs for the bound bits? should be relatvely easy [16:00] or I can tackle when I return [16:01] gary_poster: I'm stepping away for lunch too. The first one back gets to make as many tempfs-s as he likes. [16:13] cool [16:13] I'm giving it a try [16:37] pycon us, all the videos: http://pyvideo.org/category/17/pycon-us-2012 [16:38] gary_poster, benji, frankban: Can one of you run `sudo apt-add-repository ppa:gmb/canonical-ppa && sudo apt-get update` and then tell me what the latest version reported by `apt-cache show charm-tools` is please? [16:38] sure gmb [16:38] Thanks [16:40] 0.3+bzr130-1-pythonhelpers~precise1 [16:41] Argh. [16:41] frankban: Thanks. [16:41] gmb ^^^^, but I've got 131 for the source package [16:41] frankban: Ah, cool. So it's probably just that the recipe's built but that doesn't actually mean that the binary has built. [16:41] E_CONFUSED_GMB [16:42] gmb: I think so [16:42] I've seen that the binary takes more time [16:42] Okay, I can live with that. [16:42] * gmb digs a bit to find out more [16:43] gmb: you should find a cheating countdown in launchpad [16:43] frankban: I have "Start in 11 minutes" for precise [16:43] I can live with that. [16:43] I'll go and do some admin stuff in the meantime. [16:44] gary_poster: my understanding is that the rationale/justification for overlayfs's simplicity is precisely that you can overly on top of an overlay [16:44] * koolhead17|away (~beermon@117.193.251.230) has joined #ubuntu-server [16:44] gary_poster: so doing what you suggest is good for verifying that that's the problem, but if there's a problem then it's a bug [16:45] not sure that's particularly reassuring [16:59] * benji is back. [17:06] benji, hallyn said using overlayfs within overlayfs is fine (see immediately above), but I could experiment anyway. I have done so, and I have a version that makes a tempfs on the slave now [17:07] gary_poster: cool; is that version on the slave (or easily transferable) so we can test it? [17:07] benji ^^ on the slave now :-) [17:08] benji, I am still getting the "I'm not really mounted" weirdness [17:08] gary_poster: where "I'm not really mounted" is the empty /var/lib/buildbot? [17:09] benji, right [17:09] benji, pretty sure it was working before [17:09] just with my /home/gary [17:09] but should be the same [17:14] benji, mind is blown. Completely confused. [17:14] gary_poster: do you want to pair on this? [17:14] oh! of course! [17:14] benji, sure [17:14] mm, confused again [17:15] :) [17:15] benji, https://talkgadget.google.com/hangouts/extras/canonical.com/goldenhorde [17:40] https://code.launchpad.net/~gary/ubuntu/precise/lxc/bug-951150/+merge/97021 [17:51] benji, lp:~launchpad/zope.testing/3.9.4-p5 [18:18] * gary_poster lunches [18:19] benji, btw, hallyn added a -d (daemon) to script which changed the look of it significantly. The version of the script with my most recent changes is https://code.launchpad.net/~gary/ubuntu/precise/lxc/bug-951150-2/+merge/97077 [18:21] I suspect lxc-start, lxc-start-ephemeral, and lxc-clone are on a collision course (i.e., there should be a refactoring project that looks at all three and how they are related to one another) [19:24] gary_poster: I've been doing reviews and haven't really made any progress on bug 9slhlsdffjlisdhdf [19:24] <_mup_> Bug #9: Rosetta's po parser is too strict < https://launchpad.net/bugs/9 > [19:24] pfft, thanks mup [19:24] benji :-) on call [19:34] so far this looks virtually identical...we are not to the crazy bits yet, I guess: http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio [19:46] gary_poster: I've verified that the testDryrunOption failure is because of lack of test isolation (running the test by itself produces the same failure) [20:01] both lp.archivepublisher.tests.test_generate_ppa_htaccess.TestPPAHtaccessTokenGeneration.testDryrunOption [20:01] and lp.archivepublisher.tests.test_generate_ppa_htaccess.TestPPAHtaccessTokenGeneration.testGenerateHtpasswd [20:06] benji, it blew up again, with the same errors. :-/ Mm, idea... [20:08] in the last test run, lp.services.webapp.tests.test_dbpolicy.LayerDatabasePolicyTestCase.test_WebServiceRequest_uses_LaunchpadDatabasePolicy [20:08] is the first test that doesn't fail when run in isolation (on a regular dev machine) [20:08] well, that's good-ish [20:09] the way it fails does suggest some inter-container state bleeding: AssertionError: newInteraction called while another interaction is active. [20:11] That's in-memory though! [20:11] that's a security thing [20:12] Mar 12 17:16:33 is the last time we had a "failed to whiteout" problem [20:12] that's a good point [20:12] and we are now at 20:12 [20:13] that's good! [20:13] so I tentatively suggest that that particular problem might be resolved [20:13] yeah [20:13] I still see a lot of these: [20:14] non-accessible hardlink creation was attempted by: Xvfb [20:15] I think we ought to try [20:15] non-accessible hardlink creation was attempted by: Xvfb [20:15] eh [20:15] gary_poster: the word "attempted" worries me [20:15] I think we ought to try [20:15] echo 0 > /proc/sys/kernel/yama/protected_nonaccess_hardlinks [20:15] per bug 944386 [20:15] <_mup_> Bug #944386: Making a hard link of a 0444 permission file fails in overlayfs [Precise] < https://launchpad.net/bugs/944386 > [20:15] gary_poster: can't hurt [20:16] yeah [20:16] gary_poster: do you want to do that and I'll kill the current run? [20:16] ok sure benji [20:17] done benji [20:19] new build running: http://ec2-107-21-145-254.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/2 [20:20] data point: my machine has not yet hung today! [20:23] benji, did you try running lp.services.webapp.tests.test_dbpolicy.LayerDatabasePolicyTestCase.test_WebServiceRequest_uses_ReadOnlyDatabasePolicy in isolation? [20:23] gary_poster: I think so, let me check. [20:24] gary_poster: it passes [20:24] benji, darn. it could so easily be explained by isolation also [20:47] benji, the last time we saw the xvfb error was 18:11:58. Now 20:46. I'm hopeful that the echo removed that error message at least, even if it does not actually fix any of these test failures. [20:48] I hope so. [20:48] I'm trying to reproduce the first non-simple isolation failure (test_read_only_mode_uses_ReadOnlyLaunchpadDatabasePolicy) [20:49] cool benji. It would be nice to see what order the processes ran tests. Actually... [20:49] you know, we could copy over those lists of tests [20:50] and use them to specify what tests to run [20:50] and run them normally, without the parallel stuff [20:50] and see if they fail that way [20:51] the testrunner does not appear to run the tests in the file in first-to-last order, or last-to-first, though I could be wrong. [20:52] gary_poster: yep, that's what I'm trying [20:52] ah, cool! [20:55] I suspect the divisino of the tests into the two lists and the order within those lists is stable between runs [20:56] the first non-reproducable failure is about one-eighth of the way into the file [21:00] yes, it is stable, I'm pretty sure. benji, which is the non-reproducable one? [21:00] and benji, are you running them in isolation, or all together with the list? [21:00] gary_poster: lp.services.webapp.tests.test_dbpolicy.LayerDatabasePolicyTestCase.test_read_only_mode_uses_ReadOnlyLaunchpadDatabasePolicy [21:01] I tried running just the one before it and it together, then a few more before it, but now I'm running the list of all 1000-odd tests that lead up to it (plus it) to see if I get the error [21:02] if so, I'm tempted to write a little script to search for the minimal set of tests that reproduce the error [21:02] I'm also tempted to stop work now and make dinner. :) [21:02] benji, go for dinner. :-) I'll shut down the machines in a bit [21:03] thanks & have a good evening [21:03] gary_poster: you too, see you tomorrow