=== danilo_ is now known as danilos [11:10] morning [11:12] http://mozillamemes.tumblr.com/post/19333515188/we-love-you-nigelb [11:27] :-) hi bac [12:22] gary_poster: bug 959352 seems to be getting stranger and stranger [12:23] benji, yeah, really [12:23] benji, I have a machine up now [12:23] ran tests on it through the night [12:24] I'm confused [12:24] gary_poster: I just replied to your other email about the blocking, you may want to read it now if you're getting ready for a run. [12:24] entropy went down to 0 a couple of times [12:24] yeah I did [12:24] but giving you more data now :-) [12:24] entropy went down to 0 a few times [12:25] but didn't stay there for long [12:25] while it was hanging it had gotten up to 51 [12:25] down to 0 again now [12:26] Do I read this correctly? [12:26] read(11, "6\351d=\310\235\274\300", 4096) = 8 [12:27] program: "give me 4k bytes from fd 11" [12:27] kernel: "I can give you 8" [12:28] (and they are the ones in the buffer) [12:28] the second arg [12:28] I'm pretty sure that is right [12:28] I have another test run now [12:29] I'm really confused about the mapping though [12:29] what was it doing when it was (according to lsof) trying to get something from /rootfs/dev/random? [12:29] well, supposedly it is still trying to do that now [12:29] but I have symlinks in [12:30] but...it should have fallen over because the file didn't exist, not hung waiting for output [12:30] my suspicion is that it is getting the random values actually from the proper place [12:30] and lsof is the one that is confused [12:31] if that (or any other scenario in which that dns code is actually reading /dev/random despite what lsof says) is happening [12:31] then doing mapping games won't help [12:36] gary_poster: do you know if /dev/random inside ephemerals points to /dev/random of the host? [12:37] frankban, oh, interesting. I assumed it would, because I thought dev was mounted, but it is not. proc and sysfs are mounted. I don't know what the dev mechanism is. [12:38] I'm trying to kill everything now but I'll look at that again in a moment [12:38] gary_poster: I am trying now a test run using a nasty trick to feed the entropy in the host [12:40] frankban, heh, cool. I saw you could write to /dev/random and then manipulate...something to convince the system that it had enough. good experiment, thanks! :-) [12:40] well, I am using rng-tools to feed /dev/random with data from /dev/urandom [12:41] ok, everything is sufficiently dead... [12:41] heh, cool [12:45] gary_poster: yes, you're reading that correctly; yeah, I bet your suspicion is right that /dev/random is really being read, but strace is reporting the file name incorrectly [12:46] gary_poster: there is a mapping game that will help, remove /dev/random and recreate it but use the urandom kernel device [12:52] frankban: thanks for pointing out rng-tools...i didn't know about it. have you seen http://www.howtoforge.com/helping-the-random-number-generator-to-gain-enough-entropy-with-rng-tools-debian-lenny [12:53] so apparently generating enough entropy on servers is a known problem and we've made it worse by throwing 8 cores at it [12:53] bac: yes, that's the trick. [12:54] gary_poster: here is the incantaion that will point /dev/random at /dev/urandom: rm /dev/random; mknod -m 0444 /dev/random c 1 9 [12:54] benji ack. one more minute away... [12:56] ok... benji that should be done on the host I'm assuming? [12:57] gary_poster: good question. I was assuming we'd do it on the containers, but assuming the host's /dev/random feeds the containers, then at least we'd only have to do it in one place (and we can do it live) [12:59] well, I could do it in the base container [12:59] that seems mildly safer [13:05] ok build started [13:09] bac benji frankban gmb call in 1 or 2 [13:09] (as soon as we all show up) [13:16] benji starting without you [13:16] * benji runs toward the horde! [13:27] https://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit [13:37] ok benji, I'm going to kill the test run and try again after changing the host's /dev/random [14:10] gary_poster: document updated wrt generic code packaging [14:12] benji, frankban, I have host and container using the urandom and it's not unblocking. going to try duping frankban's rng-tools now. apt-get install failed--perhaps it didn't like the /dev/random change. undoing and retrying that. [14:12] bac, good, thank you [14:13] hmm, it should not block [14:15] benji, frankban's loop ("for i in $(seq 1000); do head -1 /dev/random > /dev/null; done") blocks too. oh...wow [14:15] /dev/random is 1, 8 again! [14:15] who did that! [14:15] heh [14:16] the mknod really should "stick", as far as I know those device files are created at boot and not touched afterward [14:16] seriously, though...yeah, it did not stick [14:16] I verified history [14:16] I changed it [14:16] it was 1, 9 [14:17] then when I looked again just now [14:17] it was 1, 8 [14:18] gary_poster: maybe the rng-tools installation changes them? [14:18] yeah I was wondering about that too [14:19] there's not a command to see when a command was run, is there? [14:19] no, that wouldn't help anyway [14:20] in any case, this was hanging before I tried to install rng-tools [14:20] and is still hanging now [14:20] after I re-switched /dev/random [14:23] frankban, am I right in assuming that you did not see this? [14:23] Setting up rng-tools (2-unofficial-mt.14-1ubuntu1) ... [14:23] Starting Hardware RNG entropy gatherer daemon: (failed). [14:23] invoke-rc.d: initscript rng-tools, action "start" failed. [14:23] yes, you are [14:23] :-/ [14:23] that's after I reinstated the old /dev/random [14:24] not very informative [14:28] gary_poster: if the file was open, switching it out won't make a difference (just like a regular file) [14:28] benji, sure [14:28] benji, I had made the switch before starting the test run [14:29] ah, sorry; I got the impression that it was mid-hang [14:29] I did a variety of switches mid-hang :-) but I did the start of the experiment correctly, at least in this regard [14:29] the mid-hang switches were of the "it's not working, so what shall I do now" variety [14:54] gary_poster: another build without hangs: http://ec2-50-17-161-214.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio [14:55] around 50 minutes this time too, but results are... weird... [15:06] frankban, results look ok (roughly expected) to me. what's the weird part? [15:06] gary_poster: 3 failures, I expected 5, always seen 5 [15:09] but that's maybe because tests are split and ordered differently each time [15:10] frankban, yeah, that's my assumption. difference between first and second is especially big [15:12] gary_poster: started my third run, if it goes well, I can conclude that 1. my trick works or 2. I've got a holy blessed ec2 instance [15:16] frankban, :-) well I think based on your evidence we can conclude (1) the exhausted /dev/random hypothesis seems proven and then (2) one of your two options. I'm working on duping your approach. I think one can ignore the error message I got. I modified /etc/default/rng-tools and then it started fine. starting a run now [15:17] wow, I think that immediately kicked things into high gear [15:51] cool, third run completed in 51 minutes [15:56] great frankban. frankban, since you are ahead of me, how's this for an idea. Hack testr (I can give a pointer, but easy to find) to make it think the machine only has 1 core so we can start to get a baseline for our tests. Then, gather the three times and the three error reports for your three test runs. We'll want the times to make averages (make sure you note which was the first, second and third please) an [15:56] d we'll want the failures to know what to fix [15:56] does that sound ok frankban? or do you have better idea? [16:02] frankban, I have duped a successful run on my ec2 box, using your approach. yay! [16:02] running another one. [16:08] gary_poster: so, do you want me to save test reports and send them to yellow? and for running using one core: maybe hack testcommand.local_concurrency? [16:22] frankban, I was thinking of you sending the buildbot report, but actually... [16:22] yes, testcommand.local_concurrency [16:22] ...anyway, in build/.testrepository there should be a few files there now [16:22] some numbered ones are the subunit files [16:22] I think they might be the most useful [16:23] in addition to the timings, of course, which I would get from the equivalent of http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/6/steps/shell_8 [16:24] for that, I get to the page this way: [16:24] from waterfall, click on build link for the given build. Then click on the "shell 8" link [16:24] that will give the timing [16:24] as measured by buildbot [16:26] ah, I see gary_poster. btw, I've only those files in /var/lib/buildbot/slaves/slave/lucid-devel/build/.testrepository: [16:26] 0 failing format next-stream times.dbm [16:27] frankban, huh :-/ [16:28] OK frankban, maybe the buildbot output is the thing to get. oh...interesting... [16:29] actually, we are wiping out the .testrepository directory every build [16:29] this means that every single run is round-robin [16:29] because testrepository has no data to work with [16:29] that's probably something we should fix [16:30] gary_poster: maybe that's because we use an overlayfs? [16:30] for now, frankban, get that "0" file, which should be for your most recent (current?) run [16:30] frankban, actually even more than that [16:30] or use the overlayfs, *and* buildbot wipes away the build directory for every build--we do a fresh checkout every time [16:31] gary_poster: i just got sucked into an investigation of gpg key importing from a six month old question that laura asked me to follow up on. but, good news, i figured out what was going on. [16:31] bac, yeah, I happened to see those emails, cool [16:32] short answer, the key server doesn't support old v3 keys and neither does LP, but we lie to the user and say his key doesn't exist. [16:32] right, I saw you say that this was a bad error message. sure sounds like it [16:33] frankban, I'm not sure why my last sentence sentence started with the word "or". :-P This conveys more of my meaning: "We use the overlayfs, *and* buildbot wipes away the build directory for every build--we do a fresh checkout every time" [16:34] so anyway, short term, grab the "0" file and grab the buildbot output [16:35] gary_poster: ok, I will send them together with the results of the single core test run. [16:35] frankban, cool, thank you. [16:37] gary_poster: updated the lpsetup part of the document [16:37] frankban, great, thank you very much. bac, I saw you did the same for your part. Thank you. [16:37] actually frankban. overlayfs doesn't have anything to do with it; it is entirely the buildbot issue [16:38] testr runs in host [16:38] sorry, talking about the .testrepository directory [17:54] gary_poster: if you have a minute, I have a draft of an LXC blog post that would appreciate any input you have [17:54] benji, sure. how would you like me to look at it [17:55] gary_poster: how about this: https://pastebin.canonical.com/62744/ [17:55] cool, on it [17:57] benji, I'd r eplace "three or four hours, depending on the hardware" with "six hours on our current continuous integration machines, for instance" or similar. It's six hours there. [17:58] I wasn't aware of the exact time, thanks [18:00] After sentence "The ephemeral containers can then write to their local file systems [18:00] without interfering with the others running simultaneously." I suggest something like this: "Because the file system changes are stored in memory, IO doesn't slow us down or block us, as it would in other similar situations." [18:01] I hadn't thought of higlighting that aspect, good idea. [18:03] Last sentence, suggest replacing "Even so we have already shortened a full test run on an [18:03] eight-core EC2 instance down to 45 minutes." with something like "Even so, our current results, with only a handful of test failures per run, are running on an eight-core EC2 instance in about 55 minutes." I'd also suggest waiting on frankban's numbers for running the tests on a single core on the same machine: that will give us a real statistic to give in the blog post, wich is probably highly pertinent given o [18:03] ur audience. That should be done today, hopefully, so it won't block you long. [18:04] Benji, very nice post. Thank you. [18:04] s/Benji/benji/ :-P [18:04] heh [18:05] The blog post got me in full-sentence mode. [19:45] well, at least my PC waited until the call was almost over to crash [19:53] heh [20:06] benji, reread blog post. I wonder if you should omit mention of ssh ("ssh ubuntu@test"), and tell people to log in with ubuntu:ubuntu. Maybe you should also mention that we are doing this on precise? [20:39] gary_poster: things look good on the blog post front, danhg says "this is just what I was after, much appreciated" [20:39] benji, excellent, thank you [20:47] gary_poster: i've determined the translations but is not an isolation bug but a spurious failure. in fact, running the set of tests present in the original merge proposal where the test was introduced (33 tests), the test fails frequently. [20:47] bac, huh. Disable? [20:48] if all else fails, perhaps [20:48] my question is why is it not being seen in buildbot? [20:49] gary_poster: yes, if i can't figure it out my tomorrow morning i'll punt and disable it [20:50] bac, I figured. Spurious bugs can have different characteristics in different environments, of course [20:50] I figured that's what you were wondering, I mean [20:51] The nature of a spurious bug is that it is tied to something environmental that it should not be, I'd argue