/srv/irclogs.ubuntu.com/2012/03/21/#launchpad-yellow.txt

=== danilo_ is now known as danilos
bacmorning11:10
bachttp://mozillamemes.tumblr.com/post/19333515188/we-love-you-nigelb11:12
frankban:-) hi bac11:27
benjigary_poster: bug 959352 seems to be getting stranger and stranger12:22
gary_posterbenji, yeah, really12:23
gary_posterbenji, I have a machine up now12:23
gary_posterran tests on it through the night12:23
gary_posterI'm confused12:24
benjigary_poster: I just replied to your other email about the blocking, you may want to read it now if you're getting ready for a run.12:24
gary_posterentropy went down to 0 a couple of times12:24
gary_posteryeah I did12:24
gary_posterbut giving you more data now :-)12:24
gary_posterentropy went down to 0 a few times12:24
gary_posterbut didn't stay there for long12:25
gary_posterwhile it was hanging it had gotten up to 5112:25
gary_posterdown to 0 again now12:25
gary_posterDo I read this correctly?12:26
gary_posterread(11, "6\351d=\310\235\274\300", 4096) = 812:26
gary_posterprogram: "give me 4k bytes from fd 11"12:27
gary_posterkernel: "I can give you 8"12:27
gary_poster(and they are the ones in the buffer)12:28
gary_posterthe second arg12:28
gary_posterI'm pretty sure that is right12:28
gary_posterI have another test run now12:28
gary_posterI'm really confused about the mapping though12:29
gary_posterwhat was it doing when it was (according to lsof) trying to get something from /rootfs/dev/random?12:29
gary_posterwell, supposedly it is still trying to do that now12:29
gary_posterbut I have symlinks in12:29
gary_posterbut...it should have fallen over because the file didn't exist, not hung waiting for output12:30
gary_postermy suspicion is that it is getting the random values actually from the proper place12:30
gary_posterand lsof is the one that is confused12:30
gary_posterif that (or any other scenario in which that dns code is actually reading /dev/random despite what lsof says) is happening12:31
gary_posterthen doing mapping games won't help12:31
frankbangary_poster: do you know if /dev/random inside ephemerals points to /dev/random of the host?12:36
gary_posterfrankban, oh, interesting.  I assumed it would, because I thought dev was mounted, but it is not.  proc and sysfs are mounted.  I don't know what the dev mechanism is.12:37
gary_posterI'm trying to kill everything now but I'll look at that again in a moment12:38
frankbangary_poster: I am trying now a test run using a nasty trick to feed the entropy in the host12:38
gary_posterfrankban, heh, cool.  I saw you could write to /dev/random and then manipulate...something to convince the system that it had enough.  good experiment, thanks! :-)12:40
frankbanwell, I am using rng-tools to feed /dev/random with data from /dev/urandom12:40
gary_posterok, everything is sufficiently dead...12:41
gary_posterheh, cool12:41
benjigary_poster: yes, you're reading that correctly; yeah, I bet your suspicion is right that /dev/random is really being read, but strace is reporting the file name incorrectly12:45
benjigary_poster: there is a mapping game that will help, remove /dev/random and recreate it but use the urandom kernel device12:46
bacfrankban: thanks for pointing out rng-tools...i didn't know about it.  have you seen http://www.howtoforge.com/helping-the-random-number-generator-to-gain-enough-entropy-with-rng-tools-debian-lenny12:52
bacso apparently generating enough entropy on servers is a known problem and we've made it worse by throwing 8 cores at it12:53
frankbanbac: yes, that's the trick.12:53
benjigary_poster: here is the incantaion that will point /dev/random at /dev/urandom: rm /dev/random; mknod -m 0444 /dev/random c 1 912:54
gary_posterbenji ack.  one more minute away...12:54
gary_posterok... benji that should be done on the host I'm assuming?12:56
benjigary_poster: good question.  I was assuming we'd do it on the containers, but assuming the host's /dev/random feeds the containers, then at least we'd only have to do it in one place (and we can do it live)12:57
gary_posterwell, I could do it in the base container12:59
gary_posterthat seems mildly safer12:59
gary_posterok build started13:05
gary_posterbac benji frankban gmb call in 1 or 213:09
gary_poster(as soon as we all show up)13:09
gary_posterbenji starting without you13:16
* benji runs toward the horde!13:16
gary_posterhttps://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit13:27
gary_posterok benji, I'm going to kill the test run and try again after changing the host's /dev/random13:37
bacgary_poster: document updated wrt generic code packaging14:10
gary_posterbenji, frankban, I have host and container using the urandom and it's not unblocking.  going to try duping frankban's rng-tools now.  apt-get install failed--perhaps it didn't like the /dev/random change.  undoing and retrying that.14:12
gary_posterbac, good, thank you14:12
benjihmm, it should not block14:13
gary_posterbenji, frankban's loop ("for i in $(seq 1000); do head -1 /dev/random > /dev/null; done") blocks too.  oh...wow14:15
gary_poster/dev/random is 1, 8 again!14:15
gary_posterwho did that!14:15
benjiheh14:15
benjithe mknod really should "stick", as far as I know those device files are created at boot and not touched afterward14:16
gary_posterseriously, though...yeah, it did not stick14:16
gary_posterI verified history14:16
gary_posterI changed it14:16
gary_posterit was 1, 914:16
gary_posterthen when I looked again just now14:17
gary_posterit was 1, 814:17
frankbangary_poster: maybe the rng-tools installation changes them?14:18
gary_posteryeah I was wondering about that too14:18
gary_posterthere's not a command to see when a command was run, is there?14:19
gary_posterno, that wouldn't help anyway14:19
gary_posterin any case, this was hanging before I tried to install rng-tools14:20
gary_posterand is still hanging now14:20
gary_posterafter I re-switched /dev/random14:20
gary_posterfrankban, am I right in assuming that you did not see this?14:23
gary_posterSetting up rng-tools (2-unofficial-mt.14-1ubuntu1) ...14:23
gary_posterStarting Hardware RNG entropy gatherer daemon: (failed).14:23
gary_posterinvoke-rc.d: initscript rng-tools, action "start" failed.14:23
frankbanyes, you are14:23
gary_poster:-/14:23
gary_posterthat's after I reinstated the old /dev/random14:23
gary_posternot very informative14:24
benjigary_poster: if the file was open, switching it out won't make a difference (just like a regular file)14:28
gary_posterbenji, sure14:28
gary_posterbenji, I had made the switch before starting the test run14:28
benjiah, sorry; I got the impression that it was mid-hang14:29
gary_posterI did a variety of switches mid-hang :-) but I did the start of the experiment correctly, at least in this regard14:29
gary_posterthe mid-hang switches were of the "it's not working, so what shall I do now" variety14:29
frankbangary_poster: another build without hangs: http://ec2-50-17-161-214.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio14:54
frankbanaround 50 minutes this time too, but results are... weird...14:55
gary_posterfrankban, results look ok (roughly expected) to me.  what's the weird part?15:06
frankbangary_poster: 3 failures, I expected 5, always seen 515:06
frankbanbut that's maybe because tests are split and ordered differently each time15:09
gary_posterfrankban, yeah, that's my assumption.  difference between first and second is especially big15:10
frankbangary_poster: started my third run, if it goes well, I can conclude that 1. my trick works or 2. I've got a holy blessed ec2 instance15:12
gary_posterfrankban, :-) well I think based on your evidence we can conclude (1) the exhausted /dev/random hypothesis seems proven and then (2) one of your two options.  I'm working on duping your approach.  I think one can ignore the error message I got.  I modified /etc/default/rng-tools and then it started fine.  starting a run now15:16
gary_posterwow, I think that immediately kicked things into high gear15:17
frankbancool, third run completed in 51 minutes15:51
gary_postergreat frankban.  frankban, since you are ahead of me, how's this for an idea.  Hack testr (I can give a pointer, but easy to find) to make it think the machine only has 1 core so we can start to get a baseline for our tests.  Then, gather the three times and the three error reports for your three test runs.  We'll want the times to make averages (make sure you note which was the first, second and third please) an15:56
gary_posterd we'll want the failures to know what to fix15:56
gary_posterdoes that sound ok frankban?  or do you have better idea?15:56
gary_posterfrankban, I have duped a successful run on my ec2 box, using your approach.  yay!16:02
gary_posterrunning another one.16:02
frankbangary_poster: so, do you want me to save test reports and send them to yellow? and for running using one core: maybe hack testcommand.local_concurrency?16:08
gary_posterfrankban, I was thinking of you sending the buildbot report, but actually...16:22
gary_posteryes, testcommand.local_concurrency16:22
gary_poster...anyway, in build/.testrepository there should be a few files there now16:22
gary_postersome numbered ones are the subunit files16:22
gary_posterI think they might be the most useful16:22
gary_posterin addition to the timings, of course, which I would get from the equivalent of http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/6/steps/shell_816:23
gary_posterfor that, I get to the page this way:16:24
gary_posterfrom waterfall, click on build link for the given build.  Then click on the "shell 8" link16:24
gary_posterthat will give the timing16:24
gary_posteras measured by buildbot16:24
frankbanah, I see gary_poster. btw, I've only those files in /var/lib/buildbot/slaves/slave/lucid-devel/build/.testrepository:16:26
frankban0  failing  format  next-stream  times.dbm16:26
gary_posterfrankban, huh :-/16:27
gary_posterOK frankban, maybe the buildbot output is the thing to get.  oh...interesting...16:28
gary_posteractually, we are wiping out the .testrepository directory every build16:29
gary_posterthis means that every single run is round-robin16:29
gary_posterbecause testrepository has no data to work with16:29
gary_posterthat's probably something we should fix16:29
frankbangary_poster: maybe that's because we use an overlayfs?16:30
gary_posterfor now, frankban, get that "0" file, which should be for your most recent (current?) run16:30
gary_posterfrankban, actually even more than that16:30
gary_posteror use the overlayfs, *and* buildbot wipes away the build directory for every build--we do a fresh checkout every time16:30
bacgary_poster: i just got sucked into an investigation of gpg key importing from a six month old question that laura asked me to follow up on.  but, good news, i figured out what was going on.16:31
gary_posterbac, yeah, I happened to see those emails, cool16:31
bacshort answer, the key server doesn't support old v3 keys and neither does LP, but we lie to the user and say his key doesn't exist.16:32
gary_posterright, I saw you say that this was a bad error message.  sure sounds like it16:32
gary_posterfrankban, I'm not sure why my last sentence sentence started with the word "or". :-P  This conveys more of my meaning: "We use the overlayfs, *and* buildbot wipes away the build directory for every build--we do a fresh checkout every time"16:33
gary_posterso anyway, short term, grab the "0" file and grab the buildbot output16:34
frankbangary_poster: ok, I will send them together with the results of the single core test run.16:35
gary_posterfrankban, cool, thank you.16:35
frankbangary_poster: updated the lpsetup part of the document16:37
gary_posterfrankban, great, thank you very much.  bac, I saw you did the same for your part.  Thank you.16:37
gary_posteractually frankban. overlayfs doesn't have anything to do with it; it is entirely the buildbot issue16:37
gary_postertestr runs in host16:38
gary_postersorry, talking about the .testrepository directory16:38
benjigary_poster: if you have a minute, I have a draft of an LXC blog post that would appreciate any input you have17:54
gary_posterbenji, sure.  how would you like me to look at it17:54
benjigary_poster: how about this: https://pastebin.canonical.com/62744/17:55
gary_postercool, on it17:55
gary_posterbenji, I'd r eplace "three or four hours, depending on the hardware" with "six hours on our current continuous integration machines, for instance" or similar.  It's six hours there.17:57
benjiI wasn't aware of the exact time, thanks17:58
gary_posterAfter sentence "The ephemeral containers can then write to their local file systems18:00
gary_posterwithout interfering with the others running simultaneously." I suggest something like this: "Because the file system changes are stored in memory, IO doesn't slow us down or block us, as it would in other similar situations."18:00
benjiI hadn't thought of higlighting that aspect, good idea.18:01
gary_posterLast sentence, suggest replacing "Even so we have already shortened a full test run on an18:03
gary_postereight-core EC2 instance down to 45 minutes." with something like "Even so, our current results, with only a handful of test failures per run, are running on an eight-core EC2 instance in about 55 minutes."  I'd also suggest waiting on frankban's numbers for running the tests on a single core on the same machine: that will give us a real statistic to give in the blog post, wich is probably highly pertinent given o18:03
gary_posterur audience.  That should be done today, hopefully, so it won't block you long.18:03
gary_posterBenji, very nice post.  Thank you.18:04
gary_posters/Benji/benji/ :-P18:04
benjiheh18:04
gary_posterThe blog post got me in full-sentence mode.18:05
benjiwell, at least my PC waited until the call was almost over to crash19:45
gary_posterheh19:53
gary_posterbenji, reread blog post. I wonder if you should omit mention of ssh ("ssh ubuntu@test"), and tell people to log in with ubuntu:ubuntu.  Maybe you should also mention that we are doing this on precise?20:06
benjigary_poster: things look good on the blog post front, danhg says "this is just what I was after, much appreciated"20:39
gary_posterbenji, excellent, thank you20:39
bacgary_poster: i've determined the translations but is not an isolation bug but a spurious failure.  in fact, running the set of tests present in the original merge proposal where the test was introduced (33 tests), the test fails frequently.20:47
gary_posterbac, huh.  Disable?20:47
gary_posterif all else fails, perhaps20:48
bacmy question is why is it not being seen in buildbot?20:48
bacgary_poster: yes, if i can't figure it out my tomorrow morning i'll punt and disable it20:49
gary_posterbac, I figured.  Spurious bugs can have different characteristics in different environments, of course20:50
gary_posterI figured that's what you were wondering, I mean20:50
gary_posterThe nature of a spurious bug is that it is tied to something environmental that it should not be, I'd argue20:51

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!