/srv/irclogs.ubuntu.com/2012/03/21/#launchpad-yellow.txt

=== danilo_ is now known as danilos
bac	morning	11:10
bac	http://mozillamemes.tumblr.com/post/19333515188/we-love-you-nigelb	11:12
frankban	:-) hi bac	11:27
benji	gary_poster: bug 959352 seems to be getting stranger and stranger	12:22
gary_poster	benji, yeah, really	12:23
gary_poster	benji, I have a machine up now	12:23
gary_poster	ran tests on it through the night	12:23
gary_poster	I'm confused	12:24
benji	gary_poster: I just replied to your other email about the blocking, you may want to read it now if you're getting ready for a run.	12:24
gary_poster	entropy went down to 0 a couple of times	12:24
gary_poster	yeah I did	12:24
gary_poster	but giving you more data now :-)	12:24
gary_poster	entropy went down to 0 a few times	12:24
gary_poster	but didn't stay there for long	12:25
gary_poster	while it was hanging it had gotten up to 51	12:25
gary_poster	down to 0 again now	12:25
gary_poster	Do I read this correctly?	12:26
gary_poster	read(11, "6\351d=\310\235\274\300", 4096) = 8	12:26
gary_poster	program: "give me 4k bytes from fd 11"	12:27
gary_poster	kernel: "I can give you 8"	12:27
gary_poster	(and they are the ones in the buffer)	12:28
gary_poster	the second arg	12:28
gary_poster	I'm pretty sure that is right	12:28
gary_poster	I have another test run now	12:28
gary_poster	I'm really confused about the mapping though	12:29
gary_poster	what was it doing when it was (according to lsof) trying to get something from /rootfs/dev/random?	12:29
gary_poster	well, supposedly it is still trying to do that now	12:29
gary_poster	but I have symlinks in	12:29
gary_poster	but...it should have fallen over because the file didn't exist, not hung waiting for output	12:30
gary_poster	my suspicion is that it is getting the random values actually from the proper place	12:30
gary_poster	and lsof is the one that is confused	12:30
gary_poster	if that (or any other scenario in which that dns code is actually reading /dev/random despite what lsof says) is happening	12:31
gary_poster	then doing mapping games won't help	12:31
frankban	gary_poster: do you know if /dev/random inside ephemerals points to /dev/random of the host?	12:36
gary_poster	frankban, oh, interesting. I assumed it would, because I thought dev was mounted, but it is not. proc and sysfs are mounted. I don't know what the dev mechanism is.	12:37
gary_poster	I'm trying to kill everything now but I'll look at that again in a moment	12:38
frankban	gary_poster: I am trying now a test run using a nasty trick to feed the entropy in the host	12:38
gary_poster	frankban, heh, cool. I saw you could write to /dev/random and then manipulate...something to convince the system that it had enough. good experiment, thanks! :-)	12:40
frankban	well, I am using rng-tools to feed /dev/random with data from /dev/urandom	12:40
gary_poster	ok, everything is sufficiently dead...	12:41
gary_poster	heh, cool	12:41
benji	gary_poster: yes, you're reading that correctly; yeah, I bet your suspicion is right that /dev/random is really being read, but strace is reporting the file name incorrectly	12:45
benji	gary_poster: there is a mapping game that will help, remove /dev/random and recreate it but use the urandom kernel device	12:46
bac	frankban: thanks for pointing out rng-tools...i didn't know about it. have you seen http://www.howtoforge.com/helping-the-random-number-generator-to-gain-enough-entropy-with-rng-tools-debian-lenny	12:52
bac	so apparently generating enough entropy on servers is a known problem and we've made it worse by throwing 8 cores at it	12:53
frankban	bac: yes, that's the trick.	12:53
benji	gary_poster: here is the incantaion that will point /dev/random at /dev/urandom: rm /dev/random; mknod -m 0444 /dev/random c 1 9	12:54
gary_poster	benji ack. one more minute away...	12:54
gary_poster	ok... benji that should be done on the host I'm assuming?	12:56
benji	gary_poster: good question. I was assuming we'd do it on the containers, but assuming the host's /dev/random feeds the containers, then at least we'd only have to do it in one place (and we can do it live)	12:57
gary_poster	well, I could do it in the base container	12:59
gary_poster	that seems mildly safer	12:59
gary_poster	ok build started	13:05
gary_poster	bac benji frankban gmb call in 1 or 2	13:09
gary_poster	(as soon as we all show up)	13:09
gary_poster	benji starting without you	13:16
* benji runs toward the horde!		13:16
gary_poster	https://docs.google.com/a/canonical.com/document/d/19Zn7fGkQH5oOpJkaU2lGCpt8RK5KiDpPBTKJKs50wWw/edit	13:27
gary_poster	ok benji, I'm going to kill the test run and try again after changing the host's /dev/random	13:37
bac	gary_poster: document updated wrt generic code packaging	14:10
gary_poster	benji, frankban, I have host and container using the urandom and it's not unblocking. going to try duping frankban's rng-tools now. apt-get install failed--perhaps it didn't like the /dev/random change. undoing and retrying that.	14:12
gary_poster	bac, good, thank you	14:12
benji	hmm, it should not block	14:13
gary_poster	benji, frankban's loop ("for i in $(seq 1000); do head -1 /dev/random > /dev/null; done") blocks too. oh...wow	14:15
gary_poster	/dev/random is 1, 8 again!	14:15
gary_poster	who did that!	14:15
benji	heh	14:15
benji	the mknod really should "stick", as far as I know those device files are created at boot and not touched afterward	14:16
gary_poster	seriously, though...yeah, it did not stick	14:16
gary_poster	I verified history	14:16
gary_poster	I changed it	14:16
gary_poster	it was 1, 9	14:16
gary_poster	then when I looked again just now	14:17
gary_poster	it was 1, 8	14:17
frankban	gary_poster: maybe the rng-tools installation changes them?	14:18
gary_poster	yeah I was wondering about that too	14:18
gary_poster	there's not a command to see when a command was run, is there?	14:19
gary_poster	no, that wouldn't help anyway	14:19
gary_poster	in any case, this was hanging before I tried to install rng-tools	14:20
gary_poster	and is still hanging now	14:20
gary_poster	after I re-switched /dev/random	14:20
gary_poster	frankban, am I right in assuming that you did not see this?	14:23
gary_poster	Setting up rng-tools (2-unofficial-mt.14-1ubuntu1) ...	14:23
gary_poster	Starting Hardware RNG entropy gatherer daemon: (failed).	14:23
gary_poster	invoke-rc.d: initscript rng-tools, action "start" failed.	14:23
frankban	yes, you are	14:23
gary_poster	:-/	14:23
gary_poster	that's after I reinstated the old /dev/random	14:23
gary_poster	not very informative	14:24
benji	gary_poster: if the file was open, switching it out won't make a difference (just like a regular file)	14:28
gary_poster	benji, sure	14:28
gary_poster	benji, I had made the switch before starting the test run	14:28
benji	ah, sorry; I got the impression that it was mid-hang	14:29
gary_poster	I did a variety of switches mid-hang :-) but I did the start of the experiment correctly, at least in this regard	14:29
gary_poster	the mid-hang switches were of the "it's not working, so what shall I do now" variety	14:29
frankban	gary_poster: another build without hangs: http://ec2-50-17-161-214.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/1/steps/shell_8/logs/stdio	14:54
frankban	around 50 minutes this time too, but results are... weird...	14:55
gary_poster	frankban, results look ok (roughly expected) to me. what's the weird part?	15:06
frankban	gary_poster: 3 failures, I expected 5, always seen 5	15:06
frankban	but that's maybe because tests are split and ordered differently each time	15:09
gary_poster	frankban, yeah, that's my assumption. difference between first and second is especially big	15:10
frankban	gary_poster: started my third run, if it goes well, I can conclude that 1. my trick works or 2. I've got a holy blessed ec2 instance	15:12
gary_poster	frankban, :-) well I think based on your evidence we can conclude (1) the exhausted /dev/random hypothesis seems proven and then (2) one of your two options. I'm working on duping your approach. I think one can ignore the error message I got. I modified /etc/default/rng-tools and then it started fine. starting a run now	15:16
gary_poster	wow, I think that immediately kicked things into high gear	15:17
frankban	cool, third run completed in 51 minutes	15:51
gary_poster	great frankban. frankban, since you are ahead of me, how's this for an idea. Hack testr (I can give a pointer, but easy to find) to make it think the machine only has 1 core so we can start to get a baseline for our tests. Then, gather the three times and the three error reports for your three test runs. We'll want the times to make averages (make sure you note which was the first, second and third please) an	15:56
gary_poster	d we'll want the failures to know what to fix	15:56
gary_poster	does that sound ok frankban? or do you have better idea?	15:56
gary_poster	frankban, I have duped a successful run on my ec2 box, using your approach. yay!	16:02
gary_poster	running another one.	16:02
frankban	gary_poster: so, do you want me to save test reports and send them to yellow? and for running using one core: maybe hack testcommand.local_concurrency?	16:08
gary_poster	frankban, I was thinking of you sending the buildbot report, but actually...	16:22
gary_poster	yes, testcommand.local_concurrency	16:22
gary_poster	...anyway, in build/.testrepository there should be a few files there now	16:22
gary_poster	some numbered ones are the subunit files	16:22
gary_poster	I think they might be the most useful	16:22
gary_poster	in addition to the timings, of course, which I would get from the equivalent of http://ec2-174-129-101-121.compute-1.amazonaws.com:8010/builders/lucid_lp/builds/6/steps/shell_8	16:23
gary_poster	for that, I get to the page this way:	16:24
gary_poster	from waterfall, click on build link for the given build. Then click on the "shell 8" link	16:24
gary_poster	that will give the timing	16:24
gary_poster	as measured by buildbot	16:24
frankban	ah, I see gary_poster. btw, I've only those files in /var/lib/buildbot/slaves/slave/lucid-devel/build/.testrepository:	16:26
frankban	0 failing format next-stream times.dbm	16:26
gary_poster	frankban, huh :-/	16:27
gary_poster	OK frankban, maybe the buildbot output is the thing to get. oh...interesting...	16:28
gary_poster	actually, we are wiping out the .testrepository directory every build	16:29
gary_poster	this means that every single run is round-robin	16:29
gary_poster	because testrepository has no data to work with	16:29
gary_poster	that's probably something we should fix	16:29
frankban	gary_poster: maybe that's because we use an overlayfs?	16:30
gary_poster	for now, frankban, get that "0" file, which should be for your most recent (current?) run	16:30
gary_poster	frankban, actually even more than that	16:30
gary_poster	or use the overlayfs, and buildbot wipes away the build directory for every build--we do a fresh checkout every time	16:30
bac	gary_poster: i just got sucked into an investigation of gpg key importing from a six month old question that laura asked me to follow up on. but, good news, i figured out what was going on.	16:31
gary_poster	bac, yeah, I happened to see those emails, cool	16:31
bac	short answer, the key server doesn't support old v3 keys and neither does LP, but we lie to the user and say his key doesn't exist.	16:32
gary_poster	right, I saw you say that this was a bad error message. sure sounds like it	16:32
gary_poster	frankban, I'm not sure why my last sentence sentence started with the word "or". :-P This conveys more of my meaning: "We use the overlayfs, and buildbot wipes away the build directory for every build--we do a fresh checkout every time"	16:33
gary_poster	so anyway, short term, grab the "0" file and grab the buildbot output	16:34
frankban	gary_poster: ok, I will send them together with the results of the single core test run.	16:35
gary_poster	frankban, cool, thank you.	16:35
frankban	gary_poster: updated the lpsetup part of the document	16:37
gary_poster	frankban, great, thank you very much. bac, I saw you did the same for your part. Thank you.	16:37
gary_poster	actually frankban. overlayfs doesn't have anything to do with it; it is entirely the buildbot issue	16:37
gary_poster	testr runs in host	16:38
gary_poster	sorry, talking about the .testrepository directory	16:38
benji	gary_poster: if you have a minute, I have a draft of an LXC blog post that would appreciate any input you have	17:54
gary_poster	benji, sure. how would you like me to look at it	17:54
benji	gary_poster: how about this: https://pastebin.canonical.com/62744/	17:55
gary_poster	cool, on it	17:55
gary_poster	benji, I'd r eplace "three or four hours, depending on the hardware" with "six hours on our current continuous integration machines, for instance" or similar. It's six hours there.	17:57
benji	I wasn't aware of the exact time, thanks	17:58
gary_poster	After sentence "The ephemeral containers can then write to their local file systems	18:00
gary_poster	without interfering with the others running simultaneously." I suggest something like this: "Because the file system changes are stored in memory, IO doesn't slow us down or block us, as it would in other similar situations."	18:00
benji	I hadn't thought of higlighting that aspect, good idea.	18:01
gary_poster	Last sentence, suggest replacing "Even so we have already shortened a full test run on an	18:03
gary_poster	eight-core EC2 instance down to 45 minutes." with something like "Even so, our current results, with only a handful of test failures per run, are running on an eight-core EC2 instance in about 55 minutes." I'd also suggest waiting on frankban's numbers for running the tests on a single core on the same machine: that will give us a real statistic to give in the blog post, wich is probably highly pertinent given o	18:03
gary_poster	ur audience. That should be done today, hopefully, so it won't block you long.	18:03
gary_poster	Benji, very nice post. Thank you.	18:04
gary_poster	s/Benji/benji/ :-P	18:04
benji	heh	18:04
gary_poster	The blog post got me in full-sentence mode.	18:05
benji	well, at least my PC waited until the call was almost over to crash	19:45
gary_poster	heh	19:53
gary_poster	benji, reread blog post. I wonder if you should omit mention of ssh ("ssh ubuntu@test"), and tell people to log in with ubuntu:ubuntu. Maybe you should also mention that we are doing this on precise?	20:06
benji	gary_poster: things look good on the blog post front, danhg says "this is just what I was after, much appreciated"	20:39
gary_poster	benji, excellent, thank you	20:39
bac	gary_poster: i've determined the translations but is not an isolation bug but a spurious failure. in fact, running the set of tests present in the original merge proposal where the test was introduced (33 tests), the test fails frequently.	20:47
gary_poster	bac, huh. Disable?	20:47
gary_poster	if all else fails, perhaps	20:48
bac	my question is why is it not being seen in buildbot?	20:48
bac	gary_poster: yes, if i can't figure it out my tomorrow morning i'll punt and disable it	20:49
gary_poster	bac, I figured. Spurious bugs can have different characteristics in different environments, of course	20:50
gary_poster	I figured that's what you were wondering, I mean	20:50
gary_poster	The nature of a spurious bug is that it is tied to something environmental that it should not be, I'd argue	20:51

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!