/srv/irclogs.ubuntu.com/2010/11/17/#launchpad-dev.txt

cody-somerville	wgrant, https://edge.launchpad.net/builders	00:14
thumper	StevenK: ^^	00:17
wgrant	fuuuuuuuuuu	00:19
wgrant	That was meant to be fixed.	00:19
wgrant	And we can't rollback after tomorrow. :/	00:20
wgrant	spm: Around?	00:20
elmo	we can cherrypick rollback	00:20
elmo	he doesn't appear to be	00:20
elmo	what do you need?	00:20
wgrant	We have no StevenK this week, AIUI.	00:20
wgrant	The new buildd-manager is still horribly broken. Can you see if there's anything interesting in its log?	00:21
elmo	yes, all sorts of crfap	00:21
wgrant	Well, that's relatively more opaque than I would have hoped.	00:24
thumper	are we running the new buildmanager then?	00:24
wgrant	We are.	00:24
pjdc	wgrant: are we looking for anything in particular in the buildd manager log?	00:25
wgrant	pjdc: I have a section of it. It's not very helpful.	00:25
thumper	any idea where the log is on devpad?	00:25
thumper	wgrant: do you know which machine the buildd manager runs on?	00:25
wgrant	thumper: cesium	00:26
wgrant	I have bits of the log already.	00:26
pjdc	thumper: looks like they land in devpad:/x/launchpad.net-logs/production/cesium	00:26
thumper	pjdc: yep, there	00:26
wgrant	Twisted errors make me sad.	00:27
wgrant	I wonder if the fermium connection error at 2010-11-17 00:08:13 is the root.	00:28
wgrant	Oh, no.	00:28
wgrant	It was a few minutes before that.	00:28
wgrant	Sad.	00:28
wgrant	Can you see where it started?	00:29
wgrant	pjdc: There have been no known network glitches this morning?	00:29
pjdc	wgrant: not to my knowledge.	00:30
wgrant	:(	00:30
thumper	010-11-16 00:01:03+0000 [-] <titanium:http://titanium.ppa:8221/> communication failed (User timeout caused connection failure.)	00:31
thumper	2010-11-16 00:01:03+0000 [-] <titanium:http://titanium.ppa:8221/> failure (None)	00:31
thumper	2010-11-16 00:01:03+0000 [-] <nannyberry:http://nannyberry.ppa:8221/> communication failed (User timeout caused connection failure.)	00:31
thumper	2010-11-16 00:01:03+0000 [-] <nannyberry:http://nannyberry.ppa:8221/> failure (None)	00:31
thumper	2010-11-16 00:01:03+0000 [-] <charichuelo:http://charichuelo.ppa:8221/> communication failed (User timeout caused connection failure.)	00:31
pjdc	hmm, i could check if fermium was airlocked around the	00:31
thumper	2010-11-16 00:01:03+0000 [-] <charichuelo:http://charichuelo.ppa:8221/> failure (None)	00:31
thumper	2010-11-16 00:01:03+0000 [-] <thorium:http://thorium.ppa:8221/> communication failed (User timeout caused connection failure.)	00:31
thumper	2010-11-16 00:01:03+0000 [-] <thorium:http://thorium.ppa:8221/> failure (None)	00:31
thumper	seeing things like that	00:31
thumper	hmm...	00:31
thumper	perhaps a pastebin would be better	00:31
wgrant	thumper: Isn't that >24 hours ago?	00:31
thumper	wgrant: yep, I'm looking for failures	00:31
thumper	hmm...	00:34
thumper	buildd-manager.log-20101116 has up to 2010-11-16 00:01:05+0000	00:34
wgrant	Yeah, they're named sort of wrongly.	00:35
thumper	buildd-manager.log starts at 2010-11-17 00:00:45+0000	00:35
wgrant	They're named after the day they're rotated.	00:35
thumper	and no log file in the middle	00:35
wgrant	There's no -20101117?	00:35
thumper	nope	00:35
wgrant	Um.	00:35
thumper	pjdc: can you see a -20101117 file on cesium?	00:36
pjdc	thumper: looking	00:36
pjdc	thumper: yes, there's a buildd-manager.log-20101117	00:36
thumper	pjdc: can you get that to devpad plz?	00:36
thumper	according to the graph, we started loosing builders approx 12 hours ago	00:37
wgrant	Does LPS say when cesium was updated?	00:38
pjdc	thumper: landed in /tmp	00:38
thumper	pjdc: ta	00:39
thumper	Scanning failed with: <Fault 8002: 'error'> <-- look suspect	00:40
wgrant	That's fairly normal.	00:40
thumper	Unhandled error in Deferred:	00:40
wgrant	That's not.	00:40
thumper	that's probably not	00:40
thumper	2010-11-16 23:51:34+0000 [-] builder promethium failure count: 5, job 'amd64 build of widelands 1:15bzr5723-ppa1~natty1 in ubuntu natty RELEASE' failure count: 1	00:41
thumper	2010-11-16 23:51:34+0000 [-] Scanning failed with: User timeout caused connection failure.	00:41
wgrant	There are hundreds of those.	00:41
wgrant	And that's not the first.	00:41
thumper	Failure: twisted.internet.defer.CancelledError:	00:42
thumper	??	00:42
thumper	no	00:42
thumper	it isn't the first	00:42
thumper	I'm looking for suspect lines	00:42
thumper	is there a log entry for disabling a buider?	00:43
wgrant	There used to be.	00:44
wgrant	I never got around to reviewing the full 6000 line diff of this branch, though :/	00:44
* wgrant looks.		00:44
thumper	Miserable failure when trying to examine failure counts: :-)	00:45
wgrant	.... wow.	00:45
wgrant	except:	00:45
wgrant	GRGWOGIFJEWF	00:45
wgrant	Anyway, what was the miserable failure?	00:46
wgrant	thumper: You'll be pleased to know that not all failBuilder callsites log.	00:47
thumper	wgrant: I'm grepping	00:47
thumper	wgrant: I can kinda tell	00:47
thumper	I suppose we should do a launchpad status thingy	00:47
wgrant	But we should be able to tell from the logs.	00:47
wgrant	Since only two of the callsites are unobvious, and one has enough logging around it that we should be able to work it out.	00:48
thumper	wgrant: can you see devpad?	00:48
wgrant	thumper: Not for another few weeks :(	00:48
thumper	just wondering, I'll copy the logfile to people.c.c	00:48
wgrant	Thanks.	00:49
thumper	hmm...	00:50
thumper	I'm getting refused ssh	00:50
wgrant	That's excellent.	00:50
thumper	hmm... something screwy is going on...	00:52
thumper	ok, it's on its way up	00:52
thumper	wgrant: I'm not entirely sure what to look for	00:53
wgrant	thumper: Thanks.	00:53
* wgrant examines.		00:53
wgrant	It's tempting to add more explicit logging, restart it, and hope it breaks again.	00:54
thumper	test rollout of 11926 on 2010-11-16	00:54
thumper	that's from LPS	00:55
wgrant	I was hoping for higher granularity. But I guess the log should help with that.	00:55
thumper	bug https://bugs.launchpad.net/soyuz/+bug/671242	00:55
_mup_	Bug #671242: New buildd-manager disabling everything in sight <qa-ok> <Soyuz:Fix Committed by julian-edwards> <https://launchpad.net/bugs/671242>	00:55
wgrant	10:03, by the look of things.	00:55
wgrant	thumper: Right, 11888 was deployed, but that broke with that bug.	00:56
wgrant	11926 fixed that.	00:56
wgrant	But we now have another one.	00:56
thumper	this is a different problem?	00:56
wgrant	Yes.	00:57
wgrant	I believe.	00:57
=== thumper changed the topic of #launchpad-dev to: Launchpad Development Channel \| Week 4 of 10.11 \| PQM open for 10.12 \| firefighting: buildd-manager is disabling things again \| https://dev.launchpad.net/ \| Get the code: https://dev.launchpad.net/Getting
wgrant	I think this problem may be partially described in that bug, but it's not the one that was identified and fixed.	00:57
wgrant	So, we have major problems at 14:41:45, 15:03:20 and 16:31:22, 23:39:11, and possibly 18:11:22, 19:52:18,	01:03
wgrant	23:39:11 is the big one which took out everything.	01:03
wgrant	Each major failure starts with a single scan failure, than a huge number 9 seconds later.	01:04
thumper	elmo: ping?	01:04
wgrant	He left a while ago.	01:05
thumper	:(	01:05
thumper	I think we are losaless	01:05
wgrant	pjdc should be around, though?	01:05
thumper	pjdc: what do you know about LP deployment?	01:05
wgrant	If you want to revert, we don't need a full deployment. It's just a symlink change and restart of buildd-manager.	01:06
pjdc	thumper: not much. i assisted elmo with an emergency cowboy about a year ago, but that's about it.	01:06
thumper	:(	01:06
thumper	lifeless: are you really there?	01:06
thumper	pjdc: on cesium there was a rollout yesterday	01:06
thumper	pjdc: I'm hoping that they kept the old code around	01:06
wgrant	thumper: Last time that was the case.	01:07
thumper	pjdc: as the new code is disabling all the builders	01:07
wgrant	thumper: A LOSA just flipped the symlink back to the 10.10 rollout.	01:07
wgrant	soyuz-production-rev-9886	01:07
thumper	wgrant: if the buildd-manager is restarted, will it recheck the disabled buiders?	01:07
pjdc	thumper: if i'm looking in the right place, there are two tree, 11926 and 9886	01:07
lifeless	thumper: yes	01:07
pjdc	thumper: 9886 looks pretty old	01:07
thumper	pjdc: 11926 is the broken one	01:08
thumper	pjdc: 9886 will be from db-devel	01:08
thumper	so... probably the last rollout	01:08
thumper	Wed 13th of Oct	01:08
thumper	that's the date on rev 9886 of db-devel	01:08
wgrant	thumper: No, we'll have to flip a flag on each to get them back.	01:09
wgrant	9886 is what we reverted to after 11888 failed. It's the last rollout.	01:09
thumper	pjdc: can you do that?	01:09
thumper	wgrant: how do we re-enable the builders?	01:10
pjdc	thumper: so that'd be change the symlink, restart the service?	01:11
thumper	pjdc: AFAIK	01:11
pjdc	thumper: i take it things can't get worse at this point?	01:11
lifeless	thumper: again? nooo	01:11
lifeless	this bodes badly for the rollout tomorrow.	01:11
lifeless	like really.	01:11
lifeless	thumper: 9886 is fine, its the last db-stable deploy	01:11
lifeless	wgrant: you got log files etc - whats up ?	01:11
lifeless	it was stable for hours - did this happen recently? could it be librarian changes? [hope not]	01:11
lifeless	.	01:11
wgrant	Yeah, cesium is as broken as it can be.	01:11
wgrant	lifeless: I have the log change.	01:12
wgrant	s/change//	01:12
lifeless	whats causing this	01:12
lifeless	before we change stuff	01:12
wgrant	It was mostly OK for 4 hours.	01:12
wgrant	After 13 hours it just completely melted down.	01:12
wgrant	Still looking to see if I can get anything useful from the logs.	01:13
lifeless	can we just restart it in the meantime and toggle the builds back on ?	01:13
wgrant	thumper: The 'Builder OK' flag on Builder:+edit does it. Otherwise there might be a script around.	01:13
lifeless	or will it kill them immediately ?	01:13
wgrant	lifeless: I guess we could try that.	01:13
lifeless	we need to figure out what happens tomorrow @ rollout time	01:13
lifeless	which is what, 8 hours away	01:14
thumper	lifeless: right now it is a release blocker IMO	01:14
thumper	I'm starting an incident report	01:14
wgrant	lifeless: I think we probably cherrypick 11808, 11815 and 11926 off cesium.	01:14
lifeless	wgrant: 11926 itself is a problem ?	01:14
wgrant	lifeless: No, but 11808 probably won't revert unless we revert 11926 first.	01:15
wgrant	Er.	01:15
wgrant	Not 11926.	01:15
wgrant	That other one.	01:15
lifeless	sorry, I left my mind reader in asia	01:15
wgrant	The fix for the issue that caused us to roll cesium back from 11888.	01:16
wgrant	11898	01:16
wgrant	So 11808, 11815 and 11898.	01:17
wgrant	Do we know if enablement has pulled any buildds today?	01:19
lifeless	cody-somerville: ^	01:19
pjdc	wgrant: it's been quiet since the 12th, as far as i can tell	01:19
wgrant	It doesn't explain everything (since non-virt buildds had the same error), but it might be something.	01:20
lifeless	wgrant: whats the error	01:20
wgrant	lifeless: 2010-11-16 16:31:13+0000 [-] Scanning failed with: User timeout caused connection failure.	01:20
wgrant	2010-11-16 16:31:13+0000 [-] Traceback (most recent call last):	01:20
wgrant	2010-11-16 16:31:13+0000 [-] Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.	01:20
wgrant	lifeless: In most of the major failures in the log, there is one of those followed by dozens 9 seconds later.	01:21
thumper	lifeless: do you object to rolling back the buildd-manager code on cesium?	01:22
wgrant	Perhaps we should try enabling things and see if they stay alive for long.	01:22
mwhudson	does the buildd-manager still do blocking things?	01:23
lifeless	thumper: I want to be sure we understand it	01:23
wgrant	Only when downloading files from slaves, I believe.	01:23
thumper	mwhudson: I believe it is twisted now	01:23
wgrant	mwhudson: ^^	01:23
lifeless	pjdc: can you please:	01:23
thumper	wgrant: I thought that jelmer fixed that	01:23
mwhudson	thumper: fully? it's been somewhat twisted for a long time	01:24
* thumper needs food		01:24
lifeless	- restart the builddmanager	01:24
wgrant	thumper: jelmer fixed it so it uploads the downloaded files asynchronously.	01:24
lifeless	- reenable a couple of fast buildds	01:24
lifeless	- see what happens over a few minutes	01:24
wgrant	thumper: A branch is coming to download them async, too, but it's not done yet.	01:24
lifeless	thumper: go eat, nothing will change radically while you eat	01:24
pjdc	i'm not too familiar with the buildd pool. can someone suggest candidates?	01:24
* pjdc picks 3 amd64 official builders		01:25
wgrant	pjdc: A fairly random pick of the various categories: roseapple, allspice, doubah, samarium	01:25
wgrant	Couple of new non-virt, and an old and new virt.	01:26
pjdc	works for me	01:26
pjdc	restarting buildd-manager	01:26
pjdc	started, doing the buildds now	01:27
wgrant	Hm.	01:28
wgrant	Maybe we should turn the logging up.	01:28
wgrant	(lib/lp/buildmaster/manager.py, s/logging.INFO/logging.DEBUG/)	01:28
pjdc	re-enabled those four, plus yellow and crested since i had the tabs all ready	01:28
wgrant	Great.	01:29
wgrant	There are some odd five minute gaps in the log, and it would be nice to know if it actually does anything in them.	01:29
thumper	pjdc: the queue for amd64 is empty though	01:42
thumper	not sure if that'll show much	01:42
pjdc	thumper: looks like doubah's done the business though, showing as disabled again	01:43
wgrant	The builders were failed regardless of whether there was anything to build or not.	01:43
wgrant	Oh, already?	01:43
thumper	oh ok	01:43
lifeless	wgrant: does the buildd manager read from the librarian ?	01:44
wgrant	lifeless: I don't think so.	01:44
wgrant	I can't think why it would.	01:44
lifeless	do builders ?	01:44
wgrant	Yes.	01:44
lifeless	what code path do they use to get their urls ?	01:44
wgrant	Ahh. cesium provides them, I believe.	01:45
wgrant	But it doesn't use the restricted librarian.	01:45
lifeless	even for security builds etc?	01:45
wgrant	No -- private build files are retrieved from the archive.	01:45
wgrant	Since the builders can't have restricted librarian access.	01:46
wgrant	(well, I guess they could now)	01:46
lifeless	just to be sure	01:46
lifeless	pjdc: are we seeing access denied for 91.189.89.189 or 91.189.89.188 from the builders that fail (or from cesium for that matter)	01:46
lifeless	wgrant: what time was the first builder disabled ?	01:46
wgrant	lifeless: Can't tell. But the first major incident was probably 14:41:45. 23:39:11 was the really big one.	01:47
pjdc	lifeless: cesium can't connect to those IPs on 80 and 443	01:47
elmo	oh	01:48
elmo	is this the restricted librarian	01:48
lifeless	pjdc: ah, but is it trying	01:48
elmo	aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa	01:48
elmo	god damn it	01:48
pjdc	lifeless: i'll check	01:48
lifeless	elmo: there are two code paths for it in lp. internal stuff like the merge proposal diff code will use lp.internal still	01:48
wgrant	cesium will be wanting to upload to it. Is that also going to be broken?	01:48
elmo	188 IN PTR wildcard-restricted-launchpadlibrarian-net.banana.canonical.com.	01:49
lifeless	elmo: I really really wouldn't expect this to be connected, but Just In Case.	01:49
lifeless	wgrant: no, uploads have not altered at all	01:49
wgrant	That's what I thought.	01:49
lifeless	wgrant: are those times UTC?	01:49
wgrant	lifeless: Yes.	01:49
lifeless	ok, so 9 hours apart.	01:50
wgrant	There were some in between.	01:50
lifeless	(range of) unless we're seeing two different things	01:50
lifeless	then its not the publicrestricted librarian work	01:50
wgrant	When did that happen?	01:50
pjdc	lifeless: a few rejects (5 total) for cesium going to both IPs on 80 and 443	01:50
lifeless	wgrant: and really? sed to change log levels ?	01:50
wgrant	lifeless: Yup :D	01:51
lifeless	pjdc: thats -very- interesting	01:51
lifeless	wgrant: I can has bug, and fix.	01:51
lifeless	elmo: did you just remove the publicrestricted feature flag?	01:51
pjdc	lifeless: sorry, false alarm. those were my tests attempts.	01:52
lifeless	elmo: from https://launchpad.net/+feature-rules	01:52
elmo	lifeless: I haven't touched anything	01:52
lifeless	interesting	01:52
lifeless	cause the setting is gone ;)	01:52
wgrant	Hmm.	01:53
wgrant	It's possible that translations jobs might read from the librarian.	01:53
wgrant	I don't know them well.	01:53
* wgrant looks.		01:53
wgrant	But they wouldn't be private.	01:53
lifeless	pjdc: did you?	01:55
lifeless	nb we really need that audit log	01:55
pjdc	lifeless: sorry, did i what?	01:55
lifeless	sounds like a no to me	01:56
lifeless	ok, one thing at a time	01:57
lifeless	uhm	01:57
wgrant	We should probably have an 'assert not libraryfilealias.restricted' in BuilderSlave.cacheFile.	01:57
lifeless	pjdc: please reenable doubah again.	01:57
wgrant	But I doubt that's the problem here.	01:57
pjdc	lifeless: ok	01:58
lifeless	wgrant: or enable it	01:58
* thumper has to go get kids		01:58
wgrant	lifeless: Hm>	01:58
lifeless	pjdc: if this fails, its not the publicrestricted librarian.	01:58
pjdc	lifeless: doubah re-enabled	01:58
elmo	https://launchpad.net/ubuntu/+source/fglrx-installer/2:8.780-0ubuntu3/+build/2049941 <-- roseapple worked (for one build) - did we know that?	01:58
wgrant	Yeah, most things work for a while.	01:58
elmo	ok	01:58
wgrant	It's not related to what the builder is doing at the time.	01:58
wgrant	It may be related to what others are doing, but who knows.	01:59
lifeless	wgrant: how do we know that?	01:59
wgrant	lifeless: Because it affects dozens of builders at a time, whether they're idle or building recipes or building binaries.	01:59
wgrant	doubah's gone.	01:59
lifeless	wgrant: what if the timeout or some such is pseudo global, and one hung builder breaks all the ones open for the time window involved	01:59
wgrant	lifeless: Exactly.	02:00
wgrant	12:59:04 < wgrant> It may be related to what others are doing, but who knows.	02:00
wgrant	So, doubah is dead with a TCP timeout.	02:00
lifeless	wgrant: in which case its related to what one does	02:00
wgrant	I wish we had a traceback.	02:00
wgrant	It would be sorta helpful to know what timed out.	02:00
wgrant	lifeless: buildd-manager won't have cached the old FF?	02:04
lifeless	wgrant: it wasn't restarted when the problem happened	02:05
wgrant	True.	02:05
wgrant	Can we disable everything, enable doubah, and see what happens?	02:05
lifeless	wgrant: it was 3 hours ago now that the ff was turned on (and apparently off again)	02:06
wgrant	Oh, so ages after the world exploded. I see.	02:06
lifeless	wgrant: yeah, I'm convinced we're clear	02:07
mwhudson	lifeless: Chex turned the feature flag off after i complained that private codebrowse wasn't working	02:07
lifeless	mwhudson: did that make it work?	02:07
mwhudson	(which seems entirely unrelated to me, but after he did it, private codebrowse started working again)	02:08
lifeless	wtf	02:08
mwhudson	yes	02:08
lifeless	codebrowse uses the librarian?	02:08
mwhudson	i'm betting some kind of coincidence	02:08
lifeless	pjdc: can you please turn the flag on again - its listed under sql queries etc on LPS.	02:08
pjdc	lifeless: that's a long page. what am i looking for exactly?	02:09
lifeless	publicrestrictedlibrarian default 0 on	02:10
pjdc	that doesn't mean much to me. is that a command?	02:11
lifeless	its a line you put in https://launchpad.net/+feature-rules	02:11
lifeless	https://dev.launchpad.net/LEP/FeatureFlags has plenty of docs - see the bottom o the page in partiuclar	02:11
pjdc	ah, ok. so bung it in at the end, hit "Change"?	02:12
lifeless	yes	02:12
pjdc	done	02:12
lifeless	mwhudson: still working ?	02:13
mwhudson	lifeless: will check	02:13
lifeless	'Server denied check_authentication' is what you saw?	02:13
mwhudson	lifeless: yes	02:13
lifeless	zomg	02:14
lifeless	pjdc: and remove it?	02:14
mwhudson	lifeless: it works now	02:14
lifeless	oh	02:14
lifeless	pjdc: don't remove it	02:14
mwhudson	lifeless: oh, it failed for you?	02:14
pjdc	lifeless: ok, doing nothing :)	02:14
lifeless	mwhudson: failed once	02:15
lifeless	mwhudson: worked on second url	02:15
lifeless	I think its stubs openid change	02:15
mwhudson	lifeless: random	02:15
lifeless	mwhudson: so coincidence that that was the first private branch url you tried since 11926 was deploy.	02:15
lifeless	pjdc: thanks	02:16
lifeless	ok	02:16
lifeless	so back to the buildd	02:16
lifeless	pjdc: did doubah do day carooba?	02:16
wgrant	So, I'd like to see this happen:	02:16
wgrant	- Disable all builders.	02:17
wgrant	- Shut down buildd-manager.	02:17
pjdc	lifeless: you lost me at "do	02:17
wgrant	- Change log leve.	02:17
pjdc	"	02:17
wgrant	- Enable doubah.	02:17
wgrant	- Start buildd-manager	02:17
lifeless	pjdc: you reenabled doubah	02:17
lifeless	pjdc: did it die again?	02:17
wgrant	It did.	02:17
lifeless	ok	02:18
lifeless	pjdc: could you do what wgrant just described	02:18
pjdc	disable all builders incluing those currently building? or just the idle ones?	02:18
wgrant	All, ideally.	02:19
lifeless	all	02:19
lifeless	we're wondering if doubah is broken	02:19
pjdc	the alls have it	02:19
lifeless	and a bug is making all the others get nuked when it goes if they happen to be lined up with it in the polling period	02:19
wgrant	Well, I'm mostly hoping we can get a minimal case to fail.	02:19
wgrant	I doubt there's anything wrong with doubah.	02:19
lifeless	ok. I'm thinking that.	02:19
lifeless	what is doubah - virt i386?	02:20
wgrant	Since I picked four semi-randomly from a pool of 60.	02:20
wgrant	Yeah.	02:20
wgrant	A fairly beefy one, too.	02:20
pjdc	all disabled, stopping the buildd-manager	02:22
pjdc	wgrant: how is the log level changed?	02:22
wgrant	pjdc: Aherm.	02:23
wgrant	pjdc: s/logging.INFO/logging.DEBUG/ in lib/lp/buildmaster/manager.py.	02:23
lifeless	wgrant: you are going to fix that.	02:23
wgrant	lifeless: It is Twisted evil.	02:24
wgrant	Which I don't know awfully well.	02:24
lifeless	hasn't stopped you in the past	02:24
wgrant	True.	02:25
pjdc	wgrant: like so? http://paste.ubuntu.com/533321/	02:25
wgrant	pjdc: Yup.	02:25
lifeless	wgrant: and you have help	02:25
wgrant	Flip doubah back on, and start b-m up.	02:25
wgrant	And let's hope it fails.	02:25
pjdc	doubah enabled, b-m starting	02:25
lifeless	wgrant: what time do recipe builds auto create ?	02:25
wgrant	lifeless: Probably a couple of hours ago.	02:26
pjdc	b-m started	02:26
wgrant	lifeless: They were happening around the time it was noticed.	02:26
wgrant	Er.	02:26
wgrant	doubah's dead already?	02:26
wgrant	Hm, no.	02:26
wgrant	Must have been cached.	02:27
pjdc	shows as building here	02:27
wgrant	Yeah, it is now.	02:27
wgrant	OK, it's started.	02:28
wgrant	Only died once.	02:28
wgrant	:( it seems to be happy.	02:32
lifeless	should we bring up another virt i386 ?	02:32
lifeless	=== Top 10 Time Out Counts by Page ID ===	02:33
lifeless	Hard / Soft Page ID	02:33
lifeless	230 / 59 Person:+commentedbugs	02:33
lifeless	111 / 5615 Archive:+index	02:33
lifeless	76 / 295 BugTask:+index	02:33
lifeless	12 / 398 Distribution:+bugtarget-portlet-bugfilters-stats	02:33
wgrant	Worth a try.	02:33
lifeless	12 / 341 Distribution:+bugs	02:33
lifeless	10 / 5 Person:+bugs	02:33
lifeless	9 / 7 ProjectGroup:+milestones	02:33
wgrant	(virt i386 is good because it gives us all job types)	02:33
lifeless	8 / 2 BugTask:+create-question	02:33
lifeless	5 / 47 Distribution:+archivemirrors	02:33
lifeless	5 / 17 DistributionSourcePackage:+publishinghistory	02:33
pjdc	wgrant: shall i enable actinium then?	02:35
wgrant	pjdc: Sure.	02:35
pjdc	wgrant: ok, enabled	02:35
wgrant	Really?	02:37
wgrant	Maybe I'm on a slave, but actinium looks dead.	02:37
wgrant	If it has just died, this is great news indeed.	02:37
pjdc	looking dead here too	02:37
wgrant	!!!	02:37
wgrant	We may have some hope of untangling the logs this time.	02:38
wgrant	Could you throw the log since the restart somewhere I can see it?	02:38
pjdc	will do	02:40
pjdc	see query	02:43
wgrant	Thanks.	02:43
wgrant	...	02:44
wgrant	2010-11-17 02:26:50+0000 [Uninitialized] ForbiddenAttribute: ('build', <TRANSLATION_TEMPLATES_BUILD branch job (2146072) for ~gwibber-committers/gwibber/trunk>)	02:44
thumper	???	02:44
wgrant	This logging is a lot more descriptive :)	02:45
wgrant	Hm, so actinium was aborted.	02:47
wgrant	It was resumed, then just a few seconds later a dispatch was attempted... that's far too quick.	02:48
wgrant	So, actinium probably wasn't hit by the root issue. :(	02:49
wgrant	We don't wait long enough for the resume to complete.	02:49
wgrant	But that doesn't explain the 'User timeout caused connection failure' thing, or why non-virt builders were broken too.	02:49
wgrant	OK. I think we should try to get it to break horribly again. So we should reset the failure counts and reenable everything, I suppose.	02:51
wgrant	Sigh.	02:52
lifeless	if we bring everything up	02:54
lifeless	will we log useful data?	02:54
wgrant	I hope so.	02:54
wgrant	Maybe we should make failBuilder log before we do that, though.	02:55
wgrant	So we can see when things are disabled.	02:55
lifeless	wgrant: can you prep a cowboy	02:56
wgrant	Doing so.	02:56
* thumper has to head afk		02:57
wgrant	pjdc, lifeless: http://pastebin.ubuntu.com/533329/ should do it.	02:58
* wgrant pelts buildd-manager with rocks and sets it on fire.		03:00
pjdc	wgrant: so, shut down, apply patch, enable all (this might take a while), start up?	03:03
wgrant	pjdc: Yup.	03:04
wgrant	pjdc: Do you know if there's a script to enable them all?	03:04
wgrant	Otherwise there's SQL...	03:04
pjdc	wgrant: no idea, i've only ever done them manually	03:04
wgrant	We may need SQL to reset the failure counts anyway. We'll see shortly.	03:05
pjdc	b-m stopped, patch applied	03:06
lifeless	wgrant: api script :)	03:06
wgrant	lifeless: Yeah, yeah, on my todo list.	03:07
wgrant	It's reasonably unfortunate that all this has happened when we have no available LOSAs in this TZ, no available Soyuz developers in this TZ, and both of the buildd admins in this TZ also unavailable.	03:08
lifeless	s/unfortunate/normal/	03:09
wgrant	No... We'd normally have a LOSA, a Soyuz developer, and two buildd admins.	03:10
pjdc	okay, that's all of them enabled	03:13
pjdc	anything else before b-m is started?	03:13
wgrant	So we are now running with the log level change and the additional failBuilder logging?	03:14
pjdc	yep, left the loglevel change in place, and applied your cowboy	03:14
wgrant	Start it up!	03:14
wgrant	I expect most of the them will disable themselves again in about 30 seconds :(	03:15
pjdc	started	03:15
wgrant	:(	03:17
wgrant	So, everything seems to be happy now.	03:23
wgrant	I guess we just leave it until it explodes in a few hours, and hope the new logging tells us something useful.	03:24
pjdc	that shouldn't be far off when the UK wakes up, so that might work out	03:25
wgrant	Given that we've failed to reproduce it elsewhere, it is tempting to let the rollout go ahead and just automatically undisable builders until we work out what's going on :/	03:25
wgrant	The 14 builders that are disabled now probably just need their failure count reset (it's already over the threshold, so the initial failure to connect because the builder is still resuming causes them to be disabled).	03:26
wgrant	Something like this:	03:26
wgrant	UPDATE builder SET failure_count=0, builderok=true WHERE name IN ('hawthorn', 'actinium', 'hassium', 'lansones', 'muntries', 'radium', 'rosehip', 'sandpaperfig', 'terranova', 'fermium', 'lawrencium', 'nobelium', 'papaya', 'plutonium');	03:26
pjdc	if it's not critical, that's probably best left for a losa	03:27
wgrant	Probably, yeah.	03:27
wgrant	Not critical. Just makes it harder to see if it's broken without watching logs.	03:27
wgrant	Thanks for your help.	03:27
pjdc	you're welcome!	03:28
wgrant	adare and ross are now broken in other ways :(	03:29
wgrant	But that can wait.	03:29
lifeless	mwhudson: https://bugs.launchpad.net/launchpad-foundations/+bug/676372	05:18
_mup_	Bug #676372: "Server denied check_authentication" from bazaar.launchpad.net private branch since 11926 deployed <regression> <Launchpad Foundations:Triaged> <https://launchpad.net/bugs/676372>	05:18
=== jtv is now known as jtv-eat
poolie	hi all	06:47
poolie	i am running './bin/test' in a vm, and it has been stuck for hours, with the last output being	06:47
poolie	Started ['/tmp/tmpecWY0y.mozrunner/mozrunner-firefox', '-profile', '/tmp/tmpecWY0y.mozrunner', 'http://bugs.launchpad.dev:8085/windmill-serv/start.html']	06:47
poolie	in 1.109 seconds.	06:47
poolie		06:47
poolie	halp?	06:47
wgrant	Is there a firefox window lurking around?	06:47
poolie	not that i can see	06:48
poolie	i'm ssh'd in to the vm without -X	06:48
poolie	i will see if there's a firefox process	06:48
poolie	there is not, though there is a zombie	06:48
wgrant	mthaddon: Around yet?	07:28
henninge	Hi wgrant!	07:48
henninge	;)	07:48
wgrant	Morning henninge.	07:48
henninge	wgrant: heard you got engaged	07:48
wgrant	Oh?	07:49
henninge	That Kate really is a nice girl	07:49
henninge	oh sorry, wrong W... ;-)	07:49
henninge	wgrant: What's that about the buildmanager?	07:49
wgrant	henninge: Well, it may or may not be a release blocker.	07:50
wgrant	henninge: We have a not utterly terrible workaround, so it's probably OK.	07:50
henninge	wgrant: what does the workaround include?	07:50
wgrant	henninge: Uhh, frequently reenabling all the builders manually.	07:50
henninge	;-)	07:51
henninge	How frequently?	07:51
wgrant	Unsure. It was OK for 4 hours yesterday. And it's been OK for 4 hours so far today.	07:51
henninge	hours, wow ...	07:51
henninge	The affected code is on cesium, right?	07:52
wgrant	Yeah.	07:52
wgrant	Hopefully jml and bigjools will save the world tonight.	07:52
henninge	So that is (again) part of the nodowntime hosts	07:52
henninge	so a fix can be deployed any time.	07:52
wgrant	Yeah.	07:53
henninge	wgrant: I am sure they will! ;-)	07:53
wgrant	So, it's a pretty terrible bug, but we can work around it easily enough with a script.	07:53
henninge	The only reason I can imagine this being a blocker for the roll-out would be if any fix would include db changes.	07:55
henninge	which is not that far fetched, I guess.	07:56
wgrant	It won't.	07:56
poolie	wgrant: congrats!	08:04
wgrant	poolie: Hm?	08:04
poolie	or is he just totally confused?	08:05
wgrant	I hope he's just totally confused.	08:05
poolie	ah, me clicks	08:05
wgrant	Or there's some news about me that I don't know.	08:05
poolie	William Soxe-Gotha-Coburg-Windsor	08:05
wgrant	Ahhhhhhh, of course.	08:06
bac	hi henninge	08:35
henninge	hi bac!	08:35
adeuring	good morning	08:50
henninge	poolie, wgrant: Yeah, I messed up the joke. I meant to say "sorry, wrong prince" ... ;)	08:54
henninge	Moin adeuring!	08:54
adeuring	hi henninge	08:54
wgrant	Haha.	08:55
wgrant	bigjools: Morning...	09:08
bigjools	morning	09:12
wgrant	bigjools: Have you heard the wonderful news?	09:12
bigjools	which?	09:12
wgrant	bigjools: We're about to release with a pretty screwed buildd-manager :)	09:13
bigjools	fuck sake	09:13
wgrant	It disabled 60 or so this morning.	09:13
=== mthaddon changed the topic of #launchpad-dev to: Launchpad down/read-only from 10:00-12:00 UTC for DB update \| Launchpad Development Channel \| Week 4 of 10.11 \| PQM open for 10.12 \| firefighting: buildd-manager is disabling things again \| https://dev.launchpad.net/ \| Get the code: https://dev.launchpad.net/Getting
wgrant	It seems to be reasonably happy now, since we restarted everything 7 hours ago.	09:14
wgrant	But it was OK for a few hours yesterday too :/	09:14
bigjools	it was disabling builders because they were unresponsive	09:14
bigjools	it's supposed to do that	09:14
wgrant	TCP timeouts and no route to host errors are different.	09:15
bigjools	how?	09:15
wgrant	This was "User timeout caused connection failure" or something like that.	09:15
bigjools	that's because they don't respond within the timeout	09:15
wgrant	Dozens of them in one second?	09:16
bigjools	what sort of time did this happen?	09:16
wgrant	14:41:45, 15:03:20 and 16:31:22, 23:39:11 are some that I saw.	09:17
wgrant	23:39:11 was the big one.	09:17
bigjools	that's when the daily recipes kick off	09:17
wgrant	But the last two incidents there start with a single error, then 9 seconds later dozens.	09:17
mrevell	Hello	09:17
wgrant	There were also a few other odd errors in the logs.	09:18
wgrant	And it's not waiting long enough for builders to resume.	09:18
wgrant	But apart from that it's happy now.	09:18
bigjools	that's a problem because there's nothing we can do to fix that	09:18
wgrant	Hmm?	09:18
bigjools	the connection timeout is hard-coded in the python libs :/	09:18
wgrant	Odd...	09:18
bigjools	the reset script waits until some event in the builder, which is supposed to be when it's ready to accept a connection	09:19
bigjools	then that connection often times out	09:19
wgrant	2010-11-17 02:35:58+0000 [QueryProtocol,client] Resuming actinium (http://actinium.ppa:8221/)	09:19
wgrant	2010-11-17 02:36:04+0000 [-] Asking builder on http://actinium.ppa:8221/filecache to ensure it has file chroot-ubuntu-lucid-i386.tar.bz2 (http://launchpadlibrarian.net/51974282/chroot-ubuntu-lucid-i386.tar.bz2, d267a7b39544795f0e98d00c3cf7862045311464)	09:19
bigjools	we're seeing the fruits of that now because I am actually disabling stuff	09:19
wgrant	2010-11-17 02:36:25+0000 [Uninitialized] Scanning failed with: TCP connection timed out: 110: Connection timed out.	09:19
bigjools	whereas the old one never disabled anything	09:19
wgrant	It waited 6 seconds from firing the resume trigger.	09:20
wgrant	Maybe the script is buggy.	09:20
bigjools	no	09:20
bigjools	6 seconds is about right	09:20
bigjools	they reset very quickly	09:20
wgrant	The VM is created and boots in 6 seconds!?	09:20
bigjools	yes	09:20
wgrant	Nice.	09:20
bigjools	the first connection is to send the chroot, and that's why you see it timing out	09:21
bigjools	we can get around this for now by removing the code that fails builders	09:21
bigjools	which is essentially what the old b-m was not doing	09:21
wgrant	I think we need to disable failure counting.	09:21
wgrant	It took out lots of builds as well.	09:22
wgrant	(and fourteen or so builders need their failure counts manually reset)	09:22
bigjools	digh	09:22
wgrant	I still find it unlikely that dozens of builders failed to respond all in the same second, several times, unless there were network glitches that nobody knows about.	09:23
wgrant	The 9 second delay betwen the first failure and subsequent stream on at least two occasions is also rather suspicious.	09:23
bigjools	if it's a network glitch then it's more likely that they all go at once	09:23
wgrant	Anyway, cesium is currently running the new code with two cowboys: one setting loglevel to DEBUG, and another to log whenever a builder is failed.	09:25
wgrant	We also need to fix the failure counts of those builders, and probably do a mass-giveback :/	09:25
bigjools	failure counts are reset on a successful dispatch	09:25
wgrant	They are.	09:26
wgrant	Hmm.	09:26
bigjools	for a builder to get failed it has to go wrong on 5 consecutive occasions	09:27
wgrant	But the issue is that the first failure will immediately knock them out again.	09:27
bigjools	no, that's not true	09:27
wgrant	It will, since the count is currently 5.	09:27
wgrant	We reenable, they time out, and are immediately disabled.	09:27
wgrant	No five strikes rule for them.	09:27
bigjools	ok, re-enabling should reset the count	09:27
bigjools	that's a bug	09:27
wgrant	It should.	09:27
wgrant	But it doesn't.	09:27
wgrant	And we were LOSAless today, so we couldn't do it manually.	09:27
bigjools	I think the recipe builds are thoroughly screwing the builders	09:29
wgrant	So "User timeout caused connection failure" occurs when the TCP connection is accepted, but there's no HTTP response?	09:29
bigjools	everything works fine until they come along	09:29
bigjools	that happens when the connect() fails	09:30
wgrant	We're still running the old lp-buildd with in-chroot bzr-builder, aren't we?	09:30
bigjools	yes, we rolled them back	09:30
wgrant	If that happens when connect() fails, then why this:	09:30
wgrant	"TCP connection timed out: 110: Connection timed out."	09:30
wgrant	That's a separate error.	09:31
bigjools	I think I'm going to just remove the failure counting stuff for now	09:33
wgrant	Sounds like a good idea.	09:34
bigjools	wgrant: did you ask someone to restart it at 0126 UTC?	09:36
wgrant	I think the first one was lifeless, but yeah, it was around then.	09:36
bigjools	there were no problems with it at that time	09:36
wgrant	It had taken out all but a few buildds an hour earlier. We wanted to see if we could reproduce it fresh with just a couple of active builders, to see if we needed to roll back and work out what to do about the release.	09:39
bigjools	I think the problem is recipe builds for sure, I just need to reproduce on DF	09:42
bigjools	the builder is doing something that makes it unresponsive	09:42
wgrant	That's not the whole thing.	09:43
wgrant	palmer was disabled. It is non-virt and had been idle for 30 minutes.	09:43
bigjools	hmmm	09:43
=== Guest8056 is now known as jelmer
bigjools	oh jeez the log is massive with debug on	09:44
wgrant	So we knew it was either several undetected network glitches throughout the day manifesting without any TCP timeouts, or something with one builder was glitching everything else out.	09:44
wgrant	So we turned up logging and hoped it would reappear, since the INFO logging is sort of completely sparse.	09:45
wgrant	We can't tell when the problematic scans were triggered, and there are five minute gaps in the log :/	09:46
wgrant	And I can't reproduce it locally however much I try :(	09:46
bigjools	it's a nightmare	09:46
wgrant	Yeah, just a bit.	09:47
bigjools	from the log, it starts going wrong at the exact same time the faily (sic) recipe builds get kicked off	09:48
bigjools	around 23:35Z	09:48
wgrant	That's the big incident, yeah.	09:48
wgrant	But there are several smaller ones in the preceding 9 hours.	09:48
bigjools	other indicents are almost certainly another batch	09:48
wgrant	Possibly.	09:48
bigjools	there are some Fault 8002:	09:49
wgrant	Yeah, but they're everywhere...	09:51
bigjools	that's a protocol fault	09:51
bigjools	hmmm /me sees something	09:52
wgrant	What has been seen?	09:52
bigjools	this might have something to do with the huge blocking file fetch	09:53
wgrant	I considered that.	09:53
wgrant	But the 23:39 incident suggests not.	09:53
wgrant	The nearest fetch before that was about 6 minutes earlier.	09:54
bigjools	I think it's a number of different things that cause blocks	09:54
=== henninge changed the topic of #launchpad-dev to: Launchpad down/read-only from 10:00-12:00 UTC for DB update \| Launchpad Development Channel \| Week 4 of 10.11 \| PQM open for 10.12 (but closed during the roll-out)\| firefighting: buildd-manager is disabling things again \| https://dev.launchpad.net/ \| Get the code: https://dev.launchpad.net/Getting
wgrant	bigjools: So, just going to cowboy out failure counting after the rollout and hope that we can work it out?	10:03
bigjools	yes	10:03
wgrant	:/	10:03
bigjools	one of the things that the failure counting did was to remove in-progress jobs from builders if they failed a poll	10:04
bigjools	I might have to rethink how that work	10:04
bigjools	s	10:04
bigjools	damn, this stuff is hard	10:05
wgrant	It should all be fine.	10:06
bigjools	"should"	10:06
wgrant	Except for those unexplained User blah blah blah errors, and the reset script lying.	10:06
wgrant	Apart from that and the occasional other translations exception, it seems to be OK.	10:06
wgrant	2010-11-17 02:26:50+0000 [Uninitialized] ForbiddenAttribute: ('build', <TRANSLATION_TEMPLATES_BUILD branch job (2146072) for ~gwibber-committers/gwibber/trunk>)	10:07
wgrant	That's the translations exception.	10:07
bigjools	sigh	10:09
wgrant	Yes.	10:10
wgrant	Does the reset script wait until the slave responds to HTTP?	10:15
wgrant	How hard is readonly bazaar.launchpad.net?	10:25
wgrant	Surely not that bad?	10:25
lifeless	wgrant: we tested it on qastaging yesterday. it works with one small bug	10:28
lifeless	wgrant: however, we're doing machine maintenance.	10:28
wgrant	lifeless: What's the bug? It's not read-only?	10:28
lifeless	https://bugs.launchpad.net/launchpad-code/+bug/676124	10:29
lifeless	if we weren't doing maintenance on that machine, we'd have tried keeping it up this time.	10:29
wgrant	Ah.	10:29
wgrant	Great.	10:29
lifeless	adeuring: ping	10:46
wgrant	Huh, codebrowse works?	10:51
jml	LP seems to be r/w for me now	10:52
wgrant	Ah, so it is.	10:52
lifeless	morning jml	10:54
jml	lifeless: hello	10:54
wgrant	Indeed, morning jml.	10:54
=== danilo_ is now known as danilos
wgrant	Could someone please ec2 https://code.launchpad.net/~wgrant/launchpad/bug-654372-optimise-domination/+merge/40854?	11:06
jml	wgrant: on it	11:07
wgrant	jml: Thanks.	11:07
wgrant	bigjools: re. bug #676262, I suspect they were both ABORTING (since abort() doesn't actually end up killing sbuild). That's a situation we ran into a few hours ago.	11:10
_mup_	Bug #676262: launchpad lost track of a build <Soyuz:Incomplete> <https://launchpad.net/bugs/676262>	11:10
wgrant	(with those same two builders)	11:10
wgrant	Damn ppc :(	11:11
jml	wow	11:11
jml	I got a crazy error when doing ec2 land	11:12
jml	http://paste.ubuntu.com/533423/	11:12
=== mthaddon changed the topic of #launchpad-dev to: Launchpad Development Channel \| Week 4 of 10.11 \| PQM open for 10.12 (but closed during the roll-out)\| firefighting: buildd-manager is disabling things again \| https://dev.launchpad.net/ \| Get the code: https://dev.launchpad.net/Getting
danilos	henninge, https://pastebin.canonical.com/39840/	11:12
adeuring	lifeless: pong (sorry, did not look at the IRC windows after returning from the kitchen...)	11:14
lifeless	adeuring: hey	11:15
lifeless	adeuring: remember how in APIs and restricted files we hard coded handing out the internal url ?	11:15
adeuring	lifeless: not exactly... let me check again	11:15
lifeless	adeuring: the token based librarian is deployed now	11:15
jml	lifeless: https://bugs.launchpad.net/launchpad-code/+bug/554206 might be relevant to some stuff you are doing	11:15
_mup_	Bug #554206: Need a read-only version of bazaar.launchpad.net for codehosting and codebrowse <canonical-losa-lp> <codebrowse> <codehosting-ssh> <Launchpad Bazaar Integration:Triaged> <https://launchpad.net/bugs/554206>	11:15
adeuring	lifeless: I remember that firewall settings in the DC needed some teaking	11:16
adeuring	...tweaking...	11:16
lifeless	right	11:17
wgrant	Why is [ui=none] in every commit message? Can't it just be omitted?	11:20
adeuring	lifeless: mizuho needed access to private Librarian files, and that machine "saw" a librarian URL having a host name with an "internal" domain part	11:20
jml	wgrant: the [ui=foo] field was added as a way of strongly encouraging UI reviews for any UI change	11:22
jml	wgrant: a huge number of changes do not affect the UI	11:22
jml	wgrant: and I suspect that many people skip UI reviews	11:22
lifeless	adeuring: yes	11:22
wgrant	jml: Is it more than 1% of commits that have ui=somethingelse?	11:22
lifeless	adeuring: right, so you did a patch for the API to show the internal url	11:22
jml	wgrant: you can run log & grep as easily as I	11:23
lifeless	adeuring: but its not needed now	11:23
adeuring	lifeless: did I? seems that I need a memory refresh.... looking now	11:23
wgrant	jml: True.	11:24
lifeless	adeuring: you did :)	11:26
lifeless	adeuring: rev 11506	11:28
jml	henninge: now that the rollout is done, can we fix canonical/launchpad/interfaces/__init__?	11:30
henninge	jml: oh. ...	11:30
adeuring	lifeless: thanks! so, time to fix bug 629804	11:30
_mup_	Bug #629804: implement access to private Librarian files for launchpadlib clients <Launchpad Foundations:New> <https://launchpad.net/bugs/629804>	11:30
henninge	jml: well, it's still on the list to do post-rollout but you can prepare a branch. By the time it gets deployed from stable, that should all be done ;-)	11:34
henninge	jml: "it" is "fixing +inbound-email-config.zcml"	11:34
henninge	;-)	11:34
=== matsubara-afk is now known as matsubara
jml	henninge: ok. will do.	11:35
matsubara	maxb, misclicked	11:35
henninge	jml: just check again before marking the revision as deployable.	11:36
jml	henninge: nod. do you recall the bug number?	11:36
henninge	I am not sure it had a bug.	11:36
henninge	jml: nm, it's fixed. ;-)	11:37
henninge	so I guess you can just submit it [no-qa]	11:37
jml	henninge: will do. ta.	11:38
henninge	which is true because we already know it works on qa/staging ... ;-)	11:38
jml	aeoueoia	11:40
jml	lp-land has a bad token, but I don't know where to find it	11:40
lifeless	adeuring: I've unduplicated it	11:43
jml	how do I work around this problem? http://paste.ubuntu.com/533431/	11:43
adeuring	ok	11:43
adeuring	lifeless: I'll do it once I've finished my current work	11:43
adeuring	...i mean; I'll fix the bug...	11:44
lifeless	adeuring: do you have an estimate for when that will be?	11:44
lifeless	adeuring: if its going to be not-immediate, I might just do it	11:44
lifeless	s/not-immediate/not-today	11:44
adeuring	lifeless: i think I can probably start tomorrow	11:44
adeuring	lifeless: you beat me ;)	11:44
adeuring	problem is that I am quite slow with context swtiches...	11:45
lifeless	adeuring: I'll drop you a mail to let you know if I get to it or not.	11:45
adeuring	lifeless: coool	11:45
jml	wgrant: your branch is being tested in ec2: http://ec2-50-16-92-112.compute-1.amazonaws.com/	11:48
wgrant	jml: I can't see that, but thanks!	11:49
jml	it'd be kind of neat to add a phone-home thing to the ec2 script so we could have a page showing what's being built (as well as test results)	11:49
deryck	Morning, all.	11:57
adeuring	morning deryck	11:59
jml	bigjools: I added something to the derived distributions LEP about opening vs initialization; do you need anything more?	12:01
bigjools	jml: inspiration	12:01
bigjools	thanks :)	12:01
jml	bigjools: np.	12:02
jml	bigjools: also, I notice that https://launchpad.net/launchpad-project/+bugs?field.tag=buildd-scalability has no bugs.	12:03
bigjools	it should do	12:03
bigjools	I tagged loads	12:03
bigjools	jml: ah it's because they've all been released	12:06
jml	bigjools: nice.	12:06
bigjools	jml: https://bugs.launchpad.net/soyuz/+bugs?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=FIXRELEASED&assignee_option=any&field.assignee=&field.bug_reporter=&field.	12:06
bigjools	bug_supervisor=&field.bug_commenter=&field.subscriber=&field.tag=buildd-scalability&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on	12:06
bigjools	aiieee sorry	12:06
jml	bigjools: looking at the LEP and based on random IRC sampling, I'm guessing we're still missing "When a builder becomes free, we must dispatch a queued build to it within a maximum of 30 seconds.", "Design for a system with 200 builders" and "Not starve low-scored builds when there are higher-scored builds in the queue"	12:07
stub	Having trouble following https://dev.launchpad.net/LaunchpadPpa. debsign -S fails with 'debsign: Can't find or can't read changes file !'	12:07
bigjools	jml: missing from where?	12:07
jml	bigjools: what I mean is, have we met those requirements?	12:08
bigjools	jml: I need to have a call with you about that	12:08
jml	bigjools: ah, ok :)	12:08
bigjools	:)	12:08
bigjools	but later	12:08
bigjools	I am up to my neck in buildd-manager issues	12:09
bigjools	right after a dispatch of 10 or more recipes, there's nothing in the log for 4 minutes	12:09
bigjools	which is somewhat suspicious	12:09
jml	yeah, later is good	12:09
wgrant	The queue isn't just empty?	12:10
bigjools	no, it's the gap between "startBuild" and the "RESULT" stuff	12:10
wgrant	This is why I wanted better logging :(	12:10
bigjools	in fact the latter never appears	12:10
bigjools	yes we all want better logging	12:11
wgrant	Ah.	12:11
bigjools	but one thing at a time	12:11
wgrant	That's very interesting indeed.	12:11
stub	Shouldn't bzr builddeb actually create a .deb?	12:11
jelmer	stub: You have to go back to the parent directory or ../result where the changes file was added.	12:11
jelmer	stub: By default it creates binary packages (.deb's), with -S it creates a source package.	12:12
stub	But where?	12:12
bigjools	wgrant: something is blocking too long when it's dispatching a recipe build	12:12
jelmer	stub: In the parent directory or ../result	12:12
stub	jelmer: I don't have a ../result and nothing new in the parent directory	12:12
wgrant	bigjools: After the "Initiating build foo on bar"?	12:12
jelmer	stub: you can specify a directory manuall with --result-dir	12:13
bigjools	wgrant: in Builder.startBuild() it logs the build start (behavior.logStartBuild)	12:13
bigjools	then there's nothing logged until it fails	12:13
bigjools	at that point, there's a few things that could have gone wrong but the lack of logging means it's hard to tell	12:14
stub	jelmer: Garh. They were in my branch, not my checkout of the branch	12:15
stub	jelmer: Guess that would be a bug...	12:15
jelmer	stub: yeah, that seems a bit strange	12:16
wgrant	bigjools: So we don't even know if it made it into resume_done?	12:16
bigjools	I suspect it has, that's the most reliable part of the process	12:16
wgrant	True.	12:16
bigjools	my suspicions lie in the file disaptching and initiation	12:16
wgrant	But it never made it to got_cache_file... hmm.	12:16
bigjools	we don't know	12:18
bigjools	there's no info level logging	12:18
wgrant	got_cache_file logs fairly obviously.	12:19
wgrant	Ohh, crap.	12:19
wgrant	True.	12:19
jml	deryck: there are a couple of LEPs about bug duplication...	12:19
* bigjools is changing some debug to info		12:19
jml	deryck: one's in drafting (https://dev.launchpad.net/LEP/DisableFilebugDuplicateSearchOption) and the other (https://dev.launchpad.net/LEP/ACLMarkAsDuplicate) isn't on the LEP page	12:19
lifeless	wgrant: We Can Haz Runtime Log Changing Please	12:20
wgrant	lifeless: debug 4 eva	12:20
_mup_	Bug #4: Importing finished po doesn't change progressbar <Launchpad Translations:Fix Released by carlos> <Ubuntu:Invalid> <https://launchpad.net/bugs/4>	12:20
wgrant	Ahem.	12:20
lifeless	rotfl	12:20
lifeless	ok foods	12:20
jml	lifeless: I guess there's https://bugs.edge.launchpad.net/soyuz/+bug/667958	12:21
_mup_	Bug #667958: Web diagnostic tool for build manager <buildd-manager> <Soyuz:Triaged> <https://launchpad.net/bugs/667958>	12:21
jml	but that's not quite the same thing	12:21
bigjools	dynamically changeable log levels is totally essential for decent production debugging	12:23
wgrant	bigjools: Is there anything in the current debug level that isn't interesting, except for the hundreds of "Scanning foo" messages?	12:25
wgrant	Given the frequency and obscurity of issues, it'd be nice to keep as much data as possible...	12:26
bigjools	the problem is that I don't want the log swamped	12:26
bigjools	it makes it harder to notice issues	12:26
bigjools	so I am trying to carefully select important messages for the info logging	12:27
bigjools	but hindsight is awesome	12:27
wgrant	Heh.	12:27
deryck	Hi jml. Yeah, the first should be done. And the second was meant to sketch out the idea and go back to marjo et al and get feedback....	12:28
deryck	jml, remember, we talked about this and said, let's do what everyone agrees on and is easy first, and get consensus on if the second is even required.	12:28
deryck	unfortunately, I didn't ping anyone about the second yet. I'll do that today.	12:29
jml	deryck: ahh right. I forgot to refactor that new knowledge into the LEP page :)	12:29
jml	deryck: so I'll bump the first LEP to the Deployed section?	12:30
deryck	jml, in progress. I think I assumed approval and moved ahead.	12:30
deryck	jml, sorry to assume ;)	12:30
jml	deryck: no, that's all good :)	12:30
deryck	thanks!	12:30
lifeless	jml: gary has a variant of the LEP template with stuff specific to his team; I've suggested you might be amenable to folding those into the main template	12:33
jml	lifeless: sure, I'll have a look	12:34
jml	lifeless: if someone points me at a thing :)	12:34
lifeless	sure	12:35
* jml is also thinking (again!) about tracking LEPs at blueprints.launchpad.net/launchpad		12:35
lifeless	dunno when he'll do that	12:35
lifeless	jml: lets fix it first.	12:35
lifeless	jml: -please-	12:35
jml	lifeless: I reckon I could do a useful muck-around experiment that wouldn't affect anyone other than me.	12:36
lifeless	would it be a good use of your time?	12:37
lifeless	also, can we chat about reset (voice) ?	12:37
jml	lifeless: sure. gimme a couple of minutes to put my phones back together	12:37
maxb	Is the "builders are being disabled" topic comment in #launchpad still valid after the rollout?	12:39
lifeless	yes	12:39
jml	lifeless: and yes, it would be a good use of my time.	12:39
lifeless	hmm, didn't mean that to be snarky. Sorry	12:39
jml	lifeless: it wasn't at all snarky. I was going to elaborate but got distracted by yet another networking problem.	12:41
bigjools	jml: you remember how we added timeouts to the async xmlrpc by cancelling the Deferred?	12:45
jml	bigjools: yes	12:45
bigjools	jml: in those cases we get a CancelledError, but I am seeing hundreds of " User timeout caused connection failure."	12:45
bigjools	what causes those?	12:45
bigjools	it's a TimeoutError, sorry. I can't fathom how that would happen before the cancel	12:46
=== salgado is now known as salgado-physio
bigjools	huh actually - that's the 30 second connection issue	12:48
bigjools	which is much lower than our configured value for everything else	12:49
bigjools	jml: I'm tempted to inherit from Proxy and override stuff	13:20
jml	bigjools: yeah. I can't think of anything better right now. You ought to file a ticket and submit a patch too.	13:22
bigjools	jml: there's already a ticket, but the fix needs to go in quite a few places I think	13:22
bigjools	I'll file another anyway	13:22
bigjools	right - I need vittles	13:22
jml	bigjools: yeah, a specific ticket for xmlrpc.py would be great. thanks.	13:23
bigjools	nod	13:23
=== mrevell is now known as mrevell-lunch
lifeless	maxb: hey	13:41
maxb	hi	13:42
lifeless	maxb: what do you think of us having a custom python build - with http://bugs.python.org/issue10440 applied	13:42
=== Ursinha-dinner is now known as Ursinha
maxb	If it really is just an integer constant, why do we need to modify python for that?	13:43
maxb	Instead of just defining the value locally	13:43
lifeless	it can be different in different libcs, by definition.	13:44
lifeless	we can hardcode '1' as the constant, but its less portable and thus a bit ugly.	13:44
maxb	Well, it's a tiny patch, so it's hardly much effort to roll a modified package. The question then is the ongoing maintenance effort and how long it would be needed for	13:45
lifeless	yeah	13:46
maxb	I'd be tempted to consider putting the constant in a tiny module of its own, to avoid needing to rebuild every time there's an Ubuntu update out	13:46
maxb	Also, given Launchpad only targets Ubuntu, and a fairly narrow range of distroseries, even the non-portable solution is probably viable	13:47
lifeless	true on both counts	13:47
lifeless	will mull on it	13:47
=== henninge changed the topic of #launchpad-dev to: Launchpad Development Channel \| Week 4 of 10.11 \| PQM open for 10.12 \| firefighting: buildd-manager is disabling things again \| https://dev.launchpad.net/ \| Get the code: https://dev.launchpad.net/Getting
=== salgado-physio is now known as salgado
bigjools	lifeless: can you think of a way of creating a tcp endpoint that doesn't reply in a twisted test? I need to test a timeout and winding the reactor forwards is no good if the tcp connects or refuses to connect immediately	14:10
lifeless	sure	14:10
lifeless	bind, listen, but don't accept	14:10
bigjools	in real life I'd suspend a process but that's not ideal in a test	14:10
lifeless	Actually, that might not work. But its worth a go	14:11
bigjools	I suspect it would get connection refused wouldn't it?	14:11
bigjools	hmmm	14:11
lifeless	no	14:11
lifeless	accept is what takes a queued connection and gives you the new fd for it	14:12
elmo	alternatively iptables + -j DROP	14:12
bigjools	ah right	14:12
elmo	(although that requires root)	14:12
bigjools	not ideal for LP's test suite	14:13
elmo	sure, was just giving it as an option as a one off	14:14
=== mrevell-lunch is now known as mrevell
bigjools	elmo: how evil is it to try and connect to something like 10.255.255.1 ?	14:32
lifeless	bigjools: evil; some machines it will error immediately ;)	14:52
bigjools	lifeless: grar	14:53
lifeless	bigjools: because someone, somewhere has that ip	14:53
lifeless	bigjools: or routers that will see it and REJECT	14:54
bigjools	it doesn't get past my own router	14:54
bigjools	oh well it'll do as a stub for now	14:54
=== matsubara is now known as matsubara-lunch
bac	Reviewers Meeting starting at top of the hour: abentley, adeuring, allenap , bac, danilo, sinzui, deryck, EdwinGrubbs, flacoste, gary, gmb, henninge, jelmer, jtv, bigjools, leonardr, mars, salgado, jcsackett, benji	14:59
deryck	thanks bac	14:59
flacoste	bac: apologies from me	14:59
bac	np flacoste	15:01
=== matsubara-lunch is now known as matsubara
henninge	what's this?	16:25
henninge	http://paste.ubuntu.com/533506/	16:25
henninge	No handlers could be found for logger "librarian"	16:26
bigjools	henninge: you already have a librarian running	16:26
henninge	seriously? didn't know that ...	16:27
bigjools	kill it and the pid file and /var/tmp/fatsam.test	16:27
henninge	what's the process called?	16:28
bigjools	it's a twistd	16:28
henninge	ps ax \| grep libra returns nothing	16:28
henninge	ps ax \| grep twist - nada	16:28
henninge	:-(	16:28
bigjools	ummm then I dunno, I've only ever seen that when there's another librarian hanging around	16:29
henninge	thanks	16:29
henninge	why is the librarian logging in a +0530 time zone anyway???	16:32
henninge	India?	16:33
jml	henninge: there are no Canonical LP developers in that tz	16:38
henninge	outsourcing?	16:39
jml	henninge: we set the TZ there to avoid accidental TZ assumptions	16:40
jml	henninge: or something	16:40
henninge	;-)	16:40
henninge	but do you have an idea why the librarian layer might be failing?	16:40
lifeless	henninge: rm /var/tmp/fatsam.test/librarian.pid	16:43
henninge	already done. twice ;)	16:44
lifeless	ps fux \| grep twistd	16:44
lifeless	?	16:44
lifeless	oh	16:44
lifeless	netstat -n \| grep 58085	16:45
henninge	nothing	16:45
lifeless	or something like that	16:45
lifeless	is the second upload port thats barfingk	16:45
henninge	maybe I should mention that this is not devel ? It's the recife branch	16:49
henninge	but the test worked yesterday	16:50
lifeless	sinzui: btw your script to close bugs is closing bugs that shouldn't be closed - because of RFWTAD	16:52
henninge	a second run always gives me "TacException: Could not kill stale process /var/tmp/fatsam.test/librarian.pid.	16:52
henninge	so I remove that dir and try again.	16:52
lifeless	nothing changed overnight	16:53
lifeless	I think you have another process using the port	16:53
lifeless	thus the netstat - check lazr-schema / the test schema to see what port it will be using	16:53
sinzui	lifeless, they were fix committed in 10.11, but were not intended to be released?	16:53
thumper	bigjools: did you get to the bottom of the problem?	16:53
lifeless	sinzui: no, our process assigns bugs to milestones before they are fixed, not after	16:54
sinzui	lifeless, are these really 10.12 bugs	16:54
lifeless	sinzui: they are 'some work done, but not finished'	16:54
lifeless	sinzui: things like:	16:54
bigjools	thumper: I * think* so - I think it's slow builders that don't respond to connection requests within Twisted's 30 second default timeout. The recipe builds hammer the builders.	16:54
lifeless	- landed code but it didn't fix it	16:54
lifeless	- needs a cronscript enabled via an RT ticket	16:54
thumper	bigjools: so why does it take down all types of builders then?	16:55
bigjools	thumper: thanks for doing the incident report	16:55
bigjools	thumper: I don't know, it might be a coincidence.	16:55
lifeless	who is looking at the 'report a bug' feature not working ?	16:56
* thumper doesn't believe in coincidence		16:56
bigjools	I am putting in a fix that increases the connection timeout - copy & paste from Twisted FTW :/	16:56
sinzui	lifeless, I think that is a bug. The engineer should know when he intends to release Auto-assigning is convenient, but it does not exempt the person from correcting the milestone when he knows it will not be release with the milestone. eg we knew this when PQM was frozen	16:56
bigjools	thumper: I've seen slow builders doing exactly that for a while now - it's just that we never disabled them before this release.	16:56
lifeless	sinzui: sure, I'm not blaming the script or you :) - getting info on how to address - what policies we need to change	16:56
sinzui	lifeless, I can add a sanity check (qa-ok in tags)	16:57
lifeless	sinzui: I think thats an excellent idea	16:57
lifeless	sinzui: also I'm closing most bugs - those that are linked from revs - when we do incremental deploys	16:58
lifeless	I have to go eat or miss out, bbiab	16:58
sinzui	lifeless, i will have script for you by the end of my lunch	16:58
bigjools	jml: I guess you're not near your PC then	17:00
=== jam1 is now known as jam
=== benji is now known as benji-lunch
dobey	leonardr: around?	17:36
leonardr	dobey: yes	17:36
dobey	leonardr: http://pastebin.ubuntu.com/533530/ <- am getting this as a result of a getMembersByStatus() on a team with status=u'Administrator'	17:37
dobey	leonardr: any idea why that would be?	17:38
leonardr	dobey, what is the code in allowedcontributors.py?	17:38
lifeless	deryck: ping	17:39
deryck	hi lifeless. on tl call	17:39
dobey	leonardr: http://bazaar.launchpad.net/~rockstar/tarmac/main/annotate/head%3A/tarmac/plugins/allowedcontributors.py#L62	17:39
lifeless	deryck: are you aware that bug filing is reportedly broken ?	17:39
deryck	lifeless, no. how so?	17:40
lifeless	deryck: two independent reports	17:40
lifeless	1) apport user filed a bug in launhcpad	17:40
lifeless	2) james hunt mailed tom who forwarded it in the lp rollout thread	17:40
leonardr	dobey: so the 'approved' one succeeds but the 'administrator' one fails?	17:41
dobey	leonardr: that appears to be the case, yes	17:41
deryck	lifeless, I believe allenap is looking into that.	17:42
deryck	I'll follow up after tl call to make sure, and cover if not	17:42
lifeless	cool	17:42
dobey	leonardr: and unfortunately i have to call it twice, because i can't do status=[u'Approved', u'Administrator']; like i can do with other similar get APIs, but i guess that wouldn't fix this specific problem either :)	17:42
leonardr	dobey: i have no clue why it should work once and then fail. just for fun, you might try assigning launchpad.people[team] to a variable	17:47
leonardr	so you're not using it twice	17:47
leonardr	and if that doesn't work, try assigning to a variable and then printing out its name before invoking those named operations	17:47
leonardr	i'm just seeing if various known problems are in play here (in which case upgrading would help)	17:48
dobey	leonardr: what would i upgrade to exactly?	17:51
leonardr	dobey: a later launchpadlib/lazr.restfulclient	17:51
dobey	leonardr: is there one newer than what is in 11.04 already?	17:52
leonardr	dobey: there is, but the one in 11.04 should have the fix i'm thinking about already	17:53
dobey	ok	17:54
leonardr	dobey: my only suggestion is to put a breakpoint in get_representation_definition and see what it does differently the first time vs. the second	17:57
dobey	leonardr: ok; i've changed it to assign the team to a variable and print the team twice as suggested; will see what happens next time that code gets hit	18:00
rockstar	launchpad is being very slow today. :(	18:00
rockstar	abentley, are there any issues with the new lp-serve happening right now?	18:06
=== benji-lunch is now known as benji
=== EdwinGrubbs is now known as Edwin-lunch
thumper	rockstar: the new forking lp-serve isn't enabled yet	18:38
rockstar	thumper, oh, the bug was marked as Fix Released. :(	18:39
mars	sinzui, ping	18:41
thumper	rockstar: yes, I know. jam commented on it too saying as much	18:41
rockstar	Ah, I hadn't seen the comment, just the status change.	18:41
jam	rockstar: right, still trying to work through getting everything qa'd, etc. It isn't considered a qa blocker because it is disabled in production	18:52
jam	I'm noticing that my download-cache has grown to about 500MB, anyone know what files I can nuke?	18:52
jam	I'd like to think that I don't need 12 versions of "zope.testing-*"	18:52
rockstar	jam, basically, you can nuke any files that aren't in versions.cfg	18:53
jam	rockstar: which is in the lp root?	18:53
rockstar	jam, yes	18:53
jam	well, that isn't particularly fun to cross-reference...	18:54
rockstar	thumper, urbanape just pointed out to me that when diff is too big, it says "Truncated for viewing." That's wrong, because if it was really for viewing, it wouldn't be truncated...	18:54
=== deryck is now known as deryck[lunch]
jam	rockstar: so why is download-cache a bzr branch that is versioning all of these tarballs? seems odd to me	19:02
jam	especially given that it is storing all old versions together in the same working tree	19:02
rockstar	jam, I am not the one to ask about that, but I think it was supposed to be a temporary solution we concocted two years ago.	19:02
jam	(for example, it contains 20 bzr tarballs)	19:02
jam	the .bzr/repository is actually bigger than the launchpad repo at this point	19:05
abentley	jam: you do not need to convince us. We know it's wonky.	19:06
jam	another quick question. Anyone know how lp-production-configs are placed at runtime so I can simulate a runtime environment locally?	19:09
jam	(how does the launchpad codebase find the values in lp-production-configs)	19:10
lifeless	its put at the configs directory in the root I think	19:11
lifeless	and then LPCONFIG=configname	19:11
thumper	rockstar: in an email you mean/	19:13
thumper	?	19:13
thumper	rockstar: I thought it just said that on the page itself	19:13
thumper	rockstar: and in that case you are viewing it and it is truncated	19:13
rockstar	thumper, in the view, you are viewing it, and it is truncated, but it's not truncated FOR viewing. It's truncated FROM viewing. :)	19:15
lifeless	maxb: so, python 3.2 will have my patch :)	19:16
thumper	rockstar: it is truncated to allow you to view it otherwise it times out :-)	19:16
maxb	lifeless: And when are we migrating LP to Python 3? :-)	19:17
thumper	I'd not approve a textual change to "truncated from viewing" as it doesn't make grammatical sense	19:17
rockstar	thumper, yeah, it was pedantry from the start.	19:17
* thumper closes laptop to go and buy a 3g stick		19:17
thumper	rockstar: well we do work for pedantical :)	19:17
rockstar	thumper, although the fact that it's truncated drastically reduces its usefulness.	19:18
thumper	rockstar: the download link still works	19:18
thumper	rockstar: the fact that it is over 5000 lines drastically reduces its usefulness :)	19:19
sinzui	hi mars	19:21
rockstar	thumper, this is true as well.	19:23
jam	lifeless: I know about LPCONFIG=xxxx, but how is the "qastaging.conf" file found?	19:36
jam	it is just copied into the launchpad source tree?	19:36
jam	or is schema-lazr.conf (the symlink) pointed to something else, or?	19:37
rockstar	jam, it's symlinked.	19:40
jam	rockstar: to what file?	19:40
rockstar	jam, it's a file from lp-production-configs.	19:40
jam	rockstar: so they explicitly point schema-lazr.conf to schema-qastaging.conf for example?	19:40
jam	If so, why do you also need LPCONFIG=qastaging?	19:40
=== Edwin-lunch is now known as EdwinGrubbs
mwhudson	morning	19:47
=== deryck[lunch] is now known as deryck
jam	morning mwhudson	19:52
lifeless	jam: qastaging says 'the qastaging' dir which has a launchpad-lazr.conf file	20:14
jam	lifeless: sure, but there are 4 schema-XXX.conf files	20:15
jam	and no "schema-lazr.conf" or "schema-launchpad.conf", etc in the top of the dir	20:15
jam	anyway, I'm getting my problem solved without using it yet	20:15
jam	but still, I don't know yet how to set up something that resembles production	20:15
lifeless	jam: schema-xxx is irrelevant	20:16
jam	lifeless: so you still haven't answered how launchpad finds lp-production-configs/*.conf then	20:16
lifeless	I thnk its	20:17
lifeless	rm configs	20:17
lifeless	mv lp-production-configs configs	20:17
lifeless	IMBW	20:17
lifeless	losa can tell you though - ask chex	20:17
jam	k	20:17
jam	lifeless: any idea of a 'clean' way to invoke the bzr that is packaged with the launchpad tree? or should we just be invoking /usr/bin/bzr ?	20:20
jam	(IOW, how are the dependencies found in production)	20:21
jam	`pwd`/eggs/bzr-2.2.0-py2.6-linux-i686.egg/EGG-INFO/scripts/bzr is obviously not a long-term solution	20:21
jam	or Chex ^^	20:21
mwhudson	um	20:24
mwhudson	i think launchpad looks for lp-production-configs/$LPCONFIG/launchpad-lazr.conf then for configs/$LPCONFIG/launchpad-lazr.conf	20:25
mwhudson	the other config files get brought in by extends: ../foo.conf in those config files	20:26
jam	mwhudson: so it is just 'lp-production-configs' in a generic sibling dir?	20:26
mwhudson	jam: pretty sure, let me look at some code	20:26
jam	mwhudson: doing that, I get "Can't find qastaging in ..."	20:27
jam	in a traceback	20:27
mwhudson	jam: "production-configs", not lp-production-configs	20:28
mwhudson	my mistake	20:28
jam	mwhudson: confirmed that it works	20:28
jam	(via symlink at least)	20:28
mwhudson	cool	20:29
jam	mwhudson: and ./production-configs is also in .bzrignore	20:30
mwhudson	heh heh	20:30
=== salgado is now known as salgado-afk
jam	losa ping. I don't know if you have time, but mthaddon was looking at rt#42199 last night, and I think I've responded to what he needed. I don't know whether that means there is a hand-off or whether it is just going to wait for him to get back.	20:40
_mup_	Bug #42199: evolution causes gpg stale locks <Evolution:Fix Released> <evolution (Ubuntu):Fix Released by desktop-bugs> <https://launchpad.net/bugs/42199>	20:40
lifeless	jam: not a sibling dir, child dir	20:45
jam	lifeless: nope, at the root "launchpad/configs launchpad/production-configs"	20:45
jam	at least, that worked for me	20:46
jam	and that is what is in .bzrignore	20:46
lifeless	kk	20:46
wgrant	Were we in testfix overnight?	20:57
=== Ursinha is now known as Ursinha-bbk
=== Ursinha-bbk is now known as Ursinha-bbl
weather15	Hello Everyone	21:02
wallyworld	abentley: thumper: now?	21:03
gary_poster	lifeless: https://bazaar.launchpad.net/~launchpad-pqm/launchpad/production-stable/revision/9000	21:03
abentley	wallyworld: sure.	21:03
thumper	wallyworld: just here	21:04
wallyworld	abentley: %@$!!#$ audio died again.	21:05
wallyworld	brb	21:05
weather15	I have a wuick question about the Launchpad source	21:06
weather15	When running make schema is this part of a normal output? Unknown entry URL: ScalarValue Unknown entry URL: archive_dependency Unknown entry URL: archive_subscriber Unknown entry URL: binary_package_release_download_count Unknown entry URL: branch_merge_queue Unknown entry URL: branch_subscription Unknown e	21:06
wgrant	That's normal.	21:07
weather15	Okay Thank's wgrant	21:07
=== matsubara is now known as matsubara-afk
weather15	Wgrant: is this a typical end output: make[1]: Leaving directory `/home/weather15/launchpad/lp-branches/devel/database/schema' rm -f -r /var/tmp/fatsam	21:10
wgrant	weather15: Yes.	21:10
weather15	wgrant: Thanks	21:10
weather15	wgrant: I'm running Ubuntu Server	21:11
weather15	In this case how can I access Launchpad.dev?	21:11
weather15	SSH Tunnel?	21:11
weather15	or is there Apache settings to change?	21:12
wgrant	weather15: Have a look at https://dev.launchpad.net/Running/RemoteAccess	21:12
weather15	Also should I follow these instructions? 2010-11-17T16:11:40 WARNING root Developer mode is enabled: this is a security risk and should NOT be enabled on production servers. Developer mode can be turned off in etc/zope.conf	21:13
weather15	I plan on going into production	21:13
wgrant	Running a production Launchpad instance is not a simple task.	21:14
weather15	wgrant: Do I need to have more then 1 IP?	21:14
wgrant	weather15: Only if you want to be able to browse private branches.	21:15
weather15	Okay I do	21:15
weather15	two IP's on my local net or on the Internet?	21:15
wgrant	Wherever you want it to be accessible from.	21:16
weather15	okay	21:17
weather15	I guess if I run it on my local net then I will have all public repos	21:17
weather15	Then I only need 1 IP	21:17
maxb	weather15: OOI, which LP applications do you intend to use in production?	21:18
weather15	Pretty much all	21:19
maxb	Interesting, I'd only imagined people using bugs & code in a local setting	21:19
weather15	That's most likely what will happen but I'm not sure yet	21:20
weather15	I'm focused on getting it working now	21:20
maxb	You know about the whole image licence pain, right?	21:20
weather15	no	21:21
maxb	https://dev.launchpad.net/LaunchpadLicense	21:22
maxb	especially the 4th paragraph	21:22
weather15	"The image and icon files in Launchpad are copyright Canonical, but unlike the source code they are not licensed under the AGPLv3. Canonical grants you the right to use them for testing and development purposes only, but not to use them in production (commercially or non-commercially). "	21:23
weather15	That Part	21:23
wgrant	That part.	21:24
weather15	I know about that	21:29
weather15	I was wondering how to change those images	21:30
weather15	rather 1 IP i guess Ubuntu is not getting IP's on my second interface	21:33
weather15	what do you do when launchpad.dev will not resolve on the network?	21:34
weather15	I guess because I have only 1 IP bazaar will not work	21:35
weather15	is this true?	21:35
weather15	It seems I have 2 IP's now	21:41
weather15	do you replace a.b.c.d here <VirtualHost a.b.c.d:80> with your ip?	21:41
leonardr	james_w, who's the best person to talk to about getting new versions of launchpadlib and friends included in natty?	21:42
james_w	leonardr, Luca probably	21:42
leonardr	james_w: ok, makes sense, thanks	21:42
bigjools	wgrant: so, in case I am asleep at 2330 (highly likely), I've put another cowboy on cesium to fix the buildd manager.	21:48
weather15	Any one know the answer to my previous question?	21:48
weather15	It says "Or, if you did allocate a suitable second IP address: * Change the <VirtualHost 127.0.0.99:80> line to <VirtualHost a.b.c.d:80> * Change the <VirtualHost 127.0.0.99:443> line to <VirtualHost a.b.c.d:443>"	21:49
wgrant	bigjools: Removing failure counting?	21:50
weather15	is this what I should use or replace a.b.c.d with the IP on my second NIC	21:50
wgrant	weather15: The latter.	21:50
wgrant	bigjools: Do we also have more logging now?	21:50
bigjools	wgrant: some more yes	21:51
weather15	wgrant: with my IP correct?	21:51
wgrant	weather15: Yes.	21:51
weather15	Thanks	21:51
wgrant	bigjools: Well, I guess we'll see how it goes!	21:51
wgrant	bigjools: Did you and jml work anything out?	21:51
jml	wgrant: I didn't!	21:52
bigjools	wgrant: default connection timeout on twisted xmlrpc is 30 seconds, I've made it use socket_timeout instead	21:52
bigjools	I am seeing some builders still failing with that though	21:53
wgrant	bigjools: Hmm. I don't think that really explains everything, but it might fix the resume thing.	21:53
bigjools	not just resume, all xmlrpc requests	21:53
wgrant	jml: Also, why did PQM eat my branch?	21:53
jml	wgrant: I don't know. I didn't see that it got eaten	21:53
bigjools	and no it does not explain everything	21:53
bigjools	but it's a start	21:53
wgrant	jml: It said it submitted, but then nothing :/	21:54
wgrant	bigjools: Yeah, I guess.	21:54
jml	wgrant: I don't know. I won't be able to get around to looking into it tonight – sorry.	21:55
jml	wgrant: maybe you can convince someone else to land it. the tests all pass. if not, I'll do it first thing tomorrow	21:55
mwhudson	a builder taking 30 seconds to accept a connection seems pretty crazy too	21:55
wgrant	jml: Sure, no rush.	21:55
mwhudson	is the listen queue overflowing on the slave side or something?	21:56
mwhudson	i guess that's pretty hard to tell	21:56
wgrant	mwhudson: The builder is an archaic Twisted mess gluing together shoddy shell scripts.	21:56
wgrant	It's allowed to be crazy, I think.	21:56
jml	meh	21:56
jml	we allow it to be crazy	21:56
mwhudson	wgrant: even so	21:56
mwhudson	wgrant: is the builder one of these half twisted things that does blocking operations in the reactor thread?	21:57
wgrant	mwhudson: Sometimes.	21:57
jml	the build manager is	21:57
bigjools	but not for long	21:57
bigjools	mwhudson: I think >30 seconds happens when the slave manager was swapped out under load or something	21:58
mwhudson	oh right	21:58
bigjools	that's my guess....	21:58
jml	bigjools: db queries are blocking calls	21:58
wgrant	bigjools: Doesn't explain all the non-virt failures :(	21:58
bigjools	wgrant: it might, actually	21:58
bigjools	jml: true, very true.	21:59
wgrant	bigjools: How? Unless buildd-manager leaks exceptions across multiple builders, I don't see how...	21:59
weather15	for the allow for	21:59
bigjools	wgrant: if the previous build went into swap ...	21:59
weather15	would this work for 10.0.0.1? 10.0.0. or 10.0.0?	22:00
bigjools	on the same builder, I mean	22:00
mwhudson	... but you'd still need to fill up the listen queue, right? connecting to a listening socket doesn't involve the userspace process doing the listening iiuc	22:00
weather15	For the Allow from	22:00
wgrant	weather15: That's just normal Apache configuration.	22:00
wgrant	mwhudson: Hmmm? It needs to call accept(), right?	22:00
weather15	Yes but I need to set the sllow from	22:00
weather15	*allow	22:00
bigjools	mwhudson: I don't know	22:00
bigjools	some people have said that it needs to accept()	22:00
weather15	would 10.0.0 work or would I have to use 10.0.0. to allow my local network?	22:01
weather15	on 10.0.0.x	22:01
bigjools	it's been a while since I I did socket stuff	22:01
mwhudson	i can't remember either	22:01
bigjools	weather15: I suggest you ask Apache questions in the right channel	22:01
bigjools	you will almost certainly get a more knowledgeable answer	22:02
wgrant	bigjools: Hmmm. I see that palmer had been aborted 10 minutes before the failure. So it was probably still building. So that's plausible.	22:03
weather15	looks to me like Allow from 10.0.0.0/255.255.255.0 will work	22:03
mwhudson	science suggests that i am right about accept	22:03
bigjools	science rocks	22:03
wgrant	Although the fact that it timed out at the same time as the rest is a bit suspicious, perhaps buildd-manager was blocking for the preceding couple of minutes. Insufficient logging :/	22:03
bigjools	yeah, impossible to tell	22:04
bigjools	although if it was slow with the DB ...	22:04
wgrant	siiigh	22:04
mwhudson	i guess you can turn on statement tracing in buildd-manager	22:05
bigjools	log armageddon!	22:06
mwhudson	yeah	22:06
wgrant	bigjools: Ah, this is why we needed to clean out accepted... so we can have hundreds of gigabytes of logs!	22:07
mwhudson	more realistically, you can probably have a tracer log any statement that takes longer than say 5 s	22:07
bigjools	not sure that will help if there's a cumulative effect of 10*1s for example	22:07
mwhudson	... or collect aggregate stats, min, max, mean, stddev kind of thing	22:08
wgrant	If it happens again today, I think we should run with full logging tomorrow.	22:08
bigjools	I am too tired to think straight now	22:08
mwhudson	fair enough :-)	22:08
bigjools	we are full logging now, except the madness of statement tracing	22:08
wgrant	Even the 'Scanning foo' messages?	22:09
wgrant	And the extra logging in failBuilder that was cowboyed in earlier?	22:09
bigjools	everything	22:09
wgrant	Great.	22:09
bigjools	not that one - because we're not currently failing builders	22:09
wgrant	Ah, heh.	22:09
bigjools	assessFailureCounts is commented out	22:09
bigjools	so it will report on the counts but never do anything about it	22:10
wgrant	Perfect.	22:10
bigjools	I need to split the failure count stuff in two though	22:10
bigjools	1 set for dispatch attempts and 1 set for contact attempts	22:10
weather15	OKay My Launchpad install can be accessed with one problem	22:18
weather15	What do you do about this error? Error code: ssl_error_rx_record_too_long	22:19
weather15	SSL received a record that exceeded the maximum permissible length.	22:19
wgrant	Your Apache configuration is broken. It's probably serving normal HTTP on 443.	22:19
weather15	Okay I'll check it again	22:20
weather15	Any idea as to where to look?	22:20
weather15	I don't see anything wrong with it	22:22
weather15	is there something wrong with the keys>	22:22
wgrant	No.	22:22
wgrant	Have you tried restarting Apache?	22:22
thumper	wallyworld: I've pulled you branch and am looking at it...	22:22
weather15	I have a problem	22:24
jml	phwoar	22:24
weather15	my Apache config no-longer exists	22:24
jml	food helps	22:24
weather15	what do you do in this case?	22:24
thumper	wallyworld: found it	22:29
thumper	I wish we had different root objects for each virtual domain	22:30
jml	bigjools: did you file a patch upstream for the xmlrpc timeout thingy?	22:41
wallyworld	thumper: just finished breakfast. what was it?	22:43
thumper	wallyworld: I told you wrong, the canonical_url of IBazaarApplication is http://code.launchpad.dev/+code	22:44
thumper	wallyworld: so... we should hang off ILaunchpadRoot	22:44
thumper	or whatever it is	22:44
wallyworld	thumper: ah ok. i saw some other stuff hanging off that and was wondering.....	22:44
wallyworld	i'll fix it	22:45
wallyworld	thanks	22:45
thumper	wallyworld: also, the location of the link on the code homepage needs to be fixed	22:47
wallyworld	thumper: where would you like me to stick it? :-)	22:47
weather15	is this good or bad? WARNING Bad object name 'public.todrop_branchmergerobot' 2010-11-17 22:48:01 WARNING No permissions specified for [u'public.lp_openididentifier'] * Disabling autovacuum	22:48
thumper	wallyworld: I think we should have some nice text below the import text mentioning recipies	22:48
thumper	wallyworld: they are going to be one of our prime features	22:48
thumper	wallyworld: lets mock something up and get it to mrevell to check	22:48
wallyworld	thumper: also, i added a 30 day window to the query. not sure if we want that or not or make it user selectable	22:49
thumper	wallyworld: that may be fine for now	22:49
thumper	we may want to give the users an option	22:49
thumper	later	22:49
wallyworld	thumper: ack the mockup. the initial intent was just to get something working :-)	22:49
bigjools	jml: no, only a bug so far	22:49
thumper	wallyworld: yeah, understand that	22:49
wallyworld	thumper: +1 on the option. i was going to have a selection on the listing page itself, like we do for branch listings	22:50
jml	bigjools: ok.	22:50
bigjools	jml: I'll do one tomorrow	22:50
weather15	any ide a what to do about this? Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) * Module zope.publisher.publish, line 109, in mapply return debug_call(obj, args) __trace	22:52
jml	bigjools: neat.	22:52
weather15	??	22:55
weather15	More Ouput: File "/home/weather15/launchpad/lp-sourcedeps/eggs/zope.publisher-3.12.0-py2.6.egg/zope/publisher/publish.py", line 134, in publish result = publication.callObject(request, obj) File "/home/weather15/launchpad/lp-branches/devel/lib/canonical/launchpad/webapp/publication.py", line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) File "/home/weather15/launchpad/lp-sourcede	22:56
bigjools	jml: maybe someone will write a test if I attach a patch :)	22:58
weather15	????	22:58
bigjools	good night	22:58
jml	bigjools: g'night.	22:58
wgrant	Night bigjools.	22:59
weather15	What's this mean? No such file or directory: '/var/tmp/mailman/data/master-qrunner.pid' Is qrunner even running? rm -f logs/thread*.request bin/run -r librarian,google-webservice,memcached -i development	22:59
weather15	mailman not running	22:59
weather15	causin gthis problem?	22:59
weather15	Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) * Module zope.publisher.publish, line 109, in mapply return debug_call(obj, args) __traceback_info__: <bound method OpenID	23:00
mars	weather15, what command did you run to get that output?	23:01
weather15	mars:I went to the login page: https://launchpad.dev/+login	23:01
mars	weather15, are you using 'make run' in the launchpad source tree?	23:02
weather15	yes	23:02
mars	and it did not produce obvious errors about starting mailman?	23:03
weather15	no it did	23:03
weather15	heres the full ouput: make run utilities/shhh.py PYTHONPATH= python bootstrap.py\ --setup-source=ez_setup.py \ --download-base=download-cache/dist --eggs=eggs \ --version=1.5.1 mkdir -p /var/tmp/vostok-archive utilities/shhh.py make -C sourcecode build PYTHON=python \ LPCONFIG=development utilities/shhh.py LPCONFIG=development /home/weather15/launchpad/lp-branches/d	23:03
weather15	Aparently I can't paste it all	23:04
mars	weather15, pastebin.ubuntu.com	23:04
weather15	http://pastebin.ubuntu.com/533640/	23:06
wallyworld	thumper: just wondering aloud, to me it's bad that the tests passed (2 different page/view creation steps too) but the app failed to run in practice. agree? something to fix?	23:07
mars	weather15, on line 28, that looks like an error when the server is first run - it tried to clean up a PID file that doesn't exist. I wouldn't worry about it.	23:07
thumper	wallyworld: the problem is that you weren't loading the page, and clicking on the link	23:07
thumper	wallyworld: we had page tests for things like that	23:07
mars	weather15, what did you see when you tried launchpad.dev/+login ?	23:08
thumper	wallyworld: the unit tests were going directly to the page	23:08
thumper	wallyworld: so you never saw the actual url	23:08
weather15	mars: http://pastebin.ubuntu.com/533641/	23:08
thumper	wallyworld: you could add a test that gets the browser for the page	23:08
thumper	wallyworld: and tests the browser.url	23:08
thumper	wallyworld: that would have caught it	23:08
wallyworld	thumper: ok. i assumed that calls like create_initialized_view(root, "+daily-builds", rootsite='code') would use the same zope infrastructure as is used to load a page etc	23:09
thumper	wallyworld: it does	23:09
thumper	wallyworld: but the code root page was using a relative url hard coded	23:10
mars	weather15, that is new. https://launchpad.dev works?	23:10
thumper	wallyworld: it wasn't generating the url in the same way that the tests were	23:10
wallyworld	ok	23:10
weather15	for me using the source and by setting it in my /etc/hosts file	23:10
wgrant	weather15: Your Apache config for testopenid.dev is still broken.	23:10
weather15	thr documentation never mentioned that	23:11
weather15	what do I have to do?	23:11
wgrant	You must have broken it when you were changing the config.	23:11
wgrant	It's in with the rest.	23:11
mars	weather15, read the rocketfuel-setup script, it has a bash Here Document inside that sets up the /etc/hosts file. You can compare with that.	23:13
weather15	there's no mention of openid in the apache config	23:14
weather15	this is what the LaunchPad part looks like of .etc/hosts: 10.0.0.3 launchpad.dev answers.launchpad.dev archive.launchpad.dev api.launchpad.dev bazaar-internal.launchpad.dev beta.launchpad.dev blueprints.launchpad.dev bugs.launchpad.dev code.launchpad.dev feeds.launchpad.dev id.launchpad.dev keyserver.launchpad.dev lists.launchpad.dev openid.launchpad.dev ubuntu-openid.launchpad.dev ppa.launchpad.dev private-ppa.launchpa	23:15
wgrant	It will probably go to the first matching vhost, then.	23:15
wgrant	weather15: Try adding 'ServerAlias testopenid.dev' to the bottom two sections in the Apache config.	23:16
wgrant	Alongside launchpad.dev and *.launchpad.dev	23:16
weather15	okay done	23:18
mars	wgrant, weather15, on my system, the only location of testopenid.dev is in the /etc/hosts file	23:18
wgrant	mars: Right.	23:19
wgrant	mars: So it uses the default vhost.	23:19
weather15	I have launchpad starting now lets see what happens	23:19
wgrant	flacoste: http://paste.ubuntu.com/533638/ fixes the .htpasswd thing.	23:19
wgrant	flacoste: Not sure why.	23:19
wgrant	(it reverts part of the problematic rev)	23:20
* jml off		23:20
flacoste	wgrant: weird	23:20
wgrant	flacoste: Just a little.	23:20
flacoste	i thought that umask played only when creating a file	23:20
weather15	That doesn't explain why this is not working	23:20
wgrant	In both cases this creates a file.	23:20
wgrant	But somehow O_TRUNC changes things.	23:21
wgrant	Or Python is doing something stupid.	23:21
weather15	still: Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) * Module zope.publisher.publish, line 109, in mapply return debug_call(obj, args) __traceback_info__: <bound method	23:21
wgrant	weather15: Does accessing testopenid.dev in a browser work?	23:22
flacoste	wgrant: 'w' would use O_TRUNC?	23:22
wgrant	flacoste: Yes.	23:23
flacoste	wgrant: ok	23:23
weather15	server side yes	23:23
weather15	client side no	23:23
flacoste	wgrant: wallyworld is going to coordinate deploying that as a cow-boy	23:23
wgrant	flacoste: Great.	23:24
* flacoste updates incident report		23:25
weather15	is that the problem	23:25
=== flacoste changed the topic of #launchpad-dev to: Launchpad Development Channel \| Week 4 of 10.11 \| PQM open for 10.12 \| firefighting: buildd-manager is disabling things again & https://wiki.canonical.com/IncidentReports/2010-11-17-LP-Private-PPA-500-errors \| https://dev.launchpad.net/ \| Get the code: https://dev.launchpad.net/Getting
wgrant	Yay Soyuz.	23:29
weather15	Problem still exists	23:31
weather15	Oops! Sorry, something just went wrong in Launchpad. We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience. (Error ID: OOPS-1782X11) Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mappl	23:31
flacoste	wgrant: shouldn't we set the umask explicitely there? instead of relying on the env	23:33
weather15	this URL works: http://testopenid.dev/	23:33
weather15	it returns:Test OpenID provider for launchpad.dev	23:33
weather15	I wonder if this has something to do with it: https://code.launchpad.net/~bac/launchpad/bug-524302/+merge/22180	23:34
weather15	????	23:36
weather15	output on server side is different:	23:40
weather15	does this make more sense? http://pastebin.ubuntu.com/533653/	23:42
weather15	mars wgrant?	23:43
mars	weather15, try stopping the service, then running 'make clean && make' in the source tree.	23:44
weather15	okay will do	23:44
mars	weather15, https://launchpad.dev/+icing/rev5/build/lp/lp.js should be a real file when the server is running	23:44
weather15	okay it's executing now	23:45
mars	the build system should create that JavaScript file for you. You may want to check the source tree to see that it was created	23:45
mars	(when make finishes)	23:45
weather15	when I run make run I get this http://pastebin.ubuntu.com/533655/	23:51
weather15	mars wgrant?	23:53
wallyworld	abentley: ping	23:53
wallyworld	mars: ping - i need a cowboy eyeballed before asking a losa to deploy it	23:55
weather15	??	23:56
wallyworld	StevenK: ping?	23:57
wgrant	wallyworld: How many eyeballs does it need?	23:57
wgrant	Is yours insufficient?	23:57
wallyworld	wgrant: just one. the change as per your pastebin just reverts it to as it was before 11982 landed	23:58
wallyworld	https://code.edge.launchpad.net/~wallyworld/launchpad/htpasswd-access-permissions/+merge/41115	23:58
wallyworld	wgrant: i wasn't sure if i needed to ask a reviewer to eyeball it or not	23:58
wgrant	Oh, right, forgot you weren't a reviewer yet.	23:59
wallyworld	:-)	23:59

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!