cody-somervillewgrant, https://edge.launchpad.net/builders00:14
thumperStevenK: ^^00:17
wgrantThat was meant to be fixed.00:19
wgrantAnd we can't rollback after tomorrow. :/00:20
wgrantspm: Around?00:20
elmowe can cherrypick rollback00:20
elmohe doesn't appear to be00:20
elmowhat do you need?00:20
wgrantWe have no StevenK this week, AIUI.00:20
wgrantThe new buildd-manager is still horribly broken. Can you see if there's anything interesting in its log?00:21
elmoyes, all sorts of crfap00:21
wgrantWell, that's relatively more opaque than I would have hoped.00:24
thumperare we running the new buildmanager then?00:24
wgrantWe are.00:24
pjdcwgrant: are we looking for anything in particular in the buildd manager log?00:25
wgrantpjdc: I have a section of it. It's not very helpful.00:25
thumperany idea where the log is on devpad?00:25
thumperwgrant: do you know which machine the buildd manager runs on?00:25
wgrantthumper: cesium00:26
wgrantI have bits of the log already.00:26
pjdcthumper: looks like they land in devpad:/x/launchpad.net-logs/production/cesium00:26
thumperpjdc: yep, there00:26
wgrantTwisted errors make me sad.00:27
wgrantI wonder if the fermium connection error at 2010-11-17 00:08:13 is the root.00:28
wgrantOh, no.00:28
wgrantIt was a few minutes before that.00:28
wgrantCan you see where it started?00:29
wgrantpjdc: There have been no known network glitches this morning?00:29
pjdcwgrant: not to my knowledge.00:30
thumper010-11-16 00:01:03+0000 [-] <titanium:http://titanium.ppa:8221/> communication failed (User timeout caused connection failure.)00:31
thumper2010-11-16 00:01:03+0000 [-] <titanium:http://titanium.ppa:8221/> failure (None)00:31
thumper2010-11-16 00:01:03+0000 [-] <nannyberry:http://nannyberry.ppa:8221/> communication failed (User timeout caused connection failure.)00:31
thumper2010-11-16 00:01:03+0000 [-] <nannyberry:http://nannyberry.ppa:8221/> failure (None)00:31
thumper2010-11-16 00:01:03+0000 [-] <charichuelo:http://charichuelo.ppa:8221/> communication failed (User timeout caused connection failure.)00:31
pjdchmm, i could check if fermium was airlocked around the00:31
thumper2010-11-16 00:01:03+0000 [-] <charichuelo:http://charichuelo.ppa:8221/> failure (None)00:31
thumper2010-11-16 00:01:03+0000 [-] <thorium:http://thorium.ppa:8221/> communication failed (User timeout caused connection failure.)00:31
thumper2010-11-16 00:01:03+0000 [-] <thorium:http://thorium.ppa:8221/> failure (None)00:31
thumperseeing things like that00:31
thumperperhaps a pastebin would be better00:31
wgrantthumper: Isn't that >24 hours ago?00:31
thumperwgrant: yep, I'm looking for failures00:31
thumperbuildd-manager.log-20101116 has up to 2010-11-16 00:01:05+000000:34
wgrantYeah, they're named sort of wrongly.00:35
thumperbuildd-manager.log starts at 2010-11-17 00:00:45+000000:35
wgrantThey're named after the day they're rotated.00:35
thumperand no log file in the middle00:35
wgrantThere's no -20101117?00:35
thumperpjdc: can you see a -20101117 file on cesium?00:36
pjdcthumper: looking00:36
pjdcthumper: yes, there's a buildd-manager.log-2010111700:36
thumperpjdc: can you get that to devpad plz?00:36
thumperaccording to the graph, we started loosing builders approx 12 hours ago00:37
wgrantDoes LPS say when cesium was updated?00:38
pjdcthumper: landed in /tmp00:38
thumperpjdc: ta00:39
thumperScanning failed with: <Fault 8002: 'error'> <-- look suspect00:40
wgrantThat's fairly normal.00:40
thumperUnhandled error in Deferred:00:40
wgrantThat's not.00:40
thumperthat's probably not00:40
thumper2010-11-16 23:51:34+0000 [-] builder promethium failure count: 5, job 'amd64 build of widelands 1:15bzr5723-ppa1~natty1 in ubuntu natty RELEASE' failure count: 100:41
thumper2010-11-16 23:51:34+0000 [-] Scanning failed with: User timeout caused connection failure.00:41
wgrantThere are hundreds of those.00:41
wgrantAnd that's not the first.00:41
thumperFailure: twisted.internet.defer.CancelledError:00:42
thumperit isn't the first00:42
thumperI'm looking for suspect lines00:42
thumperis there a log entry for disabling a buider?00:43
wgrantThere used to be.00:44
wgrantI never got around to reviewing the full 6000 line diff of this branch, though :/00:44
* wgrant looks.00:44
thumperMiserable failure when trying to examine failure counts: :-)00:45
wgrant.... wow.00:45
wgrant        except:00:45
wgrantAnyway, what was the miserable failure?00:46
wgrantthumper: You'll be pleased to know that not all failBuilder callsites log.00:47
thumperwgrant: I'm grepping00:47
thumperwgrant: I can kinda tell00:47
thumperI suppose we should do a launchpad status thingy00:47
wgrantBut we should be able to tell from the logs.00:47
wgrantSince only two of the callsites are unobvious, and one has enough logging around it that we should be able to work it out.00:48
thumperwgrant: can you see devpad?00:48
wgrantthumper: Not for another few weeks :(00:48
thumperjust wondering, I'll copy the logfile to people.c.c00:48
thumperI'm getting refused ssh00:50
wgrantThat's excellent.00:50
thumperhmm... something screwy is going on...00:52
thumperok, it's on its way up00:52
thumperwgrant: I'm not entirely sure what to look for00:53
wgrantthumper: Thanks.00:53
* wgrant examines.00:53
wgrantIt's tempting to add more explicit logging, restart it, and hope it breaks again.00:54
thumpertest rollout of 11926 on 2010-11-1600:54
thumperthat's from LPS00:55
wgrantI was hoping for higher granularity. But I guess the log should help with that.00:55
thumperbug https://bugs.launchpad.net/soyuz/+bug/67124200:55
_mup_Bug #671242: New buildd-manager disabling everything in sight <qa-ok> <Soyuz:Fix Committed by julian-edwards> <https://launchpad.net/bugs/671242>00:55
wgrant10:03, by the look of things.00:55
wgrantthumper: Right, 11888 was deployed, but that broke with that bug.00:56
wgrant11926 fixed that.00:56
wgrantBut we now have another one.00:56
thumperthis is a different problem?00:56
wgrantI believe.00:57
=== thumper changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting
wgrantI think this problem may be partially described in that bug, but it's not the one that was identified and fixed.00:57
wgrantSo, we have major problems at 14:41:45, 15:03:20 and 16:31:22, 23:39:11, and possibly 18:11:22, 19:52:18,01:03
wgrant23:39:11 is the big one which took out everything.01:03
wgrantEach major failure starts with a single scan failure, than a huge number 9 seconds later.01:04
thumperelmo: ping?01:04
wgrantHe left a while ago.01:05
thumperI think we are losaless01:05
wgrantpjdc should be around, though?01:05
thumperpjdc: what do you know about LP deployment?01:05
wgrantIf you want to revert, we don't need a full deployment. It's just a symlink change and restart of buildd-manager.01:06
pjdcthumper: not much. i assisted elmo with an emergency cowboy about a year ago, but that's about it.01:06
thumperlifeless: are you really there?01:06
thumperpjdc: on cesium there was a rollout yesterday01:06
thumperpjdc: I'm hoping that they kept the old code around01:06
wgrantthumper: Last time that was the case.01:07
thumperpjdc: as the new code is disabling all the builders01:07
wgrantthumper: A LOSA just flipped the symlink back to the 10.10 rollout.01:07
thumperwgrant: if the buildd-manager is restarted, will it recheck the disabled buiders?01:07
pjdcthumper: if i'm looking in the right place, there are two tree, 11926 and 988601:07
lifelessthumper: yes01:07
pjdcthumper: 9886 looks pretty old01:07
thumperpjdc: 11926 is the broken one01:08
thumperpjdc: 9886 will be from db-devel01:08
thumperso... probably the last rollout01:08
thumperWed 13th of Oct01:08
thumperthat's the date on rev 9886 of db-devel01:08
wgrantthumper: No, we'll have to flip a flag on each to get them back.01:09
wgrant9886 is what we reverted to after 11888 failed. It's the last rollout.01:09
thumperpjdc: can you do that?01:09
thumperwgrant: how do we re-enable the builders?01:10
pjdcthumper: so that'd be change the symlink, restart the service?01:11
thumperpjdc: AFAIK01:11
pjdcthumper: i take it things can't get worse at this point?01:11
lifelessthumper: again? nooo01:11
lifelessthis bodes badly for the rollout tomorrow.01:11
lifelesslike really.01:11
lifelessthumper: 9886 is fine, its the last db-stable  deploy01:11
lifelesswgrant: you got log files etc - whats up ?01:11
lifelessit was stable for hours - did this happen recently? could it be librarian changes? [hope not]01:11
wgrantYeah, cesium is as broken as it can be.01:11
wgrantlifeless: I have the log change.01:12
lifelesswhats causing this01:12
lifelessbefore we change stuff01:12
wgrantIt was mostly OK for 4 hours.01:12
wgrantAfter 13 hours it just completely melted down.01:12
wgrantStill looking to see if I can get anything useful from the logs.01:13
lifelesscan we just restart it in the meantime and toggle the builds back on ?01:13
wgrantthumper: The 'Builder OK' flag on Builder:+edit does it. Otherwise there might be a script around.01:13
lifelessor will it kill them immediately ?01:13
wgrantlifeless: I guess we could try that.01:13
lifelesswe need to figure out what happens tomorrow @ rollout time01:13
lifelesswhich is what, 8 hours away01:14
thumperlifeless: right now it is a release blocker IMO01:14
thumperI'm starting an incident report01:14
wgrantlifeless: I think we probably cherrypick 11808, 11815 and 11926 off cesium.01:14
lifelesswgrant: 11926 itself is a problem ?01:14
wgrantlifeless: No, but 11808 probably won't revert unless we revert 11926 first.01:15
wgrantNot 11926.01:15
wgrantThat other one.01:15
lifelesssorry, I left my mind reader in asia01:15
wgrantThe fix for the issue that caused us to roll cesium back from 11888.01:16
wgrantSo 11808, 11815 and 11898.01:17
wgrantDo we know if enablement has pulled any buildds today?01:19
lifelesscody-somerville: ^01:19
pjdcwgrant: it's been quiet since the 12th, as far as i can tell01:19
wgrantIt doesn't explain everything (since non-virt buildds had the same error), but it might be something.01:20
lifelesswgrant: whats the error01:20
wgrantlifeless: 2010-11-16 16:31:13+0000 [-] Scanning failed with: User timeout caused connection failure.01:20
wgrant2010-11-16 16:31:13+0000 [-] Traceback (most recent call last):01:20
wgrant2010-11-16 16:31:13+0000 [-] Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.01:20
wgrantlifeless: In most of the major failures in the log, there is one of those followed by dozens 9 seconds later.01:21
thumperlifeless: do you object to rolling back the buildd-manager code on cesium?01:22
wgrantPerhaps we should try enabling things and see if they stay alive for long.01:22
mwhudsondoes the buildd-manager still do blocking things?01:23
lifelessthumper: I want to be sure we understand it01:23
wgrantOnly when downloading files from slaves, I believe.01:23
thumpermwhudson: I believe it is twisted now01:23
wgrantmwhudson: ^^01:23
lifelesspjdc: can you please:01:23
thumperwgrant: I thought that jelmer fixed that01:23
mwhudsonthumper: fully?  it's been somewhat twisted for a long time01:24
* thumper needs food01:24
lifeless - restart the builddmanager01:24
wgrantthumper: jelmer fixed it so it uploads the downloaded files asynchronously.01:24
lifeless - reenable a couple of fast buildds01:24
lifeless - see what happens over a few minutes01:24
wgrantthumper: A branch is coming to download them async, too, but it's not done yet.01:24
lifelessthumper: go eat, nothing will change radically while you eat01:24
pjdci'm not too familiar with the buildd pool. can someone suggest candidates?01:24
* pjdc picks 3 amd64 official builders01:25
wgrantpjdc: A fairly random pick of the various categories: roseapple, allspice, doubah, samarium01:25
wgrantCouple of new non-virt, and an old and new virt.01:26
pjdcworks for me01:26
pjdcrestarting buildd-manager01:26
pjdcstarted, doing the buildds now01:27
wgrantMaybe we should turn the logging up.01:28
wgrant(lib/lp/buildmaster/manager.py, s/logging.INFO/logging.DEBUG/)01:28
pjdcre-enabled those four, plus yellow and crested since i had the tabs all ready01:28
wgrantThere are some odd five minute gaps in the log, and it would be nice to know if it actually does anything in them.01:29
thumperpjdc: the queue for amd64 is empty though01:42
thumpernot sure if that'll show much01:42
pjdcthumper: looks like doubah's done the business though, showing as disabled again01:43
wgrantThe builders were failed regardless of whether there was anything to build or not.01:43
wgrantOh, already?01:43
thumperoh ok01:43
lifelesswgrant: does the buildd manager *read* from the librarian ?01:44
wgrantlifeless: I don't think so.01:44
wgrantI can't think why it would.01:44
lifelessdo builders ?01:44
lifelesswhat code path do they use to get their urls ?01:44
wgrantAhh. cesium provides them, I believe.01:45
wgrantBut it doesn't use the restricted librarian.01:45
lifelesseven for security builds etc?01:45
wgrantNo -- private build files are retrieved from the archive.01:45
wgrantSince the builders can't have restricted librarian access.01:46
wgrant(well, I guess they could now)01:46
lifelessjust to be sure01:46
lifelesspjdc: are we seeing access denied for or from the builders that fail (or from cesium for that matter)01:46
lifelesswgrant: what time was the first builder disabled ?01:46
wgrantlifeless: Can't tell. But the first major incident was probably 14:41:45. 23:39:11 was the really big one.01:47
pjdclifeless: cesium can't connect to those IPs on 80 and 44301:47
elmois this the restricted librarian01:48
lifelesspjdc: ah, but is it *trying*01:48
elmogod damn it01:48
pjdclifeless: i'll check01:48
lifelesselmo: there are two code paths for it in lp. internal stuff like the merge proposal diff code will use lp.internal still01:48
wgrantcesium will be wanting to upload to it. Is that also going to be broken?01:48
elmo188     IN      PTR     wildcard-restricted-launchpadlibrarian-net.banana.canonical.com.01:49
lifelesselmo: I really really wouldn't expect this to be connected, but Just In Case.01:49
lifelesswgrant: no, uploads have not altered *at all*01:49
wgrantThat's what I thought.01:49
lifelesswgrant: are those times UTC?01:49
wgrantlifeless: Yes.01:49
lifelessok, so 9 hours apart.01:50
wgrantThere were some in between.01:50
lifeless(range of) unless we're seeing two different things01:50
lifelessthen its not the publicrestricted librarian work01:50
wgrantWhen did that happen?01:50
pjdclifeless: a few rejects (5 total) for cesium going to both IPs on 80 and 44301:50
lifelesswgrant: and really? sed to change log levels ?01:50
wgrantlifeless: Yup :D01:51
lifelesspjdc: thats -very- interesting01:51
lifelesswgrant: I can has bug, and fix.01:51
lifelesselmo: did you just remove the publicrestricted feature flag?01:51
pjdclifeless: sorry, false alarm. those were my tests attempts.01:52
lifelesselmo: from https://launchpad.net/+feature-rules01:52
elmolifeless: I haven't touched anything01:52
lifelesscause the setting is gone ;)01:52
wgrantIt's possible that translations jobs might read from the librarian.01:53
wgrantI don't know them well.01:53
* wgrant looks.01:53
wgrantBut they wouldn't be private.01:53
lifelesspjdc: did you?01:55
lifelessnb we really need that audit log01:55
pjdclifeless: sorry, did i what?01:55
lifelesssounds like a no to me01:56
lifelessok, one thing at a time01:57
wgrantWe should probably have an 'assert not libraryfilealias.restricted' in BuilderSlave.cacheFile.01:57
lifelesspjdc: please reenable doubah again.01:57
wgrantBut I doubt that's the problem here.01:57
pjdclifeless: ok01:58
lifelesswgrant: or enable it01:58
* thumper has to go get kids01:58
wgrantlifeless: Hm>01:58
lifelesspjdc: if this fails, its not the publicrestricted librarian.01:58
pjdclifeless: doubah re-enabled01:58
elmohttps://launchpad.net/ubuntu/+source/fglrx-installer/2:8.780-0ubuntu3/+build/2049941 <-- roseapple worked (for one build) - did we know that?01:58
wgrantYeah, most things work for a while.01:58
wgrantIt's not related to what the builder is doing at the time.01:58
wgrantIt may be related to what others are doing, but who knows.01:59
lifelesswgrant: how do we know that?01:59
wgrantlifeless: Because it affects dozens of builders at a time, whether they're idle or building recipes or building binaries.01:59
wgrantdoubah's gone.01:59
lifelesswgrant: what if the timeout or some such is pseudo global, and one hung builder breaks all the ones open for the time window involved01:59
wgrantlifeless: Exactly.02:00
wgrant12:59:04 < wgrant> It may be related to what others are doing, but who knows.02:00
wgrantSo, doubah is dead with a TCP timeout.02:00
lifelesswgrant: in which case its related to what *one* does02:00
wgrantI wish we had a traceback.02:00
wgrantIt would be sorta helpful to know what timed out.02:00
wgrantlifeless: buildd-manager won't have cached the old FF?02:04
lifelesswgrant: it wasn't restarted when the problem happened02:05
wgrantCan we disable everything, enable doubah, and see what happens?02:05
lifelesswgrant: it was 3 hours ago now that the ff was turned on (and apparently off again)02:06
wgrantOh, so ages after the world exploded. I see.02:06
lifelesswgrant: yeah, I'm convinced we're clear02:07
mwhudsonlifeless: Chex turned the feature flag off after i complained that private codebrowse wasn't working02:07
lifelessmwhudson: did that make it work?02:07
mwhudson(which seems entirely unrelated to me, but after he did it, private codebrowse started working again)02:08
lifelesscodebrowse uses the librarian?02:08
mwhudsoni'm betting some kind of coincidence02:08
lifelesspjdc: can you please turn the flag on again - its listed under sql queries etc on LPS.02:08
pjdclifeless: that's a long page. what am i looking for exactly?02:09
lifelesspublicrestrictedlibrarian default 0 on02:10
pjdcthat doesn't mean much to me. is that a command?02:11
lifelessits a line you put in https://launchpad.net/+feature-rules02:11
lifelesshttps://dev.launchpad.net/LEP/FeatureFlags has plenty of docs - see the bottom o the page in partiuclar02:11
pjdcah, ok. so bung it in at the end, hit "Change"?02:12
lifelessmwhudson: still working ?02:13
mwhudsonlifeless: will check02:13
lifeless'Server denied check_authentication' is what you saw?02:13
mwhudsonlifeless: yes02:13
lifelesspjdc: and remove it?02:14
mwhudsonlifeless: it works now02:14
lifelesspjdc: don't remove it02:14
mwhudsonlifeless: oh, it failed for you?02:14
pjdclifeless: ok, doing nothing :)02:14
lifelessmwhudson: failed once02:15
lifelessmwhudson: worked on second url02:15
lifelessI think its stubs openid change02:15
mwhudsonlifeless: random02:15
lifelessmwhudson: so coincidence that that was the first private branch url you tried since 11926 was deploy.02:15
lifelesspjdc: thanks02:16
lifelessso back to the buildd02:16
lifelesspjdc: did doubah do day carooba?02:16
wgrantSo, I'd like to see this happen:02:16
wgrant - Disable all builders.02:17
wgrant - Shut down buildd-manager.02:17
pjdclifeless: you lost me at "do02:17
wgrant - Change log leve.02:17
wgrant - Enable doubah.02:17
wgrant - Start buildd-manager02:17
lifelesspjdc: you reenabled doubah02:17
lifelesspjdc: did it die again?02:17
wgrantIt did.02:17
lifelesspjdc: could you do what wgrant just described02:18
pjdcdisable all builders incluing those currently building? or just the idle ones?02:18
wgrantAll, ideally.02:19
lifelesswe're wondering if doubah is broken02:19
pjdcthe alls have it02:19
lifelessand a bug is making all the others get nuked when it goes *if* they happen to be lined up with it in the polling period02:19
wgrantWell, I'm mostly hoping we can get a minimal case to fail.02:19
wgrantI doubt there's anything wrong with doubah.02:19
lifelessok. I'm thinking that.02:19
lifelesswhat is doubah - virt i386?02:20
wgrantSince I picked four semi-randomly from a pool of 60.02:20
wgrantA fairly beefy one, too.02:20
pjdcall disabled, stopping the buildd-manager02:22
pjdcwgrant: how is the log level changed?02:22
wgrantpjdc: Aherm.02:23
wgrantpjdc: s/logging.INFO/logging.DEBUG/ in lib/lp/buildmaster/manager.py.02:23
lifelesswgrant: you *are* going to fix that.02:23
wgrantlifeless: It is Twisted evil.02:24
wgrantWhich I don't know awfully well.02:24
lifelesshasn't stopped you in the past02:24
pjdcwgrant: like so? http://paste.ubuntu.com/533321/02:25
wgrantpjdc: Yup.02:25
lifelesswgrant: and you have help02:25
wgrantFlip doubah back on, and start b-m up.02:25
wgrantAnd let's hope it fails.02:25
pjdcdoubah enabled, b-m starting02:25
lifelesswgrant: what time do recipe builds auto create ?02:25
wgrantlifeless: Probably a couple of hours ago.02:26
pjdcb-m started02:26
wgrantlifeless: They were happening around the time it was noticed.02:26
wgrantdoubah's dead already?02:26
wgrantHm, no.02:26
wgrantMust have been cached.02:27
pjdcshows as building here02:27
wgrantYeah, it is now.02:27
wgrantOK, it's started.02:28
wgrantOnly died once.02:28
wgrant:( it seems to be happy.02:32
lifelessshould we bring up another virt i386 ?02:32
lifeless=== Top 10 Time Out Counts by Page ID ===02:33
lifeless    Hard / Soft  Page ID02:33
lifeless     230 /   59  Person:+commentedbugs02:33
lifeless     111 / 5615  Archive:+index02:33
lifeless      76 /  295  BugTask:+index02:33
lifeless      12 /  398  Distribution:+bugtarget-portlet-bugfilters-stats02:33
wgrantWorth a try.02:33
lifeless      12 /  341  Distribution:+bugs02:33
lifeless      10 /    5  Person:+bugs02:33
lifeless       9 /    7  ProjectGroup:+milestones02:33
wgrant(virt i386 is good because it gives us all job types)02:33
lifeless       8 /    2  BugTask:+create-question02:33
lifeless       5 /   47  Distribution:+archivemirrors02:33
lifeless       5 /   17  DistributionSourcePackage:+publishinghistory02:33
pjdcwgrant: shall i enable actinium then?02:35
wgrantpjdc: Sure.02:35
pjdcwgrant: ok, enabled02:35
wgrantMaybe I'm on a slave, but actinium looks dead.02:37
wgrantIf it has just died, this is great news indeed.02:37
pjdclooking dead here too02:37
wgrantWe may have some hope of untangling the logs this time.02:38
wgrantCould you throw the log since the restart somewhere I can see it?02:38
pjdcwill do02:40
pjdcsee query02:43
wgrant2010-11-17 02:26:50+0000 [Uninitialized] ForbiddenAttribute: ('build', <TRANSLATION_TEMPLATES_BUILD branch job (2146072) for ~gwibber-committers/gwibber/trunk>)02:44
wgrantThis logging is a lot more descriptive :)02:45
wgrantHm, so actinium was aborted.02:47
wgrantIt was resumed, then just a few seconds later a dispatch was attempted... that's far too quick.02:48
wgrantSo, actinium probably wasn't hit by the root issue. :(02:49
wgrantWe don't wait long enough for the resume to complete.02:49
wgrantBut that doesn't explain the 'User timeout caused connection failure' thing, or why non-virt builders were broken too.02:49
wgrantOK. I think we should try to get it to break horribly again. So we should reset the failure counts and reenable everything, I suppose.02:51
lifelessif we bring everything up02:54
lifelesswill we log useful data?02:54
wgrantI hope so.02:54
wgrantMaybe we should make failBuilder log before we do that, though.02:55
wgrantSo we can see when things are disabled.02:55
lifelesswgrant: can you prep a cowboy02:56
wgrantDoing so.02:56
* thumper has to head afk02:57
wgrantpjdc, lifeless: http://pastebin.ubuntu.com/533329/ should do it.02:58
* wgrant pelts buildd-manager with rocks and sets it on fire.03:00
pjdcwgrant: so, shut down, apply patch, enable all (this might take a while), start up?03:03
wgrantpjdc: Yup.03:04
wgrantpjdc: Do you know if there's a script to enable them all?03:04
wgrantOtherwise there's SQL...03:04
pjdcwgrant: no idea, i've only ever done them manually03:04
wgrantWe may need SQL to reset the failure counts anyway. We'll see shortly.03:05
pjdcb-m stopped, patch applied03:06
lifelesswgrant: api script :)03:06
wgrantlifeless: Yeah, yeah, on my todo list.03:07
wgrantIt's reasonably unfortunate that all this has happened when we have no available LOSAs in this TZ, no available Soyuz developers in this TZ, and both of the buildd admins in this TZ also unavailable.03:08
wgrantNo... We'd normally have a LOSA, a Soyuz developer, and two buildd admins.03:10
pjdcokay, that's all of them enabled03:13
pjdcanything else before b-m is started?03:13
wgrantSo we are now running with the log level change and the additional failBuilder logging?03:14
pjdcyep, left the loglevel change in place, and applied your cowboy03:14
wgrantStart it up!03:14
wgrantI expect most of the them will disable themselves again in about 30 seconds :(03:15
wgrantSo, everything seems to be happy now.03:23
wgrantI guess we just leave it until it explodes in a few hours, and hope the new logging tells us something useful.03:24
pjdcthat shouldn't be far off when the UK wakes up, so that might work out03:25
wgrantGiven that we've failed to reproduce it elsewhere, it is tempting to let the rollout go ahead and just automatically undisable builders until we work out what's going on :/03:25
wgrantThe 14 builders that are disabled now probably just need their failure count reset (it's already over the threshold, so the initial failure to connect because the builder is still resuming causes them to be disabled).03:26
wgrantSomething like this:03:26
wgrantUPDATE builder SET failure_count=0, builderok=true WHERE name IN ('hawthorn', 'actinium', 'hassium', 'lansones', 'muntries', 'radium', 'rosehip', 'sandpaperfig', 'terranova', 'fermium', 'lawrencium', 'nobelium', 'papaya', 'plutonium');03:26
pjdcif it's not critical, that's probably best left for a losa03:27
wgrantProbably, yeah.03:27
wgrantNot critical. Just makes it harder to see if it's broken without watching logs.03:27
wgrantThanks for your help.03:27
pjdcyou're welcome!03:28
wgrantadare and ross are now broken in other ways :(03:29
wgrantBut that can wait.03:29
lifelessmwhudson: https://bugs.launchpad.net/launchpad-foundations/+bug/67637205:18
_mup_Bug #676372: "Server denied check_authentication" from bazaar.launchpad.net private branch since 11926 deployed <regression> <Launchpad Foundations:Triaged> <https://launchpad.net/bugs/676372>05:18
=== jtv is now known as jtv-eat
pooliehi all06:47
pooliei am running './bin/test' in a vm, and it has been stuck for hours, with the last output being06:47
poolieStarted ['/tmp/tmpecWY0y.mozrunner/mozrunner-firefox', '-profile', '/tmp/tmpecWY0y.mozrunner', 'http://bugs.launchpad.dev:8085/windmill-serv/start.html']06:47
pooliein 1.109 seconds.06:47
poolie 06:47
wgrantIs there a firefox window lurking around?06:47
poolienot that i can see06:48
pooliei'm ssh'd in to the vm without -X06:48
pooliei will see if there's a firefox process06:48
pooliethere is not, though there is a zombie06:48
wgrantmthaddon: Around yet?07:28
henningeHi wgrant!07:48
wgrantMorning henninge.07:48
henningewgrant: heard you got engaged07:48
henningeThat Kate really is a nice girl07:49
henningeoh sorry, wrong W... ;-)07:49
henningewgrant: What's that about the buildmanager?07:49
wgranthenninge: Well, it may or may not be a release blocker.07:50
wgranthenninge: We have a not utterly terrible workaround, so it's probably OK.07:50
henningewgrant: what does the workaround include?07:50
wgranthenninge: Uhh, frequently reenabling all the builders manually.07:50
henningeHow frequently?07:51
wgrantUnsure. It was OK for 4 hours yesterday. And it's been OK for 4 hours so far today.07:51
henningehours, wow ...07:51
henningeThe affected code is on cesium, right?07:52
wgrantHopefully jml and bigjools will save the world tonight.07:52
henningeSo that is (again) part of the nodowntime hosts07:52
henningeso a fix can be deployed any time.07:52
henningewgrant: I am sure they will! ;-)07:53
wgrantSo, it's a pretty terrible bug, but we can work around it easily enough with a script.07:53
henningeThe only reason I can imagine this being a blocker for the roll-out would be if any fix would include db changes.07:55
henningewhich is not that far fetched, I guess.07:56
wgrantIt won't.07:56
pooliewgrant: congrats!08:04
wgrantpoolie: Hm?08:04
poolieor is he just totally confused?08:05
wgrantI hope he's just totally confused.08:05
poolieah, me clicks08:05
wgrantOr there's some news about me that I don't know.08:05
poolieWilliam Soxe-Gotha-Coburg-Windsor08:05
wgrantAhhhhhhh, of course.08:06
bachi henninge08:35
henningehi bac!08:35
adeuringgood morning08:50
henningepoolie, wgrant: Yeah, I messed up the joke. I meant to say "sorry, wrong prince" ... ;)08:54
henningeMoin adeuring!08:54
adeuringhi henninge08:54
wgrantbigjools: Morning...09:08
wgrantbigjools: Have you heard the wonderful news?09:12
wgrantbigjools: We're about to release with a pretty screwed buildd-manager :)09:13
bigjoolsfuck sake09:13
wgrantIt disabled 60 or so this morning.09:13
=== mthaddon changed the topic of #launchpad-dev to: Launchpad down/read-only from 10:00-12:00 UTC for DB update | Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting
wgrantIt seems to be reasonably happy now, since we restarted everything 7 hours ago.09:14
wgrantBut it was OK for a few hours yesterday too :/09:14
bigjoolsit was disabling builders because they were unresponsive09:14
bigjoolsit's supposed to do that09:14
wgrantTCP timeouts and no route to host errors are different.09:15
wgrantThis was "User timeout caused connection failure" or something like that.09:15
bigjoolsthat's because they don't respond within the timeout09:15
wgrantDozens of them in one second?09:16
bigjoolswhat sort of time did this happen?09:16
wgrant14:41:45, 15:03:20 and 16:31:22, 23:39:11 are some that I saw.09:17
wgrant23:39:11 was the big one.09:17
bigjoolsthat's when the daily recipes kick off09:17
wgrantBut the last two incidents there start with a single error, then 9 seconds later dozens.09:17
wgrantThere were also a few other odd errors in the logs.09:18
wgrantAnd it's not waiting long enough for builders to resume.09:18
wgrantBut apart from that it's happy now.09:18
bigjoolsthat's a problem because there's nothing we can do to fix that09:18
bigjoolsthe connection timeout is hard-coded in the python libs :/09:18
bigjoolsthe reset script waits until some event in the builder, which is supposed to be when it's ready to accept a connection09:19
bigjoolsthen that connection often times out09:19
wgrant2010-11-17 02:35:58+0000 [QueryProtocol,client] Resuming actinium (http://actinium.ppa:8221/)09:19
wgrant2010-11-17 02:36:04+0000 [-] Asking builder on http://actinium.ppa:8221/filecache to ensure it has file chroot-ubuntu-lucid-i386.tar.bz2 (http://launchpadlibrarian.net/51974282/chroot-ubuntu-lucid-i386.tar.bz2, d267a7b39544795f0e98d00c3cf7862045311464)09:19
bigjoolswe're seeing the fruits of that now because I am actually disabling stuff09:19
wgrant2010-11-17 02:36:25+0000 [Uninitialized] Scanning failed with: TCP connection timed out: 110: Connection timed out.09:19
bigjoolswhereas the old one never disabled anything09:19
wgrantIt waited 6 seconds from firing the resume trigger.09:20
wgrantMaybe the script is buggy.09:20
bigjools6 seconds is about right09:20
bigjoolsthey reset very quickly09:20
wgrantThe VM is created and boots in 6 seconds!?09:20
bigjoolsthe first connection is to send the chroot, and that's why you see it timing out09:21
bigjoolswe can get around this for now by removing the code that fails builders09:21
bigjoolswhich is essentially what the old b-m was not doing09:21
wgrantI think we need to disable failure counting.09:21
wgrantIt took out lots of builds as well.09:22
wgrant(and fourteen or so builders need their failure counts manually reset)09:22
wgrantI still find it unlikely that dozens of builders failed to respond all in the same second, several times, unless there were network glitches that nobody knows about.09:23
wgrantThe 9 second delay betwen the first failure and subsequent stream on at least two occasions is also rather suspicious.09:23
bigjoolsif it's a network glitch then it's more likely that they all go at once09:23
wgrantAnyway, cesium is currently running the new code with two cowboys: one setting loglevel to DEBUG, and another to log whenever a builder is failed.09:25
wgrantWe also need to fix the failure counts of those builders, and probably do a mass-giveback :/09:25
bigjoolsfailure counts are reset on a successful dispatch09:25
wgrantThey are.09:26
bigjoolsfor a builder to get failed it has to go wrong on 5 consecutive occasions09:27
wgrantBut the issue is that the first failure will immediately knock them out again.09:27
bigjoolsno, that's not true09:27
wgrantIt will, since the count is currently 5.09:27
wgrantWe reenable, they time out, and are immediately disabled.09:27
wgrantNo five strikes rule for them.09:27
bigjoolsok, re-enabling should reset the count09:27
bigjoolsthat's a bug09:27
wgrantIt should.09:27
wgrantBut it doesn't.09:27
wgrantAnd we were LOSAless today, so we couldn't do it manually.09:27
bigjoolsI think the recipe builds are thoroughly screwing the builders09:29
wgrantSo "User timeout caused connection failure" occurs when the TCP connection is accepted, but there's no HTTP response?09:29
bigjoolseverything works fine until they come along09:29
bigjoolsthat happens when the connect() fails09:30
wgrantWe're still running the old lp-buildd with in-chroot bzr-builder, aren't we?09:30
bigjoolsyes, we rolled them back09:30
wgrantIf that happens when connect() fails, then why this:09:30
wgrant"TCP connection timed out: 110: Connection timed out."09:30
wgrantThat's a separate error.09:31
bigjoolsI think I'm going to just remove the failure counting stuff for now09:33
wgrantSounds like a good idea.09:34
bigjoolswgrant: did you ask someone to restart it at 0126 UTC?09:36
wgrantI think the first one was lifeless, but yeah, it was around then.09:36
bigjoolsthere were no problems with it at that time09:36
wgrantIt had taken out all but a few buildds an hour earlier. We wanted to see if we could reproduce it fresh with just a couple of active builders, to see if we needed to roll back and work out what to do about the release.09:39
bigjoolsI think the problem is recipe builds for sure, I just need to reproduce on DF09:42
bigjoolsthe builder is doing something that makes it unresponsive09:42
wgrantThat's not the whole thing.09:43
wgrantpalmer was disabled. It is non-virt and had been idle for 30 minutes.09:43
=== Guest8056 is now known as jelmer
bigjoolsoh jeez the log is massive with debug on09:44
wgrantSo we knew it was either several undetected network glitches throughout the day manifesting without any TCP timeouts, or something with one builder was glitching everything else out.09:44
wgrantSo we turned up logging and hoped it would reappear, since the INFO logging is sort of completely sparse.09:45
wgrantWe can't tell when the problematic scans were triggered, and there are five minute gaps in the log :/09:46
wgrantAnd I can't reproduce it locally however much I try :(09:46
bigjoolsit's a nightmare09:46
wgrantYeah, just a bit.09:47
bigjoolsfrom the log, it starts going wrong at the exact same time the faily (sic) recipe builds get kicked off09:48
bigjoolsaround 23:35Z09:48
wgrantThat's the big incident, yeah.09:48
wgrantBut there are several smaller ones in the preceding 9 hours.09:48
bigjoolsother indicents are almost certainly another batch09:48
bigjoolsthere are some Fault 8002:09:49
wgrantYeah, but they're everywhere...09:51
bigjoolsthat's a protocol fault09:51
bigjoolshmmm /me sees something09:52
wgrantWhat has been seen?09:52
bigjoolsthis might have something to do with the huge blocking file fetch09:53
wgrantI considered that.09:53
wgrantBut the 23:39 incident suggests not.09:53
wgrantThe nearest fetch before that was about 6 minutes earlier.09:54
bigjoolsI think it's a number of different things that cause blocks09:54
=== henninge changed the topic of #launchpad-dev to: Launchpad down/read-only from 10:00-12:00 UTC for DB update | Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 (but closed during the roll-out)| firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting
wgrantbigjools: So, just going to cowboy out failure counting after the rollout and hope that we can work it out?10:03
bigjoolsone of the things that the failure counting did was to remove in-progress jobs from builders if they failed a poll10:04
bigjoolsI might have to rethink how that work10:04
bigjoolsdamn, this stuff is hard10:05
wgrantIt should all be fine.10:06
wgrantExcept for those unexplained User blah blah blah errors, and the reset script lying.10:06
wgrantApart from that and the occasional other translations exception, it seems to be OK.10:06
wgrant2010-11-17 02:26:50+0000 [Uninitialized] ForbiddenAttribute: ('build', <TRANSLATION_TEMPLATES_BUILD branch job (2146072) for ~gwibber-committers/gwibber/trunk>)10:07
wgrantThat's the translations exception.10:07
wgrantDoes the reset script wait until the slave responds to HTTP?10:15
wgrantHow hard is readonly bazaar.launchpad.net?10:25
wgrantSurely not that bad?10:25
lifelesswgrant: we tested it on qastaging yesterday. it works with one small bug10:28
lifelesswgrant: however, we're doing machine maintenance.10:28
wgrantlifeless: What's the bug? It's not read-only?10:28
lifelessif we weren't doing maintenance on that machine, we'd have tried keeping it up this time.10:29
lifelessadeuring: ping10:46
wgrantHuh, codebrowse works?10:51
jmlLP seems to be r/w for me now10:52
wgrantAh, so it is.10:52
lifelessmorning jml10:54
jmllifeless: hello10:54
wgrantIndeed, morning jml.10:54
=== danilo_ is now known as danilos
wgrantCould someone please ec2 https://code.launchpad.net/~wgrant/launchpad/bug-654372-optimise-domination/+merge/40854?11:06
jmlwgrant: on it11:07
wgrantjml: Thanks.11:07
wgrantbigjools: re. bug #676262, I suspect they were both ABORTING (since abort() doesn't actually end up killing sbuild). That's a situation we ran into a few hours ago.11:10
_mup_Bug #676262: launchpad lost track of a build <Soyuz:Incomplete> <https://launchpad.net/bugs/676262>11:10
wgrant(with those same two builders)11:10
wgrantDamn ppc :(11:11
jmlI got a crazy error when doing ec2 land11:12
=== mthaddon changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 (but closed during the roll-out)| firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting
daniloshenninge, https://pastebin.canonical.com/39840/11:12
adeuringlifeless: pong (sorry, did not look at the IRC windows after returning from the kitchen...)11:14
lifelessadeuring: hey11:15
lifelessadeuring: remember how in APIs and restricted files we hard coded handing out the internal url ?11:15
adeuringlifeless: not exactly... let me check again11:15
lifelessadeuring: the token based librarian is deployed now11:15
jmllifeless: https://bugs.launchpad.net/launchpad-code/+bug/554206 might be relevant to some stuff you are doing11:15
_mup_Bug #554206: Need a read-only version of bazaar.launchpad.net for codehosting and codebrowse <canonical-losa-lp> <codebrowse> <codehosting-ssh> <Launchpad Bazaar Integration:Triaged> <https://launchpad.net/bugs/554206>11:15
adeuringlifeless: I remember that firewall settings in the DC needed some teaking11:16
wgrantWhy is [ui=none] in every commit message? Can't it just be omitted?11:20
adeuringlifeless: mizuho needed access to private Librarian files, and that machine "saw" a librarian URL having a host name with an "internal" domain part11:20
jmlwgrant: the [ui=foo] field was added as a way of strongly encouraging UI reviews for any UI change11:22
jmlwgrant: a huge number of changes do not affect the UI11:22
jmlwgrant: and I suspect that many people skip UI reviews11:22
lifelessadeuring: yes11:22
wgrantjml: Is it more than 1% of commits that have ui=somethingelse?11:22
lifelessadeuring: right, so you did a patch for the API to show the internal url11:22
jmlwgrant: you can run log & grep as easily as I11:23
lifelessadeuring: but its not needed now11:23
adeuringlifeless: did I? seems that I need a memory refresh.... looking now11:23
wgrantjml: True.11:24
lifelessadeuring: you did :)11:26
lifelessadeuring: rev 1150611:28
jmlhenninge: now that the rollout is done, can we fix canonical/launchpad/interfaces/__init__?11:30
henningejml: oh. ...11:30
adeuringlifeless: thanks! so, time to fix bug 62980411:30
_mup_Bug #629804: implement access to private Librarian files for launchpadlib clients <Launchpad Foundations:New> <https://launchpad.net/bugs/629804>11:30
henningejml: well, it's still on the list to do post-rollout but you can prepare a branch. By the time it gets deployed from stable, that should all be done ;-)11:34
henningejml: "it" is "fixing +inbound-email-config.zcml"11:34
=== matsubara-afk is now known as matsubara
jmlhenninge: ok. will do.11:35
matsubaramaxb, misclicked11:35
henningejml: just check again before marking the revision as deployable.11:36
jmlhenninge: *nod*. do you recall the bug number?11:36
henningeI am not sure it had a bug.11:36
henningejml: nm, it's fixed. ;-)11:37
henningeso I guess you can just submit it [no-qa]11:37
jmlhenninge: will do. ta.11:38
henningewhich is true because we already know it works on qa/staging ... ;-)11:38
jmllp-land has a bad token, but I don't know where to find it11:40
lifelessadeuring: I've unduplicated it11:43
jmlhow do I work around this problem? http://paste.ubuntu.com/533431/11:43
adeuringlifeless: I'll do it once I've finished my current work11:43
adeuring...i mean; I'll fix the bug...11:44
lifelessadeuring: do you have an estimate for when that will be?11:44
lifelessadeuring: if its going to be not-immediate, I might just do it11:44
adeuringlifeless: i think I can probably start tomorrow11:44
adeuringlifeless: you beat me ;)11:44
adeuringproblem is that I am quite slow with context swtiches...11:45
lifelessadeuring: I'll drop you a mail to let you know if I get to it or not.11:45
adeuringlifeless: coool11:45
jmlwgrant: your branch is being tested in ec2: http://ec2-50-16-92-112.compute-1.amazonaws.com/11:48
wgrantjml: I can't see that, but thanks!11:49
jmlit'd be kind of neat to add a phone-home thing to the ec2 script so we could have a page showing what's being built (as well as test results)11:49
deryckMorning, all.11:57
adeuringmorning deryck11:59
jmlbigjools: I added something to the derived distributions LEP about opening vs initialization; do you need anything more?12:01
bigjoolsjml: inspiration12:01
bigjoolsthanks :)12:01
jmlbigjools: np.12:02
jmlbigjools: also, I notice that https://launchpad.net/launchpad-project/+bugs?field.tag=buildd-scalability has no bugs.12:03
bigjoolsit should do12:03
bigjoolsI tagged loads12:03
bigjoolsjml: ah it's because they've all been released12:06
jmlbigjools: nice.12:06
bigjoolsjml: https://bugs.launchpad.net/soyuz/+bugs?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=FIXRELEASED&assignee_option=any&field.assignee=&field.bug_reporter=&field.12:06
bigjoolsaiieee sorry12:06
jmlbigjools: looking at the LEP and based on random IRC sampling, I'm guessing we're still missing "When a builder becomes free, we must dispatch a queued build to it within a maximum of 30 seconds.", "Design for a system with 200 builders" and "Not starve low-scored builds when there are higher-scored builds in the queue"12:07
stubHaving trouble following https://dev.launchpad.net/LaunchpadPpa. debsign -S fails with 'debsign: Can't find or can't read changes file !'12:07
bigjoolsjml: missing from where?12:07
jmlbigjools: what I mean is, have we met those requirements?12:08
bigjoolsjml: I need to have a call with you about that12:08
jmlbigjools: ah, ok :)12:08
bigjoolsbut later12:08
bigjoolsI am up to my neck in buildd-manager issues12:09
bigjoolsright after a dispatch of 10 or more recipes, there's nothing in the log for 4 minutes12:09
bigjoolswhich is somewhat suspicious12:09
jmlyeah, later is good12:09
wgrantThe queue isn't just empty?12:10
bigjoolsno, it's the gap between "startBuild" and the "RESULT" stuff12:10
wgrantThis is why I wanted better logging :(12:10
bigjoolsin fact the latter never appears12:10
bigjoolsyes we all want better logging12:11
bigjoolsbut one thing at a time12:11
wgrantThat's very interesting indeed.12:11
stubShouldn't bzr builddeb actually create a .deb?12:11
jelmerstub: You have to go back to the parent directory or ../result where the changes file was added.12:11
jelmerstub: By default it creates binary packages (.deb's), with -S it creates a source package.12:12
stubBut where?12:12
bigjoolswgrant: something is blocking too long when it's dispatching a recipe build12:12
jelmerstub: In the parent directory or ../result12:12
stubjelmer: I don't have a ../result and nothing new in the parent directory12:12
wgrantbigjools: After the "Initiating build foo on bar"?12:12
jelmerstub: you can specify a directory manuall with --result-dir12:13
bigjoolswgrant: in Builder.startBuild() it logs the build start (behavior.logStartBuild)12:13
bigjoolsthen there's nothing logged until it fails12:13
bigjoolsat that point, there's a few things that could have gone wrong but the lack of logging means it's hard to tell12:14
stubjelmer: Garh. They were in my branch, not my checkout of the branch12:15
stubjelmer: Guess that would be a bug...12:15
jelmerstub: yeah, that seems a bit strange12:16
wgrantbigjools: So we don't even know if it made it into resume_done?12:16
bigjoolsI suspect it has, that's the most reliable part of the process12:16
bigjoolsmy suspicions lie in the file disaptching and initiation12:16
wgrantBut it never made it to got_cache_file... hmm.12:16
bigjoolswe don't know12:18
bigjoolsthere's no info level logging12:18
wgrantgot_cache_file logs fairly obviously.12:19
wgrantOhh, crap.12:19
jmlderyck: there are a couple of LEPs about bug duplication...12:19
* bigjools is changing some debug to info12:19
jmlderyck: one's in drafting (https://dev.launchpad.net/LEP/DisableFilebugDuplicateSearchOption) and the other (https://dev.launchpad.net/LEP/ACLMarkAsDuplicate) isn't on the LEP page12:19
lifelesswgrant: We Can Haz Runtime Log Changing Please12:20
wgrantlifeless: debug 4 eva12:20
_mup_Bug #4: Importing finished po doesn't change progressbar <Launchpad Translations:Fix Released by carlos> <Ubuntu:Invalid> <https://launchpad.net/bugs/4>12:20
lifelessok foods12:20
jmllifeless: I guess there's https://bugs.edge.launchpad.net/soyuz/+bug/66795812:21
_mup_Bug #667958: Web diagnostic tool for build manager <buildd-manager> <Soyuz:Triaged> <https://launchpad.net/bugs/667958>12:21
jmlbut that's not quite the same thing12:21
bigjoolsdynamically changeable log levels is totally essential for decent production debugging12:23
wgrantbigjools: Is there anything in the current debug level that isn't interesting, except for the hundreds of "Scanning foo" messages?12:25
wgrantGiven the frequency and obscurity of issues, it'd be nice to keep as much data as possible...12:26
bigjoolsthe problem is that I don't want the log swamped12:26
bigjoolsit makes it harder to notice issues12:26
bigjoolsso I am trying to carefully select important messages for the info logging12:27
bigjoolsbut hindsight is awesome12:27
deryckHi jml.  Yeah, the first should be done.  And the second was meant to sketch out the idea and go back to marjo et al and get feedback....12:28
deryckjml, remember, we talked about this and said, let's do what everyone agrees on and is easy first, and get consensus on if the second is even required.12:28
deryckunfortunately, I didn't ping anyone about the second yet.  I'll do that today.12:29
jmlderyck: ahh right. I forgot to refactor that new knowledge into the LEP page :)12:29
jmlderyck: so I'll bump the first LEP to the Deployed section?12:30
deryckjml, in progress.  I think I assumed approval and moved ahead.12:30
deryckjml, sorry to assume ;)12:30
jmlderyck: no, that's all good :)12:30
lifelessjml: gary has a variant of the LEP template with stuff specific to his team; I've suggested you might be amenable to folding those into the main template12:33
jmllifeless: sure, I'll have a look12:34
jmllifeless: if someone points me at a thing :)12:34
* jml is also thinking (again!) about tracking LEPs at blueprints.launchpad.net/launchpad12:35
lifelessdunno when he'll do that12:35
lifelessjml: lets fix it first.12:35
lifelessjml: -please-12:35
jmllifeless: I reckon I could do a useful muck-around experiment that wouldn't affect anyone other than me.12:36
lifelesswould it be a good use of your time?12:37
lifelessalso, can we chat about reset (voice) ?12:37
jmllifeless: sure. gimme a couple of minutes to put my phones back together12:37
maxbIs the "builders are being disabled" topic comment in #launchpad still valid after the rollout?12:39
jmllifeless: and yes, it would be a good use of my time.12:39
lifelesshmm,  didn't mean that to be snarky. Sorry12:39
jmllifeless: it wasn't at all snarky. I was going to elaborate but got distracted by yet another networking problem.12:41
bigjoolsjml: you remember how we added timeouts to the async xmlrpc by cancelling the Deferred?12:45
jmlbigjools: yes12:45
bigjoolsjml: in those cases we get a CancelledError, but I am seeing hundreds of " User timeout caused connection failure."12:45
bigjoolswhat causes those?12:45
bigjoolsit's a TimeoutError, sorry.  I can't fathom how that would happen before the cancel12:46
=== salgado is now known as salgado-physio
bigjoolshuh actually - that's the 30 second connection issue12:48
bigjoolswhich is much lower than our configured value for everything else12:49
bigjoolsjml: I'm tempted to inherit from Proxy and override stuff13:20
jmlbigjools: yeah. I can't think of anything better right now. You ought to file a ticket and submit a patch too.13:22
bigjoolsjml: there's already a ticket, but the fix needs to go in quite a few places I think13:22
bigjoolsI'll file another anyway13:22
bigjoolsright - I need vittles13:22
jmlbigjools: yeah, a specific ticket for xmlrpc.py would be great. thanks.13:23
=== mrevell is now known as mrevell-lunch
lifelessmaxb: hey13:41
lifelessmaxb: what do you think of us having a custom python build - with http://bugs.python.org/issue10440 applied13:42
=== Ursinha-dinner is now known as Ursinha
maxbIf it really is just an integer constant, why do we need to modify python for that?13:43
maxbInstead of just defining the value locally13:43
lifelessit can be different in different libcs, by definition.13:44
lifelesswe can hardcode '1' as the constant, but its less portable and thus a bit ugly.13:44
maxbWell, it's a tiny patch, so it's hardly much effort to roll a modified package. The question then is the ongoing maintenance effort and how long it would be needed for13:45
maxbI'd be tempted to consider putting the constant in a tiny module of its own, to avoid needing to rebuild every time there's an Ubuntu update out13:46
maxbAlso, given Launchpad only targets Ubuntu, and a fairly narrow range of distroseries, even the non-portable solution is probably viable13:47
lifelesstrue on both counts13:47
lifelesswill mull on it13:47
=== henninge changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting
=== salgado-physio is now known as salgado
bigjoolslifeless: can you think of a way of creating a tcp endpoint that doesn't reply in a twisted test?  I need to test a timeout and winding the reactor forwards is no good if the tcp connects or refuses to connect immediately14:10
lifelessbind, listen, but don't accept14:10
bigjoolsin real life I'd suspend a process but that's not ideal in a test14:10
lifelessActually, that might not work. But its worth a go14:11
bigjoolsI suspect it would get connection refused wouldn't it?14:11
lifelessaccept is what takes a queued connection and gives you the new fd for it14:12
elmoalternatively iptables  + -j DROP14:12
bigjoolsah right14:12
elmo(although that requires root)14:12
bigjoolsnot ideal for LP's test suite14:13
elmosure, was just giving it as an option as a one off14:14
=== mrevell-lunch is now known as mrevell
bigjoolselmo: how evil is it to try and connect to something like ?14:32
lifelessbigjools: evil; some machines it will error immediately ;)14:52
bigjoolslifeless: grar14:53
lifelessbigjools: because someone, somewhere has that ip14:53
lifelessbigjools: or routers that will see it and REJECT14:54
bigjoolsit doesn't get past my own router14:54
bigjoolsoh well it'll do as a stub for now14:54
=== matsubara is now known as matsubara-lunch
bacReviewers Meeting starting at top of the hour: abentley, adeuring, allenap , bac, danilo, sinzui, deryck, EdwinGrubbs, flacoste, gary, gmb, henninge, jelmer, jtv, bigjools, leonardr, mars, salgado, jcsackett, benji14:59
deryckthanks bac14:59
flacostebac: apologies from me14:59
bacnp flacoste15:01
=== matsubara-lunch is now known as matsubara
henningewhat's this?16:25
henningeNo handlers could be found for logger "librarian"16:26
bigjoolshenninge: you already have a librarian running16:26
henningeseriously? didn't know that ...16:27
bigjoolskill it and the pid file and /var/tmp/fatsam.test16:27
henningewhat's the process called?16:28
bigjoolsit's a twistd16:28
henningeps ax | grep libra returns nothing16:28
henningeps ax | grep twist - nada16:28
bigjoolsummm then I dunno, I've only ever seen that when there's another librarian hanging around16:29
henningewhy is the librarian logging in a +0530 time zone anyway???16:32
jmlhenninge: there are no Canonical LP developers in that tz16:38
jmlhenninge: we set the TZ there to avoid accidental TZ assumptions16:40
jmlhenninge: or something16:40
henningebut do you have an idea why the librarian layer might be failing?16:40
lifelesshenninge: rm /var/tmp/fatsam.test/librarian.pid16:43
henningealready done. twice ;)16:44
lifelessps fux | grep twistd16:44
lifelessnetstat -n | grep 5808516:45
lifelessor something like that16:45
lifelessis the second upload port thats barfingk16:45
henningemaybe I should mention that this is not devel ? It's the recife branch16:49
henningebut the test worked yesterday16:50
lifelesssinzui: btw your script to close bugs is closing bugs that shouldn't be closed - because of RFWTAD16:52
henningea second run always gives me "TacException: Could not kill stale process /var/tmp/fatsam.test/librarian.pid.16:52
henningeso I remove that dir and try again.16:52
lifelessnothing changed overnight16:53
lifelessI think you have another process using the port16:53
lifelessthus the netstat - check lazr-schema / the test schema to see what port it will be using16:53
sinzuilifeless, they were fix committed in 10.11, but were not intended to be released?16:53
thumperbigjools: did you get to the bottom of the problem?16:53
lifelesssinzui: no, our process assigns bugs to milestones *before* they are fixed, not *after*16:54
sinzuilifeless, are these really 10.12 bugs16:54
lifelesssinzui: they are 'some work done, but not finished'16:54
lifelesssinzui: things like:16:54
bigjoolsthumper: I * think* so - I think it's slow builders that don't respond to connection requests within Twisted's 30 second default timeout.  The recipe builds hammer the builders.16:54
lifeless - landed code but it didn't fix it16:54
lifeless - needs a cronscript enabled via an RT ticket16:54
thumperbigjools: so why does it take down all types of builders then?16:55
bigjoolsthumper: thanks for doing the incident report16:55
bigjoolsthumper: I don't know, it might be a coincidence.16:55
lifelesswho is looking at the 'report a bug' feature not working ?16:56
* thumper doesn't believe in coincidence16:56
bigjoolsI am putting in a fix that increases the connection timeout - copy & paste from Twisted FTW :/16:56
sinzuilifeless, I think that is a bug. The engineer should know when he intends to release Auto-assigning is convenient, but it does not exempt the person from correcting the milestone when he knows it will not be release with the milestone. eg we knew this when PQM was frozen16:56
bigjoolsthumper: I've seen slow builders doing exactly that for a while now - it's just that we never disabled them before this release.16:56
lifelesssinzui: sure, I'm not blaming the script or you :) - getting info on how to address - what policies we need to change16:56
sinzuilifeless, I can add a sanity check (qa-ok in tags)16:57
lifelesssinzui: I think thats an excellent idea16:57
lifelesssinzui: also I'm closing most bugs - those that are linked from revs - when we do incremental deploys16:58
lifelessI have to go eat or miss out, bbiab16:58
sinzuilifeless, i will have script for you by the end of my lunch16:58
bigjoolsjml: I guess you're not near your PC then17:00
=== jam1 is now known as jam
=== benji is now known as benji-lunch
dobeyleonardr: around?17:36
leonardrdobey: yes17:36
dobeyleonardr: http://pastebin.ubuntu.com/533530/ <- am getting this as a result of a getMembersByStatus() on a team with status=u'Administrator'17:37
dobeyleonardr: any idea why that would be?17:38
leonardrdobey, what is the code in allowedcontributors.py?17:38
lifelessderyck: ping17:39
deryckhi lifeless.  on tl call17:39
dobeyleonardr: http://bazaar.launchpad.net/~rockstar/tarmac/main/annotate/head%3A/tarmac/plugins/allowedcontributors.py#L6217:39
lifelessderyck: are you aware that bug filing is reportedly broken ?17:39
derycklifeless, no.  how so?17:40
lifelessderyck: two independent reports17:40
lifeless1) apport user filed a bug in launhcpad17:40
lifeless2) james hunt mailed tom who forwarded it in the lp rollout thread17:40
leonardrdobey: so the 'approved' one succeeds but the 'administrator' one fails?17:41
dobeyleonardr: that appears to be the case, yes17:41
derycklifeless, I believe allenap is looking into that.17:42
deryckI'll follow up after tl call to make sure, and cover if not17:42
dobeyleonardr: and unfortunately i have to call it twice, because i can't do status=[u'Approved', u'Administrator']; like i can do with other similar get APIs, but i guess that wouldn't fix this specific problem either :)17:42
leonardrdobey: i have no clue why it should work once and then fail. just for fun, you might try assigning launchpad.people[team] to a variable17:47
leonardrso you're not using it twice17:47
leonardrand if that doesn't work, try assigning to a variable and then printing out its name before invoking those named operations17:47
leonardri'm just seeing if various known problems are in play here (in which case upgrading would help)17:48
dobeyleonardr: what would i upgrade to exactly?17:51
leonardrdobey: a later launchpadlib/lazr.restfulclient17:51
dobeyleonardr: is there one newer than what is in 11.04 already?17:52
leonardrdobey: there is, but the one in 11.04 should have the fix i'm thinking about already17:53
leonardrdobey: my only suggestion is to put a breakpoint in get_representation_definition and see what it does differently the first time vs. the second17:57
dobeyleonardr: ok; i've changed it to assign the team to a variable and print the team twice as suggested; will see what happens next time that code gets hit18:00
rockstarlaunchpad is being very slow today. :(18:00
rockstarabentley, are there any issues with the new lp-serve happening right now?18:06
=== benji-lunch is now known as benji
=== EdwinGrubbs is now known as Edwin-lunch
thumperrockstar: the new forking lp-serve isn't enabled yet18:38
rockstarthumper, oh, the bug was marked as Fix Released.  :(18:39
marssinzui, ping18:41
thumperrockstar: yes, I know.  jam commented on it too saying as much18:41
rockstarAh, I hadn't seen the comment, just the status change.18:41
jamrockstar: right, still trying to work through getting everything qa'd, etc. It isn't considered a qa blocker because it is disabled in production18:52
jamI'm noticing that my download-cache has grown to about 500MB, anyone know what files I can nuke?18:52
jamI'd like to think that I don't need 12 versions of "zope.testing-*"18:52
rockstarjam, basically, you can nuke any files that aren't in versions.cfg18:53
jamrockstar: which is in the lp root?18:53
rockstarjam, yes18:53
jamwell, that isn't particularly fun to cross-reference...18:54
rockstarthumper, urbanape just pointed out to me that when diff is too big, it says "Truncated for viewing."  That's wrong, because if it was really for viewing, it wouldn't be truncated...18:54
=== deryck is now known as deryck[lunch]
jamrockstar: so why is download-cache a bzr branch that is versioning all of these tarballs? seems odd to me19:02
jamespecially given that it is storing all old versions together in the same working tree19:02
rockstarjam, I am not the one to ask about that, but I *think* it was supposed to be a temporary solution we concocted two years ago.19:02
jam(for example, it contains 20 bzr tarballs)19:02
jamthe .bzr/repository is actually bigger than the launchpad repo at this point19:05
abentleyjam: you do not need to convince us.  We know it's wonky.19:06
jamanother quick question. Anyone know how lp-production-configs are placed at runtime so I can simulate a runtime environment locally?19:09
jam(how does the launchpad codebase find the values in lp-production-configs)19:10
lifelessits put at the configs directory in the root I think19:11
lifelessand then LPCONFIG=configname19:11
thumperrockstar: in an email you mean/19:13
thumperrockstar: I thought it just said that on the page itself19:13
thumperrockstar: and in that case you are viewing it and it is truncated19:13
rockstarthumper, in the view, you are viewing it, and it is truncated, but it's not truncated FOR viewing.  It's truncated FROM viewing.  :)19:15
lifelessmaxb: so, python 3.2 will have my patch :)19:16
thumperrockstar: it is truncated to allow you to view it otherwise it times out :-)19:16
maxblifeless: And when are we migrating LP to Python 3? :-)19:17
thumperI'd not approve a textual change to "truncated from viewing" as it doesn't make grammatical sense19:17
rockstarthumper, yeah, it was pedantry from the start.19:17
* thumper closes laptop to go and buy a 3g stick19:17
thumperrockstar: well we do work for pedantical :)19:17
rockstarthumper, although the fact that it's truncated drastically reduces its usefulness.19:18
thumperrockstar: the download link still works19:18
thumperrockstar: the fact that it is over 5000 lines drastically reduces its usefulness :)19:19
sinzuihi mars19:21
rockstarthumper, this is true as well.19:23
jamlifeless: I know about LPCONFIG=xxxx, but how is the "qastaging.conf" file found?19:36
jamit is just copied into the launchpad source tree?19:36
jamor is schema-lazr.conf (the symlink) pointed to something else, or?19:37
rockstarjam, it's symlinked.19:40
jamrockstar: to what file?19:40
rockstarjam, it's a file from lp-production-configs.19:40
jamrockstar: so they explicitly point schema-lazr.conf to schema-qastaging.conf for example?19:40
jamIf so, why do you also need LPCONFIG=qastaging?19:40
=== Edwin-lunch is now known as EdwinGrubbs
=== deryck[lunch] is now known as deryck
jammorning mwhudson19:52
lifelessjam: qastaging says 'the qastaging' dir which has a launchpad-lazr.conf file20:14
jamlifeless: sure, but there are 4 schema-XXX.conf files20:15
jamand no "schema-lazr.conf" or "schema-launchpad.conf", etc in the top of the dir20:15
jamanyway, I'm getting my problem solved without using it yet20:15
jambut still, I don't know yet how to set up something that resembles production20:15
lifelessjam: schema-xxx is irrelevant20:16
jamlifeless: so you still haven't answered how launchpad finds lp-production-configs/*.conf then20:16
lifelessI thnk its20:17
lifelessrm configs20:17
lifelessmv lp-production-configs configs20:17
lifelesslosa can tell you though - ask chex20:17
jamlifeless: any idea of a 'clean' way to invoke the bzr that is packaged with the launchpad tree? or should we just be invoking /usr/bin/bzr ?20:20
jam(IOW, how are the dependencies found in production)20:21
jam`pwd`/eggs/bzr-2.2.0-py2.6-linux-i686.egg/EGG-INFO/scripts/bzr is obviously not a long-term solution20:21
jamor Chex ^^20:21
mwhudsoni think launchpad looks for lp-production-configs/$LPCONFIG/launchpad-lazr.conf then for configs/$LPCONFIG/launchpad-lazr.conf20:25
mwhudsonthe other config files get brought in by extends: ../foo.conf in those config files20:26
jammwhudson: so it is just 'lp-production-configs' in a generic sibling dir?20:26
mwhudsonjam: pretty sure, let me look at some code20:26
jammwhudson: doing that, I get "Can't find qastaging in ..."20:27
jamin a traceback20:27
mwhudsonjam: "production-configs", not lp-production-configs20:28
mwhudsonmy mistake20:28
jammwhudson: confirmed that it works20:28
jam(via symlink at least)20:28
jammwhudson: and ./production-configs is also in .bzrignore20:30
mwhudsonheh heh20:30
=== salgado is now known as salgado-afk
jamlosa ping. I don't know if you have time, but mthaddon was looking at rt#42199 last night, and I think I've responded to what he needed. I don't know whether that means there is a hand-off or whether it is just going to wait for him to get back.20:40
_mup_Bug #42199: evolution causes gpg stale locks <Evolution:Fix Released> <evolution (Ubuntu):Fix Released by desktop-bugs> <https://launchpad.net/bugs/42199>20:40
lifelessjam: not a sibling dir, child dir20:45
jamlifeless: nope, at the root "launchpad/configs launchpad/production-configs"20:45
jamat least, that worked for me20:46
jamand that is what is in .bzrignore20:46
wgrantWere we in testfix overnight?20:57
=== Ursinha is now known as Ursinha-bbk
=== Ursinha-bbk is now known as Ursinha-bbl
weather15Hello Everyone21:02
wallyworldabentley: thumper: now?21:03
gary_posterlifeless: https://bazaar.launchpad.net/~launchpad-pqm/launchpad/production-stable/revision/900021:03
abentleywallyworld: sure.21:03
thumperwallyworld: just here21:04
wallyworldabentley: %@$!!#$ audio died again.21:05
weather15I have a wuick question about the Launchpad source21:06
weather15When running make schema is this part of a normal output? Unknown entry URL:                     ScalarValue Unknown entry URL:                     archive_dependency Unknown entry URL:                     archive_subscriber Unknown entry URL:                     binary_package_release_download_count Unknown entry URL:                     branch_merge_queue Unknown entry URL:                     branch_subscription Unknown e21:06
wgrantThat's normal.21:07
weather15Okay Thank's wgrant21:07
=== matsubara is now known as matsubara-afk
weather15Wgrant:  is this a typical end output: make[1]: Leaving directory `/home/weather15/launchpad/lp-branches/devel/database/schema' rm -f -r /var/tmp/fatsam21:10
wgrantweather15: Yes.21:10
weather15wgrant: Thanks21:10
weather15wgrant: I'm running Ubuntu Server21:11
weather15In this case how can I access Launchpad.dev?21:11
weather15SSH Tunnel?21:11
weather15or is there Apache settings to change?21:12
wgrantweather15: Have a look at https://dev.launchpad.net/Running/RemoteAccess21:12
weather15Also should I follow these instructions? 2010-11-17T16:11:40 WARNING root Developer mode is enabled: this is a security risk and should NOT be enabled on production servers. Developer mode can be turned off in etc/zope.conf21:13
weather15I plan on going into production21:13
wgrantRunning a production Launchpad instance is not a simple task.21:14
weather15wgrant: Do I need to have more then 1 IP?21:14
wgrantweather15: Only if you want to be able to browse private branches.21:15
weather15Okay I do21:15
weather15two IP's on my local net or on the Internet?21:15
wgrantWherever you want it to be accessible from.21:16
weather15I guess if I run it on my local net then I will have all public repos21:17
weather15Then I only need 1 IP21:17
maxbweather15: OOI, which LP applications do you intend to use in production?21:18
weather15Pretty much all21:19
maxbInteresting, I'd only imagined people using bugs & code in a local setting21:19
weather15That's most likely what will happen but I'm not sure yet21:20
weather15I'm focused on getting it working now21:20
maxbYou know about the whole image licence pain, right?21:20
maxbespecially the 4th paragraph21:22
weather15"The image and icon files in Launchpad are copyright Canonical, but unlike the source code they are not licensed under the AGPLv3. Canonical grants you the right to use them for testing and development purposes only, but not to use them in production (commercially or non-commercially). "21:23
weather15That Part21:23
wgrantThat part.21:24
weather15I know about that21:29
weather15I was wondering how to change those images21:30
weather15rather 1 IP i guess Ubuntu is not getting IP's on my second interface21:33
weather15what do you do when launchpad.dev will not resolve on the network?21:34
weather15I guess because I have only 1 IP bazaar will not work21:35
weather15is this true?21:35
weather15It seems I have 2 IP's now21:41
weather15do you replace a.b.c.d here <VirtualHost a.b.c.d:80> with your ip?21:41
leonardrjames_w, who's the best person to talk to about getting new versions of launchpadlib and friends included in natty?21:42
james_wleonardr, Luca probably21:42
leonardrjames_w: ok, makes sense, thanks21:42
bigjoolswgrant: so, in case I am asleep at 2330 (highly likely), I've put another cowboy on cesium to fix the buildd manager.21:48
weather15Any one know the answer to my previous question?21:48
weather15It says "Or, if you did allocate a suitable second IP address:      *        Change the <VirtualHost> line to <VirtualHost a.b.c.d:80>     *        Change the <VirtualHost> line to <VirtualHost a.b.c.d:443>"21:49
wgrantbigjools: Removing failure counting?21:50
weather15is this what I should use or replace a.b.c.d with the IP on my second NIC21:50
wgrantweather15: The latter.21:50
wgrantbigjools: Do we also have more logging now?21:50
bigjoolswgrant: some more yes21:51
weather15wgrant: with my IP correct?21:51
wgrantweather15: Yes.21:51
wgrantbigjools: Well, I guess we'll see how it goes!21:51
wgrantbigjools: Did you and jml work anything out?21:51
jmlwgrant: I didn't!21:52
bigjoolswgrant: default connection timeout on twisted xmlrpc is 30 seconds, I've made it use socket_timeout instead21:52
bigjoolsI am seeing some builders *still* failing with that though21:53
wgrantbigjools: Hmm. I don't think that really explains everything, but it might fix the resume thing.21:53
bigjoolsnot just resume, all xmlrpc requests21:53
wgrantjml: Also, why did PQM eat my branch?21:53
jmlwgrant: I don't know. I didn't see that it got eaten21:53
bigjoolsand no it does not explain everything21:53
bigjoolsbut it's a start21:53
wgrantjml: It said it submitted, but then nothing :/21:54
wgrantbigjools: Yeah, I guess.21:54
jmlwgrant: I don't know. I won't be able to get around to looking into it tonight – sorry.21:55
jmlwgrant: maybe you can convince someone else to land it. the tests all pass. if not, I'll do it first thing tomorrow21:55
mwhudsona builder taking 30 seconds to accept a connection seems pretty crazy too21:55
wgrantjml: Sure, no rush.21:55
mwhudsonis the listen queue overflowing on the slave side or something?21:56
mwhudsoni guess that's pretty hard to tell21:56
wgrantmwhudson: The builder is an archaic Twisted mess gluing together shoddy shell scripts.21:56
wgrantIt's allowed to be crazy, I think.21:56
jmlwe allow it to be crazy21:56
mwhudsonwgrant: even so21:56
mwhudsonwgrant: is the builder one of these half twisted things that does blocking operations in the reactor thread?21:57
wgrantmwhudson: Sometimes.21:57
jmlthe build manager is21:57
bigjoolsbut not for long21:57
bigjoolsmwhudson: I think >30 seconds happens when the slave manager was swapped out under load or something21:58
mwhudsonoh right21:58
bigjoolsthat's my guess....21:58
jmlbigjools: db queries are blocking calls21:58
wgrantbigjools: Doesn't explain all the non-virt failures :(21:58
bigjoolswgrant: it might, actually21:58
bigjoolsjml: true, very true.21:59
wgrantbigjools: How? Unless buildd-manager leaks exceptions across multiple builders, I don't see how...21:59
weather15for the allow for21:59
bigjoolswgrant: if the previous build went into swap ...21:59
weather15would this work for 10.0.0. or 10.0.0?22:00
bigjoolson the same builder, I mean22:00
mwhudson... but you'd still need to fill up the listen queue, right?  connecting to a listening socket doesn't involve the userspace process doing the listening iiuc22:00
weather15For the Allow from22:00
wgrantweather15: That's just normal Apache configuration.22:00
wgrantmwhudson: Hmmm? It needs to call accept(), right?22:00
weather15Yes but I need to set the sllow from22:00
bigjoolsmwhudson: I don't know22:00
bigjoolssome people have said that it needs to accept()22:00
weather15would 10.0.0 work or would I have to use 10.0.0. to allow my local network?22:01
weather15on 10.0.0.x22:01
bigjoolsit's been a while since I I did socket stuff22:01
mwhudsoni can't remember either22:01
bigjoolsweather15: I suggest you ask Apache questions in the right channel22:01
bigjoolsyou will almost certainly get a more knowledgeable answer22:02
wgrantbigjools: Hmmm. I see that palmer had been aborted 10 minutes before the failure. So it was probably still building. So that's plausible.22:03
weather15looks to me like Allow from will work22:03
mwhudsonscience suggests that i am right about accept22:03
bigjoolsscience rocks22:03
wgrantAlthough the fact that it timed out at the same time as the rest is a bit suspicious, perhaps buildd-manager was blocking for the preceding couple of minutes. Insufficient logging :/22:03
bigjoolsyeah, impossible to tell22:04
bigjoolsalthough if it was slow with the DB ...22:04
mwhudsoni guess you can turn on statement tracing in buildd-manager22:05
bigjoolslog armageddon!22:06
wgrantbigjools: Ah, this is why we needed to clean out accepted... so we can have hundreds of gigabytes of logs!22:07
mwhudsonmore realistically, you can probably have a tracer log any statement that takes longer than say 5 s22:07
bigjoolsnot sure that will help if there's a cumulative effect of 10*1s for example22:07
mwhudson... or collect aggregate stats, min, max, mean, stddev kind of thing22:08
wgrantIf it happens again today, I think we should run with full logging tomorrow.22:08
bigjoolsI am too tired to think straight now22:08
mwhudsonfair enough :-)22:08
bigjoolswe are full logging now, except the madness of statement tracing22:08
wgrantEven the 'Scanning foo' messages?22:09
wgrantAnd the extra logging in failBuilder that was cowboyed in earlier?22:09
bigjoolsnot that one - because we're not currently failing builders22:09
wgrantAh, heh.22:09
bigjoolsassessFailureCounts is commented out22:09
bigjoolsso it will report on the counts but never do anything about it22:10
bigjoolsI need to split the failure count stuff in two though22:10
bigjools1 set for dispatch attempts and 1 set for contact attempts22:10
weather15OKay My Launchpad install can be accessed with one problem22:18
weather15What do you do about this error? Error code: ssl_error_rx_record_too_long22:19
weather15SSL received a record that exceeded the maximum permissible length.22:19
wgrantYour Apache configuration is broken. It's probably serving normal HTTP on 443.22:19
weather15Okay I'll check it again22:20
weather15Any idea as to where to look?22:20
weather15I don't see anything wrong with it22:22
weather15is there something wrong with the keys>22:22
wgrantHave you tried restarting Apache?22:22
thumperwallyworld: I've pulled you branch and am looking at it...22:22
weather15I have a problem22:24
weather15my Apache config no-longer exists22:24
jmlfood helps22:24
weather15what do you do in this case?22:24
thumperwallyworld: found it22:29
thumperI wish we had different root objects for each virtual domain22:30
jmlbigjools: did you file a patch upstream for the xmlrpc timeout thingy?22:41
wallyworldthumper: just finished breakfast. what was it?22:43
thumperwallyworld: I told you wrong, the canonical_url of IBazaarApplication is http://code.launchpad.dev/+code22:44
thumperwallyworld: so... we should hang off ILaunchpadRoot22:44
thumperor whatever it is22:44
wallyworldthumper: ah ok. i saw some other stuff hanging off that and was wondering.....22:44
wallyworldi'll fix it22:45
thumperwallyworld: also, the location of the link on the code homepage needs to be fixed22:47
wallyworldthumper: where would you like me to stick it? :-)22:47
weather15is this good or bad?  WARNING Bad object name 'public.todrop_branchmergerobot' 2010-11-17 22:48:01 WARNING No permissions specified for [u'public.lp_openididentifier'] * Disabling autovacuum22:48
thumperwallyworld: I think we should have some nice text below the import text mentioning recipies22:48
thumperwallyworld: they are going to be one of our prime features22:48
thumperwallyworld: lets mock something up and get it to mrevell to check22:48
wallyworldthumper: also, i added a 30 day window to the query. not sure if we want that or not or make it user selectable22:49
thumperwallyworld: that may be fine for now22:49
thumperwe may want to give the users an option22:49
wallyworldthumper: ack the mockup. the initial intent was just to get something working :-)22:49
bigjoolsjml: no, only a bug so far22:49
thumperwallyworld: yeah, understand that22:49
wallyworldthumper: +1 on the option. i was going to have a selection on the listing page itself, like we do for branch listings22:50
jmlbigjools: ok.22:50
bigjoolsjml: I'll do one tomorrow22:50
weather15any ide a what to do about this? Traceback (most recent call last):      * Module zope.publisher.publish, line 134, in publish       result = publication.callObject(request, obj)     * Module canonical.launchpad.webapp.publication, line 483, in callObject       return mapply(ob, request.getPositionalArguments(), request)     * Module zope.publisher.publish, line 109, in mapply       return debug_call(obj, args)       __trace22:52
jmlbigjools: neat.22:52
weather15More Ouput: File "/home/weather15/launchpad/lp-sourcedeps/eggs/zope.publisher-3.12.0-py2.6.egg/zope/publisher/publish.py", line 134, in publish     result = publication.callObject(request, obj)   File "/home/weather15/launchpad/lp-branches/devel/lib/canonical/launchpad/webapp/publication.py", line 483, in callObject     return mapply(ob, request.getPositionalArguments(), request)   File "/home/weather15/launchpad/lp-sourcede22:56
bigjoolsjml: maybe someone will write a test if I attach a patch :)22:58
bigjoolsgood night22:58
jmlbigjools: g'night.22:58
wgrantNight bigjools.22:59
weather15What's this mean? No such file or directory: '/var/tmp/mailman/data/master-qrunner.pid' Is qrunner even running? rm -f logs/thread*.request bin/run -r librarian,google-webservice,memcached -i development22:59
weather15mailman not running22:59
weather15causin gthis problem?22:59
weather15Traceback (most recent call last):      * Module zope.publisher.publish, line 134, in publish       result = publication.callObject(request, obj)     * Module canonical.launchpad.webapp.publication, line 483, in callObject       return mapply(ob, request.getPositionalArguments(), request)     * Module zope.publisher.publish, line 109, in mapply       return debug_call(obj, args)       __traceback_info__: <bound method OpenID23:00
marsweather15, what command did you run to get that output?23:01
weather15mars:I went to the login page: https://launchpad.dev/+login23:01
marsweather15, are you using 'make run' in the launchpad source tree?23:02
marsand it did not produce obvious errors about starting mailman?23:03
weather15no it did23:03
weather15heres the full ouput: make run utilities/shhh.py PYTHONPATH= python bootstrap.py\                 --setup-source=ez_setup.py \                 --download-base=download-cache/dist --eggs=eggs \                 --version=1.5.1 mkdir -p /var/tmp/vostok-archive utilities/shhh.py make -C sourcecode build PYTHON=python \             LPCONFIG=development utilities/shhh.py LPCONFIG=development /home/weather15/launchpad/lp-branches/d23:03
weather15Aparently I can't paste it all23:04
marsweather15, pastebin.ubuntu.com23:04
wallyworldthumper: just wondering aloud, to me it's bad that the tests passed (2 different page/view creation steps too) but the app failed to run in practice. agree? something to fix?23:07
marsweather15, on line 28, that looks like an error when the server is first run - it tried to clean up a PID file that doesn't exist.  I wouldn't worry about it.23:07
thumperwallyworld:  the problem is that you weren't loading the page, and clicking on the link23:07
thumperwallyworld: we had page tests for things like that23:07
marsweather15, what did you see when you tried launchpad.dev/+login ?23:08
thumperwallyworld: the unit tests were going directly to the page23:08
thumperwallyworld: so you never saw the actual url23:08
weather15mars: http://pastebin.ubuntu.com/533641/23:08
thumperwallyworld: you could add a test that gets the browser for the page23:08
thumperwallyworld: and tests the browser.url23:08
thumperwallyworld: that would have caught it23:08
wallyworldthumper: ok. i assumed that calls like create_initialized_view(root, "+daily-builds", rootsite='code') would use the same zope infrastructure as is used to load a page etc23:09
thumperwallyworld: it does23:09
thumperwallyworld: but the code root page was using a relative url hard coded23:10
marsweather15, that is new.  https://launchpad.dev works?23:10
thumperwallyworld: it wasn't generating the url in the same way that the tests were23:10
weather15for me using the source and by setting it in my /etc/hosts file23:10
wgrantweather15: Your Apache config for testopenid.dev is still broken.23:10
weather15thr documentation never mentioned that23:11
weather15what do I have to do?23:11
wgrantYou must have broken it when you were changing the config.23:11
wgrantIt's in with the rest.23:11
marsweather15, read the rocketfuel-setup script, it has a bash Here Document inside that sets up the /etc/hosts file.  You can compare with that.23:13
weather15there's no mention of openid in the apache config23:14
weather15this is what the LaunchPad part looks like of .etc/hosts:      launchpad.dev answers.launchpad.dev archive.launchpad.dev api.launchpad.dev bazaar-internal.launchpad.dev beta.launchpad.dev blueprints.launchpad.dev bugs.launchpad.dev code.launchpad.dev feeds.launchpad.dev id.launchpad.dev keyserver.launchpad.dev lists.launchpad.dev openid.launchpad.dev ubuntu-openid.launchpad.dev ppa.launchpad.dev private-ppa.launchpa23:15
wgrantIt will probably go to the first matching vhost, then.23:15
wgrantweather15: Try adding 'ServerAlias testopenid.dev' to the bottom two sections in the Apache config.23:16
wgrantAlongside launchpad.dev and *.launchpad.dev23:16
weather15okay done23:18
marswgrant, weather15, on my system, the only location of testopenid.dev is in the /etc/hosts file23:18
wgrantmars: Right.23:19
wgrantmars: So it uses the default vhost.23:19
weather15I have launchpad starting now lets see what happens23:19
wgrantflacoste: http://paste.ubuntu.com/533638/ fixes the .htpasswd thing.23:19
wgrantflacoste: Not sure why.23:19
wgrant(it reverts part of the problematic rev)23:20
* jml off23:20
flacostewgrant: weird23:20
wgrantflacoste: Just a little.23:20
flacostei thought that umask played only when creating a file23:20
weather15That doesn't explain why this is not working23:20
wgrantIn both cases this creates a file.23:20
wgrantBut somehow O_TRUNC changes things.23:21
wgrantOr Python is doing something stupid.23:21
weather15still: Traceback (most recent call last):      * Module zope.publisher.publish, line 134, in publish       result = publication.callObject(request, obj)     * Module canonical.launchpad.webapp.publication, line 483, in callObject       return mapply(ob, request.getPositionalArguments(), request)     * Module zope.publisher.publish, line 109, in mapply       return debug_call(obj, args)       __traceback_info__: <bound method23:21
wgrantweather15: Does accessing testopenid.dev in a browser work?23:22
flacostewgrant: 'w' would use O_TRUNC?23:22
wgrantflacoste: Yes.23:23
flacostewgrant: ok23:23
weather15server side yes23:23
weather15client side no23:23
flacostewgrant: wallyworld is going to coordinate deploying that as a cow-boy23:23
wgrantflacoste: Great.23:24
* flacoste updates incident report23:25
weather15is that the problem23:25
=== flacoste changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again & https://wiki.canonical.com/IncidentReports/2010-11-17-LP-Private-PPA-500-errors | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting
wgrantYay Soyuz.23:29
weather15Problem still exists23:31
weather15Oops!  Sorry, something just went wrong in Launchpad.  We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience.  (Error ID: OOPS-1782X11)  Traceback (most recent call last):      * Module zope.publisher.publish, line 134, in publish       result = publication.callObject(request, obj)     * Module canonical.launchpad.webapp.publication, line 483, in callObject       return mappl23:31
flacostewgrant: shouldn't we set the umask explicitely there? instead of relying on the env23:33
weather15this URL works: http://testopenid.dev/23:33
weather15it returns:Test OpenID provider for launchpad.dev23:33
weather15I wonder if this has something to do with it: https://code.launchpad.net/~bac/launchpad/bug-524302/+merge/2218023:34
weather15output on server side is different:23:40
weather15does this make more sense? http://pastebin.ubuntu.com/533653/23:42
weather15mars wgrant?23:43
marsweather15, try stopping the service, then running 'make clean && make' in the source tree.23:44
weather15okay will do23:44
marsweather15, https://launchpad.dev/+icing/rev5/build/lp/lp.js should be a real file when the server is running23:44
weather15okay it's executing now23:45
marsthe build system should create that JavaScript file for you.  You may want to check the source tree to see that it was created23:45
mars(when make finishes)23:45
weather15when I run make run I get this http://pastebin.ubuntu.com/533655/23:51
weather15mars wgrant?23:53
wallyworldabentley: ping23:53
wallyworldmars: ping - i need a cowboy eyeballed before asking a losa to deploy it23:55
wallyworldStevenK: ping?23:57
wgrantwallyworld: How many eyeballs does it need?23:57
wgrantIs yours insufficient?23:57
wallyworldwgrant: just one. the change as per your pastebin just reverts it to as it was before 11982 landed23:58
wallyworldwgrant: i wasn't sure if i needed to ask a reviewer to eyeball it or not23:58
wgrantOh, right, forgot you weren't a reviewer yet.23:59

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!