[00:14] wgrant, https://edge.launchpad.net/builders [00:17] StevenK: ^^ [00:19] fuuuuuuuuuu [00:19] That was meant to be fixed. [00:20] And we can't rollback after tomorrow. :/ [00:20] spm: Around? [00:20] we can cherrypick rollback [00:20] he doesn't appear to be [00:20] what do you need? [00:20] We have no StevenK this week, AIUI. [00:21] The new buildd-manager is still horribly broken. Can you see if there's anything interesting in its log? [00:21] yes, all sorts of crfap [00:24] Well, that's relatively more opaque than I would have hoped. [00:24] are we running the new buildmanager then? [00:24] We are. [00:25] wgrant: are we looking for anything in particular in the buildd manager log? [00:25] pjdc: I have a section of it. It's not very helpful. [00:25] any idea where the log is on devpad? [00:25] wgrant: do you know which machine the buildd manager runs on? [00:26] thumper: cesium [00:26] I have bits of the log already. [00:26] thumper: looks like they land in devpad:/x/launchpad.net-logs/production/cesium [00:26] pjdc: yep, there [00:27] Twisted errors make me sad. [00:28] I wonder if the fermium connection error at 2010-11-17 00:08:13 is the root. [00:28] Oh, no. [00:28] It was a few minutes before that. [00:28] Sad. [00:29] Can you see where it started? [00:29] pjdc: There have been no known network glitches this morning? [00:30] wgrant: not to my knowledge. [00:30] :( [00:31] 010-11-16 00:01:03+0000 [-] communication failed (User timeout caused connection failure.) [00:31] 2010-11-16 00:01:03+0000 [-] failure (None) [00:31] 2010-11-16 00:01:03+0000 [-] communication failed (User timeout caused connection failure.) [00:31] 2010-11-16 00:01:03+0000 [-] failure (None) [00:31] 2010-11-16 00:01:03+0000 [-] communication failed (User timeout caused connection failure.) [00:31] hmm, i could check if fermium was airlocked around the [00:31] 2010-11-16 00:01:03+0000 [-] failure (None) [00:31] 2010-11-16 00:01:03+0000 [-] communication failed (User timeout caused connection failure.) [00:31] 2010-11-16 00:01:03+0000 [-] failure (None) [00:31] seeing things like that [00:31] hmm... [00:31] perhaps a pastebin would be better [00:31] thumper: Isn't that >24 hours ago? [00:31] wgrant: yep, I'm looking for failures [00:34] hmm... [00:34] buildd-manager.log-20101116 has up to 2010-11-16 00:01:05+0000 [00:35] Yeah, they're named sort of wrongly. [00:35] buildd-manager.log starts at 2010-11-17 00:00:45+0000 [00:35] They're named after the day they're rotated. [00:35] and no log file in the middle [00:35] There's no -20101117? [00:35] nope [00:35] Um. [00:36] pjdc: can you see a -20101117 file on cesium? [00:36] thumper: looking [00:36] thumper: yes, there's a buildd-manager.log-20101117 [00:36] pjdc: can you get that to devpad plz? [00:37] according to the graph, we started loosing builders approx 12 hours ago [00:38] Does LPS say when cesium was updated? [00:38] thumper: landed in /tmp [00:39] pjdc: ta [00:40] Scanning failed with: <-- look suspect [00:40] That's fairly normal. [00:40] Unhandled error in Deferred: [00:40] That's not. [00:40] that's probably not [00:41] 2010-11-16 23:51:34+0000 [-] builder promethium failure count: 5, job 'amd64 build of widelands 1:15bzr5723-ppa1~natty1 in ubuntu natty RELEASE' failure count: 1 [00:41] 2010-11-16 23:51:34+0000 [-] Scanning failed with: User timeout caused connection failure. [00:41] There are hundreds of those. [00:41] And that's not the first. [00:42] Failure: twisted.internet.defer.CancelledError: [00:42] ?? [00:42] no [00:42] it isn't the first [00:42] I'm looking for suspect lines [00:43] is there a log entry for disabling a buider? [00:44] There used to be. [00:44] I never got around to reviewing the full 6000 line diff of this branch, though :/ [00:44] * wgrant looks. [00:45] Miserable failure when trying to examine failure counts: :-) [00:45] .... wow. [00:45] except: [00:45] GRGWOGIFJEWF [00:46] Anyway, what was the miserable failure? [00:47] thumper: You'll be pleased to know that not all failBuilder callsites log. [00:47] wgrant: I'm grepping [00:47] wgrant: I can kinda tell [00:47] I suppose we should do a launchpad status thingy [00:47] But we should be able to tell from the logs. [00:48] Since only two of the callsites are unobvious, and one has enough logging around it that we should be able to work it out. [00:48] wgrant: can you see devpad? [00:48] thumper: Not for another few weeks :( [00:48] just wondering, I'll copy the logfile to people.c.c [00:49] Thanks. [00:50] hmm... [00:50] I'm getting refused ssh [00:50] That's excellent. [00:52] hmm... something screwy is going on... [00:52] ok, it's on its way up [00:53] wgrant: I'm not entirely sure what to look for [00:53] thumper: Thanks. [00:53] * wgrant examines. [00:54] It's tempting to add more explicit logging, restart it, and hope it breaks again. [00:54] test rollout of 11926 on 2010-11-16 [00:55] that's from LPS [00:55] I was hoping for higher granularity. But I guess the log should help with that. [00:55] bug https://bugs.launchpad.net/soyuz/+bug/671242 [00:55] <_mup_> Bug #671242: New buildd-manager disabling everything in sight [00:55] 10:03, by the look of things. [00:56] thumper: Right, 11888 was deployed, but that broke with that bug. [00:56] 11926 fixed that. [00:56] But we now have another one. [00:56] this is a different problem? [00:57] Yes. [00:57] I believe. === thumper changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [00:57] I think this problem may be partially described in that bug, but it's not the one that was identified and fixed. [01:03] So, we have major problems at 14:41:45, 15:03:20 and 16:31:22, 23:39:11, and possibly 18:11:22, 19:52:18, [01:03] 23:39:11 is the big one which took out everything. [01:04] Each major failure starts with a single scan failure, than a huge number 9 seconds later. [01:04] elmo: ping? [01:05] He left a while ago. [01:05] :( [01:05] I think we are losaless [01:05] pjdc should be around, though? [01:05] pjdc: what do you know about LP deployment? [01:06] If you want to revert, we don't need a full deployment. It's just a symlink change and restart of buildd-manager. [01:06] thumper: not much. i assisted elmo with an emergency cowboy about a year ago, but that's about it. [01:06] :( [01:06] lifeless: are you really there? [01:06] pjdc: on cesium there was a rollout yesterday [01:06] pjdc: I'm hoping that they kept the old code around [01:07] thumper: Last time that was the case. [01:07] pjdc: as the new code is disabling all the builders [01:07] thumper: A LOSA just flipped the symlink back to the 10.10 rollout. [01:07] soyuz-production-rev-9886 [01:07] wgrant: if the buildd-manager is restarted, will it recheck the disabled buiders? [01:07] thumper: if i'm looking in the right place, there are two tree, 11926 and 9886 [01:07] thumper: yes [01:07] thumper: 9886 looks pretty old [01:08] pjdc: 11926 is the broken one [01:08] pjdc: 9886 will be from db-devel [01:08] so... probably the last rollout [01:08] Wed 13th of Oct [01:08] that's the date on rev 9886 of db-devel [01:09] thumper: No, we'll have to flip a flag on each to get them back. [01:09] 9886 is what we reverted to after 11888 failed. It's the last rollout. [01:09] pjdc: can you do that? [01:10] wgrant: how do we re-enable the builders? [01:11] thumper: so that'd be change the symlink, restart the service? [01:11] pjdc: AFAIK [01:11] thumper: i take it things can't get worse at this point? [01:11] thumper: again? nooo [01:11] this bodes badly for the rollout tomorrow. [01:11] like really. [01:11] thumper: 9886 is fine, its the last db-stable deploy [01:11] wgrant: you got log files etc - whats up ? [01:11] it was stable for hours - did this happen recently? could it be librarian changes? [hope not] [01:11] . [01:11] Yeah, cesium is as broken as it can be. [01:12] lifeless: I have the log change. [01:12] s/change// [01:12] whats causing this [01:12] before we change stuff [01:12] It was mostly OK for 4 hours. [01:12] After 13 hours it just completely melted down. [01:13] Still looking to see if I can get anything useful from the logs. [01:13] can we just restart it in the meantime and toggle the builds back on ? [01:13] thumper: The 'Builder OK' flag on Builder:+edit does it. Otherwise there might be a script around. [01:13] or will it kill them immediately ? [01:13] lifeless: I guess we could try that. [01:13] we need to figure out what happens tomorrow @ rollout time [01:14] which is what, 8 hours away [01:14] lifeless: right now it is a release blocker IMO [01:14] I'm starting an incident report [01:14] lifeless: I think we probably cherrypick 11808, 11815 and 11926 off cesium. [01:14] wgrant: 11926 itself is a problem ? [01:15] lifeless: No, but 11808 probably won't revert unless we revert 11926 first. [01:15] Er. [01:15] Not 11926. [01:15] That other one. [01:15] sorry, I left my mind reader in asia [01:16] The fix for the issue that caused us to roll cesium back from 11888. [01:16] 11898 [01:17] So 11808, 11815 and 11898. [01:19] Do we know if enablement has pulled any buildds today? [01:19] cody-somerville: ^ [01:19] wgrant: it's been quiet since the 12th, as far as i can tell [01:20] It doesn't explain everything (since non-virt buildds had the same error), but it might be something. [01:20] wgrant: whats the error [01:20] lifeless: 2010-11-16 16:31:13+0000 [-] Scanning failed with: User timeout caused connection failure. [01:20] 2010-11-16 16:31:13+0000 [-] Traceback (most recent call last): [01:20] 2010-11-16 16:31:13+0000 [-] Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure. [01:21] lifeless: In most of the major failures in the log, there is one of those followed by dozens 9 seconds later. [01:22] lifeless: do you object to rolling back the buildd-manager code on cesium? [01:22] Perhaps we should try enabling things and see if they stay alive for long. [01:23] does the buildd-manager still do blocking things? [01:23] thumper: I want to be sure we understand it [01:23] Only when downloading files from slaves, I believe. [01:23] mwhudson: I believe it is twisted now [01:23] mwhudson: ^^ [01:23] pjdc: can you please: [01:23] wgrant: I thought that jelmer fixed that [01:24] thumper: fully? it's been somewhat twisted for a long time [01:24] * thumper needs food [01:24] - restart the builddmanager [01:24] thumper: jelmer fixed it so it uploads the downloaded files asynchronously. [01:24] - reenable a couple of fast buildds [01:24] - see what happens over a few minutes [01:24] thumper: A branch is coming to download them async, too, but it's not done yet. [01:24] thumper: go eat, nothing will change radically while you eat [01:24] i'm not too familiar with the buildd pool. can someone suggest candidates? [01:25] * pjdc picks 3 amd64 official builders [01:25] pjdc: A fairly random pick of the various categories: roseapple, allspice, doubah, samarium [01:26] Couple of new non-virt, and an old and new virt. [01:26] works for me [01:26] restarting buildd-manager [01:27] started, doing the buildds now [01:28] Hm. [01:28] Maybe we should turn the logging up. [01:28] (lib/lp/buildmaster/manager.py, s/logging.INFO/logging.DEBUG/) [01:28] re-enabled those four, plus yellow and crested since i had the tabs all ready [01:29] Great. [01:29] There are some odd five minute gaps in the log, and it would be nice to know if it actually does anything in them. [01:42] pjdc: the queue for amd64 is empty though [01:42] not sure if that'll show much [01:43] thumper: looks like doubah's done the business though, showing as disabled again [01:43] The builders were failed regardless of whether there was anything to build or not. [01:43] Oh, already? [01:43] oh ok [01:44] wgrant: does the buildd manager *read* from the librarian ? [01:44] lifeless: I don't think so. [01:44] I can't think why it would. [01:44] do builders ? [01:44] Yes. [01:44] what code path do they use to get their urls ? [01:45] Ahh. cesium provides them, I believe. [01:45] But it doesn't use the restricted librarian. [01:45] even for security builds etc? [01:45] No -- private build files are retrieved from the archive. [01:46] Since the builders can't have restricted librarian access. [01:46] (well, I guess they could now) [01:46] just to be sure [01:46] pjdc: are we seeing access denied for 91.189.89.189 or 91.189.89.188 from the builders that fail (or from cesium for that matter) [01:46] wgrant: what time was the first builder disabled ? [01:47] lifeless: Can't tell. But the first major incident was probably 14:41:45. 23:39:11 was the really big one. [01:47] lifeless: cesium can't connect to those IPs on 80 and 443 [01:48] oh [01:48] is this the restricted librarian [01:48] pjdc: ah, but is it *trying* [01:48] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [01:48] god damn it [01:48] lifeless: i'll check [01:48] elmo: there are two code paths for it in lp. internal stuff like the merge proposal diff code will use lp.internal still [01:48] cesium will be wanting to upload to it. Is that also going to be broken? [01:49] 188 IN PTR wildcard-restricted-launchpadlibrarian-net.banana.canonical.com. [01:49] elmo: I really really wouldn't expect this to be connected, but Just In Case. [01:49] wgrant: no, uploads have not altered *at all* [01:49] That's what I thought. [01:49] wgrant: are those times UTC? [01:49] lifeless: Yes. [01:50] ok, so 9 hours apart. [01:50] There were some in between. [01:50] (range of) unless we're seeing two different things [01:50] then its not the publicrestricted librarian work [01:50] When did that happen? [01:50] lifeless: a few rejects (5 total) for cesium going to both IPs on 80 and 443 [01:50] wgrant: and really? sed to change log levels ? [01:51] lifeless: Yup :D [01:51] pjdc: thats -very- interesting [01:51] wgrant: I can has bug, and fix. [01:51] elmo: did you just remove the publicrestricted feature flag? [01:52] lifeless: sorry, false alarm. those were my tests attempts. [01:52] elmo: from https://launchpad.net/+feature-rules [01:52] lifeless: I haven't touched anything [01:52] interesting [01:52] cause the setting is gone ;) [01:53] Hmm. [01:53] It's possible that translations jobs might read from the librarian. [01:53] I don't know them well. [01:53] * wgrant looks. [01:53] But they wouldn't be private. [01:55] pjdc: did you? [01:55] nb we really need that audit log [01:55] lifeless: sorry, did i what? [01:56] sounds like a no to me [01:57] ok, one thing at a time [01:57] uhm [01:57] We should probably have an 'assert not libraryfilealias.restricted' in BuilderSlave.cacheFile. [01:57] pjdc: please reenable doubah again. [01:57] But I doubt that's the problem here. [01:58] lifeless: ok [01:58] wgrant: or enable it [01:58] * thumper has to go get kids [01:58] lifeless: Hm> [01:58] pjdc: if this fails, its not the publicrestricted librarian. [01:58] lifeless: doubah re-enabled [01:58] https://launchpad.net/ubuntu/+source/fglrx-installer/2:8.780-0ubuntu3/+build/2049941 <-- roseapple worked (for one build) - did we know that? [01:58] Yeah, most things work for a while. [01:58] ok [01:58] It's not related to what the builder is doing at the time. [01:59] It may be related to what others are doing, but who knows. [01:59] wgrant: how do we know that? [01:59] lifeless: Because it affects dozens of builders at a time, whether they're idle or building recipes or building binaries. [01:59] doubah's gone. [01:59] wgrant: what if the timeout or some such is pseudo global, and one hung builder breaks all the ones open for the time window involved [02:00] lifeless: Exactly. [02:00] 12:59:04 < wgrant> It may be related to what others are doing, but who knows. [02:00] So, doubah is dead with a TCP timeout. [02:00] wgrant: in which case its related to what *one* does [02:00] I wish we had a traceback. [02:00] It would be sorta helpful to know what timed out. [02:04] lifeless: buildd-manager won't have cached the old FF? [02:05] wgrant: it wasn't restarted when the problem happened [02:05] True. [02:05] Can we disable everything, enable doubah, and see what happens? [02:06] wgrant: it was 3 hours ago now that the ff was turned on (and apparently off again) [02:06] Oh, so ages after the world exploded. I see. [02:07] wgrant: yeah, I'm convinced we're clear [02:07] lifeless: Chex turned the feature flag off after i complained that private codebrowse wasn't working [02:07] mwhudson: did that make it work? [02:08] (which seems entirely unrelated to me, but after he did it, private codebrowse started working again) [02:08] wtf [02:08] yes [02:08] codebrowse uses the librarian? [02:08] i'm betting some kind of coincidence [02:08] pjdc: can you please turn the flag on again - its listed under sql queries etc on LPS. [02:09] lifeless: that's a long page. what am i looking for exactly? [02:10] publicrestrictedlibrarian default 0 on [02:11] that doesn't mean much to me. is that a command? [02:11] its a line you put in https://launchpad.net/+feature-rules [02:11] https://dev.launchpad.net/LEP/FeatureFlags has plenty of docs - see the bottom o the page in partiuclar [02:12] ah, ok. so bung it in at the end, hit "Change"? [02:12] yes [02:12] done [02:13] mwhudson: still working ? [02:13] lifeless: will check [02:13] 'Server denied check_authentication' is what you saw? [02:13] lifeless: yes [02:14] zomg [02:14] pjdc: and remove it? [02:14] lifeless: it works now [02:14] oh [02:14] pjdc: don't remove it [02:14] lifeless: oh, it failed for you? [02:14] lifeless: ok, doing nothing :) [02:15] mwhudson: failed once [02:15] mwhudson: worked on second url [02:15] I think its stubs openid change [02:15] lifeless: random [02:15] mwhudson: so coincidence that that was the first private branch url you tried since 11926 was deploy. [02:16] pjdc: thanks [02:16] ok [02:16] so back to the buildd [02:16] pjdc: did doubah do day carooba? [02:16] So, I'd like to see this happen: [02:17] - Disable all builders. [02:17] - Shut down buildd-manager. [02:17] lifeless: you lost me at "do [02:17] - Change log leve. [02:17] " [02:17] - Enable doubah. [02:17] - Start buildd-manager [02:17] pjdc: you reenabled doubah [02:17] pjdc: did it die again? [02:17] It did. [02:18] ok [02:18] pjdc: could you do what wgrant just described [02:18] disable all builders incluing those currently building? or just the idle ones? [02:19] All, ideally. [02:19] all [02:19] we're wondering if doubah is broken [02:19] the alls have it [02:19] and a bug is making all the others get nuked when it goes *if* they happen to be lined up with it in the polling period [02:19] Well, I'm mostly hoping we can get a minimal case to fail. [02:19] I doubt there's anything wrong with doubah. [02:19] ok. I'm thinking that. [02:20] what is doubah - virt i386? [02:20] Since I picked four semi-randomly from a pool of 60. [02:20] Yeah. [02:20] A fairly beefy one, too. [02:22] all disabled, stopping the buildd-manager [02:22] wgrant: how is the log level changed? [02:23] pjdc: Aherm. [02:23] pjdc: s/logging.INFO/logging.DEBUG/ in lib/lp/buildmaster/manager.py. [02:23] wgrant: you *are* going to fix that. [02:24] lifeless: It is Twisted evil. [02:24] Which I don't know awfully well. [02:24] hasn't stopped you in the past [02:25] True. [02:25] wgrant: like so? http://paste.ubuntu.com/533321/ [02:25] pjdc: Yup. [02:25] wgrant: and you have help [02:25] Flip doubah back on, and start b-m up. [02:25] And let's hope it fails. [02:25] doubah enabled, b-m starting [02:25] wgrant: what time do recipe builds auto create ? [02:26] lifeless: Probably a couple of hours ago. [02:26] b-m started [02:26] lifeless: They were happening around the time it was noticed. [02:26] Er. [02:26] doubah's dead already? [02:26] Hm, no. [02:27] Must have been cached. [02:27] shows as building here [02:27] Yeah, it is now. [02:28] OK, it's started. [02:28] Only died once. [02:32] :( it seems to be happy. [02:32] should we bring up another virt i386 ? [02:33] === Top 10 Time Out Counts by Page ID === [02:33] Hard / Soft Page ID [02:33] 230 / 59 Person:+commentedbugs [02:33] 111 / 5615 Archive:+index [02:33] 76 / 295 BugTask:+index [02:33] 12 / 398 Distribution:+bugtarget-portlet-bugfilters-stats [02:33] Worth a try. [02:33] 12 / 341 Distribution:+bugs [02:33] 10 / 5 Person:+bugs [02:33] 9 / 7 ProjectGroup:+milestones [02:33] (virt i386 is good because it gives us all job types) [02:33] 8 / 2 BugTask:+create-question [02:33] 5 / 47 Distribution:+archivemirrors [02:33] 5 / 17 DistributionSourcePackage:+publishinghistory [02:35] wgrant: shall i enable actinium then? [02:35] pjdc: Sure. [02:35] wgrant: ok, enabled [02:37] Really? [02:37] Maybe I'm on a slave, but actinium looks dead. [02:37] If it has just died, this is great news indeed. [02:37] looking dead here too [02:37] !!! [02:38] We may have some hope of untangling the logs this time. [02:38] Could you throw the log since the restart somewhere I can see it? [02:40] will do [02:43] see query [02:43] Thanks. [02:44] ... [02:44] 2010-11-17 02:26:50+0000 [Uninitialized] ForbiddenAttribute: ('build', ) [02:44] ??? [02:45] This logging is a lot more descriptive :) [02:47] Hm, so actinium was aborted. [02:48] It was resumed, then just a few seconds later a dispatch was attempted... that's far too quick. [02:49] So, actinium probably wasn't hit by the root issue. :( [02:49] We don't wait long enough for the resume to complete. [02:49] But that doesn't explain the 'User timeout caused connection failure' thing, or why non-virt builders were broken too. [02:51] OK. I think we should try to get it to break horribly again. So we should reset the failure counts and reenable everything, I suppose. [02:52] Sigh. [02:54] if we bring everything up [02:54] will we log useful data? [02:54] I hope so. [02:55] Maybe we should make failBuilder log before we do that, though. [02:55] So we can see when things are disabled. [02:56] wgrant: can you prep a cowboy [02:56] Doing so. [02:57] * thumper has to head afk [02:58] pjdc, lifeless: http://pastebin.ubuntu.com/533329/ should do it. [03:00] * wgrant pelts buildd-manager with rocks and sets it on fire. [03:03] wgrant: so, shut down, apply patch, enable all (this might take a while), start up? [03:04] pjdc: Yup. [03:04] pjdc: Do you know if there's a script to enable them all? [03:04] Otherwise there's SQL... [03:04] wgrant: no idea, i've only ever done them manually [03:05] We may need SQL to reset the failure counts anyway. We'll see shortly. [03:06] b-m stopped, patch applied [03:06] wgrant: api script :) [03:07] lifeless: Yeah, yeah, on my todo list. [03:08] It's reasonably unfortunate that all this has happened when we have no available LOSAs in this TZ, no available Soyuz developers in this TZ, and both of the buildd admins in this TZ also unavailable. [03:09] s/unfortunate/normal/ [03:10] No... We'd normally have a LOSA, a Soyuz developer, and two buildd admins. [03:13] okay, that's all of them enabled [03:13] anything else before b-m is started? [03:14] So we are now running with the log level change and the additional failBuilder logging? [03:14] yep, left the loglevel change in place, and applied your cowboy [03:14] Start it up! [03:15] I expect most of the them will disable themselves again in about 30 seconds :( [03:15] started [03:17] :( [03:23] So, everything seems to be happy now. [03:24] I guess we just leave it until it explodes in a few hours, and hope the new logging tells us something useful. [03:25] that shouldn't be far off when the UK wakes up, so that might work out [03:25] Given that we've failed to reproduce it elsewhere, it is tempting to let the rollout go ahead and just automatically undisable builders until we work out what's going on :/ [03:26] The 14 builders that are disabled now probably just need their failure count reset (it's already over the threshold, so the initial failure to connect because the builder is still resuming causes them to be disabled). [03:26] Something like this: [03:26] UPDATE builder SET failure_count=0, builderok=true WHERE name IN ('hawthorn', 'actinium', 'hassium', 'lansones', 'muntries', 'radium', 'rosehip', 'sandpaperfig', 'terranova', 'fermium', 'lawrencium', 'nobelium', 'papaya', 'plutonium'); [03:27] if it's not critical, that's probably best left for a losa [03:27] Probably, yeah. [03:27] Not critical. Just makes it harder to see if it's broken without watching logs. [03:27] Thanks for your help. [03:28] you're welcome! [03:29] adare and ross are now broken in other ways :( [03:29] But that can wait. [05:18] mwhudson: https://bugs.launchpad.net/launchpad-foundations/+bug/676372 [05:18] <_mup_> Bug #676372: "Server denied check_authentication" from bazaar.launchpad.net private branch since 11926 deployed === jtv is now known as jtv-eat [06:47] hi all [06:47] i am running './bin/test' in a vm, and it has been stuck for hours, with the last output being [06:47] Started ['/tmp/tmpecWY0y.mozrunner/mozrunner-firefox', '-profile', '/tmp/tmpecWY0y.mozrunner', 'http://bugs.launchpad.dev:8085/windmill-serv/start.html'] [06:47] in 1.109 seconds. [06:47] [06:47] halp? [06:47] Is there a firefox window lurking around? [06:48] not that i can see [06:48] i'm ssh'd in to the vm without -X [06:48] i will see if there's a firefox process [06:48] there is not, though there is a zombie [07:28] mthaddon: Around yet? [07:48] Hi wgrant! [07:48] ;) [07:48] Morning henninge. [07:48] wgrant: heard you got engaged [07:49] Oh? [07:49] That Kate really is a nice girl [07:49] oh sorry, wrong W... ;-) [07:49] wgrant: What's that about the buildmanager? [07:50] henninge: Well, it may or may not be a release blocker. [07:50] henninge: We have a not utterly terrible workaround, so it's probably OK. [07:50] wgrant: what does the workaround include? [07:50] henninge: Uhh, frequently reenabling all the builders manually. [07:51] ;-) [07:51] How frequently? [07:51] Unsure. It was OK for 4 hours yesterday. And it's been OK for 4 hours so far today. [07:51] hours, wow ... [07:52] The affected code is on cesium, right? [07:52] Yeah. [07:52] Hopefully jml and bigjools will save the world tonight. [07:52] So that is (again) part of the nodowntime hosts [07:52] so a fix can be deployed any time. [07:53] Yeah. [07:53] wgrant: I am sure they will! ;-) [07:53] So, it's a pretty terrible bug, but we can work around it easily enough with a script. [07:55] The only reason I can imagine this being a blocker for the roll-out would be if any fix would include db changes. [07:56] which is not that far fetched, I guess. [07:56] It won't. [08:04] wgrant: congrats! [08:04] poolie: Hm? [08:05] or is he just totally confused? [08:05] I hope he's just totally confused. [08:05] ah, me clicks [08:05] Or there's some news about me that I don't know. [08:05] William Soxe-Gotha-Coburg-Windsor [08:06] Ahhhhhhh, of course. [08:35] hi henninge [08:35] hi bac! [08:50] good morning [08:54] poolie, wgrant: Yeah, I messed up the joke. I meant to say "sorry, wrong prince" ... ;) [08:54] Moin adeuring! [08:54] hi henninge [08:55] Haha. [09:08] bigjools: Morning... [09:12] morning [09:12] bigjools: Have you heard the wonderful news? [09:12] which? [09:13] bigjools: We're about to release with a pretty screwed buildd-manager :) [09:13] fuck sake [09:13] It disabled 60 or so this morning. === mthaddon changed the topic of #launchpad-dev to: Launchpad down/read-only from 10:00-12:00 UTC for DB update | Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [09:14] It seems to be reasonably happy now, since we restarted everything 7 hours ago. [09:14] But it was OK for a few hours yesterday too :/ [09:14] it was disabling builders because they were unresponsive [09:14] it's supposed to do that [09:15] TCP timeouts and no route to host errors are different. [09:15] how? [09:15] This was "User timeout caused connection failure" or something like that. [09:15] that's because they don't respond within the timeout [09:16] Dozens of them in one second? [09:16] what sort of time did this happen? [09:17] 14:41:45, 15:03:20 and 16:31:22, 23:39:11 are some that I saw. [09:17] 23:39:11 was the big one. [09:17] that's when the daily recipes kick off [09:17] But the last two incidents there start with a single error, then 9 seconds later dozens. [09:17] Hello [09:18] There were also a few other odd errors in the logs. [09:18] And it's not waiting long enough for builders to resume. [09:18] But apart from that it's happy now. [09:18] that's a problem because there's nothing we can do to fix that [09:18] Hmm? [09:18] the connection timeout is hard-coded in the python libs :/ [09:18] Odd... [09:19] the reset script waits until some event in the builder, which is supposed to be when it's ready to accept a connection [09:19] then that connection often times out [09:19] 2010-11-17 02:35:58+0000 [QueryProtocol,client] Resuming actinium (http://actinium.ppa:8221/) [09:19] 2010-11-17 02:36:04+0000 [-] Asking builder on http://actinium.ppa:8221/filecache to ensure it has file chroot-ubuntu-lucid-i386.tar.bz2 (http://launchpadlibrarian.net/51974282/chroot-ubuntu-lucid-i386.tar.bz2, d267a7b39544795f0e98d00c3cf7862045311464) [09:19] we're seeing the fruits of that now because I am actually disabling stuff [09:19] 2010-11-17 02:36:25+0000 [Uninitialized] Scanning failed with: TCP connection timed out: 110: Connection timed out. [09:19] whereas the old one never disabled anything [09:20] It waited 6 seconds from firing the resume trigger. [09:20] Maybe the script is buggy. [09:20] no [09:20] 6 seconds is about right [09:20] they reset very quickly [09:20] The VM is created and boots in 6 seconds!? [09:20] yes [09:20] Nice. [09:21] the first connection is to send the chroot, and that's why you see it timing out [09:21] we can get around this for now by removing the code that fails builders [09:21] which is essentially what the old b-m was not doing [09:21] I think we need to disable failure counting. [09:22] It took out lots of builds as well. [09:22] (and fourteen or so builders need their failure counts manually reset) [09:22] digh [09:23] I still find it unlikely that dozens of builders failed to respond all in the same second, several times, unless there were network glitches that nobody knows about. [09:23] The 9 second delay betwen the first failure and subsequent stream on at least two occasions is also rather suspicious. [09:23] if it's a network glitch then it's more likely that they all go at once [09:25] Anyway, cesium is currently running the new code with two cowboys: one setting loglevel to DEBUG, and another to log whenever a builder is failed. [09:25] We also need to fix the failure counts of those builders, and probably do a mass-giveback :/ [09:25] failure counts are reset on a successful dispatch [09:26] They are. [09:26] Hmm. [09:27] for a builder to get failed it has to go wrong on 5 consecutive occasions [09:27] But the issue is that the first failure will immediately knock them out again. [09:27] no, that's not true [09:27] It will, since the count is currently 5. [09:27] We reenable, they time out, and are immediately disabled. [09:27] No five strikes rule for them. [09:27] ok, re-enabling should reset the count [09:27] that's a bug [09:27] It should. [09:27] But it doesn't. [09:27] And we were LOSAless today, so we couldn't do it manually. [09:29] I think the recipe builds are thoroughly screwing the builders [09:29] So "User timeout caused connection failure" occurs when the TCP connection is accepted, but there's no HTTP response? [09:29] everything works fine until they come along [09:30] that happens when the connect() fails [09:30] We're still running the old lp-buildd with in-chroot bzr-builder, aren't we? [09:30] yes, we rolled them back [09:30] If that happens when connect() fails, then why this: [09:30] "TCP connection timed out: 110: Connection timed out." [09:31] That's a separate error. [09:33] I think I'm going to just remove the failure counting stuff for now [09:34] Sounds like a good idea. [09:36] wgrant: did you ask someone to restart it at 0126 UTC? [09:36] I think the first one was lifeless, but yeah, it was around then. [09:36] there were no problems with it at that time [09:39] It had taken out all but a few buildds an hour earlier. We wanted to see if we could reproduce it fresh with just a couple of active builders, to see if we needed to roll back and work out what to do about the release. [09:42] I think the problem is recipe builds for sure, I just need to reproduce on DF [09:42] the builder is doing something that makes it unresponsive [09:43] That's not the whole thing. [09:43] palmer was disabled. It is non-virt and had been idle for 30 minutes. [09:43] hmmm === Guest8056 is now known as jelmer [09:44] oh jeez the log is massive with debug on [09:44] So we knew it was either several undetected network glitches throughout the day manifesting without any TCP timeouts, or something with one builder was glitching everything else out. [09:45] So we turned up logging and hoped it would reappear, since the INFO logging is sort of completely sparse. [09:46] We can't tell when the problematic scans were triggered, and there are five minute gaps in the log :/ [09:46] And I can't reproduce it locally however much I try :( [09:46] it's a nightmare [09:47] Yeah, just a bit. [09:48] from the log, it starts going wrong at the exact same time the faily (sic) recipe builds get kicked off [09:48] around 23:35Z [09:48] That's the big incident, yeah. [09:48] But there are several smaller ones in the preceding 9 hours. [09:48] other indicents are almost certainly another batch [09:48] Possibly. [09:49] there are some Fault 8002: [09:51] Yeah, but they're everywhere... [09:51] that's a protocol fault [09:52] hmmm /me sees something [09:52] What has been seen? [09:53] this might have something to do with the huge blocking file fetch [09:53] I considered that. [09:53] But the 23:39 incident suggests not. [09:54] The nearest fetch before that was about 6 minutes earlier. [09:54] I think it's a number of different things that cause blocks === henninge changed the topic of #launchpad-dev to: Launchpad down/read-only from 10:00-12:00 UTC for DB update | Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 (but closed during the roll-out)| firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [10:03] bigjools: So, just going to cowboy out failure counting after the rollout and hope that we can work it out? [10:03] yes [10:03] :/ [10:04] one of the things that the failure counting did was to remove in-progress jobs from builders if they failed a poll [10:04] I might have to rethink how that work [10:04] s [10:05] damn, this stuff is hard [10:06] It should all be fine. [10:06] "should" [10:06] Except for those unexplained User blah blah blah errors, and the reset script lying. [10:06] Apart from that and the occasional other translations exception, it seems to be OK. [10:07] 2010-11-17 02:26:50+0000 [Uninitialized] ForbiddenAttribute: ('build', ) [10:07] That's the translations exception. [10:09] sigh [10:10] Yes. [10:15] Does the reset script wait until the slave responds to HTTP? [10:25] How hard is readonly bazaar.launchpad.net? [10:25] Surely not that bad? [10:28] wgrant: we tested it on qastaging yesterday. it works with one small bug [10:28] wgrant: however, we're doing machine maintenance. [10:28] lifeless: What's the bug? It's not read-only? [10:29] https://bugs.launchpad.net/launchpad-code/+bug/676124 [10:29] if we weren't doing maintenance on that machine, we'd have tried keeping it up this time. [10:29] Ah. [10:29] Great. [10:46] adeuring: ping [10:51] Huh, codebrowse works? [10:52] LP seems to be r/w for me now [10:52] Ah, so it is. [10:54] morning jml [10:54] lifeless: hello [10:54] Indeed, morning jml. === danilo_ is now known as danilos [11:06] Could someone please ec2 https://code.launchpad.net/~wgrant/launchpad/bug-654372-optimise-domination/+merge/40854? [11:07] wgrant: on it [11:07] jml: Thanks. [11:10] bigjools: re. bug #676262, I suspect they were both ABORTING (since abort() doesn't actually end up killing sbuild). That's a situation we ran into a few hours ago. [11:10] <_mup_> Bug #676262: launchpad lost track of a build [11:10] (with those same two builders) [11:11] Damn ppc :( [11:11] wow [11:12] I got a crazy error when doing ec2 land [11:12] http://paste.ubuntu.com/533423/ === mthaddon changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 (but closed during the roll-out)| firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [11:12] henninge, https://pastebin.canonical.com/39840/ [11:14] lifeless: pong (sorry, did not look at the IRC windows after returning from the kitchen...) [11:15] adeuring: hey [11:15] adeuring: remember how in APIs and restricted files we hard coded handing out the internal url ? [11:15] lifeless: not exactly... let me check again [11:15] adeuring: the token based librarian is deployed now [11:15] lifeless: https://bugs.launchpad.net/launchpad-code/+bug/554206 might be relevant to some stuff you are doing [11:15] <_mup_> Bug #554206: Need a read-only version of bazaar.launchpad.net for codehosting and codebrowse [11:16] lifeless: I remember that firewall settings in the DC needed some teaking [11:16] ...tweaking... [11:17] right [11:20] Why is [ui=none] in every commit message? Can't it just be omitted? [11:20] lifeless: mizuho needed access to private Librarian files, and that machine "saw" a librarian URL having a host name with an "internal" domain part [11:22] wgrant: the [ui=foo] field was added as a way of strongly encouraging UI reviews for any UI change [11:22] wgrant: a huge number of changes do not affect the UI [11:22] wgrant: and I suspect that many people skip UI reviews [11:22] adeuring: yes [11:22] jml: Is it more than 1% of commits that have ui=somethingelse? [11:22] adeuring: right, so you did a patch for the API to show the internal url [11:23] wgrant: you can run log & grep as easily as I [11:23] adeuring: but its not needed now [11:23] lifeless: did I? seems that I need a memory refresh.... looking now [11:24] jml: True. [11:26] adeuring: you did :) [11:28] adeuring: rev 11506 [11:30] henninge: now that the rollout is done, can we fix canonical/launchpad/interfaces/__init__? [11:30] jml: oh. ... [11:30] lifeless: thanks! so, time to fix bug 629804 [11:30] <_mup_> Bug #629804: implement access to private Librarian files for launchpadlib clients [11:34] jml: well, it's still on the list to do post-rollout but you can prepare a branch. By the time it gets deployed from stable, that should all be done ;-) [11:34] jml: "it" is "fixing +inbound-email-config.zcml" [11:34] ;-) === matsubara-afk is now known as matsubara [11:35] henninge: ok. will do. [11:35] maxb, misclicked [11:36] jml: just check again before marking the revision as deployable. [11:36] henninge: *nod*. do you recall the bug number? [11:36] I am not sure it had a bug. [11:37] jml: nm, it's fixed. ;-) [11:37] so I guess you can just submit it [no-qa] [11:38] henninge: will do. ta. [11:38] which is true because we already know it works on qa/staging ... ;-) [11:40] aeoueoia [11:40] lp-land has a bad token, but I don't know where to find it [11:43] adeuring: I've unduplicated it [11:43] how do I work around this problem? http://paste.ubuntu.com/533431/ [11:43] ok [11:43] lifeless: I'll do it once I've finished my current work [11:44] ...i mean; I'll fix the bug... [11:44] adeuring: do you have an estimate for when that will be? [11:44] adeuring: if its going to be not-immediate, I might just do it [11:44] s/not-immediate/not-today [11:44] lifeless: i think I can probably start tomorrow [11:44] lifeless: you beat me ;) [11:45] problem is that I am quite slow with context swtiches... [11:45] adeuring: I'll drop you a mail to let you know if I get to it or not. [11:45] lifeless: coool [11:48] wgrant: your branch is being tested in ec2: http://ec2-50-16-92-112.compute-1.amazonaws.com/ [11:49] jml: I can't see that, but thanks! [11:49] it'd be kind of neat to add a phone-home thing to the ec2 script so we could have a page showing what's being built (as well as test results) [11:57] Morning, all. [11:59] morning deryck [12:01] bigjools: I added something to the derived distributions LEP about opening vs initialization; do you need anything more? [12:01] jml: inspiration [12:01] thanks :) [12:02] bigjools: np. [12:03] bigjools: also, I notice that https://launchpad.net/launchpad-project/+bugs?field.tag=buildd-scalability has no bugs. [12:03] it should do [12:03] I tagged loads [12:06] jml: ah it's because they've all been released [12:06] bigjools: nice. [12:06] jml: https://bugs.launchpad.net/soyuz/+bugs?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=FIXRELEASED&assignee_option=any&field.assignee=&field.bug_reporter=&field. [12:06] bug_supervisor=&field.bug_commenter=&field.subscriber=&field.tag=buildd-scalability&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on [12:06] aiieee sorry [12:07] bigjools: looking at the LEP and based on random IRC sampling, I'm guessing we're still missing "When a builder becomes free, we must dispatch a queued build to it within a maximum of 30 seconds.", "Design for a system with 200 builders" and "Not starve low-scored builds when there are higher-scored builds in the queue" [12:07] Having trouble following https://dev.launchpad.net/LaunchpadPpa. debsign -S fails with 'debsign: Can't find or can't read changes file !' [12:07] jml: missing from where? [12:08] bigjools: what I mean is, have we met those requirements? [12:08] jml: I need to have a call with you about that [12:08] bigjools: ah, ok :) [12:08] :) [12:08] but later [12:09] I am up to my neck in buildd-manager issues [12:09] right after a dispatch of 10 or more recipes, there's nothing in the log for 4 minutes [12:09] which is somewhat suspicious [12:09] yeah, later is good [12:10] The queue isn't just empty? [12:10] no, it's the gap between "startBuild" and the "RESULT" stuff [12:10] This is why I wanted better logging :( [12:10] in fact the latter never appears [12:11] yes we all want better logging [12:11] Ah. [12:11] but one thing at a time [12:11] That's very interesting indeed. [12:11] Shouldn't bzr builddeb actually create a .deb? [12:11] stub: You have to go back to the parent directory or ../result where the changes file was added. [12:12] stub: By default it creates binary packages (.deb's), with -S it creates a source package. [12:12] But where? [12:12] wgrant: something is blocking too long when it's dispatching a recipe build [12:12] stub: In the parent directory or ../result [12:12] jelmer: I don't have a ../result and nothing new in the parent directory [12:12] bigjools: After the "Initiating build foo on bar"? [12:13] stub: you can specify a directory manuall with --result-dir [12:13] wgrant: in Builder.startBuild() it logs the build start (behavior.logStartBuild) [12:13] then there's nothing logged until it fails [12:14] at that point, there's a few things that could have gone wrong but the lack of logging means it's hard to tell [12:15] jelmer: Garh. They were in my branch, not my checkout of the branch [12:15] jelmer: Guess that would be a bug... [12:16] stub: yeah, that seems a bit strange [12:16] bigjools: So we don't even know if it made it into resume_done? [12:16] I suspect it has, that's the most reliable part of the process [12:16] True. [12:16] my suspicions lie in the file disaptching and initiation [12:16] But it never made it to got_cache_file... hmm. [12:18] we don't know [12:18] there's no info level logging [12:19] got_cache_file logs fairly obviously. [12:19] Ohh, crap. [12:19] True. [12:19] deryck: there are a couple of LEPs about bug duplication... [12:19] * bigjools is changing some debug to info [12:19] deryck: one's in drafting (https://dev.launchpad.net/LEP/DisableFilebugDuplicateSearchOption) and the other (https://dev.launchpad.net/LEP/ACLMarkAsDuplicate) isn't on the LEP page [12:20] wgrant: We Can Haz Runtime Log Changing Please [12:20] lifeless: debug 4 eva [12:20] <_mup_> Bug #4: Importing finished po doesn't change progressbar [12:20] Ahem. [12:20] rotfl [12:20] ok foods [12:21] lifeless: I guess there's https://bugs.edge.launchpad.net/soyuz/+bug/667958 [12:21] <_mup_> Bug #667958: Web diagnostic tool for build manager [12:21] but that's not quite the same thing [12:23] dynamically changeable log levels is totally essential for decent production debugging [12:25] bigjools: Is there anything in the current debug level that isn't interesting, except for the hundreds of "Scanning foo" messages? [12:26] Given the frequency and obscurity of issues, it'd be nice to keep as much data as possible... [12:26] the problem is that I don't want the log swamped [12:26] it makes it harder to notice issues [12:27] so I am trying to carefully select important messages for the info logging [12:27] but hindsight is awesome [12:27] Heh. [12:28] Hi jml. Yeah, the first should be done. And the second was meant to sketch out the idea and go back to marjo et al and get feedback.... [12:28] jml, remember, we talked about this and said, let's do what everyone agrees on and is easy first, and get consensus on if the second is even required. [12:29] unfortunately, I didn't ping anyone about the second yet. I'll do that today. [12:29] deryck: ahh right. I forgot to refactor that new knowledge into the LEP page :) [12:30] deryck: so I'll bump the first LEP to the Deployed section? [12:30] jml, in progress. I think I assumed approval and moved ahead. [12:30] jml, sorry to assume ;) [12:30] deryck: no, that's all good :) [12:30] thanks! [12:33] jml: gary has a variant of the LEP template with stuff specific to his team; I've suggested you might be amenable to folding those into the main template [12:34] lifeless: sure, I'll have a look [12:34] lifeless: if someone points me at a thing :) [12:35] sure [12:35] * jml is also thinking (again!) about tracking LEPs at blueprints.launchpad.net/launchpad [12:35] dunno when he'll do that [12:35] jml: lets fix it first. [12:35] jml: -please- [12:36] lifeless: I reckon I could do a useful muck-around experiment that wouldn't affect anyone other than me. [12:37] would it be a good use of your time? [12:37] also, can we chat about reset (voice) ? [12:37] lifeless: sure. gimme a couple of minutes to put my phones back together [12:39] Is the "builders are being disabled" topic comment in #launchpad still valid after the rollout? [12:39] yes [12:39] lifeless: and yes, it would be a good use of my time. [12:39] hmm, didn't mean that to be snarky. Sorry [12:41] lifeless: it wasn't at all snarky. I was going to elaborate but got distracted by yet another networking problem. [12:45] jml: you remember how we added timeouts to the async xmlrpc by cancelling the Deferred? [12:45] bigjools: yes [12:45] jml: in those cases we get a CancelledError, but I am seeing hundreds of " User timeout caused connection failure." [12:45] what causes those? [12:46] it's a TimeoutError, sorry. I can't fathom how that would happen before the cancel === salgado is now known as salgado-physio [12:48] huh actually - that's the 30 second connection issue [12:49] which is much lower than our configured value for everything else [13:20] jml: I'm tempted to inherit from Proxy and override stuff [13:22] bigjools: yeah. I can't think of anything better right now. You ought to file a ticket and submit a patch too. [13:22] jml: there's already a ticket, but the fix needs to go in quite a few places I think [13:22] I'll file another anyway [13:22] right - I need vittles [13:23] bigjools: yeah, a specific ticket for xmlrpc.py would be great. thanks. [13:23] nod === mrevell is now known as mrevell-lunch [13:41] maxb: hey [13:42] hi [13:42] maxb: what do you think of us having a custom python build - with http://bugs.python.org/issue10440 applied === Ursinha-dinner is now known as Ursinha [13:43] If it really is just an integer constant, why do we need to modify python for that? [13:43] Instead of just defining the value locally [13:44] it can be different in different libcs, by definition. [13:44] we can hardcode '1' as the constant, but its less portable and thus a bit ugly. [13:45] Well, it's a tiny patch, so it's hardly much effort to roll a modified package. The question then is the ongoing maintenance effort and how long it would be needed for [13:46] yeah [13:46] I'd be tempted to consider putting the constant in a tiny module of its own, to avoid needing to rebuild every time there's an Ubuntu update out [13:47] Also, given Launchpad only targets Ubuntu, and a fairly narrow range of distroseries, even the non-portable solution is probably viable [13:47] true on both counts [13:47] will mull on it === henninge changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting === salgado-physio is now known as salgado [14:10] lifeless: can you think of a way of creating a tcp endpoint that doesn't reply in a twisted test? I need to test a timeout and winding the reactor forwards is no good if the tcp connects or refuses to connect immediately [14:10] sure [14:10] bind, listen, but don't accept [14:10] in real life I'd suspend a process but that's not ideal in a test [14:11] Actually, that might not work. But its worth a go [14:11] I suspect it would get connection refused wouldn't it? [14:11] hmmm [14:11] no [14:12] accept is what takes a queued connection and gives you the new fd for it [14:12] alternatively iptables + -j DROP [14:12] ah right [14:12] (although that requires root) [14:13] not ideal for LP's test suite [14:14] sure, was just giving it as an option as a one off === mrevell-lunch is now known as mrevell [14:32] elmo: how evil is it to try and connect to something like 10.255.255.1 ? [14:52] bigjools: evil; some machines it will error immediately ;) [14:53] lifeless: grar [14:53] bigjools: because someone, somewhere has that ip [14:54] bigjools: or routers that will see it and REJECT [14:54] it doesn't get past my own router [14:54] oh well it'll do as a stub for now === matsubara is now known as matsubara-lunch [14:59] Reviewers Meeting starting at top of the hour: abentley, adeuring, allenap , bac, danilo, sinzui, deryck, EdwinGrubbs, flacoste, gary, gmb, henninge, jelmer, jtv, bigjools, leonardr, mars, salgado, jcsackett, benji [14:59] thanks bac [14:59] bac: apologies from me [15:01] np flacoste === matsubara-lunch is now known as matsubara [16:25] what's this? [16:25] http://paste.ubuntu.com/533506/ [16:26] No handlers could be found for logger "librarian" [16:26] henninge: you already have a librarian running [16:27] seriously? didn't know that ... [16:27] kill it and the pid file and /var/tmp/fatsam.test [16:28] what's the process called? [16:28] it's a twistd [16:28] ps ax | grep libra returns nothing [16:28] ps ax | grep twist - nada [16:28] :-( [16:29] ummm then I dunno, I've only ever seen that when there's another librarian hanging around [16:29] thanks [16:32] why is the librarian logging in a +0530 time zone anyway??? [16:33] India? [16:38] henninge: there are no Canonical LP developers in that tz [16:39] outsourcing? [16:40] henninge: we set the TZ there to avoid accidental TZ assumptions [16:40] henninge: or something [16:40] ;-) [16:40] but do you have an idea why the librarian layer might be failing? [16:43] henninge: rm /var/tmp/fatsam.test/librarian.pid [16:44] already done. twice ;) [16:44] ps fux | grep twistd [16:44] ? [16:44] oh [16:45] netstat -n | grep 58085 [16:45] nothing [16:45] or something like that [16:45] is the second upload port thats barfingk [16:49] maybe I should mention that this is not devel ? It's the recife branch [16:50] but the test worked yesterday [16:52] sinzui: btw your script to close bugs is closing bugs that shouldn't be closed - because of RFWTAD [16:52] a second run always gives me "TacException: Could not kill stale process /var/tmp/fatsam.test/librarian.pid. [16:52] so I remove that dir and try again. [16:53] nothing changed overnight [16:53] I think you have another process using the port [16:53] thus the netstat - check lazr-schema / the test schema to see what port it will be using [16:53] lifeless, they were fix committed in 10.11, but were not intended to be released? [16:53] bigjools: did you get to the bottom of the problem? [16:54] sinzui: no, our process assigns bugs to milestones *before* they are fixed, not *after* [16:54] lifeless, are these really 10.12 bugs [16:54] sinzui: they are 'some work done, but not finished' [16:54] sinzui: things like: [16:54] thumper: I * think* so - I think it's slow builders that don't respond to connection requests within Twisted's 30 second default timeout. The recipe builds hammer the builders. [16:54] - landed code but it didn't fix it [16:54] - needs a cronscript enabled via an RT ticket [16:55] bigjools: so why does it take down all types of builders then? [16:55] thumper: thanks for doing the incident report [16:55] thumper: I don't know, it might be a coincidence. [16:56] who is looking at the 'report a bug' feature not working ? [16:56] * thumper doesn't believe in coincidence [16:56] I am putting in a fix that increases the connection timeout - copy & paste from Twisted FTW :/ [16:56] lifeless, I think that is a bug. The engineer should know when he intends to release Auto-assigning is convenient, but it does not exempt the person from correcting the milestone when he knows it will not be release with the milestone. eg we knew this when PQM was frozen [16:56] thumper: I've seen slow builders doing exactly that for a while now - it's just that we never disabled them before this release. [16:56] sinzui: sure, I'm not blaming the script or you :) - getting info on how to address - what policies we need to change [16:57] lifeless, I can add a sanity check (qa-ok in tags) [16:57] sinzui: I think thats an excellent idea [16:58] sinzui: also I'm closing most bugs - those that are linked from revs - when we do incremental deploys [16:58] I have to go eat or miss out, bbiab [16:58] lifeless, i will have script for you by the end of my lunch [17:00] jml: I guess you're not near your PC then === jam1 is now known as jam === benji is now known as benji-lunch [17:36] leonardr: around? [17:36] dobey: yes [17:37] leonardr: http://pastebin.ubuntu.com/533530/ <- am getting this as a result of a getMembersByStatus() on a team with status=u'Administrator' [17:38] leonardr: any idea why that would be? [17:38] dobey, what is the code in allowedcontributors.py? [17:39] deryck: ping [17:39] hi lifeless. on tl call [17:39] leonardr: http://bazaar.launchpad.net/~rockstar/tarmac/main/annotate/head%3A/tarmac/plugins/allowedcontributors.py#L62 [17:39] deryck: are you aware that bug filing is reportedly broken ? [17:40] lifeless, no. how so? [17:40] deryck: two independent reports [17:40] 1) apport user filed a bug in launhcpad [17:40] 2) james hunt mailed tom who forwarded it in the lp rollout thread [17:41] dobey: so the 'approved' one succeeds but the 'administrator' one fails? [17:41] leonardr: that appears to be the case, yes [17:42] lifeless, I believe allenap is looking into that. [17:42] I'll follow up after tl call to make sure, and cover if not [17:42] cool [17:42] leonardr: and unfortunately i have to call it twice, because i can't do status=[u'Approved', u'Administrator']; like i can do with other similar get APIs, but i guess that wouldn't fix this specific problem either :) [17:47] dobey: i have no clue why it should work once and then fail. just for fun, you might try assigning launchpad.people[team] to a variable [17:47] so you're not using it twice [17:47] and if that doesn't work, try assigning to a variable and then printing out its name before invoking those named operations [17:48] i'm just seeing if various known problems are in play here (in which case upgrading would help) [17:51] leonardr: what would i upgrade to exactly? [17:51] dobey: a later launchpadlib/lazr.restfulclient [17:52] leonardr: is there one newer than what is in 11.04 already? [17:53] dobey: there is, but the one in 11.04 should have the fix i'm thinking about already [17:54] ok [17:57] dobey: my only suggestion is to put a breakpoint in get_representation_definition and see what it does differently the first time vs. the second [18:00] leonardr: ok; i've changed it to assign the team to a variable and print the team twice as suggested; will see what happens next time that code gets hit [18:00] launchpad is being very slow today. :( [18:06] abentley, are there any issues with the new lp-serve happening right now? === benji-lunch is now known as benji === EdwinGrubbs is now known as Edwin-lunch [18:38] rockstar: the new forking lp-serve isn't enabled yet [18:39] thumper, oh, the bug was marked as Fix Released. :( [18:41] sinzui, ping [18:41] rockstar: yes, I know. jam commented on it too saying as much [18:41] Ah, I hadn't seen the comment, just the status change. [18:52] rockstar: right, still trying to work through getting everything qa'd, etc. It isn't considered a qa blocker because it is disabled in production [18:52] I'm noticing that my download-cache has grown to about 500MB, anyone know what files I can nuke? [18:52] I'd like to think that I don't need 12 versions of "zope.testing-*" [18:53] jam, basically, you can nuke any files that aren't in versions.cfg [18:53] rockstar: which is in the lp root? [18:53] jam, yes [18:54] well, that isn't particularly fun to cross-reference... [18:54] thumper, urbanape just pointed out to me that when diff is too big, it says "Truncated for viewing." That's wrong, because if it was really for viewing, it wouldn't be truncated... === deryck is now known as deryck[lunch] [19:02] rockstar: so why is download-cache a bzr branch that is versioning all of these tarballs? seems odd to me [19:02] especially given that it is storing all old versions together in the same working tree [19:02] jam, I am not the one to ask about that, but I *think* it was supposed to be a temporary solution we concocted two years ago. [19:02] (for example, it contains 20 bzr tarballs) [19:05] the .bzr/repository is actually bigger than the launchpad repo at this point [19:06] jam: you do not need to convince us. We know it's wonky. [19:09] another quick question. Anyone know how lp-production-configs are placed at runtime so I can simulate a runtime environment locally? [19:10] (how does the launchpad codebase find the values in lp-production-configs) [19:11] its put at the configs directory in the root I think [19:11] and then LPCONFIG=configname [19:13] rockstar: in an email you mean/ [19:13] ? [19:13] rockstar: I thought it just said that on the page itself [19:13] rockstar: and in that case you are viewing it and it is truncated [19:15] thumper, in the view, you are viewing it, and it is truncated, but it's not truncated FOR viewing. It's truncated FROM viewing. :) [19:16] maxb: so, python 3.2 will have my patch :) [19:16] rockstar: it is truncated to allow you to view it otherwise it times out :-) [19:17] lifeless: And when are we migrating LP to Python 3? :-) [19:17] I'd not approve a textual change to "truncated from viewing" as it doesn't make grammatical sense [19:17] thumper, yeah, it was pedantry from the start. [19:17] * thumper closes laptop to go and buy a 3g stick [19:17] rockstar: well we do work for pedantical :) [19:18] thumper, although the fact that it's truncated drastically reduces its usefulness. [19:18] rockstar: the download link still works [19:19] rockstar: the fact that it is over 5000 lines drastically reduces its usefulness :) [19:21] hi mars [19:23] thumper, this is true as well. [19:36] lifeless: I know about LPCONFIG=xxxx, but how is the "qastaging.conf" file found? [19:36] it is just copied into the launchpad source tree? [19:37] or is schema-lazr.conf (the symlink) pointed to something else, or? [19:40] jam, it's symlinked. [19:40] rockstar: to what file? [19:40] jam, it's a file from lp-production-configs. [19:40] rockstar: so they explicitly point schema-lazr.conf to schema-qastaging.conf for example? [19:40] If so, why do you also need LPCONFIG=qastaging? === Edwin-lunch is now known as EdwinGrubbs [19:47] morning === deryck[lunch] is now known as deryck [19:52] morning mwhudson [20:14] jam: qastaging says 'the qastaging' dir which has a launchpad-lazr.conf file [20:15] lifeless: sure, but there are 4 schema-XXX.conf files [20:15] and no "schema-lazr.conf" or "schema-launchpad.conf", etc in the top of the dir [20:15] anyway, I'm getting my problem solved without using it yet [20:15] but still, I don't know yet how to set up something that resembles production [20:16] jam: schema-xxx is irrelevant [20:16] lifeless: so you still haven't answered how launchpad finds lp-production-configs/*.conf then [20:17] I thnk its [20:17] rm configs [20:17] mv lp-production-configs configs [20:17] IMBW [20:17] losa can tell you though - ask chex [20:17] k [20:20] lifeless: any idea of a 'clean' way to invoke the bzr that is packaged with the launchpad tree? or should we just be invoking /usr/bin/bzr ? [20:21] (IOW, how are the dependencies found in production) [20:21] `pwd`/eggs/bzr-2.2.0-py2.6-linux-i686.egg/EGG-INFO/scripts/bzr is obviously not a long-term solution [20:21] or Chex ^^ [20:24] um [20:25] i think launchpad looks for lp-production-configs/$LPCONFIG/launchpad-lazr.conf then for configs/$LPCONFIG/launchpad-lazr.conf [20:26] the other config files get brought in by extends: ../foo.conf in those config files [20:26] mwhudson: so it is just 'lp-production-configs' in a generic sibling dir? [20:26] jam: pretty sure, let me look at some code [20:27] mwhudson: doing that, I get "Can't find qastaging in ..." [20:27] in a traceback [20:28] jam: "production-configs", not lp-production-configs [20:28] my mistake [20:28] mwhudson: confirmed that it works [20:28] (via symlink at least) [20:29] cool [20:30] mwhudson: and ./production-configs is also in .bzrignore [20:30] heh heh === salgado is now known as salgado-afk [20:40] losa ping. I don't know if you have time, but mthaddon was looking at rt#42199 last night, and I think I've responded to what he needed. I don't know whether that means there is a hand-off or whether it is just going to wait for him to get back. [20:40] <_mup_> Bug #42199: evolution causes gpg stale locks [20:45] jam: not a sibling dir, child dir [20:45] lifeless: nope, at the root "launchpad/configs launchpad/production-configs" [20:46] at least, that worked for me [20:46] and that is what is in .bzrignore [20:46] kk [20:57] Were we in testfix overnight? === Ursinha is now known as Ursinha-bbk === Ursinha-bbk is now known as Ursinha-bbl [21:02] Hello Everyone [21:03] abentley: thumper: now? [21:03] lifeless: https://bazaar.launchpad.net/~launchpad-pqm/launchpad/production-stable/revision/9000 [21:03] wallyworld: sure. [21:04] wallyworld: just here [21:05] abentley: %@$!!#$ audio died again. [21:05] brb [21:06] I have a wuick question about the Launchpad source [21:06] When running make schema is this part of a normal output? Unknown entry URL: ScalarValue Unknown entry URL: archive_dependency Unknown entry URL: archive_subscriber Unknown entry URL: binary_package_release_download_count Unknown entry URL: branch_merge_queue Unknown entry URL: branch_subscription Unknown e [21:07] That's normal. [21:07] Okay Thank's wgrant === matsubara is now known as matsubara-afk [21:10] Wgrant: is this a typical end output: make[1]: Leaving directory `/home/weather15/launchpad/lp-branches/devel/database/schema' rm -f -r /var/tmp/fatsam [21:10] weather15: Yes. [21:10] wgrant: Thanks [21:11] wgrant: I'm running Ubuntu Server [21:11] In this case how can I access Launchpad.dev? [21:11] SSH Tunnel? [21:12] or is there Apache settings to change? [21:12] weather15: Have a look at https://dev.launchpad.net/Running/RemoteAccess [21:13] Also should I follow these instructions? 2010-11-17T16:11:40 WARNING root Developer mode is enabled: this is a security risk and should NOT be enabled on production servers. Developer mode can be turned off in etc/zope.conf [21:13] I plan on going into production [21:14] Running a production Launchpad instance is not a simple task. [21:14] wgrant: Do I need to have more then 1 IP? [21:15] weather15: Only if you want to be able to browse private branches. [21:15] Okay I do [21:15] two IP's on my local net or on the Internet? [21:16] Wherever you want it to be accessible from. [21:17] okay [21:17] I guess if I run it on my local net then I will have all public repos [21:17] Then I only need 1 IP [21:18] weather15: OOI, which LP applications do you intend to use in production? [21:19] Pretty much all [21:19] Interesting, I'd only imagined people using bugs & code in a local setting [21:20] That's most likely what will happen but I'm not sure yet [21:20] I'm focused on getting it working now [21:20] You know about the whole image licence pain, right? [21:21] no [21:22] https://dev.launchpad.net/LaunchpadLicense [21:22] especially the 4th paragraph [21:23] "The image and icon files in Launchpad are copyright Canonical, but unlike the source code they are not licensed under the AGPLv3. Canonical grants you the right to use them for testing and development purposes only, but not to use them in production (commercially or non-commercially). " [21:23] That Part [21:24] That part. [21:29] I know about that [21:30] I was wondering how to change those images [21:33] rather 1 IP i guess Ubuntu is not getting IP's on my second interface [21:34] what do you do when launchpad.dev will not resolve on the network? [21:35] I guess because I have only 1 IP bazaar will not work [21:35] is this true? [21:41] It seems I have 2 IP's now [21:41] do you replace a.b.c.d here with your ip? [21:42] james_w, who's the best person to talk to about getting new versions of launchpadlib and friends included in natty? [21:42] leonardr, Luca probably [21:42] james_w: ok, makes sense, thanks [21:48] wgrant: so, in case I am asleep at 2330 (highly likely), I've put another cowboy on cesium to fix the buildd manager. [21:48] Any one know the answer to my previous question? [21:49] It says "Or, if you did allocate a suitable second IP address: * Change the line to * Change the line to " [21:50] bigjools: Removing failure counting? [21:50] is this what I should use or replace a.b.c.d with the IP on my second NIC [21:50] weather15: The latter. [21:50] bigjools: Do we also have more logging now? [21:51] wgrant: some more yes [21:51] wgrant: with my IP correct? [21:51] weather15: Yes. [21:51] Thanks [21:51] bigjools: Well, I guess we'll see how it goes! [21:51] bigjools: Did you and jml work anything out? [21:52] wgrant: I didn't! [21:52] wgrant: default connection timeout on twisted xmlrpc is 30 seconds, I've made it use socket_timeout instead [21:53] I am seeing some builders *still* failing with that though [21:53] bigjools: Hmm. I don't think that really explains everything, but it might fix the resume thing. [21:53] not just resume, all xmlrpc requests [21:53] jml: Also, why did PQM eat my branch? [21:53] wgrant: I don't know. I didn't see that it got eaten [21:53] and no it does not explain everything [21:53] but it's a start [21:54] jml: It said it submitted, but then nothing :/ [21:54] bigjools: Yeah, I guess. [21:55] wgrant: I don't know. I won't be able to get around to looking into it tonight – sorry. [21:55] wgrant: maybe you can convince someone else to land it. the tests all pass. if not, I'll do it first thing tomorrow [21:55] a builder taking 30 seconds to accept a connection seems pretty crazy too [21:55] jml: Sure, no rush. [21:56] is the listen queue overflowing on the slave side or something? [21:56] i guess that's pretty hard to tell [21:56] mwhudson: The builder is an archaic Twisted mess gluing together shoddy shell scripts. [21:56] It's allowed to be crazy, I think. [21:56] meh [21:56] we allow it to be crazy [21:56] wgrant: even so [21:57] wgrant: is the builder one of these half twisted things that does blocking operations in the reactor thread? [21:57] mwhudson: Sometimes. [21:57] the build manager is [21:57] but not for long [21:58] mwhudson: I think >30 seconds happens when the slave manager was swapped out under load or something [21:58] oh right [21:58] that's my guess.... [21:58] bigjools: db queries are blocking calls [21:58] bigjools: Doesn't explain all the non-virt failures :( [21:58] wgrant: it might, actually [21:59] jml: true, very true. [21:59] bigjools: How? Unless buildd-manager leaks exceptions across multiple builders, I don't see how... [21:59] for the allow for [21:59] wgrant: if the previous build went into swap ... [22:00] would this work for 10.0.0.1? 10.0.0. or 10.0.0? [22:00] on the same builder, I mean [22:00] ... but you'd still need to fill up the listen queue, right? connecting to a listening socket doesn't involve the userspace process doing the listening iiuc [22:00] For the Allow from [22:00] weather15: That's just normal Apache configuration. [22:00] mwhudson: Hmmm? It needs to call accept(), right? [22:00] Yes but I need to set the sllow from [22:00] *allow [22:00] mwhudson: I don't know [22:00] some people have said that it needs to accept() [22:01] would 10.0.0 work or would I have to use 10.0.0. to allow my local network? [22:01] on 10.0.0.x [22:01] it's been a while since I I did socket stuff [22:01] i can't remember either [22:01] weather15: I suggest you ask Apache questions in the right channel [22:02] you will almost certainly get a more knowledgeable answer [22:03] bigjools: Hmmm. I see that palmer had been aborted 10 minutes before the failure. So it was probably still building. So that's plausible. [22:03] looks to me like Allow from 10.0.0.0/255.255.255.0 will work [22:03] science suggests that i am right about accept [22:03] science rocks [22:03] Although the fact that it timed out at the same time as the rest is a bit suspicious, perhaps buildd-manager was blocking for the preceding couple of minutes. Insufficient logging :/ [22:04] yeah, impossible to tell [22:04] although if it was slow with the DB ... [22:04] siiigh [22:05] i guess you can turn on statement tracing in buildd-manager [22:06] log armageddon! [22:06] yeah [22:07] bigjools: Ah, this is why we needed to clean out accepted... so we can have hundreds of gigabytes of logs! [22:07] more realistically, you can probably have a tracer log any statement that takes longer than say 5 s [22:07] not sure that will help if there's a cumulative effect of 10*1s for example [22:08] ... or collect aggregate stats, min, max, mean, stddev kind of thing [22:08] If it happens again today, I think we should run with full logging tomorrow. [22:08] I am too tired to think straight now [22:08] fair enough :-) [22:08] we are full logging now, except the madness of statement tracing [22:09] Even the 'Scanning foo' messages? [22:09] And the extra logging in failBuilder that was cowboyed in earlier? [22:09] everything [22:09] Great. [22:09] not that one - because we're not currently failing builders [22:09] Ah, heh. [22:09] assessFailureCounts is commented out [22:10] so it will report on the counts but never do anything about it [22:10] Perfect. [22:10] I need to split the failure count stuff in two though [22:10] 1 set for dispatch attempts and 1 set for contact attempts [22:18] OKay My Launchpad install can be accessed with one problem [22:19] What do you do about this error? Error code: ssl_error_rx_record_too_long [22:19] SSL received a record that exceeded the maximum permissible length. [22:19] Your Apache configuration is broken. It's probably serving normal HTTP on 443. [22:20] Okay I'll check it again [22:20] Any idea as to where to look? [22:22] I don't see anything wrong with it [22:22] is there something wrong with the keys> [22:22] No. [22:22] Have you tried restarting Apache? [22:22] wallyworld: I've pulled you branch and am looking at it... [22:24] I have a problem [22:24] phwoar [22:24] my Apache config no-longer exists [22:24] food helps [22:24] what do you do in this case? [22:29] wallyworld: found it [22:30] I wish we had different root objects for each virtual domain [22:41] bigjools: did you file a patch upstream for the xmlrpc timeout thingy? [22:43] thumper: just finished breakfast. what was it? [22:44] wallyworld: I told you wrong, the canonical_url of IBazaarApplication is http://code.launchpad.dev/+code [22:44] wallyworld: so... we should hang off ILaunchpadRoot [22:44] or whatever it is [22:44] thumper: ah ok. i saw some other stuff hanging off that and was wondering..... [22:45] i'll fix it [22:45] thanks [22:47] wallyworld: also, the location of the link on the code homepage needs to be fixed [22:47] thumper: where would you like me to stick it? :-) [22:48] is this good or bad? WARNING Bad object name 'public.todrop_branchmergerobot' 2010-11-17 22:48:01 WARNING No permissions specified for [u'public.lp_openididentifier'] * Disabling autovacuum [22:48] wallyworld: I think we should have some nice text below the import text mentioning recipies [22:48] wallyworld: they are going to be one of our prime features [22:48] wallyworld: lets mock something up and get it to mrevell to check [22:49] thumper: also, i added a 30 day window to the query. not sure if we want that or not or make it user selectable [22:49] wallyworld: that may be fine for now [22:49] we may want to give the users an option [22:49] later [22:49] thumper: ack the mockup. the initial intent was just to get something working :-) [22:49] jml: no, only a bug so far [22:49] wallyworld: yeah, understand that [22:50] thumper: +1 on the option. i was going to have a selection on the listing page itself, like we do for branch listings [22:50] bigjools: ok. [22:50] jml: I'll do one tomorrow [22:52] any ide a what to do about this? Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) * Module zope.publisher.publish, line 109, in mapply return debug_call(obj, args) __trace [22:52] bigjools: neat. [22:55] ?? [22:56] More Ouput: File "/home/weather15/launchpad/lp-sourcedeps/eggs/zope.publisher-3.12.0-py2.6.egg/zope/publisher/publish.py", line 134, in publish result = publication.callObject(request, obj) File "/home/weather15/launchpad/lp-branches/devel/lib/canonical/launchpad/webapp/publication.py", line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) File "/home/weather15/launchpad/lp-sourcede [22:58] jml: maybe someone will write a test if I attach a patch :) [22:58] ???? [22:58] good night [22:58] bigjools: g'night. [22:59] Night bigjools. [22:59] What's this mean? No such file or directory: '/var/tmp/mailman/data/master-qrunner.pid' Is qrunner even running? rm -f logs/thread*.request bin/run -r librarian,google-webservice,memcached -i development [22:59] mailman not running [22:59] causin gthis problem? [23:00] Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) * Module zope.publisher.publish, line 109, in mapply return debug_call(obj, args) __traceback_info__: weather15, what command did you run to get that output? [23:01] mars:I went to the login page: https://launchpad.dev/+login [23:02] weather15, are you using 'make run' in the launchpad source tree? [23:02] yes [23:03] and it did not produce obvious errors about starting mailman? [23:03] no it did [23:03] heres the full ouput: make run utilities/shhh.py PYTHONPATH= python bootstrap.py\ --setup-source=ez_setup.py \ --download-base=download-cache/dist --eggs=eggs \ --version=1.5.1 mkdir -p /var/tmp/vostok-archive utilities/shhh.py make -C sourcecode build PYTHON=python \ LPCONFIG=development utilities/shhh.py LPCONFIG=development /home/weather15/launchpad/lp-branches/d [23:04] Aparently I can't paste it all [23:04] weather15, pastebin.ubuntu.com [23:06] http://pastebin.ubuntu.com/533640/ [23:07] thumper: just wondering aloud, to me it's bad that the tests passed (2 different page/view creation steps too) but the app failed to run in practice. agree? something to fix? [23:07] weather15, on line 28, that looks like an error when the server is first run - it tried to clean up a PID file that doesn't exist. I wouldn't worry about it. [23:07] wallyworld: the problem is that you weren't loading the page, and clicking on the link [23:07] wallyworld: we had page tests for things like that [23:08] weather15, what did you see when you tried launchpad.dev/+login ? [23:08] wallyworld: the unit tests were going directly to the page [23:08] wallyworld: so you never saw the actual url [23:08] mars: http://pastebin.ubuntu.com/533641/ [23:08] wallyworld: you could add a test that gets the browser for the page [23:08] wallyworld: and tests the browser.url [23:08] wallyworld: that would have caught it [23:09] thumper: ok. i assumed that calls like create_initialized_view(root, "+daily-builds", rootsite='code') would use the same zope infrastructure as is used to load a page etc [23:09] wallyworld: it does [23:10] wallyworld: but the code root page was using a relative url hard coded [23:10] weather15, that is new. https://launchpad.dev works? [23:10] wallyworld: it wasn't generating the url in the same way that the tests were [23:10] ok [23:10] for me using the source and by setting it in my /etc/hosts file [23:10] weather15: Your Apache config for testopenid.dev is still broken. [23:11] thr documentation never mentioned that [23:11] what do I have to do? [23:11] You must have broken it when you were changing the config. [23:11] It's in with the rest. [23:13] weather15, read the rocketfuel-setup script, it has a bash Here Document inside that sets up the /etc/hosts file. You can compare with that. [23:14] there's no mention of openid in the apache config [23:15] this is what the LaunchPad part looks like of .etc/hosts: 10.0.0.3 launchpad.dev answers.launchpad.dev archive.launchpad.dev api.launchpad.dev bazaar-internal.launchpad.dev beta.launchpad.dev blueprints.launchpad.dev bugs.launchpad.dev code.launchpad.dev feeds.launchpad.dev id.launchpad.dev keyserver.launchpad.dev lists.launchpad.dev openid.launchpad.dev ubuntu-openid.launchpad.dev ppa.launchpad.dev private-ppa.launchpa [23:15] It will probably go to the first matching vhost, then. [23:16] weather15: Try adding 'ServerAlias testopenid.dev' to the bottom two sections in the Apache config. [23:16] Alongside launchpad.dev and *.launchpad.dev [23:18] okay done [23:18] wgrant, weather15, on my system, the only location of testopenid.dev is in the /etc/hosts file [23:19] mars: Right. [23:19] mars: So it uses the default vhost. [23:19] I have launchpad starting now lets see what happens [23:19] flacoste: http://paste.ubuntu.com/533638/ fixes the .htpasswd thing. [23:19] flacoste: Not sure why. [23:20] (it reverts part of the problematic rev) [23:20] * jml off [23:20] wgrant: weird [23:20] flacoste: Just a little. [23:20] i thought that umask played only when creating a file [23:20] That doesn't explain why this is not working [23:20] In both cases this creates a file. [23:21] But somehow O_TRUNC changes things. [23:21] Or Python is doing something stupid. [23:21] still: Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mapply(ob, request.getPositionalArguments(), request) * Module zope.publisher.publish, line 109, in mapply return debug_call(obj, args) __traceback_info__: weather15: Does accessing testopenid.dev in a browser work? [23:22] wgrant: 'w' would use O_TRUNC? [23:23] flacoste: Yes. [23:23] wgrant: ok [23:23] server side yes [23:23] client side no [23:23] wgrant: wallyworld is going to coordinate deploying that as a cow-boy [23:24] flacoste: Great. [23:25] * flacoste updates incident report [23:25] is that the problem === flacoste changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 4 of 10.11 | PQM open for 10.12 | firefighting: buildd-manager is disabling things again & https://wiki.canonical.com/IncidentReports/2010-11-17-LP-Private-PPA-500-errors | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [23:29] Yay Soyuz. [23:31] Problem still exists [23:31] Oops! Sorry, something just went wrong in Launchpad. We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience. (Error ID: OOPS-1782X11) Traceback (most recent call last): * Module zope.publisher.publish, line 134, in publish result = publication.callObject(request, obj) * Module canonical.launchpad.webapp.publication, line 483, in callObject return mappl [23:33] wgrant: shouldn't we set the umask explicitely there? instead of relying on the env [23:33] this URL works: http://testopenid.dev/ [23:33] it returns:Test OpenID provider for launchpad.dev [23:34] I wonder if this has something to do with it: https://code.launchpad.net/~bac/launchpad/bug-524302/+merge/22180 [23:36] ???? [23:40] output on server side is different: [23:42] does this make more sense? http://pastebin.ubuntu.com/533653/ [23:43] mars wgrant? [23:44] weather15, try stopping the service, then running 'make clean && make' in the source tree. [23:44] okay will do [23:44] weather15, https://launchpad.dev/+icing/rev5/build/lp/lp.js should be a real file when the server is running [23:45] okay it's executing now [23:45] the build system should create that JavaScript file for you. You may want to check the source tree to see that it was created [23:45] (when make finishes) [23:51] when I run make run I get this http://pastebin.ubuntu.com/533655/ [23:53] mars wgrant? [23:53] abentley: ping [23:55] mars: ping - i need a cowboy eyeballed before asking a losa to deploy it [23:56] ?? [23:57] StevenK: ping? [23:57] wallyworld: How many eyeballs does it need? [23:57] Is yours insufficient? [23:58] wgrant: just one. the change as per your pastebin just reverts it to as it was before 11982 landed [23:58] https://code.edge.launchpad.net/~wallyworld/launchpad/htpasswd-access-permissions/+merge/41115 [23:58] wgrant: i wasn't sure if i needed to ask a reviewer to eyeball it or not [23:59] Oh, right, forgot you weren't a reviewer yet. [23:59] :-)