=== Ursinha-afk is now known as Ursinha === jcsackett is now known as jcsackett|afk [05:23] Yay for having IS all in the same sleeping timezone. [05:25] wgrant: Hi! [05:27] It looks like borium is lost and the abort is failing with Fault 8002: 'error' when xmlrpclib tries to cleanup after parsing the response. [05:27] cody-somerville: Hm, and that kills everything? [05:28] It shouldn't. [05:28] appears so as the log keeps saying the same thing over and over and over [05:28] *bohrium [05:30] in fact, it appears caught in a loop trying to do this [05:30] Ahh. [05:30] Can you paste a full loop? [05:31] wgrant, http://pastebin.ubuntu.com/477761/ [05:32] cody-somerville: Hmm, and it's doing that constantly, with no other log entries? How often? [05:32] We probably just have to disable bohrium to get things running again, but there's nobody around who can do that :/ [05:33] wgrant, yes [05:33] wgrant, multiple times per second [05:33] Aha. [05:33] So, yes, disabling it will fix it. [05:33] StevenK: ^^? [05:34] Interesting that it keeps retrying that one, though... [05:35] probably has something to do with twisted [05:35] Oh yes, it's all a nice Twisted mess. [05:36] You know... [05:36] It wouldn't surprise me if it was aborting the transaction because of the failed scan. [05:36] So it sets to the builder as not-OK, then aborts before it commits it. [05:36] lol [05:36] Wouldn't surprise me either [05:36] Soyuz has a habit of making mistakes like that [05:36] Hm, no, it should be committing. [05:37] This is, of course, brand new code :/ [05:38] Oh, damn, it's in requestAbort instead. [05:39] Ah, no, but the whole thing is wrapped. [05:39] So it does commit immediately afterwards. [07:37] wgrant: we can always escalate [07:37] whats up === lifeless_ is now known as lifeless [07:52] lifeless, buildd-manager is hung up [07:55] whats the impat [07:55] impact [07:59] lifeless, nothing is getting built [08:00] lifeless, not for the Ubuntu archive or for any PPAs [08:00] will it recover on its own? [08:00] It doesn't appear so. [08:01] lifeless, the log is filled with this: http://pastebin.ubuntu.com/477761/ [08:01] Might be able to fix it by disabling the bohrium builder (which a buildd admin can do) but no guarantee. [08:02] ok [08:02] its only bohrium showing like that ? [08:03] looks that way, yea [08:03] ok [08:03] have you considered escalating to IS ? [08:05] IS should be almost awake now... [08:05] 9am for those that are sprinting [08:05] which isn't everyone [08:05] (AFAIK) [08:27] cody-somerville: ^ [08:30] I considered it, yes. Probably should have but haven't since I didn't have a pressing reason to do so personally. [08:31] plus I'm tired of writing incident reports for Launchpad downtime :P [08:32] cody-somerville: heh [08:33] so I think we should escalate [08:33] because otherwise its going to stay down all weekend [08:47] lifeless, agreed [08:47] Right. [08:47] It *probably* just needs a buildd admin to disable bohrium. But it may be more broken than that... [09:26] wgrant: cody-somerville: its being looked at [09:32] lifeless: Thanks. [09:38] wgrant: can you file a bug please [09:38] wgrant: the builder row was deadlocked [09:39] Builder row? [09:39] Wait, in the DB? [09:39] yes [09:39] Wow. [09:39] I've not seen that before. [09:43] wgrant: so, airlock was apparently doing something to the builder [09:43] and hung waiting on a lock [09:43] whee [09:43] so lp then was timing out trying to disable the builder [09:43] it's broken again [09:44] elmo: have we bounced the builddmanager? [09:44] Airlock? [09:44] wgrant: the thing that steals buildds and gives them back [09:44] Ah. [09:44] it predates API's and writes to the DB [09:44] lifeless: yes; I'm going to try the update SQL, if that's locked, face stab the buildd-manager and try again [09:45] is there an API to disable a builder and enable it again ? [09:45] update SQL to get the fuck rid of bohrium [09:45] lifeless: Not at the moment. [09:45] wgrant: if you were to make one, it would help with this [09:45] I've considered it. It's not hard. [09:45] because we have timeouts set in the webapp ;) [09:45] But we've not run into this contention before. [09:45] wgrant: -please- [09:45] ok, so I can't run the SQL again [09:45] buildd-manager's transaction usage changed massively a couple of days ago. I'd suspect there's something a little wrong with it. [09:46] I think it's because b-m is in a tight loop failing on bohrium [09:46] elmo seemed to think its occured before but perhaps not as violently [09:46] elmo: yeah. Take the b-m down as gracefully as possible. [09:46] haha, gracefully [09:46] the init script tries TERM which always fails [09:46] then it KILLs [09:46] TERM normally works. [09:46] It can take a few seconds, though. [09:47] 'always' may be slightly hyperbolic; but I haven't seen TERM work for me since the latest round of implosions started happening [09:47] Ew. [09:47] Anyway, I also don't see how a DB deadlock could result in this loop.. unless the commit is failing, and this isn't logged? [09:48] oh [09:48] ok, bohrium disabled; b-m back up [09:48] so we found an interesting xmlrpc thing the other day [09:48] returning a Fault -> doesn't abort transactions [09:48] raising one does. [09:48] probably not the thing here, but a good thing to remember until we fix it [09:49] wgrant: in general don't we structure things so that 'unhandled exception -> rollback' ? [09:49] lifeless: Yes. But the code here catches the Fault, disables the builder, then commits. [09:50] I have to go and pack, but I'll leave my laptop up as late as I can and keep an eye on the b-m log [09:50] wgrant: given that for the last 90 minutes there was a db backend waiting for a lock [09:50] wgrant: I highly doubt that its working as advertised [09:51] lifeless: The codepath is really short and clear. [09:51] Anyway, dinner. [09:52] elmo: thanks heaps [11:46] night all === jcsackett|afk is now known as jcsacket === jcsacket is now known as jcsackett [20:25] grah rosetta is unhappy [20:26] hmm, time for incident report about lsat nights soyuuz thing [20:42] lifeless, there was another incident, or is this the EINTR one? [21:20] jelmer: there was another one [21:21] IncidentReports/2010-08-14-Soyuz-Airlock-Deadlock [21:22] jelmer: ^ [21:22] thanks, reading [21:33] jkakar: https://bugs.edge.launchpad.net/storm/+bug/617973 btw [21:33] <_mup_> Bug #617973: timeouterror could be more clear about the implications [21:41] bbiab [22:51] jml: https://devpad.canonical.com/~jml/lp-doc/index.html might be better as wiki pages