[05:23] <wgrant> Yay for having IS all in the same sleeping timezone.
[05:25] <MTecknology> wgrant: Hi!
[05:27] <cody-somerville> It looks like borium is lost and the abort is failing with Fault 8002: 'error' when xmlrpclib tries to cleanup after parsing the response.
[05:27] <wgrant> cody-somerville: Hm, and that kills everything?
[05:28] <wgrant> It shouldn't.
[05:28] <cody-somerville> appears so as the log keeps saying the same thing over and over and over
[05:28] <cody-somerville> *bohrium
[05:30] <cody-somerville> in fact, it appears caught in a loop trying to do this
[05:30] <wgrant> Ahh.
[05:30] <wgrant> Can you paste a full loop?
[05:31] <cody-somerville> wgrant, http://pastebin.ubuntu.com/477761/
[05:32] <wgrant> cody-somerville: Hmm, and it's doing that constantly, with no other log entries? How often?
[05:32] <wgrant> We probably just have to disable bohrium to get things running again, but there's nobody around who can do that :/
[05:33] <cody-somerville> wgrant, yes
[05:33] <cody-somerville> wgrant, multiple times per second
[05:33] <wgrant> Aha.
[05:33] <wgrant> So, yes, disabling it will fix it.
[05:33] <wgrant> StevenK: ^^?
[05:34] <wgrant> Interesting that it keeps retrying that one, though...
[05:35] <cody-somerville> probably has something to do with twisted
[05:35] <wgrant> Oh yes, it's all a nice Twisted mess.
[05:36] <wgrant> You know...
[05:36] <wgrant> It wouldn't surprise me if it was aborting the transaction because of the failed scan.
[05:36] <wgrant> So it sets to the builder as not-OK, then aborts before it commits it.
[05:36] <cody-somerville> lol
[05:36] <cody-somerville> Wouldn't surprise me either
[05:36] <cody-somerville> Soyuz has a habit of making mistakes like that
[05:36] <wgrant> Hm, no, it should be committing.
[05:37] <wgrant> This is, of course, brand new code :/
[05:38] <wgrant> Oh, damn, it's in requestAbort instead.
[05:39] <wgrant> Ah, no, but the whole thing is wrapped.
[05:39] <wgrant> So it does commit immediately afterwards.
[07:37] <lifeless_> wgrant: we can always escalate
[07:37] <lifeless_> whats up
[07:52] <cody-somerville> lifeless, buildd-manager is hung up
[07:55] <lifeless> whats the impat
[07:55] <lifeless> impact
[07:59] <cody-somerville> lifeless, nothing is getting built
[08:00] <cody-somerville> lifeless, not  for the Ubuntu archive or for any PPAs
[08:00] <lifeless> will it recover on its own?
[08:00] <cody-somerville> It doesn't appear so.
[08:01] <cody-somerville> lifeless, the log is filled with this: http://pastebin.ubuntu.com/477761/
[08:01] <cody-somerville> Might be able to fix it by disabling the bohrium builder (which a buildd admin can do) but no guarantee.
[08:02] <lifeless> ok
[08:02] <lifeless> its only bohrium showing like that ?
[08:03] <cody-somerville> looks that way, yea
[08:03] <lifeless> ok
[08:03] <lifeless> have you considered escalating to IS ?
[08:05] <wgrant> IS should be almost awake now...
[08:05] <lifeless> 9am for those that are sprinting
[08:05] <lifeless> which isn't everyone
[08:05] <lifeless> (AFAIK)
[08:27] <lifeless> cody-somerville: ^
[08:30] <cody-somerville> I considered it, yes. Probably should have but haven't since I didn't have a pressing reason to do so personally.
[08:31] <cody-somerville> plus I'm tired of writing incident reports for Launchpad downtime :P
[08:32] <lifeless> cody-somerville: heh
[08:33] <lifeless> so I think we should escalate
[08:33] <lifeless> because otherwise its going to stay down all weekend
[08:47] <cody-somerville> lifeless, agreed
[08:47] <wgrant> Right.
[08:47] <wgrant> It *probably* just needs a buildd admin to disable bohrium. But it may be more broken than that...
[09:26] <lifeless> wgrant: cody-somerville: its being looked at
[09:32] <wgrant> lifeless: Thanks.
[09:38] <lifeless> wgrant: can you file a bug please
[09:38] <lifeless> wgrant: the builder row was deadlocked
[09:39] <wgrant> Builder row?
[09:39] <wgrant> Wait, in the DB?
[09:39] <lifeless> yes
[09:39] <wgrant> Wow.
[09:39] <wgrant> I've not seen that before.
[09:43] <lifeless> wgrant: so, airlock was apparently doing something to the builder
[09:43] <lifeless> and hung waiting on a lock
[09:43] <elmo> whee
[09:43] <lifeless> so lp then was timing out trying to disable the builder
[09:43] <elmo> it's broken again
[09:44] <lifeless> elmo: have we bounced the builddmanager?
[09:44] <wgrant> Airlock?
[09:44] <lifeless> wgrant: the thing that steals buildds and gives them back
[09:44] <wgrant> Ah.
[09:44] <lifeless> it predates API's and writes to the DB
[09:44] <elmo> lifeless: yes; I'm going to try the update SQL, if that's locked, face stab the buildd-manager and try again
[09:45] <lifeless> is there an API to disable a builder and enable it again ?
[09:45] <elmo> update SQL to get the fuck rid of bohrium
[09:45] <wgrant> lifeless: Not at the moment.
[09:45] <lifeless> wgrant: if you were to make one, it would help with this
[09:45] <wgrant> I've considered it. It's not hard.
[09:45] <lifeless> because we have timeouts set in the webapp ;)
[09:45] <wgrant> But we've not run into this contention before.
[09:45] <lifeless> wgrant: -please-
[09:45] <elmo> ok, so I can't run the SQL again
[09:45] <wgrant> buildd-manager's transaction usage changed massively a couple of days ago. I'd suspect there's something a little wrong with it.
[09:46] <elmo> I think it's because b-m is in a tight loop failing on bohrium
[09:46] <lifeless> elmo seemed to think its occured before but perhaps not as violently
[09:46] <lifeless> elmo: yeah. Take the b-m down as gracefully as possible.
[09:46] <elmo> haha, gracefully
[09:46] <elmo> the init script tries TERM which always fails
[09:46] <elmo> then it KILLs
[09:46] <wgrant> TERM normally works.
[09:46] <wgrant> It can take a few seconds, though.
[09:47] <elmo> 'always' may be slightly hyperbolic; but I haven't seen TERM work for me since the latest round of implosions started happening
[09:47] <wgrant> Ew.
[09:47] <wgrant> Anyway, I also don't see how a DB deadlock could result in this loop.. unless the commit is failing, and this isn't logged?
[09:48] <lifeless> oh
[09:48] <elmo> ok, bohrium disabled; b-m back up
[09:48] <lifeless> so we found an interesting xmlrpc thing the other day
[09:48] <lifeless> returning a Fault -> doesn't abort transactions
[09:48] <lifeless> raising one does.
[09:48] <lifeless> probably not the thing here, but a good thing to remember until we fix it
[09:49] <lifeless> wgrant: in general don't we structure things so that 'unhandled exception -> rollback' ?
[09:49] <wgrant> lifeless: Yes. But the code here catches the Fault, disables the builder, then commits.
[09:50] <elmo> I have to go and pack, but I'll leave my laptop up as late as I can and keep an eye on the b-m log
[09:50] <lifeless> wgrant: given that for the last 90 minutes there was a db backend waiting for a lock
[09:50] <lifeless> wgrant: I highly doubt that its working as advertised
[09:51] <wgrant> lifeless: The codepath is really short and clear.
[09:51] <wgrant> Anyway, dinner.
[09:52] <lifeless> elmo: thanks heaps
[11:46] <lifeless> night all
[20:25] <lifeless> grah rosetta is unhappy
[20:26] <lifeless> hmm, time for incident report about lsat nights soyuuz thing
[20:42] <jelmer> lifeless, there was another incident, or is this the EINTR one?
[21:20] <lifeless> jelmer: there was another one
[21:21] <lifeless> IncidentReports/2010-08-14-Soyuz-Airlock-Deadlock
[21:22] <lifeless> jelmer: ^
[21:22] <jelmer> thanks, reading
[21:33] <lifeless> jkakar: https://bugs.edge.launchpad.net/storm/+bug/617973 btw
[21:33] <_mup_> Bug #617973: timeouterror could be more clear about the implications <Storm:New> <https://launchpad.net/bugs/617973>
[21:41] <lifeless> bbiab
[22:51] <lifeless> jml: https://devpad.canonical.com/~jml/lp-doc/index.html might be better as wiki pages