[16:00] <matsubara> #startmeeting
[16:00] <MootBot> Meeting started at 10:00. The chair is matsubara.
[16:00] <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
[16:00] <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues.
[16:00] <matsubara> [TOPIC] Roll Call
[16:00] <MootBot> New Topic:  Roll Call
[16:00] <sinzui> me
[16:00] <matsubara> Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us!
[16:00] <Ursinha> me
[16:01] <herb> me
[16:01] <intellectronica> me
[16:02] <matsubara> rockstar, hi
[16:02] <matsubara> al-maisan, hi
[16:02] <matsubara> flacoste, hi
[16:02] <danilos> me (so-so)
[16:03] <matsubara> ok, foundations, code and soyuz missing. they can join in later
[16:03] <matsubara> [TOPIC] Agenda
[16:03] <MootBot> New Topic:  Agenda
[16:03] <matsubara>  * Actions from last meeting
[16:03] <matsubara>  * Oops report & Critical Bugs
[16:03] <matsubara>  * Operations report (mthaddon/herb/spm)
[16:03] <matsubara>  * DBA report (stub)
[16:03] <matsubara> [TOPIC] * Actions from last meeting
[16:03] <MootBot> New Topic:  * Actions from last meeting
[16:03] <rockstar> me
[16:04] <cprov> me
[16:04] <matsubara>     * matsubara to file a bug about the missing select permissions that delayed the rollout
[16:04] <matsubara>         * https://bugs.edge.launchpad.net/launchpad-foundations/+bug/353926
[16:04] <matsubara>     * cprov to look up soyuz bugs 353568
[16:04] <matsubara>     * matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her
[16:04] <matsubara>         * matsubara commented on the bug.
[16:04] <matsubara>     * salgado to debug and fix bug 353863
[16:04] <matsubara>     * sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
[16:04] <matsubara>     * matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it
[16:04] <matsubara>         * matsubara emailed mrevell about this.
[16:04] <sinzui> matsubara: not done
[16:04] <Ursinha> the info I had was useless to the bug report
[16:04] <Ursinha> very superficial and not helpful, so I didn't add
[16:05] <flacoste> me
[16:05] <matsubara> cprov and salgado bugs are fix released. so that's done
[16:06] <matsubara> sinzui, do you want me to add another action for your item?
[16:06] <matsubara> for next week
[16:06] <sinzui> matsubara: please do
[16:06] <matsubara> [action] sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
[16:06] <MootBot> ACTION received:  sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
[16:07] <matsubara> [TOPIC] * Oops report & Critical Bugs
[16:07] <MootBot> New Topic:  * Oops report & Critical Bugs
[16:07] <matsubara> go ahead Ursinha
[16:07] <Ursinha> all right!
[16:07] <Ursinha> one puzzle for losas/stub, three bugs for foundations, three bugs for registry
[16:07] <Ursinha> flacoste, bug 354593, bug 353926, openid resetting password, bug 358498
[16:07] <Ursinha> sinzui: bug 357307, bug 358486, bug 358492
[16:07] <Ursinha> herb/stub: we're having *lots* of oopses like
[16:07] <Ursinha> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1194D1005. I've sent one email to lp list and spoken with jtv about that, something is killing the db connections, so when a request tries to re-use a connection that died it oopses like that. So herb/stub: do you know what can be possibly happening to the db?
[16:07] <sinzui> Ursinha: Account and LoginToken are Foundations issues
[16:07] <herb> Ursinha: looking at the oops now...
[16:08] <Ursinha> sinzui, looking..
[16:08] <Ursinha> sinzui, thanks for changing that
[16:08] <stub> Ursinha: The OOPS is showing that the database reconnection isn't working as it should. Why that connection died isn't on that OOPS - it would have happened on the previous request handled by that thread.
[16:08] <Ursinha> so that's flacoste's too
[16:08]  * sinzui did it a few minutes ago
[16:09] <flacoste> sinzui, for AuthToken: +resetpassword, do you think salgado could look into this one?
[16:10] <sinzui> flacoste: Not this week, He is gone. I can take the  +resetpassword in a few hours
[16:10] <flacoste> Ursinha: for the branding bug, how often does it happen?
[16:10] <flacoste> Ursinha: reworking these templates is going to happen, but it's not easy to fix now
[16:10] <stub> Ursinha: We have watchdogs that kill bad connections. I don't recall seeing any reaped connections from the appservers recently.
[16:11] <Ursinha> flacoste, about 6,7 a day
[16:11] <Ursinha> stub, we had 2 thousand oopses like this one I showed
[16:11] <flacoste> Ursinha: actually, i think these are related to the DisallowedStore error i'm seeing
[16:12] <flacoste> Ursinha: because these links don't appear on normal SSO pages
[16:12] <flacoste> stub: what the permission bug in 358492?
[16:13] <flacoste> nm, i know what this is about
[16:14] <Ursinha> stub, jtv said he saw a lot of "administrator terminated connection" errors on lp-errors-report list
[16:14] <stub> When?
[16:15] <Ursinha> yesterday
[16:15] <Ursinha> I couldn't find them
[16:16] <stub> Anyway - the reason the OOPS count is so high is the appserver isn't recovering like it should.
[16:16] <flacoste> stub: could you look at this bug tomorrow?
[16:17] <stub> I can look at the oops - I don't know if there is a bug yet (I think there is - not sure though)
[16:17] <Ursinha> for the db killing spree no, there isn't
[16:17] <stub> What db killing spree?
[16:17] <Ursinha> at least I didn't open one, I'll do now
[16:17] <Ursinha> ah
[16:18] <Ursinha> I mean, the oopses
[16:18] <Ursinha> the lots of oopses because of the appserver isn't recovering like it should
[16:19] <stub> ok
[16:19] <Ursinha> anyway, it's that bug you're talking about flacoste?
[16:20] <flacoste> yes
[16:20] <Ursinha> I'll open one and let you know
[16:20] <Ursinha> that's all for me
[16:20] <Ursinha> we have one critical bug, in progress
[16:21] <Ursinha> so, if matsubara has nothing else to say, oops section is closed
[16:21] <matsubara> [action] ursinha to file a bug about "appserver isn't recovering like it should causing too many oopses"
[16:21] <MootBot> ACTION received:  ursinha to file a bug about "appserver isn't recovering like it should causing too many oopses"
[16:21] <Ursinha> thanks sinzui, flacoste, stub and herb
[16:21] <Ursinha> and matsubara, of course
[16:21] <matsubara> intellectronica, can you move bug 269538 from fix committed to fix released?
[16:22] <matsubara> or at least chase why it's not fix released yet?
[16:22] <matsubara> that bug been in fix committed for ages
[16:22] <intellectronica> matsubara: i have no idea what's going on with that. i'll talk to gmb about it
[16:22] <matsubara> thanks intellectronica
[16:22] <matsubara> thanks everyone
[16:22] <matsubara> let's move on
[16:23] <matsubara> [action] intellectronica to talk to gmb about bug 269538
[16:23] <MootBot> ACTION received:  intellectronica to talk to gmb about bug 269538
[16:23] <matsubara> [TOPIC] * Operations report (mthaddon/herb/spm)
[16:23] <MootBot> New Topic:  * Operations report (mthaddon/herb/spm)
[16:23] <herb> 2009-04-04 - Launchpad experienced an outage most likely due to hitting some connection limits on the DB. Some users may have experienced issues for up to 90 minutes.
[16:23] <herb> 2009-04-08 - Deployed r7947 to soyuz and xmlrpc servers.
[16:23] <herb> Bug 156453 and bug 118625 continue to be problematic for us. Just want to make sure I'm keeping it on your radar.
[16:23] <herb> That's all for this week, unless there are questions.
[16:24] <rockstar> herb, I have something to report here again!
[16:24] <herb> woohoo!
[16:24] <rockstar> herb, so we've identified the real memory pig.  Unfortunately, it won't be trivial to change.
[16:24] <matsubara> cool!
[16:24] <rockstar> herb, so we know where the issue is, and now we just need to schedule about two weeks and re-write loggerhead.
[16:24] <herb> haha
[16:24] <matsubara> hehe
[16:25] <flacoste> the problem is that he is serious :-/
[16:26] <herb> mine was a laugh of despair
[16:27] <matsubara> rockstar, so are you tackling that for 2.2.4 and maybe 2.2.5?
[16:27] <sinzui> flacoste: I believe the problem with +resetpassword is that it sends logintokens to users who have not setup a person yet.
[16:27] <rockstar> matsubara, well, I doubt it'll be 2.2.4, because mwhudson is on leave for so much of it.
[16:27] <rockstar> matsubara, what really needs to happen is that we need to be sequestered again for a week to do nothing but fix it.
[16:28] <flacoste> sinzui: sounds about right, that shouldn't happen :-)
[16:28] <sinzui> flacoste: I'll get this fix today
[16:28] <matsubara> by we, you mean you and mwhudson or the whole code team?
[16:28] <flacoste> sinzui: thanks a lot
[16:29] <rockstar> matsubara, mwhudson and I.
[16:29] <rockstar> matsubara, we got some really good work done at the Pycon sprints last week.
[16:29] <matsubara> there's all hands and uds coming, maybe during that?
[16:31] <matsubara> anyway, that's beyond the scope of this meeting.
[16:31] <matsubara> I think that's all. anything else for herb?
[16:31] <matsubara> [TOPIC] * DBA report (stub)
[16:31] <MootBot> New Topic:  * DBA report (stub)
[16:31] <matsubara> thanks herb
[16:31] <stub> Can you describe the memory leak?
[16:32] <herb> thanks matsubara
[16:32] <stub> During the last rollout, one of the database patches turned out to be relying on database row ordering for some data migration, with the end result being some newly created rows on the slaves had different primary key values to the master and each other.
[16:32] <stub> This caused replication to block later when changes to the data on the master could not be duplicated on the slaves due to constraint violations, alerting us to the problem. We rebuild the slave databases to correct the problem (the safest way of recovering the situation).
[16:32] <stub> The corruption was not noticable to end users and did not infect the master, as only the internal database ids where affected.
[16:32] <stub> I was hoping to switch our master to the 16 core box, but public holidays and illness have put a hold on that this week.
[16:32] <stub> On the 6th and 7th, some batch jobs erroneously had their database connections terminated. Sorry about that. It is unlikely this was end user visible.
[16:35] <stub> echo... echo...
[16:35] <sinzui> oi oi
[16:35] <matsubara> stub, you're coordinating the downtime annoucement with mrevell, right?
[16:35] <stub> I will
[16:36] <matsubara> stub, ok. thanks.
[16:36] <matsubara> anything else for stub?
[16:36] <matsubara> thanks stub.
[16:36] <matsubara> I think that's all for today.
[16:36] <matsubara> Thank you all for attending this week's Launchpad Production Meeting.
[16:37] <matsubara> #endmeeting
[16:37] <MootBot> Meeting finished at 10:37.
[16:37] <Ursinha> thanks all