[15:00] <matsubara> #startmeeting
[15:00] <MootBot> Meeting started at 09:00. The chair is matsubara.
[15:00] <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
[15:00] <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues.
[15:00] <matsubara> [TOPIC] Roll Call
[15:00] <MootBot> New Topic:  Roll Call
[15:00] <matsubara> Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us!
[15:00] <henninge> me
[15:00] <Ursinha> me
[15:00] <matsubara> Ursinha, flacoste, bigjools, intellectronica, herb
[15:00] <bigjools> me
[15:00] <herb> me
[15:00] <matsubara> bac, ping
[15:00] <flacoste> me
[15:01] <Ursinha> matsubara, already answered
[15:01] <intellectronica> me
[15:01] <matsubara> rockstar, hi
[15:01] <rockstar> me
[15:01] <rockstar> matsubara, hi
[15:01] <bac> me
[15:03] <matsubara> ok, stub can join later. everyone else is here.
[15:03] <matsubara> [TOPIC] Agenda
[15:03] <MootBot> New Topic:  Agenda
[15:03] <matsubara>  * Actions from last meeting
[15:03] <matsubara>  * Oops report & Critical Bugs
[15:03] <matsubara>  * Operations report (mthaddon/herb/spm)
[15:03] <matsubara>  * DBA report (DBA contact)
[15:03] <matsubara> [TOPIC] * Actions from last meeting
[15:03] <MootBot> New Topic:  * Actions from last meeting
[15:03] <matsubara>  * stub to investigate the fix to avoid staging restore problems
[15:03] <matsubara>  * matsubara to chase rockstar about a fix for OOPS-1138CEMAIL12
[15:03] <matsubara>     * asked jml about this. It's bug 326056 and had importance raised.
[15:03] <matsubara>  * cprov and bigjools to investigate OOPS-1145EA14
[15:03] <matsubara>  * Ursinha to file bugs:
[15:03] <matsubara>     * Bug 333072: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1143EB189
[15:03] <matsubara>     * Bug 333071: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1145EA14
[15:04] <bigjools> 333072 is invalid
[15:04] <matsubara> bigjools, any news about 333071?
[15:04] <bigjools> yes, it's not too serious, we've set it for 2.2.3
[15:04] <bigjools> it's a corner case in the copying
[15:05] <bigjools> despite the doom-mongering error message
[15:05] <matsubara> ok. thanks bigjools
[15:06] <matsubara> [action] matsubara to chase stub about staging restore problems
[15:06] <MootBot> ACTION received:  matsubara to chase stub about staging restore problems
[15:06] <matsubara> [TOPIC] * Oops report & Critical Bugs
[15:06] <MootBot> New Topic:  * Oops report & Critical Bugs
[15:06]  * matsubara hands Ursinha the mic
[15:06]  * Ursinha looks
[15:06]  * rockstar runs
[15:07] <Ursinha> registry, foundations, code and bugs: oopses for you
[15:07] <Ursinha> Registry:-
[15:07] <Ursinha> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153E919
[15:07] <Ursinha> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 (or foundations, not sure)
[15:07] <Ursinha> Foundations:-
[15:07] <Ursinha> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153D667
[15:07] <Ursinha> Code:
[15:07] <Ursinha> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152XMLP1
[15:07] <Ursinha> Bugs:
[15:07] <Ursinha> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152EA162
[15:07] <Ursinha> ~
[15:07] <Ursinha> rockstar, ha!
[15:07] <Ursinha> rockstar, have you seen this one: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152XMLP1?
[15:08] <rockstar> Ursinha, looking at all of them now.
[15:08] <Ursinha> rockstar, you can just look at code's one :)
[15:08] <Ursinha> sinzui, hi
[15:09] <Ursinha> sinzui, I'm not sure if https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 is foundations or registry
[15:09] <matsubara> Ursinha, looks like registry
[15:09] <intellectronica> Ursinha: strange. do you see lots of those?
[15:09] <bac> Ursinha: yes, looks like registry
[15:09] <Ursinha> intellectronica, no, actually not
[15:09] <Ursinha> intellectronica, but never saw one of those before
[15:10] <Ursinha> so better bring to attention
[15:10] <matsubara> intellectronica, Ursinha that one looks like caused by the rollout
[15:10] <sinzui> Ursinha I don't know the answer either. I will look into it and assign it. I suspect salgado-afk is working on
[15:10] <Ursinha> matsubara, even in the time it happened?
[15:10] <intellectronica> matsubara: i also thought so
[15:10] <intellectronica> but it is quite early
[15:11] <Ursinha> intellectronica, I've discarded the rollout possibility because of its timestamp
[15:11] <Ursinha> sinzui, thanks for that
[15:11] <matsubara> yeah, too early to be caused by the rollout.
[15:11] <Ursinha> intellectronica, can you take a look then, please?
[15:11] <rockstar> Ursinha, I'll have to investigate our oops.  It's the XML-RPC server, and it requires the sacrifice of a virgin goat.
[15:11] <matsubara> check OSAs incident log to see if something happened during that time
[15:11] <intellectronica> so, this isn't really a bugs oops, but i don't know whether it's rollout-related or not. fwiw it's more than three hours before rollout, so it's hard to see how it would be related
[15:11] <Ursinha> rockstar, oh, I have a bunch here in my backyard if you need some
[15:12] <rockstar> Ursinha, :)
[15:12] <Ursinha> intellectronica, I'll do what matsubara suggested
[15:13] <matsubara> [action] ursinha to check OSAs incident log to help identify cause of OOPS-1152EA162
[15:13] <MootBot> ACTION received:  ursinha to check OSAs incident log to help identify cause of OOPS-1152EA162
[15:14] <Ursinha> thanks intellectronica and matsubara
[15:14] <matsubara> [action] rockstar to investigate xmlrpc oops OOPS-1152XMLP1
[15:14] <MootBot> ACTION received:  rockstar to investigate xmlrpc oops OOPS-1152XMLP1
[15:14] <Ursinha> flacoste, hi
[15:14] <henninge> Translations is happy, that POFile:+translate dropped from the timeout top ten now ..
[15:14] <henninge> btw
[15:15] <henninge> ;)
[15:15] <Ursinha> henninge, indeed, congrats to translate team :)
[15:15] <Ursinha> translations
[15:15] <Ursinha> there he is :)
[15:15] <henninge> Ursinha: thank you, I will pass it on.
[15:15] <Ursinha> sinzui, about the other oops
[15:16] <stub> Sorry - on a call and didn't realize the time
[15:16] <sinzui> bac: can you look at it.
[15:16] <bac> Ursinha: they seem to be related (acting for sinzui today)
[15:16]  * sinzui is in another meeting
[15:16] <flacoste> hmm
[15:16] <flacoste> i'd say registry
[15:17] <bac> yes, i think registry for both
[15:17] <flacoste> Ursinha are you talking about OOPS-1153A1135?
[15:17] <Ursinha> bac, hi :) so, can you take a look in both oopses? do you need me to file bugs about them?
[15:17]  * Ursinha looks
[15:17] <bac> Ursinha: yes i'll look at them both
[15:17] <bac> i can open the bugs
[15:17] <bac> unless you need the karma
[15:17] <Ursinha> flacoste, no, https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153D667
[15:17] <Ursinha> bac, haha, no
[15:18] <flacoste> Ursinha: that's also a registry query
[15:18] <Ursinha> [action] bac to file bugs and take care of https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153E919 and https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135
[15:18] <matsubara> [action] bac to file bugs for OOPS-1153E919 and OOPS-1153A1135
[15:18] <MootBot> ACTION received:  bac to file bugs for OOPS-1153E919 and OOPS-1153A1135
[15:19] <bac> wow, y'all are insistent today!  :)
[15:19] <Ursinha> :)
[15:19] <Ursinha> flacoste, hm.
[15:19] <Ursinha> thanks
[15:19] <Ursinha> bac, can you take a look at that too?
[15:20] <bac> which?
[15:20] <Ursinha> promise not to paste the oops again
[15:20] <Ursinha> bac, https://devpad.canonical.com/~jamesh/oops.cgi/1153D667
[15:20] <Ursinha> I tried :)
[15:20]  * bac looks
[15:20] <bac> yes
[15:21] <Ursinha> bac, thanks
[15:21] <Ursinha> that's all from me from the oops land
[15:21] <matsubara> [action] bac to also file a bug and take care of OOPS-1153D667
[15:22] <MootBot> ACTION received:  bac to also file a bug and take care of OOPS-1153D667
[15:22] <matsubara> ok, thanks everyone.
[15:22] <matsubara> [TOPIC] * Operations report (mthaddon/herb/spm)
[15:22] <MootBot> New Topic:  * Operations report (mthaddon/herb/spm)
[15:22] <Ursinha> there's one critical bug, though
[15:22] <Ursinha> argh
[15:22] <Ursinha> bad bad timing
[15:23] <herb> shall I wait for the critical bug?
[15:23] <Ursinha> herb, just a second, let me check with henninge
[15:23] <matsubara> danilo is handling the critical bug, so won't duplicate what's in the bug report.
[15:23] <matsubara> it's bug 334787
[15:23] <Ursinha> matsubara, okay, if you say so
[15:23] <matsubara> let's move on
[15:24] <Ursinha> go ahead herb, thanks
[15:24] <herb> 2009-02-20 - We had an issue that may have caused some users to experience intermittent outages on Launchpad. I worked with joey and flacosted to find the issue. joey's notes were sent to the list. I would be interested in hearing any updates we might have on this issue.
[15:24] <herb> 2009-02-21 and 2009-02-22 - It appears we had bit of buggy code land on edge that caused a performance problem on both edge and production. The revision was backed out and I believe the code has been fixed.
[15:24] <herb> 2009-02-26 - We rolled out 2.2.2 based on r7763
[15:24] <herb> We continue to see problems relating to bug #156453 and bug #118625. So much so that we're going to start bouncing codebrowse regularly to hopefully head off any issues. I want to emphasize that this will be masking the problem and we really do need to find the root cause and fix it.
[15:24] <herb> Bug #260171 continues to creep up regularly (every few days). This is already morked as high and I know that mwhudson's plate is full with codebrowse issues, but can we get an update on this one?
[15:24]  * herb somehow managed to change flacoste into a verb.
[15:24] <danilos> matsubara, Ursinha: I am running tests on the critical bug fix, will let you know once it has landed
[15:25] <flacoste> i saw!
[15:25] <bac> i've been flacosted!
[15:25] <matsubara> danilos, thanks
[15:25] <Ursinha> thanks danilos
[15:25] <matsubara> rockstar, can you bring up the codebrowse issue to the code team?
[15:25] <rockstar> matsubara, everyday.  :)
[15:25] <matsubara> rockstar, thanks :-)
[15:26] <rockstar> Codebrowse is being ACTIVELY worked on.  It'd be nice if we knew what the issues is.  Right now, we're just fixing things and hoping that was the problem.
[15:26] <herb> rockstar: let the losas know if there is anything we can do to help.
[15:26] <rockstar> herb, we certainly will.
[15:27] <stub> Should we be bringing in any outside help to intrument, test and diagnose the issue?
[15:27] <matsubara> herb, anything happened to the DB during the time of this OOPS-1152EA162?
[15:28] <matsubara> or maybe stub might know ^
[15:28] <herb> matsubara: nothing in the incident log.
[15:28] <stub> matsubara: That is one of the connection reaper scripts kicking in
[15:29] <herb> matsubara: I think that's also on the void between LOSAs.
[15:29] <herb> ah, there we go.
[15:29] <stub> We kill connections idle in a transaction more than a few hours (and should be more agressive), and appserver connections that have been in a transaction for more than 2 minutes.
[15:30] <Ursinha> stub, I see
[15:30] <matsubara> stub, ok. so if we start seeing too many of those, we have a problem somewhere and a few is kinda normal?
[15:30] <stub> The notification gets sent to the error-reports list (where we can confirm that this is indeed what happened)
[15:31] <matsubara> stub, aha. that's better. I'll chase the lp-errors for that one
[15:31] <matsubara> s/lp-errors/lp-errors list/
[15:31] <stub> If we see many of them, we have a problem. One is probably a problem - appserver requests taking two minutes on the db means we need to investigate why the normal timeout mechanisms didn't work.
[15:31] <matsubara> [action] matsubara to look lp-errors list to determine cause of OOPS-1152EA162
[15:31] <MootBot> ACTION received:  matsubara to look lp-errors list to determine cause of OOPS-1152EA162
[15:32] <matsubara> right. thanks for the explanation
[15:32] <stub> -1 second non-sql time, 0 seconds total time indicates a problem at the appserver? The request never got started?
[15:33] <matsubara> I'll file a bug about that one and we can discuss there
[15:34] <stub> hmm... might be a reconnection bug - perhaps the previous request handled by that thread got killed?
[15:34] <stub> I don't know if we Retry on DisconnectionError exceptions, or if it is a good idea in all cases.
[15:35] <matsubara> ok
[15:35] <matsubara> [TOPIC] * DBA report (stub)
[15:35] <MootBot> New Topic:  * DBA report (stub)
[15:35] <matsubara> and thanks herb and stub
[15:36] <stub> New hardware exists and is being brought online by IS. I've realized I might need to tweak the db maintenance scripts (upgrade.py, security.py etc.) to cope with a third replica - I think it only copes with a single master and slave at the moment.
[15:36] <stub> Staging can be moved by the LOSAs as soon as the hardware is available and they have time, which will move that load from the production systems.
[15:36] <stub> I assume the rollout went fine as far as the db upgrade procedure goes.
[15:37] <herb> I assume it did too. I didn't hear any complaints from my colleagues.
[15:37] <matsubara> stub, great news! with the new hardware we won't have the staging restore problems anymore?
[15:37] <herb> stub: what's the plan with the 3rd replica?
[15:38] <stub> The staging restore problems should no longer be a problem.
[15:38]  * herb feels like he missed something
[15:38] <stub> herb: We can start by pointing half the appservers at the new slave when it is online. We really should get a connection pool/load balancer thingy though running like pgbouncer, pgpool 1 or 2.
[15:39] <herb> stub: gotcha
[15:39] <stub> herb: I realized just now though that upgrade.py won't apply patches to a third replica, which would be bad. So that needs to be fixed.
[15:40] <herb> yeah. that's important.
[15:40] <stub> Or actually, slonik may take care of all that. I need to confirm anyway.
[15:40] <stub> I forget and it is too late for my brain :)
[15:40] <stub> erm... late as in evening
[15:42] <matsubara> all right. I guess that's all unless there are questions for stub
[15:42] <matsubara> thanks stub
[15:42] <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See the channel topic for the location of the logs.
[15:42] <matsubara> #endmeeting
[15:42] <MootBot> Meeting finished at 09:42.
[15:42] <intellectronica> thanks matsubara
[15:42] <flacoste> hey
[15:42] <flacoste> matsubara: question
[15:43] <flacoste> do we need a new roll-out?
[15:43] <flacoste> and i think it applies to everyone here
[15:43] <flacoste> anyone requires a new roll-out?
[15:43] <matsubara> flacoste, I was on vacation and need t ocheck that
[15:43] <matsubara> but I think there's at least danilos' bug to re roll
[15:43] <bac> flacoste: i don't know of any issues for us
[15:43] <danilos> matsubara, flacoste: yes
[15:43] <stub> I thought it was policy to let enough bugs through qa to require a rerollout?
[15:44] <flacoste> we're getting better at QA stub
[15:44] <flacoste> even the code team weren't that late this cycle :-)
[15:45] <matsubara> ok, so we'll need a re-roll for translations. need to check for the other teams, but so far, there's nothing on the radar
[15:46] <stub> We need a counter somewhere - 'Launchpad has been running for n days without need to a release critical patch'
[15:46] <Ursinha> stub, :)
[15:47] <matsubara> I think that's all then. thanks everyone
[15:48] <Ursinha> thanks matsubara