=== salgado is now known as salgado-afk === danilo_ is now known as danilos === danilos is now known as danilo-afk === danilo-afk is now known as danilos === danilos is now known as danilo-afk === Ursinha__ is now known as Ursinha === Ursinha is now known as Guest41422 === mrevell is now known as mrevell-dejeuner === mrevell-dejeuner is now known as mrevell [15:00] #startmeeting [15:00] Meeting started at 09:00. The chair is matsubara. [15:00] Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE] [15:00] Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. [15:00] [TOPIC] Roll Call [15:00] New Topic: Roll Call [15:00] Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us! [15:00] me [15:00] me [15:00] Ursinha, flacoste, bigjools, intellectronica, herb [15:00] me [15:00] me [15:00] bac, ping [15:00] me [15:01] matsubara, already answered [15:01] me [15:01] rockstar, hi [15:01] me [15:01] matsubara, hi [15:01] me [15:03] ok, stub can join later. everyone else is here. [15:03] [TOPIC] Agenda [15:03] New Topic: Agenda [15:03] * Actions from last meeting [15:03] * Oops report & Critical Bugs [15:03] * Operations report (mthaddon/herb/spm) [15:03] * DBA report (DBA contact) [15:03] [TOPIC] * Actions from last meeting [15:03] New Topic: * Actions from last meeting [15:03] * stub to investigate the fix to avoid staging restore problems [15:03] * matsubara to chase rockstar about a fix for OOPS-1138CEMAIL12 [15:03] * asked jml about this. It's bug 326056 and had importance raised. [15:03] * cprov and bigjools to investigate OOPS-1145EA14 [15:03] * Ursinha to file bugs: [15:03] * Bug 333072: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1143EB189 [15:03] Launchpad bug 326056 in launchpad-bazaar "OOPS on BadStateTransition when reviewing code by mail" [High,Triaged] https://launchpad.net/bugs/326056 [15:03] * Bug 333071: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1145EA14 [15:03] Launchpad bug 333072 in soyuz "AttributeError OOPS on Build:+index" [Undecided,Invalid] https://launchpad.net/bugs/333072 [15:03] Launchpad bug 333071 in soyuz "AssertionError OOPS on +copy-packages" [High,Triaged] https://launchpad.net/bugs/333071 [15:04] 333072 is invalid [15:04] bigjools, any news about 333071? [15:04] yes, it's not too serious, we've set it for 2.2.3 [15:04] it's a corner case in the copying [15:05] despite the doom-mongering error message [15:05] ok. thanks bigjools [15:06] [action] matsubara to chase stub about staging restore problems [15:06] ACTION received: matsubara to chase stub about staging restore problems [15:06] [TOPIC] * Oops report & Critical Bugs [15:06] New Topic: * Oops report & Critical Bugs [15:06] * matsubara hands Ursinha the mic [15:06] * Ursinha looks [15:06] * rockstar runs [15:07] registry, foundations, code and bugs: oopses for you [15:07] Registry:- [15:07] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153E919 [15:07] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 (or foundations, not sure) [15:07] Foundations:- [15:07] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153D667 [15:07] Code: [15:07] https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 [15:07] https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [15:07] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152XMLP1 [15:07] https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 [15:07] Bugs: [15:07] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152EA162 [15:07] ~ [15:07] rockstar, ha! [15:07] rockstar, have you seen this one: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1152XMLP1? [15:08] Ursinha, looking at all of them now. [15:08] rockstar, you can just look at code's one :) [15:08] sinzui, hi [15:09] sinzui, I'm not sure if https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 is foundations or registry [15:09] https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [15:09] Ursinha, looks like registry [15:09] Ursinha: strange. do you see lots of those? [15:09] Ursinha: yes, looks like registry [15:09] intellectronica, no, actually not [15:09] intellectronica, but never saw one of those before [15:10] so better bring to attention [15:10] intellectronica, Ursinha that one looks like caused by the rollout [15:10] Ursinha I don't know the answer either. I will look into it and assign it. I suspect salgado-afk is working on [15:10] matsubara, even in the time it happened? [15:10] matsubara: i also thought so === salgado-afk is now known as salgado [15:10] but it is quite early [15:11] intellectronica, I've discarded the rollout possibility because of its timestamp [15:11] sinzui, thanks for that [15:11] yeah, too early to be caused by the rollout. [15:11] intellectronica, can you take a look then, please? [15:11] Ursinha, I'll have to investigate our oops. It's the XML-RPC server, and it requires the sacrifice of a virgin goat. [15:11] check OSAs incident log to see if something happened during that time [15:11] so, this isn't really a bugs oops, but i don't know whether it's rollout-related or not. fwiw it's more than three hours before rollout, so it's hard to see how it would be related [15:11] rockstar, oh, I have a bunch here in my backyard if you need some [15:12] Ursinha, :) [15:12] intellectronica, I'll do what matsubara suggested [15:13] [action] ursinha to check OSAs incident log to help identify cause of OOPS-1152EA162 [15:13] ACTION received: ursinha to check OSAs incident log to help identify cause of OOPS-1152EA162 [15:14] thanks intellectronica and matsubara [15:14] [action] rockstar to investigate xmlrpc oops OOPS-1152XMLP1 [15:14] ACTION received: rockstar to investigate xmlrpc oops OOPS-1152XMLP1 [15:14] flacoste, hi [15:14] Translations is happy, that POFile:+translate dropped from the timeout top ten now .. [15:14] btw [15:15] ;) [15:15] henninge, indeed, congrats to translate team :) [15:15] translations [15:15] there he is :) [15:15] Ursinha: thank you, I will pass it on. [15:15] sinzui, about the other oops [15:16] Sorry - on a call and didn't realize the time [15:16] bac: can you look at it. [15:16] Ursinha: they seem to be related (acting for sinzui today) [15:16] * sinzui is in another meeting [15:16] hmm [15:16] i'd say registry [15:17] yes, i think registry for both [15:17] Ursinha are you talking about OOPS-1153A1135? [15:17] https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [15:17] bac, hi :) so, can you take a look in both oopses? do you need me to file bugs about them? [15:17] * Ursinha looks [15:17] Ursinha: yes i'll look at them both [15:17] i can open the bugs [15:17] unless you need the karma [15:17] flacoste, no, https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153D667 [15:17] https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 [15:17] bac, haha, no [15:18] Ursinha: that's also a registry query [15:18] [action] bac to file bugs and take care of https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153E919 and https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1153A1135 [15:18] https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 [15:18] https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [15:18] [action] bac to file bugs for OOPS-1153E919 and OOPS-1153A1135 [15:18] ACTION received: bac to file bugs for OOPS-1153E919 and OOPS-1153A1135 [15:18] https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 [15:18] https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [15:18] https://devpad.canonical.com/~jamesh/oops.cgi/1153E919 [15:18] https://devpad.canonical.com/~jamesh/oops.cgi/1153A1135 [15:19] wow, y'all are insistent today! :) [15:19] :) [15:19] flacoste, hm. [15:19] thanks [15:19] bac, can you take a look at that too? [15:20] which? [15:20] promise not to paste the oops again === danilo-afk is now known as danilos [15:20] bac, https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 [15:20] I tried :) [15:20] * bac looks [15:20] yes [15:21] bac, thanks [15:21] that's all from me from the oops land [15:21] [action] bac to also file a bug and take care of OOPS-1153D667 [15:22] https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 [15:22] ACTION received: bac to also file a bug and take care of OOPS-1153D667 [15:22] https://devpad.canonical.com/~jamesh/oops.cgi/1153D667 [15:22] ok, thanks everyone. [15:22] [TOPIC] * Operations report (mthaddon/herb/spm) [15:22] New Topic: * Operations report (mthaddon/herb/spm) [15:22] there's one critical bug, though [15:22] argh [15:22] bad bad timing [15:23] shall I wait for the critical bug? [15:23] herb, just a second, let me check with henninge [15:23] danilo is handling the critical bug, so won't duplicate what's in the bug report. [15:23] it's bug 334787 [15:23] matsubara, okay, if you say so [15:23] Launchpad bug 334787 in rosetta "Ubuntu packagers are not translation editors (assertion error)" [Critical,In progress] https://launchpad.net/bugs/334787 [15:23] let's move on [15:24] go ahead herb, thanks [15:24] 2009-02-20 - We had an issue that may have caused some users to experience intermittent outages on Launchpad. I worked with joey and flacosted to find the issue. joey's notes were sent to the list. I would be interested in hearing any updates we might have on this issue. [15:24] 2009-02-21 and 2009-02-22 - It appears we had bit of buggy code land on edge that caused a performance problem on both edge and production. The revision was backed out and I believe the code has been fixed. [15:24] 2009-02-26 - We rolled out 2.2.2 based on r7763 [15:24] We continue to see problems relating to bug #156453 and bug #118625. So much so that we're going to start bouncing codebrowse regularly to hopefully head off any issues. I want to emphasize that this will be masking the problem and we really do need to find the root cause and fix it. [15:24] Launchpad bug 156453 in loggerhead "production loggerhead branch leaks memory" [Critical,Triaged] https://launchpad.net/bugs/156453 [15:24] Launchpad bug 118625 in launchpad-bazaar "codebrowse sometimes hangs" [High,Triaged] https://launchpad.net/bugs/118625 [15:24] Bug #260171 continues to creep up regularly (every few days). This is already morked as high and I know that mwhudson's plate is full with codebrowse issues, but can we get an update on this one? [15:24] Bug 260171 on http://launchpad.net/bugs/260171 is private [15:24] * herb somehow managed to change flacoste into a verb. [15:24] matsubara, Ursinha: I am running tests on the critical bug fix, will let you know once it has landed [15:25] i saw! [15:25] i've been flacosted! [15:25] danilos, thanks [15:25] thanks danilos [15:25] rockstar, can you bring up the codebrowse issue to the code team? [15:25] matsubara, everyday. :) [15:25] rockstar, thanks :-) [15:26] Codebrowse is being ACTIVELY worked on. It'd be nice if we knew what the issues is. Right now, we're just fixing things and hoping that was the problem. [15:26] rockstar: let the losas know if there is anything we can do to help. [15:26] herb, we certainly will. [15:27] Should we be bringing in any outside help to intrument, test and diagnose the issue? [15:27] herb, anything happened to the DB during the time of this OOPS-1152EA162? [15:28] or maybe stub might know ^ [15:28] matsubara: nothing in the incident log. [15:28] matsubara: That is one of the connection reaper scripts kicking in [15:29] matsubara: I think that's also on the void between LOSAs. [15:29] ah, there we go. [15:29] We kill connections idle in a transaction more than a few hours (and should be more agressive), and appserver connections that have been in a transaction for more than 2 minutes. [15:30] stub, I see [15:30] stub, ok. so if we start seeing too many of those, we have a problem somewhere and a few is kinda normal? [15:30] The notification gets sent to the error-reports list (where we can confirm that this is indeed what happened) [15:31] stub, aha. that's better. I'll chase the lp-errors for that one [15:31] s/lp-errors/lp-errors list/ [15:31] If we see many of them, we have a problem. One is probably a problem - appserver requests taking two minutes on the db means we need to investigate why the normal timeout mechanisms didn't work. [15:31] [action] matsubara to look lp-errors list to determine cause of OOPS-1152EA162 [15:31] ACTION received: matsubara to look lp-errors list to determine cause of OOPS-1152EA162 [15:32] right. thanks for the explanation [15:32] -1 second non-sql time, 0 seconds total time indicates a problem at the appserver? The request never got started? [15:33] I'll file a bug about that one and we can discuss there [15:34] hmm... might be a reconnection bug - perhaps the previous request handled by that thread got killed? [15:34] I don't know if we Retry on DisconnectionError exceptions, or if it is a good idea in all cases. [15:35] ok [15:35] [TOPIC] * DBA report (stub) [15:35] New Topic: * DBA report (stub) [15:35] and thanks herb and stub [15:36] New hardware exists and is being brought online by IS. I've realized I might need to tweak the db maintenance scripts (upgrade.py, security.py etc.) to cope with a third replica - I think it only copes with a single master and slave at the moment. [15:36] Staging can be moved by the LOSAs as soon as the hardware is available and they have time, which will move that load from the production systems. [15:36] I assume the rollout went fine as far as the db upgrade procedure goes. [15:37] I assume it did too. I didn't hear any complaints from my colleagues. [15:37] stub, great news! with the new hardware we won't have the staging restore problems anymore? [15:37] stub: what's the plan with the 3rd replica? [15:38] The staging restore problems should no longer be a problem. [15:38] * herb feels like he missed something [15:38] herb: We can start by pointing half the appservers at the new slave when it is online. We really should get a connection pool/load balancer thingy though running like pgbouncer, pgpool 1 or 2. [15:39] stub: gotcha [15:39] herb: I realized just now though that upgrade.py won't apply patches to a third replica, which would be bad. So that needs to be fixed. [15:40] yeah. that's important. [15:40] Or actually, slonik may take care of all that. I need to confirm anyway. [15:40] I forget and it is too late for my brain :) [15:40] erm... late as in evening [15:42] all right. I guess that's all unless there are questions for stub [15:42] thanks stub [15:42] Thank you all for attending this week's Launchpad Production Meeting. See the channel topic for the location of the logs. [15:42] #endmeeting [15:42] Meeting finished at 09:42. [15:42] thanks matsubara [15:42] hey [15:42] matsubara: question [15:43] do we need a new roll-out? [15:43] and i think it applies to everyone here [15:43] anyone requires a new roll-out? [15:43] flacoste, I was on vacation and need t ocheck that [15:43] but I think there's at least danilos' bug to re roll [15:43] flacoste: i don't know of any issues for us [15:43] matsubara, flacoste: yes [15:43] I thought it was policy to let enough bugs through qa to require a rerollout? [15:44] we're getting better at QA stub [15:44] even the code team weren't that late this cycle :-) [15:45] ok, so we'll need a re-roll for translations. need to check for the other teams, but so far, there's nothing on the radar [15:46] We need a counter somewhere - 'Launchpad has been running for n days without need to a release critical patch' [15:46] stub, :) [15:47] I think that's all then. thanks everyone [15:48] thanks matsubara === matsubara is now known as matsubara-lunch === matsubara-lunch is now known as matsubara === salgado is now known as salgado-lunch === salgado-lunch is now known as salgado === thumper_laptop is now known as thumper === Ursinha is now known as Ursinha-fud === salgado is now known as salgado-afk