[16:00] <matsubara> #startmeeting
[16:00] <MootBot> Meeting started at 10:00. The chair is matsubara.
[16:00] <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
[16:00] <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues.
[16:00] <matsubara> [TOPIC] Roll Call
[16:00] <MootBot> New Topic:  Roll Call
[16:00] <rockstar> me
[16:01] <herb> me
[16:01] <cprov> me
[16:01] <sinzui> me
[16:01] <matsubara> Ursinha:
[16:01] <Ursinha> me
[16:01] <stub> me (on the right server this time)
[16:02] <danilos> me (if no call)
[16:02] <flacoste> me
[16:02] <matsubara> intellectronica: hi
[16:03] <intellectronica> me
[16:03] <matsubara> all right, everyone here
[16:03] <matsubara> [TOPIC] Agenda
[16:03] <MootBot> New Topic:  Agenda
[16:03] <matsubara>  * Actions from last meeting
[16:03] <matsubara>  * Oops report & Critical Bugs
[16:03] <matsubara>  * Operations report (mthaddon/herb/spm)
[16:03] <matsubara>  * DBA report (stub)
[16:03] <matsubara> [TOPIC] * Actions from last meeting
[16:03] <MootBot> New Topic:  * Actions from last meeting
[16:03] <matsubara>   * intellectronica to make efforts to take a look at bug 329908
[16:03] <matsubara>   * sinzui to talk to kiko about pending cp requests
[16:04] <intellectronica> matsubara: that's fixed
[16:04] <matsubara> well, sinzui's one is not needed anymore since that's been released
[16:04] <matsubara> thanks intellectronica
[16:04] <sinzui> matsubara: I removed the requests because it was close to the rollout and the items were not critical
[16:04] <matsubara> sinzui: sure. thanks for checking
[16:04] <matsubara> moving on
[16:04] <matsubara> [TOPIC] * Oops report & Critical Bugs
[16:04] <MootBot> New Topic:  * Oops report & Critical Bugs
[16:05]  * sinzui has a question about what is critical for unmaintaines app
[16:05] <matsubara> Ursinha: ?
[16:05] <Ursinha> me
[16:05] <Ursinha> 4 bugs to talk about
[16:06] <Ursinha> matsubara wants to talk about bug 353530
[16:06] <Ursinha> • bigjools, bug 347194, fixed as RC but still appears on lpnet
[16:06] <Ursinha> • sinzui: bug 353863
[16:06] <Ursinha> • bigjools, bug 353568, timeout at +source/package page
[16:06] <matsubara> sinzui: good question. You mean blueprint stuff?
[16:06] <Ursinha> should we raise bug 353568 to critical?
[16:06] <matsubara> sinzui: I think we need to raise that question in the list
[16:07] <matsubara> cprov: what's up wit hteh ones bigjools fixed?
[16:07] <flacoste> me again
[16:07] <matsubara> hi francis
[16:07] <flacoste> another X lock-up
[16:07] <flacoste> what did i miss?
[16:07] <matsubara> we're doing the oops section
[16:08] <Ursinha> flacoste, the bugs we'll discuss
[16:08] <sinzui> Ursinha: That looks like a critical bug to me
[16:08] <cprov> matsubara: I don't know, AFAICT it's not fixed.
[16:08] <matsubara> so far nothing for foundations
[16:08] <sinzui> Ursinha: I will give it to salgado who is already looking into login/account issues
[16:08] <Ursinha> sinzui, I couldn't reproduce that, don't know if matsubara tried that
[16:08] <matsubara> those oopses are likely to be candidates for RC and next re-roll
[16:08] <Ursinha> for sure
[16:08] <matsubara> Ursinha: I did not
[16:09] <Ursinha> thanks sinzui
[16:09] <flacoste> what login/account issues are we having?
[16:09] <sinzui> Ursinha: salgado saw many oopses he could not reproduce, but I think he can at least explain why
[16:09] <cprov> matsubara: I will look at it this afternoon, maybe I can do something quick to stop the timeout in production
[16:09] <Ursinha> flacoste, bug 353863
[16:09] <salgado> I'll need help with this one
[16:09] <matsubara> re: bug 353530, intellectronica could you take a look? it's about the OOPS in filing bug using the email interface but I'm not sure that scpecific oops is under Bugs responsability
[16:10] <matsubara> cprov: cool. thanks
[16:10] <intellectronica> matsubara: according to steve's comment that's another case of missing permissions
[16:10] <intellectronica> but i'm not clear whether it was dealt with. i'll check
[16:10] <matsubara> I'm going to add those to the CurrentRolloutBlockers page and use that page to coordinate things that will go in for the re-roll
[16:10] <Ursinha> matsubara, afaik that was just fixed by adding the user to the conf file in the server
[16:10] <matsubara> intellectronica: seems to be dealt with, but my question is more in the sense on how we can avoid that in the future
[16:11] <Ursinha> as per spm explanations
[16:11] <Ursinha> to me
[16:11] <matsubara> so, apparently it was a unusual rollout requirement but nobody added it there
[16:12] <matsubara> Ursinha: don't say server, we have at least 10 "servers" out there :-)
[16:12] <Ursinha> matsubara, sorry :) s/server/server in which the conf was missing/
[16:12] <matsubara> anyway, glancing at it, could be that the slaves were missing the right config?
[16:13] <intellectronica> so it seems
[16:13] <rockstar> matsubara, might that be a question for the db report section?
[16:13] <flacoste> Ursinha, matsubara: we should add test for missing permission
[16:13] <flacoste> matsubara: did you file a bug about the one you wanted me to discuss with stub?
[16:14] <matsubara> flacoste: nope, but I have the pastebin here. I'll file a bug about it right after the meeting
[16:14] <matsubara> [action] matsubara to file a bug about the missing select permissions that delayed the rollout
[16:14] <MootBot> ACTION received:  matsubara to file a bug about the missing select permissions that delayed the rollout
[16:14] <flacoste> thanks
[16:15] <matsubara> [action] cprov to look up soyuz bugs 347194, 353568
[16:15] <MootBot> ACTION received:  cprov to look up soyuz bugs 347194, 353568
[16:15] <cprov> matsubara: the first one is fixed
[16:15] <matsubara> err, sorry about that, I'll edit that entry
[16:16] <matsubara> [action] matsubara to edit #347194 out of the last action :-)
[16:16] <MootBot> ACTION received:  matsubara to edit #347194 out of the last action :-)
[16:16] <cprov> matsubara: some errors happened yesterday because I had to reprocess a bunch binary uploads that failed after the rollout (due the absence of the launchpad_auth DB user)
[16:17] <Ursinha> cprov, now it makes sense
[16:17] <matsubara> ah, so that also affected other things other than the email interface.
[16:17] <Ursinha> thanks :)
[16:18] <cprov> Ursinha: yes, it was a nightmare, because the buildfarm was full and binaries could not be processed due to the lack of DB access
[16:18] <matsubara> [action] matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her
[16:18] <MootBot> ACTION received:  matsubara to include francis suggestion to bug 353530 and ursinha to summarize what spm told her
[16:18] <Ursinha> indeed
[16:19] <matsubara> salgado: how can we help you with that one?
[16:19] <salgado> matsubara, I'll let you know once I know. :)
[16:20] <matsubara> [action] salgado to debug and fix bug 353863
[16:20] <MootBot> ACTION received:  salgado to debug and fix bug 353863
[16:20] <matsubara> I think I addressed everything
[16:21] <danilos> Ursinha: has there been any outcome of the timeout discussion?
[16:21] <matsubara> so, as usual after the release we are going to monitor the oops reports constantly and coordinate with the teams about any new oopses
[16:21] <Ursinha> danilos, I'm going to talk about it with stub in his section
[16:21] <danilos> Ursinha: ok, thanks
[16:21] <danilos> sorry for not following the script, I forgot my lines :)
[16:21] <Ursinha> danilos, :)
[16:22] <matsubara> [action] sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
[16:22] <MootBot> ACTION received:  sinzui to email the list how we should address critical bugs on unmaintained apps (e.g. blueprint)
[16:22] <matsubara> sinzui: ^ is that correct?
[16:22] <sinzui> matsubara: yes
[16:22] <matsubara> ok, I think that's all for this section. All the critical ones are being handled
[16:23] <matsubara> thanks everyone
[16:23] <matsubara> [TOPIC] * Operations report (mthaddon/herb/spm)
[16:23] <MootBot> New Topic:  * Operations report (mthaddon/herb/spm)
[16:23] <herb> 2009-03-30 - Experienced some DB problems that affected the service. Launchpad was unavailable for approximately 9 minutes. stub sent out an email summarizing the issues.
[16:23] <herb> 2009-03-30 - Cherry picked r8054 and part of r7999.
[16:23] <herb> 2009-04-01 - Rollout of 2.2.3. Total downtime was approximately 100 minutes. I think there were a few hiccups on some DB permissions, but I haven't had an opportunity to catch up with mthaddon and spm on the details.
[16:23] <herb> Bug 156453 and bug 118625 continue to be a source of discomfort. I think rockstar has an update on these though.
[16:23] <herb> Bug 80895 and bug 119420 are a pain point for the LOSAs. I think something may have been scheduled for this cycle on this front. If so that's a total win from our point of view.
[16:23] <herb> When do we think we'll be doing a re-roll?
[16:24] <rockstar> herb, I can has update!
[16:24] <rockstar> :)
[16:24] <herb> woo!
[16:24] <rockstar> So we have a memory middleware currently that's allowing us to track down memory issues.
[16:24] <rockstar> herb, also, mwhudson and jam have been writing a C-based memory profiler as well, so we can track refs even better in bzrlib itself.
[16:25] <herb> excellent
[16:25] <matsubara> herb: I'll let you know about the re-roll once we know. :-)
[16:25] <herb> matsubara: appreciated.
[16:26] <rockstar> herb, unfortunately, I can't really tell if the "sometimes hangs" bug is related to the "leaks memory" bug.
[16:26] <matsubara> herb: re: the DB permission, I'm going to file a bug about it and flacoste and stub will discuss it :-)
[16:26] <herb> rockstar: I suspect so, but fixing the memory issue would be a huge win.
[16:26] <stub> its not a bug, it was an operational issue
[16:26] <Ursinha> indeed
[16:27] <rockstar> herb, yes.  If they are unrelated, it's probably a bug in one of our dependencies.
[16:27] <stub> erm... if you are talking about the same one i'm thinking off.
[16:27] <matsubara> stub: I'm talking about the permission for the SSO user
[16:27] <stub> ok. different ;)
[16:27] <matsubara> :-)
[16:28] <matsubara> ok, anything else for herb?
[16:28] <matsubara> thanks herb.
[16:28] <herb> thanks matsubara
[16:29] <matsubara> and thank mthaddon and spm for the handling the rollout so well too!
[16:29] <matsubara> moving on.
[16:29] <herb> matsubara: will do
[16:29] <matsubara> [TOPIC] * DBA report (stub)
[16:29] <MootBot> New Topic:  * DBA report (stub)
[16:29] <stub> Todays Database update ran in about 100 mins with all replicas enabled. Earlier calculations indicated the downtime would be a bit under three hours. The discrepancy is staging isn't as powerful and normal staging operations are underway during the restore.
[16:29] <stub> This was good from a downtime perspective, but does mean we can no longer get reliable rollout timings from staging. When rollout times are a concern, we might have to test the database upgrade process on a production server and calculate the time from there.
[16:29] <stub> I want to switch our master database to the new 16 core box from the current 8 core box in the next two weeks. This will require a few minutes downtime - I think a scheduled 10 minute outage will suffice. We might want to double up if there is other downtime required in the near future.
[16:29] <stub> A few days ago, generating a table bloat report managed to mess up PostgreSQL, causing all queries to the master to generate nothing but errors. A forced restart was required, causing a few minutes of downtime total The cause has been tracked down and is being worked on upstream, and we can avoid it now we know what it is (don't feed temporary tables to pgstattuple).
[16:29] <stub> I've opened a couple of bugs about batch jobs that are taking too long. I generally don't care how long things take as long as their impact is light, but staging updates and post rollout processes are approaching 24 hours...
[16:30] <stub> A number of problems where caused by missing PostgreSQL authorization to the new launchpad_auth user on production. This authorization was added to staging, but missed getting into the production rollout tasks. spm sorted it a few hours after the rollout as I understand it. This is a purely operational issue outside the scope of our test suite (staging is the test bed for database connection authorizations). Ignore OOPSes and bugs like 3535
[16:30] <stub> All from me.
[16:30] <stub> Bug 353530
[16:31] <Ursinha> stub, I have one oops, I don't know if it was just a hiccup
[16:31] <Ursinha> stub, https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1188D1214
[16:31] <matsubara> [action] matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it
[16:31] <MootBot> ACTION received:  matsubara to talk to mrevell to announce a maintenance in the DB for about 10 min outage in the next 2 weeks. ask mrevell to talk to stub about it
[16:31] <stub> Ursinha: Thats a bug needing fixing.
[16:32] <Ursinha> stub, I'll file a bug about it now
[16:32] <Ursinha> about the timeouts we mentioned during the week
[16:33] <Ursinha> it seems they indeed dropped
[16:33] <Ursinha> the major responsible now is the source package index page
[16:33] <Ursinha> danilos, ^
[16:33] <stub> Ok. So we need to be even less aggressive doing mass data migration.
[16:34] <Ursinha> if the timeouts continue the next days, we'll have to chase another cause.
[16:34] <danilos> stub, Ursinha: we'll have something similar coming up, how can we make sure the impact is not felt on our production machines?
[16:35] <stub> danilos: Either set the acceptable lag setting lower, or a cooldown time after each batch.
[16:36] <herb> stub: or both?
[16:36] <danilos> stub: ok, I guess we'll have to experiment with these
[16:36] <stub> or both
[16:36] <matsubara> ok. I guess that's all for stub?
[16:36] <matsubara> thanks stub
[16:37] <Ursinha> thanks stub
[16:37] <matsubara> I have a minor annoucement that I forgot to add to the agenda
[16:37] <matsubara> Next week is our second performance week
[16:37] <matsubara> so, please add the bugs you're going to work on in https://dev.launchpad.net/PerformanceWeeks/April2009
[16:38] <matsubara> and I think that's all
[16:38] <matsubara> anything else before I close?
[16:38] <matsubara> 3
[16:38] <matsubara> 2
[16:38] <matsubara> 1
[16:38] <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See the channel topic for the location of the logs.
[16:38] <Ursinha> stub, bug 353897
[16:39] <matsubara> #endmeeting
[16:39] <MootBot> Meeting finished at 10:39.
[16:39] <flacoste> stub: do you know why that bug is happening?
[16:39] <flacoste> Ursinha: i guess we should fix this before the re-roll?
[16:40] <stub> flacoste: a login.launchpad.net page is trying to access the MAIN_STORE, MASTER_FLAVOR which is disallowed (because it needs to keep running when lp is down for maintenance)
[16:40] <Ursinha> flacoste, 5 occurrences we have registered
[16:40] <flacoste> on loging!
[16:40] <flacoste> ok, this needs to be fixed
[16:40] <Ursinha> on loging
[16:41] <flacoste> create_unique_token_for_table
[16:41] <stub> He is spelling login with a Canadian accent
[16:41] <Ursinha> lol
[16:41] <flacoste> lol
[16:41] <flacoste> French Canadian accent!
[16:43] <stub> flacoste: Or more precisely, a login.launchpad.net page is attempting to create a LoginToken (which it can't) instead of an AuthToken (which it can)
[16:43] <flacoste> ok
[16:43] <flacoste> stub: i'll try to give it a shot this afternoon
[16:45] <stub> flacoste: It is a twisty maze
[16:46] <flacoste> stub: but i might punt it to you if i cannot complete it :-)
[16:46] <stub> flacoste: Salgado loves the authentication system.
[16:46] <flacoste> i think he has his share of problems already