[16:00] <matsubara> #startmeeting
[16:00] <MootBot> Meeting started at 10:00. The chair is matsubara.
[16:00] <MootBot> Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE]
[16:00] <matsubara> Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues.
[16:00] <matsubara> [TOPIC] Roll Call
[16:00] <MootBot> New Topic:  Roll Call
[16:00] <matsubara> Not on the Launchpad Dev team? Welcome! Come "me" with the rest of us!
[16:00] <sinzui> me
[16:00] <al-maisan> me
[16:00] <danilo_> me
[16:01] <mrjazzcat> me
[16:01] <mbarnett> me
[16:01] <matsubara> sorry mrjazzcat, I always forget to ping you about the meeting. I'll add you to the "Who should be here?" section if you don't mind
[16:01] <mrjazzcat> yes, please
[16:01] <matsubara> on the MeetingAgenda page, I mean
[16:01] <mrjazzcat> no worries
[16:02] <matsubara> [action] add brian to the list of attendees in the MeetingAgenda page
[16:02] <MootBot> ACTION received:  add brian to the list of attendees in the MeetingAgenda page
[16:02] <matsubara> Ursula won't be around today
[16:02] <matsubara> and I'll be standing in for Gary
[16:02] <matsubara> rockstar, hi, around?
[16:03] <matsubara> allenap, hi
[16:03] <matsubara> well, let's move on and then Gavin and Paul can join in later
[16:03] <matsubara> [TOPIC] Agenda
[16:03] <MootBot> New Topic:  Agenda
[16:03] <matsubara>  * Actions from last meeting
[16:03] <matsubara>  * Oops report & Critical Bugs & Broken scripts
[16:03] <matsubara>  * Operations report (mthaddon/Chex/spm/mbarnett)
[16:03] <matsubara>  * DBA report (stub)
[16:03] <matsubara>  * Proposed items
[16:03] <matsubara> [TOPIC] * Actions from last meeting
[16:03] <MootBot> New Topic:  * Actions from last meeting
[16:04] <matsubara>  * allenap to dig the master bug of OOPS-1474EA771
[16:04] <matsubara>  * salgado to take a look in the TypeError oopses (OOPS-1479S1000)
[16:04] <matsubara>    * already did that, this is bug 403281, it happened because mthaddon was testing the new read-only switch on staging.
[16:04] <matsubara>  * rockstar to take a look in OOPS-1480CMP1
[16:04] <matsubara> ok, so I'll re-add both items for allenap and rockstar
[16:05] <matsubara> [action] * allenap to dig the master bug of OOPS-1474EA771
[16:05] <MootBot> ACTION received:  * allenap to dig the master bug of OOPS-1474EA771
[16:05] <matsubara> [action] * rockstar to take a look in OOPS-1480CMP1
[16:05] <MootBot> ACTION received:  * rockstar to take a look in OOPS-1480CMP1
[16:05] <matsubara> [TOPIC] * Oops report & Critical Bugs & Broken scripts
[16:05] <MootBot> New Topic:  * Oops report & Critical Bugs & Broken scripts
[16:05] <matsubara> we have some oops reports but most of them foundations issues
[16:06] <matsubara> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
[16:06] <matsubara> Looks like an anonymous user is trying to do some operation which (s)he's not allowed. Should we really log an oops for this?
[16:06] <matsubara> maybe related to https://bugs.edge.launchpad.net/launchpad-foundations/+bug/271029
[16:06] <matsubara> More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
[16:06] <matsubara> InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
[16:06] <matsubara> code team, BranchMergeProposalExists https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
[16:06] <matsubara> so, that's it and there's no one from Code to take a look at the BranchMergeProposalExists one
[16:06] <matsubara> [action] matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
[16:06] <MootBot> ACTION received:  matsubara to email Tim about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA174
[16:07] <matsubara> [action] matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
[16:07] <MootBot> ACTION received:  matsubara to talk to leonard about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1488EA884
[16:07] <matsubara> [action] matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
[16:07] <MootBot> ACTION received:  matsubara to talk to salgado about More non-informational disconnectionerrors https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489J147
[16:07] <matsubara> [action] matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
[16:07] <MootBot> ACTION received:  matsubara to talk to stub or gary about InternalError after ther rollout https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1489C1094
[16:07] <matsubara> lovely, looks like I'm running the meeting all by myself heheh
[16:07] <rockstar> me
[16:08] <allenap> me
[16:08] <al-maisan> :)
[16:08] <matsubara> on the broken scripts side
[16:08] <matsubara> sinzui, Scripts failed to run: loganberry:send-person-notifications seems to be broken
[16:09] <matsubara> sinzui, could you take a look and reply to the list?
[16:09] <sinzui> matsubara: all scripts appear to be broken
[16:09] <matsubara> all?
[16:09] <sinzui> They are not running and I am tempted to say something new was added that is taking forever and a day
[16:09] <matsubara> I only see notifications for send-person-notifications and garbo-hourly
[16:10] <matsubara> sinzui, can you confirm and reply to the list that's the case, at least for the send-person-notifications one?
[16:10] <matsubara> I'll ask losas and/or stub about garbo-hourly not running as well
[16:10] <allenap> matsubara: Re. OOPS-1474EA771, it's bug 508302, and deryck is working on it today.
[16:10] <matsubara> thanks allenap, I'll adjust the bug link on that oops report
[16:11] <matsubara> [action] matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302
[16:11] <MootBot> ACTION received:  matsubara to fix bug link on OOPS-1474EA771 to point to bug 508302
[16:11] <matsubara> [action] sinzui to investigate failure on send-person-notifications and reply to the list with his findings
[16:11] <MootBot> ACTION received:  sinzui to investigate failure on send-person-notifications and reply to the list with his findings
[16:13] <matsubara> btw, updatebranches script also failed recently but that's been fixed by spm. the new rollout changed the script name and losas updated the notification thing to recognize the new name
[16:13] <matsubara> on the critical bugs side
[16:13] <rockstar> matsubara, updatebranches no longer runs.
[16:13] <mthaddon> matsubara: er, not quite
[16:13] <matsubara> we have 3 critical bugs
[16:13] <rockstar> It's been replaced by scan_branches
[16:13] <mthaddon> matsubara: we've had to revert it a bunch of times
[16:14] <matsubara> mthaddon, hmm no? spm's email seems to indicate that
[16:14] <mthaddon> matsubara: spm went to bed a while ago - a new problem was discovered since then
[16:14] <mthaddon> matsubara: abentley and Chex have been working on it
[16:14] <matsubara> oh, I was looking at this latest email to the list replying to one of the script failures notification
[16:15] <matsubara> well, if they're already working on it, it's ok. :-)
[16:15] <mthaddon> matsubara: not really...
[16:15] <matsubara> mthaddon, no? what else is expected?
[16:15] <mthaddon> matsubara: as I understand it, we've reverted to the old script because we still don't know what was wrong
[16:15] <mthaddon> matsubara: and the fact that we've reverted between the old and new scripts twice now on production is a problem in itself
[16:16] <mthaddon> matsubara: and also the fact that the first we heard about the problem was from a user report
[16:16] <matsubara> mthaddon, I meant it's ok in the sense that people are already working on a solution and there's nothing much to be done during this meeting to have people act on it
[16:16] <mthaddon> i.e. we don't have a good measure of when this problem is even happening
[16:17] <mthaddon> matsubara: maybe not, but I'd like a bit of discussion about this class of problem and what can be done to prevent it in the future
[16:17] <danilo_> mthaddon: what is the exact problem that we need to be able to track? (sorry, I am not fully up to date on what broke)
[16:17] <mthaddon> danilo_: aiui email notifications of branch updates failed to be sent out
[16:18] <gary_poster> mthaddon: "reverted ...twice...on production": I think we all agree this sucks.  However, AIUI, this was successfully QAd.  Either the QA was bad, or staging is not close enough to prod in some way.  I don't think we know yet.
[16:18] <matsubara> mthaddon, I'm unaware of the details as well. My expectation is that a IncidentLog will be filed and action to prevent it will be included in the incident log
[16:18] <danilo_> mthaddon: ah, right, that could have a bigger impact (it might be harming us in translations as well)
[16:19] <mthaddon> matsubara: this doesn't really qualify as an incident log item since there's no measurable service that's been interrupted (we don't have any kind of nagios monitoring of this) - I guess I'm asking how we plan to approach it from here
[16:20] <mthaddon> and how we got into this situation
[16:21] <danilo_> mthaddon, gary_poster, matsubara: we are obviously missing a dedicated "communications person" for this specific item (someone to keep the entire situation in check); we've discussed that approach before, it'd be nice to find someone who can offload the communication side from abentley and others working on it
[16:21] <gary_poster> danilo_: to the degree there's a failure there (communications), it'd probably be mine as RM
[16:21] <gary_poster> maybe we can have somebody else too
[16:21] <gary_poster> but that's RM stuff
[16:22] <gary_poster> but AIUI that's not the prob
[16:22] <danilo_> gary_poster, not necessarily, we discussed this in a TL call a few weeks (months?) back where we need someone to communicate with everyone
[16:22] <gary_poster> maybe so
[16:22] <gary_poster> but probs I see:
[16:23] <danilo_> gary_poster, it's mostly about having someone take responsibility for making sure problems are visible and we know what's going on
[16:23] <gary_poster> - we didn't catch this on staging.  Why?
[16:23] <gary_poster> either QA was bad or staging is too diff
[16:23] <gary_poster> we need to know why
[16:23] <gary_poster> and fix it
[16:23] <mthaddon> yep, I agree with that
[16:24] <gary_poster> then also, unless I misunderstand, mthaddon is saying that we don't have an automated nagios-like process verifying basic success on production for this thing
[16:24] <danilo_> gary_poster, neither of those is easy to fix (one depends on people always DTRT, another on machines always DTRT), so we need to be able to easily find out when it's broken rather than wait for users to report it
[16:24] <gary_poster> danilo_: but doesn't that depend on one of the three things I said?  (people DTRT, machines DTRT, nagios-like-thing DTRT)
[16:25] <danilo_> gary_poster, it does, I was typing before you typed the last one :)
[16:25] <mthaddon> gary_poster: it's possible we can't do that for *everything*, but if we decide this is a sufficiently important thing that we care about it if it fails, it sounds like we need to monitor it somehow, yeah (possibly we are already with OOPSes, but why didn't we catch it til a user told us about it?)
[16:25] <gary_poster> :-) ok
[16:25] <danilo_> gary_poster, the 4th is lack of coordination and communication :)
[16:26] <gary_poster> mthaddon: right. For me, this gets to my "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
[16:26] <gary_poster> maybe the jobs system can help with this
[16:26] <danilo_> anyway, gary_poster, I think we should just raise the importance of ensuring sufficient monitoring of this part of code-hosting by thumper, and we can be done with the topic
[16:27] <gary_poster> maybe we can architect the jobs system to give us a nagios-like hook
[16:27] <danilo_> gary_poster, we don't have to solve the problem here :)
[16:27] <gary_poster> because doing it with cron scripts is a one-per job
[16:27] <matsubara> danilo_, can you raise the topic in the next TL meeting?
[16:28] <gary_poster> danilo_: ack.  I kind of disagree with your summary though, and your action item, so that's why I'm continuing to blather :-)
[16:28] <danilo_> matsubara, we are having a week long TL meeting next week, so it'd be best to action it for someone from code team to pass it on to thumper, imho :)
[16:28] <gary_poster> (IOW, this is not a problem for thumper, it is a problem for Björn, team leads, etc.)
[16:29] <danilo_> gary_poster, well, sure, I agree, but one step at a time
[16:29] <gary_poster> matsubara: two action items: :-)
[16:29] <matsubara> [action] rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper
[16:29] <MootBot> ACTION received:  rockstar to raise the importance of ensuring sufficient monitoring of this part (i.e. branch updates emails failing to be delivered) of code-hosting by thumper
[16:29] <danilo_> gary_poster, there's immediate problem and then there's the elegant solution; I'm always for fixing the immediate problem first and having the elegant solution come out of that
[16:29] <gary_poster> yeah, that's number one
[16:30] <gary_poster> number two is gary to bring up archtecture concerns to team lead mtg :-)
[16:30] <danilo_> gary_poster, as for the other one, I think it ties in well with what we discussed today and what we'll want to discuss anyway
[16:30] <matsubara> [action] TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
[16:30] <MootBot> ACTION received:  TLs + Bjorn to talk about "too many different kinds of moving parts" in our architecture. If we have fewer moving parts then we can institute more uniform nagios-like-checks.
[16:30] <matsubara> does that summarize it well?
[16:31] <gary_poster> yeah thank you.  though it's probably my action, since I'm the one with the bee in my bonnet :-)  but that's fine
[16:31] <danilo_> gary_poster, matsubara: I don't like action items like that because they put no responsibility on anyone in particular, thus meaning that if they get done, they get done unrelated to the action item; thus, you don't really need it
[16:31] <gary_poster> so give it to me :-)
[16:31] <matsubara> danilo_, I'll add it to gary's queue when I add the summary to the MeetingAgenda page
[16:31] <danilo_> gary_poster, heh, that's ok, I am certain we would have discussed this regardless of us having any particular action item
[16:32] <danilo_> matsubara, sure, thanks
[16:32] <gary_poster> :-)
[16:32] <matsubara> it serves as a reminder as well
[16:32] <matsubara> anyway, thanks for the comments
[16:32] <rockstar> It fairness, the "not getting branch update emails" thing was because a rather large part of the code hosting system was made into a job.
[16:33] <gary_poster> To whom are you being fair? :-)
[16:33] <gary_poster> Never mind, I'll be quiet :-)
[16:33] <danilo_> :)
[16:33] <matsubara> we have 3 critical bugs, one in progress, one fix committed
[16:33] <rockstar> I'm not sure how "sufficient monitoring" would have fixed this.
[16:33] <matsubara> the other one is triaged, bug 511567
[16:33] <rockstar> gary_poster, to the code team in general.
[16:33] <matsubara> hmm
[16:33] <danilo_> rockstar, sufficient monitoring of scripts that do this
[16:33] <matsubara> that's a dupe
[16:33] <matsubara> and I filed that bug a few days ago
[16:34] <rockstar> danilo_, howso?
[16:34] <matsubara> or maybe I filed the dupe
[16:34] <gary_poster> rockstar: ah, gotcha.  Tim can beat us into shape at the TL sprint so we understand.
[16:34] <rockstar> gary_poster, yeah, I'll talk to him.
[16:34] <gary_poster> cool
[16:34] <danilo_> rockstar, monitoring should have caught the problem (i.e. "hey, this script is failing"); I won't pretend to understand the entire problem, so we might be entirely off base, but we should be able to check our service level
[16:34] <rockstar> danilo_, there wasn't a script failing.
[16:35] <rockstar> It ran fine, it was just a new script that had apparently left out some old functionality.
[16:36] <danilo_> rockstar, right, never mind the "implementation details", the problem is: "why we didn't catch it before someone told us it's failing"; there's not necessarily a technical solution
[16:38] <danilo_> matsubara, am I still on the channel?
[16:38] <matsubara> yes
[16:38] <danilo_> oh, ok, it's just everybody being quite :)
[16:38] <danilo_> matsubara, I think we should go on
[16:38] <matsubara> sorry, I was looking for a bug report to dupe against 511567
[16:38] <matsubara> anyway
[16:38] <matsubara> thanks
[16:39] <matsubara> [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett)
[16:39] <MootBot> New Topic:  * Operations report (mthaddon/Chex/spm/mbarnett)
[16:41] <matsubara> hello?
[16:41] <matsubara> Chex, mbarnett ?
[16:41] <mbarnett> sorry
[16:42] <Chex> sorry
[16:42] <Chex> here is the report
[16:42] <Chex> - LP rollout 10.01 Wednesday was successful:
[16:42] <Chex>     : See https://wiki.canonical.com/InformationInfrastructure/OSA/LPRollout20100127 for more details.
[16:42] <Chex>     : The read-only switch left idle connections to the master DB, it is currently being investigated
[16:42] <Chex> - New LP Appserver is online, some issues with internal access, but now everything is OK.
[16:42] <Chex> - New branch-scanner having issues, just reverted back to old again.  Based on meeting dicsussion here,
[16:42] <Chex>         continuing to address.
[16:42] <Chex> and thats all for us.  Any questions/comments?
[16:43] <matsubara> Chex, what's this new LP appserver online? I guess I'll have to tell oops-tools about oops reports from it?
[16:43] <matsubara> [action] matsubara to update oops-tools to know about the new lp appserver
[16:43] <MootBot> ACTION received:  matsubara to update oops-tools to know about the new lp appserver
[16:43] <noodles775> Chex: do you know if the new servers have access to the private librarian?
[16:43] <mbarnett> matsubara: soybean was recently put online as a replacement for gangotri +
[16:44] <mbarnett> noodles775: that was resolved earlier today
[16:44] <matsubara> mbarnett, oh, so it's using the same config files?
[16:44] <noodles775> A user was seeing about 1 in 4 requests to download a... ah, great, thanks!
[16:44] <mbarnett> matsubara: it took over lpnet1, lpnet2, and edge1 from gangotri, stole lpnet9 from gandwana, and added a sparkly new lpnet15 standard lpnet appserver
[16:45] <matsubara> mbarnett, ok, it's the new lpnet15 instance I care about. I'll check the configs and update oops-tools accordingly
[16:45] <matsubara> thanks
[16:45] <matsubara> moving on
[16:45] <mbarnett> matsubara: thank you.
[16:45] <matsubara> [TOPIC] * DBA report (stub)
[16:45] <MootBot> New Topic:  * DBA report (stub)
[16:45] <matsubara> stub sent the report to the list
[16:46] <matsubara> allenap, he mentioned something about checkwatches being very cpu intensive. it's probably of interest of the Bugs team
[16:46] <allenap> mars: deryck has just forwarded the message to me.
[16:46] <allenap> matsubara: ^
[16:46] <matsubara> thanks allenap
[16:46] <matsubara> [TOPIC] * Proposed items
[16:47] <MootBot> New Topic:  * Proposed items
[16:47] <matsubara> no proposed items
[16:47] <matsubara> which brings this meeting to a close
[16:47] <matsubara> Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs.
[16:47] <matsubara> and sorry for the delay
[16:47] <matsubara> #endmeeting
[16:47] <MootBot> Meeting finished at 10:47.