=== bigjools is now known as bigjools-lunch === mrevell-lunch is now known as mrevell === bigjools-lunch is now known as bigjools === danilo_ is now known as danilos [16:00] #startmeeting [16:00] Meeting started at 10:00. The chair is matsubara. [16:00] Commands Available: [TOPIC], [IDEA], [ACTION], [AGREED], [LINK], [VOTE] [16:00] Welcome to this week's Launchpad Production Meeting. For the next 45 minutes or so, we'll be coordinating the resolution of specific Launchpad bugs and issues. [16:00] [TOPIC] Roll Call [16:00] New Topic: Roll Call [16:00] ni! [16:00] me [16:01] me [16:01] (allenap is sick) [16:01] (or "coo", if anyone knows about kin dza dza ;) [16:01] gary_poster, Chex, bigjools: hi [16:01] hello [16:01] sinzui, hi [16:01] me [16:01] and hi [16:01] :-) [16:01] me [16:01] me [16:01] apologies from Stuart and Ursula [16:01] me [16:02] hello [16:02] [TOPIC] Agenda [16:02] New Topic: Agenda [16:02] * Actions from last meeting [16:02] * Oops report & Critical Bugs & Broken scripts [16:02] * Operations report (mthaddon/Chex/spm/mbarnett) [16:02] * DBA report (stub) [16:02] * Proposed items [16:02] [TOPIC] * Actions from last meeting [16:02] New Topic: * Actions from last meeting [16:02] * matsubara to trawl logs related to high load on edge on 2009-09-09 ~1830UTC and ping Chex about it [16:02] * matsubara to email the devel list about the new ErrorReportingUtility method [16:02] * done [16:02] * matsubara to file a bug to have the HWSubmissionMissingFields oopses as informational only (note to self: see bug 438671 for more details) [16:02] * filed https://bugs.edge.launchpad.net/malone/+bug/446660 [16:02] Launchpad bug 438671 in checkbox "HWSubmissionMissingFields OOPS on +hwdb/+submit" [Undecided,Confirmed] https://launchpad.net/bugs/438671 [16:02] * matsubara to look in lp-production-configs for the new oops prefixes. [16:02] Launchpad bug 446660 in malone "HWSubmissionMissingFields exceptions should be updated to be informational only" [High,Triaged] [16:02] * all QA contacts to inform their teams about the new QA column and what they should do about it. [16:02] * Chex to email the list about the new QA column in https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog [16:04] I still haven't checked the high load logs. Chex or mthaddon, did you notice high loads after the 2009-09-09? [16:04] matsubara: we still have been seeing some high loads, yes [16:05] I did look the new prefixes on lp-productions-configs. I need to update oops-tools to recognize those [16:06] I also noticed that some oops prefixes will conflict with existing ones, so I need to sort that out with... [16:06] losas I guess [16:07] [action] matsubara to file a bug on oops-tools to recognize new oops prefixes and sort out conflicting prefixes with losas [16:07] ACTION received: matsubara to file a bug on oops-tools to recognize new oops prefixes and sort out conflicting prefixes with losas [16:09] Chex, re: the high load, could you take on the task of analysing the logs? my idea was to correlate information from the app servers logs with the apache logs and see if that could shed some light. [16:09] mthaddon emailed the list about the new QA column, so everyone, read it and spread the word to your teams, please. [16:10] Chex: yes sure, I can look at that. [16:10] matsubara: it has just been discussed in the TL call as well, flacoste will champion the process [16:10] matsubara: ^^ I mean.. [16:10] matsubara: (about QA Info column on LP incident log) [16:10] Chex, cool, thanks a lot. ping me if you need any info on that [16:11] danilos, cool. thanks! [16:11] [action] Chex to check app server logs and apache logs to see if it can shed any light in the high load issue. [16:11] ACTION received: Chex to check app server logs and apache logs to see if it can shed any light in the high load issue. [16:12] [TOPIC] * Oops report & Critical Bugs & Broken scripts [16:12] New Topic: * Oops report & Critical Bugs & Broken scripts [16:12] we're seeing a bunch on DisconnectionErrors which are not informational only [16:13] which means, the Retry mechanism is not enough for those cases. [16:13] matsubara: are these the ones on the xmlrpc server? [16:13] or something else? [16:13] gary_poster, yes, most of them on xmlrpc server [16:13] but there are a few, like OOPS-1383I246, in login.launchpad.net [16:13] https://lp-oops.canonical.com/oops.py/?oopsid=1383I246 [16:14] matsubara: right. I investigated and could not duplicate the ones in the xmlrpc server. Kicking the xmlrpc server made them go away. There's a bug number which I can get in a moment. After discussing with flacoste, I think the best we can hope for is to figure out a way to add more diagnostic information should the problem happen again [16:17] gary_poster, ok, I take this is a foundations taks then. let me know the bug number please (or I'll file a new one for the more diagnostic info needed issue, if that's not what the bug you mentioned is about) [16:17] bug 450593 . Stuart has a follow up: check with losas if there were any unusual activity ATM [16:17] Launchpad bug 450593 in launchpad-foundations "Lots of DisconnectionErrors on xmlrpc server - staging" [Undecided,New] https://launchpad.net/bugs/450593 [16:17] I think a comment saying that we should address by adding diagnostic information in case there is a repeat would be sufficient. I'll do that. [16:18] thanks gary_poster [16:19] apart from that we have a bunch of oopses that will need fixing given the new zero oops policy. [16:19] Ursula will keep an eye on those for now and let the teams lead which ones are happening more frequently [16:21] we had some script failures last week [16:21] the main one seems to be the branch-puller which was already discussed in the list [16:22] checkwatches failed on the 13th, but since no other email came out, I assume it was a blip. adeuring, can you confirm? [16:22] matsubara: erm, I have no idea... [16:22] and the product-release finder and update-cache failed to run on the 14th [16:23] matsubara: I'll ask Graham [16:23] sinzui, do you know what's up with the product release finder script? [16:23] who's owns the update-cache script? [16:23] s/'s// [16:23] thanks adeuring [16:24] I don't know; looking [16:24] [action] adeuring to check with gmb about checkwatches failure [16:24] ACTION received: adeuring to check with gmb about checkwatches failure [16:24] matsubara: No, but I think the issue is not that it failed, bu that a long process prevented it from running [16:24] sinzui, right, that'd explain. could you check that's the root cause and reply to the list? [16:25] matsubara: okay [16:25] maybe the update-cache failure happened for the same reason [16:25] matsubara: I don't see an update-cache script in the LP tree. (I do see variants like update-download-cache) [16:27] just a reminder to everyone, if a script fails and your team owns that script, please reply to the failure email saying that someone is taking a look at it. [16:27] gary_poster, all I see is: "The script 'update-cache' didn't run on 'loganberry' between 2009-10-14 04:00:08 and 2009-10-14 22:00:08 (last seen 2009-10-13 11:36:51.345188)" not sure which script that one is monitoring. [16:29] for the critical bugs section, we have 4 bugs, 3 fix committed and 1 in progress [16:30] danilos, the one in progress is assigned to henning but he's on vacation [16:30] is it really critical? [16:31] matsubara: I'd have to check, sorry for not being on top of this [16:32] (I also looked for update-cache in lp-production-configs. not there either.) [16:34] gary_poster, I think it's cronscripts/update-pkgcache.py. IIRC, the losas script monitoring tool uses the script name defined in LaunchpadCronScript [16:35] [action] danilos to check bug 438039, assess if it's really critical. if it's is, land a fix, if it's not, update the importance [16:35] ACTION received: danilos to check bug 438039, assess if it's really critical. if it's is, land a fix, if it's not, update the importance [16:35] Launchpad bug 438039 in rosetta "bzr branch import script oopses sometimes" [Critical,In progress] https://launchpad.net/bugs/438039 [16:36] matsubara: oh ok, thanks. that script is either the one salgado was talking about that he owns, or something for soyuz, seems to me. [16:36] it's traditionally maintained by soyuz [16:36] bigjools: ok, thanks [16:36] but in the new world order it could be registry [16:36] bigjools, can you confirm that update-cache failure described in the "Subject: Scripts failed to run: loganberry:productreleasefinder, loganberry:update-cache" refers to the update-pckg.py and reply back to the email sent to the list? [16:37] ok, you just did :-) [16:37] it doesn't look like update-packagecache [16:38] errr ah it is [16:38] bigjools, it's the only script that has update-cache string in cronscripts/ [16:38] sorry got confused by seeing productreleasefinder [16:38] [action] bigjools to investigate update-cache failure and reply back to the list [16:38] ACTION received: bigjools to investigate update-cache failure and reply back to the list [16:39] bigjools, you might want to coordinate with sinzui since he'll check the product release failure one and suspects it might have failed because of a long running process [16:39] matsubara: is there an oops? [16:39] only an email that it did not run [16:39] bigjools, nope [16:39] and it was a one-off? [16:40] bigjools: it did not start, and that is 99% of the time the fault of a long running process [16:40] ok [16:40] * sinzui really does not think about the issue until it happens two days in a row [16:40] and me [16:40] sinzui, perhaps the script monitoring should have such a feature [16:41] but anyway, sorry for taking so long on this section [16:41] thanks eveyrone [16:41] [TOPIC] * Operations report (mthaddon/Chex/spm/mbarnett) [16:41] New Topic: * Operations report (mthaddon/Chex/spm/mbarnett) [16:41] hello everyone [16:42] hi [16:43] sorry, notes failure: [16:43] - LP Ship-it progress: [16:43] ; LP shipit is live on the new servers [16:43] ; Nigel Pugh is now in charge of approving CPs to those servers [16:43] ; We are still working on the new front-ends for LP Login and LP itself [16:44] - Buildd-manager DB restart issue/bugs: Bugs 451351 & 451349 have been [16:44] Launchpad bug 451351 in soyuz "buildd-manager doesn't give us a good way of determining it's in a failed state" [High,Triaged] https://launchpad.net/bugs/451351 [16:44] filed to address this issue, any movement to fix this problem? [16:45] - QA column in Incident Log: Tom sent a email to LP list on Oct 12, has [16:45] anyone reviewed the email and have comments/concerns about it? [16:45] Chex, are oops reports from those new servers going to be rsync'ed to devpad? such oopses are supposed to be included in LP oops summaries? [16:45] LP Incidents of note: ; Applied: CP 9660 to lpnet, CP 9679 to lpnet [16:45] ; Small LP outage (8 mins) : App servers (and [16:45] librarians) didn't reconnect & had to be restarted after LP DBs [16:45] were restarted: Bug filed: 451093 [16:46] and thats our report for this week. sorry for the troubles there [16:46] Chex: I am looking into 451351 but don't expect anything soon, it's a hard problem [16:47] matsubara: I am not sure on the status of oops summaries on the new servers, I will check on that [16:47] Chex, cool, thanks. [16:47] bigjools: ok, thanks, just looking for status of progress. [16:48] Chex, danilos mentioned that QA column things was discussed today in the TL meeting and flacoste will champion the process. [16:48] matsubara: ok, that is great to hear. [16:48] Chex, thanks for the report [16:48] let me move on as we are overdue [16:48] [TOPIC] * DBA report (stub) [16:48] New Topic: * DBA report (stub) [16:48] The new replica to become the master for the authentication service has been taken offline, as the hardware was showing signs of strain keeping up with Launchpad's write load. The hardware is being beefed up to cope. The alternative is to just put the authdb replication set on this server and have the authentication service appservers connect to the main launchpad databases for the data they need to pull from the lpmain repl [16:48] ication set. [16:48] Nothing else to report. [16:49] that came from Stuart. any questions about dba's report? [16:50] ok, I'll take that as a no :-) [16:50] [TOPIC] * Proposed items [16:50] New Topic: * Proposed items [16:50] no new proposed items [16:50] Thank you all for attending this week's Launchpad Production Meeting. See https://dev.launchpad.net/MeetingAgenda for the logs. [16:50] sorry for overrunning [16:51] #endmeeting [16:51] Meeting finished at 10:51. [16:51] thanks matsubara === salgado is now known as salgado-lunch === matsubara is now known as matsubara-lunch === salgado-lunch is now known as salgado === matsubara-lunch is now known as matsubara === salgado_ is now known as salgado === salgado is now known as salgado-afk === matsubara is now known as matsubara-afk