[00:12] lifeless: Can we defer bugheat like bugsummary? [00:13] (no, it shouldn't be expensive, but it is, so it should GTFO) [00:13] wgrant: we can do something [00:13] however its inline in the rows being edited usually [00:13] or is there a project row being updated? [00:14] It's mostly the max_heat, I think. [00:14] yes, we should [00:14] garbo it up [00:14] Let me just add that to the critical queue... oh wait [00:15] :( [00:15] if it causes timeouts its already in there [00:21] Ooh shiny. [00:21] https://qastaging.launchpad.net/ubuntu/oneiric/+localpackagediffs?batch=10 [00:21] ? [00:21] The packagesets column. [00:22] nice [00:24] OOPS-2032QASTAGING19 [00:29] 'This is the first version of the web service ever published. Its end-of-life date is April 2011, the same as the Ubuntu release "Karmic Koala".' [00:33] wgrant: 809786 timed out for me [00:34] when I filtered by core [00:35] lifeless: The page times out a lot anyway. [00:35] Without packageset filtering. [00:36] (drop the ?batch=10 and try to have it render) [00:36] Putting &batch=10 on the packageset-filtered URL works fine. [00:38] hmm, it should have preserved the batch size [00:38] I have closed the tab, but perhaps thats it [01:18] headdesk. There are too many ways to send mail to teams. [01:33] which is why procmail was invented. so as to auto delete the vast majority of email that LP sends, because there's no other way to manage it. [01:36] Harsh [01:37] But true [01:45] I got sufficiently pissed off with some mail - 815623 [01:45] bug 815623 [01:45] <_mup_> Bug #815623: Mail notifications sent to team admins on joins / leaves to open teams < https://launchpad.net/bugs/815623 > [02:33] grah [02:33] the way deactivated memberships are special cased for joining can be surprising ... and tis buggy [02:37] \o/ Fix released :) [02:39] man, that took way too long. [02:40] btw, New Zealand has snow? [02:41] * nigelb wwas surprised of a friend stuck because of snow [02:41] http://www.stuff.co.nz/national/5334310/Wintry-blast-brings-country-to-a-standstill [02:42] woah. [02:42] can has review ? https://code.launchpad.net/~lifeless/launchpad/bug-815623/+merge/69021 [03:02] diff is up [03:13] wgrant: ^ ? === wgrant changed the topic of #launchpad-dev to: https://dev.launchpad.net/ | On call reviewer: - | Critical bugs: 239 - 0:[#######=]:256 [03:13] * wgrant looks. [03:13] or stevenk ^ ? [03:13] lifeless: I think we should talk to teams like LoCos first. [03:14] The LoCo Council has been furious with us a few times lately for making team policy changes like this. [03:15] this seems entirely different to the things they were (reasonably) unhappy with [03:15] Yes, but still. [03:15] but sure [03:15] whats their contact ? [03:16] There's loco-council@lists.u.c, but pinging czajkowski or popey directly might also work better. [03:22] BugTask.target is worse than Soyuz. [03:22] Haha [03:22] wgrant: anyhow, you can review the change; I'll hold of landing for loco response [03:22] s/you can/can you [03:22] Sure. [03:23] whats left on my todo list.... [03:24] lifeless: DOne. [03:25] * wgrant lunches. === almaisan-away is now known as al-maisan [04:45] quick supermarket run [05:23] I am fairly sure that nobody who ever touched anything related to BugTask.target knew about the concepts of encapsulation or layering. [05:24] are you creating a BugTaskTarget table ? [05:24] No. [05:24] That would be slow. [05:25] But any time non-model code puts its fingers near the target key attributes, they are being cut off. [05:25] actually [05:25] stub and I think it would be fast [05:26] it would shrink the task table size [05:26] make its constraints easier to read [05:26] Well, it will be very possible to do in a few branches. [05:26] / evaluate [05:26] "in a few branches" == "once I've finished this current stack of refactoring", that is. [05:26] we'd probably want CTE's for the dimensions [05:26] cool [05:26] lifeless: I'm worried how slow it would be to query for, eg, all bugs in Ubuntu. [05:27] Because you'd need all DSP targets. [05:27] And conceivably all SP targets too. [05:27] CTEs may work. [05:27] But maybe not. [05:27] wgrant: that is indeed a factor; however... [05:27] there are 20K * small-int targets [05:27] Yeah. [05:28] this is a small number when you're dealing with table scans already [05:28] and a moderate number otherwise. [05:29] Anyway, my branch to remove access to sourcepackagename/product/productseries/distribution/distroseries from the security declarations came back with ~50 failures, most of which just need to be changed to use transitionToTarget instead. [05:29] awesome [05:29] Some of the other failures reveal about four or five branches of further necessary refactorings, but we're getting there... [05:30] def maybeAddNotificationOrTeleport(self): [05:30] So it might add a notification or possibly jump somewhere else? [05:30] Yay [05:31] rotfl [05:35] wgrant: hi [05:35] wgrant: SPRB's - you were involved with the current impl right ? the wellington sprint ? [05:36] lifeless: Yes. [05:36] in the -way-back- plans for no more source packages [05:36] I did various bits, then grabbed everyone else's bits and mashed them together at the end into something that almost worked with only about 10 breakage points along the way... [05:36] we were going to record a manifest describing each build [05:36] That was an amusing afternoon. [05:36] Ah, yes. [05:36] sprdata seems to match that design [05:36] We send it back. [05:36] I forget if we store it. [05:37] Let me check. [05:37] but we're not recording a version per build [05:37] we only have 1300 in prod [05:37] I want to know if thats something we planned to do when we need it [05:37] or if the actual intent changed [05:37] https://code.launchpad.net/~jelmer/launchpad/bfbia-db/+merge/68990 for context [05:38] jelmer and I discussed this a bit in Dallas. [05:38] But not details like this. [05:38] IIRC we already send the manifest back, but don't do anything with it. [05:38] THe intent was to parse it back into a SourcePackageRecipeData, and store it in SourcePackageRecipeBuild.manifest (which already exists). [05:39] ok [05:50] wgrant: so in short - original plan unchanged, but bits not implemented [05:51] Yes. [06:11] HI STUB [06:11] *cough* [06:11] Hi stub [06:11] * stub rubs his ears [06:12] stub: hey, so -0 patches; am I right that we no longer use them at all in the new world ? [06:13] stub: I've optimistically updated the schema process page to say that [06:15] So previously, -0 where the ones being run during a full rollout. But now we are not going to have full rollouts (db or code, not both at the same time) [06:15] yeah [06:15] and -0 requires code and db to be in sync [06:15] We might as well just pull out the schema version detection code, or tune it for the new world. [06:15] in that the code says a -0 in the db the code doesn't know is boom, and vice versa [06:16] Can you think of a better rule? How about 'if it is in my tree but not applied to the db, fail'. [06:16] stub: well, we still have, in extreme cases, synced deployments - at least in principle, until we're past the first few of these deploys and can be sure we've got the kinks out [06:16] Or just switch it off entirely. [06:17] stub: I believe that 'if it is in my tree but not applied to the db, fail' is what -non-0 patches enforce [06:17] ok. so for now, no -0 patches. But we should fix that, as leaving it in there unused could bite us. [06:17] stub: so i guess in a few weeks or a month, we should make it the same as the -non-0 rule. [06:18] Sure. I don't see the point of supporting '-0 patches cause things to explode' :-) [06:18] me neither :) [06:18] * stub opens a bug === al-maisan is now known as almaisan-away [06:29] stub: so you just need to stop and start pgbouncer for the tests right ? [06:29] lifeless: what is the secondary fastdowntime tag? [06:29] -later IIRC [06:29] its linked from the LEP [06:29] lifeless: Yes, that covers it. [06:29] https://dev.launchpad.net/LEP/FastDowntime -> https://bugs.launchpad.net/launchpad-project/+bugs?field.tag=fastdowntime-later === jam1 is now known as jam [06:56] hmm [06:56] my less-mail change may hide some mails we care about - transitions to-from admin of members. [06:56] still, less of a wart than what we have today, IMNSHO [06:56] Oh, blah, that counts as joining too, doesn't it. [06:57] wgrant: setStatus too [06:57] wgrant: I think [06:59] hi lifeless, hi wgrant [06:59] Evening jtv. [06:59] jtv: generate-contents-files may have just finished. [06:59] jtv: For the first time. [07:00] wgrant: still morning here :) [07:00] It must have taken a while because it didn't run for ages. [07:00] Not so much. [07:00] It started slightly under three hours ago. [07:00] Oh [07:00] So not too much slower. [07:01] Phew. I thought you were saying it had been running all weekend. [07:01] It failed to run over the weekend, because it didn't have permissions to do the move. [07:01] ! [07:01] So you spotted that and fixed it? I am in your debt. [07:02] Ah, no, it's still going. [07:02] Must be nearly there. [07:02] excited… [07:06] morning rvba! [07:06] Morning jtv, morning all! [07:06] Oh, it's not done powerpc yet. [07:06] So it's got a while left. [07:07] It was just being very quiet :( [07:08] More things are LaunchpadCronScripts now, and weird things happen to logging when one of those instantiates another. [07:08] That's something I'm looking into. [07:08] I think it's just apt-ftparchive being itself. [07:09] However, there is one significant bug in the new script. [07:09] Killing archivepublisher(5321) from launchpad_prod_3: [07:09] query: in transaction [07:09] backend start: 2011-07-25 04:02:12.038005+00:00 [07:09] query start: 2011-07-25 04:02:12.207793+00:00 [07:09] age: 0:57:49.467450 [07:09] It seems to be continuing OK, but it remains to be seen how badly it blows up at the end. [07:10] meep [07:11] If it blows up it won't escape from its own little /srv/launchpad.net/ubuntu-archive/contents-generation world, so I'm not too concerned about it leaving the archive in a bad state. [07:12] A more interesting risk is that it might generate different output for a frozen suite. [07:19] So apt-ftparchive is taking long enough that its transaction gets reaped? === stub1 is now known as stub [07:20] uhm, please tell me you're not holding a transaction ope while you run apt-ftparchive ? [07:20] Yes, but what lifeless said. [07:20] No need for a transaction, and it takes hours so it's silly. [07:21] So any time at all is long enough. === stub1 is now known as stub [07:21] this runs as a different user to the publisher, right ? [07:22] No. [07:22] ok, so this needs to be critical then, because its going to break downtime deploys (of any sort) [07:23] losas know that quiescing things (time period) in advance is enough, and time period is not (IIRC) 1 hour [07:23] assuming that this is a transaction-open-around-ftparchive [07:23] Er, you know about the 24h translations scripts, right? :) [07:23] when did they suddenly speed up to 24h? [07:23] wgrant: I was just going to ask what lifeless asked. :) [07:23] wgrant: not on the whitelist to abort [07:23] That script has been tweaked to avoid holding transactions open. [07:24] wgrant: they'll get slaughtered, and we've had no response beyond the publisher to the requests asking for things to whitelist, both on-list and to team-leads [07:24] How do we get certainty about what stage the script was in when it lost its transaction? [07:24] quite a few times aiui. it use to be a minor bane of our existence. but haven't seen any woes from it for ... well years. [07:24] jtv: yes, that script is fine [07:24] * lifeless wants to move all scripts to internal API clients [07:25] 2011-07-25 04:58:39 WARNING dists/maverick/restricted/binary-amd64/: 28 files 112MB 0s [07:25] 2011-07-25 05:00:58 WARNING dists/maverick/universe/binary-amd64/: 24080 files 24.7GB 2min 18s [07:25] 2011-07-25 05:01:03 WARNING dists/maverick/multiverse/binary-amd64/: 700 files 2871MB 4s [07:25] It was kill at 05:00 UTC [07:25] It was during a-f. [07:25] killed. [07:25] OK, so we need to make sure there's no transaction open at that point. [07:25] It may have been the ORM reloading objects. [07:26] That won't allocate a full transaction in modern-day postgres, but I'm not sure our scripts would know that. [07:26] I'm filing a bug. [07:28] BTW since it's monday morning: this is generate-contents-files, right? Or is it publish-distro (or publish-ftpmaster running publish-distro)? [07:28] generate-contents-files [07:30] bug 815725 [07:30] <_mup_> Bug #815725: Long-running transaction in generate-contents-files < https://launchpad.net/bugs/815725 > === stub1 is now known as stub === wgrant changed the topic of #launchpad-dev to: https://dev.launchpad.net/ | On call reviewer: - | Critical bugs: 240 - 0:[#######=]:256 [07:39] jtv: thanks! [07:39] wgrant: there are no transaction boundaries at all in that script, yet I haven't found any traces yet of it running in auto-commit previously. [07:39] jtv: That script didn't exist three months ago. [07:39] It's never run before. [07:39] jtv: wgrant: is there any reason we can't change the db user for this script as well ? [07:40] Well. [07:40] It was a shell script. [07:40] lifeless: Maybe. [07:40] lifeless: Just needs work to find out DB users. [07:40] what I want is stub's whitelist to get no false positives [07:40] Ah, drat, this was a shell script. [07:40] We really need something like User-Agent. [07:40] things we need to abort deploys on need to be precisely and accurated identified by the whitelist code [07:42] good morning [07:42] We can probably change that without any problems… focusing on the other thing right now though. [07:42] hi adeuring [07:42] hi jtv! [07:43] jtv: might need to focus sooner than later if you don't want your script killed every few days [07:43] stub: thanks for distracting me from just that. :) [07:43] Is archivepublisher whitelisted for reapability? [07:44] It does a lot of commits, so might not need it. [07:44] wgrant: it will be - bigjools flagged that. So rollout will abort until that stuff has been shut down manually. [07:44] stub: actually its the other way around; this script runs as something we aren't willing to interrupt, so we need to move it out of the way or cause deploys to refuse to run [07:44] wgrant: can you think of any reason why generate-contents-files shouldn't run against the read-only store? [07:44] jtv: There is no read-only store. [07:44] Which could be an issue. [07:45] (read-only mode has gone away) [07:45] How is there no read-only store? [07:45] ISlaveStore(DbClass) is the read-only store [07:45] jtv: FYI we'll be interrupting transactions on the slaves too, not just the master [07:45] Yes, but that doesn't help deployments. [07:45] wgrant: read-only store, not read-only mode. [07:45] And using it makes things go faster which helps everything [07:45] Not thinking of deployments, thinking of not getting the script killed. [07:45] It's still going to be killed. [07:45] jtv: it still will, same rules [07:46] But much easier to deal with. [07:46] Oh? [07:46] Database changes are the real bastard. [07:46] There should be a decorator/contextmanager around somewhere to ensure a lack of transaction. [07:46] That'd be nice. [07:47] checkwatches uses one, but it's Twisted. [07:47] abently wrote it I believe [07:47] That' [07:47] s right. [07:47] For upgrade-brances [07:47] TransactionFreeOperation or something [07:47] There is a database policy we use to guarantee no db access is being made. [07:47] Anyway, to repeat the question: does anyone know of any reason why this script shouldn't run against the read-only store? [07:47] As long as it's up to date, no, that's fine. [07:47] jtv: unless it changes things, running against a fresh slave should be fine. [07:47] (and database policies are context managers) [07:48] lifeless: I'm asking wgrant whether he can think of anything it changes. [07:48] It shouldn't. [07:48] It will probably need to eventually. [07:48] But right now it doesn't. [07:48] Or at least shouldn't. [07:48] it should make API calls anyway [07:48] Eventually, yes. [07:48] if it doesn't change anything now, there is no good reason for it to ever. [07:48] (via the db) [07:50] stub: is SlaveOnlyDatabasePolicy what I need? (What I care about really is a read-only store to catch db changes, not necessarily an actual slave database). [07:51] jtv: Sounds like that is what you want. [07:51] We use DatabaseBlockedPolicy to confirm no db access at all for pages we need to work when the db is unavailable. [07:52] OK, I'll do that first. If I understand correctly, that'll keep the reaper off its back _for the time being_, so we can land it separately, and then feel much more confident changing transactionality later. Right? [07:52] Can also be used for narrowing down long-transaction issues [07:52] Which is what this is. [07:52] jtv: I think that still takes out a transaction [07:52] jtv: so that it gets consistent reads [07:52] jtv: which is why wgrant and I were saying it won't help. [07:52] Yes it does. [07:52] because it will still get reaped. [07:53] Oh, I thought you were saying the reaper currently only kills master transactions. [07:53] The regular reaper may well only kill master transactions. [07:53] The deployment reaper kills everything. [07:54] jtv: we're starting fastdowntime deployments very soon - stub has made fantastic progress, it may be as early as wednesday. [07:54] We have a reaper installed on hackberry too now [07:54] jtv: and that will nuke all connections on all replicas [07:54] lifeless: I think we should JFDI and watch what breaks. [07:54] stub: choke as well ? [07:55] Okay, then I'll have to do both right now. [07:55] wgrant: oh, we will [07:55] lifeless: chokecherry too [07:56] wgrant: I'm mainly worried about this causing a fastdowntime to abort if it happens to have a transaction open at just the wrong time [07:57] lifeless: Yeah. [07:58] maybe we should change the db user for the archivepublisher [07:58] jtv: So I'm not sure what the original problem is, but the fallback is we whitelist the db user your script is connecting as. This will block rollouts until the script has been shut down manually. Update Bug #809123 if we need this. [07:58] <_mup_> Bug #809123: we cannot deploy DB schema changes live

< https://launchpad.net/bugs/809123 > [07:59] stub: I suspect this will be a rollout blocker regardless, since it works on the filesystem — wgrant will know. [08:00] this works in a staging area [08:00] its not a blocker [08:00] As lifeless says, it's OK. [08:00] whats the cron time for this beastie ? [08:00] 04:02 [08:01] lifeless: it will not abort the fastdowntime deployment. We will check the whitelist, and if no whitelisted connections, we shut down pgbouncer. If a script managed to sneak in a connection between the two steps, it will die. [08:01] wgrant: do you think it will take 4 hours routinely ? [08:01] lifeless: Rarely more than 2.5, I expect. [08:01] lifeless: But we will see. [08:01] ok, we're probably safe then. [08:01] stub: Well, it will abort it, just before we are down. [08:02] stub: this script we're talking about runs as the same user as the archivepublisher which is whitelisted [08:02] stub: which is why it could abort the deploy; if it had a non-idle connection at the whitelist check time [08:02] stub: (or do you check for -any- connection ?) [08:03] ok. I'd call that blocking the deploy, not aborting it. I expect all the whitelisted systems would be shutdown manually before kicking off the update. The checks are just there to confirm that all that stuff really has disconnected. [08:04] stub: agreed [08:04] stub: so I'm thinking we should make this script be on a different user; or move the whitelisted script to a dedicated just-for-it-user [08:04] (I'm considering an abort as aborting mid-way, which is a problem as we might be partially updated) [08:04] stub: and make that a clear policy for anything that gets whitelisted [08:04] yes, it should be a different user per existing policies :-) [08:05] Every separate script should be connecting as a unique user already. [08:05] jtv: wgrant: bigjools: how many different scripts connect as archivepublisher? [08:05] Everywhere that is not happening is a bug. [08:05] lifeless: Let's not go there. [08:05] (and they know it ;) ) [08:06] lifeless: afaik, lots. [08:06] lifeless: But at least 14. [08:06] But that's a matter for archeology. [08:06] ok [08:06] Predates the policy, I think. :) [08:06] so, action item here is ot move the actual fragile script to its own user. [08:09] jtv: Can you point me at the script? [08:09] stub: cronscripts/generate-contents-files.py [08:10] lifeless: "not" move or "to" move? [08:10] jtv: *to* move. [08:10] finger-fail. [08:10] jtv: so quick fix just hard code a new user in there, and add the new user in security.cfg to be an alias to archivepublisher. [08:11] OK, I'll include that. [08:11] (4 lines, not including whitespace) [08:12] stub: jtv: I suggest doing the new user in the archivepublisher script, not this one. [08:12] Is this really an either-or choice? [08:12] given there are so many scripts on the same user already. [08:12] jtv: not at all. [08:12] rephrasing [08:12] stub: jtv: I suggest doing a new user in the archivepublisher script, becuase thats the one we care about for deploys. [08:13] which archivepublisher script? [08:13] Hi bigjools [08:13] (morning!) [08:14] publish-distro and probably process-death-row need care. [08:14] lifeless: this is beginning to sound like a very different problem from the one I'm currently dealing with — can we do it as a separate bug (though presumably critical)? [08:15] p-d-r sometimes takes a while to run, because it doesn't look for PPAs that need work: it just looks at all of them. [08:15] jtv: yes [08:17] what is the problem? [08:17] bigjools: the contents generation script takes more than 1 hour to run [08:17] Uh, it has for a while? [08:17] bigjools: the pre-deploy suspension of crontabs is done one hour before [08:18] That seems premature and disruptive, but OK. [08:18] StevenK: i didn't say 'now takes' :P... [08:18] you can just kill it, it won't break anything [08:18] bigjools: the contents generation script runs as the same db user as the archivepublisher, which needs to be whitelisted [08:19] bigjools: so, the deploy script can't tell the difference [08:19] wtf does it need a DB user? [08:19] it used to be a shell script [08:19] I have work in progress as a in-my-spare-time project to move contents generation from cocoplum's disk to the DB [08:19] bigjools: you'll need to talk to jtv and wgrant for that question. [08:19] bigjools: So we can tell which script it is, so we don't stop deploys for it. [08:20] bigjools: At present it is indistinguishable from publish-distro. [08:20] bigjools: but do you see the issue ? only the things that can't be interrupted can use a db user which is whitelisted. [08:20] wgrant: no, I mean *why* does it need to touch the DB at all [08:20] lifeless: yes [08:20] bigjools: It used to use lp-query-distro and stuff. [08:20] gah [08:20] bigjools: Now it is Python. [08:21] bigjools: so I'm proposing we move the must-not-interrupt stuff to a new dedicated user which we whitelist. [08:21] bigjools: it always got bits and bobs from the DB, such as configs. [08:21] It's just it used lots of short scripts, which also ran as archivepublisher. [08:21] right [08:21] Hi [08:21] morning mrevell [08:21] hi mrevell [08:21] So it probably needs to grab configs and then delete the store [08:21] bigjools: do you see any gotchas or issues with doing that ? [08:21] lifeless: +1 [08:21] at the worst, we just inherit DB permissions in the cfg [08:22] Since the rest of it does not require DB access [08:22] adding a new user is trivial [08:22] There's slightly more than just configs, but I'm currently making the script grab data first, then commit & apply a "no DB" policy. [08:22] ok, can someone that knows which scripts are relevant, file a bug for this? as jtv says its a different issue from the generation script having too-long transactions. [08:22] jtv: it might be worth re-writing it to shell out to the short-running scripts that need db access [08:22] And wait for ZCML to be parsed for each? [08:23] no bigdeal [08:23] compared to the hour-long txn :) [08:23] I've already got that isolated here. [08:23] Please no. [08:23] Maybe add XML-RPC calls. [08:23] But no shelling out :( [08:23]