[00:52] <lifeless> wgrant: I'll do oops-tools if you haven't
[01:04] <wgrant> lifeless: I haven't
[01:18] <StevenK> wgrant: So, what's going on is the bugtask subscriber stuff is poking in the event and getting a set of new_subscribers and passes that into add_bug_change_notifications. The BugTaskAssignee case is pawing through that list and removing the assignee -- which assumes that everyone listed in new_subscribers is already in recipients. The structsub's owner isn't, so bang.
[01:19] <wgrant> Yep
[01:19] <bigjools> Bug 1017293 \o/
[01:19] <_mup_> Bug #1017293: Commit message is not shown on the initial merge proposal email <code-review> <email> <qa-ok> <Launchpad itself:Fix Released by wallyworld> < https://launchpad.net/bugs/1017293 >
[01:19] <wallyworld_> bigjools: i accept credit card or prefer cash
[01:20] <bigjools> wallyworld_: coffee?
[01:20] <StevenK> wgrant: So new_subscribers isn't used anywhere else, and that's why it's only an impact for LIFECYCLE -- otherwise they get a reason
[01:20] <wallyworld_> ok, that will do
[01:20] <StevenK> wallyworld_: Act now, and bigjools might throw in his twins
[01:20] <wallyworld_> and a set of steak knives
[01:21]  * StevenK prepares a shallow grave for Tim Shaw
[01:21] <bigjools> twinings only do tea
[01:22] <StevenK> wgrant: Given we're only interested in the assignee, I'm tempted to catch the exception here and 'continue'
[01:23] <wgrant> StevenK: I'm not sure that's valid.
[01:24] <StevenK> wgrant: Why? This function does nothing else with new_subscribers and assumes that all of them are already in recipients
[01:25] <wgrant> StevenK: Can you point me at the problematic code?
[01:25] <StevenK> wgrant: lib/lp/bugs/subscribers/bug.py. Function is line 139 the block I'm talking about is line 160
[01:27] <StevenK> wgrant: new_subscribers is passed in from subscribers/bugtask.py and is the difference of event.object.bug_subscribers and event.object_before_modification.bug_subscribers
[01:28] <StevenK> In the usual case, it's only going to be the assignee or the empty set
[01:28] <wgrant> Oh no
[01:28] <wgrant> Not this function :(
[01:30] <StevenK> wgrant: Hahaha
[01:46] <wallyworld_> wgrant: thanks for looking at my branch, not sure i 100% agree with the project group reversions. the pg filebug page will list the reason why projects cannot have bugs filed, forbidden is but one reason so the mp sticks to the current conventions in that area
[01:47] <wgrant> wallyworld_: Well, doesn't it only check default_product now?
[01:48] <wgrant> Blocking it early is also far less necessary now that we just redirect to Product:+filebug
[01:48] <wgrant> And this is an exceptional misconfiguration situation
[01:48] <wgrant> So I'm not sure adding the extra confusing UI text to the existing situation improves things.
[01:48] <wgrant> It'll be very clear when they get redirected to Product:+filebug
[01:49] <wallyworld_> i think so, but what it does it display a message for each product saying eg "not malone" and i've added a new message "forbidden" to go with that
[01:50] <wallyworld_> reverting those bits would make the code simpler for sure
[01:50] <wgrant> It would make the code and UI simpler
[01:50] <wallyworld_> the ui is already not simple
[01:50] <wallyworld_> it already lists each product with reasons
[01:50] <wgrant> And I'm pretty sure the ProjectGroup contextAllowsNewBugs is buggy
[01:50] <wgrant> Does it?
[01:50] <wallyworld_> yes
[01:51] <wgrant> The text you've changed seems to only be shown when no projects are selectable.
[01:51] <wallyworld_> yes
[01:51] <wgrant> So it doesn't list each product
[01:51] <wgrant> 245	There are no projects registered for <span
[01:51] <wgrant> 246	tal:replace="context/displayname">project displayname</span>
[01:51] <wgrant> 247	- that use Launchpad to track bugs.
[01:51] <wgrant> 248	+ that either use Launchpad to track bugs or allow new bugs to be filed.
[01:51] <wallyworld_> it does, i tested it
[01:51] <wgrant> 256	- but none of them use Launchpad as their bug tracker.
[01:51] <wgrant> 257	+ but either none of them use Launchpad as their bug tracker or none
[01:51] <wgrant> 258	+ allow new bugs to be filed.
[01:51] <wgrant> 272	There is 1 project registered as part of
[01:51] <wgrant> 273	<tal:project replace="context/displayname" /> but it
[01:51] <wgrant> 274	- does not use Launchpad as its bug tracker.
[01:51] <wgrant> 275	+ does not use Launchpad as its bug tracker or does not allow new
[01:51] <wgrant> 276	+ bugs to be filed.
[01:51] <wallyworld_> yes, that's the summary message at the top
[01:52] <wallyworld_> and then each product is listed
[01:52] <wallyworld_> with the reason
[01:52] <wgrant> https://launchpad.net/launchpad-project/+filebug
[01:52] <wgrant> There's just a <select>
[01:53] <wallyworld_> but that's not relvant here since launchpad-projects has products with alow new bugs
[01:53] <wallyworld_> on a .dev system, update firefox to forbidden and then goto mozilla+filebug
[01:54] <wgrant> Oh, that's all wrapped in !contextUsesMalone
[01:54] <wgrant> How odd
[01:54] <wgrant> I think we can just delete that
[01:54]  * wgrant checks the DB
[01:56] <wgrant> Heh
[01:56] <wgrant> What a buggy little-known page
[01:56] <wgrant> https://launchpad.net/mercurial/+filebug
[01:57] <wgrant> it has a persistent notification and accidentally nested bullets somehow
[01:57] <wgrant> (and it's not linked from anywhere)
[01:57] <wallyworld_> that's why i took the approach i did - to be consistent with what was there
[01:57] <wgrant> So I would suggest we delete most of the complexity from that block
[01:57] <wgrant> It's not worth it
[01:57] <wgrant> The page is unlinked and already buggy
[01:57] <StevenK> wgrant: So, "not this function" and no other discussion? :-(
[01:57] <wgrant> StevenK: I was reviewing wallyworld's branch and that function is horrible :)
[01:59] <wallyworld_> wgrant: it may be buggy but this branch is already large. i think it best to be consistent with what's there and do any tidy up in another branch
[02:00] <wgrant> wallyworld_: If you fix the contextAllowsNewBugs bug, can your new text on ProjectGroup:+filebug ever be true?
[02:00] <wallyworld_> i do think it's reasonable that it doesn;t allow you to enter all this info, and then say "ha ha, no you can't really file a bug here"
[02:00] <lifeless> does it get hits?
[02:00] <wgrant> wallyworld_: "all this info" is a project and bug title.
[02:00] <wgrant> wallyworld_: The description etc. is after the redirect
[02:01] <wallyworld_> sure, but still, it allows you to proceed when you can't
[02:01] <wgrant> wallyworld_: Your contextAllowsNewBugs is buggy because it will block the page if the default project, not all projects, happen to be forbidden
[02:01] <wallyworld_> i was following the pattern used for contextUsesMalone
[02:02] <wgrant> Unless you do a very complex fix for that, I don't think the added text on ProjectGroup:+filebug can be true
[02:02] <wallyworld_> which will do the same thing
[02:02] <wgrant> wallyworld_: contextUsesMalone works because default_product is the first product in the project group that uses malone
[02:02] <wgrant> So if it's present, there's a project that uses malone
[02:02] <wallyworld_> ok, it's easy to change ales new bugs to see if one of the projects allows bugs
[02:03] <wgrant> It's easy but completely not worth it.
[02:03] <wallyworld_> i don't agree
[02:03] <wgrant> This is an exceptional situation where the worst that can happen is you enter a bug summary and get told "lol no"
[02:03] <wgrant> It's not worth the extra code on this little-used view
[02:04] <wgrant> And tests
[02:05] <wallyworld_> the view isn't inlinked
[02:05] <wallyworld_> unlinked
[02:05] <wallyworld_> it's linked from the project group page
[02:05] <wgrant> The complex product listing bit is unlinked today
[02:05] <wgrant> https://launchpad.net/ggi
[02:05] <wgrant> spot the link
[02:06] <wallyworld_> you get the view if you +filebug on a project group
[02:06] <wgrant> wallyworld_: how do you get there if no projects use malone?
[02:06] <wgrant> The link is not shown in that case
[02:06] <wgrant> The page needs to tell you "lol no" if you URL hack, but it doesn't need dozens of lines of complex template logic to be pretty
[02:06] <wallyworld_> i didn't check the tal - is the link wrapped in a condition?
[02:11] <wgrant> wallyworld_: The +filebug link is only shown if bug_tracking_usage == LAUNCHPAD, and a project group's bug_tracking_usage == LAUNCHPAD iff at least one of its projects does.
[02:12] <wgrant> So the link is only shown iff contextUsesMalone is true
[02:12] <wallyworld_> so it seems that view is only there if people url hack
[02:12] <wgrant> So the !contextUsesMalone case is unimportant
[02:12] <wgrant> (there's 105 lines of template for it, despite it being unimportant except for URL hackers)
[02:13] <wgrant> That's 20% of the entire template dedicated to a case that's unlinked.
[02:13] <wgrant> Delete :)
[02:13] <wallyworld_> but people may have bookmarks etc, so we should stil ldisplay somwething nice
[02:14] <StevenK> I think wgrant and I agree that 'Not found' is something nice for this case
[02:14] <wallyworld_> what? really?
[02:14] <wallyworld_> with forbidden, the view will not be unlinked anymore either
[02:15] <wgrant> It's not "Not found"
[02:15] <StevenK> It isn't?
[02:15] <wgrant> It's "No projects in this project group use Launchpad for bug tracking."
[02:15] <wgrant> Or thereabouts
[02:15] <wallyworld_> so you could click on +filebug, and you should see a page which says, sorry, xxx project does not accept new bugs
[02:15] <wgrant> In roughly 3 lines of template
[02:15] <wgrant> Without any logic
[02:17] <wgrant> wallyworld_: Well, in the exceptional, rare, and usually temporary case of subscription expiration (which usually warrants immediate notification of and action by the project owner), you'll be able to see the +filebug link, click it, select the product, enter a summary, and then be told that your project is in a perilous state
[02:17] <wallyworld_> for the addition of a few lines of tal
[02:18] <wgrant> For the deletion of 100 lines of TAL
[02:18] <wallyworld_> the tal that's there does seem reasonable though for the !uses malone case
[02:19] <wallyworld_> since people may have the page bookmarked
[02:19] <wallyworld_> and the page suggests where they might go instead
[02:21] <wallyworld_> so if one believes we should scrap that nicety, then yes delete the tal, if not then adding the extra few lines for this new case is reasonable
[02:23] <wgrant> wallyworld_: ProjectGroup:+filebug is hit a total of ~200 times a month
[02:23] <wgrant> That's already tiny
[02:23] <wgrant> The fraction of people with bookmarks to +filebug on project groups with no projects that use LP is going to be even tinier
[02:24] <wallyworld_> hmmmm
[02:25] <wgrant> (and that 200 hits includes search engines and such)
[02:25] <StevenK> I was under the impression it was a whole template.
[02:25] <StevenK> If it's part of one, kill, kill, kill
[02:25] <wallyworld_> there's a bunch of macros
[02:26] <StevenK> Excellent, even more things to kill
[02:26] <wallyworld_> i can look to kill, but i really dislike unilaterally removing functionality with the stroke of a pen
[02:26] <wallyworld_> without understanding the impact to users
[02:27] <wallyworld_> but 200 hits a month seems small
[02:27] <StevenK> With the press of a key even? :-)
[02:27] <wallyworld_> yeah, that too
[02:28] <StevenK> wgrant: http://pastebin.ubuntu.com/1216037/
[02:29] <StevenK> wgrant: Do you disagree with the verbose comment I've added as well as my fix?
[02:51] <wgrant> wallyworld_: Right, we shouldn't just rip out functionality, but the bit we're removing is unlinked, unlikely to be bookmarked, and the whole page is only viewed a few times a day.
[02:52] <wgrant> It's an excellent candidate for just being deleted
[03:07] <StevenK> wallyworld_, wgrant: https://code.launchpad.net/~stevenk/launchpad/destroy-cause-unknownrecipienterror/+merge/125397
[03:49] <StevenK> wallyworld_, wgrant: No review? :-(
[03:57] <wallyworld_> StevenK: sorry, was having coffee with bigjools
[04:01]  * StevenK rings the blood bank
[04:02] <wallyworld_> StevenK: "due to level or so" does quite make sense to me
[04:02] <wallyworld_> when you are not otp
[04:03] <wallyworld_> the comment also says "we are only interested in dropping the assignee out" and yet we then proceed to drop anyone else not in the recipients list, so it's a little confusing as written perhaps
[04:08] <StevenK> wallyworld_: Yes, exactly. If they aren't in the recipients list, they aren't the assignee and they aren't going to be mailed and can be ignored.
[04:10] <StevenK> wallyworld_: Due to level or so -- the people in new_subscribers might be BugNotificationLevel.LIFECYCLE -- if we aren't closing the bug, they won't get mailed anyway, but they may leak into new_subscribers.
[04:11] <wallyworld_> the recipients list contains other people besides assignee so saying that not being in the recipients list equates to not being the assignee doesn't quite addd up for me
[04:12] <wallyworld_> maybe you could reword the lvel bit to work into the comment somthing along the lines of what you posted just above?
[04:12] <StevenK> wallyworld_: Do you want to get back onto mumble? Words are obviously failing me.
[04:13] <wallyworld_> ok
[04:15]  * wgrant tries to fix translation search timeouts
[04:18] <StevenK> wgrant: Are you fixing it by removing rosetta?
[04:19] <wgrant> Hopefully not
[04:19] <lifeless> hah, cute
[04:21] <StevenK> wallyworld_: http://pastebin.ubuntu.com/1216114/
[05:30] <lifeless> wgrant: did you add the new -repo tarball to the download cache?
[05:31] <wgrant> lifeless: No, I felt I'd pushed my luck enough by uploading to PyPI. But I can if you want.
[05:31] <lifeless> nope can does
[05:54] <StevenK> wallyworld_, wgrant: https://code.launchpad.net/~stevenk/launchpad/catch-incomingemailerror/+merge/125416
[05:56] <wallyworld_> StevenK: could you monkey patch a test? i guess as you say, it's a trivial change
[05:57] <StevenK> wallyworld_: I've been looking at the end to end tests, and I don't think so, as I said.
[05:58] <wallyworld_> np
[05:58] <wallyworld_> r=me
[06:05] <lifeless> wgrant: can you coordinate getting oops-tools deployed ?
[06:05] <lifeless> wgrant: release is done, dependency bumped etc
[06:06] <wgrant> lifeless: That's just a matter of poking ops, right?
[06:06] <lifeless> yes
[06:07] <lifeless> and helping if there should be fallout / conflicts etc.
[06:07] <lifeless> You'll find their local diff instructive.
[06:07] <lifeless> I need to go help lynne now is all, or I'd do this.
[06:08] <wgrant> Well, I know I want to clobber at least one bit of the local diff
[07:04] <StevenK> wallyworld_: Still here?
[07:04] <lifeless> wgrant: how did the deploy go ?
[07:07] <wgrant> lifeless: I was going to wait until we had ops that weren't alone and near the end of their shift
[07:09] <stub1> So I'm just about finished with the new deployment updates, and I'm now thinking most of it is useless.
[07:09] <wgrant> stub1: Oh?
[07:10] <stub1> Because streaming replication will automatically cancel transactions that conflict with WAL that needs replaying
[07:11] <stub1> So we disable access to the master, killing master connections, and apply the patches. That gets replayed on the slaves and if that transaction happens to conflict with what slave transactions are trying to do, they get cancelled.
[07:12] <stub1> There is a timeout for how long they are given, which is how long lag is allowed to get.
[07:12] <wgrant> Hmmm, that's a reasonable point. Have you tried it locally?
[07:12] <stub> Not yet, just clicked
[07:13] <stub> By default, that timeout is 30 seconds. Should be easy enough to test by locking a table on the slave.
[07:14] <stub> bah... can't lock on a slave :)
[07:14] <wgrant> Well
[07:14] <wgrant> Select from a table on the slave
[07:14] <wgrant> Drop a column on the master
[07:14] <wgrant> See what happens
[07:15] <wgrant> My local replication setup has mysteriously evaporated...
[07:15] <stub> Yeah, just need to ensure that select keeps running for longer than the timeout, plus time for my fat fingers.
[07:15] <wgrant> stub: Well, not necessarily
[07:15] <wgrant> stub: As long as it's one transaction it should be fine
[07:15] <wgrant> Where "fine" means "cancelled"
[07:15] <stub> nah, that would require repeatable read
[07:15] <wgrant> Since it's not legal to have a subsequent SELECT * in a REPEATABLE READ transaction return fewer columns...
[07:15] <wgrant> Right
[07:16] <stub> no serializable on the slaves
[07:16] <wgrant> But you can get REPEATABLE READ on a slave
[07:16] <wgrant> Just not SSI SERIALIZABLE, since that requires feedback
[07:17] <stub> yeah, that should work
[07:17] <stub> Silly new isolation  levels confusing me :(
[07:18] <wallyworld_> StevenK: yes
[07:19] <stub> Yup: FATAL:  terminating connection due to conflict with recovery
[07:20] <stub> Just need to make sure Storm handles that one :)
[07:20] <stub> It will need to in any case
[07:20] <wgrant> That's handy, if it kills the connection rather than aborting the transaction
[07:20] <StevenK> wallyworld_: I thought we showed OOPS for AJAX timeouts via JS?
[07:20] <stub> Yeah, except when the connection being killed is my backups ;)
[07:20] <stub> the connection is killed
[07:20] <wallyworld_> StevenK: we should do so long as the client js is correctly constructed
[07:21] <stub> should have clicked earlier with that problem staring me in the face
[07:21] <StevenK> wallyworld_: Ah ha, so could you look at https://bugs.launchpad.net/launchpad/+bug/925937 ?
[07:21] <_mup_> Bug #925937: Does this bug affect you: Timeout error popup <timeout> <Launchpad itself:Triaged> < https://launchpad.net/bugs/925937 >
[07:21] <wgrant> stub: Yeah
[07:21] <wgrant> stub: There's probably a gotcha here
[07:21]  * wallyworld_ looks
[07:21] <wgrant> stub: But it seems like it should work...
[07:21] <lifeless> so pending schema changes won't cause locks on the slaves?
[07:21] <StevenK> wallyworld_: I don't want you to fix it, mind you, just figure out why we don't get the OOPS so we can fix that first.
[07:21] <wgrant> lifeless: They won't cause locks. They'll cause massacres.
[07:21] <stub> wgrant: Only if they improve the feedback the slaves are able to give the master in future pg releases
[07:21] <lifeless> which we want to avoid
[07:21]  * wallyworld_ nods
[07:22] <stub> At the moment, the slave only sends back enough information to stop vacuum cleaning up records that might still be in use. And even that seems busted at the moment.
[07:22] <lifeless> stub: so what happens to *new* transactions after the schema change is in the WAL
[07:22] <lifeless> stub: but not yet applied.
[07:22] <wgrant> stub: Still, we need the disable/enable stuff to kill off the master connections when we want HA clients to fall back to the slaves
[07:22] <stub> So the slave has no choice but to either lag, or cancel transactions
[07:22] <stub> It does both, allowing lag up to the configurable max then cancels connections.
[07:23] <stub> wgrant: Yes
[07:23] <wallyworld_> StevenK: from memory, the js for that affects me widget predates work to standardise js error handling, so it should just be a bit of a tweak to make it behave properly
[07:23] <wgrant> stub: So most of the work is still completely relevant.
[07:23] <stub> lifeless: All transactions before the WAL is applied are the same
[07:23] <lifeless> stub: so, it sounds to me like your wrk is very valuable
[07:23] <lifeless> stub: because the goal was 0 interrupted reads
[07:23] <stub> wgrant: half of it is. But I've written the whole thing :)
[07:23] <StevenK> wallyworld_: Right, okay. I'll grab you tomorrow to look it over.
[07:23] <wallyworld_> StevenK: basically, an ErrorHandler needs to be constructed and used in conjunction with the io call
[07:23] <wallyworld_> ok
[07:24] <adeuring> good morning
[07:24] <wgrant> lifeless: I thought the goal was 0 read interruptions, not 0 interrupted reads.
[07:24] <StevenK> wallyworld_: Oh, right. That should be pretty easy, then.
[07:24] <wgrant> lifeless: They're somewhat different
[07:24] <wallyworld_> StevenK: my statements above are made without looking at the code, just an educated guess
[07:24] <StevenK> wallyworld_: Sure.
[07:24] <stub> It means we don't need to disable replication, so less work recovering from a failed upgrade and less things to put us in OMG ring the dba mode.
[07:24] <wallyworld_> but yes, should be easy
[07:24] <wgrant> stub: We do need to disable replication if we want to let transactions on the slaves finish.
[07:25] <wgrant> But I don't think we do
[07:25] <StevenK> wallyworld_: So my two step plan is to instrument the call so we can actually get an OOPS and then see WTF is going on.
[07:25] <wallyworld_> yes, good idea
[07:25] <stub> wgrant: That involves switching pgbouncer to transactional mode
[07:26] <lifeless> wgrant: the goal is that the new PPA infrastructure be reliable.
[07:26] <stub> I don't think we have that option, at least on the master, because we are making use of temporary tables
[07:26] <wgrant> lifeless: That means 0 read interruptions, not 0 interrupted reads
[07:26] <wgrant> lifeless: We can interrupt reads as long as they can be immediately retried.
[07:27] <lifeless> ok
[07:27] <lifeless> I'll let you and stub get back into it :)
[07:28] <stub> With pgbouncer running in session mode, there is no way to say 'let the connections finish the current transaction, then terminate them'. We would have to add that. I'm not sure if it is possible.
[07:28] <wgrant> Yeah, that's a good point
[07:28] <wgrant> Well
[07:28] <wgrant> The webapp is easy
[07:28] <wgrant> Once replication is paused lag will increase
[07:28] <stub> And since faster is better, I think terminating and letting them retry is just fine. These services need to handle network outages, and that is exactly what the update looks like to them
[07:29] <wgrant> And the webapp will run away from the slave after the last transaction
[07:29] <wgrant> Right, exactly.
[07:29] <wgrant> I don't think we need to preserve the transactions.
[07:29] <wgrant> But we can if we need to
[07:30] <stub> Haha... most of my work really is unnecessary.
[07:30] <wgrant> Oh?
[07:30] <stub> We could run a second pgbouncer in transaction mode. Set all the slave connections to use that.
[07:31] <stub> Then we run the existing script which shuts down the master pgbouncer
[07:32] <wgrant> We could
[07:32] <wgrant> though there's a pull request for per-DB pool_mode
[07:32] <stub> Yeah, the new approach is nicer and doesn't require sudo loving.
[07:32] <wgrant> And it's faster
[07:32] <stub> And I certainly want per-DB pool_mode
[07:32] <wgrant> And it's also less sickening :)
[07:33] <stub> Hey, at least I used sudo rather than kill -9 ;)
[07:33] <wgrant> True, true.
[07:35] <wgrant> Ah
[07:36] <wgrant> Slaves don't like it much when there's a recovery.done rsynced from the master alongside their recovery.conf :)
[07:37] <wgrant> stub: Do you know how the killing works internally?
[07:37] <wgrant> I guess the RO transactions take the usual minimal locks
[07:37] <lifeless> i imagine the wal segment
[07:37] <wgrant> And then recovery checks locks as it goes, killing anything that might conflict
[07:37] <lifeless> lists the relations that locks are needed on
[07:38] <lifeless> perhaps implicitly
[07:38] <wgrant> Right.
[07:44] <wgrant> I think I have fixed translation searching
[07:44] <wgrant> Yay
[07:44] <wgrant> Although one of the new indices is 5GB, which probably pushes sourcherry over the limit
[07:44] <wgrant> So there goes that idea
[07:46] <lifeless> if its needed, its needed
[07:47] <wgrant> Yeah, but if it pushes sourcherry over the edge than I can't do it tomorrow :)
[07:53] <StevenK> wgrant: It's just an index or a query fix as well?
[07:54] <wgrant> Both
[07:58] <stub> wgrant: I think that after you set wal_level=hot_standby, the WAL files include enough information to block or cancel transactions using resources that might change, so lock information at a minimum.
[07:59] <wgrant> stub: Right, it would make sense if that was the difference between archive and hot_standby
[07:59] <stub> wgrant: And the RO transactions will take the usual minimal locks so this actually works
[07:59] <wgrant> Right, exactly
[07:59] <stub> And because then the hot standbys are using the same well tested code path :)
[08:02] <stub> wgrant: We still have 170GB free I think. It might set of alerts, but it should fit.
[08:03] <stub> But given that partition is already at 91%... I suspect the alerts have already been adjusted.
[08:03] <wgrant> Last I heard there was ~100GB
[08:03] <wgrant> Ah
[08:03] <wgrant> So some space has been freed
[08:03] <stub> Maybe fewer backups at the moment :)
[08:03] <wgrant> This'll consume about 22GB all up
[08:03] <wgrant> Heh
[08:03] <wgrant> True
[08:03] <wgrant> (about 5.5GB of new indices per instance)
[08:03] <stub> right
[08:29] <stub> 2012-09-20 08:28:57 INFO    Outage complete. 0:00:00.913263
[08:29] <stub> That is local
[08:30] <wgrant> stub: Have you tried disabling security.py and comments.sql?
[08:30] <wgrant> That should get it down below 500ms
[08:30] <wgrant> Unless something's going wrong
[08:30] <wgrant> No, I'm never satisfied :)
[08:31] <stub> When do we reset permissions then?
[08:31] <wgrant> We do it then, just with a faster version of security.py
[08:31] <stub> We currently only grant permissions live, we don't revoke them live
[08:32]  * stub tries to recall why that is
[08:32] <wgrant> I think the only thing we really need to do during the outage is relation ownership changes.
[08:32] <wgrant> And we avoid revoking during nodowntimes because security.py is run before the code is deployed
[08:33] <wgrant> I don't believe revocations take any notable locks, but table ownership changes need something hideous
[08:33] <wgrant> It used to be ACCESS EXCLUSIVE; not sure if that remains true in 91.
[08:33] <wgrant> 9.1
[08:33] <stub> Oh, we need to grant permissions in the same transaction as we create new objects
[08:33] <stub> Which this simplified ignore-the-slave methodology doesn't cope with, but can with a little refactoring.
[08:34] <wgrant> Ah, you mean we might replicate new objects that require new permissions before we replicate the permissions? Indeed, that's a potential issue
[08:34] <wgrant> Well, s/potential/real/
[08:34] <stub> Yeah, window is so tiny we could pretend it doesn't exist... but it is there.
[08:35] <stub> I need to share the transaction between upgrade.py and security.py, committing at the end.
[08:36] <wgrant> I think security.py should sensibly run entirely within a transaction now that we've removed cluster mode
[08:36] <stub> It does I believe
[08:37] <stub> It makes the failure mode nicer too, all the updates applied or they didn't. Currently we might get schema updates but no permission changes if things explode.
[08:37] <wgrant> Right
[08:38] <wgrant> Although that's only an issue in a couple of situations
[08:38] <wgrant> SECURITY INVOKER functions, Person foreign keys, LFA foreign keys.
[08:38] <wgrant> I think
[08:38] <wgrant> But it could still be an issue if we're really unlucky :)
[08:51] <wgrant> OK
[08:51] <wgrant> Now this is odd
[08:51] <wgrant> stub: Ahem
[08:52] <wgrant> stub: Locks replicate
[08:52] <wgrant> stub: We still need to disable access
[08:52] <wgrant> stub: On the master, set a table owner in a transaction
[08:52] <wgrant> Don't commit
[08:52] <wgrant> On the slave a couple of seconds later, try to select from that table
[08:52] <wgrant> It will hang
[08:52] <wgrant> abort the master transaction, the slave will unhang half a second later
[08:53] <stub> wgrant: Unless I disable replication
[08:53] <wgrant> stub: But we still have to kill everybody off before we restart replication if we don't want them to hang
[08:54] <wgrant> stub: eg. I'm just starting my transaction, and replication unpauses... person's owner changes, and I try to select from it a couple of milliseconds later
[08:54] <wgrant> Now I time out waiting for the lock
[08:54] <wgrant> Because some other transaction on the slave was holding person, and recovery was waiting for it
[08:54] <stub> If you are waiting for the lock, you will get it until replication propagates and you are kicked off.
[08:55] <wgrant> But I won't be kicked off immediately
[08:55] <stub> No, it is a little nicer than that.
[08:56]  * stub wonders what is nicer
[08:56] <stub> Sucks I have to disable replication in any case
[08:56] <wgrant> Hey, it's nice and easy and less unreliable this time
[08:56] <wgrant> Much nicer than the slony days
[08:57] <stub> Yeah, it is still an improvement
[08:57] <stub> Just wondering if I should disable slave access for a while in any case.
[08:58] <stub> I think we can get away with not doing it, and I think it is nicer on the clients, but I'd rather be sure.
[08:58] <wgrant> I think we have to disable slave access, or we'll run into the same locking issues as normal
[08:59] <wgrant> It'll automatically undeadlock after 30s, though
[09:00] <wgrant> Hmm
[09:00] <wgrant> Thinking
[09:00] <stub> So real transactions are being run while the WAL is being replayed. I guess if the WAL replay grabs a lock before a slave db connection does, then the slave db connection blocks until the update transaction has finished being replayed.
[09:02] <stub> If vice versa, the WAL replay blocks until it timesout. And that WAL replay potentially might have locks open.
[09:02] <stub> I'm not sure if that is reality, but if that seems logical it is enough to go back to disabling replication and slave db connections
[09:03] <wgrant> Ah
[09:03] <wgrant> I think it depends on whether all the locks are taken atomically
[09:03] <wgrant> But they won't be
[09:03] <wgrant> So it's not safe
[09:03] <stub> My confusion is because I don't know how transactional WAL replay is.
[09:03] <stub> yer
[09:03] <stub> sod it, I'll take the safe route
[09:04] <stub> code is already there in any case :)
[09:04] <wgrant> Yeah, we need to do the planned full slave disablement dance to be safe
[09:04] <wgrant> Yeah
[09:04] <wgrant> I didn't realise it replicated locks like that
[09:05] <wgrant> I guess it probably only needs to replicate ACCESS EXCLUSIVEs, since all the slave txns are RO
[09:07] <wgrant> Yeah, only access exclusives are logged in hot_standby
[09:09] <wgrant> stub: Ah
[09:09] <wgrant> stub: It gets messier
[09:09] <wgrant> Only idle sessions get killed
[09:09] <wgrant> If there's a query executing, the query gets cancelled and the transaction aborted
[09:09] <wgrant> But the session survives
[09:10] <wgrant> So yeah, I think we really want to do the full disable and kill dance
[09:11] <stub> Because of Storm our code won't care if the session is killed or just the transaction, but yeah.
[09:11] <stub> oh.. might have that wrong
[09:11] <wgrant> It will care
[13:52] <deryck> looks like maybe rick_h_ was dropped from our stand-up, which is why it was listed as updated.
[13:52] <deryck> I really do think that was Zoe and Siri and not me. ;)
[13:53] <rick_h_> I've been kicked out by Siri?!
[13:53] <rick_h_> come on, just because I don't love the apples :P
[14:30] <deryck> rick_h_, I replied on email, but I may have misunderstood you -- did you not remember we were on an old webkit, or is it that you still find it hard to believe this name resolution could be the problem?
[14:31] <deryck> I think you and allenap probably have the same WTF lobe going off. ;)
[14:31] <rick_h_> deryck: yea, I replied with the test results
[14:31] <rick_h_> deryck: once that lands it'll help perhaps.
[14:31] <rick_h_> deryck: I still have a wtf since there were maybe 10 lookups in that diff that was changed
[14:32] <rick_h_> and in my test I'm counting to 1M and such
[14:32] <rick_h_> but the effect is demonstrated
[14:32] <deryck> rick_h_, right, but sinzui confirmed the webkit on ec2 test is 20x slower than post-lucid.
[14:32] <rick_h_> deryck: I know we're on older webkit, but just the time diff between a namespace lookup 10 times seems it wouldn't be but a handful of ms, but that's what I've not verified
[14:32] <rick_h_> right
[14:33] <rick_h_> still, let's say it's a full ms slower per nesting, and we removed 10 cases that were two levels too deep. That saved up 20ms
[14:33] <rick_h_> and if that was triggering a timeout it still wtf lobes me
[14:35] <rick_h_> deryck: but yea, always known there was a hit and the jsperf charts in my email show it in IE and FF, looks like Chrome optimizes it away automatically
[14:35] <rick_h_> and older machines will show it for sure
[14:35] <deryck> chrome is the best for this, indeed.
[14:36] <rick_h_> I'm past it, it was a red flag I got namespace happy
[14:36] <deryck> and without profiling, it's all speculation how much the actual hit is in lucid webkit.
[14:36] <rick_h_> so pita to debug, still kind of wtf...but moving on atm
[14:36] <deryck> yeah, I didn't think you were dwelling on it.
[15:19] <cjohnston> Trying my wording change again until we can get it sorted out to make more substantial changes to participation essential. :-) https://code.launchpad.net/~chrisjohnston/launchpad/part-essential-2/+merge/125501
[15:58] <sinzui> jcsackett, do you have time to review https://code.launchpad.net/~sinzui/launchpad/incomplete-api/+merge/125515
[15:58] <jcsackett> sinzui: sure. :-)
[15:58] <jcsackett> sinzui: looking now.
[16:02] <jcsackett> sinzui: looks good, r=me.
[16:02] <sinzui> thanks jcsackett
[16:02] <jcsackett> and curse your 3000 LoC credit. :-P
[16:10] <sinzui> jcsackett, you can remove the portlet that suggest Ubuntu packages on project pages: https://launchpad.net/kran
[16:11] <sinzui> jcsackett, The feature is ineffective now because we have matched the existing project to packages. The top 1,200 packages that need upstream linking require a project to be registered.
[16:12] <jcsackett> sinzui: ah, cool.
[16:12] <sinzui> So removing this portlet + view + tests will get you a few 1000 lines
[16:12] <jcsackett> sinzui: ah, sounds like a plan.
[16:12] <cjohnston> sinzui: I created a new MP for just the text change.. how do we go about starting the proccess of further investigating the usage of participation essential?
[16:12] <sinzui> jcsackett, there is a counterpart to this portlet on package pages, it will remain because that is how we get the expedited project registration
[16:12]  * sinzui looks
[16:17] <sinzui> cjohnston, I will land you branch
[16:17] <cjohnston> ty
[16:48] <deryck> jcsackett, hi.  I have a branch for review if you have the time.
[16:50] <jcsackett> deryck: sure.
[16:51] <deryck> jcsackett, thanks!  https://code.launchpad.net/~deryck/launchpad/honor-default-for-spec-sharing-policy-1052521/+merge/125039
[17:06] <jcsackett> deryck: r=me.
[17:06] <deryck> jcsackett, awesome, thanks
[21:11] <sinzui> wgrant, StevenK, wallyworld_, jcsackett, I am summoned to visit the local institution that claims to be preparing my daughter for a place in society. I cannot make our meeting. I will send an email with what I did.
[21:12] <wallyworld_> ok, have fun :-)
[21:12] <wallyworld_> i may be a few minutes late myself
[21:33] <czajkowski> cjwatson: nice mail to -devl found it easier to rea than the blog post
[22:05] <cjwatson> czajkowski: thanks
[22:07] <czajkowski> cjwatson: why is the reason for the fedora patches being inclued?
[22:09] <cjwatson> czajkowski: either those patches or the moral equivalent are likely to be necessary to avoid our key being revoked
[22:09] <czajkowski> ahh so theyve done some of the leg work already ?
[22:09] <cjwatson> And there's no point reinventing the wheel for the sake of it
[22:09] <cjwatson> Yes
[22:09] <czajkowski> gotcha
[22:09] <czajkowski> thanks
[22:10] <cjwatson> They aren't particularly long
[22:38] <StevenK> wgrant: Can you create a structsub on https://qastaging.launchpad.net/auditorfixture/+milestone/milestone-foo ?
[22:38] <StevenK> wgrant: Of open/close, etc ...
[22:39] <StevenK> cjwatson: I'm guessing r15988 is hard to QA on mawson?
[22:40] <wgrant> StevenK: Done
[22:41] <StevenK> And no OOPS. Excellent.
[22:42]  * StevenK ponders marking r15990 as qa-untestable
[23:26] <StevenK> wallyworld_: Can you peer at lib/lp/bugs/javascript/bugtask_index.js ? The affects me too stuff is in there, and it looks like there already is an error handler?
[23:27] <wallyworld_> sure
[23:29] <wallyworld_> StevenK: so remind me the problem - did the error message that was displayed not contain the oops id? was it a timeout error?
[23:32] <StevenK> wallyworld_: It was a timeout error with no OOPS id, right
[23:32] <wallyworld_> StevenK: the error handler does not show the oops id for timeouts
[23:32] <wallyworld_> only for other errors
[23:32] <lifeless> wgrant: did you deploy oops ?
[23:33] <wallyworld_>                 if (o.status [23:33] <wallyworld_>                     self.showError(
[23:33] <wallyworld_>                         'Timeout error, please try again in a few minutes.');
[23:33] <wallyworld_>                 // If it was a server error...
[23:33] <wallyworld_>                 } else if (o.status >= 500) {
[23:33] <wallyworld_>                     var server_error =
[23:33] <wallyworld_>                         'Server error, please contact an administrator.';
[23:33] <wgrant> lifeless: Argh, no, sorry
[23:33] <wallyworld_>                     var oops_id = self.get_oops_id(o);
[23:33] <wallyworld_> StevenK: perhaps it should show the oops id for timeouts
[23:34] <StevenK> wallyworld_: Is that the error handler for affects me too, or the generic one?
[23:34] <wallyworld_> StevenK: generic one in client.js
[23:34] <wallyworld_> affectsmetoo uses it
[23:36] <StevenK> Oh yah, the OOPS id turns up in AJAX events
[23:37] <StevenK> Hahahahaha
[23:37] <StevenK> OOPSes in the body link to oops.canonical.com
[23:37] <StevenK> OOPSes in the AJAX log link to pad.lv, which redirects to lp-oops.canonical.com
[23:37] <StevenK> lifeless: ^
[23:37] <wallyworld_> yes, oops id is in ajax response
[23:38] <StevenK> Except I can't find the OOPS
[23:41] <lifeless> they should link to oops.canonical.com then :)
[23:41] <lifeless> StevenK: ^
[23:42] <StevenK> lifeless: Obviously :-)
[23:43] <StevenK> Hmm, the URL is in our config. Pity I don't think you can fetch that in JS
[23:50] <StevenK> lifeless: I also can't find the OOPS that the ajax log showed
[23:56] <StevenK> wallyworld_: https://code.launchpad.net/~stevenk/launchpad/ajax-oops-link/+merge/125609
[23:58] <wallyworld_> StevenK: r=me, good that this is at least correct even if it's hardcoded