[00:11] wgrant: tells haproxy the service hasn't crashed [00:16] lifeless: What's the difference? [00:18] The only conceivable difference I can see is that in the case of a crashed service you might possibly want to kill leftover connections, but I doubt it does that. [00:20] uhm, I looked into it, I don't recall now. [00:20] it may have been to do with zope awfulness [00:20] Right, but that only affects appservers. [00:20] feel free to dig, I don't want to try and page that in right now. [00:21] It would be nice to not have to add an HTTP service to every service. [00:21] Sure. [00:23] wgrant: persistent connections [00:23] wgrant: and no alerts generated [00:27] lifeless: Do we use persistent connections? [00:29] we were; we should on twisted services; we don't on the appservers [00:29] s/twisted/non-ultra-tuned/ [00:29] Why would we use them on Twisted services? [00:30] also non-cpu-bound [00:30] Why do we care which service it gets dispatched to? [00:30] less overheads [00:31] tcp will already be open on the second request [00:31] Hm? That sounds like HTTP keepalive, not haproxy persistent connections. [00:32] same same [00:32] at this point I'm going to say 'shoo, go read' - haproxy is terrifyingly evil in some ways [00:32] yes we have room to rejigger things [00:32] AFAICT the maintenance-mode persistence behaviour affects cookie-based persistence. [00:33] also 'how can we tell the service is really down' [00:34] It's not listening. [00:34] I don't have any requirements that all our services use haprox the same way [00:34] wgrant: that not at all the same as really down [00:34] wgrant: (if we don't have a status page) [00:34] lifeless: Does that matter? [00:34] hell yes [00:36] .. how? [00:36] Assuming I'm using haproxy as a round-robin load balancer, not as a nagios or cookie-based load balancer. [00:36] so, again, I don't care if some services are configured differently; the requirements are that its got a sane nodowntime upgrade mechanism, including all the aspects like telling the difference between crashed (so just start it) and going down (do not touch), and getting metrics out [00:38] I expect different answers for different stacks and different services [00:38] And I think our assumptions for appservers and codehosting are flawed. [00:38] right now we have two services with status pages [00:38] * lifeless shrugs [00:39] it works, its introspectable, and aided us iding and fixing the root cause of hang-on-restart; preserve the functionality and I'm happy. [00:40] Huh, how was it relevant to that? [00:40] The regular requests? [00:42] I seem to recall you polling it too watch what was gong on [00:43] Oh, for codehosting, not appservers. [00:43] what has you pulling on this thread ? [00:43] why is it interesting to talk about now ? [00:43] Well, I don't want to have to add a pointless HTTP service to poppy. [00:44] I don't think status pages are pointless even if not used for haproxy [00:45] they provide a good place to hook useful metrics we (have in the past) wanted to poke at (vs publishing to *statsd which is also a good idea but not as watchable-by-F5) [00:46] conversely, I don't think everything has to have one. [00:46] Have you been told you have to add one? [00:47] No. [00:48] But if there's a reason that codehosting needs one for haproxy, then poppy needs it to. [00:48] However, I suspect that neither does, and probably neither do the appservers. [00:48] poppy and codehosting aren't the same service, so I don't see that that holds [00:58] hah [00:58] ? [00:59] I've filed 6.2% of open LP bugs [00:59] Timeouts do that :( [01:01] Bah, I'm just below 3% now. [01:02] could you add some technical info to bug 528459 please? [01:02] <_mup_> Bug #528459: PPA does not delete old packages after new build < https://launchpad.net/bugs/528459 > [01:03] lifeless: NBS [01:03] I was hoping for a paragraph, to point future folk straight at the issue [01:03] 2-3 lines [01:04] But all Soyuz people will know basic stuff like NBS. [01:04] Oh wait :( [01:06] lifeless: NBS == Not Built from Source. If you have a 'user' source package that builds 'libuser1', and then a new version of user is uploaded that now builds 'libuser2', libuser1 is now NBS. [01:07] StevenK: yes, I am slightly familiar with packaging jargon [01:07] StevenK: (thanks for chiming in :P) [01:07] I was providing said paragraph [01:08] StevenK: mmm; Something like 'ppas do not handle NBS binaries, look at for how they are handled in primary archives' - that would be said paragraph [01:08] StevenK: though it sounds like wgrant suspects its not NBS handling [01:09] Hm? [01:09] It is NBS. [01:09] you said 'Oh wait :(' [01:09] But all Soyuz people will know basic stuff like NBS. *Oh wait :(* [01:09] That was imitating realisation that there is no longer a Soyuz team to know things. [01:09] That is what wgrant meant [01:16] and there are now no lp medium bugs [01:16] that are not incomplete [01:16] + we need a release of loggerhead [01:16] Just lots of bugspam :P [01:16] Indeed. [01:35] erk [01:35] are we able to use amqplib 1.0.2 ? [01:35] germanium:process-uploaded failed to run for at least 35 minutes. [01:35] * wgrant investigates. [01:35] lifeless: What are we using now? [01:35] 0.6.1 is in the downloadcache [01:37] Seems like 1.0 was released at the end of July. [01:37] Two years after the previous release. [01:38] Upgrade away. [01:46] Morning! [01:46] Evening nigelb. [01:49] wgrant: Its alreay evening out there? [01:49] No. [01:50] But correctly timed greetings are silly. [02:02] Heh, true. [02:03] bwah, going to have to thread-sanitise errorlog.py [02:04] unless; do we get one instance per thread? I seem to recall a hateful global in there. [02:12] quick, stash it on the interaction! === Ursinha is now known as Ursula__ [02:26] mwhudson: separate non-threadsafe library [02:26] mwhudson: giving it a callback to the interaction seems even uglier than a callback to make a connection [02:26] ah ok [02:26] mwhudson: however, it won't have global tls; just per object tls [02:27] (oops_amqp.Publisher will hold a threading.locals() containing the amqplib Connection object for a given threads oops reporting) === micahg_ is now known as micahg [03:00] can has revu? [03:01] As long as it doesn't involve oops_amqp and thread-globals, probably. [03:02] oops-datedir-repo - support to act as a component in oops-tools [03:02] https://code.launchpad.net/~lifeless/python-oops-datedir-repo/0.0.9/+merge/78793 [03:02] the amqp thing I will want you to read, as one of the more amqp fluent folk around [03:02] but thats still largely scifi. [03:05] hi all [03:06] wgrant: Can haz mumble? [03:06] StevenK: Why? [03:06] wgrant: I'd like a pre-impl about the enum that sinzui was talking about [03:06] Which enum? [03:06] The privacy enum [03:07] The one that probably isn't an enum at all? [03:07] Oh. [03:07] Team privacy? [03:07] Right [03:08] I think we really need Curtis for this as well. [03:08] Do you know the history around private and private membership teams? [03:08] Not really, so we probably want to do it after the stand-up [03:09] wallyworld_: Your three cards in Deployment-Ready can probably be tossed into Done-Done. [03:09] enum ? [03:10] StevenK: ok [03:10] wallyworld_: But check the linked bugs are Fix Released. [03:10] * wallyworld_ checks [03:10] lifeless: A less-private privacy level for private teams. [03:10] lifeless: To reveal their name. [03:11] garbo-hourly is *still* crashing? [03:11] StevenK: Yes. [03:11] Somebody needs to fix it. [03:11] Fail. [03:11] * StevenK looks [03:11] It started crashing on Friday night, and will crash for weeks if one of us doesn't fix the code, I expect. [03:12] but isn't that a maintenance squad role? [03:12] Yes. [03:12] so let them do it [03:12] 2011-10-10 00:13:10 ERROR [BugTaskIncompleteMigrator] Unhandled exception [03:12] -> http://launchpadlibrarian.net/82446649/vWceIFzREh8RMuyErP2OjLr1n4s.txt (can' [03:12] t compare datetime.datetime to NoneType) [03:12] wallyworld_: So we can get spamed with script-activity for weeks? [03:13] Bug #871038 [03:13] <_mup_> Bug #871038: BugTaskIncompleteMigrator blows up on bugs that were filed Incomplete < https://launchpad.net/bugs/871038 > [03:13] StevenK: no, if someone else keeps doing it fot them then nothing will change [03:26] wgrant: so we believe some teams will want to be islands and standalone ? [03:26] wgrant: (I would question that, particularly for canonical ones) [03:27] I suspect brad will see it on monday, as he is working that arc [03:27] lifeless: I assume there was a reason that completely private teams were implemented. [03:27] wgrant: thats a nice assumption [03:28] the history is messy [03:28] rather than design a model that scales up as folk need more, we've special cased things with the team-type enum, and its been traumatic [03:28] I would like to butt into this discussion [03:31] lifeless: I've found a bunch of lines in BugTaskIncompleteMigrator that are either indented incorrectly, or >77 characters. :-( [03:34] * StevenK nails Ursinha to the channel. [03:34] StevenK, sorry [03:35] setting up my bip server [03:36] Ursinha: It's okay. Hopefully my nail didn't hit anything vital. :-P [03:36] StevenK, haha [03:36] lifeless, hi, i might have a go at the affectsmetoo timeouts later [03:37] i would appreciate some advice or reassurance though [03:37] because with ~20x variations in query speed on production [03:37] i feel a bit pessimistic about being able to write something that will be consistently fast :/ [03:37] and trying it out by landing changes has a long latency [03:45] Can haz review? https://code.launchpad.net/~stevenk/launchpad/fix-bugtask-incomplete-migrator/+merge/78794 [03:50] poolie: its tricky, yes. [03:51] poolie: basically needs a wholistic approach - look at the whole work happening, what its doing and how, and come up with a way to answer it efficiently [03:51] hm [03:52] it doesn't seem like it should be that hard of a query to answer [03:52] it's not retrieving a large amount of data [03:52] poolie: I wouldn't underestimate the warm-up cost for doing that; you can easily spend a couple of days on a low hanging fruit timeout, and a week or more on one that has been previously optimised [03:52] mm [03:52] i hate to leave this half baked [03:52] i really like the feature but it's pretty flakey now [03:52] poolie: so, as a for-instance: its going to be cold data, nearly every time [03:52] i don't suppose we can up the timeout? [03:53] so the question is 'how can we answer this query when the data is being read off of disk each time' [03:53] oh, and it got ~4 bug dupes about the timeout since it launched [03:53] poolie: yah, making it visible tends to do that :> [03:53] poolie: no, if we can't answer this efficiently, we should drop the feature. [03:53] re 'cold' - i had heard that the db fitted in memory [03:53] but, it's not guaranteed to be all in memory, just mostly? [03:53] Like fun it does [03:53] poolie: nope, not at al. [03:54] poolie: The DB is 300GiB, and the DB servers have 128GiB of RAM [03:54] There's math in there. [03:54] poolie: the DB exceeds main memory on our prod servers by -lots-l [03:54] StevenK: re line length - meh; sure; shrug. [03:54] 'drop the feature' - doing a bug search at all also generally times out :/ [03:54] lifeless: We have coding standards for a reason. :-( [03:54] StevenK: if it wasn't in PEP8, I would be arguing that we should expand it [03:55] It makes me unhappy to see them not used. [03:55] StevenK: there are standards and standards; amongst other things pragmatism over purity. [03:55] anyhow [03:55] StevenK: line length violations don't cause crashes exceptions unbound variables etc [03:55] so this stuff might be cold [03:56] poolie: its pretty much guaranteed to be [03:56] ok [03:56] poolie: its a separate table with no reason for it to be read until someone consults it for $user-FOO [03:56] lifeless: Right, they don't, but we still have the standards for a reason, so they should be followeed. [03:56] s/ee/e/ [03:56] is it reasonable to do a query that may do 'a bit' of hard io, but not much, or will even that often be too slow? [03:57] poolie: its about 1 to 2 ms per row of IO (thats a terrible rule of thumb but close enough to be useful) [03:57] thanks [03:57] so it'd be reasonable to expect someone affected by 5 bugs would always be able to see this without timing out? [03:58] StevenK: so fix them; if you're saying I put them there, 'sorry' - I have my editor configured to wrap at 80 and don't usually break the limit, but OTOH if its clearer as it is, then I think its better : standards are meant to help with clarity after all [03:58] lifeless: I have fixed them -- I'm not complaining that you put them -- I'm just saying that they make me sad. [03:59] poolie: possibly; once it gets enough use for the table statistics to be hot, then it will probably be able to plan the queries quickly and execute in a reasonable timeframe [03:59] StevenK: k [03:59] StevenK: thanks for fixing the migrator [03:59] StevenK: can I ask that you also comment it out :) [03:59] StevenK: (see the bug about ProductSeries:+index and migrated bugs) [04:00] Er? [04:00] Now I'm confused [04:00] there were two issues with the branch [04:00] lifeless: ok so if you were going to work on this, how, generally, would you go about it? [04:00] one is the migrator going boom [04:00] try different queries in the hope of finding one that's fast? [04:00] the other is that a migratde bug in a series causes a manually-mapping function on ProductSeries:+index to time out. [04:01] or, do some kind of more architectural change as far as ajax loading, or freshening them for active users, or something? [04:01] lifeless: So we should revert the migrator wholesale? [04:01] or something about trying to create different indexes - and if so which? [04:01] StevenK: I don't think the whole branch needs reverting; but the garbo job shouldn't convert any more until the web UI issue is fixed [04:01] s/which/can they just be tested on staging to see how they perform on real data? [04:02] poolie: those are all valid strategies; doing ajax loading of just one number is something I'm suspicious off (when I started we had ajax things that themselves time out) - I figure if we can't do the whole page sanely in 5s time, splitting it into bits just lets us be inefficient in more places [04:03] i'm talking about loading the whole page, like bugs.l.n/~/+affectingbugs [04:03] not the number [04:03] ok [04:03] and yes, i shared that thing that doing it multiple roundtrips just seems more inefficient [04:03] ah [04:03] so for that, ajax is irrelevant anyhow, you need the batch to be efficient [04:03] well [04:04] we could load most of the page, then have the actual bug rows arrive later [04:04] that would give a bit better of an impression [04:04] but it would be a lot of work [04:04] the timeout though is coming from the batch [04:04] right, and it would still need a long timeout to actually load the rows [04:04] which we won't do - long queries affect liveliness throught the system [04:04] we're driving *everything* down to 5s tops [04:05] there is a quote request in with IS to examine new hardware [04:05] which would have more memory [04:05] I think that that will make a significant difference to the 'this works if I keep hitting refresh' cases [04:06] StevenK: I think something like removing the job from the garbo list, or putting it in the experimental list or something, is appropriate [04:06] StevenK: Can't you just add another disjunct to the existing if? [04:06] StevenK: reverting the whole branch would break existing migrated bugs in more ways [04:07] StevenK: If either date is None, -> WITHOUT_RESPONSE [04:07] in case you don't know, i landed this with no count next to it, so there is no performance impact unless you actually click through to the new pages [04:07] poolie: yeah, I know :) [04:07] Is there a guide somewhere on writing garbo jobs? [04:07] nigelb: there are docs on it in the system about how to use the API etc [04:07] lifeless: Product:+series OOPSes, it doesn't time out. [04:07] nigelb: there isn't a 'for dummies' one AFAIK [04:07] wgrant: I know [04:07] lifeless: HA. [04:08] lifeless: lol :) [04:08] wgrant: why do you think I didn't ? [04:08] "causes a manually-mapping function on ProductSeries:+index to time out" [04:08] wgrant: ah, because I wrote bad words. [04:08] wgrant: I meant go boom [04:09] Heh [04:09] wgrant: Right, fixed. [04:09] Is there a bug for Product:+series bangness? [04:10] lifeless: Is it on the wiki? [04:10] StevenK: Also, rather than using deprecated switchDbUser and testadmin, can you use 'with dbuser' and launchpad? [04:10] poolie: so the problem here is you need to consult 4 tables to generate the batch, one of will have a lot of pages loaded [04:10] poolie: its not a clustered table IIRC, so take me - 1K rows in the table [04:10] StevenK: Bug #871076, but that can be left for maintenance people. [04:10] <_mup_> Bug #871076: Product:+series OOPSes on INCOMPLETE_WITH_RESPONSE and INCOMPLETE_WITHOUT_RESPONSE bug tasks < https://launchpad.net/bugs/871076 > [04:10] Since it isn't cronspamming incessantly. [04:11] LP can choke and die, as long as it's quiet about it. [04:11] wgrant: But then I'm just spreading the problem by fixing garbo-hourly ... [04:11] poolie: because they are spread out over the table, and there are 1M rows, its probably loading 1K pages of table data plus 10 or so pages of index, to calculate the query result [04:11] right [04:11] StevenK: Yeah, but it's likely they'll be deployed together. [04:12] wgrant: Hopefully. [04:12] StevenK: As it's not as if we have any big series things happening this week. [04:12] Hah [04:12] Hahahahahaha [04:12] poolie: so, in short, personalisation of the bug tracker is -hard-, this is going to be very tricky. [04:12] poolie: probably we want a personalisation service on its own hardware that can run in-RAM totally and not get paged out, vs the one big DB which just LRU's everything. [04:13] wgrant: I'd like to push back on switchDbUser -- I'm following the established pattern, and I'd rather not re-indent the entire function. [04:13] obviously there's going to be strong locality across actually-active users [04:13] StevenK: You should only need to indent the makeBug. [04:13] StevenK: It is the established pattern, but I deprecated it weeks ago. [04:13] poolie: the crazy thing here is that the table is only 40MB packed and 21MB in the index I would predict will be used. [04:13] * StevenK looks for 'with dbuser' in the code [04:13] StevenK: from lp.testing.dbuser import dbuser [04:13] with dbuser('launchpad'): [04:14] so i guess we *could* cluster on user id, but it's probably also used to count up affected users? [04:14] # ahaha I am god [04:14] or maybe that's cached? otherwise it seems it would always be hot [04:14] poolie: for instance, I have only 64K of the data in that table [04:14] I have a suspicion that it'd be faster to do it in a few separate queries. [04:14] have it where? [04:14] The enforce optimisation boundaries. [04:14] select * into temporary table lifelessbap from bugaffectsperson where person=2; [04:14] SELECT [04:14] s/The/To/ [04:14] \dt+ lifelessbap [04:14] List of relations [04:14] Schema | Name | Type | Owner | Size | Description [04:14] ------------+-------------+-------+-------+-------+------------- [04:14] pg_temp_40 | lifelessbap | table | ro | 64 kB | [04:15] so how can it get so cold? [04:15] wgrant: Fixed, re-running tests. [04:15] poolie: we have cron scripts that walk entire tables end to end [04:15] poolie: for some very big tables [04:15] StevenK: I'll do a batch fix-up and remove switchDbUser eventually. [04:15] And then pull some tests back to Functional layers. [04:15] poolie: Revision, for instance is nearly a billion rows [04:16] poolie: the point that that 64K is spread over everything is still valid [04:16] poolie: however - have you deployed stubs query fix yet ? [04:16] nup [04:16] poolie: it was just a crazy query initially, and theres no point stressing until that is landed [04:16] my point in talking to you was, whether it was too much of a shot in the dark [04:16] since even the improved one is sometimes quite bad (>3000ms) [04:16] but perhaps the typical case is better [04:16] the bad query is 9s; the improved one is 3x better to start with [04:17] i'll take that [04:21] Bring on longpoll [04:22] wgrant: Changed pushed, diff updated. [04:25] StevenK: Approved. [04:26] wgrant: Thanks, tossing through ec2 [04:28] wallyworld_: Your 518 ec2 AMI can be deleted. [04:28] * StevenK looks at deleting 519 [04:29] Ah, yes, was going to poke people about that. [04:29] But related stuff on Saturday didn't quite go smoothly. [04:29] * wgrant thwacks buildbot and stuff. [04:30] Yes, the build of that old revision was lol-worthy [04:30] And then the one afterwards caused a bzrlib exception. [04:30] Come on, AWS! Surely you're not slower than SSO! [04:30] lol [04:31] YOu'd be surprised. [04:31] I would not. [04:32] StevenK: Easiest way of handling a garbo job you don't want to run on production but might want to run on staging is to stick it in the 'experimental' list. [04:34] lib/lp/scripts/garbo.py is where garbo jobs go? [04:34] Yes [04:34] * StevenK kicks AWS and re-loads [04:35] nigelb: For now. It is growing a little unwieldy. [04:35] stub: I'm still trying to figure out how it works :) [04:37] wgrant: 519 binned [04:37] StevenK: Great. [04:49] * StevenK tries to understand bug 307539 [04:49] <_mup_> Bug #307539: bug attachment HostedFiles refuse to be deleted

< https://launchpad.net/bugs/307539 > [04:53] Ah. HostedFile is in lazr.restfulclient [05:04] lifeless: You commented on bug 307539 -- writing a quick API script: bug = launchpad.bugs[17] ; bug.attachments[0].data.delete() ; results in a 405, not a 500 [05:04] <_mup_> Bug #307539: bug attachment HostedFiles refuse to be deleted

< https://launchpad.net/bugs/307539 > [05:06] StevenK: sounds like its Fix Released already then [05:07] lifeless: Do you have opinions on how to do per-artifact observer modelling? [05:09] wgrant: some testing required but either partitioned FK columns or a single generic-reference intermediary table (which we may want to move BugTask etc to use as well) [05:10] the testing basically being tossing variations of existing prod data at both and assessing query performance [05:10] lifeless: I was thinking it's probably best (for indices) to just have bugtask/branch FKs on the observer table for now. [05:10] do you observe a task or a bug ? [05:11] what happens if you observe something thats not in your pillar anymore ? [05:11] I'm not sure yet, but I think a task. [05:11] Bugs are horrible and will probably need several triggers anyway. [05:11] I lean towards a separate table [05:12] Oh? [05:12] surrogate keys are a common part of query schemas, and generally good for performance [05:12] but like I say, needs testing [05:12] What would this table look like? [05:13] id, bugtask, branch, blueprint, question, milestone, series, sourcepackagename, distroseries [05:13] Oh, a universal generic reference table? [05:14] possibly a little less ambitious than totally generic, but yes. [05:14] immutable rows [05:14] (except when we truely are moving the referenced thing around) [05:15] by which I mean, if the observer changs from observing bug X to bug Y, you would add a row for bug Y and change the observer FK to its PK, rather than editing the dereference table [05:15] Also, I can't see any way around using triggers to ensure that an observer exists for every task if there is an observer for one of them. [05:16] Right. [05:16] you will probably need a chunk of triggers, yes [05:16] And I'm not sure how exactly we'll handle task retargeting, but I guess we can sort that out somehow. === almaisan-away is now known as al-maisan [05:17] do you mean observer policy ? [05:17] '18:15 < wgrant> Also, I can't see any way around using triggers to ensure that an observer exists for every task if there is an observer for one of them. [05:17] ' [05:17] lifeless: Well, that too, but that's far simpler. [05:17] ok, so why would you want to ensure observers exist for every task ? [05:17] lifeless: The issue is that if I'm an observer for project A, and there's also a project B task on one of A's bugs, I need a restricted observer record for the project B task as well. [05:18] Or disclosure views will have to query across bugtask. [05:18] or are you suggesting for a 200 task bug, that the special exemption given to let A see the bug, will result in 200 (A, taskN) rules ? [05:18] Yes. [05:18] uhm [05:18] my immediate reaction is to suggest two schemas [05:19] a single non-duplicate schema [05:19] and a derived query schema [05:19] Why? [05:19] depends on exactly what the page needs to show I guess [05:20] why? so that web and API transactions to change things write a small amount of data [05:20] We need to be able to show everyone that has access to the project artifacts. [05:20] you have two very different statements here though [05:20] one is a statement that (person or team X) has access to (asset Y) [05:21] the other is a statement that (anyone that has access to asset Y in project B) has access to (asset Y too) [05:22] I presume you're not suggesting that you would expand out all the former statements from project B into restricted statements on project A [05:22] because that would make any write to project B's observer list also write to project A [05:22] We need to something along those lines. [05:22]