[00:57] lifeless: How old is staging's DB? [01:00] (I need to get a rough idea of the publications demolished by bug #653382) [01:00] <_mup_> Bug #653382: BinaryPackagePublishingHistory._getOtherPublications fails to restrict the distroseries context [01:09] wgrant: I'm not sure [01:11] It appears to be from the 16th. [01:13] lifeless: Could you please run http://paste.ubuntu.com/505395/? [01:13] (hm, it has a shiny new theme) [01:23] wgrant: terrible query [01:23] wgrant: not exists is probably better than count(*) == 0; [01:24] lifeless: Fair point. [01:25] also its probably flattenable into a single query rather than correlated subquery [01:25] which will tend to be much faster [01:25] How could I flatten that? [01:26] using group by? [01:26] Oh, true. [01:27] or possibly [01:27] just use a left outer where bpph.id is NULL [01:27] but I'd have to actually think to suggest more [01:32] wgrant: its going to be a while, the query is expressed poorly I think. [01:33] wgrant: by a while, I mean its running and has been for 10 minutes [01:36] I'll be back in a bit [01:44] 25 [01:45] lifeless: Yeah, 3am SQL isn't optimal, as it turns out. [01:47] http://paste.ubuntu.com/505410/ removes the subquery, and gets rid of one of the BPPH scans. [01:47] And is generally a whole lot less hostile. [01:50] But it will still probably take a while, given what it's doing. [01:51] I could reduce the dataset to publications superseded since the bug was introduced, but I know there have been similar bugs in the past, so I'd like to see that there is no older broken data too. [01:57] http://paste.ubuntu.com/505415/ [02:10] 123744 [02:10] (1 row) [02:10] Time: 768829.442 ms [02:10] thats using 'not exists (select binarypackagepublishinghistory.id ... [02:10] Hm, that's a few. [02:11] 505415 is running now [02:11] Thanks. [02:13] Er, 505415 doesn't have a COUNT, so it's going to give a crapload of output. [02:16] when it finishes :P [02:16] I'll hit END [02:18] wgrant: the 505415 query is odd [02:18] Howso? [02:18] wgrant: if you want 'no other_bpph' then use join, not left join [02:18] and drop the having COUNT [02:19] Er. Won't a JOIN against something that doesn't exist result in... nothing? [02:19] right, so those rows are skipped. [02:19] think sets [02:19] not procedure [02:21] 3662 rows [02:22] Hmm. I wonder what the other 120000 are. [02:23] mwhudson: 'test' passed ec2, of course testfix then fcked us. [02:23] lifeless: \o/, at least a little bit [02:23] yeah [02:24] will bomb them in at EOD [02:25] lifeless: How would you rewrite that query to not left join? I want the IDs, so joining against an empty set is not going to help me much. [02:26] I'd start by writing out in english what I want [02:26] I'm not clear what you want [02:27] the two versions I've seen are doing different things. (thus the different results) [02:28] I want to find superseded binary publications where the dominant build has no publications in that context. [02:28] (context == (archive, distroarchseries, pocket)) [02:29] Er, (archive, distroseries, pocket) for this particular case. [02:29] so 'superseded but nothing superseding it' ? [02:29] Right. [02:30] We record the build that supersedes each publication. [02:30] so [02:30] to first address the left join/join lemma [02:30] if I have two tables [02:31] SUPERSEDED SUPERSEDES [02:31] (imaginary) [02:31] with supereseded.id == supersedes.superseded_id [02:32] then ED JOIN ES group by ED.id having count(ES.id) == 0 [02:32] is equivalent to [02:32] ED LEFT JOIN ES group by ed.id [02:32] bah [02:32] got the LEFT in the wrong example [02:32] then ED LEFT JOIN ES group by ED.id having count(ES.id) == 0 [02:33] is equivalent to [02:33] ED JOIN ES group by ed.id [02:33] Won't ED JOIN ES return the opposite of what we want? [02:33] oh right [02:33] It's the same as HAVING COUNT(ES.id) > 0 [02:34] left join where ed.id is NULL [02:34] I guess that might work. [02:35] now, one question is whether ES is better correlated or not; if we can make it very fast to satisfy for a given ED [02:35] wgrant: I meanT Left join where es.ID IS null [02:36] Right, that makes more sense than JOIN. [02:37] * rockstar slaps buildout [02:37] now one issue here [02:37] there is what, 100 times as many ED as ES [02:37] so it might be better to start with dominant builds [02:38] and for each dominant build return: content, publication, (a) superseded build [02:39] which should look at 1% of the data we're considering today, no ? [02:39] awol for a bit [02:40] bbiab [02:43] lifeless: Where did you pull the 100 figure from? [02:44] I guess starting at BPB might be a little smaller, but we would still need to select every single superseded BPPH. [02:44] ES's intersection with ED is massive. [02:44] wgrant: well, gimme a query to get stats [02:45] wgrant: we'll see what my ass looks like [02:45] It's not clear what the stats should be. [02:45] It's also not clear why we're trying to optimise a query that will hopefully only be run once :P [02:56] yao, could you put LP: 653316 on your list please? [02:56] * michaelh1 types on the wrong channel [03:21] wallyworld__, how's things? [03:21] rockstar: pretty quiet. everyone's away [03:22] i'm having "fun" with YUI [03:23] wallyworld__, me too! [03:23] :-) [03:23] wallyworld__, it's quite possible my fun will be breaking yours. :) [03:23] wallyworld__, are you working on the thing you and I talked about last week? [03:24] yeah, got the endpoint all set up. just having fun figuring out the YUI object model so I can get the info I need from client side and send it to the server [03:24] rockstar: i just got a call from my son who is sick at school. i have to duck out briefly to pick him up [03:26] wallyworld__, okay. I just wanted to make sure you had someone around with thumper out this week. === Ursinha is now known as Ursinha-afk [04:28] wgrant: several reasons: [04:28] - if the data model doesn't support this sort of query, we're going to be doing it more than once, eventually. [04:29] - e.g. in the garbo. And 10 minute queries are flat bad [in the absence of dedicated data mining clusters] [04:29] - You're not going to want to fix up the prod data? We can't sensibly run a 10 minute query on lpprod [04:34] True. [04:38] hmm [04:51] lifeless: How long did 505415 take on staging? [05:21] 600941.564 ms [05:23] statik: don't cringe at the ui, but - https://edge.launchpad.net/+feature-rules [05:24] statik: thats our 'takes effect immediately' config system, up and live [05:24] statik: we can also use it to drive A/B tests and similar experiments [05:24] * jtv peeks as well [05:25] lifeless: Hmm. Not fantastic. But it'd probably be faster on a production slave, and we do already have some regular multi-minute queries (on the master, no less). [05:25] wgrant: please file bugs for all those queries that you know of. [05:25] wgrant: any query over 5 seconds is a bug. [05:25] (although I know how to optimise those queries down to a few seconds, I haven't got around to doing it yet) [05:25] They're in the publisher. [05:25] wgrant: any *write* query over 1 second is a bug. [05:25] wgrant: doesn't matter. [05:26] wgrant: I should say, any write transaction, from the first update/insert/delete through to commit, taking more than 1 second, is a bug. [05:26] Right. [05:26] awesome [05:26] The publisher is itself pretty much a bug. We all know that. [05:26] hey jtv! how about a phone call? [05:26] wgrant: I like having bugs if *someone* knows there is a problem. [05:26] statik! Not sleeping yet? [05:26] * jtv scrambles around for headsets [05:27] jtv: nosir, i hung around specially for you [05:27] :) [05:27] * jtv takes leisurely sip of coffee [05:28] wgrant: so, please file them? [05:28] wgrant: if you get pushback, I'll go to bat over their existing. [05:35] I suppose I might be able to use that same optimisation in the repair query. [05:36] Mm, no, not really. [05:36] It makes too many assumptions that are now broken. === almaisan-away is now known as al-maisan [06:35] stub: https://lpbuildbot.canonical.com/builders/prod_lp/builds/120/steps/shell_7/logs/summary [06:36] stub: thats a prod-lp run, the same passed ec2 earlier [06:36] stub: I'm wondering if its a python/twisted version issue? [06:38] lifeless: No such file lib/lp/services/scripts/tests/cronscripts.ini [06:38] lifeless: That is a relative path, so a current working directory issue [06:39] *blink* [06:39] No idea how it got through ec2 though if that is really the case\ [06:39] stub: I landed the CP using ec2land [06:39] stub: its 50 or so revs that have all passed qc [06:40] https://code.edge.launchpad.net/~lifeless/launchpad/cp/+merge/37345 [06:41] lifeless: maybe a isolation/test ordering issue with an earlier test changing the cwd? [06:41] stub: could be; maybe BaseLayer should check that cwd hasn't been fucked with [06:42] in bzrlib we ensure its not messed up in the base class, *and* we don't use chdir except in very controlled circumstances (because its not nice as a library to use it :P) [06:42] It does already :-/ [06:42] >< [06:43] mwhudson: wow, something chomped on the encoding in that mail [06:43] Does that file still exist in the branch that failed? [06:44] its production-devel [06:44] checking [06:45] stub: yes [06:46] bzr ls lib/lp/services/scripts/tests [06:46] lib/lp/services/scripts/tests/__init__.py [06:46] lib/lp/services/scripts/tests/cronscripts.ini [06:46] ... [06:47] Can add the cwd to the error if that file isn't found to check my initial theory (although I still can't see how it would be possible, it is my best guess). [06:47] it could be a dict order thing between python versions [06:47] causing the layers to be grouped in a different order. [06:48] stub: alternatively, where are we at in terms of getting to python 2.6 everywhere? [db server upgrades] [06:49] forwarded you the ec2run results for the branch [06:49] We have done hackberry, we have chokecherry, wildcherry, poha and plantain to go. As far as Python 2.6 everywhere, the important one is wildcherry [06:49] whats the eta on wildcherry? [06:49] I've never heard of poha and plaintain :( [06:50] I was planning to do it after chokecherry, but there is nothing stopping us doing it next now I think about it. [06:50] poha and plantain are the SSO database servers. [06:50] ah kk [06:57] stub: rs=me if you wanted to land something to debug this, but I think upgraded bb to python2.6 for prod_lp is probably best, if we can get wildcherry done. [06:57] stub: its past EOD for me, so I'll only be sporadically around now. [07:04] (you'd need to send it straight to production-devel) [07:34] lifeless: you wouldn't happen to know where the branch scanner does its get_r[ow]_server() and start_server(), would you? [07:34] and hi, btw :) [07:34] * jtv now sees the eod note [08:45] Hi jtv1, is there one particular view in translations that displays multiple selectable items for form processing which you'd recommend looking at? I want to compare with teh way we do it for copy/delete packages in case there are better examples. [08:45] Specifically, creating a vocabulary based on the current view's search terms... [08:46] noodles775, we've got some stuff like that, but none of that is, as you put it, "recommended to look at" :) [08:46] heh, OK. Thanks danilos. [08:47] noodles775, looking at import queue stuff is of that type (lp.translations.model.translationimportqueueentry and on from there) [08:47] * noodles775 looks [08:48] good morning [08:48] noodles775: whatever you do, don't look at our languages list. :) === jtv1 is now known as jtv [08:48] hi adeuring [08:49] hi jtv! [08:50] danilos, jtv: OK, it looks like you guys similarly use a VocabularyFactory when its dependent on the context object, ah, but you pass the view into __init__, that's what I was looking for. Thanks! [08:51] Morning adeuring, bigjools [08:51] hi noodles775! [08:51] danilos: uh-oh, that sounds like he may imitate something we did. Should we stop him? [08:51] morning all [08:52] morning bigjools! [08:52] jtv, heh, in general yes, but let's run an experiment and see where it takes him :) [08:52] bigjools, morning [08:52] jtv: heh, afaics, it's the most zope-like way to do it (using a vocabulary factory which is evaluated when the view is initiated etc. [08:52] danilos: good point… this way he gets the blame or we get the credit [08:52] lol [08:52] Take. [08:53] Did I say "get" the credit? [08:54] mwhudson: do you know if getting, starting, and stopping servers as returned by get_ro_server() is something we can do very frequently? [09:00] hurrah for code that fixes things for vague and (to me) confusing reasons. [09:01] jtv: hi, 'sup ? [09:01] lifeless: hi, I _think_ I found the answer more or less the second you answered. [09:01] hah, kk [09:02] I was trying to figure out whether a transport or server (I'm not sure of the terminology) for lp-internal:/// branch URLs was running when the branch scanner triggers a particular kind of work. [09:02] lifeless: hi [09:02] bigjools: evening! [09:02] lifeless: did you fix prod_lp by any chance? [09:03] the builder yes; the build decided to fail on some of stubs new stuff; looks like a test that is doing chdir or something [09:03] we *think* upgrading the builder to py 2.6 will fix it (because the cp passed ec2) [09:03] lifeless: I've been trying to get a CP done for a week now, this is crazy. I am going to do a cowboy instead. [09:03] bigjools: have you landed the revision ? [09:03] yes [09:03] then it should be deployable [09:04] from prod-devel? [09:04] we can deploy from any branch [09:04] ok [09:04] if the rev is specified [09:04] lemme find it [09:04] but, I thought your stuff hit prod-stable last week [09:04] rev 9777 in prod-stable [09:05] isn't that your thing? [09:05] no [09:05] 9779 in devel [09:05] oh, 9778 and 9779 [09:06] yeah, 9779 had the oops isolation fix in it to see if that helped [09:06] that fix is critical for the platform team [09:06] got a link to a failure from last week? [09:06] did you land your cp's via ec2? [If so that implies real seriously annoying fuckage in the hardy builder] [09:07] lifeless: https://lpbuildbot.canonical.com/builders/prod_lp/builds/119/steps/shell_7/logs/summary [09:07] no, withoyt [09:07] without [09:07] ahhh [09:07] I don't use ec2 [09:07] I -hate- the openid glue on buildbot [09:08] Morning [09:09] that does look like the oops issue doesn't it [09:09] aye [09:09] bigjools: so my branch included your stuff and passed ec2 [09:09] * bigjools brb [09:09] bigjools: but i'd -really- like it if we consistently use ec2 when landing stuff on production-stable [09:09] bigjools: anyhow [09:10] bigjools: I'm +1 on a cowboy of your revisions (e.g. deploy from prod-devel rev 9779) [09:10] bigjools: stub: is working on wildcherry now I think. [09:10] I'll send mail then off again. [09:10] bigjools: Feel like some DB surgery? [09:10] eh? [09:11] stub: well, 'you' meaning you're coordinating, no ? [09:13] lifeless: I've never used ec2, and I'm not about to start :) [09:14] wgrant: wassup? [09:14] bigjools: that sounds unhelpful and negative [09:14] lifeless: what? [09:14] bigjools: but its after 9pm, so I'm going to ignore it and go unstrap more boxes [09:14] that's a particularly unhelpful comment yourself [09:14] blah it is [09:14] sorry [09:14] bigjools: See https://edge.launchpad.net/ubuntu/lucid/i386/python-imaging-doc -- ignoring the new Proposed publication, spot the issue. [09:15] so, ec2test gives us important guarantees for bottleneck branhces [09:15] lifeless: I run the test suite locally, always have done. [09:15] bigjools: so one of the things that ec2 does is run in a consistent clean state each time; do you arrange that locally as well? [09:16] lifeless: yes [09:16] bigjools: I'm fine with folk running the test suite locally, it just seems particularly hard to make local match 'what buildbot does' [09:16] hell its hard to even make buildbot match buildbot at the moment. [09:16] lifeless: if nobody ran it locally (there's 2 of us) then you'd not see local issues either [09:17] bigjools: I think you should ignore my previous snarky comment; am very tired, been battling a sinus infection all week, terrible beds at the hotel in sydney, and *yawn* [09:17] bigjools: -sorry- [09:17] wgrant: groan [09:17] lifeless: no worries, it's not good to work when tired eh? :) [09:17] bigjools: Yes. My fault, due to unobvious Storm quirks plus shitty old tests which missed it. Bug #653382. [09:18] <_mup_> Bug #653382: BinaryPackagePublishingHistory._getOtherPublications fails to restrict the distroseries context [09:18] Easy enough to fix, fortunately. [09:18] bigjools: I wasn't ;) I wandered past and saw a ping... then you pinged as well ;) [09:18] ah yes I saw that [09:18] lifeless: ok :) thanks for the +1 on the cowboy anyway [09:19] wgrant: how many packages are affected? [09:20] bigjools: lifeless ran a query for me on staging (which has data from 3ish weeks ago), and there were around 3000 publications. I'm not sure if the query was perfect, nor how many of those were in the primary archive. [09:21] For those in the primary archive we just need to set them back to Published, and unset datesuperseded and supersededby, since they'll all be in Release. [09:21] And there should be just about no PPA publications affected, since the use cases there are different. [09:22] this is .... not good [09:22] ... yes. [09:22] can you show me the query please? [09:22] http://paste.ubuntu.com/505415/ [09:23] (the code rolled out on 2010-08-11 or so) [09:23] It finds publications superseded by a build that's not published in the context. [09:24] what about deletions? [09:24] that'll pick up valid deletions [09:24] It won't. [09:24] Deletion doesn't set supersededby. [09:24] aha [09:25] umm has death row reaped these? [09:25] When I discovered this on Saturday, I initially thought that mass expiration last week would have killed lots of binaries that were hit by this. But it turns out that it wouldn't have. [09:25] No. [09:26] Because the dominator doesn't run over frozen pockets. [09:26] So they're Superseded, but have not been scheduled for removal. [09:26] that's lucky :) [09:26] If they were deathrow candidates, we would have noticed almost immediately that something was wrong. [09:26] what about maverick though? [09:27] Release pocket changes can only happen in maverick. [09:27] So it doesn't matter that they leak into maverick through this bug, because they were performed in maverick anyway. [09:28] So only frozen release pockets (and possibly some PPAs, but that's probably tiny) are affected. [09:28] ah maverick only has a release pocket at the moment [09:28] The query still respected pockets. [09:28] Maverick has -proposed too. [09:29] really? so why didn't the dominator kill maverick's packages then? [09:29] It did! [09:29] But they were meant to be killed. [09:29] true [09:30] All of the problematic dominations were meant to remove things from maverick, since that's the only place they can happen. [09:31] let's get the code fix in before doing this then [09:31] I think this is about the fourth time this code has broken :/ [09:31] Certainly, yes. [09:31] Was just letting you know the details. [09:32] wgrant: thanks [09:32] the dominator code scares me :) [09:32] It is a bit like that. [09:38] This was only noticed when a lucid-proposed upload landed in NEW, which strongly suggests that nothing bad made it further than the DB. [09:40] hello [09:51] morning jml [09:53] bigjools: good morning. [09:58] What merriment and hijinx are in store for me today, I wonder. [09:58] * wgrant lures jml back to engineering. [09:59] I suspect I've got more hiring than engineering to be doing. [09:59] There seems to have been a bit of that lately. [10:02] it never seems to stop [10:14] bigjools: Thanks. [10:15] np [10:17] wgrant: I'm trying to figure out if we can make your query faster [10:17] bigjools: So was I. [10:18] maybe a subselect? [10:18] It's possible we could use an optimisation like the one in Dominator.judgeAndDominate/ [10:18] What sort of subselect? [10:18] My initial version had a NOT EXISTS subselect, and was even slower. [10:18] urgh [10:19] maybe stub has some time to help :) [10:20] bigjools: Does the query look otherwise correct? [10:20] wgrant: it's hard for me to tell, to be frank [10:20] Heh, yes. [10:20] the left join is confusing me [10:20] and it doesn't help that my sql is shite [10:21] The subselect was a bit more obvious. [10:22] But I'm taking each publication and left joining any publications of the supersededby build in the original publication's context. [10:22] Then finding those that have no publications. [11:18] danilos: just to let you know, the bzr plugin does what we expect (even more so than other API stuff): uses production xmlrpc on production, edge xmlprc on edge, and staging xmlrpc on staging. [11:19] jtv1, ok, cool, thanks for checking [11:23] bigjools: has demand for PPA builds dropped or is jelmer's improvement really as awesome as the graphs make it look? [11:23] jml: the latter [11:23] It could easily have increased throughput 300%. [11:23] bear in mind that we used to spend up to around 20 minutes blocking on uploads per scan... [11:24] it's the most awesome change on soyuz I've seen in 3 years [11:24] wgrant: I've only really been looking at queue length. it's much shorter, which presumably means that people are getting what they want much faster. [11:25] bigjools: yeah. it's amazing. [11:25] It will be interesting to see how it copes in a couple of days when all the builders disappear. [11:25] bigjools: also shows that sabdfl was probably right to refuse more hardware [11:25] jml: once we get this change done that we're working on too, then it'll be more awesomererererer than anything [11:25] jml: totally [11:25] we always knew utilisation was crap [11:26] although, as wgrant says, we'll find out if we need more *dedicated* hardware fairly soon. [11:26] yip [11:28] So the graphs all look healthy? My occasional /builders checks have had the queues pretty much constantly empty. [11:28] wgrant: if bigjools doesn't mind, I'll post a public screenshot. [11:28] jml: I was going to blog about it [11:28] meant to do it Friday and forgot [11:29] bigjools: I reckon it's definitely blog-worthy. [11:29] I have seen a couple of reports of builds stuck in Uploading, though. [11:29] jml: if you want to paste something for wgrant that's ok [11:29] wgrant: yeah, known issue, it's when they fail to upload [11:29] Ah, k. [11:30] Apart from that, it seems to have gone incredibly smoothly. [11:30] it had 2 months of dogfooding :) [11:32] wgrant: http://people.canonical.com/~jml/Active-Builders.png [11:33] wgrant: we changed the way the green bit was calculated on the 30th [11:33] jml: It was originally the total? [11:33] yeah. [11:33] the graphing app is a little restrictive [11:34] That certainly looks blog-worthy. [11:34] bigjools: I spent a few minutes doing some research the other day. Looks like the only decent web-based graphing app is hosted in a google data center. [11:35] heh [11:36] Although the graph would be more impressive if the old data was corrected, so the reduction in utilisation was more obvious. [11:37] Hmm. I suppose this also means that we'll be able to do rebuilds without destroying the world for two weeks? [11:37] the important bit is the queue length [11:37] It is. [11:37] we want as much utilisation as possible [11:37] rebuilds will certainly be better! [11:37] yay for codebounce being down [11:38] bigjools: that seems a non sequitur [11:38] * bigjools has secateurs and is not afraid to use them [11:38] bigjools: if you wanted as much utilization as possible, then the important thing would be the red bit – not seeing any green. [11:39] jml: yes - however, the current graph will never do that since the builders are not marked as building until the files are dispatched [11:39] that'll change when we release the stuff we did [11:40] How well does your new thing survive restarts? [11:40] bigjools: right, but that means that the queue length is *not* the important part. [11:41] jml: it was, but as we get better at utilisation, I agree [11:42] bigjools: well, I happen to only care about increased utilization insofar as it reduces the wait time for our users. [11:42] I can't say I trust tuolumne's data, though... how is the maximum number of active builders not integral? [11:42] jml: But wait time is better estimated by queue size, as bigjools says. [11:42] jml: yes - but remember the graph works off the data in the DB - so although we are making great utilisation of the builders now, the graph is incorrect. [11:44] wgrant: yeah, I know. that's what I'm getting at: wait time is the important thing; queue length is a good proxy measure; statements about utilization do not follow from this [11:45]