[00:20] err! [00:20] why is branchChanged hitting AssertionErrors? [00:21] And no visible OOPS ID in the traceback sent to my 'bzr push' either... [00:21] yeah [00:22] On the other hand, LP did seem to successfully notice that my branch changed. [00:23] thumper: hello :-) [00:24] well [00:24] the assertionerror is because the transaction is doomed [00:25] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777XMLP119 [00:26] ah no, being doomed [00:27] its in a timeout block [00:28] wth, there's a gap of 15s between recorded queries [00:29] * wgrant stabs qastaging. [00:34] wgrant: What did qastaging ever do to you? [00:35] spiv: https://bugs.launchpad.net/launchpad-code/+bug/674305 <- feel free to hit the affects me too thing :-) [00:35] <_mup_> Bug #674305: bzr push occasionally reports AssertionError on terminal [00:35] StevenK: Timed out lots. [00:35] Although it may just be that those pages are broken now. [00:35] (Archive:+index, +packages, +delete-packages, that sort of thing) [00:39] Hmm. [00:39] It'd be nice if daily builds didn't all hit and DoS the build farm at the same time. [00:41] mwhudson: done, thanks! [00:45] wgrant: https://bugs.launchpad.net/soyuz/+bug/672371 [00:45] <_mup_> Bug #672371: Archive:+packages timeouts [00:46] mwhudson: hey [00:46] mwhudson: whazzup? [00:47] thumper: that bug [00:48] mwhudson: I think [00:48] thumper: https://bugs.launchpad.net/launchpad-code/+bug/674305 [00:48] <_mup_> Bug #674305: bzr push occasionally reports AssertionError on terminal [00:48] mwhudson: I think that may be the xmlrpc fuckage [00:48] mwhudson: not sure why there are massive gaps [00:48] thumper: the xmlrpc fuckage? [00:48] the same as for getJobForMachine? [00:48] mwhudson: all the timeouts on the xmlrpc server [00:48] mwhudson: exactly [00:48] hm, ok [00:49] I've not been able to find out why we have 8s gaps [00:49] with no obvious reason [00:49] :/ [00:49] I spent almost a week chasing it [00:50] and I've nothing to show for it [00:50] lifeless: Yeah, but isn't that in theory fixed? [00:50] wgrant: see my last comment [00:50] Oh. [00:50] iz single slow query [00:50] well [00:50] there are other slow queries [00:50] but thats the smoking gun [00:51] does that also take forever on a real DB? [00:51] lifeless: ah... no [00:51] it isn't a slow query [00:52] it is the 15s gap between query execution and the next one that bothers me [00:53] mwhudson: I'd love some help chasing that down as I've exhausted my understanding on that problem [00:53] wgrant: don't know [00:53] thumper: huh, what are you talking about? [00:53] thumper: I'm talking about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1776QS51 [00:53] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777XMLP119 [00:53] query 33 [00:54] lifeless: ^^ [00:54] thumper: I'll have a look [00:55] thumper: that looks like thread starvation to me [00:55] lifeless: but it is only a guess [00:55] lifeless: and why is it starved [00:55] we don't know [00:55] we are just guessing [00:56] thumper: the losas have the xml server split out as the highest ticket [00:56] thumper: when thats done we'll have more resources for xmlrpc [00:56] thumper: and after that the single threaded experiment will kick in [00:57] thumper: if you want to work on this today, I suggest implementing the per thread stats [00:57] lifeless: no, I'm in the middle of something else [01:01] https://bugs.edge.launchpad.net/launchpad-foundations/+bug/243554 for reference [01:01] <_mup_> Bug #243554: oops report should record information about the running environment [01:02] wgrant: I have two problems answering for 'on a real db' [01:02] wgrant: firstly, we don't have the substituted ids to reproduce [01:02] wgrant: secondly don't have access and we're short staffed losa-wise. [01:03] wgrant: where are you up to exam wise? [01:05] lifeless: On the first day of a 12 day break. [01:05] So not doing much. [01:06] wgrant: Are you interested in tackling this perf issue? I have a trip on sunday for the cassandra training [01:06] lifeless: We should have a stub soon, shouldn't we? [01:06] and shoppinh/prep to do today [01:06] wgrant: in a few hours yes [01:07] wgrant: I'm strictly on leave, but I'm pretty bad at unwinding for < several-week periods. [01:07] Heh. [01:07] right now though, I have to do a shop-run. bbs. [01:07] So, 11888 made it bad, and the fix in iforgetwhat didn't help? [01:21] it helped [01:21] but not enough [01:21] we have two options [01:22] fix the query - its taking 200ms per SPPH at the moment. [01:22] rollback both 11888 and 11903(?) [01:22] note that rolling back leaves the page at 10 seconds and the ajax status updating timing out. [02:15] thumper: hello, mcfly [02:15] wallyworld: whazzup? [02:16] i can't get branch lp:~wallyworld/launchpad/invalid-branch-link-message to merge properly [02:16] it's not in the codebase either locally or on loggerhead and any merge attempts via pqm or lp-land claim there is nothing to do [02:17] wallyworld: this is the revision that was backed out wasn't it? [02:17] yes [02:17] but i fixed it [02:17] ie backed out the bad yui stuff [02:17] right [02:17] it's gone past ec2 again no probs [02:18] did you reverse the reversed merge? [02:18] no. not sure what to do [02:18] right [02:18] what you need to do is to merge devel into you branch [02:18] then do a reverse merge of the revision that backed out your change [02:19] the guts of the problem is that most of your branch has been merged [02:19] and the files were then reverted [02:19] so you need to revert the revert [02:19] ok. noob alert. how do i do a reverse merge? [02:19] do you know the devel revision that reverted your merge? [02:19] wallyworld: it is a cherry pick merge [02:20] i did just after it happened :-) [02:20] wallyworld: merge -r NEW..OLD (rather than merge -r OLD..NEW) [02:20] wallyworld: I [02:20] i can see if i can find it [02:20] wallyworld: I'll leave you in spiv's capable hands [02:20] wallyworld: "bzr help revert" has an example: [02:20] wallyworld: to test the merge locally [02:21] “For example, "merge . --revision -2..-3" will remove the [02:21] changes introduced by -2, without affecting the changes introduced by -1.” [02:21] wallyworld: get an up to date devel, and go bzr merge --preview ../my-branch [02:21] wallyworld: that way you can see what pqm will be attempting to merge into devel [02:21] wallyworld: in the way of changes [02:22] ok. i'll have a wee looksy. thanks. i'll grab a quick bite first. suddenly i'm hungry [02:28] * thumper finally has the recipe index builds looking nice [02:28] now for the tests... [02:34] F**K ME - 150 / 1593 CodeImportSchedulerApplication:CodeImportSchedulerAPI [02:34] hard / soft timeouts [02:34] 36 / 131 CodehostingApplication:CodehostingAPI [02:34] mwhudson: ^^^ that'll be contributing to the push issues [02:35] thumper: yep [02:35] also :( [02:43] * thumper has push failures like mwhudson had [02:58] wgrant: so ;) [03:01] lifeless: Hi. Just reinstalled and trying to get Launchpad running. [03:01] meep! [03:01] poolie: ping [03:01] Desktp + Soyuz on amd64 with lp-buildd in a VM does not fit well in 4GiB. :/ [03:01] hi there wallyworld [03:01] hi wgrant, lifeless [03:02] Afternoon poolie. [03:02] hi poolie [03:03] hey, with the bzr 2.2.2 upgrade, we talked about doing it today from tip to avoid 2 lots of downtime. but i don't really think we should package trunk prior to official release. what downtime is involved? when i did the 2.2.1 upgrade, was there any downtime there? [03:03] so two things: [03:04] firstly, i wasn't really saying "you should package tip", just "it's safe to jump to tip if you want to" [03:04] we shouldn't normally need to [03:04] There is a few seconds of downtime for codehosting upgrades. [03:05] and if there's a bug there for which you need an urgent deployment, it could be better to just do a release immediately [03:05] secondly i don't think it's really relevant to downtime [03:05] wgrant: although if you are a user 90% of the way through an hour long push the cost to you will be more than a few seconds... [03:05] i probably said "to avoid lag between us landing a fix and you running it" [03:05] hm iwbni it didn't interrupt running connections [03:06] Hmm, true. [03:06] poolie: hmm, and in this case hypothetically it wouldn't need to; we don't need to restart the ssh server, just provide a new bzr so that new connections will get a fixed lp-serve... [03:07] so, me thinks it's better to wait for bzr 2.2.2 to be released next week deal schedule a small outage [03:07] if needed at all [03:07] We have a downtime window next week for the DB upgrade anyway. [03:07] right [03:07] otherwise we have to schedule downtime [03:07] unless its zomg time [03:08] we will once the relevant RT is done have no-downtime deploys to codehosting. [03:08] but its (I think) third in the queue. [03:08] and we're getting one item done every 2-3 weeks. [03:08] there's that cpu spin/wait issue that 2.2.2 fixes and a few people get hit by hit but not so many that we shouldn't wait till next week... [03:08] Tangentially, I see https://lpstats.canonical.com/graphs/CodehostingPerformance/ looks a bit alarming ? [03:09] it does [03:09] fortunately its friday and noone will care about it till Monday [03:09] [03:10] lifeless: you shouldn't care about it either. so much for you taking the day off. my wife would kill me if i worked too much on my "day off" [03:11] hm, is is that a repeating pattern over the last 24h? [03:11] spm, are you back at work? [03:12] poolie: I am, but seriously considering tking the rest off - having a horrible hayfever attack atm - has triggered a very nasty asthma response. :-/ [03:12] spm: :( [03:13] spm: taken claratyne? [03:13] indeed [03:13] spm: saline solutionas suggested can help a lot - gets the pollen out [03:13] aye [03:13] mmm, neti pots. [03:13] spiv: I ordered one wed [03:14] nasonex is great (prescription only) [03:14] Hmm. It'd be nice if we had tracebacks for each SQL statement. [03:15] poolie: yeah, mine runs out in a few days [03:15] I've been given a (different) thing - I haven't read up to see if its equivalent yet. [03:16] rofl [03:16] 'allonase' or something like that [03:16] 'I also suggest renaming "incomplete" to "need info", as it's much more [03:16] descriptive. "Incomplete" makes it sound like the bug is in progress of [03:16] being fixed, but not yet done.' [03:16] wgrant: https://bugs.launchpad.net/launchpad-foundations/+bug/606959 [03:16] <_mup_> Bug #606959: oops should record the short traceback that caused each query? [03:16] lifeless: heh [03:17] lifeless: what's nice about that idea is that although capturing tracebacks is a touch expensive, that shouldn't matter if you only do a reasonable number of queries ;) [03:17] spiv: http://ecoyogastore.co.nz/eco-yoga-gear/neti-pot [03:17] spiv: yeah [03:17] i saw, linked from the discussion of Go, google have a final bug status of "unfortunate" [03:17] that's nice [03:17] lol [03:18] "suckstobeyou" :) [03:18] I thought they added that specially for the naming bug. [03:18] lifeless: what web stores need for neti pots are photos more like http://www.flickr.com/photos/debrisdesign/502255811/ [03:18] But I may be wrong. [03:18] oh, maybe [03:18] it could be freeform for all i know [03:19] spiv: yeah, I hope it has a manual [03:19] but it's a bit more precise for some things than 'wontfix' [03:19] lifeless: the internet can provide a guide or twenty, I'm sure. [03:19] what we need is a closure-space [03:19] N dimensions and a slider. [03:20] like the colour-space pickers [03:20] poolie: That's what Opinion is for! [03:20] *cough* [03:20] wgrant: thats an opinion! [03:24] lifeless: :( [03:25] seriously [03:25] its still an experiment as far as I've heard [03:25] Ah. [03:29] OK, with Unity defeated, it is now time to look at that query. [03:29] heh [03:30] wallyworld: if you want to discuss https://bugs.launchpad.net/bugs/674329 further I'm happy to do so - I didn't mean to prevent discussion about whatever symptoms you ran into. [03:30] <_mup_> Bug #674329: DecoratedResultSet eagerly fetches all results [03:32] lifeless: hmmm. seems at first glance the whole concept of iterable results sets which load records in batches is not supported? [03:33] what is the query returns 10000000 records. and the user only wants to see 100 at a time? [03:33] wallyworld: thats what batch navigator is for [03:34] wallyworld: we do a count(*) [we should estimate instead, but thats orthogonal) and then use a slice (OFFSET X LIMIT Y in SQL) to only retrieve 100 at a time. [03:34] i realise that's what it is supposed to be for, but isn't the pirpose defauted if __iter__ loads the whole lot anyway [03:34] wallyworld: __iter__ is /not/ for 'do partial work' [03:34] wallyworld: (neither in general, nor in this specific case) [03:35] wallyworld: in this specific case its because the database server will do all the work requested, always. [03:35] so we have to ask for the right amount of work up front rather than do some, do some more, and then say that we're done. [03:36] wallyworld: if you consider the implications of ORDER BY/GROUP BY on the work required in the db, this should make a lot of sense [03:36] sorry for my dumbness, but isn;t the whole concept of yield to avoid eagerly realising the entire list? [03:36] uhm [03:37] so, iterators, generators and lazy evaluation [03:37] why does the server do all the work? other databases don't enforce this? [03:37] wallyworld: good question. Pg definitely does; others I won't speculate on. [03:38] sure, the database has to do some work to satisfy order by etc, but the step of extracting the data from the db into the result set needn't be done unless required [03:38] nevertheless [03:38] python-pgsql has a single large buffer with the results, no further network access occurs as we iterate the rows. [03:39] Or so I am assured by Smart People. [03:39] [specifically jamesh who dug into this in the past too] [03:39] ok then. [03:39] by python-pgsql, you mean psycopg2? [03:39] jamesh: blah - yes [03:41] lifeless: so to recap, if the result set has 10000000 rows, it's ok to do a list(rs) which effectively constructs an in memory data structure with all that data even if we only want to process 100 at a time? [03:41] wallyworld: if you stop reading the result set early, the only effort you're going to save is the conversion of the result buffer to Python objects on the client side. [03:41] or am i missing something? [03:41] wallyworld: You'll slice first. [03:41] yes, and for a large result set, that's significant and a potential performance issue [03:41] wallyworld: The slice affects the issued query. [03:42] if you know you will only need a subset of the rows, tell the database so that it can send you less info. [03:43] jamesh: i'm talking about say batch navigator which allows the user to scroll through the results 100 at a time. [03:43] we may want the whole lot eventually, but not all at once [03:43] That slices, so the DB only sends those 100 rows. [03:43] And only those 100 are turned into objects. [03:43] wgrant: not if a list(rs) is done?? [03:43] which is what happens in DecoratedResultSet [03:43] wallyworld: no, to recap, slice the resultset. [03:43] wallyworld: __iter__ will only be called on the sliced version, right? [03:43] wallyworld: how do you know you'll want them all eventually? [03:44] slicing returns a new resultset. [03:44] And __iter__ is called on *that*. [03:44] for example, how often do people go to the second page of results from a bug search? [03:44] jamesh: i said we *may* want them all eventually, say if the user scrolls to the end [03:44] wallyworld: general principle: specify all the work you want within a *transaction* - call it 2 seconds of processing time. [03:44] :-) [03:45] wallyworld: and ask for, and process that. No more (would be wasted). No less (would result in additional queries - lowers efficiency) [03:45] wallyworld: the batch navigator does this slicing for you [03:45] wallyworld: how about we get concrete. 'I'm trying to do X, and Y is happening' [03:46] ok. i think my problem is i misunderstood how the batch navigator works. [03:46] thanks for setting me straight :-) [03:46] the batch navigator uses count() on the base result set to estimate the number of pages [03:46] * wallyworld crawls back to his hole [03:47] and a slice to get the data for the current page [03:47] makes sense [03:47] the count() is a performance issue with huge datasets [03:47] we need to switch to estimators [03:47] yeah. [03:48] but thats orthogonal [03:48] also, in my case, i had a query with a group by so had to override count() [03:48] erm [03:48] the default storm rs barfs [03:48] :( [03:48] I thought that was fixed in 0.18 [03:49] you can't say select (*) from xxxx with a group by in it [03:49] no [03:49] i fixed it quite simply [03:49] but i also found a bug in Count() [03:49] it messes up count(distinct xxx) [03:49] lifeless, do you go to the losa meetings? [03:49] it leaves out () around the columns [03:50] i don't know the speciic name for it, but i mean the one where francis asks them to do things [03:50] s/select(*)/select count(*) [03:50] poolie: no, tz fail. I get minutes, and have a separate meeting with ISF [03:50] k [03:50] poolie: I do when I'm in a workable tz [03:51] i'll mail him then [03:51] thanks [03:53] jam, did you file an RT for starting lp-serve? [03:53] bug 660264 [03:54] <_mup_> Bug #660264: bzr+ssh on launchpad should fork, not exec [03:54] I've had an rt for a while now, 41340 IIRC, but I'm not positive [03:54] thanks, i'll check that [03:54] sorry, 42156 [03:55] * wallyworld goes to make a coffee and get his fire proof suit [03:56] poolie: https://rt.admin.canonical.com/Ticket/Display.html?id=41791 [03:56] that's not exactly the same as getting it running though [03:57] is there a ticket or bug for that? [03:57] iirc you need them to change some configuration scripts that you don't yourself have access to? [03:58] poolie: the lp-serve thing is moving; jam needed to land more code [03:58] to do what? [03:59] poolie: there is one, but I keep shooting blind as to the rt number [04:00] Let me find the email [04:00] thanks [04:00] Could someone run http://paste.ubuntu.com/530449/ on staging? [04:01] lifeless, while jam's, looking, what do you understand the state of this to be? [04:01] i'd just like to make the bug accurate and work out where if anywhere it's getting stuck [04:02] poolie: its in a back and forth discussion with the losas as they figure all the bits out [04:02] poolie: its low priority (relatively that is) so I wouldn't expect it to happen rapidly [04:02] poolie: 42199 [04:03] poolie: mwhudson was landing the init script for jam, and with that it should be able to be enabled on staging [04:03] and then qad [04:04] epic fail [04:04] 3142 OOPS-1776B79 BugTask:+index [04:04] so from that rt it looks like the next action is still 'get the service running on qastaging'? [04:04] === Top 10 Time Out Counts by Page ID === [04:04] Hard / Soft Page ID [04:04] 238 / 35 Person:+commentedbugs [04:04] 150 / 1593 CodeImportSchedulerApplication:CodeImportSchedulerAPI [04:04] 50 / 188 BugTask:+index [04:04] 36 / 131 CodehostingApplication:CodehostingAPI [04:04] 16 / 9 Person:+bugs [04:04] 14 / 352 Distribution:+bugs [04:04] poolie: right, this whole week there haven't been enough l-osas, and there have been some critical things going on [04:04] 9 / 70 Archive:EntryResource:getBuildSummariesForSourceIds [04:04] 9 / 8 Archive:+copy-packages [04:04] 8 / 396 Distribution:+bugtarget-portlet-bugfilters-stats [04:04] today there was only Ch-ex [04:04] 7 / 0 BugTask:+addcomment [04:04] poolie: yes [04:05] k, i don't want to preempt the critical things, of course, i just want it to not stay stuck after that [04:05] poolie: so in my queue its: [04:05] - after RFWTAD stuff - thats important to finish getting single revs deployed and finish eliminating operation risk [04:06] - after token librarian - thats old inventory which fixes timeouts for many private attachments (e.g. security builds) [04:06] in terms of LOSA time [04:07] ok [04:07] short interrupts to move it along are of course reasonable [04:07] so it's off john's plate until they get to it? [04:07] poolie: John can best answer that [04:12] lifeless, poolie: I'm at least pending them telling me what I need to do next [04:12] the last round I didn't know I needed until they asked for it [04:12] mm there seem to be a few problems like that [04:21] RFC: http://people.canonical.com/~tim/recipe-latest-builds.png [04:22] it is using factory generated fake data, so I have multiple binary builds for the same arch [04:22] but the basics are there [04:23] this is up for review now [04:28] poolie: hi [04:28] poolie: we have another urgent need for committing to stacked branches [04:28] hi thumper [04:29] i think francis mentioned this... [04:29] poolie: bzr-builder commits to the branch [04:29] it was for.. right [04:29] and why does it want a stacked branch not a checkout? [04:29] poolie: and getting a branch for some big projects was using much more memory than the virtual builders had [04:29] poolie: because it never pushes [04:30] thumper: Not a fan of the triplicated spr name and version, but apart from that it looks great. [04:30] poolie: apparently an alternative solution is to change the merge code [04:30] poolie: abentley wrote it all up [04:31] onto the bug about commit? [04:31] on the incident report [04:31] for the buildd failures [04:31] that was an email or a wiki page? [04:31] wiki page I believe [04:32] I could forward you the email if you like [04:32] aaron wrote solutions up for me [04:32] i can probably find it [04:32] rockstar: ping? [04:33] * thumper EODs [04:34] thumper, is that https://wiki.canonical.com/IncidentReports/2010-10-28-LP-build-manager-not-dispatching ? [04:35] poolie: ah, I see it isn't all on the incident report [04:40] thumper: if its not pushing [04:40] thumper: why commit at all? [04:42] stub: what do you think of the idea of capturing query params in oops [04:42] stub: it seems to me it will help reproducing issues lot [04:43] lifeless: We will be logging private information, including information lp devs technically shouldn't have access to. [04:45] Some of that already leaks via the URL of course (so LP devs can learn about private teams they shouldn't know about) [04:45] But that hasn't been a problem so far, as private stuff has been company internal rather than private to a subset of the company. [04:46] stub: well, in theory :) [04:46] stub: so, we also manually create many queries today [04:46] so at least - today - we already leak that [04:50] Content of some of the private bugs could be an issue, as that would violate vendorsec [04:51] yeah [04:51] all disclosure stuff is serious [04:52] stub: when would we use content from a private bug in a query ? [04:52] stub: INSERT I guess [04:52] stub: + 'bugs like this' [04:52] stub: uhm, fo rthe INSERT case we could choose not to substitute [04:52] s/substitute/include/ [04:53] stub: we're trying to figure out why https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777QS12 has multi second queries [04:53] stub: doing them by hand with plausible ids is extremely fast - 130ms for the main lookup in the page [04:55] stub: could it be the something odd like the isolation level (what level does appserver run as), or is it just the specific ids that will be at issue? [05:41] stub: ping [05:41] hmm, nvm for a sec [05:42] wallyworld: qastaging-slave vs main [05:42] perhaps [05:42] bah [05:42] wgrant: ^ [05:42] lifeless: Could be, I suppose. [05:42] lifeless: ECONTEXT [06:04] wallyworld: I was talking to wgrant ; tab fail. [06:04] lifeless: np. i figured that when i saw the rest of the conversation come through :-) [06:14] stub: ping === almaisan-away is now known as al-maisan [06:14] lifeless: pong [06:14] hi [06:14] I need your help [06:14] we've got a very odd thing happening [06:15] have a look at these two oopses [06:15] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777QS12 [06:15] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777QS19 [06:15] this is the +packages page which is a current blocker for deploying [06:15] isolation level doesn't cause slowdowns [06:15] this page [06:15] https://qastaging.launchpad.net/~yavdr/+archive/stable-vdr/+packages?start=0&batch=204 [06:16] in 1777QS12 query 34 takes 6.3 seconds [06:16] in 19 it takes 202ms [06:16] and 39 takes 20 seconds [06:16] we've shutdown cronscripts on asuka [06:17] so the load should be tolerable (about 2 I believe - spm can confirm ?) [06:18] running query 34 by hand, it takes about 200ms consistently, every time [06:18] lifeless: Are the oopses from the first batch? The way we currently do batching means that it you have a large set of results, the later batches will always timeout. [06:18] stub: same batch in both oopses [06:18] stub: same exact url [06:22] ahh [06:22] I think I've managed to get a slow query [06:22] \o/ finally [06:23] !! [06:25] OOPS-1777QS19 q39 is slow and comes with all the parameters (obviously we are not sanitizing the aborted query...) [06:25] stub: yeah, its also genuinely slow locally [06:25] by which I mean ro user on qastaging [06:26] stub: thanks [06:28] And that is slow because it is returning 1.35 million rows [06:28] \o/ [06:29] wgrant: ^ [06:29] Hmm. Is that the newer version query? [06:29] I think I just reused the existing grouped version of it [06:29] sounds like it was inefficient already [06:29] :) [06:30] or buggy [06:30] 1.35 million rows sounds buggy. [06:30] wgrant: this give you what you need to make a test, isolate n fix? [06:31] lifeless: its missing a join condition [06:31] lifeless: Maybe. [06:31] Hah, so it is. [06:31] lifeless: Its missing a 'AND sourcepackagename.id = sourcepackagerelease.sourcepackagename [06:31] SPN [06:31] Yeah. [06:32] stub: in the inner or outer? [06:32] The outer [06:33] 2.7 seconds [06:33] tolerable with just one [06:33] So every matched row is being expanded to 38k rows. [06:33] Ah. [06:33] I think in fact that it shouldn't be joining against SPN at all. [06:33] wgrant: still badly needs tuning [06:34] oh, I did chage that, I removed spn.... but I bet storm is putting it back in. [06:34] bastardo. [06:34] how do you disable autotables? [06:34] jamesh: ^ [06:34] lifeless: It's still explicitly there. [06:34] clauseTables=[ [06:34] 'SourcePackageName', 'SourcePackagePublishingHistory']) [06:34] s/Name/Release/, I suspect. [06:34] So we might be able to avoid the subselect using DISTINCT ON [06:34] lifeless: what's the context? [06:35] jamesh: nvm :) [06:35] jamesh: I was thinking storm was seeing a table ref from an inner query and autotables adding it to the outer FROM [06:35] jamesh: but I was wrong [06:36] lifeless: So, how does it go if you remove the SPN join from the query? [06:36] ah. [06:36] wgrant: fine [06:36] wgrant: what file is tha tin [06:36] wgrant: 2.6 seconds [06:37] lifeless: lib/lp/registry/model/distroseries.py [06:37] 2.6 seconds sounds sort of excessive. [06:38] http://pastebin.com/bJ2TxmFc [06:41] Hmm... distinct on makes it worse. [06:43] ok [06:43] thats up in PQM [06:43] immediate fix [06:44] And that hopefully makes it non-critical. [06:49] yeah [06:49] assuming theres nothing hiding behind it [06:49] let me get the change cowboyed to see [06:51] This explains why even trivial archives were timing out. [07:17] wgrant: indeed [07:18] ok its landed [08:58] good morning [08:59] morning [09:10] Hey up, by the way [09:51] lifeless: It looks like your fix for bug 672371 did not help. +packages still times out on qastaing. [09:51] <_mup_> Bug #672371: Archive:+packages timeouts [09:52] What's next? Revert r11888? [09:55] jml: Hi! Any chance you could QA bug 673015? [09:55] <_mup_> Bug #673015: Code of Conduct requirement for PPA upload rights is unnecessary [09:56] allenap: Hi! Any luck figuring out bug 667340? [09:56] <_mup_> Bug #667340: Trac status of "Verified" confuses bug watcher [09:59] stub: Can you please QA bug 673874 before starting on your weekend? [09:59] <_mup_> Bug #673874: Improve bug comment caching [09:59] henninge: No, not yet. It hasn't caused any regressions, so it's actually safe to go. [09:59] henninge: I'll mark it as qa-ok but continue to investigate. [09:59] allenap: thanks a lot! [10:04] gmb: can you please QA bug 672507 ? [10:04] <_mup_> Bug #672507: Add bug_notification_level to the structural +subscribe view [10:04] henninge: Sure. [10:06] henninge: Done [10:06] gmb: thanks a lot! [10:10] henninge: jml needs my help to QA that [10:10] bigjools: thanks for offering it ;) [10:11] henninge: see the comment in the bug [10:11] he can't QA without it since it needs dogfood :) [10:11] henninge: we're waiting on https://lpbuildbot.canonical.com/waterfall [10:14] lifeless: ah yes, thank you. [10:19] hello. [10:19] yes QA, I know I know [10:26] bigjools: where do I need to point .dput.cf at? [10:26] jml: http://pastebin.ubuntu.com/530615/ [10:28] * bigjools processes your upload [10:29] jml: rejected [10:30] bigjools: why so? [10:30] jml: can I help you make a dummy package that I know works [10:30] "Unable to find python-testtools_0.9.6.orig.tar.gz" [10:30] and it was a mixed upload it seems [10:30] meaning? [10:30] binaries and source [10:31] jml: I normally use the "hello" package [10:31] apt-get source hello [10:32] cd hello-2.5 [10:32] dch -i [10:32] [10:32] yeah, that's what I did with testtools [10:32] (so far so good) [10:33] * bigjools sighs at stuck keys [10:33] heh [10:33] ok, then you need to "debuild -S" [10:34] ahh [10:34] it's the -S that I didn't do [10:34] uploaded [10:35] accepting it this time [10:35] yay [10:36] you cleared the CoC from ~jml? [10:36] bigjools: I did, but I'd like to double check with getUtility(IPersonSet).getByName('jml').is_ubuntu_coc_signer [10:36] * bigjools checks [10:37] False [10:37] qa-ok! [10:37] sweet. [10:37] bigjools: thanks! [10:37] my pleasure [10:38] * bigjools goes to celebrate with caffeine [10:41] henninge: what's the word on the crazy non-vc managed file that refers to class paths? [10:43] jml: It cannot be updated outside of a roll-out - at least not without Tom around ... [10:44] jml: So I am preparing a branch that adds the required import to c.l.i again with an XXX to remove it again after the roll-out. [10:44] henninge: that seems unsatisfactory [10:44] and a special roll-out requirement to update that file [10:45] henninge: can't we just add the requirement and leave c.l.i as-is? [10:45] jml: only if we go without a further deployment today [10:45] henninge: so it needs a rollout-with-downtime? [10:45] so spm says, yes. [10:46] henninge: did he say what it's needed for? [10:46] jml: hang on, I'll forward the mail [10:46] henninge: thanks :) [10:55] henninge: ok. I find this whole thing colossally annoying, but it looks like you guys are making the best of a bad situation. [10:55] jml: we are trying hard ... ;) thanks [10:55] and yes, it is annoying === al-maisan is now known as almaisan-away [10:59] henninge: it passed buildbot [10:59] henninge: when 914 hits qastaging [10:59] then [10:59] https://qastaging.launchpad.net/~yavdr/+archive/stable-vdr/+packages?start=0&batch=204 [10:59] should start working [11:00] that should be anytime now [11:00] lifeless: thanks! But it will be another 4 hours or so ... [11:01] henninge: why? [11:01] https://lpbuildbot.canonical.com/builders/lucid_lp/builds/355 [11:01] It just entered buildbot not passed it yet [11:01] oh crumbs 913 I see [11:01] ah well [11:01] gl [11:01] ! [11:01] and gnight all [11:02] lifeless: good night and thanks again. === matsubara_ is now known as matsubara [11:23] hello - I am having trouble getting the webservice to work on dogfood. When I try and log in, there's a rejection because it can't traverse to '1.0'. HALP? [11:25] bigjools: Have you tried using 'devel' rather than 1.0 ? [11:25] yes, that's what I am using - which makes the error more odderer [11:25] launchpad = Launchpad.login_with('testing', 'https://api.dogfood.launchpad.net/devel/') [11:25] Shouldn't there be another /api/ in there? [11:26] yes [11:26] still fails! [11:26] then I'm out of ideas :-) [11:27] hmm using https://api.dogfood.launchpad.net/api worked [11:32] ah you need to write version='devel' in the login_with params [11:59] jml: got a sec? [11:59] bigjools: sure [12:00] jml: I'm probably doing something very very stupid but I have code blindness. See http://pastebin.ubuntu.com/530655/ [12:00] there's a code snippet and a pdb session [12:00] the inner function callback can't see all of the outer method's variables.... [12:01] Project devel build (220): FAILURE in 2 hr 4 min: https://hudson.wedontsleep.org/job/devel/220/ [12:01] * Launchpad Patch Queue Manager: [r=lifeless][ui=none][no-qa] Remove StartsWith matcher from [12:01] lp.testing.matchers in favour of one from testtools & fix some [12:01] assertions that always passed. [12:01] * Launchpad Patch Queue Manager: [r=lifeless][ui=none][no-qa] Really drop Sourcepackagename from getNewerSourceReleases - fixing massive timeouts on +packages. [12:02] ./me looks [12:02] Morning, all. [12:02] morning deryck [12:03] bigjools: you are masking them in scope, I think. [12:03] bigjools: let me knock up a simpler example... === henninge changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 3 of 10.11 | PQM is open | firefighting: Lots of timeouts on qastaging!! | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [12:04] bigjools: http://paste.ubuntu.com/530657/ [12:04] OK, qastaging is timing out left and right ... :( [12:05] Ubuntu pages seem to work fine but any project page times out. [12:05] bigjools: in "if file_sha1 == 'buildlog':", you are overriding out_file, out_file_name and out_file_fd [12:05] bigjools: probably the thing to do is pass them in. [12:05] e.g. [12:05] jml: it's not got that far yet [12:05] d.addCallback(got_file, out_file_name, out_file) [12:05] bigjools: it doesn't matter. [12:05] ok [12:05] bigjools: run the python I pasted [12:06] that's special [12:06] bigjools: simply having an assignment in the scope masks the outer scope, whether or not the assignment has been evaluated. [12:07] bigjools: I'm not sure it would be sensible to do anything else. [12:07] jml: ok thanks , I'll pass 'em in [12:07] bigjools: np. [12:14] Argh! [12:14] I think I never realized how widespread the problems are that r11888 caused. [12:15] Maybe it's just that. [12:17] henninge: It should be limited to pages on IArchive. [12:17] henninge: can you please subscribe me to whatever bug you file for the XXX in c/l/interfaces/__init__? [12:18] jml: oh bug, right ... ;) [12:18] henninge: Anything outside Archive:+(index|packages|copy-packages|delete-packages) is probably not 11888. [12:18] wgrant: thanks [12:19] although I wish it was ... (because there is a fix coming) [12:26] bigjools: should I put that API gotcha on a wiki page somewhere? [12:26] jml: not yet - I can't get it working still [12:26] bigjools: ok. [12:26] jml: there's an error from wadllib about "Can't look up definition in another url" === mrevell is now known as mrevell-lunch [12:27] I've not seen that one before [12:27] and I suspect I need leonardr [12:27] yeah, it's doing something weird so that the /api is stripped somewhere [12:27] The URL shouldn't have /api in it. [12:27] but later depends on it being there [12:28] /api is used to traverse from the webapp to the API -- you don't use it on api.launchpad.net. [12:28] ......... [12:28] and so it works [12:28] thanks wgrant [12:28] Heh. [12:30] jml: bug 674476 I failed to mention it in the XXX, though... :/ [12:30] <_mup_> Bug #674476: Files outside the LP tree reference LP code [12:30] henninge: thanks. that's ok. [12:31] and you are subscribed [12:32] jml: and thanks for reminding me about the bug [12:33] henninge: np. === didrocks1 is now known as didrocks [12:57] # === mrevell-lunch is now known as mrevell === almaisan-away is now known as al-maisan === beuno_ is now known as beuno [13:14] Started in 15 minutes 27 seconds! [13:24] jml: ha - remember how we added Deferred to lp_sitecustomise.py? [13:25] yeah? [13:25] jml: looks like I need DeferredList too :) [13:25] bigjools: I thought DeferredList subclassed Deferred [13:25] ForbiddenAttribute: ('addCallback', meh [13:26] I guess zope doesn't care so much about that [13:39] why might bugtask.date_closed be none, even though its status is one of Fix Released, Wontfix or Inprogress? [14:03] jam: was my qtwebkit build fix0red? === matsubara is now known as matsubara-lunch === Ursinha is now known as Ursinha-lunch [14:24] henninge: where are we with the qastaging slowdown? I see that only qastaging is affected; staging and production are fine. The timeout exception I see is within database code, but since that's where we check for timeouts, that's not necessarily indicative. [14:24] Has anyone looked at qastaging logs? Has anyone looked at performance graphs? Has anyone tried to correlate performance graphs with revisions deployed on qastaging? [14:25] and, are we coordinating here or on -ops? [14:26] Hm, no tuolomne graphs of qastaging AFAICT :-/ [14:27] Maybe I need to know the machine name(s) [14:34] qastaging is the same machines as staging, and staging is not timing out (as badly?) so machine load doesn't seem likely... [14:34] trying logs [14:36] gary_poster: staging is a lot of revisions behind qastaging atm [14:36] henninge: I figured it was something like that, yeah [14:36] so what has been done? [14:37] I saw stub's reply, but that didn't tell us much [14:37] I was about to grope around in logs [14:38] gary_poster: logs is good [14:39] I was hoping that the authors of the revision could check if any of their code could be causing this. [14:40] henninge: you identified the revision? [14:41] no, it's just any of the later ones. [14:41] ah :-) [14:41] bug I could narrow down the range because it only started today. [14:44] henninge: you up for that while I do log groping? I can share log groping fun here. So far the only thing that does not look like chatter in the qastaging librarian log is "Exception KeyError: ((, (1890638,)),) in ignored" [14:44] seems to be mostly happy thoug [14:45] h [14:45] gary_poster: I am looking at the revs atm, yes. [14:46] cool thanks [14:51] There are boatloads of "No handlers could be found for logger "librarian"" things in logs, which do make me a nit nervous [14:51] bit [14:52] gary_poster: I noticed earlier that getting images from the librarian had a long delay. [14:53] mrevell: flacoste: http://paste.ubuntu.com/530734/ [14:53] henninge: yeah. Maybe. It doesn't smell like the cause to me. This is interesting though: qastaging app log is *swamped* with these: http://pastebin.ubuntu.com/530735/ [14:54] gary_poster: What is a "DoomedTransaction"? [14:54] a transaction that must not be restarted [14:56] henninge: may be an unrelated problem. This started after the the restart 2010-10-21T15:22:39, so it's been happening a looong time [14:57] ah, ok [15:07] jkakar: around? ResultSet.set is generating bad SQL. === Ursinha-lunch is now known as Ursinha [15:13] jkakar: http://pastebin.ubuntu.com/530742/ === matsubara-lunch is now known as matsubara [15:17] Can you paste the code that generates this, please? [15:18] abentley: ^^ Also, am on a call, will be laggy. [15:22] jkakar: http://pastebin.ubuntu.com/530747/ [15:23] abentley: What's the __storm_table__ for SourcePackageRecipeBuild. [15:24] jkakar: __storm_table__ = 'SourcePackageRecipeBuild' [15:25] abentley: Is that right? [15:25] jkakar: Yes. [15:25] abentley: Can you do a JOIN in an UPDATE statement? It looks like you're building a bad query, not that Storm is generating a bad one. [15:27] jkakar: I'm not an expert on SQL syntax. It's possible that I'm asking Storm to do the impossible, but if I am, I expect Storm to tell me. [15:29] abentley: Storm won't tell you if you're trying to do the impossible. [15:29] abentley: It's reasonable to expect it, but Storm is just a "thin" *cough* layer with an expression compiler that generates SQL exactly as you specify [15:30] abentley: The database is telling you that you're trying to do the impossible. Which, given that different backends have different definitions of "impossible", is probably right anyway. [15:30] jkakar: This doesn't seem like it *should* be impossible. Can't one do a subselect or something? [15:31] abentley: Sure. The best thing to do is first, figure out what query you want to run. The second step is to figure out how to make Storm generate it. [15:31] abentley: If you can write out the query you want I can help you figure out the second part. [15:32] jkakar: If that is actually how it is done, why not write the query directly? [15:32] abentley: A few reasons: [15:32] - Storm will expand a class name into a series of column names in a query, such as in a SELECT. [15:33] - When you use Storm you get a result set that gives you powerful capabilities, like union, max, count, etc. [15:33] - When your class changes, because you added or removed a column, you don't have to change your queries unless they involve one of the modified attributes. [15:34] - Most of the time you already know what query you want, so it isn't hard to get from what you want to a store.find() call with Storm expressions. [15:34] abentley: This sounds like a case where the problem is not knowing what query you want to run. With Storm you're always expected to know what query you want to run. [15:34] jkakar: What makes you say that? [15:34] abentley: It was designed explicitly not to hide SQL from you, but in fact, to make it possible to generate the exact query you want. [15:34] abentley: Because the query you're generating doesn't work (according to the database)? [15:35] abentley: Sorry, that probably sounded offensive, but I mean no offense. [15:35] jkakar: I didn't set out to generate a query. I set out to use an existing function that returns a collection that provides functionality that should do what I want. [15:36] abentley: Okay. [15:37] The query I actually want is, "Find all SourcePackageRecipeBuilds where recipe = X and set recipe to NULL", which I can work out in SQL if you like. [15:37] abentley: That's the next step, yes, working it out in SQL so you know what you need Storm to generate for you. [15:40] jkakar: UPDATE SourcePackageRecipeBuild SET recipe = NULL WHERE recipe = 5 [15:40] jkakar: 5 actually being a variable. [15:41] css question [15:41] if I want to have a heading that has an image followed by some text aligned to the middle of that img, how do I do that? [15:41] abentley: store.find(SourcePackageRecipeBuild, SourcePackageRecipeBuild.recipe == $value).update(recipe=None) [15:43] jkakar: I don't want to have two definitions of how you get the builds associated with a recipe, so how do I update getBuilds to return a ResultSet that works? [15:44] abentley: Let me read some PostgreSQL documentation for a sec... [15:47] abentley: Hmm, it looks like you could include multiple tables in an UPDATE... at least on PostgreSQL. [15:52] abentley: I'm not sure exactly what you need... but I think you would benefit by writing a specialized query for the case you have. [15:53] abentley: For two reasons, (1) it's a simpler query than the one from getBuilds and will probably run faster and (2) you'll run one less query than you do now (by specifying pending=True and pending=False). [15:53] henninge: how big do the tarballs get that are produced by the TTBJs on the builders? [15:54] good question [15:54] bigjools: well, it's all text files so they should compress nicely. [15:54] jkakar: I disagree that it's a benefit. I'd rather have clearer code than simpler queries, and I think two queries is acceptable, and if I cared, I could update getBuilds so that I could get all builds at once. [15:55] henninge: it's just that the code that jtv wrote reads them into memory ... [15:55] it obviously works but I'd rather not have a time bomb [15:55] bigjools: They should not become very big, most projects don't have many templates [15:56] and if they have many, they are each small [15:56] abentley: Okay. Updating getBuilds to optionally include the pending clauses then would do what you want... ie: use it in a way that doesn't include the pending clauses. [15:56] henninge: typically what sort of size? [15:56] I'd have to research that. danilos, do you have a figure off the top of your head? [15:56] henninge: the change I am making will mean we could potentially be reading as many of these as there are builders [15:56] in parallel [15:57] henninge, 17 [15:57] thanks danilos [15:57] danilos is ever helpful :) [15:57] henninge, uhm, let me read the backscroll then [15:58] bigjools, if the tarball only includes translations, they should be small (never more than say 50M for the biggest case, but probably around 1M for most) [15:58] jkakar: So it would only support set if pending was not supplied? [15:58] abentley: Yep. [15:58] jkakar: gross. [15:58] abentley: Unless we change the way UPDATE statements are generated. [15:59] abentley: So you were probably right in the beginning, there probably is a bug in Storm. [15:59] danilos, henninge: aieeeee, I just looked at addOrUpdateEntriesFromTarball [15:59] tarball_io = StringIO(content) [16:00] if I have the file on disk is there a different method that will work? [16:00] bigjools, well, by that time, they are already in memory :) where is "content" initialized? [16:00] bigjools: actually, that's my code ;) [16:00] danilos: either in the upload processor or from the builder [16:01] bigjools, we can as easily parse the file directly on-disk using the tarfile module, if I am not mistaken [16:01] sorry but arbitrarily sized files going into stringio scares me [16:01] I'm going to file a bug about this, it'll need fixes in a few places [16:01] bigjools, uhm, what I am trying to say is that StringIO is a shallow wrapper, entire file is already in the memory [16:01] danilos: yes, it should not be :) [16:02] I get it [16:02] just some figures [16:02] bigjools, agreed, perhaps we need to save it to a tmp file before we process it [16:02] all of gimps templates are 736k [16:02] all of gtk+ templates (2) are 264k [16:02] danilos: well I can make a tmp file available in the buildd-manager and the upload processor before it calls that method [16:03] it currently has to read the file into memory before passing it [16:03] if the template generation goes a bit wonky then it can easily take out the buildd-manager [16:03] which Is Bad (TM) [16:04] bigjools, then it'd be a very simple fix on "our side" [16:04] excellent, I'll file the bug and put some pointers to soyuz/buildmaster code in it [16:04] cheers [16:04] bigjools, don't do it before you make the tmp file available :P [16:05] bigjools, also, note that we are using the same thing for actual Ubuntu package builds, so we'd want to fix that as well [16:05] danilos: yes, that's what I was referring to above about the upload processore [16:06] bigjools, ok consigliere ;) [16:06] heh [16:06] was about to make a joke about an offer you can't refuse [16:06] heh [16:07] jkakar: Here's a version that seems to work: http://pastebin.ubuntu.com/530766/ [16:07] abentley: Yeah, not surprisingly. I wonder how that query performs compared to the other one, though? [16:08] jkakar: For the cases where both work, I bet they both perform the same. That's got to be trivial to optimize. [16:10] abentley: Probably, yes. Though, in practice, I've occasionally seen dramatically different performance when a query uses a subselect vs. when it doesn't. [16:10] It's hard to understand when that will be the case or why, though. [16:16] gary_poster: staging is timing out, too, now. It has been updated from 9955 to 9965 [16:17] henninge: well, that seem to point a pretty stong finger at code then, which simplifies things in some ways. how is the revert going on qastaging? [16:17] *strong [16:17] gary_poster: it's taking it's time [16:17] :-) [16:17] its [16:17] ok [16:18] * gary_poster carefully replaces the second "it's" but leaves the first intact ;-) [16:19] thanks for being careful ;) [16:21] jkakar: filed as https://bugs.edge.launchpad.net/storm/+bug/674582 [16:21] <_mup_> Bug #674582: Storm may generate SQL errors on ResultSets.set for otherwise-working ResultSets. [16:22] abentley: Thanks! === al-maisan is now known as almaisan-away [16:27] henninge, gary_poster: I agree that staging is now as useless as qastaging, but I do not see what has changed to make SPR/SPPH queries slower. [16:29] sinzui, I am no longer actively investigating, because henninge's summary that revisions 11888 -> 11899 -> 11914 are a likely cause sounded like a good hypothesis. We are waiting to see if reverting these clears up qastaging. [16:33] 11914 is not on staging, so I discount that [16:33] yes, but it's part of the logical set [16:44] Yippie, build fixed! [16:44] Project devel build (221): FIXED in 4 hr 3 min: https://hudson.wedontsleep.org/job/devel/221/ [16:53] gary_poster, henninge, I do not understand the "set" point. I do not see that revision on staging 11914. I suspect that 11914 fixes the issue. I think the origin of the issue is 11899 [16:54] sinzui: 11914 does not fix it, it had already been on qastaging and did not help [16:55] sinzui: we are currently reverting 914 and 899 [16:59] gary_poster, sinzui: do you know if the revision display at the bottom of the LP page is dynamic or static? [16:59] i.e. Does it need a "make build" to be updated or is the information straight from the branch? [17:00] people.c.c has an old launchpadlib :( [17:03] henninge, it requires make build [17:03] so Chex just told me [17:06] jml: I added a test to directly use downloadPage against a real slave in a test and it gets a "405 Method not allowed". Do you know if Twisted has the equivalent of urllib2.debug = True ? [17:07] EdwinGrubbs, I am looking at distroseries.getCurrentSourceReleases() I think the subquery for max(spph.id) is doing a full table scan of SPNs because there is no constraint to return only the SPNs passed to the method [17:08] bigjools: I don't know what urllib2.debug=True is. [17:08] bigjools: and I don't know of any debugging foo off the top of my head [17:08] jml: it dumps the http comms to stdout - I'm trying to work out what methods it's using that's not allowed [17:08] EdwinGrubbs, I suspect that moving 'SourcePackageRelease.sourcepackagename IN %s" into the subquery will make the query faster [17:11] sinzui: that shouldn't be necessary since it looks like "spr.sourcepackagename = SourcePackageRelease.sourcepackagename" makes it search for all the spph/spr records for a single sourcepackagename. [17:12] bigjools: nothing obviously like that in t.web [17:12] jml: yeah, I looked too [17:12] bigjools: wireshark maybe? [17:12] tcpdump ... :) [17:17] bigjools: ooh, did you know about from launchpadlib.uris import DOGFOOD_SERVICE_ROOT? [17:17] yes [17:17] I think I put it there and shamefully forgot [17:17] EdwinGrubbs, that assumes that the query planner built that set first [17:20] sinzui: I've never seen an instance where the query planner thought that it would be faster to run a correlated subquery first and then limit the results of the outer query. [17:20] why is it that bazaar.launchpad.net is so hard for dns servers to resolve? [17:20] s/limit/filter/ [17:22] EdwinGrubbs, since we are looking at a PG 8.4 change + the removal of the SPN table from the query. I think I should get sometimes based on where that constraint is placed [17:23] mrevell: still around [17:23] ? [17:24] Hi jml, sure am [17:24] mrevell: I don't know where best to put this link on the beautifully presented https://dev.launchpad.net/BugJam – http://mumak.net/lp-bugjam-2010/ [17:24] mrevell: it's a count of the number of bugs fixed during the bug jam so far [17:24] sinzui: how many sourcepackagenames are passed in as an argument to getCurrentSourceReleases() [17:24] jml, I love it :) [17:25] jml, I'll put a link under "Tracking progress" [17:25] mrevell: thanks. [17:26] EdwinGrubbs, 1, but get get 38536 where we would expect 1 from natty, maybe 3 for maverick [17:29] Edwin 1, my move of the constraint does not fix the issue, 2 I feel pretty good that getting what looks like getting a match for every SPN in natty implies an open join [17:30] sinzui: can I see the query plan? [17:31] I will get it for you [17:33] jml: well, that flushed out a nice bug in the tests we wrote a few weeks ago :) [17:34] bigjools: which was? [17:34] jml: it was constructing a url of the form /rpc/rpc [17:34] bigjools: heh [17:34] jml, abentley: Woah: http://paste.ubuntu.com/530794/ [17:35] jkakar: yeah, it's filed as a critical bug. [17:36] jkakar: https://bugs.launchpad.net/launchpad-code/+bug/674305 [17:36] <_mup_> Bug #674305: bzr push occasionally reports AssertionError on terminal [17:37] EdwinGrubbs, this is the plan to get the current release of bzr, 1 SPN provided and only 1 expected: http://pastebin.ubuntu.com/530797/ [17:39] jml: Cool. [17:40] jml: Dunno if it helps debugging, but this was with a bound branch, it wasn't a push (explicitly). [17:40] jkakar: I'm not at all involved in fixing it [17:40] <- part of the problem [17:40] Heh [17:40] EdwinGrubbs, sourcepackagename is still listed in the FROM. It was removed several revisions ago [17:41] removing it from the query fixes everything [17:41] * sinzui looks at code again [17:41] sinzui: yeah, I was wondering where that table came from. [17:42] sinzui: so, is the code not broken? Was it just an old oops? [17:43] EdwinGrubbs, It was removed a few days ago, lifeless removed it from clauseTables in r11914, but I suspect something else is putting the table in the from clause [17:44] Edwin to be clear, the SPN joins were removed a few days ago, Lifeless then landed another branch to remove it fix clauseTables. But this oops shows that the SPN table is still in the from clause [17:46] EdwinGrubbs, ^ [17:49] EdwinGrubbs, sorry. I am looking at too may oopses. That oops was for an older revision [17:50] * sinzui tries query from r11915 [17:51] EdwinGrubbs, This is the correct plan for qastaging: http://pastebin.ubuntu.com/530801/ [17:58] sinzui, Thanks for your post wrt strategies for the bug jam. [17:59] mrevell, your welcome [17:59] Have a wonderful weekend people. See you Monday. [18:01] sinzui: ok, the problem is that there are 1138 spr records for a single sourcepackagename. [18:03] edwin I agree. I am looking for a constraint or a revised subquery that removes the loop or 1138 [18:04] moin [18:05] sinzui: rev 11914 [18:05] EdwinGrubbs: ^ [18:06] lifeless: yes, but 11915 still times out [18:06] lifeless we want to reduce the loop of SPRs in the query [18:07] good bye, have a nice weekend [18:07] sinzui: since, there is only one valid spph record for all the spr records, you would get good performance by just eliminating the subquery and moving the conditions into the outer query. You will just have to eliminate the duplicates. DISTINCT won't let you choose the spph record with the max id, so you would have to do that in python, if it is important to get that spph record and not a random one. [18:08] * sinzui nods [18:08] sinzui: works for me [18:09] https://qastaging.launchpad.net/~yavdr/+archive/stable-vdr/+packages?start=0&batch=204 [18:09] At least 782 queries/external actions issued in 17.77 seconds [18:09] little slower than ideal [18:09] trying again to remove cold cache effects [18:09] * sinzui just went from 9695.618 to 33.446 ms using a subquery table of just current ids [18:10] At least 782 queries/external actions issued in 12.63 seconds [18:10] sinzui: ^ https://qastaging.launchpad.net/~yavdr/+archive/stable-vdr/+packages?start=0&batch=204 [18:10] sinzui: I welcome further improvements here [18:10] EdwinGrubbs: bringing too much back and filtering in python will almost always be slower [18:10] lifeless yes, we want to see a source package page load a single spr. [18:11] storm is (relatively) slow at deserialisation, due to the cache coherency logic [18:11] https://qastaging.launchpad.net/ubuntu/natty/+source/bzr [18:11] At least 49 queries/external actions issued in 1.91 seconds [18:11] view-source:https://qastaging.launchpad.net/~yavdr/+archive/stable-vdr/+packages?start=0&batch=1 [18:12] interestingly that page is not flat yet, the binaries must be the cause because there is a test that its flat with sources...and the binary test seemed surprisingly low to me [18:13] lifeless: it won't bring too much back since there is only one spph record for 1300 spr records that meets the condition. So, the filtering in python might only have to deal with eliminating a handful of records. [18:13] EdwinGrubbs, I essentially did the reverse, of your suggestion. I converted the subquery to get the max id to be a table of only viable candidates: http://pastebin.ubuntu.com/530811/ [18:14] * sinzui now tries to do it the EdwinGrubbs approved way [18:15] ah right, deep history leading to a slow query [18:15] EdwinGrubbs: when we query for 200 rows [18:15] EdwinGrubbs: what would happen then [18:16] EdwinGrubbs: e.g. for http://pastebin.com/7jC2vD7G [18:20] lifeless: yes, I would like to do it in the database, but I don't know if getting just the max(spph.id) for each sourcepackagename is important or not. To do that in the database would require using a temp table in order to get rid of the subquery. [18:20] EdwinGrubbs, lifeless. I think this is the solution we want to achieve in the code http://pastebin.ubuntu.com/530817/ [18:21] sinzui: that only works for a single sourcepackagename. [18:22] sinzui: oh wait [18:22] Edwin why? I see the table controls the SPNs [18:22] me tries a list [18:23] Edwin it does work with multiple SPNs [18:24] sinzui: ok, that makes sense. I was thinking that you would run into problems with group by, but you are just grouping by the spr columns, so it all works out. [18:24] well, it certainly did not work until I add that [18:25] EdwinGrubbs, I do not need the outer "SourcePackageRelease.sourcepackagename IN ()" do I? [18:25] sinzui: no [18:26] This is wicked fast [18:26] I am going to start a branch and watch the tests pass [18:27] gary_poster, henninge: I have a very fast query that fixes distroseries.getCurrentSourceReleases() === almaisan-away is now known as al-maisan [18:42] (37 rows) [18:42] Time: 186.584 ms [18:42] sinzui: thats for the big page [18:42] using your branch [18:43] sinzui: love your work [18:43] wow. I feel good [18:43] this is going to knock +packages right back to zilch on the timeouts chart I think [18:43] This will have to be stormified. I know how two write this in storm, but not sqlobject [18:44] hmm? [18:44] I mean thats great [18:44] but I can help you do it in situ if you want [18:44] sinzui, lifeless: Is what you are doing related to the qastaging timeouts? [18:44] yes [18:45] henninge: do you mean on +packages? [18:45] this looks like it will also fix many other timeouts in production too [18:45] no, the general timeouts we get on all kinds of pages. [18:45] henninge: no [18:45] :( [18:45] henninge: we get timeouts because of a few reasons [18:45] a) cold cache effects in the db - its much smaller in memory that production [18:46] b) we have inefficient code and staging hardware shows this up [18:46] this is a case in point - sinzui is shaving many seconds off of a routine page [18:46] c) contention/thrashing in the appserver due to all the scripts running on the appserver staging host asuka [18:46] there is an rt open to address (c) [18:47] (a) - retry a few times, if it eventually works prod will probably chew it up happily [18:47] (b) - we need to fix our code. Which will help with (a) too [18:48] but it seems to be related to certain revisions of the code [18:48] it started on quastaging and when staging got updated with the same revisions it showed the same timeouts whereas before it (staging) was working fine. [18:48] henninge: what pages specifically [18:49] all project homepages [18:49] launchpad.net/anyproject [18:49] all source packages [18:49] from 11888 to 11914 we had a very broken query for getCurrentSourceReleases [18:49] launchpad.net/ubuntu/maverick/+sourece/anypackage [18:49] all the pages you're listing are covered by it; it should be tolerable now - the same as before 11888 [18:49] 11914 did not fix it, though [18:50] what EdwinGrubbs and sinzui are doing is about to make it much better [18:50] henninge: this is one reason those pages are all slow on lpnet too [18:51] lifeless, henninge method is used in soyuz, translations, registry, and bugs pages. Anything that wants to know the current release of a package is going to be between 50 and 100 times faster [18:52] sinzui: yah [18:52] sinzui: note that production db is much faster [18:52] sinzui: so not all pages will zoom as much [18:52] https://launchpad.net/ubuntu/natty/+source/bzr [18:52] Does this error message in rabbitmq mean anything to anyone? It's preventing me from install launchpad-developer-dependencies. http://pastebin.ubuntu.com/530828/ [18:53] but there are many pages which do this query that will benefit a great deal [18:53] rockstar: is it already running? [18:53] lifeless, no, it won't start. [18:54] lifeless: I don't find those pages particularly slow but maybe I am just so accustomed to LP slowness ... [18:54] ;-) [18:54] lifeless, when the package gets installed, it explodes and prevents anything else from being installed. [18:54] rockstar: on maverick ? [18:54] lifeless, yes. [18:54] hmm [18:54] I don't know sorry [18:54] OK, let's wait and see the outcome of that work. [18:54] inet_tcp",{{badmatch,{error,duplicate_name [18:54] makes me think the socket is in use [18:54] lifeless, hm... [18:54] which would happen if you had a rabbit instance already running [18:55] e.g. if the devscript is buggy on upgrades [18:55] sinzui: you have my r-c approval for landing that if it gets too late. [18:55] lifeless, oh! Yeah, the other change on this laptop is the u1 setup, so I guess that makes sense. I completely spaced that. [18:56] henninge, thanks. [18:56] PQM is scheduled to close in 3 hours [18:56] lifeless, everything are happy now. [18:56] so you will need r-c ;) === henninge changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 3 of 10.11 | PQM is closing at 22 UTC | firefighting: Lots of timeouts on qastaging!! | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [18:56] lifeless, thanks for intervening between my head and the wall. [18:58] rockstar: that was it ? [18:58] rockstar: if so, please file a bug ... buggy package ;) [18:59] lifeless, well, I should also say that I run launchpad in a chroot. [19:00] rockstar: thats fodder for the bug report [19:03] henninge, sinzui: could these slow pages be related to the changes to add latest releases to the source package pages? [19:03] henninge, sinzui: that was added in a recent revision [19:03] flacoste, I do not think so. The method was unchanged this year except for lifelesses changes this week [19:03] flacoste I think this is PG 8.4 [19:04] sinzui: hennige says that timeouts increased with a recent revision [19:04] as staging is now seeing the same timeouts than qastaging [19:04] whereas it wasn't until it was updated [19:04] and qastaging wasn't either yesterday [19:04] flacoste yes, we had an open join, but there were timeouts none-the-less [19:05] flacostewe had two landing to fix the issue, neither was substantial [19:05] I fluffed one [19:05] removed an unneeded table *constraint*, left the table in by mistake. [19:05] that went boom badly ;) [19:06] pages affected are: [19:06] all project homepages [19:06] all sourcepackages [19:06] according to henninge again [19:06] flacoste: we did just discuss this [19:06] 20 minutes ago [19:07] right, i read the backlog [19:07] kk [19:07] but it's not clear that have identified the issue [19:07] flacoste: we have an 8 second query [19:07] that will come down to 140ms [19:07] on qastaging [19:07] sure [19:07] we know they all use it [19:08] until its fixed we have no data about what lies behind it [19:08] well there is an alternative, which is to do a binary search to find the revision introducing the slowness [19:09] flacoste: 11888 [19:10] lifeless: i was under the impression that we tried reverting 11888 and its two following fixes from qastaging, but that still resulted in all these pages timing out [19:10] but now, i'm not sure, it's possible that only the follow-up fixes were reverted... === shadeslayer is now known as evilshadeslayer [19:11] flacoste: 11888 is a confounding factor [19:11] henninge: could you confirm/inform the above?^^^ [19:11] flacoste: with 11888 present, any other flaws would have been magnified [19:11] right, but without it, we shouldn't see any more timeouts than before [19:11] flacoste: even if 11888 isn't the cause of all the issues, we can't be sure without running with 11888 reverted and the others present [19:12] flacoste: we're runnign 11887 live [19:12] i know [19:12] flacoste: so yes, I agree. [19:12] we've always seen more timeouts on qastaging [19:12] it has a 10 second timeout [19:12] but i thought we had made that tests on qastaging (no 11888, others present) and found that it was still timing out all over the place [19:12] but maybe, that's not the test that took place [19:13] let me check the branch... [19:13] lp:~henninge/launchpad/stable-revert [19:13] ah, no [19:14] only 11899 and 11914 were reverted [19:14] so your hypothese holds [19:15] ok [19:16] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777QS139 is qastaging.launchpad.net/bzr [19:46] sinzui: can you look at these screenshots of the involvement portlet for the bugsupervisor versus the admins? https://devpad.canonical.com/~egrubbs/configuration/ [19:48] Edwin It was easier to write an exception? [19:49] EdwinGrubbs, We wanted to hide the link once the tracker was configured [19:52] sinzui: well, the progress bar doesn't make sense with just one link being shown, so an exception seems like the cleaner solution. It would also be odd to have a single link hidden under the "Configuration options" expander. [19:53] okay, I agree. Your approach is correct [19:55] flacoste: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777QS139 [19:56] flacoste: I did not revert 11888 in that branch because it has been on qastaging for a while without any trouble. [19:56] qastaging.launchpad.net/bzr - its the query sinzui is redoing [19:56] flacoste: I am sorry for the misunderstanding [19:56] lifeless: ack [19:57] I am fixing the distroseries/source package problem illustrated by qastaging.launchpad.net/ubuntu/natty/+source/bzr [19:57] * henninge has to go away for a bit again [20:07] sinzui: can you review https://code.edge.launchpad.net/~edwin-grubbs/launchpad/bug-664788-configure-bugtracker-link-permission/+merge/40754 [20:07] I will [20:13] thumper: perhaps we can cowboy in a squelch for the xmlrpc Fault === al-maisan is now known as almaisan-away === matsubara is now known as matsubara-afk [20:30] sinzui: I have a favour to ask [20:31] sinzui: add [rollback=11888] to your landing for the new query [20:31] yes lifeless [20:31] I will [20:31] sinzui: it will tell qatagger that https://bugs.launchpad.net/soyuz/+bug/662523 can be unblocked so the deploy report is accurate [20:31] <_mup_> Bug #662523: Archive:EntryResource:getBuildSummariesForSourceIds times out [20:31] sinzui: thank you! [21:57] lifeless: So that SPN fix worked? [21:59] wgrant: yes, and sinzui has an even more effective fix to make other uses of the query much more efficient [21:59] wgrant: https://qastaging.launchpad.net/~yavdr/+archive/stable-vdr/+packages?start=0&batch=204 [21:59] lifeless: How fast is sinzui's? [21:59] 8700->100ms [22:00] wgrant: on production this is already tolerably fast, db server size yada yada yada [22:00] Nice. [22:00] Yep. [22:00] wgrant: but I expect a positive improvement all over. [22:01] wgrant: my mp has time summaries and SQL explains https://code.launchpad.net/~sinzui/launchpad/ds-getcurrentreleases/+merge/40756 [22:01] sinzui: I'm really glad you guys dug into this [22:02] Making the SP and DS pages faster really has requires half a dozen engineers looking at the same number of objects === henninge changed the topic of #launchpad-dev to: Launchpad Development Channel | Week 3 of 10.11 | PQM is in release-critical mode | firefighting: - | https:/​/​dev.launchpad.net/​ | Get the code: https:/​/​dev.launchpad.net/​Getting [22:08] sinzui: we've 22 or so, looking all across the board [22:09] I fear milestones will be the last to fix :( I have time to return to that one next week. [22:10] +commentedbugs is the current most severe timeout [22:10] and stub has a fix \o/ [22:10] I won't have time to do anything with it till week after next [22:16] Let me guess... it's querying badly to try to find comments with index != 0? [22:17] read the bug :) [22:32] http://www.jacobian.org/writing/buildbot/ci-is-hard/ <-- lmao. "Django’s big. The test suite is around 40,000 lines of code in something like 3,000 individual tests. We work constantly to speed up the test suite, but best case it still takes about 5 minutes to run. This means that our CI absolutely needs to be distributed — a single test server won’t cut it." [22:51] flacoste: ping [22:52] hi lifeless [22:52] flacoste: I think we need to treat this bzr thing as an emergency [22:52] flacoste: its very frequent [22:52] lifeless, which? [22:52] poolie: the backtrace on push [22:52] lifeless: my understanding is that it's only annoying, not a real error [22:52] flacoste: our users don't know this [22:52] flacoste: perception [22:53] this is the zope error being shown to the user? [22:53] is there a bug? [22:53] bug number [22:53] poolie: flacoste: https://bugs.launchpad.net/launchpad-code/+bug/674305 [22:53] <_mup_> Bug #674305: bzr push occasionally reports AssertionError on terminal [22:54] Also, doesn't it stop a scan from being requested? [22:55] lifeless: any idea of how we could fix this apart from escalating this RT? [22:55] wgrant: that would be new information [22:55] flacoste: here are the options I know about [22:55] wgrant: thats m understanding too [22:55] flacoste: a) escalate the RT [22:56] If it doesn't mean that no scan is requested, then we have bigger problems. [22:56] b) wedge in some retry code here - high risk [22:56] wgrant: there are multiple routes to trigger scans [22:56] wgrant: its possible a redundant route is saving us [22:56] wgrant: e.g. the disconnect hook [22:56] Possibly. [22:57] c) push the mailman improvement and hope its enough [22:57] d) disable other services like codeimport that use the same service [23:02] c and d looks like the main option at this time [23:02] can we get confirmation that scan isn't triggered? [23:06] i don't get any errors from here fwiw [23:07] it seems to be a couple of users an hour - which, because its not (appearing-to-be) localised to product/bug like other timeouts, particularly confusing and harmful to our users. [23:08] theres no obvious rationale they can connect it to [23:09] wgrant: we're talking with ops now [23:33] Time Out Counts by Page ID [23:33] Hard Soft Page ID [23:33] 570 7384 CodeImportSchedulerApplication:CodeImportSchedulerAPI [23:33] 211 32 Person:+commentedbugs [23:33] 164 561 CodehostingApplication:CodehostingAPI [23:33] 44 156 BugTask:+index [23:33] 8 10 ProjectGroup:+milestones [23:33] 6 305 Distribution:+bugtarget-portlet-bugfilters-stats [23:33] 5 259 Distribution:+bugs [23:33] 5 14 Person:+bugs [23:33] 5 7 DistroSeries:+queue [23:33] 5 4 Archive:EntryResource:getBuildSummariesForSourceIds [23:33] bah sorrry for the formatting [23:33] flacoste: ^ turning off code imports is probably the fastest thing we can do [23:33] mbarnett: how hard is it to disable all code imports ? [23:34] mbarnett: we should be able to see an immediate drop in that netstat over a couple of minutes if thats were to help [23:36] flacoste: 570 7384 CodeImportSchedulerApplication:CodeImportSchedulerAPI [23:36] flacoste: 164 561 164I561ICodehostingApplication:CodehostingAPI [23:36] lifeless: good suggestion [23:37] mbarnett: please turn off the importds [23:37] mbarnett: And keep watching that netstat [23:41] flacoste: https://devpad.canonical.com/~lpqateam/lpnet-oops.html#time-outs is where I'm looking [23:43] no more imports should be fired off after any currently running complete. [23:43] flacoste: look at this: [23:43] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777XMLP1011 [23:43] SQL time: 17 ms [23:43] Non-sql time: 15074 ms [23:43] Ow. [23:44] flacoste: this is why I want a) single threaded appservers and b) in the main cluster [23:46] right [23:46] the GIL hypothesis [23:46] yes [23:46] for instance [23:46] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777XMLP4675 [23:46] but i have another hypothesis [23:47] if we start the timer too early [23:47] a deep queue could look like this as well [23:47] see the 4675 in particular its a soft timeout