[00:20] err! [00:20] why is branchChanged hitting AssertionErrors? [00:21] And no visible OOPS ID in the traceback sent to my 'bzr push' either... [00:21] yeah [00:22] On the other hand, LP did seem to successfully notice that my branch changed. [00:23] thumper: hello :-) [00:24] well [00:24] the assertionerror is because the transaction is doomed [00:25] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777XMLP119 [00:26] ah no, being doomed [00:27] its in a timeout block [00:28] wth, there's a gap of 15s between recorded queries [00:29] * wgrant stabs qastaging. [00:34] wgrant: What did qastaging ever do to you? [00:35] spiv: https://bugs.launchpad.net/launchpad-code/+bug/674305 <- feel free to hit the affects me too thing :-) [00:35] <_mup_> Bug #674305: bzr push occasionally reports AssertionError on terminal [00:35] StevenK: Timed out lots. [00:35] Although it may just be that those pages are broken now. [00:35] (Archive:+index, +packages, +delete-packages, that sort of thing) [00:39] Hmm. [00:39] It'd be nice if daily builds didn't all hit and DoS the build farm at the same time. [00:41] mwhudson: done, thanks! [00:45] wgrant: https://bugs.launchpad.net/soyuz/+bug/672371 [00:45] <_mup_> Bug #672371: Archive:+packages timeouts

[00:46] mwhudson: hey [00:46] mwhudson: whazzup? [00:47] thumper: that bug [00:48] mwhudson: I think [00:48] thumper: https://bugs.launchpad.net/launchpad-code/+bug/674305 [00:48] <_mup_> Bug #674305: bzr push occasionally reports AssertionError on terminal [00:48] mwhudson: I think that may be the xmlrpc fuckage [00:48] mwhudson: not sure why there are massive gaps [00:48] thumper: the xmlrpc fuckage? [00:48] the same as for getJobForMachine? [00:48] mwhudson: all the timeouts on the xmlrpc server [00:48] mwhudson: exactly [00:48] hm, ok [00:49] I've not been able to find out why we have 8s gaps [00:49] with no obvious reason [00:49] :/ [00:49] I spent almost a week chasing it [00:50] and I've nothing to show for it [00:50] lifeless: Yeah, but isn't that in theory fixed? [00:50] wgrant: see my last comment [00:50] Oh. [00:50] iz single slow query [00:50] well [00:50] there are other slow queries [00:50] but thats the smoking gun [00:51] does that also take forever on a real DB? [00:51] lifeless: ah... no [00:51] it isn't a slow query [00:52] it is the 15s gap between query execution and the next one that bothers me [00:53] mwhudson: I'd love some help chasing that down as I've exhausted my understanding on that problem [00:53] wgrant: don't know [00:53] thumper: huh, what are you talking about? [00:53] thumper: I'm talking about https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1776QS51 [00:53] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1777XMLP119 [00:53] query 33 [00:54] lifeless: ^^ [00:54] thumper: I'll have a look [00:55] thumper: that looks like thread starvation to me [00:55] lifeless: but it is only a guess [00:55] lifeless: and why is it starved [00:55] we don't know [00:55] we are just guessing [00:56] thumper: the losas have the xml server split out as the highest ticket [00:56] thumper: when thats done we'll have more resources for xmlrpc [00:56] thumper: and after that the single threaded experiment will kick in [00:57] thumper: if you want to work on this today, I suggest implementing the per thread stats [00:57] lifeless: no, I'm in the middle of something else [01:01] https://bugs.edge.launchpad.net/launchpad-foundations/+bug/243554 for reference [01:01] <_mup_> Bug #243554: oops report should record information about the running environment [01:02] wgrant: I have two problems answering for 'on a real db' [01:02] wgrant: firstly, we don't have the substituted ids to reproduce [01:02] wgrant: secondly don't have access and we're short staffed losa-wise. [01:03] wgrant: where are you up to exam wise? [01:05] lifeless: On the first day of a 12 day break. [01:05] So not doing much. [01:06] wgrant: Are you interested in tackling this perf issue? I have a trip on sunday for the cassandra training [01:06] lifeless: We should have a stub soon, shouldn't we? [01:06] and shoppinh/prep to do today [01:06] wgrant: in a few hours yes [01:07] wgrant: I'm strictly on leave, but I'm pretty bad at unwinding for < several-week periods. [01:07] Heh. [01:07] right now though, I have to do a shop-run. bbs. [01:07] So, 11888 made it bad, and the fix in iforgetwhat didn't help? [01:21] it helped [01:21] but not enough [01:21] we have two options [01:22] fix the query - its taking 200ms per SPPH at the moment. [01:22] rollback both 11888 and 11903(?) [01:22] note that rolling back leaves the page at 10 seconds and the ajax status updating timing out. [02:15] thumper: hello, mcfly [02:15] wallyworld: whazzup? [02:16] i can't get branch lp:~wallyworld/launchpad/invalid-branch-link-message to merge properly [02:16] it's not in the codebase either locally or on loggerhead and any merge attempts via pqm or lp-land claim there is nothing to do [02:17] wallyworld: this is the revision that was backed out wasn't it? [02:17] yes [02:17] but i fixed it [02:17] ie backed out the bad yui stuff [02:17] right [02:17] it's gone past ec2 again no probs [02:18] did you reverse the reversed merge? [02:18] no. not sure what to do [02:18] right [02:18] what you need to do is to merge devel into you branch [02:18] then do a reverse merge of the revision that backed out your change [02:19] the guts of the problem is that most of your branch has been merged [02:19] and the files were then reverted [02:19] so you need to revert the revert [02:19] ok. noob alert. how do i do a reverse merge? [02:19] do you know the devel revision that reverted your merge? [02:19] wallyworld: it is a cherry pick merge [02:20] i did just after it happened :-) [02:20] wallyworld: merge -r NEW..OLD (rather than merge -r OLD..NEW) [02:20] wallyworld: I [02:20] i can see if i can find it [02:20] wallyworld: I'll leave you in spiv's capable hands [02:20] wallyworld: "bzr help revert" has an example: [02:20] wallyworld: to test the merge locally [02:21] “For example, "merge . --revision -2..-3" will remove the [02:21] changes introduced by -2, without affecting the changes introduced by -1.” [02:21] wallyworld: get an up to date devel, and go bzr merge --preview ../my-branch [02:21] wallyworld: that way you can see what pqm will be attempting to merge into devel [02:21] wallyworld: in the way of changes [02:22] ok. i'll have a wee looksy. thanks. i'll grab a quick bite first. suddenly i'm hungry [02:28] * thumper finally has the recipe index builds looking nice [02:28] now for the tests... [02:34] F**K ME - 150 / 1593 CodeImportSchedulerApplication:CodeImportSchedulerAPI [02:34] hard / soft timeouts [02:34] 36 / 131 CodehostingApplication:CodehostingAPI [02:34] mwhudson: ^^^ that'll be contributing to the push issues [02:35] thumper: yep [02:35] also :( [02:43] * thumper has push failures like mwhudson had [02:58] wgrant: so ;) [03:01] lifeless: Hi. Just reinstalled and trying to get Launchpad running. [03:01] meep! [03:01] poolie: ping [03:01] Desktp + Soyuz on amd64 with lp-buildd in a VM does not fit well in 4GiB. :/ [03:01] hi there wallyworld [03:01] hi wgrant, lifeless [03:02] Afternoon poolie. [03:02] hi poolie [03:03] hey, with the bzr 2.2.2 upgrade, we talked about doing it today from tip to avoid 2 lots of downtime. but i don't really think we should package trunk prior to official release. what downtime is involved? when i did the 2.2.1 upgrade, was there any downtime there? [03:03] so two things: [03:04] firstly, i wasn't really saying "you should package tip", just "it's safe to jump to tip if you want to" [03:04] we shouldn't normally need to [03:04] There is a few seconds of downtime for codehosting upgrades. [03:05] and if there's a bug there for which you need an urgent deployment, it could be better to just do a release immediately [03:05] secondly i don't think it's really relevant to downtime [03:05] wgrant: although if you are a user 90% of the way through an hour long push the cost to you will be more than a few seconds... [03:05] i probably said "to avoid lag between us landing a fix and you running it" [03:05] hm iwbni it didn't interrupt running connections [03:06] Hmm, true. [03:06] poolie: hmm, and in this case hypothetically it wouldn't need to; we don't need to restart the ssh server, just provide a new bzr so that new connections will get a fixed lp-serve... [03:07] so, me thinks it's better to wait for bzr 2.2.2 to be released next week deal schedule a small outage [03:07] if needed at all [03:07] We have a downtime window next week for the DB upgrade anyway. [03:07] right [03:07] otherwise we have to schedule downtime [03:07] unless its zomg time [03:08] we will once the relevant RT is done have no-downtime deploys to codehosting. [03:08] but its (I think) third in the queue. [03:08] and we're getting one item done every 2-3 weeks. [03:08] there's that cpu spin/wait issue that 2.2.2 fixes and a few people get hit by hit but not so many that we shouldn't wait till next week... [03:08] Tangentially, I see https://lpstats.canonical.com/graphs/CodehostingPerformance/ looks a bit alarming ? [03:09] it does [03:09] fortunately its friday and noone will care about it till Monday [03:09] [03:10] lifeless: you shouldn't care about it either. so much for you taking the day off. my wife would kill me if i worked too much on my "day off" [03:11] hm, is is that a repeating pattern over the last 24h? [03:11] spm, are you back at work? [03:12] poolie: I am, but seriously considering tking the rest off - having a horrible hayfever attack atm - has triggered a very nasty asthma response. :-/ [03:12] spm: :( [03:13] spm: taken claratyne? [03:13] indeed [03:13] spm: saline solutionas suggested can help a lot - gets the pollen out [03:13] aye [03:13] mmm, neti pots. [03:13] spiv: I ordered one wed [03:14] nasonex is great (prescription only) [03:14] Hmm. It'd be nice if we had tracebacks for each SQL statement. [03:15] poolie: yeah, mine runs out in a few days [03:15] I've been given a (different) thing - I haven't read up to see if its equivalent yet. [03:16] rofl [03:16] 'allonase' or something like that [03:16] 'I also suggest renaming "incomplete" to "need info", as it's much more [03:16] descriptive. "Incomplete" makes it sound like the bug is in progress of [03:16] being fixed, but not yet done.' [03:16] wgrant: https://bugs.launchpad.net/launchpad-foundations/+bug/606959 [03:16] <_mup_> Bug #606959: oops should record the short traceback that caused each query? [03:16] lifeless: heh [03:17] lifeless: what's nice about that idea is that although capturing tracebacks is a touch expensive, that shouldn't matter if you only do a reasonable number of queries ;) [03:17] spiv: http://ecoyogastore.co.nz/eco-yoga-gear/neti-pot [03:17] spiv: yeah [03:17] i saw, linked from the discussion of Go, google have a final bug status of "unfortunate" [03:17] that's nice [03:17] lol [03:18] "suckstobeyou" :) [03:18] I thought they added that specially for the naming bug. [03:18] lifeless: what web stores need for neti pots are photos more like http://www.flickr.com/photos/debrisdesign/502255811/ [03:18] But I may be wrong. [03:18] oh, maybe [03:18] it could be freeform for all i know [03:19] spiv: yeah, I hope it has a manual [03:19] but it's a bit more precise for some things than 'wontfix' [03:19] lifeless: the internet can provide a guide or twenty, I'm sure. [03:19] what we need is a closure-space [03:19] N dimensions and a slider. [03:20] like the colour-space pickers [03:20] poolie: That's what Opinion is for! [03:20] *cough* [03:20] wgrant: thats an opinion! [03:24] lifeless: :( [03:25] seriously [03:25] its still an experiment as far as I've heard [03:25] Ah. [03:29] OK, with Unity defeated, it is now time to look at that query. [03:29] heh [03:30] wallyworld: if you want to discuss https://bugs.launchpad.net/bugs/674329 further I'm happy to do so - I didn't mean to prevent discussion about whatever symptoms you ran into. [03:30] <_mup_> Bug #674329: DecoratedResultSet eagerly fetches all results [03:32] lifeless: hmmm. seems at first glance the whole concept of iterable results sets which load records in batches is not supported? [03:33] what is the query returns 10000000 records. and the user only wants to see 100 at a time? [03:33] wallyworld: thats what batch navigator is for [03:34] wallyworld: we do a count(*) [we should estimate instead, but thats orthogonal) and then use a slice (OFFSET X LIMIT Y in SQL) to only retrieve 100 at a time. [03:34] i realise that's what it is supposed to be for, but isn't the pirpose defauted if __iter__ loads the whole lot anyway [03:34] wallyworld: __iter__ is /not/ for 'do partial work' [03:34] wallyworld: (neither in general, nor in this specific case) [03:35] wallyworld: in this specific case its because the database server will do all the work requested, always. [03:35] so we have to ask for the right amount of work up front rather than do some, do some more, and then say that we're done. [03:36] wallyworld: if you consider the implications of ORDER BY/GROUP BY on the work required in the db, this should make a lot of sense [03:36] sorry for my dumbness, but isn;t the whole concept of yield to avoid eagerly realising the entire list? [03:36] uhm [03:37] so, iterators, generators and lazy evaluation [03:37] why does the server do all the work? other databases don't enforce this? [03:37] wallyworld: good question. Pg definitely does; others I won't speculate on. [03:38] sure, the database has to do some work to satisfy order by etc, but the step of extracting the data from the db into the result set needn't be done unless required [03:38] nevertheless [03:38] python-pgsql has a single large buffer with the results, no further network access occurs as we iterate the rows. [03:39] Or so I am assured by Smart People. [03:39] [specifically jamesh who dug into this in the past too] [03:39] ok then. [03:39] by python-pgsql, you mean psycopg2? [03:39] jamesh: blah - yes [03:41] lifeless: so to recap, if the result set has 10000000 rows, it's ok to do a list(rs) which effectively constructs an in memory data structure with all that data even if we only want to process 100 at a time? [03:41] wallyworld: if you stop reading the result set early, the only effort you're going to save is the conversion of the result buffer to Python objects on the client side. [03:41] or am i missing something? [03:41] wallyworld: You'll slice first. [03:41] yes, and for a large result set, that's significant and a potential performance issue [03:41] wallyworld: The slice affects the issued query. [03:42] if you know you will only need a subset of the rows, tell the database so that it can send you less info. [03:43] jamesh: i'm talking about say batch navigator which allows the user to scroll through the results 100 at a time. [03:43] we may want the whole lot eventually, but not all at once [03:43] That slices, so the DB only sends those 100 rows. [03:43] And only those 100 are turned into objects. [03:43] wgrant: not if a list(rs) is done?? [03:43] which is what happens in DecoratedResultSet [03:43] wallyworld: no, to recap, slice the resultset. [03:43] wallyworld: __iter__ will only be called on the sliced version, right? [03:43] wallyworld: how do you know you'll want them all eventually? [03:44] slicing returns a new resultset. [03:44] And __iter__ is called on *that*. [03:44] for example, how often do people go to the second page of results from a bug search? [03:44] jamesh: i said we *may* want them all eventually, say if the user scrolls to the end [03:44] wallyworld: general principle: specify all the work you want within a *transaction* - call it 2 seconds of processing time. [03:44] :-) [03:45] wallyworld: and ask for, and process that. No more (would be wasted). No less (would result in additional queries - lowers efficiency) [03:45] wallyworld: the batch navigator does this slicing for you [03:45] wallyworld: how about we get concrete. 'I'm trying to do X, and Y is happening' [03:46] ok. i think my problem is i misunderstood how the batch navigator works. [03:46] thanks for setting me straight :-) [03:46] the batch navigator uses count() on the base result set to estimate the number of pages [03:46] * wallyworld crawls back to his hole [03:47] and a slice to get the data for the current page [03:47] makes sense [03:47] the count() is a performance issue with huge datasets [03:47] we need to switch to estimators [03:47] yeah. [03:48] but thats orthogonal [03:48] also, in my case, i had a query with a group by so had to override count() [03:48] erm [03:48] the default storm rs barfs [03:48] :( [03:48] I thought that was fixed in 0.18 [03:49] you can't say select (*) from xxxx with a group by in it [03:49] no [03:49] i fixed it quite simply [03:49] but i also found a bug in Count() [03:49] it messes up count(distinct xxx) [03:49] lifeless, do you go to the losa meetings? [03:49] it leaves out () around the columns [03:50] i don't know the speciic name for it, but i mean the one where francis asks them to do things [03:50] s/select(*)/select count(*) [03:50] poolie: no, tz fail. I get minutes, and have a separate meeting with ISF [03:50] k [03:50] poolie: I do when I'm in a workable tz [03:51] i'll mail him then [03:51] thanks [03:53] jam, did you file an RT for starting lp-serve? [03:53] bug 660264 [03:54] <_mup_> Bug #660264: bzr+ssh on launchpad should fork, not exec [03:54] I've had an rt for a while now, 41340 IIRC, but I'm not positive [03:54] thanks, i'll check that [03:54] sorry, 42156 [03:55] * wallyworld goes to make a coffee and get his fire proof suit [03:56] poolie: https://rt.admin.canonical.com/Ticket/Display.html?id=41791 [03:56] that's not exactly the same as getting it running though [03:57] is there a ticket or bug for that? [03:57] iirc you need them to change some configuration scripts that you don't yourself have access to? [03:58] poolie: the lp-serve thing is moving; jam needed to land more code [03:58] to do what? [03:59] poolie: there is one, but I keep shooting blind as to the rt number [04:00] Let me find the email [04:00] thanks [04:00] Could someone run http://paste.ubuntu.com/530449/ on staging? [04:01] lifeless, while jam's, looking, what do you understand the state of this to be? [04:01] i'd just like to make the bug accurate and work out where if anywhere it's getting stuck [04:02] poolie: its in a back and forth discussion with the losas as they figure all the bits out [04:02] poolie: its low priority (relatively that is) so I wouldn't expect it to happen rapidly [04:02] poolie: 42199 [04:03] poolie: mwhudson was landing the init script for jam, and with that it should be able to be enabled on staging [04:03] and then qad [04:04] epic fail [04:04] 3142 OOPS-1776B79 BugTask:+index [04:04] so from that rt it looks like the next action is still 'get the service running on qastaging'? [04:04] === Top 10 Time Out Counts by Page ID === [04:04] Hard / Soft Page ID [04:04] 238 / 35 Person:+commentedbugs [04:04] 150 / 1593 CodeImportSchedulerApplication:CodeImportSchedulerAPI [04:04] 50 / 188 BugTask:+index [04:04] 36 / 131 CodehostingApplication:CodehostingAPI [04:04] 16 / 9 Person:+bugs [04:04] 14 / 352 Distribution:+bugs [04:04] poolie: right, this whole week there haven't been enough l-osas, and there have been some critical things going on [04:04] 9 / 70 Archive:EntryResource:getBuildSummariesForSourceIds [04:04] 9 / 8 Archive:+copy-packages [04:04] 8 / 396 Distribution:+bugtarget-portlet-bugfilters-stats [04:04] today there was only Ch-ex [04:04] 7 / 0 BugTask:+addcomment [04:04] poolie: yes [04:05] k, i don't want to preempt the critical things, of course, i just want it to not stay stuck after that [04:05] poolie: so in my queue its: [04:05] - after RFWTAD stuff - thats important to finish getting single revs deployed and finish eliminating operation risk [04:06] - after token librarian - thats old inventory which fixes timeouts for many private attachments (e.g. security builds) [04:06] in terms of LOSA time [04:07] ok [04:07] short interrupts to move it along are of course reasonable [04:07] so it's off john's plate until they get to it? [04:07] poolie: John can best answer that [04:12] lifeless, poolie: I'm at least pending them telling me what I need to do next [04:12] the last round I didn't know I needed until they asked for it [04:12] mm there seem to be a few problems like that [04:21] RFC: http://people.canonical.com/~tim/recipe-latest-builds.png [04:22] it is using factory generated fake data, so I have multiple binary builds for the same arch [04:22] but the basics are there [04:23] this is up for review now [04:28] poolie: hi [04:28] poolie: we have another urgent need for committing to stacked branches [04:28] hi thumper [04:29] i think francis mentioned this... [04:29] poolie: bzr-builder commits to the branch [04:29] it was for.. right [04:29] and why does it want a stacked branch not a checkout? [04:29] poolie: and getting a branch for some big projects was using much more memory than the virtual builders had [04:29] poolie: because it never pushes [04:30] thumper: Not a fan of the triplicated spr name and version, but apart from that it looks great. [04:30] poolie: apparently an alternative solution is to change the merge code [04:30] poolie: abentley wrote it all up [04:31] onto the bug about commit? [04:31] on the incident report [04:31] for the buildd failures [04:31] that was an email or a wiki page? [04:31] wiki page I believe [04:32] I could forward you the email if you like [04:32] aaron wrote solutions up for me [04:32] i can probably find it [04:32] rockstar: ping? [04:33] * thumper EODs [04:34] thumper, is that https://wiki.canonical.com/IncidentReports/2010-10-28-LP-build-manager-not-dispatching ? [04:35] poolie: ah, I see it isn't all on the incident report [04:40] thumper: if its not pushing [04:40] thumper: why commit at all? [04:42] stub: what do you think of the idea of capturing query params in oops [04:42] stub: it seems to me it will help reproducing issues lot [04:43] lifeless: We will be logging private information, including information lp devs technically shouldn't have access to. [04:45] Some of that already leaks via the URL of course (so LP devs can learn about private teams they shouldn't know about) [04:45] But that hasn't been a problem so far, as private stuff has been company internal rather than private to a subset of the company. [04:46] stub: well, in theory :) [04:46]