[03:15] <wgrant> spm: Hi. I'm trying to QA my fix for bug #592573. The differences between the right panels on https://launchpad.net/builders and https://edge.launchpad.net/builders show that there are ~146 builds in some sort of limbo. Do you have a moment to do a bit of DB poking to work out what they are and why?
[03:15] <_mup_> Bug #592573: BuilderSet.getBuildQueueSizes doesn't consider non-binary builds <Soyuz:Fix Committed by wgrant> <https://launchpad.net/bugs/592573>
[03:19] <spm> wgrant: hrm. curious. sure. I guess the ones you have via the bug is a good starter?
[03:27] <wgrant> spm: SELECT id, builder, lastscore, job, job_type, processor FROM buildqueue WHERE virtualized=true;
[03:27] <wgrant> At the moment that query *should* be empty.
[03:28] <wgrant> But it will probably return around 61 rows
[03:28] <spm> (23461 rows) ho ho ho ho
[03:28] <wgrant> Oh, oops. Forgot a join.
[03:28] <spm> heh, np
[03:28] <wgrant> SELECT id, builder, lastscore, job, job_type, processor FROM buildqueue JOIN job ON job.id = buildqueue.job WHERE virtualized=true AND job.status = 0;
[03:29] <spm> you did get the '61', just missed the 23,400 as well. so.. not too shabby?
[03:29] <wgrant> SELECT buildqueue.id, builder, lastscore, job, job_type, processor FROM buildqueue JOIN job ON job.id = buildqueue.job WHERE virtualized=true AND job.status = 0;
[03:29] <spm> yarp. 61 rows
[03:30] <spm> one looks suspiciously old. very low job# compared to the rest
[03:30] <wgrant> There should be nothing sensitive there (I excluded logtail). Can you pastebin, please?
[03:31] <spm> http://paste.ubuntu.com/466237/
[03:31] <wgrant> Oh wow :(
[03:31] <spm> hrm? in what sense?
[03:31] <wgrant> Can you 'SELECT MAX(id) FROM buildqueue;' just to see how old those are?
[03:32] <spm> 3708959 eek
[03:32] <wgrant> Ah, so nothing new. Good.
[03:32] <wgrant> Just from the early days of the move to the job system, I suspect.
[03:33] <wgrant> (those are all queued builds that are being ignored, so are somehow corrupt)
[03:35] <spm> ahh I see. is this something that should be cleaned up? or can be happily ignored?
[03:36] <wgrant> We need to clean it up. I guess I'll talk to Julian about it.
[03:37] <spm> oki, ta
[03:37] <wgrant> SELECT * FROM buildpackagejob WHERE job=2691238;
[03:38] <wgrant> I suspect it's the BuildPackageJobs that are missing.
[03:38] <spm>    id   |   job   |  build
[03:38] <spm> --------+---------+---------
[03:38] <spm>  495749 | 2691238 | 1684155
[03:38] <wgrant> Damn.
[03:45] <mtaylor> spm: ola amigo. como estas
[03:45] <mtaylor> spm: or, should I say - ¿como estas?
[03:46] <spm> mtaylor: alas, my spanish is limited to that picked up via overhearing Dora the Explorer. So beyond Hola and Gracios(sp?) you've lost me :-)
[03:46] <mtaylor> spm: ¿como estas? means "what's up?" - although I honestly don't speak much myself
[03:47] <spm> heh
[03:47] <mtaylor> but I'm in cozumel, mx, so I'm trying to make an effort past "uno mas magarita, por favor"
[03:48] <wgrant> spm: One last one:
[03:48] <wgrant> SELECT buildqueue.id, builder, lastscore, buildqueue.job, job_type, processor, build FROM buildqueue JOIN job ON job.id = buildqueue.job JOIN buildpackagejob ON buildpackagejob.job = job.id WHERE virtualized=true AND job.status = 0;
[03:48] <wgrant> Then I can examine the builds myself.
[03:48] <spm> mtaylor: ha!
[03:49]  * mtaylor cringes that you're implementing queues in the database... then decides to shut his mouth before he's asked to fix it
[03:49] <spm> wgrant: http://paste.ubuntu.com/466245/
[03:49] <spm> what do you mean *implementing*?!? implemented. :-)
[03:50] <mtaylor> *shudder* :)
[03:50] <wgrant> mtaylor: Before my time :(
[03:50] <wgrant> This is five years old :(
[03:50] <spm> I believe plans are to move to something like rabbit, or whatever. but nfi.
[03:50] <mtaylor> wgrant: it's ok - in my days as a mysql consultant, I saw _many_ _MANY_ thing implemented inside of an RDBMS that didn't belong there
[03:51] <mtaylor> but my favorite things people mis-use dbs for are queues and email
[03:52] <mtaylor> free tip #1 from formerly over-paid db consultant - if your app needs some sort of message system that's similar to email ... USE EMAIL SERVERS .. don't half-way implement private broken email in database tables
[03:52] <mtaylor> spm: :)
[03:52] <wgrant> spm: Ah, it's not a bug after all. Most of those are builds that were cancelled by manual SQL to mark them superseded.
[03:52] <spm> ah
[03:53] <wgrant> Not all, but most. I'll work out how to clean them up.
[03:53] <wgrant> Thanks.
[03:53] <spm> mtaylor: that sounds suspiciously like heresy. email for email!?!? I mean, srsly!?!?
[03:54] <spm> wgrant: np
[03:55] <mtaylor> spm: I know - right? you should have seen the look on the client's face the first time I suggested that as the fix for their performance isues
[03:55] <mtaylor> issues
[03:55] <mtaylor> "um, hai! you've implemented email in php ... try learning the internet"
[03:55] <spm> heh
[03:56] <spm> NIH. Alive and well.
[03:56] <mtaylor> yup. also "I don't want to learn anything"
[04:58] <lifeless> morning
[05:19]  * wgrant is still looking for someone to land three branches.
[05:29] <mwhudson> here's two fun lines to have next to each other:
[05:29] <mwhudson> from canonical.launchpad.layers import WebServiceLayer
[05:29] <mwhudson> from canonical.testing.layers import FunctionalLayer
[05:43] <lifeless> mwhudson: can you help wgrant out?
[05:43] <lifeless> pretty please?
[05:46] <mwhudson> ah right
[05:46] <mwhudson> wgrant: url me up
[05:47] <wgrant> mwhudson: lp:~wgrant/launchpad/bug-598345-restrict-dep-contexts, lp:~wgrant/launchpad/refactor-_dominateBinary and lp:~wgrant/launchpad/really-publish-ppa-ddebs
[05:48]  * mwhudson grrs at not being able to give branch urls to ec2 land
[05:49] <wgrant> Can't you?
[05:49] <wgrant> I thought you could now.
[05:49] <mwhudson> well i didn't work for these
[05:55] <wgrant> Does anyone happen to know if BFB works for Hardy yet?
[05:59] <mwhudson> wgrant: ok, three instances started
[06:00] <mwhudson> wgrant: did it not work for hardy at some point?
[06:01] <wgrant> mwhudson: Oh, right, I remember now.
[06:01] <wgrant> Its bzr was old, so it didn't work with 2a.
[06:03] <wgrant> Thanks for sending those off.
[06:23] <wgrant> mwhudson: Ah, no, it is really broken like this: http://launchpadlibrarian.net/52193804/buildlog.txt.gz
[06:23] <wgrant>   bzr-builder: Depends: python-debian but it is not going to be installed
[06:24] <mwhudson> wgrant: weee
[06:26]  * wgrant builds it the old way :(
[08:03] <lifeless> wgrant: hi
[08:39] <wgrant> lifeless: Hi.
[08:40] <rockstar> mtaylor, re: Tarmac needing commit message set... The reasoning is because Tarmac doesn't know what to set the commit message to when it commits if you don't specify it.
[08:40] <rockstar> mtaylor, I realize that this is rather awkward. Merge queues will fix that.
[08:40]  * rockstar keeps rolling the squeaky wheel
[08:40] <poolie> what advice do we normally get to people who don't receive the account confirmation mail?
[08:41] <lifeless> check spam
[08:41] <wgrant> SSO does hate some domains, though.
[08:43] <poolie> or they hate us?
[08:43] <poolie> is there any escape?
[08:44] <wgrant> Well, OK, one of my email aliases that is handled by a Canonical machine does not get LP or SSO email.
[08:44] <lifeless> it becomes a losa ping
[08:44] <wgrant> Everything else works.
[08:44] <wgrant> So there's something not quite right about some email setup somewhere.
[08:44] <lifeless> wgrant: orly? have you filed an rt?
[08:45] <wgrant> lifeless: The address isn't important to me, so I never really bothered.
[08:48] <wgrant> lifeless: What was that ping before about?
[08:48] <lifeless> buildds
[08:48] <lifeless> seen the thread on deployment?
[08:48] <wgrant> Yes.
[08:48] <wgrant> They don't care if the connection dies.
[08:48] <lifeless> can you confirm or deny slave behaviour ?
[08:49] <wgrant> You can kill buildd-manager at any time, and it will be fine.
[08:49] <lifeless> \o/
[08:49] <lifeless> elmo: ^
[08:49] <lifeless> wgrant: what other things can go wrong
[08:49] <wgrant> buildd-manager and the protocol are pretty stateless.
[08:49] <wgrant> (apart from DB build state)
[08:49] <lifeless> sure
[08:50] <lifeless> wgrant: is there anything that can be done to make the publisher runs able to be interrupted without causing havoc ?
[08:50] <wgrant> lifeless: Um.
[08:50] <wgrant> They shouldn't be tooo bad at the moment.
[08:50] <lifeless> any manual recovery == too bad
[08:50] <wgrant> For PPAs it would be really bad.
[08:50] <wgrant> For primary it should just about work.
[08:51] <wgrant> (primary has atomic dists/ update; PPAs do not)
[08:54] <wgrant> Phase A (file publishing) is fine, since it doesn't matter if we publish something again on the next run.
[08:55] <wgrant> B (domination) is all in one transaction, and doesn't touch the filesystem at all.
[08:55] <wgrant> But C and D touch indices, so will leave the archive inconsistent.
[08:56] <wgrant> Hm, actually, it's slightly worse.
[08:56] <wgrant> If we get through A fine, commit, then get terminated, there'll be no dirty pockets to force index regeneration when everything is rerun.
[08:57] <lifeless> 2pc needed ?
[08:57] <wgrant> That would mean a half-hour transaction.
[08:57] <lifeless> ugh
[08:57] <wgrant> (alternatively we could make the publisher less goddamn slow)
[08:57] <lifeless> or start domination later ?
[08:58] <wgrant> So, there are two problems:
[08:58] <lifeless> there are two sorts of people in the world ....
[08:58] <wgrant> 1) We rely on state internal to the publisher process to work out whether we need to regenerate indices.
[08:58] <wgrant> 2) Termination during index generation will leave the archive inconsistent for at least a few minutes.
[08:59] <lifeless> seems like they are both addressed by making indice regeneration into a separate task
[08:59] <lifeless> pipeline-like
[08:59] <wgrant> Applying the atomic dists/ update system to all archives would solve #2 simply.
[09:00] <wgrant> For #1... we may need to store dirty pockets in the DB.
[09:00] <lifeless> also less code variation - ++
[09:03] <wgrant> Perhaps.
[09:03] <wgrant> But the way it's done now is not exactly... acceptable.
[09:03] <mrevell> GUten morgen.
[09:06] <lifeless> wgrant: are bugs filed to fix it?
[09:07] <wgrant> lifeless: It's cron.publish-ftpmaster. Everyone is terrified.
[09:07] <lifeless> is it risk ?
[09:09] <wgrant> It's evil and does some things that nobody understands any more.
[09:09] <wgrant> Well, I guess that's just dsync.
[09:13] <wgrant> Morning bigjools.
[09:13] <wgrant> You may be, er, interested to know that we have somewhere around 146 inconsistent builds on production.
[09:13] <wgrant> They have BuildQueues and Jobs, but are not actually pending.
[09:15] <bigjools> sigh
[09:15] <bigjools> givebacks?
[09:15] <wgrant> Some are SUPERSEDED, others are FULLYBUILT.
[09:15] <bigjools> did I just catch lifeless looking at cron.publish-ftpmaster?
[09:15] <wgrant> The former can be explained by LOSAs manually cancelling builds, I suppose.
[09:16] <bigjools> I expect so - doing it wrong.  Are they package builds?
[09:16] <bigjools> binary that is
[09:16] <wgrant> They're all BPBs, yes.
[09:16] <bigjools> are they rebuilds?
[09:16] <wgrant> http://paste.ubuntu.com/466245/ are the virtualized ones. Although there are a couple of legitimate builds that snuck in there.
[09:17] <bigjools> how was this discovered?
[09:17] <wgrant> The ones with score == 0 are the buggy ones.
[09:17] <wgrant> I was QAing my getBuildQueueSizes change.
[09:17] <wgrant> Compare the build queue sizes on edge and production.
[09:17] <bigjools> sigh
[09:18] <bigjools> lastscore=0 indicates a retry
[09:18] <wgrant> Ah, true. I guess it would be NULL if they'd just finished naturally.
[09:19] <wgrant> They're fortunately all fairly old.
[09:19] <bigjools> I need to clean up the rebuild anyway
[09:20] <wgrant> There are 80 or so non-virt builds which I didn't query for, since the non-virt build queues are large.
[09:20] <bigjools> how old?
[09:21] <wgrant> Well, we're up to BQ 3700000 or so.
[09:21] <wgrant> Most of them are around 3330000
[09:21] <bigjools> did you get any dates?
[09:22] <wgrant> They should be in Job, but I didn't query for them, no.
[09:22] <wgrant> I decided I'd bothered people enough.
[09:23] <bigjools> it's so much harder to query across jobs now :(
[09:24] <wgrant> http://paste.ubuntu.com/466245/  has the query
[09:24] <bigjools> I have some pre-potted ones
[09:24] <wgrant> Just throw a couple of extra columns like BPB.status and Job.date_created in, I suppose.
[09:25] <bigjools> argh, I can't even do this on staging any more
[09:25] <wgrant> No perms?
[09:25] <bigjools> no, we wipe the queues when restoring
[09:26] <bigjools> to stop staging collecting build from production
[09:26] <wgrant> Oh, true.
[09:26] <wgrant> So, the few builds on there from before 3300000 are all SUPERSEDED.
[09:26] <wgrant> Then there are a couple around 3300000
[09:26] <wgrant> And the rest around 3330000
[09:27] <wgrant> I haven't checked the statuses for the last two categories, though
[09:32] <bigjools> wgrant: there's one (!) like that on dogfood
[09:33] <wgrant> bigjools: What's its status?
[09:33] <wgrant> Build status, not Job status.
[09:37]  * bigjools rides SQL wild horses
[09:38] <lifeless> bigjools: asking about things that will make deployment hard
[09:38] <lifeless> bigjools: easy deployment is a really important thing for iterating on performance at faster than monthly cycles.
[09:38] <bigjools> lifeless: do you know how it will make it hard?
[09:39] <lifeless> bigjools: no idea, wgrant mentioned it was all
[09:39] <bigjools> other than being a PoS bash script, I don't see the problem
[09:39] <lifeless> bigjools: for the publisher specifically I'm told that interrupting it causes damaged PPA archives until its run again, and in some cases until the PPA has a new build and the publisher runs again.
[09:40] <bigjools> yes it's possible, but easy to fix
[09:40] <wgrant> bigjools: If we die during !primary index generation, the indices will be inconsistent until the next run.
[09:40] <wgrant> And if we die before index generation, there'll be no dirty pockets, so no index regeneration will occur.
[09:40] <bigjools> we need to write to a tmp dir like ubuntu does
[09:40] <wgrant> That's what I suggested.
[09:41] <wgrant> But it's not trivial to port, given how it's done now.
[09:41] <bigjools> and remove the stupid partial commits
[09:41] <wgrant> We can't do that until we publish everything really quickly.
[09:41] <lifeless> bigjools: the impact of that on the deployment story is that we have to be careful about when we start deployments, shutting down the task well in advance, which adds latency and that adds perceived downtime.
[09:41] <wgrant> I think a half-hour transaction over the whole primary archive would make stub cry a bit.
[09:42] <lifeless> none of this is intractable, but the more steps that are needed, the more coordination, the slower the process is.
[09:42] <bigjools> yeah, whoever wrote the publisher knew nothing about recoverability
[09:43] <lifeless> yup
[09:43] <wgrant> bigjools: Alternatively, we could write indices atomically like the primary archive, and store dirty pockets somewhere persistent.
[09:43] <wgrant> Everything else should work.
[09:43] <lifeless> wgrant: everything you do to make the system more resilient, j'adore
[09:43] <bigjools> me likey atomic
[09:44] <wgrant> Or we make the publisher complete in a few seconds, and do it all in one transaction.
[09:44] <wgrant> But I can't get it much below two minutes.
[09:44] <lifeless> can you make it pipeline / incremental ?
[09:45] <wgrant> The thing that takes all the time (after optimisation) is serialising the indices.
[09:45] <lifeless> does the db know the indices ?
[09:45] <bigjools> pipelining is possible but has ramifications
[09:46] <lifeless> could you, just commit, then do the indices as a read-back from the db ?
[09:46] <lifeless> (in principle)
[09:48] <bigjools> lifeless: you mean lazy generation?
[09:48] <bigjools> not sure I follow
[09:49] <jml> I'm fixing the conflict, btw
[09:50] <jml> is there any meaningful difference between IStore(self) and Store.of(self)?
[09:50] <jml> IStore uses c.l.webapp.adapter.get_store
[09:51] <lifeless> I think the second reads more easily
[09:51] <wgrant> Does the former respect the master/slave policy?
[09:51] <lifeless> bigjools: I mean doing the transaction commit as soon as possible
[09:51] <lifeless> bigjools: and generating the indices from the result of the commit, not within the commit
[09:52] <wgrant> lifeless: The problem is that if we commit before doing indices, we don't know that we need to regenerate them later.
[09:52] <bigjools> lifeless: what wgrant said
[09:52] <lifeless> wgrant: that seems fixable
[09:52] <bigjools> hehe - take a look at the publisher :)
[09:52] <wgrant> lifeless: Right, by storing dirty pockets in the DB, which fixes it all.
[09:52] <wgrant> But is very ugly.
[09:53] <lifeless> compared to 30 minute db transactions with uninterruptable cron scripts that prevent rollouts for 30% of the day.
[09:53] <lifeless> not ugly at all.
[09:53] <wgrant> Shhhhh.
[09:54] <lifeless> I'm very keen to see things here improved.
[09:54] <lifeless> The general principle of 'do small amounts of work, often' and 'delay till outside of transactions things that don't need to be in the transaction' are very dear to me.
[09:56] <jml> wgrant, the former goes straight to the master, I think.
[09:56] <wgrant> jml: Ah.
[10:04] <bigjools> wgrant: does this look sane to display the affected builds? http://pastebin.ubuntu.com/466361/
[10:06] <mwhudson> wgrant: your branches were 1 of 3, did you get the emails?
[10:12] <wgrant> mwhudson: Yeah, saw that. Thanks.
[10:12] <wgrant> bigjools: I would have said buildfarmjob.status NOT IN (0, 6), but looks OK.
[10:13] <bigjools> wgrant: using NOT IN makes queries slow
[10:13] <wgrant> bigjools: Ah, I guess so.
[10:13] <wgrant> bigjools: Also, grab buildqueue.virtualized.
[10:13] <bigjools> why?
[10:14] <wgrant> Why not?
[10:14] <wgrant> Might as well get all the categorisation information.
[10:16] <wgrant> I only omitted it from the initial query because I was restricting to virt jobs.
[10:23]  * wgrant grumbles about the incomplete Registry<->Soyuz split.
[10:41] <wgrant> mwhudson: Tests fixed. Can you please rerun, if you're still around?
[10:42] <wgrant> I guess not.
[10:43] <wgrant> Can someone else please re-EC2 lp:~wgrant/launchpad/refactor-_dominateBinary and lp:~wgrant/launchpad/bug-598345-restrict-dep-contexts?
[10:49] <bigjools> wgrant: problematic queue rows blitzed, how's it looking?
[10:49] <wgrant> bigjools: So none of them were actually pending?
[10:50] <bigjools> some where but I left those alone ;)
[10:50] <bigjools> sigh
[10:50] <bigjools> some *were*
[10:50] <wgrant> That looks OK.
[10:50] <wgrant> The numbers match now.
[10:50] <bigjools> \\o/
[10:50] <wgrant> I will be really glad when we get the model rework done, and such inconsistency becomes impossible.
[10:50] <bigjools> yarp
[10:51] <wgrant> Because we will be able to avoid storing the status in four or five places...
[10:51] <bigjools> although notice that all of those were from march/april
[10:51] <wgrant> Yeah.
[10:52] <bigjools> wgrant: https://edge.launchpad.net/~oem-archive/+archive/budapest/+build/1880257
[10:52] <bigjools> can you see that?
[10:53] <wgrant> bigjools: No. Is that the one that failed to upload this morning?
[10:54] <bigjools> yes
[10:54] <wgrant> I considered going hunting for the upload log, but decided the search space was slightly too big.
[10:54] <bigjools> only it says it's built properly so it looks like it dispatched twice :/
[10:54] <wgrant> What was the upload error?
[10:54] <bigjools> PM
[10:54] <wgrant> Or is it FULLYBUILT now?
[10:54] <wgrant> ?
[10:55] <wgrant> oH.
[10:55] <wgrant> Right.
[10:55] <bigjools> it's built
[10:55] <wgrant> So it was built four times.
[10:55] <wgrant> Succeeded the first and last.
[10:55] <wgrant> But somehow was retried after the first.
[10:55] <wgrant> Yay.
[10:55] <wgrant> I thought we'd weeded out all of that :(
[11:05] <bigjools> wgrant: I suspect double clicking on UI buttons
[11:05] <wgrant> bigjools: But... transactions.
[11:05] <bigjools> wgrant: that already causes havok on copy packages
[11:05] <wgrant> For copies, sure.
[11:05] <bigjools> wgrant: they don't help if the db constraints don't catch the problem
[11:05] <wgrant> bigjools: But a retry resets the BFJ status.
[11:06] <wgrant> They both have to update the same row.
[11:06] <lifeless> I have a suspicion something is double-forwarding every now and then
[11:06] <lifeless> or something
[11:06] <bigjools> to the same thing
[11:06] <lifeless> see the bug I filed about getting two bugs
[11:06] <wgrant> bigjools: I really hope that Postgres doesn't accept that.
[11:06] <bigjools> wgrant: unless there's a constraint, it will
[11:06] <bigjools> lifeless: really? ugh
[11:08] <lifeless> elmo: https://bugs.edge.launchpad.net/soyuz/+bug/607397
[11:08] <_mup_> Bug #607397: buildds need to survive the buildd master being upgraded <Soyuz:Incomplete> <https://launchpad.net/bugs/607397>
[11:08] <lifeless> elmo: can you please describe there the build farm issue you related to me - or perhaps its no longer an issue ?
[11:09] <wgrant> bigjools: With SERIALIZABLE as the isolation level, a concurrent update like that is prevented.
[11:09] <wgrant> I just checked.
[11:09] <wgrant> I wonder what Storm uses, though.
[11:10] <lifeless> elmo: nvm
[11:10] <wgrant> It default to serializable, as I'd hoped.
[11:12] <wgrant> Ahh.
[11:12] <wgrant> But LP overrides it to READ COMMITTED, and that allows it.
[11:12] <wgrant> Damn.
[11:13] <lifeless> bwah
[11:13] <lifeless> all of lp  is read committed ?
[11:13] <wgrant> I think so. Need to check Postgres logs harder.
[11:13] <bigjools> that's.... not good
[11:13]  * wgrant looks harder.
[11:14] <wgrant> At least the appserver transactions immediately set it to READ COMMITTED, or so the postgresql logs show.
[11:15] <lifeless> bah
[11:15] <lifeless> wgrant: you broke our builds
[11:15] <lifeless> :P
[11:15] <wgrant> lifeless: It was too quick to be my fault :(
[11:15] <lifeless> Build Reason:
[11:15]  * wgrant blames the build system.
[11:15] <lifeless> Build Source Stamp: [branch bzr+ssh://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel] HEAD
[11:15] <lifeless> Blamelist: William Grant <me@williamgrant.id.au>
[11:15] <lifeless> BUILD FAILED: failed shell_6 compile
[11:15] <wgrant> Oh yes, I got the email.
[11:15] <wgrant> But the test suite doesn't take an hour.
[11:16] <wgrant> So either I broke things really badly, or the build system is broken as usual.
[11:16] <wgrant> Or someone turned on parallelisation while I wasn't looking.
[11:16] <lifeless> no
[11:17] <lifeless> not done yet
[11:18] <wgrant> bigjools: Ah, wait, there's a UNIQUE on buildpackagejob.build. So you can't queue the build twice regardless of how screwed LP's default transaction level may or may not be.
[11:19] <bigjools> that's the kind of constraint I like
[11:19] <wgrant> So we have more hunting to do.
[11:19] <bigjools> grar
[11:19] <wgrant> Anything in the logs yet, or must we go librarian diving?
[11:20] <bigjools> hang on
[11:22] <wgrant> Hmm. The librarian GC is rather aggressive :/
[11:22] <lifeless> mpt: hai
[11:22] <lifeless> mpt: lunch is @ 1
[11:22] <lifeless> mpt: are you planning to starve me?
[11:22] <wgrant> It will immediately kill anything unreferenced and with no expiry date set.
[11:23] <mpt> lifeless, clearly, by "1" I meant "2"
[11:23] <lifeless> awesome
[11:23] <lifeless> Ursinha-afk: how does the oops <-> bug stuff work ?
[11:24] <lifeless> wgrant: are you working on 78 SELECT COUNT(*) FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_se ... tus=$INT AND Archive.purpose IN ($INT,$INT) AND Archive.id = PackageBuild.archive AND ($INT=$INT):
[11:24] <lifeless>  ?
[11:24] <wgrant> jml: Um, canonical.uuid has been gone for more than a hundred devel revisions...
[11:24] <wgrant> lifeless: I don't really know how to fix it.
[11:25] <wgrant> There's nothing obviously wrong with it that I can see.
[11:25] <wgrant> http://paste.ubuntu.com/465800/ is the EXPLAIN ANALYZE of the other slow query, which is just about the same.
[11:26] <lifeless> right
[11:26] <lifeless> so dropping the count * separate query will save 5 seconds
[11:26] <lifeless> sorry 7.7
[11:26] <wgrant> But that's lazr.restful.
[11:26] <wgrant> Not sure we can do much about that.
[11:26] <wgrant> And the query shouldn't take long at all anyway.
[11:26] <lifeless> sure we can
[11:26] <lifeless> is there a bug for it ?
[11:26] <wgrant> The slow queries?
[11:26] <jml> wgrant, I'm just the messenger.
[11:26] <lifeless> the problem
[11:26] <wgrant> I filed one a month or two ago.
[11:26] <lifeless> whats the number
[11:27]  * wgrant is hunting.
[11:27] <wgrant> If bug search was a little better...
[11:27] <lifeless> hush
[11:27] <wgrant> Bug #590708
[11:27] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[11:28] <wgrant> jml: Well, since I have no recent shipit, I can do nothing.
[11:29] <lifeless> bigjools: hi - https://bugs.edge.launchpad.net/soyuz/+bug/590708
[11:29] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[11:30]  * wgrant just commented with the paste.
[11:30] <wgrant> lifeless: I wonder if we should test kernel delayed copies and acceptance from +queue before taking the timeout down permanently. Those are done infrequently, take ages, and it's pretty bad if they stop working.
[11:34] <lifeless> wgrant: can you add a test plan for testing them on staging ?
[11:34] <lifeless> wgrant: staging is at 12 seconds.
[11:34] <wgrant> lifeless: Um, I'm not sure if testing on staging is valid.
[11:35] <lifeless> wgrant: why wouldn't it be ?
[11:35] <wgrant> lifeless: It sucks performance-wise.
[11:36] <lifeless> right
[11:36] <lifeless> so if it works on staging, we're set for prod.
[11:36] <wgrant> True.
[11:45] <wgrant> bigjools: Any luck? It'd be nice to get onto it before librarian-gc deletes the evidence.
[11:45] <bigjools> wgrant: no, best start diving
[11:50] <wgrant> bigjools: I guess you could just look for any recent restricted 'uploader.log's...
[11:51] <bigjools> urh
[11:51] <wgrant> Since buildd-manager's log doesn't seem to be much help in this sort of situation.
[12:00] <lifeless> wgrant: so, I've commented and escalated this
[12:00] <wgrant> lifeless: Thanks.
[12:01] <lifeless> wgrant: I think lazr restful is hurting us here and we may want to change it.
[12:01] <wgrant> lifeless: Possibly, that probably breaks all clients in the wild.
[12:01] <lifeless> separately fixing up the query to not do table scans - +1
[12:01] <lifeless> flacoste: http://people.canonical.com/~flacoste/tags-burndown-report.html is not updating ?
[12:01] <lifeless> wgrant: on this url, they are already broken.
[12:02] <deryck> Morning, all.
[12:02] <lifeless> it was timing out regularly on prod before
[12:02] <lifeless> now its just -clear-  :)
[12:02] <lifeless> hey deryck
[12:02] <wgrant> Can someone please re-EC2 lp:~wgrant/launchpad/refactor-_dominateBinary and lp:~wgrant/launchpad/bug-598345-restrict-dep-contexts?
[12:08] <bigjools> wgrant: I can do it locally if nobody volunteers ec2
[12:08] <bigjools> btw I haz logs
[12:08] <wgrant> Ooh, logs.
[12:08] <wgrant> Are the logs useful?
[12:08] <bigjools> somewhat, I'm PM you
[13:15] <poolie> lifeless, hi?
[13:19] <danilos> adiroiban, hi, it seems there's a conflict in +templates fix of yours now (I'm trying to land it); can you please take a look and fix it :)
[13:21]  * wgrant still needs someone to land those two branches.
[13:21] <adiroiban> danilos: Hi. Looking...
[13:23] <danilos> wgrant, got MPs for me that I can just pass into "ec2 land"? (fwiw, I had some problems with ec2 land lately, so I am not promising it will work)
[13:24] <wgrant> danilos: Thanks. https://code.edge.launchpad.net/~wgrant/launchpad/refactor-_dominateBinary/+merge/29667 and https://code.edge.launchpad.net/~wgrant/launchpad/bug-598345-restrict-dep-contexts/+merge/30203 are the MPs.
[13:34] <jml> mars, hi
[13:34] <mars> Hi jml, what's up?
[13:34] <jml> mars, I don't understand your recent email.
[13:35] <mars> jml, the "Hurray for failing fast" one?
[13:35] <jml> mars, yes. the build *did* go on to fail with indecipherable errors.
[13:36] <mars> build 1066, right?
[13:36] <mars> According to the waterfall, I see "pull new revisions [failed]", and "compile [failed]", and that's it
[13:36] <mars> According to this it never ran the test suite
[13:37] <jml> mars, it still tried to.
[13:37] <jml> mars, and the error the compile fails with is: zope.configuration.xmlconfig.ZopeXMLConfigurationError: File "/srv/buildbot/slaves/launchpad/devel/build/script.zcml", line 7.4-7.35
[13:37] <jml>     ZopeXMLConfigurationError: File "/srv/buildbot/slaves/launchpad/devel/build/lib/canonical/configure.zcml", line 157.4-158.42
[13:37] <jml>     ZopeXMLConfigurationError: File "/srv/buildbot/slaves/launchpad/devel/build/lib/canonical/shipit/configure.zcml", line 55.4
[13:37] <jml>     ImportError: No module named uuid
[13:37] <jml> mars, I mean, it's definitely better than running the whole test suite.
[13:38] <jml> mars, which is wonderful :)
[13:38] <mars> hmm
[13:39] <mars> I don't know why it moved on to compile_6.  Just a sec, checking the config
[13:39] <jml> but if a dependent branch had changed in a subtler way, it still would have gone on to run the whole suite
[13:41] <mars> interesting
[13:41] <mars> my fix did *not* land
[13:42] <mars> the compile steps are supposed to halt the build by default
[13:42] <mars> perhaps that is what caught it
[13:42] <jml> what caught it was that it's an import error
[13:42] <mars> mthaddon, ping, was there any word on landing my buildbot "fail fast" config change?
[13:42] <jml> and the apidoc generation has to import everything
[13:42] <mthaddon> mars: we landed it but didn't restart the builder - want me to do that now?
[13:43] <jml> oh I see, there were no steps beyond the compile one.
[13:43] <jml> actually, no, there were
[13:43] <mars> mthaddon, sure, there are three branches in there, but we have to do it sometime :/
[13:45] <mars> jml, yes, but you are right about the subtle changes things.  If it misses pulling GPG or something, then it happily goes onward into the suite.  You are right, we are lucky it happened to fail and halt on the compile step.
[13:45] <adiroiban> danilos: conflict solved and I have pushed the changes
[13:46] <adiroiban> danilos: the branch should be ready for ec2-test and landing
[13:46] <jml> mars, ok cool. I'll gladly watch my two branches get delayed if your fix for that gets deployed & works.
[13:46] <jml> (will I have to resubmit?)
[13:46] <jml> actually, I guess it's just force another rebuild
[13:46] <mars> jml, nope, it will be pulled into the next build
[13:46] <mars> right
[13:47]  * jml moves on to his next problem
[13:47] <jml> how can I run pyflakes on doctests?
[13:47] <jml> I seem to remember being able to do so
[13:49] <Ursinha> or, lifeless, hi
[13:49] <mars> jml, if you don't find something in the list archive, it might be on the Hacking pages
[13:50] <mars> jml, check your "Sent" folder, dated 13/7/2009, "[EMACS] Another flymake trick"
[13:51] <lifeless> Ursinha: hi! ;)
[13:51] <Ursinha> lifeless, you asked about the oops <-> bug link, I assume you're talking about the bug link in the oopses?
[13:51] <jml> mars, :) thanks.
[13:51] <Ursinha> lifeless, it was too early in this timezone when you pinged me :)
[13:51] <lifeless> Ursinha: yeah
[13:51] <lifeless> Ursinha: and yes, I asked for your backscroll :)
[13:52] <lifeless> Ursinha: I also filed a bug
[13:52] <jml> mars, actually, nothing useful in there :\
[13:52] <Ursinha> lifeless, let me see
[13:53] <barry> losa ping: bazaar.lp.net seems unhappy. can i haz restart?
[13:53] <lifeless> barry: hi
[13:53] <lifeless> barry: whats your bug # ?
[13:53] <mars> jml, Ah, sorry, thought that mail was right on target.  Maybe the great Warsaw knows
[13:53] <mthaddon> barry: unhappy in what way?
[13:53] <barry> mthaddon, can't connect
[13:54] <mthaddon> barry: via bzr+ssh?
[13:54] <lifeless> barry: its happy for me
[13:54] <barry> lifeless, bug #?
[13:54] <lifeless> barry: try turning on your network
[13:54] <barry> http://bazaar.launchpad.net/~ubuntu-dev/ubuntu-dev-tools/trunk/files/head:/doc/
[13:54] <lifeless> ?
[13:54] <lifeless> barry: you said you had a page timing out
[13:54] <mthaddon> barry: that's codebrowse - was just restarted
[13:54] <mars> barry, works for me
[13:54] <barry> lifeless, ah, yes
[13:54] <lifeless> barry: also if that page comes up
[13:54] <barry> lifeless, https://edge.launchpad.net/~pythoneers/+archive/py27stack4/+packages?start=0&batch=200
[13:54] <Ursinha> lifeless, bug 607087 ?
[13:54] <_mup_> Bug #607087: enable 'search by method' <OOPS Tools:New> <https://launchpad.net/bugs/607087>
[13:54] <lifeless> barry: wait 60 seconds and hit ctrl-R
[13:54] <barry> mthaddon, thanks! works now
[13:54] <lifeless> Ursinha: no, a new one ;)
[13:55] <lifeless> Ursinha: I'd love that one fixed too, of course ;)
[13:55] <Ursinha> hehe
[13:55] <lifeless> Ursinha: I filed the other one in launchpad-foundations
[13:55] <Ursinha> ah
[13:55] <Ursinha> lifeless, do you have the #:
[13:55] <lifeless> because its about our docs
[13:55] <lifeless> uhm
[13:55] <wgrant> barry: Is this another of those 2700-build monsters that gets deleted a couple of hours after it finishes using days of build farm time?
[13:55] <lifeless> sec
[13:55] <Ursinha> lifeless, I can find it here, no prob
[13:55] <barry> wgrant, ;) no
[13:56] <lifeless> barry: so is that your hacked url
[13:56] <barry> wgrant, just 150-ish packages
[13:56] <mars> mthaddon, btw, can we update the buildbot configs trunk with my 'fail fast' patch?  Or do you want to wait for it to run successfully first?
[13:56] <lifeless> barry: or the original
[13:56] <barry> lifeless, it is hacked
[13:56] <lifeless> barry: what url fails
[13:56] <Ursinha> lifeless, bug 607680
[13:56] <_mup_> Bug #607680: documentation needed on oops<->bugs linking <Launchpad Foundations:New for matsubara> <https://launchpad.net/bugs/607680>
[13:56] <wgrant> barry: Ah, good. Sanity.
[13:56] <lifeless> Ursinha: yes!
[13:56] <barry> lifeless, that url.  iow.  normally you get batches of 50 but i want to see all packages in one page
[13:56] <Ursinha> lifeless, what links oopses to bugs is part of the oops-tools
[13:57] <jml> barry, do you have a copy of pyflakes-doctest, or now where I can find one?
[13:57] <barry> jml, atm, i don't
[13:57] <jml> barry, thanks anyway.
[13:57] <barry> jml, yeah, sorry
[13:57] <lifeless> Ursinha: ok, thats cool. However most devs don't have that on their disk :)
[13:57]  * barry -> reboots.  hopefully will brb
[13:58] <Ursinha> lifeless, exactly :) afaiu it's a simple mechanism that associates the "oops signature" to a bug number
[13:58] <jml> ahh
[13:58] <jml> it's in old versions of the tree
[13:59] <Ursinha> that's why we have incorrect links sometimes
[13:59] <Ursinha> lifeless, what exactly are you aiming with that?
[14:00] <flacoste> lifeless: that graph was moved to lpstats
[14:00] <danilos> adiroiban, wgrant: all your branches are on ec2 now
[14:01] <flacoste> lifeless: https://lpstats.canonical.com/graphs/LPQA/
[14:01] <flacoste> lifeless: https://lpstats.canonical.com/graphs/LPQAByTeam/
[14:01] <danilos> lifeless, I have a suggestion for bug 590708, I've added it to the bug and emailed some reasoning to the list as well
[14:01] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[14:01] <wgrant> danilos: Thanks.
[14:01] <danilos> bigjools, ^
[14:02] <danilos> lifeless, ignore the timing differences on the bug (that's with explain analyze which usually doubles the times) and in the email though ;)
[14:02] <wgrant> danilos: Ooh, that's good.
[14:02] <danilos> wgrant, the traditional translations tricks fwiw :)
[14:03] <danilos> wgrant, hopefully the queries are compatible :)
[14:03] <danilos> or, let's say equivalent
[14:07] <mthaddon> mars: which branch is that again?
[14:09] <lifeless> Ursinha: I want to be able to make the associations, so I want the way I should do that documented :)
[14:09] <lifeless> flacoste: ok cool, its linked from https://dev.launchpad.net/PolicyAndProcess/ZeroOOPSPolicy
[14:10] <lifeless> flacoste: how does oops tie into that graph ?
[14:10] <lifeless> danilos: actually the explain in the bug case adds about 1000ms if you compare to the oopses
[14:12] <danilos> lifeless, well, even explain analyze runs quickly (300ms) on an optimized limit 50 query for me
[14:12] <Ursinha> lifeless, ah, that's bloody simple :) to make the association when there's no association, I mean, because we still cannot edit that other that using sql directly on oops-tools application db
[14:12] <Ursinha> *other than
[14:13] <Ursinha> lifeless, I'll make sure that's documented somewhere
[14:13] <danilos> lifeless, but anyway, I was just trying to point out that the times I recorded are not really correct (I've ran it a number of times with both explain analyze and without), but my conclusions should generally hold for these queries
[14:13] <lifeless> danilos: yeah
[14:13] <lifeless> danilos: want to make a patch ?
[14:14] <lifeless> Ursinha: ok, so how does one do it ?
[14:14] <danilos> lifeless, not really, I've got a few branches in my queue already :)
[14:14] <danilos> lifeless, it'd be good to test queries on production DB first as well
[14:14] <lifeless> flacoste: is zerooopspolicy still active? or died-a-quiet-death ?
[14:15] <lifeless> mthaddon: could you try danilos query on a slave please? from the bug https://launchpad.net/bugs/590708
[14:15] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[14:15] <mthaddon> lifeless: am in the middle of some ISD deployments at the moment - can it wait a little?
[14:15] <lifeless> mthaddon: of course
[14:16] <mthaddon> thx
[14:16] <flacoste> lifeless: it's kind of a not fully implemented policy
[14:16] <lifeless> ok
[14:16] <flacoste> lifeless: the intent is still there, but we lack the tools to fully back it up
[14:16] <Ursinha> lifeless, if an oops doesn't have a bug associated, add the bug number to the text box where the number usually is, click "Bug #", and it's done
[14:16] <lifeless> but in principle folk have the goahead to shelve stuff to work on oops
[14:16] <lifeless> Ursinha: oh, on the web ui ?
[14:16] <flacoste> lifeless: the OOPS report don't make it easy to get the "list of things" that need fixing
[14:16] <Ursinha> lifeless, as you can see in this one, for example: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1661CCW1
[14:17] <Ursinha> lifeless, yes sir
[14:17] <lifeless> ah doh! I see
[14:17] <Ursinha> :)
[14:17] <flacoste> lifeless: yes
[14:17] <Ursinha> lifeless, it clearly needs to be documented! :)
[14:17] <lifeless> Ursinha: nah, i'm still a little tired here, its been busy ;)
[14:18] <lifeless> Ursinha: close with 'lifeless is blind', if you like.
[14:18] <lifeless> flacoste: ok, so lets talk about making it easy to see what needs to be fixed.
[14:18] <Ursinha> lifeless, hehe, no, it really needs to be somewhere, so we can make easy to see what needs to be done
[14:18] <lifeless> flacoste: As I get the policy, every oops in the oops report - say just the top 15 timeouts for now - should be a high/critical bug right ?
[14:19] <lifeless> Ursinha: does the oops report sent to the list list the bugs when there is one ?
[14:19] <Ursinha> lifeless, yes, right below the oops signature/count
[14:20] <Ursinha> lifeless, as you can see in the oops reports
[14:21] <Ursinha> I guess I just repeated what you said, but nevermind :)
[14:21] <flacoste> lifeless: there should be clear list of things to fix for each team
[14:22] <flacoste> currently, they have to look for that list across several reports and several questions
[14:22] <flacoste> s/questions/sections/
[14:23] <Ursinha> flacoste, I've discussed that with diogo a few times, and we need to improve the oops signature in order to better group the oopses
[14:23] <Ursinha> done that, I guess the per team's reports will be more accurate
[14:23] <Ursinha> I also would like to know from the teams which oopses are real ones and which ones are only tainting the summaries
[14:23] <Ursinha> like the checkwatches one, for example
[14:24] <flacoste> Ursinha: best way to do that is to book a call with each TL and go over a couple of reports with them
[14:24] <Ursinha> like bug 592345
[14:25] <_mup_> Bug #592345: Checkwatches produces a lot of OOPSes that aren't real LP failures <oops> <Launchpad Bugs:Triaged> <https://launchpad.net/bugs/592345>
[14:25] <Ursinha> flacoste, I started with bugs team
[14:25] <Ursinha> but I'll have to do that with every other one, yes
[14:25] <lifeless> Ursinha: they're all real.
[14:25] <lifeless> In my simple simple opinion
[14:26] <flacoste> lifeless: not really
[14:26] <flacoste> for example a 404 isn't a real OOPS
[14:26] <flacoste> and a lot of checkwatches failure are of that sort
[14:26] <lifeless> flacoste: here - <gmb> Ursinha, We already have a backoff mechanism in place. However, now that we're tracking BugWatchActivity we can probably stop recording OOPSes for some things.
[14:26] <flacoste> right
[14:27] <lifeless> flacoste: I agree a 404 isn't a real oops, but the oops system already knows about that one and could trivially skip it
[14:27] <lifeless> flacoste: (except where we have an internal ref that makes a 404)
[14:28] <lifeless> barry: so what url was timing out again? not the hacked one, the real one.
[14:30] <lifeless> Ursinha: help me out with the oops reports here -
[14:30] <lifeless> the 19th report, edge errors
[14:30] <lifeless> [14:30] <lifeless>  876 SELECT BugTask.assignee, BugTask.bug, BugTask.bugwatch, BugTask.date_assigned, BugTask.date_close ... ON BugTask.bugwatch = "_prejoin5".id
[14:31] <lifeless> when I open the first oops up
[14:31] <lifeless> it has a bug #
[14:31] <lifeless> but I didn't see the bug reference on the page
[14:31] <Ursinha> lifeless, it's wrong
[14:31] <Ursinha> hm
[14:31] <barry> lifeless, i don't think the unhacked one was timing out :/
[14:31] <Ursinha> lifeless, it has?
[14:31] <lifeless> barry: ah. Don't hack the url
[14:31] <Ursinha> lifeless, let me just do my daily call with foundations, I'll get back to you in a moment
[14:31] <barry> >:-#
[14:31] <lifeless> https://lp-oops.canonical.com/oops.py/?oopsid=1661A1010
[14:31] <barry> :)
[14:31] <lifeless> Ursinha: ^
[14:33] <lifeless> Ursinha: but it thinks it is a translations problem, yet its a bugs issue
[14:33] <Ursinha> lifeless, I'll explain
[14:34] <lifeless> after your call :)
[14:37] <lifeless> gmb: are you back ?
[14:41] <Ursinha> lifeless, so
[14:41] <Ursinha> lifeless, this bug is linked incorrectly because of the "oops signature" oops-tools uses to identify the uniqueness of a problem, and link it to the bug
[14:42] <Ursinha> lifeless, we're aware that the way it is now isn't good because it doesn't work for timeouts
[14:42] <lifeless> Ursinha: can we purge the old one then ? I mean, we need a improvement in that eventually.
[14:42] <Ursinha> lifeless, it's a known issue
[14:42] <lifeless> but right now its steering folk wrong
[14:42] <Ursinha> lifeless, sure, I can do that
[14:45] <lifeless> sweet
[14:47] <lifeless> Ursinha: do you think we'll be ready to switch to the new workflow this cycle ?
[14:47] <Ursinha> lifeless, that will require changes in the scripts we use today
[14:47] <Ursinha> I'm not sure
[14:48] <Ursinha> lifeless, when the new cycle starts? in three weeks?
[14:48] <lifeless> 1 week I think, we're in 4 of 10.07 according to topic.
[14:48] <lifeless> of course, topic could be lying
[14:48] <lifeless> Ursinha: if you could have a set of bugs tagged release-features-when-they-are-done, I might try to help out on the change
[14:49] <wgrant> This is 10.08 week 1.
[14:50] <mars> lifeless, do you mean switch to the new merge workflow at the end of this cycle?
[14:50] <lifeless> mars: thats the question
[14:50] <Ursinha> lifeless, we still need to create the blesser, and integrate that with the merger infrastructure
[14:51] <Ursinha> and mars know about the second part better
[14:51] <lifeless> given its non trivial, its not jfdi-able
[14:51] <lifeless> I'd like to be able to see where I should help make this happen
[14:51] <mars> yes, I can think of three separate code changes (including one entirely new script or application)
[14:51] <lifeless> either by begging some time from team leads, or by helping out and doing, as appropriate.
[14:52] <lifeless> its a crucial change to make our production story cleaner and better
[14:53] <mars> well, it will happen - as to an alpha of the new cycle, I think getting that for 10.08 is unknown (I don't know our velocity), but I would think 10.09 for an alpha or even a beta could definitely happen
[14:53] <Ursinha> I agree with mars
[14:56] <mars> lifeless, I'll keep this moving forward, and I will let you know how things are going, so you can jump in where you think it's best
[14:56] <lifeless> if you make it visible to me
[14:56] <lifeless> I'll help in some fashion
[14:56] <lifeless> better deployment is a key enabler for overall velocity
[14:57] <lifeless> and this is a necessary condition for better deployment
[14:57] <mars> lifeless, it should be on the Foundations Kanban board - I'll make a lane now
[14:57] <lifeless> wow
[14:57] <lifeless> https://lp-oops.canonical.com/oops.py/?oopsid=1661C183
[14:58] <lifeless> mars: so kanban is great too - my main point is if *I* can figure out what is still todo, I can help somehow :)
[14:58] <lifeless> 14 seconds of non-sql time
[14:59] <mars> heh, ok
[15:03] <jml> just to be clear, are our buildbots running Python 2.6?
[15:03] <jml> (because ec2 test certainly isn't)
[15:04] <mars> jml, only the lucid_db_lp one is
[15:04] <jml> mars, thanks.
[15:04] <jml> next question
[15:04] <mars> jml, ok, so 'lucid ec2 image' goes on my ToDo list
[15:04] <jml> how do I use "with" statements in doctests in Python 2.5?
[15:04] <mars> benji might know?
[15:05] <gary_poster> from __future__ in a >>> line doesn't work?
[15:05]  * benji reads the scrollback
[15:07] <jml> gary_poster, apparently not.
[15:07] <lifeless> gary_poster: from __future__ is magic IIRC
[15:07] <lifeless> gary_poster: it changes the parser
[15:07] <lifeless> IIRC to do that with a non .py compile you need to pass a compile flag in
[15:08] <benji> jml: I assume you tried "from __future__ import with_statement" and it didn't work.
[15:08] <jml> benji, that's correct.
[15:08] <lifeless> deryck: whats 'null bug task' all about
[15:08] <benji> hmm, there is some support for future in doctests, let me look at the source real quick
[15:09] <jml> my experimentation cycle is quite long, since I don't know how to build Launchpad locally for Python 2.5
[15:09] <deryck> lifeless, null is a workaround for not being able to delete a task, since marking it invalid means you continue to receive mail
[15:09] <lifeless> not the null project
[15:09] <lifeless> the code path
[15:09] <lifeless> https://lp-oops.canonical.com/oops.py/?oopsid=1661C183
[15:09] <lifeless> Page-id: NullBugTask:+index
[15:10] <lifeless> time: 4388 ms
[15:10] <lifeless> non-sql time: 14274 ms
[15:10] <lifeless> Statement Count: 499
[15:10] <wgrant> erm, shouldn't NullBugTask just redirect now?
[15:12] <gary_poster> lifeless and bigjools, leonardr and I are discussing his reply to https://bugs.edge.launchpad.net/soyuz/+bug/590708 .
[15:12] <gary_poster> We are considering the backwards compatibility issues of what he described, because we feel we're the ones who are most likely to care about that, and we are responsible for it.
[15:12] <gary_poster> If we decide that leonardr's proposal is acceptable, I have the understanding that you are calling this a critical issue and that we should proceed to work on it, pushing our other tasks aside per the usual "critical" behavior.
[15:12] <gary_poster> (1) Do I understand correctly?  (2) If so, feedback on his reply would be appreciated, particularly if you have concerns.
[15:12] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[15:13] <bigjools> gary_poster: sounds good to me - AIUI we are supposed to treat OOPSes as "stop the line"
[15:14]  * bigjools reading the reply
[15:14] <gary_poster> I don't think that policy is practically acceptable for Foundations on a global basis, but that's a different conversation for a different forum
[15:15] <lifeless> mthaddon: https://bugs.edge.launchpad.net/soyuz/+bug/590708 - another for the queue, equal basis with doing the queries from danilo on a slave
[15:15] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[15:16] <lifeless> mthaddon: leonardr needs to correspond with the folk having trouble
[15:16] <lifeless> leonardr: wgrant is one of those people
[15:16] <wgrant> Hi.
[15:16] <mthaddon> lifeless: can you ping losa, I've passed the baton to another losa for interrupt queries
[15:16] <leonardr> wgrant, i'd like to see the program that's getting the timeouts
[15:17] <leonardr> so that i can see if your program will break if we apply the fix i've proposed
[15:17] <lifeless> mthaddon: sure, sorry I should have in that case too.
[15:17] <lifeless> losa ping
[15:17] <Chex> lifeless: morning
[15:17] <wgrant> leonardr: http://qa.ubuntuwire.org/ftbfs/source/build_status.py
[15:17] <Chex> lifeless: or evening in your case?!
[15:17] <lifeless> Chex: currently mid avo
[15:17] <lifeless> Chex: as I'm in prague
[15:18] <Chex> lifeless: ah yes, your sprinting, cool.
[15:18] <lifeless> Chex: if you look up a bit
[15:18] <lifeless> there is a bug - one of several - high frequency timeouts
[15:18] <wgrant> leonardr: It's currently timing out every time it runs.
[15:18] <leonardr> gary: wgrant's script doesn't use len(), it just iterates over the resultset
[15:19] <gary_poster> jml: I wanted to abstract the Python selection so that it wouldn't have to always be the system's Python, but it was regarded as unnecessary.  The "test with another version of Python" story is another way in which that feature would be valuable, though.  Maybe it will come to life at some point.
[15:19] <jml> gary_poster, *nod*
[15:19] <lifeless> Chex: we'd like two bits of losa assistance in the short term on this - a) danilo has a faster proposed query, we'd like to validate its performance on a production slave.
[15:19] <jml> gary_poster, I guess it's only really necessary during interim phases like this one.
[15:19] <gary_poster> right
[15:19] <lifeless> Chex: b) we may need to get leonardr in contact with some users via api keys
[15:20] <lifeless> Chex: a) should be cheap - can we do that please; b) ask leonardr in a minute :)
[15:20] <benji> jml: any futures included in the test globs are respected
[15:20] <leonardr> wgrant: did it ever work?
[15:20] <wgrant> leonardr: Until the edge timeout reduction, it worked aboutu 75% of the time.
[15:20] <wgrant> leonardr: Until 10.05, it worked 100% of the time.
[15:21] <bigjools> wgrant: it broke with the build farm model changes?
[15:21] <jml> benji, cool. how would I add a future to a test glob?
[15:21] <wgrant> bigjools: Somewhat, yes.
[15:21] <lifeless> gary_poster: I don't have an position-specific opinion on who should do the work; LEAN dictates team accountability (all of LP devs being a single team), not smaller granularity; but we don't seem to do that at the moment : and I'm finding my feet.
[15:21] <wgrant> bigjools: I suspect it was sitting just under the threshold, or something has gone really wrong plan-wise.
[15:21] <benji> I'll figure out an example.
[15:21] <jml> benji, thank you.
[15:22] <bigjools> ideally we should make the query very very fast
[15:22] <gary_poster> lifeless: sure, fair enough on all counts
[15:22] <wgrant> Certainly.
[15:22] <lifeless> we want to both make it fast, and avoid unnecessary work.
[15:22] <Chex> lifeless: ok, looking at the bug, and your request
[15:22] <lifeless> the count(*) there seems to be unnecessary for many cases.
[15:22] <lifeless> Chex: thanks
[15:22] <bigjools> unless we make the query(ies) faster, LP will never get faster
[15:23] <lifeless> right
[15:23] <Chex> lifeless: im a little confused what you would like me to do for you?
[15:23] <lifeless> Chex: run the query in https://bugs.edge.launchpad.net/soyuz/+bug/590708/comments/8
[15:23] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[15:23] <bigjools> so I think the count(*) is a red herring
[15:23] <lifeless> Chex: on a slave
[15:23] <lifeless> bigjools: its not the root cause here, but its not a non-issue.
[15:23] <bigjools> agreed
[15:24] <wgrant> COUNT is always going to be fairly slow.
[15:24] <Chex> lifeless: ok, so one of the DB slaves, seems ok, and the SQL seems to be a SELECT only, so thats ok, too
[15:24] <wgrant> And it
[15:24] <lifeless> fixing *either* the query, or the count(*) will fix this issue.
[15:24] <wgrant> it's unnecessary.
[15:24] <Chex> lifeless: any idea on the performance hit for the query?
[15:24] <lifeless> Chex: by ok, can you paste the output please :)
[15:24] <lifeless> Chex: its being hit by bots at least once an hour
[15:24] <bigjools> count should be as quick as the query itself
[15:24] <lifeless> bigjools: hell no
[15:24] <bigjools> ?
[15:25] <Chex> lifeless: you mean the bots are generating that query once an hour?
[15:25] <lifeless> Chex: something very like it, but causing table scans.
[15:25] <lifeless> bigjools: sec, let me get the query tested first.
[15:25] <Chex> lifeless: understood, ok, will run on one of the Db slaves, chokecherry
[15:25] <lifeless> Chex: thanks
[15:26] <lifeless> bigjools: ok, so count(*) has to complete the entire thing, ignoring offsets and limits
[15:27] <lifeless> bigjools: this is always more work except the special case when the limit of the first chunk matches the total work
[15:27] <bigjools> lifeless: if the query has an order, that doesn't apply
[15:27] <lifeless> bigjools: why do you say that?
[15:27] <bigjools> it has to complete the query to order it!
[15:27] <wgrant> Even with indices?
[15:27] <lifeless> that depends on the query
[15:27] <lifeless> very very much depends on the query
[15:28] <bigjools> yes
[15:28]  * bigjools goes back to buildd-manager hacking
[15:28] <lifeless> anyhow, my point is just - don't assume count(*) is effectively free: its not. :)
[15:28] <bigjools> lifeless: hell no
[15:29] <bigjools> but if the original query is quick, and it bloody well should be, then the count should not matter in the bigger picture
[15:29] <Chex> lifeless: bigjools: https://pastebin.canonical.com/34829/
[15:29] <lifeless> I agree it would be lost in noise today
[15:30] <lifeless> so, we may have stale statistics or something
[15:30] <lifeless> Chex: thanks!
[15:30] <bigjools> lifeless: "date_finished IS NOT NULL" kills that query
[15:30] <lifeless> bigjools: is date_finished indexed ?
[15:31] <bigjools> yes
[15:31] <bigjools> but it does an index scan over 103k rows
[15:31] <Chex> lifeless: your welcome
[15:31] <bigjools> when you use NOT IN it has no choice
[15:31] <bigjools> Chex: thanks from me too :)
[15:32] <lifeless> bigjools: not in - yes, I get that. I don't see 'is not null' being == to 'not in', but I may be missing a specific technicality.
[15:32] <bigjools> my tiny brain conflated the two, my bad
[15:33] <lifeless> hehe no worries.
[15:33] <benji> jml: import __future__
[15:33] <lifeless> is not null can be badly affected by index statistics and index selectivity though
[15:33] <benji> and then in your test setUp, add a global like...
[15:34] <bigjools> hmmm status
[15:34] <benji> test.globs['with_statement'] = __future__.with_statement
[15:34] <lifeless> leonardr: so, I have a small separate idea for you.
[15:34] <leonardr> lifeless, ok
[15:34] <lifeless> leonardr: what if, when the result set is < pagination size
[15:34] <lifeless> leonardr: lazr restful simply *does not* call len() on the result set
[15:34] <jml> benji, sweet. thanks.
[15:34] <benji> jml: I don't have a Python 2.5 so I can't test it, so hopefully it'll work as-is
[15:34] <lifeless> leonardr: in this particular case we have 19 results
[15:34] <leonardr> lifeless, makes sense
[15:34] <bigjools> lifeless: we can improve it with an index btree(date_finished, status)
[15:35] <jml> benji, well, I'll try that and resubmit via ec2
[15:35] <bigjools> lifeless: it's scanning over status
[15:35] <lifeless> leonardr: I know it won't help in the greater case
[15:35] <wgrant> lifeless: Only 19? There should be more than that...
[15:35] <jml> benji, in ~3hrs I'll know if it worked.
[15:35] <benji> heh
[15:35] <benji> it might be quicker for one of us to install 2.5
[15:37] <lifeless> wgrant: privmsg
[15:37] <lifeless> leonardr: how big a fix would that short hack I'm proposing be ?
[15:37] <leonardr> lifeless: very easy
[15:37] <lifeless> fix/workaround
[15:37] <lifeless> leonardr: I propose, if you think its sensible, that we:
[15:37] <lifeless>  - do this tiny hack; get that cowboyed to prod
[15:38] <lifeless>  - do the larger one you proposed, if you think its tolerable
[15:38] <lifeless>  - add the index bigjools is proposing, after evaluating it on staging
[15:39] <wgrant> lifeless: The 'Failed to build' query that normally times out should return 2614 results.
[15:40] <lifeless> wgrant: thats odd :P
[15:40] <wgrant> That's larger than the batch size.
[15:40] <lifeless> wgrant: yes, we have to deal with bigger things
[15:40] <leonardr> lifeless et al: a quick analysis of the oopses shows that we seem to have three users
[15:40] <leonardr> 1. leann ogasawara
[15:40] <leonardr> 2. someone at ubuntuwire.org
[15:40] <bigjools> lifeless: one thing I think we need to do is to document the queries we're using and the indexes that they need.  They're currently disjoint and we have no idea what needs what and what indexes are obsolete (which are a waste of processing time)
[15:41] <leonardr> 3. someone at cranberry.canonical.com (wgrant?)
[15:41] <bigjools> lifeless: a bit like the prejoins too
[15:41] <wgrant> leonardr: I have access to no Canonical machines. I manage the script on ubuntuwire.org.
[15:41] <leonardr> no, sorry, leann ogasawara is the person at cranberry.canonical.com
[15:41] <leonardr> our third client is someone from optusnet.com.au
[15:41] <wgrant> That's me.
[15:42] <lifeless> bigjools: I think thats a great idea. Start doing it however, we can iterate to make it structured later.
[15:42] <jml> wgrant, not internode?
[15:42] <wgrant> jml: No. Too far for reasonable ADSL performance.
[15:43] <bigjools> lifeless: one thing that really scares me is changing prejoins - we've simply got no idea what it will affect.
[15:43] <wgrant> jml: And Optus' caps are about an order of magnitude larger now than they were 6 months ago.
[15:43] <wgrant> So it's not that bad.
[15:43] <lifeless> bigjools: we'll be a lot safer once staging's timeout limit is down to 5 seconds
[15:43] <leonardr> wgrant: ok, so we have two users, you and leann
[15:43] <lifeless> we can rev leann's launchpadlib pretty easily
[15:43] <lifeless> let me go ask her
[15:44] <jml> wgrant, fair enough :)
[15:44] <jml> leann is being asked now in person
[15:45] <leonardr> ah, i just asked her on irc, but ok
[15:50] <lifeless> she was walking past the door to a meeting
[15:50] <lifeless> she is running in the dc on the platform lp lib - hardies I suspect
[15:50] <lifeless> but she can RT an upgrade anytime.
[15:50] <leonardr> lifeless, have we run the queries that oopsed to see how many results they actually return?
[15:50] <lifeless> there are 800 or so
[15:51] <lifeless> I'd rather not do that by hand
[15:51] <lifeless> wgrant says one in particular he does routinely returns 2.7K
[15:52] <james_w> isn't backwards compatibility not really an issue as len() didn't work for a long time?
[15:53] <lifeless> leonardr: when doing 'approve' on an MP please approve the overall proposal too - so that the queue is representative of the work reviewers have left to do
[15:53] <leonardr> sorry
[15:53]  * wgrant sleeps.
[15:53] <wgrant> Thanks for looking at this.
[15:55] <lifeless> leonardr: no probs
[15:55] <lifeless> leonardr: its a small thing, but it helps the flow.
[15:55] <leonardr> james_w: it's somewhat unlikely, but there was a published workaround for a long time, so i want to make sure
[16:00] <lifeless> flacoste: is there a burndown chart of oops and timeout bugs ?
[16:00] <lifeless> flacoste: or should I perhaps ask jml with his mad graphing skills to write one
[16:01] <jml> I only do graphs with wobbly lines that go upwards and to the right.
[16:02] <lifeless> don't worry, we can make this one do that
[16:02] <lifeless> jml: how hard would it be for you to do this?
[16:03] <lifeless> hmm meeting time
[16:03] <jml> lifeless, about as hard as it would be for you.
[16:03] <jml> maybe a little bit less if I used lpstats.
[16:05]  * bigjools chuckles
[16:08] <lifeless> jml: I think you underestimate startup cost/activation energy
[16:09] <jml> lifeless, maybe.
[16:09] <jml> lifeless, if you email me with exactly what you want, I can give it a go.
[16:10] <jml> lifeless, but if you want it today or tomorrow, you're genuinely better off finding someone else.
[16:11] <lifeless>  have mailed you
[16:15] <mars> rockstar, ping
[16:15] <rockstar> mars, distracted pong
[16:16] <mars> rockstar, just wondering what the progress was on your YUI 3.1 upgrade.
[16:16] <rockstar> mars, ah, very close.  Dealing with a bug where YUI 3.1 doesn't like the lp.client.
[16:17] <mars> ok
[16:17] <rockstar> mars, I'll send it to you for review.
[16:17] <mars> I'm going to get someone to test 1.0, then we can look at rippling the change upward through the lazr-js tree
[16:17] <mars> rockstar, sure
[16:18] <rockstar> mars, 1.0?
[16:18] <leonardr> lifeless et al: ogasawara is indeed accessing the total_size (using the workaround since len() doesn't work in old wadllib)
[16:19] <leonardr> in fact, that's all she's doing with the data
[16:19] <mars> rockstar, yes, there are three lines of development right now: trunk (dev), a 1.0 release branch that will become 1.0-dev, and the 2.0-dev line
[16:20] <rockstar> mars, is there anyone else outside of Canonical using lazr-js?
[16:20] <mars> rockstar, not to my knowledge
[16:20] <rockstar> mars, I guess I'm asking "is there much point in the overhead required for coordinating various lines of development" ?
[16:20] <rockstar> I think there should be trunk, and then the project using it can maintain their own branch of that.
[16:20] <mars> yes, because we have projects on 1.0, and also projects on 2.0
[16:21] <rockstar> mars, what projects are those?
[16:21] <mars> ISD and LP are on 1.0, U1 and Landscape are on 2.0
[16:22] <rockstar> mars, afaik, LP is on 0.9.2, which means very little, since we're not really on a branch at all.
[16:22] <mars> rockstar, I'm working to get it down to two branches: 1.0 dev, and 2.0 dev
[16:22] <mars> yes, the fact we are not on a branch is a problem as well
[16:22] <rockstar> mars, what I'm saying is "the branch we're based off has little to do with what we're actually running"
[16:22] <rockstar> mars, landscape, for instance, maintains their own branch.
[16:23] <leonardr> ogasawara says that when it worked, her script found a total_size of 2614. given that neither of our users will benefit from the small-batch optimization, i think i should be doing something else (if we have a consensus about what else should be done)
[16:23] <rockstar> If we land something on the 1.0 branch, we shouldn't be risking the breakage of a bunch of other projects.
[16:23] <mars> rockstar, yes, and that leads to everything being one big hairball :)
[16:23] <rockstar> mars, how's that?
[16:24] <mars> because changes like the YUI 3.0/3.1 split or the distribute debacle force other projects to maintain branches
[16:24] <rockstar> mars, if we make changes in trunk and then the projects then pull when they're ready, then we really only have one branch to worry about as lazr-js (trunk), and two branches to worry about as LP (trunk and whatever we're running)
[16:24] <mars> 4 projects with 4 branches and mainline is 5 times the maintenance work needed
[16:24] <rockstar> mars, the YUI 3.0/3.1 debacle wouldn't have happened if sidnei had landed in trunk.
[16:25] <mars> and no one else would have been able to use or patch trunk
[16:25] <mars> you either have everyone maintaining a private fork, or you consolidate
[16:25] <mars> I want to get everything consolidated
[16:25] <rockstar> mars, everyone needs to maintain a fork anyway.
[16:25] <mars> rockstar, why?
[16:25] <rockstar> Hopefully it's a "pull only" fork.
[16:26] <rockstar> mars, because if we update 1.0, we can break other projects.
[16:27] <rockstar> Allowing the other projects to pull in changes when they are ready is (IMHO) the best option.  We only have one line of development, and when they're ready to upgrade, they do.
[16:27] <mars> that leads to real problems with versioning and contribution
[16:27] <rockstar> mars, howso?
[16:28] <mars> you have to write two patches (one for you, one for mainline), and you can't just pull trunk to get some new feature - there could be massive changes, meaning you have to backport and maintain yet another patch for your private fork
[16:28] <mars> People should just be able to skip between releases
[16:28] <mars> and releases should be documented in the changes they perform
[16:29] <rockstar> mars, I think you ought to propose this to a mailing list somewhere, and find out what other projects are doing.  I have suspicions whether or not it needs to be this complicated.
[16:30] <rockstar> mars, having two lines of development might mean that I need to patch my private fork, 1.0, and 2.0.
[16:30] <mars> rockstar, I don't see the complexity - we'll have one mainline (2.0), and one legacy line (1.0)
[16:30] <mars> rockstar, then you should drop your private fork, and fix mainline
[16:32] <rockstar> mars, I think this is better for the mailing list.  I suspect that the private fork is a feature other projects are a bit attached to (Launchpad historically is)
[16:38] <mars> rockstar, would you be willing to test the 1.0 branch to see if it builds on your system?  I would like to make lazr-js hackable again.
[16:39] <rockstar> mars, do you need it right now, or can it be in 1 hour?
[16:39] <mars> rockstar, better make that ~3 hours then, I'll be taking lunch around 12:30
[16:40] <mars> err, "I'll be taking lunch in 1 hour"
[16:40] <rockstar> mars, okay, that will work better, because I need to eat dinner soon.  I'm happy to test it though.
[16:40] <mars> cool
[16:40] <rockstar> bigjools, are you still around?
[16:40] <bigjools> yep
[16:42] <lifeless> leonardr: lol!
[16:42] <lifeless> leonardr: so just exposing the size only thing would help her
[16:43] <leonardr> lifeless: yes, if we implemented the full solution she would have to change her script but we could get it to work
[16:44] <lifeless> \o/
[16:50] <rockstar> bigjools, I'm having a pretty hard time changing the status of a SPRBuild...  The security proxy is only slightly the problem.
[16:51] <rockstar> bigjools, do you have methods for flipping the switches?  All I see is handleStatus, and then things like "_handleStatus_OK" etc.
[16:51] <bigjools> rockstar: what are you trying to do?
[16:51] <leonardr> lifeless, i posted an update to the bug. i think we should shelve the 'don't run the count(*)' optimization since it's somewhat difficult and it won't solve the big problems. do you want me to work on the annotation-based solution?
[16:51] <leonardr> (shelve for purposes of this problem, not permanently)
[16:51] <rockstar> bigjools, make a SourcePackageRelease tied to a recipe
[16:52] <rockstar> (it's a hole in our testing currently)
[16:52] <bigjools> rockstar: in a test or in the code?  if the latter, at what point in the pipeline?
[16:53] <rockstar> bigjools, in a test.
[16:53] <bigjools> ok
[16:53] <rockstar> bigjools, it won't let me set ISourcePackageRecipeBuild.source_package_release
[16:53] <bigjools> ok let me check
[16:56] <bigjools> rockstar: dude, ISourcePackageRecipeBuild.source_package_release is a property so I'm not surprised :)
[16:57] <bigjools> rockstar: set SourcePackageRelease.source_package_recipe_build
[16:58] <rockstar> bigjools, *facepalm* I was looking at the interface thinking it'd give me all I needed...
[16:58] <rockstar> :)
[16:58] <bigjools> :)
[16:58] <bigjools> rockstar: if it makes you feel any better, I've done this exact same thing myself
[16:58] <rockstar> bigjools, it's a sign that we should delete all interfaces.
[17:01] <poolie> how do i do an api query for bugs containing a particular tag (and maybe other tags)?
[17:01] <lifeless> leonardr: please
[17:12] <lifeless> bigjools: BuildFarmJob.status <> 1 - thats an issue
[17:14] <lifeless> losa ping
[17:15] <mthaddon> lifeless: hi
[17:15] <lifeless> hi, we'd like to run an analyze on packagebuild after checking how big the table is
[17:15] <lifeless> on each db server
[17:16] <lifeless> and then check explain analyze SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, BinaryPackageBuild.package_build, BinaryPackageBuild.source_package_release FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_series IN (109, 110, 111, 112, 113, 114) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND (BuildFarmJob.status <> 1
[17:16] <lifeless> again
[17:16] <lifeless> if the table is -huge- we obviously don't want to wedge things.
[17:18] <mthaddon> lifeless: erm, why do we need to do this?
[17:18] <lifeless> also I'd like to know the range of values in BuildFarmJob.status - select status from buildfarmjob unique;
[17:19] <lifeless> mthaddon: because we have an API call timing out - taking 18 seconds - and the query plan suggests a mismatch between statistics and actual data.
[17:19] <lifeless> http://paste.ubuntu.com/465800/
[17:19] <lifeless> we've identified a few issues all at once related to this:
[17:19] <lifeless>  - the api is doubling the db load by one of its bits of magic
[17:19] <lifeless>  - the query is extremely slow itself
[17:20] <mthaddon> lifeless: I'd like to get stub's input on that (particularly why it's out of whack in the first place) ideally
[17:20] <lifeless>  - the query contains an exclude - status <>1  rather than status in (2,3,4,5,6) or whatever it should be
[17:20] <lifeless> mthaddon: hes on a plane.
[17:20] <lifeless> mthaddon: I tried to ring him just before :(
[17:20] <mthaddon> lifeless: sure - is this a critical issue now?
[17:21] <lifeless> its timing out for ogsawara and wgrant every time; they aren't able to retry to fix it.
[17:21] <lifeless> if its simply stale statistics, that would be a very easy bandaid.
[17:21] <lifeless> we can also raise the timeout again
[17:22] <lifeless> by which I mean, 'retries also time out'
[17:24] <mthaddon> I think increasing the timeout is a better short term fix - what should we increase it to?
[17:24] <mthaddon> lifeless: and it times out for them on both edge and lpnet?
[17:24] <lifeless> yes
[17:24] <lifeless> thats my [weak] understanding
[17:25] <lifeless> edge is doing a timeout every 120 seconds
[17:25] <lifeless> prod is a lot more unhappy, but that is primarily the bug attachment oops which gmb has been working on.
[17:25] <lifeless> which reminds me
[17:25] <mthaddon> lifeless: so which one are we changing? edge or lpnet? and to what?
[17:26] <lifeless> mthaddon: do you know how to generate a manual oops report for the oops since this morning on edge and lpnet?
[17:26] <lifeless> mthaddon: that would help me answer your question
[17:26] <lifeless> because I know what fixes are in-progress
[17:27] <mthaddon> lifeless: I don't, no :(
[17:28] <lifeless> mthaddon: then I'd say lets raise it back to 14 seconds on edge
[17:28] <lifeless> I know that most of the prod ones are the bug attachment script
[17:29] <lifeless> and it is in progress
[17:29] <mthaddon> lifeless: like this? https://pastebin.canonical.com/34844/
[17:29] <lifeless> mthaddon: could we run an analyze on staging at least, see how long it takes, and if it improves the query ?
[17:30] <lifeless> mthaddon: yes, that patch will raise the edge timeout.
[17:33] <mthaddon> lifeless: it's hard to say if that will match production since the load on the DBs is so different though (having never been asked to do this before for LP is throwing up a minor red flag as "doing it wrong" as well)
[17:33] <lifeless> mthaddon: I'm positive stub has done analyze's to fix statistics many times. A greb of the lp-code logs will probably find some ;)
[17:33] <mthaddon> in any case, I'm pushing out the cowboy to edge with the higher timeout now
[17:33] <lifeless> thanks
[17:34] <lifeless> mthaddon: I'm curious how, since its the same revno ...
[17:34] <mthaddon> lifeless: I landed the branch that allows me to specify a revno
[17:34] <Ursinha> lifeless, lpnet oopses since 00utc: https://devpad.canonical.com/~lpqateam/lpnet-oops.html#time-outs
[17:34] <lifeless> mthaddon: \o/
[17:34] <lifeless> mthaddon: thats awesome
[17:34] <mthaddon> s/specify a revno/specify a custom directory name/
[17:34] <lifeless> Ursinha: thanks
[17:35] <Ursinha> lifeless, same for edge and staging: https://devpad.canonical.com/~lpqateam/edge-oops.html#time-outs https://devpad.canonical.com/~lpqateam/staging-oops.html#time-outs
[17:41] <nigelb> bryceh: where is the code for it?
[17:41] <nigelb> (I wish bluprints had a comments area too for each action item)
[17:41] <bryceh> nigelb, hang on I'm composing an email
[17:41] <nigelb> heh :D
[17:41] <bryceh> nigelb, damn you're quick ;-)
[17:41] <nigelb> haha
[17:47] <mthaddon> lifeless: timeout increased on edge
[17:48] <mthaddon> lifeless: although the config change hasn't been landed, so it'll be overwritten on next rollout unless that happens
[17:57] <lifeless> mthaddon: ok, can you do that too - or should I just land an r=mthaddon to increase it ?
[17:57] <lifeless> Ursinha: thanks
[17:58] <mthaddon> lifeless: r=mthaddon would be great, thx
[17:58] <mthaddon> lifeless: has it fixed the issue?
[18:06] <lifeless> mthaddon: don't know
[18:06] <lifeless> mthaddon: wgrant may have gone to sleep
[18:08] <lifeless> mthaddon: yes its fixed
[18:08] <mthaddon> cool
[18:14] <lifeless> at least for leanne
[18:15] <lifeless> but I think they're looking at the same think
[18:15] <lifeless> *thing*
[18:27] <lifeless> sinzui: bug 607879 - if you want to discuss with me, gimme a shout
[18:27] <_mup_> Bug #607879: https://bugs.edge.launchpad.net/~person/+participation timeouts <oops> <timeout> <Launchpad Registry:Triaged> <https://launchpad.net/bugs/607879>
[18:57] <lifeless> losa ping
[18:58] <bryceh> nigelb, ok finally got that email out
[18:58] <Chex> lifeless: hi there
[18:59] <lifeless> Chex: hi, uhm channel confusion - query plan tweaking on staging
[18:59] <Chex> lifeless: ok, run that query on staging DB, then?
[19:00] <lifeless> Chex: so an analyze of packagebuild on staging
[19:00] <lifeless> and then
[19:00] <lifeless> # explain analyze SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, BinaryPackageBuild.package_build, BinaryPackageBuild.source_package_release FROM Archive, BinaryPackageBuild, BuildFarmJob, PackageBuild WHERE distro_arch_series IN (109, 110, 111, 112, 113, 114) AND BinaryPackageBuild.package_build = PackageBuild.id AND PackageBuild.build_farm_job = BuildFarmJob.id AND (BuildFarmJob.status <> 1 OR BuildFarm
[19:01] <lifeless> on staging
[19:01] <danilos> lifeless, you don't mind me doing the query I suggested two times on production slave? (i.e. you still have reasons to believe it would hurt us)
[19:02] <danilos> lifeless, fwiw, your query above was cut-off
[19:02] <Chex> lifeless: ERROR:  column "buildfarm" does not exist
[19:02] <Chex> LINE 5: ... BuildFarmJob.id AND (BuildFarmJob.status <> 1 OR BuildFarm)...
[19:02] <lifeless> Chex: http://paste.ubuntu.com/465800/
[19:02] <lifeless> Chex: top line
[19:02] <Chex> lifeless: oops, yeah pastebin is better, thanks
[19:03] <Chex> lifeless: http://pastebin.ubuntu.com/466583/
[19:04] <lifeless> Chex: and you analyzed packagebuild first ?
[19:04] <Chex> lifeless: sorry, no I did not
[19:05] <danilos> lifeless, fwiw, I wasn't thinking of doing analyze on production DB :)
[19:05] <lifeless> Chex: please do :) - there is a mismatch between rows=702 and rows=28253 in the middle of the explain
[19:05] <lifeless> that jtv pointed out
[19:06] <jtv> danilos: this is "analyze," not "explain analyze"
[19:07] <danilos> jtv, right, lifeless stopped me from doing explain analyze on production slave because it would "mess up caches on production DBs"
[19:07] <lifeless> danilos: well no, you were saying something that I interpreted to mean 'drop caches'
[19:07] <lifeless> danilos: which is rather different from 'run twice to eliminate cold cache effects'
[19:08] <danilos> lifeless, well, I was saying exactly this: "losa quick ping: hi, can you please check how caches on DB server affect executing a query at https://bugs.edge.launchpad.net/soyuz/+bug/590708/comments/8 (i.e. do it a few times on the same production slave DB)"; I'd never interpret it the way you did, but that's not up for debate :)
[19:08] <_mup_> Bug #590708: DistroSeries.getBuildRecords often timing out <api> <oops> <soyuz-build> <timeout> <Soyuz:Triaged by michael.nelson> <https://launchpad.net/bugs/590708>
[19:09] <lifeless> danilos: crossed wires happen :)
[19:09] <danilos> lifeless, yeah, you were doing something very similar here so I guess that's where the confusion comes from ;)
[19:10] <danilos> anyway, Chex, can you please try the query above twice on a single production slave to compare the results?
[19:10] <Chex> lifeless: http://pastebin.ubuntu.com/466585/
[19:10] <Chex> lifeless: this look any better?
[19:11] <jtv> Nope
[19:11] <lifeless> well, 2 seconds better
[19:12] <lifeless> Chex: so, to check - you did 'analyze packagebuild' then the query from the pastebin I linked earlier ?
[19:12] <lifeless> jtv: Nested Loop  (cost=0.00..11792.41 rows=2783 width=8) (actual time=0.068..2155.895 rows=904092 loops=1)
[19:12] <lifeless> jtv: thats another row expectation mismatch ?
[19:12] <Chex> lifeless: that is correct
[19:13] <Chex> the analyze packagebuild then the query you pasted.
[19:13] <lifeless> cool
[19:13] <Chex> danilos: ok, sure, hang on
[19:13] <lifeless> can you please do analyze archive and analyze binarypackagebuild too
[19:13] <Chex> then analyze packagebuild, then the query?
[19:14] <lifeless> analyze archive; analyze binarypackagebuild; the query
[19:14] <jtv> lifeless: that looks like a mismatch, yes...  but that looks like a definite but
[19:14] <jtv> bug
[19:14] <Chex> lifeless: ok.
[19:14] <jtv> I mean, why a million rows there?
[19:15] <Chex> lifeless: http://pastebin.ubuntu.com/466592/
[19:16] <lifeless> oh!
[19:16] <jtv> Ah, it's a highly unfortunate thing... those million rows are the Archive × PackageBuild records
[19:17] <lifeless> Chex: lets do this differently.
[19:17] <lifeless> Chex: analyze packagebuild; analyze archive; analyze binarypackagebuild; - outside of a transaction
[19:17] <lifeless> Chex: we don't want to rollback, we want the analyzes committed
[19:18] <lifeless> I'm not 100% sure about the impact of analyze + rollback :-
[19:19] <Chex> lifeless: oh, fair enough, hang on
[19:19] <Chex> lifeless: done
[19:19] <Chex> lifeless: now try your query?
[19:19] <lifeless> now, the explain query please :)
[19:21] <Chex> lifeless: http://pastebin.ubuntu.com/466597/
[19:21] <lifeless> ok, well that fairly definitely answers that
[19:21] <lifeless> thanks
[19:21] <lifeless> danilos: I've finished monopolising Chex for now :)
[19:22] <lifeless> jtv: I agree that that 900K loop finding 0 rows is an issue
[19:23] <jtv> It was only expecting 2.0358 iterations there I guess.
[19:23] <bryceh> nigelb, had a chance to try it out?  thoughts so far?
[19:23] <jtv> Sorry, 20,358
[19:24] <lifeless> so there are lots of bpb records
[19:24] <lifeless> and pb records
[19:24] <lifeless> I guess
[19:24] <lifeless> please tell me we haven't split out a common table to join that has 1:1 mapping to the table we filter on ?
[19:24] <Chex> danilos: now _your_ query..
[19:25] <danilos> Chex, mine should be quick ;)
[19:26] <lifeless> bigjools: are you still around ?
[19:27] <Chex> danilos: http://pastebin.ubuntu.com/466604/
[19:27] <Chex> danilos: note the much quicker run the 2nd time
[19:31] <danilos> Chex, excellent, just what I suspected :)
[19:31] <danilos> lifeless, jtv: the times with the above query seem much better this time, see http://pastebin.ubuntu.com/466604/ :)
[19:31] <jtv> danilos: why do you join PackageBuild twice?
[19:32] <jtv> I mean, not that I'm arguing against the speedup...  :-)
[19:32] <danilos> jtv, because I am smart :)
[19:32] <jtv> that can't be it
[19:32] <danilos> jtv, that's the same trick we use in translations: note how packagebuild-archive join takes the most time in the original query
[19:33] <danilos> jtv, because it joins entire packagebuild with archive (across all rows); this forces postgres to avoid that so it's much faster :)
[19:34] <jtv> It almost looks as if the original query intended something like this... two of the join conditions occurred double, just like with yours.
[19:35] <lifeless> its joining to workaround the split out
[19:35] <lifeless> I think the split out should not have been done at the table level: N separate tables with a common columnprefix
[19:36] <lifeless> databases tables are not classes :)
[19:37] <danilos> lifeless, I think both of these are much faster simply because the caches are already warm (because of your test :)
[19:38] <danilos> anyway, now I go away and will be able to sleep at night :)
[19:38] <danilos> cheers
[19:38]  * danilos goes
[19:38] <Ursinha> sinzui, hello
[19:39] <Ursinha> sinzui, we had a bunch of oopses like https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1661XMLP111, MailingListAPIView
[19:39] <Ursinha> sinzui, I see there's a bug for this oops, bug 531371, which is already fix released
[19:39] <_mup_> Bug #531371: oops MailingListAPIView email already in use <mailing-lists> <oops> <Launchpad Registry:Fix Released by sinzui> <https://launchpad.net/bugs/531371>
[19:40] <lifeless> daniloff: gnight. I think your query is very clever; It would be good for storm to do that for us
[19:40] <sinzui> Urshina see my email about how I hate Launchpad developers who offer crap services for free
[19:41] <Ursinha> :)
[19:41] <Ursinha> I will
[19:41] <sinzui> Ursinha, I am sick and cannot help, the LOSAs corrupted the DB because they thought they were being nice to monte
[19:41] <Ursinha> oh, argh
[19:42] <lifeless> sinzui: get well
[19:42] <lifeless> I forgot you were ill
[19:42] <Ursinha> sinzui, please, get well
[19:42] <sinzui> Ursinha, There is a question tracking how to fix the data. Someone needs to kill the private project that should never have been mondified
[19:43] <Ursinha> oh, that one with the pending mailing list? I see.
[19:45] <Ursinha> thanks sinzui, sorry to ping
[20:23] <lifeless> Ursinha: so, the grouping
[20:23] <lifeless> Ursinha: does it just use the exception type, or the exception type + the string ?
[20:26] <Ursinha> lifeless, exception type + value
[20:26] <Ursinha> lifeless, bug 461269
[20:26] <_mup_> Bug #461269: oops reports should be grouped by oops signature not exception type and exception value <OOPS Tools:Triaged> <https://launchpad.net/bugs/461269>
[20:28] <Ursinha> lifeless, I was about to leave to have some food
[20:28] <lifeless> ciao
[20:28] <lifeless> I'm just opportunistic on asking stuff
[20:28] <lifeless> no need to hang around for me
[20:28] <Ursinha> lifeless, okay :) anything else, just ask and I'll answer when I return
[20:28] <lifeless> kk
[20:28] <flacoste> lifeless: your lp:~lifeless/launchpad/soyuz mp diff is screwed up
[20:30] <flacoste> and lifeless, i had a test failures back from by ec2 land (feedparser branch)
[20:30] <flacoste> this is the fix I applied: http://pastebin.ubuntu.com/466624/
[20:30] <flacoste> do you have a better suggestion?
[20:32] <lifeless> flacoste: thats fine with me, its not hugely beautiful, but its not ugly.
[20:33] <flacoste> yeah, my feeling also, wondered if there was a better known idiom
[20:33] <lifeless> its essentially mocking; we could use an official mock, but it wouldn't be any smaller.
[20:34] <flacoste> right
[20:35] <flacoste> what library would you recommend for mocking (unrelated to this branch, asking for another project)
[20:35] <flacoste> do you use something in bzr?
[20:36] <lifeless> we don't routinely mock
[20:36] <lifeless> mocking has some risks
[20:36] <lifeless> and some rewards
[20:36] <lifeless> uhm
[20:36] <lifeless> for your line there, even with a mocking library, I'd probably just do the lambda :)
[20:37] <flacoste> ok
[20:38] <lifeless> from the school of 'simplest is often clearest'
[20:40] <lifeless> flacoste: speaking of reviews
[20:40] <lifeless> I got the queue down to 0
[20:41] <lifeless> for devel anyhow
[20:41] <lifeless> ah the soyuz brnach is messed up because db-devel exists
[20:43] <benji> flacoste: I've enjoyed using Gustavo Niemeyer's Mocker on another project (http://labix.org/mocker)
[20:44] <benji> many other options at http://pycheesecake.org/wiki/PythonTestingToolsTaxonomy#MockTestingTools
[20:44] <lifeless> I much prefer verified fakes to mocks
[20:44] <lifeless> less skew
[20:44] <lifeless> but this is not a late-at-night discussion I think; its been ... intense today
[20:44] <flacoste> thanks benji
[20:45] <benji> lifeless: I suspect Mocker can do what you want; it's quite full-featured.
[20:45] <lifeless> benji: the point is to not mock.
[20:45] <lifeless> benji: so no, it can't :)
[20:46] <benji> what do you mean by "verified fakes"?
[20:46] <lifeless> just that
[20:46] <lifeless> a fake (not a mock or stub) that is verified to behave the same
[20:46] <lifeless> as a full implementation
[20:46] <lifeless> e.g. sqlite in-memory db's are a pretty good verified fake for disk databases.
[20:47] <benji> ok, Martin Fowler's definition of "fake"
[20:48] <lifeless> yes, I find the definition to be usefully precise
[21:01] <mars> rockstar, ping
[21:27] <lifeless> hmm
[21:27] <lifeless> more count(*) taking ages
[21:27] <lifeless> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1662EA433
[21:33] <deryck> ok, I need to break for awhile.  Until later on then......
[21:33] <lifeless> flacoste: may need to think about making oops' critical rather than high... teams have lots of high already :)
[21:34] <flacoste> lifeless: that was the idea of zerooopspolicy
[21:34] <lifeless> flacoste: I thought it said high, not critical
[21:34] <lifeless> yes, it says high on the wiki
[21:37] <flacoste> hmm, ok
[21:37] <lifeless> flacoste: If the goal is 'in front of the queue', critical would seem appropriate to me.
[21:37] <lifeless> flacoste: but I wasn't part of the discussion for the policy, I don't want to just jump in here
[21:43] <lifeless> https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1662EA447 is fun
[21:43] <lifeless> anyone have ideas on what to do with *that* ?
[21:44] <lifeless> I guess its getting the wadl ?
[21:46] <benji> hmm; I did a rocketfuel-get and now my tests fail
[21:46] <benji> is the devel branch broken?
[21:47] <lifeless> shouldn't be
[21:54] <lifeless> Ursinha-nom: when you return, if you could regenerate the edge oops-since-utc0 page, I think I've got bugs made for most of them now
[22:17] <lifeless> rinze: please set the MP status when reviewing as well
[22:25] <rockstar> mars, sorry, hi.
[22:29] <lifeless> I'm -> sleep
[22:29] <lifeless> gary_poster: so you know, edge is back up to 14 seconds, 12 seconds is past the knee and unsafe
[22:30] <gary_poster> heh, ok, thanks for update
[22:30] <lifeless> https://lpstats.canonical.com/graphs/OopsEdgeHourly/ shows it quite graphically
[22:31] <lifeless> prod is still unhappy - https://lpstats.canonical.com/graphs/OopsLpnetHourly/ -  but the pending fixes should make a dramatic difference to that
[22:31] <lifeless> and your pqm hack is on my kanban todo, but I've been bouncing from thing to thing all day.
[22:31] <lifeless> on the bright side I seem to have gotten past the stomach ache part of this lurgy, so I can actually concentrate again.
[22:32] <lifeless> and with that, I bid you all asnore.
[22:35] <gary_poster> thank you and good night
[23:25] <wgrant> Can someone please ec2 land https://code.edge.launchpad.net/~wgrant/launchpad/refactor-_dominateBinary/+merge/29667? danilos tried to do it last night, but apparently ended up starting two instances for the *other* branch.
[23:38] <nigelb> bryceh: trying out now.  Do I get to confirm on the upstream tracker before it gets submitted?
[23:39] <bryceh> nigelb, yes
[23:39] <bryceh> nigelb, btw do you find that aspect important?  I've considered eliminating that as an extraneous step if it isn't considered important
[23:41] <nigelb> Since I'm testing now, I'd find it important.  But when I'm using the tool, I'd find it extraneous
[23:42] <nigelb> I keep getting, "Sory produce xorg in ubuntu does not exist or you're not allowed to report a bug in it" :/
[23:42] <bryceh> yeah try another package.  'xorg' isn't supported, but e.g. 'xserver-xorg-video-intel' is
[23:43] <bryceh> (there isn't actually an 'xorg' package upstream, it's a non-source debian package only)
[23:43] <nigelb> ahh
[23:52] <nigelb> bryceh: same error with xserver-xory-video-intel
[23:54] <bryceh> hrm
[23:56] <bryceh> nigelb, ok try now
[23:56] <bryceh> weird, I was sure I'd fixed that already
[23:59] <nigelb> bryceh: wow, just WOW