/srv/irclogs.ubuntu.com/2013/03/19/#launchpad-dev.txt

=== almaisan-away is now known as al-maisan
=== al-maisan is now known as almaisan-away
=== yofel_ is now known as yofel
StevenK	wgrant: So, WRT to the Archive:+repository-size critical, what if we denorm LFC + some squinting onto {B,S}PPH. A new size column directly on the publication, which is the sum of the 3 LFC sizes for a source, and a copy of the LFC size for a binary? Then we just need the publications for the archive, and can SUM() the sizes easily	01:29
wgrant	StevenK: That's still a good number of rows. It would improve things, but only by a smallish constant factor.	01:33
wgrant	Remember that those joins aren't really the big problem here; we're only filtering on one end.	01:34
StevenK	The number of LFC rows is a problem	01:34
StevenK	Which that denorming will solve	01:34
StevenK	And if we add columns onto Archive, we need to deal with population issues, locking and updating issues	01:35
wgrant	You still have to look up all the publications, which is less bad, but still not good.	01:35
wgrant	Sure	01:35
StevenK	Surely this problem isn't intractable :-P	01:36
wgrant	It's not intractable, but there's no easy solution	01:36
StevenK	At the moment, there is no solution I can think of that doesn't have you or I poking holes in from the very start. :-)	01:36
wgrant	Your suggestion would improve things substantially, but it's a fair bit of work for something that isn't a proper fix.	01:36
wgrant	And it's not clear that even that will scale.	01:37
wgrant	To what we have now	01:37
StevenK	wgrant: ArchiveSize table? That removes the locking problems, is probably scaleable with indexing, but leaves the population and updating issues.	01:38
wgrant	StevenK: The problem is not storing the int somewhere -- the entire problem is calculating it	01:41
wgrant	StevenK: What does the query look like today?	01:50
wgrant	Whoa	02:12
wgrant	StevenK: Have you... have you read the code?	02:13
wgrant	size = sum([lfc.filesize for lfc in result])	02:14
wgrant	No wonder it's dreadful for even moderately sized archives	02:14
StevenK	Well, yeah	02:24
wgrant	Fixing that should be a substantial and just about trivial improvement	02:24
wgrant	It certainly won't fix everything, but it's much less work than the other solution which also won't fix everything.	02:24
wgrant	So it may be tolerable for now.	02:24
StevenK	wgrant: Which is {S,B}PPH.size ?	02:25
wgrant	Yeah	02:25
wgrant	StevenK: Also, I just remembered why I completely discounted that last time	02:25
wgrant	orig.tar.gz sharing	02:25
wgrant	And arch-indep	02:25
StevenK	arch-indep is not a issue, since they have BPPHs	02:27
wgrant	Exactly	02:28
wgrant	So each arch-indep binary will be counted for each arch	02:28
wgrant	Rather than once	02:28
StevenK	We don't count per arch	02:28
StevenK	Just all binary + all source	02:28
wgrant	You do if you use BPPH.size	02:28
StevenK	Why? Fetch all BPPH's and SUM() their size	02:29
StevenK	We don't care about their archtag, we only care about the archive	02:29
wgrant	Publications regularly share files, so the sum of the publication sizes is not equivalent to the size of the archives.	02:29
StevenK	We could use a FK straight to LFC, and distinct on that, but I like that even less	02:30
wgrant	Have you checked how terrible the current queries are?	02:30
wgrant	On a large PPA	02:30
wgrant	You can't get that info from the OOPS reliably, due to the overhead of returning all the objects.	02:31
StevenK	They are pretty terrible, yes	02:32
wgrant	They're not terrible in the usual sense	02:32
wgrant	They just deal with a lot of data	02:32
StevenK	Yeah	02:32
StevenK	Dear qas, load Archive:+delete-packages :-(	04:06
StevenK	wgrant: So, rewriting a query	04:44
wgrant	StevenK: Which query? +delete-packages?	04:48
StevenK	wgrant: No, that query is actually performant on qas, but the page doesn't load. Not sure if it's due to the current issues.	04:49
wgrant	StevenK: what needs rewriting?	04:49
StevenK	wgrant: Archive:+repository-size	04:49
wgrant	Oh	04:49
wgrant	Work the sum into the query	04:49
StevenK	That was your suggestion, no?	04:49
wgrant	Rather than pulling back a bazillion Storm objects and summing an attribute of them	04:49
wgrant	in Python	04:50
StevenK	Yeah	04:50
wgrant	There's a comment about that not working	04:50
wgrant	But we can make it work	04:50
StevenK	How do we do the distinct for SHA1/filename, though?	04:51
StevenK	That bit I can't work out	04:51
wgrant	StevenK: See the comment in sources_size	04:52
wgrant	It already does the distinct	04:52
StevenK	Yeah, I saw that	04:52
StevenK	Sorry, let me be more clear.	04:52
wgrant	Hopefully storm's sum() knows to do it after a subselect	04:52
wgrant	I haven't used it like that before	04:53
StevenK	SELECT SUM(lfc.filesize) FROM ....	04:53
wgrant	You will most likely need a subselect	04:54
wgrant	Storm should automatically construct it	04:54
StevenK	wgrant: I thought there was a 'make sure that this is distinct' thing	05:02
wgrant	StevenK: What do you mean?	05:03
StevenK	Ah, I can't use DISTINCT ON	05:03
wgrant	No, because that will happen after the SUM	05:04
wgrant	Unless you use a subselect	05:04
StevenK	SELECT SUM(filesize) FROM libraryfilecontent WHERE id IN (SELECT DISTINCT(id, lfa.filename, lfc.sha1) ....) ?	05:06
wgrant	Nonono	05:06
wgrant	Slow	05:06
wgrant	SELECT SUM(filesize) FROM (SELECT DISTINCT ON (id, filename, sha1) filesize FROM ...) AS donotcare;	05:06
StevenK	Do we have a similar callsite? So I can try and work out how to fit the Storm we have into that pattern?	05:08
wgrant	- return sum(result.values(LibraryFileContent.filesize))	05:09
wgrant	+ return result.sum(LibraryFileContent.filesize)	05:09
wgrant	+ return result.sum(LibraryFileContent.filesize) or 0	05:10
wgrant	perhaps	05:10
wgrant	As it'll otherwise be None when there are no rows.	05:10
wgrant	SELECT SUM("_expr") FROM (SELECT DISTINCT LibraryFileAlias.filename, [...]	05:10
wgrant	As expected	05:10
StevenK	46 queries/external actions issued in 11.74 seconds	05:24
StevenK	Success	05:24
wgrant	Why so slow?	05:24
StevenK	The queries look good, at least	05:24
StevenK	wgrant: From the stack of OOPSes, it's all in non-SQL time	05:26
StevenK	Maybe we should move to SRF	05:26
wgrant	SRF doesn't affect non-SQL time	05:26
wgrant	And it's not a good fit for this problem	05:26
wgrant	How long is the primary query?	05:26
StevenK	400ms	05:26
wgrant	On ppa:ubuntu-langpack/ppa?	05:26
StevenK	Yeah	05:27
wgrant	Not dreadful	05:27
StevenK	It's what we expected when we tested on sourcherry	05:27
StevenK	wgrant: http://pastebin.ubuntu.com/5627366/	05:51
StevenK	wgrant: Would you prefer DISTINCT ON filename, SHA1 or it doesn't hurt?	05:55
wgrant	StevenK: It should behave identically, so no need to change	05:56
StevenK	wgrant: Do we need to go through the EXPLAIN ANALYZE dance, or shall I just push this branch up?	05:57
wgrant	StevenK: The plan is pretty trivial, but still something of potential interest	05:59
wgrant	So I'd get one just in case	05:59
StevenK	wgrant: http://pastebin.ubuntu.com/5627380/	06:01
StevenK	Using my favourite archive ever	06:04
StevenK	wgrant: It's 20ms for archive 71, which is one of yours	06:05
wgrant	StevenK: How is a subsequent run?	06:08
wgrant	Some of those indices will be pretty cold	06:08
StevenK	wgrant: ~ 380ms	06:10
StevenK	I ran it a few times afterwards without the EXPLAIN ANALYZE, and that was as fast as it got	06:10
wgrant	Right	06:11
wgrant	Right	06:12
wgrant	Yeah	06:12
wgrant	So	06:12
wgrant	Since it's not pulling back $lots of storm objects it should be much faster	06:12
StevenK	Yeah	06:12
wgrant	The underlying query still sucks	06:12
wgrant	But it can't be fixed without a lot of design work	06:12
wgrant	And it should be fast after a couple of refreshes	06:12
wgrant	Unlike the Storm bits that are problematic before your branch	06:13
StevenK	It's loaded by AJAX already	06:13
wgrant	Yes	06:13
StevenK	It won't end up realizing any from those methods, right?	06:13
StevenK	Since it will return a number, which might be 0	06:13
wgrant	Right	06:17
wgrant	.sum(...) will return an int or None	06:17
StevenK	wgrant: https://code.launchpad.net/~stevenk/launchpad/archive-repo-size/+merge/153997	06:32
wgrant	StevenK: r=me	06:48
=== almaisan-away is now known as al-maisan
=== al-maisan is now known as almaisan-away
=== almaisan-away is now known as al-maisan
=== al-maisan is now known as almaisan-away
=== BradCrittenden is now known as bac
=== wedgwood_away is now known as wedgwood
=== teknico__ is now known as teknico
=== Ursinha_ is now known as Ursinha
=== teknico_ is now known as teknico
=== deryck is now known as deryck[lunch]
=== deryck[lunch] is now known as deryck
=== yofel_ is now known as yofel
=== BradCrittenden is now known as bac
=== coffeedude_ is now known as coffeedude
=== wedgwood is now known as wedgwood_away
StevenK	wgrant: I think the test failure in buildbot is actually a bong test.	23:59

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!