/srv/irclogs.ubuntu.com/2015/03/16/#launchpad-dev.txt

cjwatson	wgrant: git-ref-scanner bulk updateRef - do you just mean having it be a single method call for many refs? There's no actual concise way to do a bulk update in SQL, AFAICS, unless I want to bother creating a temporary table which I think is probably overkill here	12:35
wgrant	cjwatson: VALUES	12:38
wgrant	cjwatson: We should have a few bulk updates around, using Update and Values storm expressions. Let me find one.	12:38
wgrant	Oh, BulkUpdate.	12:39
wgrant	PopulateLatestPersonSourcePackageReleaseCache	12:39
wgrant	cjwatson: But I'd consider just having an upsert, like Python's dict.update.	12:40
cjwatson	Oh huh, OK	12:45
cjwatson	Thanks	12:46
wgrant	I'm afraid you're going to have to learn a bit of non-trivial SQL.	12:47
wgrant	Though it can probably be cargo-culted.	12:47
cjwatson	Yeah, fine with learning it, just needed the starting point	12:48
cjwatson	BulkUpdate seems like a reasonable interface, though, and it's not going to be a problem for the upsert to be sometimes two queries rather than one.	12:49
wgrant	Two is still O(1) :)	12:50
wgrant	IIRC Ian wrote that with a tonne of sqlvalues originally and I rewrote it to use the custom Storm expression. But it's only been used in that one place, so it may need a bit of adjustment to fit well for you.	12:52
cjwatson	I suppose I could do the http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE thing, but it seems like overkill.	12:52
wgrant	We should have retry logic in the job infrastructure anyway, though I don't think we do now.	12:53
wgrant	(the webapp does, though)	12:53
cjwatson	I guess concurrent notifies could be a thing, even without any other way to create refs.	12:54
stub	We can do upsert in python now I think, using savepoints.	12:54
cjwatson	https://wiki.postgresql.org/wiki/UPSERT oww my head	12:59
stub	cjwatson: Will there potentially be concurrent connections trying to insert the same row? If not, the writable-CTE variant is easiest.	13:02
cjwatson	It's possible, yes: two notifies might be scheduled in rapid succession due to two git pushes, and those might be run concurrently by different celery workers.	13:02
cjwatson	(Although we could work around this by having notify check whether there's already a job and not schedule another one, which might be worthwhile if it makes this significantly easier)	13:03
stub	mmm, so the stored procedure or savepoints...	13:03
cjwatson	Savepoints seem cumbersome for a bulk upsert	13:04
stub	Indeedy	13:04
cjwatson	I think the stored procedure needs a temporary table, though, if I'm reading this right	13:06
wgrant	cjwatson: Branch scans use postgres advisory locks to avoid two running on the same branch simultaneously.	13:06
stub	Or do the CTE variant if you don't mind two jobs potentially conflicting and one aborting with a violation	13:06
wgrant	It may be useful to do the same here. It's not going to totally exclude all possible conflicts, but it should fix most of them.	13:07
stub	or yeah, advisory lock.	13:07
cjwatson	what's the CTE variant?	13:07
stub	The stored procedure version doesn't need a temporary table. https://dba.stackexchange.com/questions/13468/most-idiomatic-way-to-implement-upsert-in-postgresql-nowadays has a simple example of that	13:07
stub	http://www.xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html has an example of the CTE variant	13:08
stub	CTE is the fastest approach	13:09
wgrant	I don't see a huge problem with the easy client-side select+insert+update approach.	13:09
stub	speed, no of queries. Depends on how bulk is bulk.	13:12
stub	if you have a lock, it is faster doing the update first, then if now rows modified, do the insert	13:13
cjwatson	Advisory lock plus something like https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/8702291#8702291 looks tolerable	13:14
cjwatson	And maybe a retry loop around the whole thing	13:14
cjwatson	The worst case without a retry loop AFAICS is that you might have to send another notify to get it to catch up properly.	13:15
cjwatson	The "if no rows modified" strategy surely only works if I'm trying to upsert one row at a time	13:16
cjwatson	(?)	13:16
wgrant	Right, all of this is just a cache, so the worst case is we have to request the cache be refreshed again.	13:16
wgrant	UPDATE ... RETURNING	13:16
wgrant	You can't just use rowcount, but you can use RETURNING to work out which rows were changed.	13:17
cjwatson	"Bulk" is unlikely to be more than 100 rows or so, usually much less.	13:17
wgrant	However, SELECT+INSERT+UPDATE is just as correct as the CTE and probably quicker to implement now.	13:17
stub	cjwatson: yeah, "if no rows modified" is one row at a time.	13:17
cjwatson	Well, maybe more for lots of tags in a newly-pushed repo.	13:17
wgrant	Mm, lots of repos have lots of tags	13:17
cjwatson	Yeah, my linux tree here has 2648 tags	13:18
cjwatson	Rare for most of them not to be either "insert and never change again", though.	13:19
cjwatson	s/either //	13:19
cjwatson	wgrant: I'd appreciate a re-review of https://code.launchpad.net/~cjwatson/launchpad/git-ref-scanner/+merge/252763, since the changes were non-trivial.	16:24
blr	cjwatson: we have the same string -> GITOBJ.type mapping in turnip - would it be useful to have that in pygit2?	19:30
blr	bizarre that the library doesn't provide that... might send them a PR this week, time allowing.	19:31
blr	cjwatson: context - in your branch scanning branch.	19:34
cjwatson	blr: Hm, possibly, but I think LP wants to have its own enumeration for anything that's in the database, really.	21:43
cjwatson	blr: Although doing the string parsing bit wouldn't hurt, I suppose.	21:44

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!