/srv/irclogs.ubuntu.com/2015/03/16/#launchpad-dev.txt

cjwatsonwgrant: git-ref-scanner bulk updateRef - do you just mean having it be a single method call for many refs?  There's no actual concise way to do a bulk update in SQL, AFAICS, unless I want to bother creating a temporary table which I think is probably overkill here12:35
wgrantcjwatson: VALUES12:38
wgrantcjwatson: We should have a few bulk updates around, using Update and Values storm expressions. Let me find one.12:38
wgrantOh, BulkUpdate.12:39
wgrantPopulateLatestPersonSourcePackageReleaseCache12:39
wgrantcjwatson: But I'd consider just having an upsert, like Python's dict.update.12:40
cjwatsonOh huh, OK12:45
cjwatsonThanks12:46
wgrantI'm afraid you're going to have to learn a bit of non-trivial SQL.12:47
wgrantThough it can probably be cargo-culted.12:47
cjwatsonYeah, fine with learning it, just needed the starting point12:48
cjwatsonBulkUpdate seems like a reasonable interface, though, and it's not going to be a problem for the upsert to be sometimes two queries rather than one.12:49
wgrantTwo is still O(1) :)12:50
wgrantIIRC Ian wrote that with a tonne of sqlvalues originally and I rewrote it to use the custom Storm expression. But it's only been used in that one place, so it may need a bit of adjustment to fit well for you.12:52
cjwatsonI suppose I could do the http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE thing, but it seems like overkill.12:52
wgrantWe should have retry logic in the job infrastructure anyway, though I don't think we do now.12:53
wgrant(the webapp does, though)12:53
cjwatsonI guess concurrent notifies could be a thing, even without any other way to create refs.12:54
stubWe can do upsert in python now I think, using savepoints.12:54
cjwatsonhttps://wiki.postgresql.org/wiki/UPSERT  oww my head12:59
stubcjwatson: Will there potentially be concurrent connections trying to insert the same row? If not, the writable-CTE variant is easiest.13:02
cjwatsonIt's possible, yes: two notifies might be scheduled in rapid succession due to two git pushes, and those might be run concurrently by different celery workers.13:02
cjwatson(Although we could work around this by having notify check whether there's already a job and not schedule another one, which might be worthwhile if it makes this significantly easier)13:03
stubmmm, so the stored procedure or savepoints...13:03
cjwatsonSavepoints seem cumbersome for a bulk upsert13:04
stubIndeedy13:04
cjwatsonI think the stored procedure needs a temporary table, though, if I'm reading this right13:06
wgrantcjwatson: Branch scans use postgres advisory locks to avoid two running on the same branch simultaneously.13:06
stubOr do the CTE variant if you don't mind two jobs potentially conflicting and one aborting with a violation13:06
wgrantIt may be useful to do the same here. It's not going to totally exclude all possible conflicts, but it should fix most of them.13:07
stubor yeah, advisory lock.13:07
cjwatsonwhat's the CTE variant?13:07
stubThe stored procedure version doesn't need a temporary table. https://dba.stackexchange.com/questions/13468/most-idiomatic-way-to-implement-upsert-in-postgresql-nowadays has a simple example of that13:07
stubhttp://www.xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html has an example of the CTE variant13:08
stubCTE is the fastest approach13:09
wgrantI don't see a huge problem with the easy client-side select+insert+update approach.13:09
stubspeed, no of queries. Depends on how bulk is bulk.13:12
stubif you have a lock, it is faster doing the update first, then if now rows modified, do the insert13:13
cjwatsonAdvisory lock plus something like https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/8702291#8702291 looks tolerable13:14
cjwatsonAnd maybe a retry loop around the whole thing13:14
cjwatsonThe worst case without a retry loop AFAICS is that you might have to send another notify to get it to catch up properly.13:15
cjwatsonThe "if no rows modified" strategy surely only works if I'm trying to upsert one row at a time13:16
cjwatson(?)13:16
wgrantRight, all of this is just a cache, so the worst case is we have to request the cache be refreshed again.13:16
wgrantUPDATE ... RETURNING13:16
wgrantYou can't just use rowcount, but you can use RETURNING to work out which rows were changed.13:17
cjwatson"Bulk" is unlikely to be more than 100 rows or so, usually much less.13:17
wgrantHowever, SELECT+INSERT+UPDATE is just as correct as the CTE and probably quicker to implement now.13:17
stubcjwatson: yeah, "if no rows modified" is one row at a time.13:17
cjwatsonWell, maybe more for lots of tags in a newly-pushed repo.13:17
wgrantMm, lots of repos have lots of tags13:17
cjwatsonYeah, my linux tree here has 2648 tags13:18
cjwatsonRare for most of them not to be either "insert and never change again", though.13:19
cjwatsons/either //13:19
cjwatsonwgrant: I'd appreciate a re-review of https://code.launchpad.net/~cjwatson/launchpad/git-ref-scanner/+merge/252763, since the changes were non-trivial.16:24
blrcjwatson: we have the same string -> GITOBJ.type mapping in turnip - would it be useful to have that in pygit2?19:30
blrbizarre that the library doesn't provide that... might send them a PR this week, time allowing.19:31
blrcjwatson: context - in your branch scanning branch.19:34
cjwatsonblr: Hm, possibly, but I think LP wants to have its own enumeration for anything that's in the database, really.21:43
cjwatsonblr: Although doing the string parsing bit wouldn't hurt, I suppose.21:44

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!