[12:35] <cjwatson> wgrant: git-ref-scanner bulk updateRef - do you just mean having it be a single method call for many refs?  There's no actual concise way to do a bulk update in SQL, AFAICS, unless I want to bother creating a temporary table which I think is probably overkill here
[12:38] <wgrant> cjwatson: VALUES
[12:38] <wgrant> cjwatson: We should have a few bulk updates around, using Update and Values storm expressions. Let me find one.
[12:39] <wgrant> Oh, BulkUpdate.
[12:39] <wgrant> PopulateLatestPersonSourcePackageReleaseCache
[12:40] <wgrant> cjwatson: But I'd consider just having an upsert, like Python's dict.update.
[12:45] <cjwatson> Oh huh, OK
[12:46] <cjwatson> Thanks
[12:47] <wgrant> I'm afraid you're going to have to learn a bit of non-trivial SQL.
[12:47] <wgrant> Though it can probably be cargo-culted.
[12:48] <cjwatson> Yeah, fine with learning it, just needed the starting point
[12:49] <cjwatson> BulkUpdate seems like a reasonable interface, though, and it's not going to be a problem for the upsert to be sometimes two queries rather than one.
[12:50] <wgrant> Two is still O(1) :)
[12:52] <wgrant> IIRC Ian wrote that with a tonne of sqlvalues originally and I rewrote it to use the custom Storm expression. But it's only been used in that one place, so it may need a bit of adjustment to fit well for you.
[12:52] <cjwatson> I suppose I could do the http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE thing, but it seems like overkill.
[12:53] <wgrant> We should have retry logic in the job infrastructure anyway, though I don't think we do now.
[12:53] <wgrant> (the webapp does, though)
[12:54] <cjwatson> I guess concurrent notifies could be a thing, even without any other way to create refs.
[12:54] <stub> We can do upsert in python now I think, using savepoints.
[12:59] <cjwatson> https://wiki.postgresql.org/wiki/UPSERT  oww my head
[13:02] <stub> cjwatson: Will there potentially be concurrent connections trying to insert the same row? If not, the writable-CTE variant is easiest.
[13:02] <cjwatson> It's possible, yes: two notifies might be scheduled in rapid succession due to two git pushes, and those might be run concurrently by different celery workers.
[13:03] <cjwatson> (Although we could work around this by having notify check whether there's already a job and not schedule another one, which might be worthwhile if it makes this significantly easier)
[13:03] <stub> mmm, so the stored procedure or savepoints...
[13:04] <cjwatson> Savepoints seem cumbersome for a bulk upsert
[13:04] <stub> Indeedy
[13:06] <cjwatson> I think the stored procedure needs a temporary table, though, if I'm reading this right
[13:06] <wgrant> cjwatson: Branch scans use postgres advisory locks to avoid two running on the same branch simultaneously.
[13:06] <stub> Or do the CTE variant if you don't mind two jobs potentially conflicting and one aborting with a violation
[13:07] <wgrant> It may be useful to do the same here. It's not going to totally exclude all possible conflicts, but it should fix most of them.
[13:07] <stub> or yeah, advisory lock.
[13:07] <cjwatson> what's the CTE variant?
[13:07] <stub> The stored procedure version doesn't need a temporary table. https://dba.stackexchange.com/questions/13468/most-idiomatic-way-to-implement-upsert-in-postgresql-nowadays has a simple example of that
[13:08] <stub> http://www.xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html has an example of the CTE variant
[13:09] <stub> CTE is the fastest approach
[13:09] <wgrant> I don't see a huge problem with the easy client-side select+insert+update approach.
[13:12] <stub> speed, no of queries. Depends on how bulk is bulk.
[13:13] <stub> if you have a lock, it is faster doing the update first, then if now rows modified, do the insert
[13:14] <cjwatson> Advisory lock plus something like https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/8702291#8702291 looks tolerable
[13:14] <cjwatson> And maybe a retry loop around the whole thing
[13:15] <cjwatson> The worst case without a retry loop AFAICS is that you might have to send another notify to get it to catch up properly.
[13:16] <cjwatson> The "if no rows modified" strategy surely only works if I'm trying to upsert one row at a time
[13:16] <cjwatson> (?)
[13:16] <wgrant> Right, all of this is just a cache, so the worst case is we have to request the cache be refreshed again.
[13:16] <wgrant> UPDATE ... RETURNING
[13:17] <wgrant> You can't just use rowcount, but you can use RETURNING to work out which rows were changed.
[13:17] <cjwatson> "Bulk" is unlikely to be more than 100 rows or so, usually much less.
[13:17] <wgrant> However, SELECT+INSERT+UPDATE is just as correct as the CTE and probably quicker to implement now.
[13:17] <stub> cjwatson: yeah, "if no rows modified" is one row at a time.
[13:17] <cjwatson> Well, maybe more for lots of tags in a newly-pushed repo.
[13:17] <wgrant> Mm, lots of repos have lots of tags
[13:18] <cjwatson> Yeah, my linux tree here has 2648 tags
[13:19] <cjwatson> Rare for most of them not to be either "insert and never change again", though.
[13:19] <cjwatson> s/either //
[16:24] <cjwatson> wgrant: I'd appreciate a re-review of https://code.launchpad.net/~cjwatson/launchpad/git-ref-scanner/+merge/252763, since the changes were non-trivial.
[19:30] <blr> cjwatson: we have the same string -> GITOBJ.type mapping in turnip - would it be useful to have that in pygit2?
[19:31] <blr> bizarre that the library doesn't provide that... might send them a PR this week, time allowing.
[19:34] <blr> cjwatson: context - in your branch scanning branch.
[21:43] <cjwatson> blr: Hm, possibly, but I think LP wants to have its own enumeration for anything that's in the database, really.
[21:44] <cjwatson> blr: Although doing the string parsing bit wouldn't hurt, I suppose.