[12:35] wgrant: git-ref-scanner bulk updateRef - do you just mean having it be a single method call for many refs? There's no actual concise way to do a bulk update in SQL, AFAICS, unless I want to bother creating a temporary table which I think is probably overkill here [12:38] cjwatson: VALUES [12:38] cjwatson: We should have a few bulk updates around, using Update and Values storm expressions. Let me find one. [12:39] Oh, BulkUpdate. [12:39] PopulateLatestPersonSourcePackageReleaseCache [12:40] cjwatson: But I'd consider just having an upsert, like Python's dict.update. [12:45] Oh huh, OK [12:46] Thanks [12:47] I'm afraid you're going to have to learn a bit of non-trivial SQL. [12:47] Though it can probably be cargo-culted. [12:48] Yeah, fine with learning it, just needed the starting point [12:49] BulkUpdate seems like a reasonable interface, though, and it's not going to be a problem for the upsert to be sometimes two queries rather than one. [12:50] Two is still O(1) :) [12:52] IIRC Ian wrote that with a tonne of sqlvalues originally and I rewrote it to use the custom Storm expression. But it's only been used in that one place, so it may need a bit of adjustment to fit well for you. [12:52] I suppose I could do the http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE thing, but it seems like overkill. [12:53] We should have retry logic in the job infrastructure anyway, though I don't think we do now. [12:53] (the webapp does, though) [12:54] I guess concurrent notifies could be a thing, even without any other way to create refs. [12:54] We can do upsert in python now I think, using savepoints. [12:59] https://wiki.postgresql.org/wiki/UPSERT oww my head [13:02] cjwatson: Will there potentially be concurrent connections trying to insert the same row? If not, the writable-CTE variant is easiest. [13:02] It's possible, yes: two notifies might be scheduled in rapid succession due to two git pushes, and those might be run concurrently by different celery workers. [13:03] (Although we could work around this by having notify check whether there's already a job and not schedule another one, which might be worthwhile if it makes this significantly easier) [13:03] mmm, so the stored procedure or savepoints... [13:04] Savepoints seem cumbersome for a bulk upsert [13:04] Indeedy [13:06] I think the stored procedure needs a temporary table, though, if I'm reading this right [13:06] cjwatson: Branch scans use postgres advisory locks to avoid two running on the same branch simultaneously. [13:06] Or do the CTE variant if you don't mind two jobs potentially conflicting and one aborting with a violation [13:07] It may be useful to do the same here. It's not going to totally exclude all possible conflicts, but it should fix most of them. [13:07] or yeah, advisory lock. [13:07] what's the CTE variant? [13:07] The stored procedure version doesn't need a temporary table. https://dba.stackexchange.com/questions/13468/most-idiomatic-way-to-implement-upsert-in-postgresql-nowadays has a simple example of that [13:08] http://www.xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html has an example of the CTE variant [13:09] CTE is the fastest approach [13:09] I don't see a huge problem with the easy client-side select+insert+update approach. [13:12] speed, no of queries. Depends on how bulk is bulk. [13:13] if you have a lock, it is faster doing the update first, then if now rows modified, do the insert [13:14] Advisory lock plus something like https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/8702291#8702291 looks tolerable [13:14] And maybe a retry loop around the whole thing [13:15] The worst case without a retry loop AFAICS is that you might have to send another notify to get it to catch up properly. [13:16] The "if no rows modified" strategy surely only works if I'm trying to upsert one row at a time [13:16] (?) [13:16] Right, all of this is just a cache, so the worst case is we have to request the cache be refreshed again. [13:16] UPDATE ... RETURNING [13:17] You can't just use rowcount, but you can use RETURNING to work out which rows were changed. [13:17] "Bulk" is unlikely to be more than 100 rows or so, usually much less. [13:17] However, SELECT+INSERT+UPDATE is just as correct as the CTE and probably quicker to implement now. [13:17] cjwatson: yeah, "if no rows modified" is one row at a time. [13:17] Well, maybe more for lots of tags in a newly-pushed repo. [13:17] Mm, lots of repos have lots of tags [13:18] Yeah, my linux tree here has 2648 tags [13:19] Rare for most of them not to be either "insert and never change again", though. [13:19] s/either // [16:24] wgrant: I'd appreciate a re-review of https://code.launchpad.net/~cjwatson/launchpad/git-ref-scanner/+merge/252763, since the changes were non-trivial. [19:30] cjwatson: we have the same string -> GITOBJ.type mapping in turnip - would it be useful to have that in pygit2? [19:31] bizarre that the library doesn't provide that... might send them a PR this week, time allowing. [19:34] cjwatson: context - in your branch scanning branch. [21:43] blr: Hm, possibly, but I think LP wants to have its own enumeration for anything that's in the database, really. [21:44] blr: Although doing the string parsing bit wouldn't hurt, I suppose.