cjwatson | wgrant: git-ref-scanner bulk updateRef - do you just mean having it be a single method call for many refs? There's no actual concise way to do a bulk update in SQL, AFAICS, unless I want to bother creating a temporary table which I think is probably overkill here | 12:35 |
---|---|---|
wgrant | cjwatson: VALUES | 12:38 |
wgrant | cjwatson: We should have a few bulk updates around, using Update and Values storm expressions. Let me find one. | 12:38 |
wgrant | Oh, BulkUpdate. | 12:39 |
wgrant | PopulateLatestPersonSourcePackageReleaseCache | 12:39 |
wgrant | cjwatson: But I'd consider just having an upsert, like Python's dict.update. | 12:40 |
cjwatson | Oh huh, OK | 12:45 |
cjwatson | Thanks | 12:46 |
wgrant | I'm afraid you're going to have to learn a bit of non-trivial SQL. | 12:47 |
wgrant | Though it can probably be cargo-culted. | 12:47 |
cjwatson | Yeah, fine with learning it, just needed the starting point | 12:48 |
cjwatson | BulkUpdate seems like a reasonable interface, though, and it's not going to be a problem for the upsert to be sometimes two queries rather than one. | 12:49 |
wgrant | Two is still O(1) :) | 12:50 |
wgrant | IIRC Ian wrote that with a tonne of sqlvalues originally and I rewrote it to use the custom Storm expression. But it's only been used in that one place, so it may need a bit of adjustment to fit well for you. | 12:52 |
cjwatson | I suppose I could do the http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE thing, but it seems like overkill. | 12:52 |
wgrant | We should have retry logic in the job infrastructure anyway, though I don't think we do now. | 12:53 |
wgrant | (the webapp does, though) | 12:53 |
cjwatson | I guess concurrent notifies could be a thing, even without any other way to create refs. | 12:54 |
stub | We can do upsert in python now I think, using savepoints. | 12:54 |
cjwatson | https://wiki.postgresql.org/wiki/UPSERT oww my head | 12:59 |
stub | cjwatson: Will there potentially be concurrent connections trying to insert the same row? If not, the writable-CTE variant is easiest. | 13:02 |
cjwatson | It's possible, yes: two notifies might be scheduled in rapid succession due to two git pushes, and those might be run concurrently by different celery workers. | 13:02 |
cjwatson | (Although we could work around this by having notify check whether there's already a job and not schedule another one, which might be worthwhile if it makes this significantly easier) | 13:03 |
stub | mmm, so the stored procedure or savepoints... | 13:03 |
cjwatson | Savepoints seem cumbersome for a bulk upsert | 13:04 |
stub | Indeedy | 13:04 |
cjwatson | I think the stored procedure needs a temporary table, though, if I'm reading this right | 13:06 |
wgrant | cjwatson: Branch scans use postgres advisory locks to avoid two running on the same branch simultaneously. | 13:06 |
stub | Or do the CTE variant if you don't mind two jobs potentially conflicting and one aborting with a violation | 13:06 |
wgrant | It may be useful to do the same here. It's not going to totally exclude all possible conflicts, but it should fix most of them. | 13:07 |
stub | or yeah, advisory lock. | 13:07 |
cjwatson | what's the CTE variant? | 13:07 |
stub | The stored procedure version doesn't need a temporary table. https://dba.stackexchange.com/questions/13468/most-idiomatic-way-to-implement-upsert-in-postgresql-nowadays has a simple example of that | 13:07 |
stub | http://www.xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html has an example of the CTE variant | 13:08 |
stub | CTE is the fastest approach | 13:09 |
wgrant | I don't see a huge problem with the easy client-side select+insert+update approach. | 13:09 |
stub | speed, no of queries. Depends on how bulk is bulk. | 13:12 |
stub | if you have a lock, it is faster doing the update first, then if now rows modified, do the insert | 13:13 |
cjwatson | Advisory lock plus something like https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/8702291#8702291 looks tolerable | 13:14 |
cjwatson | And maybe a retry loop around the whole thing | 13:14 |
cjwatson | The worst case without a retry loop AFAICS is that you might have to send another notify to get it to catch up properly. | 13:15 |
cjwatson | The "if no rows modified" strategy surely only works if I'm trying to upsert one row at a time | 13:16 |
cjwatson | (?) | 13:16 |
wgrant | Right, all of this is just a cache, so the worst case is we have to request the cache be refreshed again. | 13:16 |
wgrant | UPDATE ... RETURNING | 13:16 |
wgrant | You can't just use rowcount, but you can use RETURNING to work out which rows were changed. | 13:17 |
cjwatson | "Bulk" is unlikely to be more than 100 rows or so, usually much less. | 13:17 |
wgrant | However, SELECT+INSERT+UPDATE is just as correct as the CTE and probably quicker to implement now. | 13:17 |
stub | cjwatson: yeah, "if no rows modified" is one row at a time. | 13:17 |
cjwatson | Well, maybe more for lots of tags in a newly-pushed repo. | 13:17 |
wgrant | Mm, lots of repos have lots of tags | 13:17 |
cjwatson | Yeah, my linux tree here has 2648 tags | 13:18 |
cjwatson | Rare for most of them not to be either "insert and never change again", though. | 13:19 |
cjwatson | s/either // | 13:19 |
cjwatson | wgrant: I'd appreciate a re-review of https://code.launchpad.net/~cjwatson/launchpad/git-ref-scanner/+merge/252763, since the changes were non-trivial. | 16:24 |
blr | cjwatson: we have the same string -> GITOBJ.type mapping in turnip - would it be useful to have that in pygit2? | 19:30 |
blr | bizarre that the library doesn't provide that... might send them a PR this week, time allowing. | 19:31 |
blr | cjwatson: context - in your branch scanning branch. | 19:34 |
cjwatson | blr: Hm, possibly, but I think LP wants to have its own enumeration for anything that's in the database, really. | 21:43 |
cjwatson | blr: Although doing the string parsing bit wouldn't hurt, I suppose. | 21:44 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!