[10:24] <cjwatson> wgrant: Can you think of any faster way to do the RTM "which SPPHs to copy" calculation than to basically do ubuntu.main_archive.getPublishedSources(distro_series=utopic, pocket="Release") and walk through the whole collection?  There are about 35000 elements in that right now, and I guess maybe a couple of thousand more by August.  I'm sure that's faster than doing lots of individual getPublishedSources calls, but wondering if I ...
[10:25] <cjwatson> ... should be adding new API first
[10:25] <wgrant> cjwatson: grep-dctrl?
[10:26] <cjwatson> On what?  I'm not necessarily forking from today's state
[10:26] <wgrant> xnox has an app for that.
[10:26] <cjwatson> And that still leaves me with querying for all the SPPHs anyway
[10:26] <wgrant> But I wouldn't be averse to enabling filtering on datepublished > X and (datesuperseded IS NULL OR datesuperseded > Y)
[10:27] <wgrant> You don't need the SPPHs, just the versions.
[10:27] <cjwatson> Oh, true
[10:27] <cjwatson> xnox: Remind me where your archive wayback machine is?
[10:31] <cjwatson> The datepublished > X component of that wouldn't be very useful, incidentally.  Some of the SPPHs in question might well just have been published when utopic was created.
[10:31] <wgrant> Er yeah.
[10:31] <wgrant> datepublished < X
[10:32] <cjwatson> Ah yes
[10:32]  * cjwatson tests materialising the whole gPS collection to see whether this is worth optimising in the first place
[10:32] <wgrant> My condolences.
[10:33] <wgrant> though sources might not be so bad, I guess.
[10:33] <wgrant> Possibly only a thousand requests.
[10:33] <cjwatson> That terminal window wasn't doing anything else anyway
[10:37] <wgrant> SPPHs scoped to series and archive might be doable without any special indices, but we might need to investigate GiST over a tsrange to get adequate performance.
[10:38] <cjwatson> Hopefully we can get it from already-published Sources.
[10:38] <wgrant> That's the ideal.
[10:38] <cjwatson> Failing xnox's wayback machine, I could hack archive-reports to stash copies for a while
[10:38] <wgrant> Exactly.
[10:39] <xnox> cjwatson: i have one locally, what dates are you interested in?
[10:39] <cjwatson> xnox: Roughly August 1-15
[10:39] <wgrant> Argh, I need to sort out overrides this week.
[10:39] <cjwatson> I do not expect you to have this yet :-)
[10:39] <xnox> cjwatson: utopic?
[10:39] <cjwatson> Yes
[10:39] <cjwatson> xnox: This is for forking ubuntu-rtm in about a month
[10:40] <cjwatson> xnox: If you don't have it somewhere public already, maybe it's easier for me to just start stashing Sources files now
[10:40] <xnox> cjwatson: i'm like, hm, which year =))) ah. right. there is github.com:xnox/apt-mirror.git
[10:41] <xnox> cjwatson: or, i need a machine which at times uses up to 8GB of ram (efficient git repack requires to store the largest blob in RAM and thus not use too much disk-space)
[10:41] <xnox> cjwatson: i could run it on e.g. snakefruit.
[10:41] <cjwatson> Hum.  Maybe this is overkill.
[10:42] <xnox> otherwise it eats up disk-space quickly
[10:42] <wgrant> xnox: Huh, what's the big blob?
[10:42] <xnox> well this is arching *all* pockets though.
[10:42] <wgrant> Unless you're storing gz/bz2, this should compress well and easily.
[10:42]  * xnox should measure how much it is to archive just one series.
[10:42] <xnox> at the moment my .git is 3.3GB + 4.6GB current tree
[10:43] <cjwatson> dists/utopic/*/source/Sources.bz2 is 8M total, snakefruit has 356G free
[10:43] <xnox> it's all dists/ for all ubuntu suites, and only uncompressed files are commited into history.
[10:43] <cjwatson> I could just stash them all
[10:43]  * jpds wonders if xnox has heard of git-annex.
[10:43] <xnox> wgrant: .gpg do not compress at all, as they are full re-writes on each publish cycle.
[10:43] <cjwatson> wgrant: customs maybe?
[10:43] <wgrant> customs would be much bigger than that, surely.
[10:44] <wgrant> Though I guess the isos might compress well.
[10:44] <xnox> cjwatson: i believe the right solution is to do round-robin type of thing somehow, with e.g. rsync /rsnapshots / hardlinks?! Cause it doesn't make sense to store per 15minute resolution indefinately.
[10:44] <xnox> and that would keep disk/memory usage constant.
[10:45] <cjwatson> We don't have to store indefinitely; for this purpose we're interested in a fairly narrow window, we just don't know exactly when in that window.
[10:45] <wgrant> If I were doing this I'd just store the non-custom, non-compressed bits in a git repo forever.
[10:45] <cjwatson> I'd have to get git installed on snakefruit, but we could run apt-mirror-snapshot out of archive-reports for a shortish period of time.
[10:45] <wgrant> Apart from the small OpenPGP sigs they should compress very well.
[10:46] <cjwatson> Or indeed forever if it works well enough, yeah.
[10:46] <xnox> with my silly git thing, I do essently 2x rsyncs (archive & ports), verify all .gpg to have consistent tree, commit *.gpg Packages Release, and have a mini front-end to query timestamps and generate .gz .bz2 on the fly.
[10:46] <xnox> or one can check them out.
[10:46] <cjwatson> Doing it from archive-reports guarantees the right granularity.
[10:46] <xnox> (frontend is separate script, from the snapshotter)
[10:46] <cjwatson> And we could discard the first two steps of that.
[10:48] <xnox> well, all you need then is just $ git init .; git add -A; git commit -m 'auto'. In that directory. And then repack/rewrite to discard useless stuff.
[10:48] <xnox> and a proper .gitignore to skip useless things.
[10:48] <xnox> (that can be recreated)
[10:48] <xnox> (*.gz *.bz2)
[10:49] <cjwatson> Materialising gPS for utopic release takes about 20 minutes on my ADSL, BTW.
[10:50] <xnox> if we have proper dists/ for the right publisher cycle, we are done. Or I can bring up canonistack instances and run them from now till september. And stash copies somewhere e.g. people.canonical.com
[10:51] <xnox> jpds: i haven't used git-annex, as it's typically never installable in devel releases =)))))) </heretic>
[10:52] <cjwatson> It's typically installable in devel, just not in devel-proposed :-)
[10:52] <xnox> i know :-P
[10:52] <cjwatson> OK, so it sounds like I just want to get git on snakefruit and then do roughly as you suggest above
[10:54]  * cjwatson files an RT for the former
[10:55] <xnox> cjwatson: and if you make that .git repository clonable to me, I can pull it to my servers & provide nice public frontends from my servers to query it on per timestamp basis et.al.
[10:56] <xnox> reliable snapshotting which doesn't get OOMed, is the thing i'm missing to make snapshotter interface public.
[10:57] <cjwatson> snakefruit has 6G of RAM; if this requires a ton of RAM I can't guarantee that
[10:59] <xnox> cjwatson: so, git commit will always succeed. (it only needs RAM to hash the largest file), But git repack may fail, thus .git may be growning in size. If you don't go $ git repack -A -d --window 9999 --depth 9999 you should be fine.
[10:59] <wgrant> Heh
[10:59] <wgrant> That's going to OOM on just about any repo.
[11:00] <xnox> if disk-space becomes an issue, and you get OOM to repack it to safe disk-space then we'd need to do something, e.g. split/graft/offload history.
[11:02]  * xnox should think of a round robin solution and estimate required disk-space there. And that will have little memory requirements.
[16:25] <cjwatson> wgrant: Do PackageBuildFormatterAPI and ArchiveFormatterAPI perhaps want to gain the distribution name?
[20:42] <cjwatson> wgrant: Mind if I take the "Optimise publish-distro phase A" Asana task?  I think I understand what shape things ought to be
[23:40] <wgrant> cjwatson: Lovely, that's exactly the first step I was going to do.
[23:40] <wgrant> cjwatson: re. the formatter APIs, they'll all use the new Archive.reference that I'm about to land.