=== cpaelzer_ is now known as cpaelzer [12:08] cjwatson, wgrant Regarding delta indexes, I built a fake Deltas index for xenial->xenial-updates with 3 deltas per update, and I landed at 563KB for mains Deltas.xz 829 KB for a Packages.xz [12:08] The fields were Package, Old-ID, New-ID, Size, SHA256 [12:08] Not convinced [12:10] I wonder if I should "just" inject SHA256 of complete debs in the dpkg status database [12:11] and then I can save one hashsum [12:11] because new-id = sha256 of delta [12:11] um no [12:12] I can however safe one ID, by using SHA256(old id | new id) as a field [12:12] brings us down to 485 KB [12:14] So, with Deltas index files, we'd be looking at 60% update time increase for a 80% install download-time decrease [12:14] There was in fact the idea to have "smart" delta indexes once [12:14] So update would download Packages files, then calculate upgrade download size [12:15] and if size(Delta indexes) < some% of upgrade size, it would fetch Delta indexes and look for deltas [12:15] oh, we need 4 deltas per update [12:15] this does not scale well [12:16] sizes of Deltas.xz ~ Packages.xz [12:17] Wondering if a bloom filter might be worht it [12:18] So update gets a bloom filter file for which updates have deltas [12:18] "might have deltas" [12:18] to reduce the number of fetches during install/upgrade [12:18] when doing a per-source-package signature [12:22] A SHA256 has 16 16 bit values we can use as hashes [12:22] we then just need 2**16 bytes for the filter [12:56] xenial-updates needs 12.3 KB bloom filter for a 0.75% false positive rate [12:56] sounds OK [12:58] 1.5% increase in update time to reduce failing delta lookups by about 40% or so [14:14] juliank: this sounds exciting. Is there more information somewhere? [14:15] sladen: Yes, at https://debconf18.debconf.org/talks/66-delta-upgrades-revisited/, https://people.debian.org/~jak/dc18-deltas.pdf for example [14:15] sladen: unless you mean the bloom filter specifically [14:15] that just popped in today [14:34] juliank: one possibly optimisation of this to provide N diff scripts, and 1x literal data per file. This would allow one deltadeb to match a scalable number of the last $N package versions [14:35] juliank: so for the moment you could just ship a 1:1 upgrade mapping, but it leaves open the possiblity for future optimisation [14:35] sladen: complicated [14:35] I mean, script is the wrong word [14:36] I'm not sure where you'd integrate this [14:36] on the ddelta level? [14:37] Add a CRC32 of input data to each command, and then provide alternating blocks? [14:37] * sladen re-finds https://www.uknof.org.uk/uknof6/Sladen-Delta-upgrades.pdf from a decade ago. [14:38] juliank: https://people.debian.org/~jak/dc18-deltas.pdf page 8, each file has 1x "diff data" and 1x "extra data" [14:38] Yes [14:38] juliank: the suggestion would be that each file has Nx "diff data" and 1x "extra data". [14:38] An there will be a CRC for diff data eventually (for the part we are adding the data too) [14:39] So you could reasonably have an "or flag", too [14:39] juliank: where the "bitmask" for the "extra data" can be expanded to cover the last N versions of a package [14:39] But I think it gets too slow [14:39] * sladen looks confused [14:39] The problem is simple: If we have multiple diff data, we need to figure out which one to use [14:40] for that we need to read the block / try to apply it [14:40] if we fail, we'd have to seek back and start again [14:40] unless you make the choice first [14:40] per file, there is {1x input file (optionally validated by some hash), 1x "extra data", and 1x "diff data"} [14:41] this gives a there is a one-way mapping X+Y [transform via "diff data"] => Z [14:42] you're still missing the point [14:42] * sladen listens [14:42] If I have n diffs for a given file [14:42] how do I figure out which one to apply [14:42] switch the question around [14:43] if I have a starting input file with hash 0xdeadbeef, which diff do I *choose* to reach the end point [14:43] you don't know anything about the file until you have started applying the delta [14:43] (one assumes that you're already validating the input before processing it ;-) [14:44] juliank: if you "don't know anything about the file", how did you find it on disk? [14:44] I did not [14:45] That's a different layer [14:45] The real delta layer can only apply deltas given an old file [14:45] yes [14:46] On top of that there's the tree delta layer, which figures out the old name and the new one and stuff like that [14:46] it's like literally a tarfile and a header with "old name" in it [14:46] But of course [14:47] we can just say "delta id 5 inside here is the delta for that combination" [14:48] that said, you then increase the problem of finding the delta on the server [14:48] * sladen raises an eyebrow [14:49] your file name needs to identify to which upgrade the delta belongsd [14:49] then you end up with pkg_id1_id2_id3_ddelta.deltadeb or something [14:50] I'm not sure on the exact details yet [14:50] the latest draft stage is at https://lists.debian.org/debian-dpkg/2018/09/msg00019.html [14:50] and even that is somewhat out of data [14:50] juliank: perhaps think of it a different way. Lets say we have a package that is regularly (once a day updated), say the hypothetical "message-of-the-day.deb" [14:51] as pkg_$old_$new_$algo.deltadeb might become $pkg_$hash($old|$new)_$algo.deltadeb [14:51] 64 bytes less per delta [14:51] juliank: only '/etc/motd' is changed in this package. the other files, such as /usr/share/message-of-the-day/README remain unchanged [14:53] juliank: for the last 5 versions, N .deltadebs are currently required, which at 98% the same. [14:54] * sladen reads https://lists.debian.org/debian-dpkg/2018/09/msg00019.html [14:54] I fully understand the argument for that approach [14:54] I don't think it's easily doable given the requirements [14:55] The requirements for deltas are (1) min. seeks on old file (2) no seeks on delta (3) no seeks on new file [14:56] If you want each block to store $n$ diff data for $n$ old versions (or store $n$ control blocks for $n$ old versions) [14:56] you can easily do that [14:57] just assign indexes to them and map indexes to old packages in the control.tar [14:57] but that's not elegant [14:57] To get rid of that, you have to make the decision which diffs to pick in the delta itself [14:57] and that gets nasty. [14:57] can this avoid duplicating the common parts of new literal data ("extra data")? [14:58] sladen: Not sure, but duplication does not matter much [14:58] it's a very tiny overhead after zstd/xz compression [14:58] you could infact just include $n$ deltas I'd think for $n$ old versions [14:59] ahhhh, okay it relies on the enormous window size of zstd/xz/etc [15:00] Now, what will happen is that we get CRC32 soon for the "old" data we are adding diffs to, to ensure we are patching with the correct file [15:01] s/ensure/increase confidence/ [15:01] !dmb-ping [15:01] cyphermox, jbicha, micahg, rbasak, sil2100, slashd, tsimonq2: DMB ping. [15:01] So my idea was that we could give each of these (diff, extra, seek) control triplets an "or" flag [15:01] o/ [15:01] if the CRC does not match, we go to the next control entry [15:02] that works, but it's somewhat inefficient [15:02] because the diff block might be MBs large [15:02] it also does not work after all, because we'd have to undo the write to the new file [15:03] (i.e. seek back and truncate) [15:03] but: Both the patch and the new file are a pipe [15:04] (dpkg reads the patched file from the pipe, and calculates hashes before storing it on the system) [15:05] You can solve that by just picking a fixed maximum window size you can keep in memory [15:05] but um, for large files that slows things down a bit [15:06] that said, for really big files, using windows during creation is a lot more effective [15:07] I think bsdiff generation is O(n^2) atm [15:07] if you use blocks of $B$ bytes, it becomes (n/B*B^2) [15:07] so it only grows linearly rather than quadratic [15:07] juliank: in the msg0019 spec, "if file.mtime >= installed_version.install_time or file.mtime > file.ctime or file.size != expected_size" <-- probably a lot better to use a checksum to match before ingesting [15:08] juliank: and once a checksum is used, that can become the primary matching key [15:08] sladen: that's the paranoid mode [15:08] our believe is that checksums are too slow to be practical here, and useless in most situations [15:10] where us is me and some other people I talked to about this [15:10] :) [15:10] checkums are really slow if done on the server (rsync), but checksums on clients are $fast (cost of checksum is > than cost of downloading precious bytes) [15:10] checkums are really slow if done on the server (rsync), but checksums on clients are $fast (cost of checksum is < than cost of downloading precious bytes) [15:10] What you already have checked is that the files, when unpacked, have the right content [15:11] because of the IDs identifying the content of the package, and you looking up deltas based on ids [15:11] I'm unsure about the implications of actually hashing the files [15:12] we will be reading them with a 60% probability [15:12] because we'll be applying a delta [15:12] and dpkg already hashes the files it is writing to the disk [15:12] so the overhead should be less than 50% [15:13] but let's measure [15:14] I spent 47 seconds checking my 3326 packages [15:14] on a fast Samsung SSD from this year [15:15] this means that for an upgrade of 300 packages, I'd spent about 5 seconds [15:15] *35 seconds with caches [15:16] that's still presumably 99% disk I/O [15:20] yes [15:20] the non-paranoid mtime check only takes 1s [15:21] that's not reading/checking any files, only filesystem structure [15:21] that's the point [15:21] we can be reasonably sure that the file has not been modified [15:22] because it requires extreme stupidness or deviousness to change your mtime back to before install time [15:22] or even build time [15:24] When does that happen? [15:25] Well, I'd say only if your clock is behind a few days/months, you have installed the package with the correct time, and you modify the file while keeping the bit length the same [15:25] check mtime sounds like an optimisation that an upgrade is /probably/ possible. -> fetch set of hash(es) is see whether it is really truely possible [15:25] but we can protect against clock resets too I think [15:26] I think [15:26] juliank: is there a directory somewhere, with an extra set of .deltadebs to look at? [15:26] s/extra/example/ [15:27] no [15:27] https://github.com/julian-klode/ddelta/tree/demo has a delta generation script [15:27] ./build-delta.sh $old $new $delta [15:27] builds a delta from $old to $new and stores it as $delta [15:28] * delta deb [15:28] it's very primitive [15:30] one thing it does is generate normal .deb with debian-binary instead of files with debian-delta [15:30] but minor details [15:30] ease of prototyping [15:30] infinity, hi, could you have a look at bug 1794137 ? [15:30] bug 1794137 in ubiquity (Ubuntu) "ubiquity FTBFS on Cosmic" [Undecided,New] https://launchpad.net/bugs/1794137 [15:31] sladen: I think windowing would definitely be worth it, the memory requirements go down a lot, effectively become constant. as https://github.com/julian-klode/ddelta/commit/05aecc28c2f049cad83d9c7ef2e2439d0ecae295 mentions [15:31] but I think fixed windows of 8 MB should not work in practice [15:32] you need some heuristics [15:45] sladen: If you want to compare with xdelta3 as the algorithm, just use ddelta_generate="xdelta3 -s" instead [15:49] juliank: where's the customised version of bsdiff() (specifically the search() function) [15:49] It's in that repo [15:49] https://github.com/julian-klode/ddelta/blob/master/ddelta_generate.c [15:51] it's mostly just Google's bsdiff fork plus some cleanup [15:51] and without large file support [15:53] (we don't really need large file support since we are delating individual files, and it reduces memory usage by about 50%) [15:55] and well, memory usage with LFS is 9m + n [15:55] so assuming you actually need large files >= 4 GB [15:55] you end up with at least 36 GB of memory requirements [15:56] unless you use the windowed branch then you only need 50 MB (or well, 6w for window size w) [15:57] well, some (variable) window should probably be used, and that should be stated in the debdelta [15:58] so that eg. an old ARM/whatever maschine with 16MB of RAM can use to choose to only look at deltadebs that declare as only using a 1MB window [15:58] and if not, download everything [15:58] memory use when applying is already constant [15:59] the memory requirement here is when generating [16:00] not counting stdio, memory use is 64 KB of stack memory for the data blocks [16:00] + a few bytes here and there, and three levels of recursion or so [16:01] there are no heap allocations in the program (except for whatever libc does) [16:01] I suggest reading the ddelta_apply.c (https://github.com/julian-klode/ddelta/blob/master/ddelta_apply.c), it's quite easy to follow [16:03] there's some wicked vectorization magic in apply_diff(), but apart from that, it's quite easy [16:04] the be64toh is a bit ugly too [17:37] juliank: this stuff is drastically simplier than what was being looked at ~10 years ago, which failed because of a desire to bit-for-bit recreate the target .deb; and the checksums to prove that were only available on the recompressed .deb, rather than purely the contents [17:38] juliank: there did not seem to be an appetite for writing directly to the filesystem, partly because of the inability to roll back to the previous state on failure [17:40] juliank: is there an appetite for directly writing, verses, re-creating the .deb and getting dpkg -i to apply that? [17:45] sladen: the rollback ability is exactly the same as for out of space errors and roughly the same as for failing preinsts [17:46] After all, we apply the delta not to the file itself, but write the .dpkg-new file as before [17:46] so if some delta fails, dpkg will revert precisely as it does with other unpack failures [17:46] Or rather try to revert [17:47] that is, it deletes the .dpkg-new files and calls some maintainer scripts which hopefully work [17:48] With some modifications to dpkg and apt, it should be possible to make apt download the full deb if installing the delta fails [17:48] For that, dpkg needs to communicate a failing unpack due to delta [17:48] via status fd [17:49] apt needs to interpret that and acquire the full thing [17:49] sladen: Also, the algorithm does support regenerating the .deb file bit-by-bit [17:50] well, except for compressor changes, if any [17:51] But the problem with regenerating the full deb is that you end up with a lot of space usage [17:51] there is the intermediate step where you regenerate the tar while you're piping it to dpkg [17:51] juliank: most compression tools can be used reproducibly with some set of flags. [17:52] persia: not really [17:52] Well, for a given point in time, they can [17:52] But let's compress our deb now, and then try to regenerate it 5 years later [17:55] juliank: I meant gz, xz, bz, etc., for which time is less important. For tar, etc. things like "--mtime @$$SOURCE_DATE_EPOCH" work. 5 years is trickier. Anyway, not important. [17:56] persia: I think pristine-gz ships like 2 copies of different gzip versions now to enable that use case :) [17:56] :( [17:57] theoretically one needs to know every choice made by the encoder, and the problem is $impossible. In practice, it can be achieved in 99% of cases, but 90% of those of gzip -9, and there is code around that eg. has snapshots of the different versions of bzip2 [17:57] yeah, like pristine-gz [17:57] :) [17:58] Now you can do bitwise regeneration of most debs [17:58] but as they use xz it's quite slow [17:58] ... [17:58] If you don't care about bitwise, then you can go nuts [17:58] e.g. regenerate full debs without compression [17:58] or with zstd -1 if we do zstdf [17:58] which would not really be noticeable [17:59] what should be better (and is the same case as 10 years ago), would be to store the hash of the uncompressed tar.deb then the compression algorithm becomes irrelevant (and can be changed as required) [17:59] you can do that, yeah [18:00] juliank: somewhere, there is the 'apt-sync' / 'apt-zsync' package code, with provides an APT-Method using zsync. It should be possible to copy and adapt that and have a more rounded and working demonstration system [18:00] also, tardelta (or deltatar; aka a tarball of my deltas) is very easy to regenerate. [18:00] sladen: Such stuff does not work [18:00] Probably want to store multiple hashes (maybe the same set as in dsc files) to make it harder to game. [18:00] juliank: why not? [18:01] sladen: You cannot use rsync/zsync algorithms for binaries [18:01] also, debs are not built rsyncable [18:01] Now the first argument is a bit complicated, but should be easy to understand [18:02] If you add a few bytes at the beginning of your file, all offsets change by that amount [18:02] the entire file becomes unsyncable [18:02] juliank: in this case, it is the apt-sync *APT method* that might be usable to get a more complete demonstration of this proposed *debdelta* system working [18:02] I have a working dpkg that can install deltas directly, and I played around with that [18:03] juliank: thus making the whole proposed *debdelta* system more denonstratable [18:03] but yes, you can also builod a method that reconstructs debs [18:03] but we also need a repository layout anyway [18:03] a PoC would not have any signature checks I think [18:03] although [18:04] the method could do that itself [18:07] juliank: [regarding zsync] the gzip --rsyncable is about repeatedly restarting the gzip stream, limiting the window size, in order to better re-use existing literal data. In the proposed *debdelta*, that literal data is provided in the "extra data" block anyway. This already presumes that server-side disk space is no longer an issue [18:09] if server-side disk-space is indeed no longer a issue, then this makes the zsync side of stuff a lot easier (it becomes zsync minus the z); in addition to the compressed .deb, store an uncompressed .deb on the server, and allow HTTP/1.1 Range: Partial-Content access to this for missing pieces not available in the "extra data" (which could now be zero-length) [18:10] sladen: rsyncable was about rsync/zsync [18:10] obviously [18:10] 2nd, the name is not debdelta [18:10] 3rd, server-side disk space is an issue [18:10] deltadeb? [18:11] sladen: yeah, for now, but it's super confusing [18:11] * sladen happy to use any name preferred [18:11] would have called them ddebs, but we already have those :( [18:12] pdeb was in the round as well (patch deb) [18:12] debdiff and debpatch are both used as well [18:12] naming software is *hard* [18:13] call it "frank" [18:13] I could call it voyager [18:13] after all, related to delta [18:14] well, the delta quadrant [18:14] :D [18:32] juliank: --rsysncable does two (unrelated) things. (1) resets the encoding/compression state on a particular input (zeros in the input stream); (2) more reset points in the stream where it can be entered without knowing state [back-reference window state + current Huffman tree in use]. [18:33] I don't care [18:33] one *can* jump into the stream at any point if the current Huffman table is known, and the contents of the backreference window is available [18:34] How's that relevant to the topic of deltas? [18:35] chosing whether, and how much literal data to duplicate, vs. trying to fetch the literal data from the original .deb [18:35] (server side disk space vs. bandwidth tradeoffs) [18:48] hi all [20:42] @help [20:42] (help [] []) -- This command gives a useful description of what does. is only necessary if the command is in more than one plugin. [20:42] @help pilot [20:42] (pilot (in|out)) -- Set yourself an in or out of patch pilot. [20:42] @pilot in === udevbot_ changed the topic of #ubuntu-devel to: Archive: Feature Freeze (Cosmic Cuttlefish) | 18.04 released | Devel of Ubuntu (not support) | Build failures: http://qa.ubuntuwire.com/ftbfs/ | #ubuntu for support and discussion of Trusty-Bionic | If you can't send messages here, authenticate to nickserv first | Patch Pilots: cyphermox, xnox [20:42] cyphermox, i think that keeping the patch-pilot name should be fine to be honest =) [20:42] @pilot out === udevbot_ changed the topic of #ubuntu-devel to: Archive: Feature Freeze (Cosmic Cuttlefish) | 18.04 released | Devel of Ubuntu (not support) | Build failures: http://qa.ubuntuwire.com/ftbfs/ | #ubuntu for support and discussion of Trusty-Bionic | If you can't send messages here, authenticate to nickserv first | Patch Pilots: cyphermox [20:43] xnox: *shrugs* either way. I was asking, since I figured only dholbach (or very few people) knew where the code was at all [20:43] @pilot out === udevbot_ changed the topic of #ubuntu-devel to: Archive: Feature Freeze (Cosmic Cuttlefish) | 18.04 released | Devel of Ubuntu (not support) | Build failures: http://qa.ubuntuwire.com/ftbfs/ | #ubuntu for support and discussion of Trusty-Bionic | If you can't send messages here, authenticate to nickserv first | Patch Pilots: [20:48] cyphermox, right, but i only had the calendar code, not the irc-bot codes. and i think the feedback was that calendar code is not wanted. [21:31] I don't know [21:31] soem people like being scheduled, that way they know to do it at a specific time [21:31] but I guess that works better for patch piloting that other things, maybe? [21:31] brb [21:32] ahasenack: congrats! [21:33] ahasenack: congrats! [22:32] is this channel anyhow related to snapcraft.io? [22:33] joelkraehemann: #snappy [22:35] wxl: thank you === nyghtmeyr is now known as wxl [23:14] c [23:14] oops [23:34] hmm is there an easy way to update to a new upstream with a git ubuntu clone-ed repo? [23:34] uupdate is almost right, apart from the way it creates a new repo [23:34] gbp import-orig similarly almost but not quite what you want [23:40] mwhudson: yeah, there's not a great way [23:40] i think we have a bug for it [23:40] mwhudson: here's what i've done in the past, it's not ideal [23:41] mwhudson: clear out all non-debian/ directories and files from the git-ubuntu repo, move all the uupdate'd repo's files over [23:41] mwhudson: `git add .` (should just be non-debian changes) [23:42] mwhudson: insert a changelog entry (you can use the topmost from the uupdate'd one as a template for the version) [23:42] mwhudson: i *think* that mostly works, that's how i would do the php updates by hand [23:42] ah yeah [23:42] i am reading how gbp import-orig is implemented now :) [23:42] and i imagine is actually similar to what gbp does, with probably more smarts. [23:42] it's basically 'stash debian, update everything, unstash debian) [23:42] (in my mind) [23:43] it creates a new tree out of the tarball contents, commits that, then mashes the debian dir from the packaging branch into it [23:43] yeah, that make sense [23:43] (with vaguely appropriate hashes as commit parents) [23:43] we could do something similar, tbh, we have git-tree representations of any directory (or can) and can do temporary directory things) [23:44] i have an implementation of gbp's debian directory mashing in shell [23:44] but not the other bit [23:44] https://paste.ubuntu.com/p/qfd6wVksDy/ [23:44] there's LP: #1649940 [23:45] Launchpad bug 1649940 in usd-importer "can't prepare new upstream releases using gbp" [Medium,Confirmed] https://launchpad.net/bugs/1649940 [23:45] and LP: #1706219 [23:45] Launchpad bug 1706219 in usd-importer "`git ubuntu upstream-update` wrapper around uupdate/uscan" [Wishlist,Confirmed] https://launchpad.net/bugs/1706219 [23:45] i had some scaffolding in a branch to do the latter [23:45] i think it's a 'future' item, because we are still stabilizing the importer ABI [23:47] it's slightly frustrating that gbp has all the bits required for this [23:47] but can't actually do it [23:48] mwhudson: yeah, gbp has some very specific branch requirements, iirc. We were trying to avoid coding too much of that policy yet [23:52] You can hack it by messing around with GIT_WORK_TREE [23:52] Unpack the tarball somewhere [23:52] ah good point [23:52] yeah, that's what my scaffolding started to do [23:52] basically, did uupdate in a specific place [23:52] and took that working tree [23:52] GIT_WORK_TREE=$there git add -A [23:52] yep [23:52] rbasak: could you spit that into one of the two bugs? [23:53] Then some messing around with "git reset debian" (I'd have to think about it exactly) [23:53] heh ok i have half a shell script for this now too [23:54] rbasak: right, that was about where i got [23:54] yeah, you want to commit changes to non-debian as part of the uupdate [23:54] Done [23:54] but it would be a nice feature :)