[03:36] <Guest47> Hello everyone! Interesting problem I'm wrestling with: I have a user with 60TB of video data in two locations, one properly organized, the other location...not. The data within is reasonably certain to be the same. Doing a recopy would take a ridiculous amount of time. Just wondering if there's a way to script walking the "correct" system into a
[03:36] <Guest47> text file (a "map") and then on the remote system, reading the "map" and recreating the original layout by simply moving inodes around
[03:38] <arraybolt3> Guest47: What you're trying to do is possible, but will require a bit of programming/scripting.
[03:38] <arraybolt3> I probably can't write the script now, but what I'd do is something like this:
[03:39] <arraybolt3> 1. On the "good" machine, build a tree with the file and directory structionre. Get a SHA256 hash of every file in the tree.
[03:39] <arraybolt3> 2. On the "bad" machine, create a similar tree, making the hashes.
[03:39] <Guest47> If there isn't an existing bit of code you all might know of, what are your thoughts on: "find $1 > map.txt" run on both systems, then a diff between the two maps. Finally, parse the diff file so if correct-folder-Ab exists, mv file-a folder-A. If the folder doesn't exist, mkdir -p, then move... simplistic, I know :-/
[03:39] <Guest47> oh
[03:39] <arraybolt3> 3. Match the hashes up against each other.
[03:39] <arraybolt3> Then you can be sure what file is what and be able to move it where it goes.
[03:40] <arraybolt3> If the filenames are guaranteed to be unique *and* identical filenames on both systems indicate identical files, you can skip hashing and just use the filenames.
[03:40] <arraybolt3> Guest47: The way you're mentioning sounds fairly straightforward. Depending on how the data on the servers works, it might be that easy.
[03:41] <arraybolt3> I don't know for sure how exactly the data is going to behave though, so my suggestion should work in any event (so long as the videos are bit-for-bit identical).
[03:41] <Guest47> arraybolt3 thanks for the thoughts! I'm *hopeful* file names are distinct, but without some kind of checksum, no way to be certain right?
[03:41] <arraybolt3> True.
[03:41] <arraybolt3> And sha256sum'ing 60 TB of data is going to take *looooooong*.
[03:42] <Guest47> Is there something faster to checksum than sha? Is crc32 still considered reasonable?
[03:42] <arraybolt3> If your code can cope with hash collisions, then something like a CRC or perhaps MD5 might be sufficient.
[03:42] <arraybolt3> But if you do that, you *have* to be ready for hash collisions.
[03:43] <sdeziel> Guest47: I'd also check if `rsync --fuzzy` would work for this. I don't know how flexible it is to find similar files in various directories
[03:43] <arraybolt3> MD5 is too likely to generate hash collisions to just be blindly trusted. whereas SHA256 is so unlikely to have a hash collision that it should be able to be trusted without any further checking.
[03:43] <arraybolt3> (At least this is what I have been led to believe from my research.)
[03:43] <arraybolt3> (If I'm wrong, someone please let me know.)
[03:46] <arraybolt3> Guest47: If rsync --fuzzy can't be made to work, I'd benchmark the speed of SHA256 on your hardware, then benchmark the network connection, and use whichever one's faster.
[03:46] <arraybolt3> If you have a 1 Gbps connection between the two, then probably the hashing will be quicker. If you have 100 Gbps, then you might want to just reclone and call it a day.
[03:46] <Guest47> Odds are probably pretty low of a collision if comparing the simple sum, and the filename.. Any collisions still found I guess I'd just log
[03:47] <Guest47> Sadly, it's fully remote, one system in LA, other near San Diego
[03:47] <Guest47> Best one could hope for is 1gbps, but likely less
[03:48] <arraybolt3> Blah.
[03:48] <Guest47> Does copy time exceed programming/debugging time? It never ever does hahaha
[03:48] <sdeziel> arraybolt3: on some hardware, sha512sum is ~2x faster than sha256sum
[03:48] <arraybolt3> sdeziel: Oh really? Didn't know that. That would be quite handy.
[03:49] <Guest47> What is the "sum" command hashing?
[03:49] <arraybolt3> That's a 16-bit checksum.
[03:49] <arraybolt3> You are going to have so many many many collisions with that.
[03:49] <sdeziel> for 60TB I'd try the rsync approach of checking the metadata (size, name, etc)
[03:50] <sdeziel> comparing the size alone should give you a list of likely copies
[03:51] <Guest47> well... "man sum" seems to show the 16bit sum, and the file size. So that could be kind of useful (maybe?)
[03:51] <arraybolt3> sdeziel: Heh, what do you know, my system is one of the ones that can sha512 faster than it can sha256.
[03:51] <sdeziel> arraybolt3: re sha512sum being faster, it's apparently due to operating on larger blocks (https://crypto.stackexchange.com/questions/26336/sha-512-faster-than-sha-256)
[03:51] <arraybolt3> (It also makes my system make an odd whining noise? That's creepy.)
[03:52] <arraybolt3> There's also blake2b.
[03:52] <arraybolt3> That one beats the sap out of sha256 and sha512.
[03:53] <Guest47> brb laundry. Before I go, can I just say that I really appreciate your inputs on this? Thanks arraybolt3 and sdeziel
[03:53] <arraybolt3> Guest47: Sure thing!
[03:53] <arraybolt3> Yeah, seeing the speed of b2sum, I'd use that. It makes long hashes so it will probably not have hash collisions, and it's operating at mind-bending speeds for me - almost a full GiB/s when being piped in /dev/zero.
[03:54] <sdeziel> Guest47: np. Lastly, I wouldn't use `sum` as you end up reading 60TB from the disks anyway... checking the size would pull way less from the filesystem and give you a good enough first indicator
[03:54] <arraybolt3> And piping it an ISO file, still way way faster than SHA.
[03:54] <sdeziel> I'd keep the heavy hash computation for files that are close in size to disambiguate them
[03:58] <arraybolt3> (There's also b3sum which appears to be even faster, https://github.com/BLAKE3-team/BLAKE3)
[04:41] <Guest47> 2.54s for b2sum vs 1.95s for sum on the same 225mb file. Hilariously, sum took 1m33.99s on an 8.01gb file, b2sum took 1m18.26s on the same file
[04:47] <Guest47> I'll take a wack at writing the utility to do this tomorrow, I'll drop by and share it with you guys when I finish if you'd like
[04:50] <arraybolt3> Thanks! Good luck!
[05:35] <alkisg> Guest47: when doing such benchmarks, you should make sure they use the same cache. Either drop caches, or cache the whole file
[11:42] <pvh_sa> hey everyone... I have a server affected by this bug - https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1958260 - as the bug discussion points out, this only happens if the /lib/modules dir for your kernel version is empty in initramfs - see this code: https://git.launchpad.net/cloud-initramfs-tools/tree/copymods/scripts/init-bottom/copymods#n52
[11:42] -ubottu:#ubuntu-server- Launchpad bug 1958260 in cloud-initramfs-tools (Ubuntu) "cloud-initramfs-copymods hides the full list of modules from the system" [High, Incomplete]
[11:44] <pvh_sa> unfortunately since the server is now in this state it seems difficult to get it unstuck. I would like to manually intervene either in the initramfs or this script that is in there so that I can get my system back into a cleanly booting form...
[12:27] <athos> bryceh: this is an interesting git-ubuntu case in https://code.launchpad.net/~bryce/ubuntu/+source/nmap/+git/nmap/+merge/437077
[12:28] <athos> while it is my first time dealing with such case, I suppose it is quite common...
[12:29] <athos> rbasak: when the patch gets uploaded, should we expect the current jammy-devel branch to be completely replaced by whatever is in bryceh's branch?
[14:13] <rbasak> athos: yes the branch pointer will just be updated
[14:13] <rbasak> athos: https://bugs.launchpad.net/git-ubuntu/+bug/1852389
[14:13] -ubottu:#ubuntu-server- Launchpad bug 1852389 in git-ubuntu "Branch pointers do not follow deletions, breaking ubuntu/devel and such" [Wishlist, New]
[17:21] <baldpope> good morning all, attempting to setup apt-cacher-ng, server is all setup and now adding auto-apt-proxy to our local nodes, have configured the hostname/dnsdomainname correctly on the node, but when running auto-apt-proxy, nothing is returned on my first node (it does return on the apt-cacher-ng node)
[17:23] <baldpope> running nslookup with a SRV type for the _apt_proxy._tcp.domain.name returns as expected
[17:27] <sdeziel> baldpope: if I understand correctly, you have to install the auto-apt-proxy package on your machines to (eventually) have them use your apt-cacher-ng, right? Instead of using a package to auto-detect, wouldn't it be simpler to deploy an apt.conf snippet saying to use a proxy?
[17:27] <sdeziel> `/etc/apt/apt.conf.d/01proxy`: `Acquire::http::proxy "http://apt-cacher-ng.domain.name:3142/";`
[17:29] <baldpope> looks like that would accomplish the same thing?
[17:32] <baldpope> thanks sdeziel 
[18:45] <bryceh> athos, I don't know if it's common, this is the first time I've come across it myself