=== polymorp- is now known as polymorphic | ||
=== remolej4 is now known as remolej | ||
Guest47 | Hello everyone! Interesting problem I'm wrestling with: I have a user with 60TB of video data in two locations, one properly organized, the other location...not. The data within is reasonably certain to be the same. Doing a recopy would take a ridiculous amount of time. Just wondering if there's a way to script walking the "correct" system into a | 03:36 |
---|---|---|
Guest47 | text file (a "map") and then on the remote system, reading the "map" and recreating the original layout by simply moving inodes around | 03:36 |
arraybolt3 | Guest47: What you're trying to do is possible, but will require a bit of programming/scripting. | 03:38 |
arraybolt3 | I probably can't write the script now, but what I'd do is something like this: | 03:38 |
arraybolt3 | 1. On the "good" machine, build a tree with the file and directory structionre. Get a SHA256 hash of every file in the tree. | 03:39 |
arraybolt3 | 2. On the "bad" machine, create a similar tree, making the hashes. | 03:39 |
Guest47 | If there isn't an existing bit of code you all might know of, what are your thoughts on: "find $1 > map.txt" run on both systems, then a diff between the two maps. Finally, parse the diff file so if correct-folder-Ab exists, mv file-a folder-A. If the folder doesn't exist, mkdir -p, then move... simplistic, I know :-/ | 03:39 |
Guest47 | oh | 03:39 |
arraybolt3 | 3. Match the hashes up against each other. | 03:39 |
arraybolt3 | Then you can be sure what file is what and be able to move it where it goes. | 03:39 |
arraybolt3 | If the filenames are guaranteed to be unique *and* identical filenames on both systems indicate identical files, you can skip hashing and just use the filenames. | 03:40 |
arraybolt3 | Guest47: The way you're mentioning sounds fairly straightforward. Depending on how the data on the servers works, it might be that easy. | 03:40 |
arraybolt3 | I don't know for sure how exactly the data is going to behave though, so my suggestion should work in any event (so long as the videos are bit-for-bit identical). | 03:41 |
Guest47 | arraybolt3 thanks for the thoughts! I'm *hopeful* file names are distinct, but without some kind of checksum, no way to be certain right? | 03:41 |
arraybolt3 | True. | 03:41 |
arraybolt3 | And sha256sum'ing 60 TB of data is going to take *looooooong*. | 03:41 |
Guest47 | Is there something faster to checksum than sha? Is crc32 still considered reasonable? | 03:42 |
arraybolt3 | If your code can cope with hash collisions, then something like a CRC or perhaps MD5 might be sufficient. | 03:42 |
arraybolt3 | But if you do that, you *have* to be ready for hash collisions. | 03:42 |
sdeziel | Guest47: I'd also check if `rsync --fuzzy` would work for this. I don't know how flexible it is to find similar files in various directories | 03:43 |
arraybolt3 | MD5 is too likely to generate hash collisions to just be blindly trusted. whereas SHA256 is so unlikely to have a hash collision that it should be able to be trusted without any further checking. | 03:43 |
arraybolt3 | (At least this is what I have been led to believe from my research.) | 03:43 |
arraybolt3 | (If I'm wrong, someone please let me know.) | 03:43 |
arraybolt3 | Guest47: If rsync --fuzzy can't be made to work, I'd benchmark the speed of SHA256 on your hardware, then benchmark the network connection, and use whichever one's faster. | 03:46 |
arraybolt3 | If you have a 1 Gbps connection between the two, then probably the hashing will be quicker. If you have 100 Gbps, then you might want to just reclone and call it a day. | 03:46 |
Guest47 | Odds are probably pretty low of a collision if comparing the simple sum, and the filename.. Any collisions still found I guess I'd just log | 03:46 |
Guest47 | Sadly, it's fully remote, one system in LA, other near San Diego | 03:47 |
Guest47 | Best one could hope for is 1gbps, but likely less | 03:47 |
arraybolt3 | Blah. | 03:48 |
Guest47 | Does copy time exceed programming/debugging time? It never ever does hahaha | 03:48 |
sdeziel | arraybolt3: on some hardware, sha512sum is ~2x faster than sha256sum | 03:48 |
arraybolt3 | sdeziel: Oh really? Didn't know that. That would be quite handy. | 03:48 |
Guest47 | What is the "sum" command hashing? | 03:49 |
arraybolt3 | That's a 16-bit checksum. | 03:49 |
arraybolt3 | You are going to have so many many many collisions with that. | 03:49 |
sdeziel | for 60TB I'd try the rsync approach of checking the metadata (size, name, etc) | 03:49 |
sdeziel | comparing the size alone should give you a list of likely copies | 03:50 |
Guest47 | well... "man sum" seems to show the 16bit sum, and the file size. So that could be kind of useful (maybe?) | 03:51 |
arraybolt3 | sdeziel: Heh, what do you know, my system is one of the ones that can sha512 faster than it can sha256. | 03:51 |
sdeziel | arraybolt3: re sha512sum being faster, it's apparently due to operating on larger blocks (https://crypto.stackexchange.com/questions/26336/sha-512-faster-than-sha-256) | 03:51 |
arraybolt3 | (It also makes my system make an odd whining noise? That's creepy.) | 03:51 |
arraybolt3 | There's also blake2b. | 03:52 |
arraybolt3 | That one beats the sap out of sha256 and sha512. | 03:52 |
Guest47 | brb laundry. Before I go, can I just say that I really appreciate your inputs on this? Thanks arraybolt3 and sdeziel | 03:53 |
arraybolt3 | Guest47: Sure thing! | 03:53 |
arraybolt3 | Yeah, seeing the speed of b2sum, I'd use that. It makes long hashes so it will probably not have hash collisions, and it's operating at mind-bending speeds for me - almost a full GiB/s when being piped in /dev/zero. | 03:53 |
sdeziel | Guest47: np. Lastly, I wouldn't use `sum` as you end up reading 60TB from the disks anyway... checking the size would pull way less from the filesystem and give you a good enough first indicator | 03:54 |
arraybolt3 | And piping it an ISO file, still way way faster than SHA. | 03:54 |
sdeziel | I'd keep the heavy hash computation for files that are close in size to disambiguate them | 03:54 |
arraybolt3 | (There's also b3sum which appears to be even faster, https://github.com/BLAKE3-team/BLAKE3) | 03:58 |
Guest47 | 2.54s for b2sum vs 1.95s for sum on the same 225mb file. Hilariously, sum took 1m33.99s on an 8.01gb file, b2sum took 1m18.26s on the same file | 04:41 |
Guest47 | I'll take a wack at writing the utility to do this tomorrow, I'll drop by and share it with you guys when I finish if you'd like | 04:47 |
arraybolt3 | Thanks! Good luck! | 04:50 |
alkisg | Guest47: when doing such benchmarks, you should make sure they use the same cache. Either drop caches, or cache the whole file | 05:35 |
=== xispita is now known as Guest4041 | ||
=== xispita_ is now known as xispita | ||
pvh_sa | hey everyone... I have a server affected by this bug - https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1958260 - as the bug discussion points out, this only happens if the /lib/modules dir for your kernel version is empty in initramfs - see this code: https://git.launchpad.net/cloud-initramfs-tools/tree/copymods/scripts/init-bottom/copymods#n52 | 11:42 |
-ubottu:#ubuntu-server- Launchpad bug 1958260 in cloud-initramfs-tools (Ubuntu) "cloud-initramfs-copymods hides the full list of modules from the system" [High, Incomplete] | 11:42 | |
pvh_sa | unfortunately since the server is now in this state it seems difficult to get it unstuck. I would like to manually intervene either in the initramfs or this script that is in there so that I can get my system back into a cleanly booting form... | 11:44 |
athos | bryceh: this is an interesting git-ubuntu case in https://code.launchpad.net/~bryce/ubuntu/+source/nmap/+git/nmap/+merge/437077 | 12:27 |
athos | while it is my first time dealing with such case, I suppose it is quite common... | 12:28 |
athos | rbasak: when the patch gets uploaded, should we expect the current jammy-devel branch to be completely replaced by whatever is in bryceh's branch? | 12:29 |
=== justache is now known as deliriumt | ||
=== deliriumt is now known as justache | ||
=== cpaelzer_ is now known as cpaelzer | ||
=== sdeziel_ is now known as sdeziel | ||
rbasak | athos: yes the branch pointer will just be updated | 14:13 |
rbasak | athos: https://bugs.launchpad.net/git-ubuntu/+bug/1852389 | 14:13 |
-ubottu:#ubuntu-server- Launchpad bug 1852389 in git-ubuntu "Branch pointers do not follow deletions, breaking ubuntu/devel and such" [Wishlist, New] | 14:13 | |
=== otisolsen70_ is now known as otisolsen70 | ||
baldpope | good morning all, attempting to setup apt-cacher-ng, server is all setup and now adding auto-apt-proxy to our local nodes, have configured the hostname/dnsdomainname correctly on the node, but when running auto-apt-proxy, nothing is returned on my first node (it does return on the apt-cacher-ng node) | 17:21 |
baldpope | running nslookup with a SRV type for the _apt_proxy._tcp.domain.name returns as expected | 17:23 |
sdeziel | baldpope: if I understand correctly, you have to install the auto-apt-proxy package on your machines to (eventually) have them use your apt-cacher-ng, right? Instead of using a package to auto-detect, wouldn't it be simpler to deploy an apt.conf snippet saying to use a proxy? | 17:27 |
sdeziel | `/etc/apt/apt.conf.d/01proxy`: `Acquire::http::proxy "http://apt-cacher-ng.domain.name:3142/";` | 17:27 |
baldpope | looks like that would accomplish the same thing? | 17:29 |
baldpope | thanks sdeziel | 17:32 |
bryceh | athos, I don't know if it's common, this is the first time I've come across it myself | 18:45 |
=== chris15 is now known as chris14 | ||
=== remolej4 is now known as remolej |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!