[07:22] that was a quickfix :p [07:23] lotuspsychje: indeed :) [07:27] ok I managed to boot without u-boot [07:28] that's pretty cool. saved me some suffering === Wryhder is now known as Lucas_Gray [09:28] i have a pxe installation with initrd.gz and linux file for the installation. the installation does not find the nic of the server. i have the source files for the driver but i dont know how to integrate that in my initrd.gz any good guides to that? [10:21] Hi all. I am getting very frustrated with ZFS at the moment. I made a RAIDZ (RAID 5) with 4x SSD drives. Every time that I reboot or shutdown and boot up, I get a different status. Now, "zpool status" shows all drives "ONLINE" and CKSUM = 0. However, sometimes they are all "DEGRADED" and yesterday I had one degraded and the rest online. [10:22] I have tried S.M.A.R.T. I have tried scrubbing and I have rebuilt the pool from scratch. [10:22] I don't know how to diagnose this further, to work out where the problem lies, particular disk, ZFS itself, the hard drive controller, BIOS or some sort of kernel problem. [10:23] Corruption is happening, "zpool status -v" listing the files, I go to them and I can't open them. [10:25] (I also checked that all the drive cables are firmly and fully plugged in) [11:51] Anyone got some terms I could search around to try and figure this out? [12:06] Jenshae: is this native zfs or fuse-zfs? [12:06] Jenshae: also, please pastebin zpool status -v [12:06] I don't know what fuse-zfs is. [12:06] Jenshae: I've seen zfs finding errors the drive didn't know about (so-called "silent" errors) [12:07] zfs-fuse, perhaps - anyway - that's zfs running under fuse, that is, in usermode. If you have the spl/zfs kernel modules, it's native [12:08] https://pastebin.ubuntu.com/p/wK3nSVfMFX/ === tinwood is now known as tinwood-afk [12:18] RoyK: Sorry, took awhile to sanitise it, most of my folders and file names are a bit too descriptive, so this is just a snippet https://pastebin.ubuntu.com/p/Yj4PHpTZbP/ [12:19] hm… that's wierd - too many errors on all four drives? [12:19] which version of ubuntu and zfs/zpool is this? [12:20] Yeah, it is weird, some boot ups, all listed as "ONLINE" without "DEGRADED" against them, sometimes all drives are "DEGRADED" and only yesterday did I have a single drive result. [12:20] weird, even [12:21] how are these connected? standard sata from onboard controllers or something more fancy? [12:21] Ubuntu 18 ... what is the -v / -version, etc to get the ZFS version? [12:21] Standard board SATA, yes. [12:21] probably --version [12:21] I haven't worked with zfs for a while [12:22] *in* a while [12:22] damn [12:22] * RoyK complains about "bad language day" [12:23] zfsutils-linux is already the newest version (0.7.5-1ubuntu16.9). [12:23] I find that I get temporary brain damage and my typing goes to hell if I deprive myself of sleep. [12:24] Like, 3 hours sleep on day 1, 14 hours on day 2, 3 hours on day 3 but on the fourth day, still typing like I am malfunctioning. [12:25] (Have had one of those weeks, hence the slow responses and "like" twice in a sentence." [12:25] ... and a " instead of a ) ... [12:26] https://xkcd.com/859/ [12:27] Are you trying to cast a hex on me?! :o [12:28] What are these about? " vol2:<0x241ec> " I could restore the other files but I don't know how to manually fix what ever that is. [12:29] hehe [12:29] does it allow you to start a scrub? [12:53] scan: scrub in progress since Wed Jul 8 13:52:36 2020 [12:53] 2.13G scanned out of 593G at 545M/s, 0h18m to go [12:53] 0B repaired, 0.36% done [12:53] Yes. [12:55] I am not too concerned about the data. I have backups. I just don't want to throw away money on a motherboard or drives a set of drives without knowing where the fault is and I am getting tired of restoring data while this problem persists. [13:13] Jenshae: I can understand - it's rare to get errors on all drives at the same time [13:14] Jenshae: btw, can you pastebin smart data from the drives? [13:16] RoyK: Is there a terminal interface for SMART? I can't copy out the GUI results. [13:17] smartctl -a /dev/sdX | pastebinit [13:17] that is - wait [13:17] for dev in sd[abcd] ; do echo ====== $dev ====== ; smartctl -a /dev/$dev ; done | pastebinit [13:17] for instance [13:18] replace [abcd] with the real device names [13:18] Hmm ... "sudo: smartctl: command not found". I have Gnome Disks installed. [13:18] apt install smartmontools [13:19] "sudo apt install msartmontools" I am so msart [13:19] hehehe [13:19] Oh no, I am infected! M$-Art ... [13:21] this has nothing to do with M$ ;) [13:21] https://help.ubuntu.com/community/Smartmontools describes the three tests but doesn't say how to initiate one. [13:22] smartctl -t short /dev/sdX [13:23] " smartctl --test=long /dev/sda /dev/sdb /dev/sdc /dev/sdbd " will work? [13:23] smartctl -l selftest /dev/sdX (or smartctl -a /dev/sdX - it'll normally show progress as well) [13:23] no, it doesn't take more than one argument, so better 'for dev in /dev/sd{a..d} ; do smartctl -t long $dev ; done [13:24] ' [13:24] well, it takes several arguments, but only one drive, for some reason [13:24] " sudo smartctl --test=long /dev/sda && sudo smartctl --test=long /dev/sdb && sudo smartctl --test=long /dev/sdc && sudo smartctl --test=long /dev/sdd " [13:24] btw, if you have /dev/sdbd, it means you have a *lot* of drives :D [13:25] *Flex* [13:25] that also works, obviously, but I prefer a little for loop [13:26] just remember that with &&, it will only run the next command (etc) if the first succeeds [13:26] "RoyK: btw, if you have /dev/sdbd, it means you have a *lot* of drives :D" https://www.youtube.com/watch?v=YFk2_5RkwlA [13:27] hehe [13:28] linux starts off with sda, then sdb and so on until it reaches sdz and then starts over with sdaa, sdab etc, so if you have sdbd, it means you should have at least 55 drives === tinwood-afk is now known as tinwood [14:05] what, vimoutliner package was removed in focal? [14:15] https://pastebin.ubuntu.com/p/3ZZPDJCkxJ/ https://pastebin.ubuntu.com/p/9F9MRWT99c/ https://pastebin.ubuntu.com/p/3HhZtmGMrP/ https://pastebin.ubuntu.com/p/25zJRCVd8G/ [14:15] RoyK: Results are in. [14:17] Extended tests, all passed, no failures shown. [14:18] hm - those "unknown attributes" - I wonder if you can find out something more if you compile smartmontools from scratch [14:18] no need to install it - just run it from the source dir [14:18] Got a cmd for that? [14:19] Jenshae: https://github.com/smartmontools/smartmontools [14:19] Jenshae: mkdir -p src/git [14:19] cd src/git [14:19] git clone https://github.com/smartmontools/smartmontools.git [14:20] cd smartmontools/smartmontools/ [14:20] ./autogen.sh ; ./configure ; make [14:22] https://pastebin.ubuntu.com/p/3HBKrX9JQp/ [14:24] nvm, step by step. [14:24] you may need some packages like the build-essential metapackage [14:24] It wouldn't "sudo ./autogen.sh" and needed "sudo sh ./autogen.sh" after that then each one with sudo in front of it. [14:24] no need for sudo there [14:25] It didn't work without it. Done now. Command to run the compiled one? [14:25] unless you ran 'sudo mkdir -p src/git' [14:26] ./smartctl [14:26] add a sudo in front of that and use the same parameters as last time [14:27] looks like 6.6 is the one installed on my machine (debian buster 10) while the one from git is 7.2, so probably some new stuff there [14:27] just check smartctl --version and ./smartctl --version [14:28] * Jenshae crosses fingers [14:29] See you again in 30+ minutes [14:29] Do you work with servers or do it as a hobby? [14:29] a bit of both :D [14:30] I've worked with IT and servers since around 1996 [14:30] * Jenshae waves a feather duster around >;P [14:30] Actually, I started work in IT in 1998. IT cafe after school and weekends. [14:31] I haven't managed to get through my Linux+ manual yet. >.> [14:32] Keep going a few chapters on, then realising I can't remember anything from a chapter or two back. Go back and re-read. [14:33] I also keep having a background thought of, "I am unlikely to use that. I could look it up." [14:34] I am getting to the point where I might save up for a System76 work station. Building and configuring my own ones ... I am starting to doubt myself or the industry as a whole. [14:35] hehe [14:35] I first started working "mosttime" with linux in 1998 and fulltime in 2000 [14:36] linux has luckily evolved a bit since the first slackware 2.1 I installed in 1994 [14:47] Jenshae: btw, you don't ned to run a new test - that's done by the drive itself. Just run ./smartctl -l selftest /dev/sdX [15:42] hey fellas, I've a server setup w/ldap access. What/how would you suggest I give a single user from ldap ssh/sudo access? [15:55] Muligan: Personally, I would create a group, assign permissions to it and work from there but I am not an expert. [15:55] RoyK: Better results? https://pastebin.ubuntu.com/p/Tq3NB965rq/ [15:57] They all passed with nothing in the Failed column. [15:59] Muligan: using visudo should do [15:59] Jenshae, I would agree [15:59] I've just limited visibility/knowledge into our ldap server(s) [15:59] * Muligan is an old school AD guy [15:59] :\ [16:00] anyhow, I'll get it figured out [16:00] Muligan: but check with getent passwd first to see if the users are visible there [16:00] Jenshae: looks good. How is the scrub going? [16:01] When you get used to " sudo chown user:group /folder[and/or file] " then it can be quite handy. [16:02] "scan: scrub repaired 1.08M in 0h8m with 120 errors on Wed Jul 8 14:01:23 2020" [16:02] damn [16:03] can you pastebin zpool status -v ? [16:03] perhaps censor the filenames, they are irrelevant to me [16:03] All the really weird ones with looking stuff is gone. [16:05] I don't think it could have repaired any of these files if the data was missing. [16:06] It has actually managed to reduce errors this time, instead of finding more. [16:07] Going to restore the files, run the scrub and see if I can clear down the log. [16:18] Jenshae: it's good you have a backup. Now, restore those files and reboot the thing, preferably cutting the power suddenly or something to see if you can provoke the error that way [16:18] or just reboot -f [16:19] if it fails again, well, there's a bug or something [16:19] I doubt this is a hardware error [16:27] Jenshae: also, if you experience this in the future, make sure to check dmesg. you should be able to find issues in the logs too, as in /var/log/kern.log.something [16:27] Thank you. Will let you know how it goes. [16:28] Jenshae: You could check the old kernel logs now to see if you find anything from last crash [16:28] as in whence the errors originated [16:31] Is there a way using the installer's busybox implementation to get disk manufacturer? [16:31] and model? [16:33] I don't think so [16:33] I just saw that smartctl provided that. [16:33] better use a live boot [16:33] Inxi also gives you the model number [16:34] you can find it with 'ls -l /dev/disk/by-id/' too, but that requires udev and I'm not sure if busybox installs have that [16:34] Royk: These are listed as the important errors in my logs - https://pastebin.ubuntu.com/p/rnJBHB8qcR/ [16:34] Jenshae: from which file was that? [16:38] Other: Lightdm sends the pam error, Hardware: is the amdgpu error, Applications: spice-vdagent send the redhat.spice one, Other: systemd sent the Postfix one. [16:39] Is there a ZFS filter I can apply to look for something related to my problem? [16:42] again, from which file were these error messages_ [16:42] ? [16:43] I don't know, as usually looking at a GUI, it has Important, All, Applications, System, Security and Hardware down the side as category options. [16:44] oh [16:44] learn the terminal, dude ;) [16:46] So much in the logs. :( [16:48] Jenshae: just pastebin output of 'ls /var/log' and I'll do my best to guide you :) [17:24] RoyK: Sent via IM. Could it be a hard drive driver problem since SMART didn't pick every thing up the first time? [17:27] Could it also be my DE not doing a polite shutdown because it isn't notifying the ZFS subsystem that it is shutting down? [17:34] Jenshae: nothing really suspicious there [17:34] some wee errors, but that's normal [17:42] rbasak: thanks for handling that bug triage. It's nice when we have defined triage cases that come up regularly isn't it xD [17:54] Is there a command for a very polite shutdown, such as, "zfs umount pool2 && poweroff"? [17:56] teward: :) [18:08] Jenshae: poweroff should do that for you anyway [18:40] RoyK: I don't seem to have all of these options, only zpool status? https://docs.joyent.com/private-cloud/troubleshooting/disk-replacement [18:44] Jenshae: zfs should anyway be robust enough to allow you to pull the plug at any time without the whole raidz going down. I started working with zfs on solaris some 12 years ago and it really works. I've done lots of "oops" reboots and similar. I guess an upgrade to 20.04 may help this if there's a more recent version of zfs there [18:45] Is there a checklist for in place upgrades? I haven't had any success with them so far. [18:47] not really a checklist - just remember to backup what's important. I've done it several times without major issues. Some hichups sometimes, but not really anything that breaks major stuff [18:47] that's with do-release-upgrade [18:47] with 20.04 today, you'll need -d with that since 20.04.1 isn't out yet [19:42] It would also be the first time I was on the cutting edge of Ubuntu, I normally wait 6+ months for them to iron out bugs after general release. [19:42] hehe [19:42] The other work station is still rocking 14.04 [19:42] well, good luck [19:42] that isn't supported anymore, though - hope it's not available on the net [19:45] https://ubuntu.com/about/release-cycle still gets security updates. [19:45] I should try a kernel update on it when this one is fixed. [19:46] that's Extended Security Maintenance (ESM), which you have to pay for [20:00] Sorry, getting my versions confused. Getting tired. It is on 16 and this one is 18. Thought for a moment it was 14 and 16 [20:02] ... would manual upgrading work? Kernel, then repositories? [20:03] it's usually better to just run do-release-upgrade and let the upgrader sort it out [20:03] Piece by piece to see where it fails? [20:03] but I'm not aware of the context of your zfs problems [20:03] I am not sure about my ZFS problems. I seem to get corrupted files every time I reboot or shutdown and boot back up again. [20:05] ouch :/ [20:05] What is this resilvering thing about? Would it help? [20:06] 'resilvering' is when zfs detects errors on a drive and repairs the errors with data from other drives or copies [20:07] Is it automatic and therefore won't help? [20:07] if you've got errors in files visible to userspace but eg zpool status doesn't report errors, then perhaps it's bad memory or something similar? [20:07] As it has tried it? [20:07] you could kick off a scrub, zpool scrub, to look for errors [20:07] I have a boot drive and then a ZFS pool. System files should be safe. [20:08] Yup, about to go and delete the remaining results of "zpool scrub -v", saved the list then reboot. [20:10] Umm ... I am discovering single corrupt files in empty folders that should be full ... === kierank_ is now known as kierank [20:13] hmm, lone files in directories that used to be full *might* also be files in a *mountpoint*, not in zfs; normally zfs refuses to mount datasets if the mountpoint has any files or directories in it, which means it's pretty easy for *one* failed mount for whatever reason to lead to data being in multiple places .. [20:14] but if mount output shows a zfs dataset mounted in that location, then it's very strange, and might reinforce my thought that you've got bad memory, or perhaps memory errors due to power supply problems, etc [20:17] It is a fairly simple setup, four drives, freshly built into a RAIDZ (RAID 5), mounted to user folder point and data dumped in there. [20:18] Same goes for the other work station. [20:18] zfs can withstand an unexpected reboot quite well [20:18] this must be either a bug or some hardware issue [20:19] Which and where is the question. [20:19] with four drives tagged TOO MANY ERRORS at once, well, there's the controller that might be the problem [20:19] who knows [20:19] it may also be the memory, unless it's ECC [20:20] (which reminds me that producing ECC costs a cent or two more than non-ECC-memory, but costs the double, since they want to skim off a lot for "server" stuff) [20:20] It is not. This is a fairly budget build at about £800 [20:21] Jenshae: so better visit http://memtest.org/ and do a memory test [20:21] you just download an iso and put in on a usb drive or something and boot directly into it [20:21] some distros even come with it preinstalled in grub [20:22] seems ubuntu is one of them [20:22] Will do. [20:23] Will run it over night. [20:23] I really wish that terminal was smart enough to read a space and figure out if it was a file name. [20:23] Jenshae: I know a guy that works with supermicro machines and he said it usually failed before test 4 or 5 and the early tests are rather quick [20:24] scp /folder/file name.something can be such a nuisance. [20:24] 'scp /folder/file\ name' or just 'scp "/folder/file name"' [20:26] Being able to do a find and replace to change "/home/etc" into "rm /home/etc" then straight up copy and paste them would be handy [20:27] I think I am picking up a pattern, it looks like a particular main folder within the pool had a problem. [20:27] Maybe something went wrong when I copied it over. [20:27] Could scp be at fault? [20:28] nah [20:28] Does it have a buffer limit? Do I need to move subfolders individually? [20:28] it works in userspace - the filesystem is in kernelspace [20:39] Could the other workstation have put its disk to sleep or scp's permission timed out? [20:40] If we figure this out and I ever end up lecturing IT, I will try and recreate the problem to give them a tough assignment. [21:22] Thoughts and prayers, thoughts and prayers. Please send me them, they apparently all that is needed. https://pastebin.ubuntu.com/p/D9ttcHmFnk/ [21:43] RoyK: Okay, shutdown, waited awhile, booted up, ran a scrub and still no errors. So, I am thinking it might be a write or incomplete copy problem? [21:44] that would have been fixed in the journal [21:44] and not affected the filesystem itself, just the file [21:45] so something is fishy there [21:45] "RoyK: that would have been fixed in the journal" I need some more context. [21:45] Are there any I/O stress tests that will tell me if there is an error reading or writing with ZFS? [21:46] the filesystem has a journal, like all modern filesystems. when something is written, a journal entry is made and synced. when the file is commited to disk, another journal entry is made and commited, saying "this block is ok" [21:47] you have to go back to FAT32 or ext2 and similar stuff to find filesystems lacking a journal [21:48] If it is a write problem, as a work around, can scrub and rsync work together? [21:49] I won't attempt to restore tonight, do a very cold boot in the morning. [21:49] if there's a write problem, it'll be reported in dmesg and the kernel log [21:49] and if there are no silent errors, the error is elsewhere [21:49] I'll put my money on the memory [21:50] Haven't had any random application crashes. [21:51] Everything to, from and using the boot drive has given me no problems. [21:51] so try a memory test first - if it fails, clean the memory connectors, both the motherboard and the chips, with isopropylic alcohol (if you can find it somewhere in these coronean times) and run a new test [21:51] using a brush to clean the motherboard sockets is usually the way to go [21:52] I use the boot drive for my personal stuff because it isn't shared. [21:52] but check the memory first, so that you can see if there's something there [21:52] but I'm tired - ttyl [21:52] Sleep well. [21:52] Thank you for all the help. [21:52] thanks [21:53] I have been asking around this problem for months on and off, this has gotten me a lot further than anywhere or anyone else. [21:54] i guess, zfs for linux is still pretty new, so not alot of people know stuff Jenshae [21:55] Trying to give Roy-K more of an accolade than critise the community. [21:56] I never thought you would critize the community [21:57] Sorry, read it more as an excuse or defence than an explanation. [22:06] See you all tomorrow / Friday. Thanks again for the help. o7 === eth01_ is now known as eth01