/srv/irclogs.ubuntu.com/2020/07/08/#ubuntu-server.txt

lotuspsychjethat was a quickfix :p07:22
iceylotuspsychje: indeed :)07:23
nb-benok I managed to boot without u-boot07:27
nb-benthat's pretty cool. saved me some suffering07:28
=== Wryhder is now known as Lucas_Gray
DK2i have a pxe installation with initrd.gz and linux file for the installation. the installation does not find the nic of the server. i have the source files for the driver but i dont know how to integrate that in my initrd.gz any good guides to that?09:28
JenshaeHi all. I am getting very frustrated with ZFS at the moment. I made a RAIDZ (RAID 5) with 4x SSD drives. Every time that I reboot or shutdown and boot up, I get a different status. Now, "zpool status" shows all drives "ONLINE" and CKSUM = 0. However, sometimes they are all "DEGRADED" and yesterday I had one degraded and the rest online.10:21
JenshaeI have tried S.M.A.R.T. I have tried scrubbing and I have rebuilt the pool from scratch.10:22
JenshaeI don't know how to diagnose this further, to work out where the problem lies, particular disk, ZFS itself, the hard drive controller, BIOS or some sort of kernel problem.10:22
JenshaeCorruption is happening, "zpool status -v" listing the files, I go to them and I can't open them.10:23
Jenshae(I also checked that all the drive cables are firmly and fully plugged in)10:25
JenshaeAnyone got some terms I could search around to try and figure this out?11:51
RoyKJenshae: is this native zfs or fuse-zfs?12:06
RoyKJenshae: also, please pastebin zpool status -v12:06
JenshaeI don't know what fuse-zfs is.12:06
RoyKJenshae: I've seen zfs finding errors the drive didn't know about (so-called "silent" errors)12:06
RoyKzfs-fuse, perhaps - anyway - that's zfs running under fuse, that is, in usermode. If you have the spl/zfs kernel modules, it's native12:07
Jenshaehttps://pastebin.ubuntu.com/p/wK3nSVfMFX/12:08
=== tinwood is now known as tinwood-afk
JenshaeRoyK: Sorry, took awhile to sanitise it, most of my folders and file names are a bit too descriptive, so this is just a snippet https://pastebin.ubuntu.com/p/Yj4PHpTZbP/12:18
RoyKhm… that's wierd - too many errors on all four drives?12:19
RoyKwhich version of ubuntu and zfs/zpool is this?12:19
JenshaeYeah, it is weird, some boot ups, all listed as "ONLINE" without "DEGRADED" against them, sometimes all drives are "DEGRADED" and only yesterday did I have a single drive result.12:20
RoyKweird, even12:20
RoyKhow are these connected? standard sata from onboard controllers or something more fancy?12:21
JenshaeUbuntu 18 ... what is the -v / -version, etc to get the ZFS version?12:21
JenshaeStandard board SATA, yes.12:21
RoyKprobably --version12:21
RoyKI haven't worked with zfs for a while12:21
RoyK*in* a while12:22
RoyKdamn12:22
* RoyK complains about "bad language day"12:22
Jenshaezfsutils-linux is already the newest version (0.7.5-1ubuntu16.9).12:23
JenshaeI find that I get temporary brain damage and my typing goes to hell if I deprive myself of sleep.12:23
JenshaeLike, 3 hours sleep on day 1, 14 hours on day 2, 3 hours on day 3 but on the fourth day, still typing like I am malfunctioning.12:24
Jenshae(Have had one of those weeks, hence the slow responses and "like" twice in a sentence."12:25
Jenshae... and a " instead of a ) ...12:25
RoyKhttps://xkcd.com/859/12:26
JenshaeAre you trying to cast a hex on me?! :o12:27
JenshaeWhat are these about? " vol2:<0x241ec> " I could restore the other files but I don't know how to manually fix what ever that is.12:28
RoyKhehe12:29
RoyKdoes it allow you to start a scrub?12:29
Jenshaescan: scrub in progress since Wed Jul  8 13:52:36 202012:53
Jenshae2.13G scanned out of 593G at 545M/s, 0h18m to go12:53
Jenshae0B repaired, 0.36% done12:53
JenshaeYes.12:53
JenshaeI am not too concerned about the data. I have backups. I just don't want to throw away money on a motherboard or drives a set of drives without knowing where the fault is and I am getting tired of restoring data while this problem persists.12:55
RoyKJenshae: I can understand - it's rare to get errors on all drives at the same time13:13
RoyKJenshae: btw, can you pastebin smart data from the drives?13:14
JenshaeRoyK: Is there a terminal interface for SMART? I can't copy out the GUI results.13:16
RoyKsmartctl -a /dev/sdX | pastebinit13:17
RoyKthat is - wait13:17
RoyKfor dev in sd[abcd] ; do echo ====== $dev ====== ; smartctl -a /dev/$dev ; done | pastebinit13:17
RoyKfor instance13:17
RoyKreplace [abcd] with the real device names13:18
JenshaeHmm ... "sudo: smartctl: command not found". I have Gnome Disks installed.13:18
RoyKapt install smartmontools13:18
Jenshae"sudo apt install msartmontools" I am so msart13:19
RoyKhehehe13:19
JenshaeOh no, I am infected! M$-Art ...13:19
RoyKthis has nothing to do with M$ ;)13:21
Jenshaehttps://help.ubuntu.com/community/Smartmontools describes the three tests but doesn't say how to initiate one.13:21
RoyKsmartctl -t short /dev/sdX13:22
Jenshae" smartctl --test=long /dev/sda /dev/sdb /dev/sdc /dev/sdbd " will work?13:23
RoyKsmartctl -l selftest /dev/sdX (or smartctl -a /dev/sdX - it'll normally show progress as well)13:23
RoyKno, it doesn't take more than one argument, so better 'for dev in /dev/sd{a..d} ; do smartctl -t long $dev ; done13:23
RoyK'13:24
RoyKwell, it takes several arguments, but only one drive, for some reason13:24
Jenshae" sudo smartctl --test=long /dev/sda && sudo smartctl --test=long /dev/sdb && sudo smartctl --test=long /dev/sdc && sudo smartctl --test=long /dev/sdd "13:24
RoyKbtw, if you have /dev/sdbd, it means you have a *lot* of drives :D13:24
Jenshae*Flex*13:25
RoyKthat also works, obviously, but I prefer a little for loop13:25
RoyKjust remember that with &&, it will only run the next command (etc) if the first succeeds13:26
Jenshae"RoyK: btw, if you have /dev/sdbd, it means you have a *lot* of drives :D" https://www.youtube.com/watch?v=YFk2_5RkwlA13:26
RoyKhehe13:27
RoyKlinux starts off with sda, then sdb and so on until it reaches sdz and then starts over with sdaa, sdab etc, so if you have sdbd, it means you should have at least 55 drives13:28
=== tinwood-afk is now known as tinwood
hallynwhat, vimoutliner package was removed in focal?14:05
Jenshaehttps://pastebin.ubuntu.com/p/3ZZPDJCkxJ/ https://pastebin.ubuntu.com/p/9F9MRWT99c/ https://pastebin.ubuntu.com/p/3HhZtmGMrP/ https://pastebin.ubuntu.com/p/25zJRCVd8G/14:15
JenshaeRoyK: Results are in.14:15
JenshaeExtended tests, all passed, no failures shown.14:17
RoyKhm - those "unknown attributes" - I wonder if you can find out something more if you compile smartmontools from scratch14:18
RoyKno need to install it - just run it from the source dir14:18
JenshaeGot a cmd for that?14:18
RoyKJenshae: https://github.com/smartmontools/smartmontools14:19
RoyKJenshae: mkdir -p src/git14:19
RoyKcd src/git14:19
RoyKgit clone https://github.com/smartmontools/smartmontools.git14:19
RoyKcd smartmontools/smartmontools/14:20
RoyK./autogen.sh ; ./configure ; make14:20
Jenshaehttps://pastebin.ubuntu.com/p/3HBKrX9JQp/14:22
Jenshaenvm, step by step.14:24
RoyKyou may need some packages like the build-essential metapackage14:24
JenshaeIt wouldn't "sudo ./autogen.sh" and needed "sudo sh ./autogen.sh" after that then each one with sudo in front of it.14:24
RoyKno need for sudo there14:24
JenshaeIt didn't work without it. Done now. Command to run the compiled one?14:25
RoyKunless you ran 'sudo mkdir -p src/git'14:25
RoyK./smartctl14:26
RoyKadd a sudo in front of that and use the same parameters as last time14:26
RoyKlooks like 6.6 is the one installed on my machine (debian buster 10) while the one from git is 7.2, so probably some new stuff there14:27
RoyKjust check smartctl --version and ./smartctl --version14:27
* Jenshae crosses fingers14:28
JenshaeSee you again in 30+ minutes14:29
JenshaeDo you work with servers or do it as a hobby?14:29
RoyKa bit of both :D14:29
RoyKI've worked with IT and servers since around 199614:30
* Jenshae waves a feather duster around >;P14:30
JenshaeActually, I started work in IT in 1998. IT cafe after school and weekends.14:30
JenshaeI haven't managed to get through my Linux+ manual yet. >.>14:31
JenshaeKeep going a few chapters on, then realising I can't remember anything from a chapter or two back. Go back and re-read.14:32
JenshaeI also keep having a background thought of, "I am unlikely to use that. I could look it up."14:33
JenshaeI am getting to the point where I might save up for a System76 work station. Building and configuring my own ones ... I am starting to doubt myself or the industry as a whole.14:34
RoyKhehe14:35
RoyKI first started working "mosttime" with linux in 1998 and fulltime in 200014:35
RoyKlinux has luckily evolved a bit since the first slackware 2.1 I installed in 199414:36
RoyKJenshae: btw, you don't ned to run a new test - that's done by the drive itself. Just run ./smartctl -l selftest /dev/sdX14:47
Muliganhey fellas, I've a server setup w/ldap access.  What/how would you suggest I give a single user from ldap ssh/sudo access?15:42
JenshaeMuligan: Personally, I would create a group, assign permissions to it and work from there but I am not an expert.15:55
JenshaeRoyK: Better results? https://pastebin.ubuntu.com/p/Tq3NB965rq/15:55
JenshaeThey all passed with nothing in the Failed column.15:57
RoyKMuligan: using visudo should do15:59
MuliganJenshae, I would agree15:59
MuliganI've just limited visibility/knowledge into our ldap server(s)15:59
* Muligan is an old school AD guy15:59
Muligan:\15:59
Muligananyhow, I'll get it figured out16:00
RoyKMuligan: but check with getent passwd first to see if the users are visible there16:00
RoyKJenshae: looks good. How is the scrub going?16:00
JenshaeWhen you get used to " sudo chown user:group /folder[and/or file] " then it can be quite handy.16:01
Jenshae"scan: scrub repaired 1.08M in 0h8m with 120 errors on Wed Jul  8 14:01:23 2020"16:02
RoyKdamn16:02
RoyKcan you pastebin zpool status -v ?16:03
RoyKperhaps censor the filenames, they are irrelevant to me16:03
JenshaeAll the really weird ones with <hexcode> looking stuff is gone.16:03
JenshaeI don't think it could have repaired any of these files if the data was missing.16:05
JenshaeIt has actually managed to reduce errors this time, instead of finding more.16:06
JenshaeGoing to restore the files, run the scrub and see if I can clear down the log.16:07
RoyKJenshae: it's good you have a backup. Now, restore those files and reboot the thing, preferably cutting the power suddenly or something to see if you can provoke the error that way16:18
RoyKor just reboot -f16:18
RoyKif it fails again, well, there's a bug or something16:19
RoyKI doubt this is a hardware error16:19
RoyKJenshae: also, if you experience this in the future, make sure to check dmesg. you should be able to find issues in the logs too, as in /var/log/kern.log.something16:27
JenshaeThank you. Will let you know how it goes.16:27
RoyKJenshae: You could check the old kernel logs now to see if you find anything from last crash16:28
RoyKas in whence the errors originated16:28
Rusty_AlmightyIs there a way using the installer's busybox implementation to get disk manufacturer?16:31
Rusty_Almightyand model?16:31
RoyKI don't think so16:33
JenshaeI just saw that smartctl provided that.16:33
RoyKbetter use a live boot16:33
JenshaeInxi also gives you the model number16:33
RoyKyou can find it with 'ls -l /dev/disk/by-id/' too, but that requires udev and I'm not sure if busybox installs have that16:34
JenshaeRoyk: These are listed as the important errors in my logs - https://pastebin.ubuntu.com/p/rnJBHB8qcR/16:34
RoyKJenshae: from which file was that?16:34
JenshaeOther: Lightdm sends the pam error, Hardware: is the amdgpu error, Applications: spice-vdagent send the redhat.spice one, Other: systemd sent the Postfix one.16:38
JenshaeIs there a ZFS filter I can apply to look for something related to my problem?16:39
RoyKagain, from which file were these error messages_16:42
RoyK?16:42
JenshaeI don't know, as usually looking at a GUI, it has Important, All, Applications, System, Security and Hardware down the side as category options.16:43
RoyKoh16:44
RoyKlearn the terminal, dude ;)16:44
JenshaeSo much in the logs. :(16:46
RoyKJenshae: just pastebin output of 'ls /var/log' and I'll do my best to guide you :)16:48
JenshaeRoyK: Sent via IM. Could it be a hard drive driver problem since SMART didn't pick every thing up the first time?17:24
JenshaeCould it also be my DE not doing a polite shutdown because it isn't notifying the ZFS subsystem that it is shutting down?17:27
RoyKJenshae: nothing really suspicious there17:34
RoyKsome wee errors, but that's normal17:34
tewardrbasak: thanks for handling that bug triage.  It's nice when we have defined triage cases that come up regularly isn't it xD17:42
JenshaeIs there a command for a very polite shutdown, such as, "zfs umount pool2 && poweroff"?17:54
rbasakteward: :)17:56
RoyKJenshae: poweroff should do that for you anyway18:08
JenshaeRoyK: I don't seem to have all of these options, only zpool status? https://docs.joyent.com/private-cloud/troubleshooting/disk-replacement18:40
RoyKJenshae: zfs should anyway be robust enough to allow you to pull the plug at any time without the whole raidz going down. I started working with zfs on solaris some 12 years ago and it really works. I've done lots of "oops" reboots and similar. I guess an upgrade to 20.04 may help this if there's a more recent version of zfs there18:44
JenshaeIs there a checklist for in place upgrades? I haven't had any success with them so far.18:45
RoyKnot really a checklist - just remember to backup what's important. I've done it several times without major issues. Some hichups sometimes, but not really anything that breaks major stuff18:47
RoyKthat's with do-release-upgrade18:47
RoyKwith 20.04 today, you'll need -d with that since 20.04.1 isn't out yet18:47
JenshaeIt would also be the first time I was on the cutting edge of Ubuntu, I normally wait 6+ months for them to iron out bugs after general release.19:42
RoyKhehe19:42
JenshaeThe other work station is still rocking 14.0419:42
RoyKwell, good luck19:42
RoyKthat isn't supported anymore, though - hope it's not available on the net19:42
Jenshaehttps://ubuntu.com/about/release-cycle still gets security updates.19:45
JenshaeI should try a kernel update on it when this one is fixed.19:45
RoyKthat's Extended Security Maintenance (ESM), which you have to pay for19:46
JenshaeSorry, getting my versions confused. Getting tired. It is on 16 and this one is 18. Thought for a moment it was 14 and 1620:00
Jenshae... would manual upgrading work? Kernel, then repositories?20:02
sarnoldit's usually better to just run do-release-upgrade and let the upgrader sort it out20:03
JenshaePiece by piece to see where it fails?20:03
sarnoldbut I'm not aware of the context of your zfs problems20:03
JenshaeI am not sure about my ZFS problems. I seem to get corrupted files every time I reboot or shutdown and boot back up again.20:03
sarnoldouch :/20:05
JenshaeWhat is this resilvering thing about? Would it help?20:05
sarnold'resilvering' is when zfs detects errors on a drive and repairs the errors with data from other drives or copies20:06
JenshaeIs it automatic and therefore won't help?20:07
sarnoldif you've got errors in files visible to userspace but eg zpool status doesn't report errors, then perhaps it's bad memory or something similar?20:07
JenshaeAs it has tried it?20:07
sarnoldyou could kick off a scrub, zpool scrub, to look for errors20:07
JenshaeI have a boot drive and then a ZFS pool. System files should be safe.20:07
JenshaeYup, about to go and delete the remaining results of "zpool scrub -v", saved the list then reboot.20:08
JenshaeUmm ... I am discovering single corrupt files in empty folders that should be full ...20:10
=== kierank_ is now known as kierank
sarnoldhmm, lone files in directories that used to be full *might* also be files in a *mountpoint*, not in zfs; normally zfs refuses to mount datasets if the mountpoint has any files or directories in it, which means it's pretty easy for *one* failed mount for whatever reason to lead to data being in multiple places ..20:13
sarnoldbut if mount output shows a zfs dataset mounted in that location, then it's very strange, and might reinforce my thought that you've got bad memory, or perhaps memory errors due to power supply problems, etc20:14
JenshaeIt is a fairly simple setup, four drives, freshly built into a RAIDZ (RAID 5), mounted to user folder point and data dumped in there.20:17
JenshaeSame goes for the other work station.20:18
RoyKzfs can withstand an unexpected reboot quite well20:18
RoyKthis must be either a bug or some hardware issue20:18
JenshaeWhich and where is the question.20:19
RoyKwith four drives tagged TOO MANY ERRORS at once, well, there's the controller that might be the problem20:19
RoyKwho knows20:19
RoyKit may also be the memory, unless it's ECC20:19
RoyK(which reminds me that producing ECC costs a cent or two more than non-ECC-memory, but costs the double, since they want to skim off a lot for "server" stuff)20:20
JenshaeIt is not. This is a fairly budget build at about £80020:20
RoyKJenshae: so better visit http://memtest.org/ and do a memory test20:21
RoyKyou just download an iso and put in on a usb drive or something and boot directly into it20:21
RoyKsome distros even come with it preinstalled in grub20:21
RoyKseems ubuntu is one of them20:22
JenshaeWill do.20:22
JenshaeWill run it over night.20:23
JenshaeI really wish that terminal was smart enough to read a space and figure out if it was a file name.20:23
RoyKJenshae: I know a guy that works with supermicro machines and he said it usually failed before test 4 or 5 and the early tests are rather quick20:23
Jenshaescp /folder/file name.something can be such a nuisance.20:24
RoyK'scp /folder/file\ name' or just 'scp "/folder/file name"'20:24
JenshaeBeing able to do a find and replace to change "/home/etc" into "rm /home/etc" then straight up copy and paste them would be handy20:26
JenshaeI think I am picking up a pattern, it looks like a particular main folder within the pool had a problem.20:27
JenshaeMaybe something went wrong when I copied it over.20:27
JenshaeCould scp be at fault?20:27
RoyKnah20:28
JenshaeDoes it have a buffer limit? Do I need to move subfolders individually?20:28
RoyKit works in userspace - the filesystem is in kernelspace20:28
JenshaeCould the other workstation have put its disk to sleep or scp's permission timed out?20:39
JenshaeIf we figure this out and I ever end up lecturing IT, I will try and recreate the problem to give them a tough assignment.20:40
JenshaeThoughts and prayers, thoughts and prayers. Please send me them, they apparently all that is needed. https://pastebin.ubuntu.com/p/D9ttcHmFnk/21:22
JenshaeRoyK: Okay, shutdown, waited awhile, booted up, ran a scrub and still no errors. So, I am thinking it might be a write or incomplete copy problem?21:43
RoyKthat would have been fixed in the journal21:44
RoyKand not affected the filesystem itself, just the file21:44
RoyKso something is fishy there21:45
Jenshae"RoyK: that would have been fixed in the journal" I need some more context.21:45
JenshaeAre there any I/O stress tests that will tell me if there is an error reading or writing with ZFS?21:45
RoyKthe filesystem has a journal, like all modern filesystems. when something is written, a journal entry is made and synced. when the file is commited to disk, another journal entry is made and commited, saying "this block is ok"21:46
RoyKyou have to go back to FAT32 or ext2 and similar stuff to find filesystems lacking a journal21:47
JenshaeIf it is a write problem, as a work around, can scrub and rsync work together?21:48
JenshaeI won't attempt to restore tonight, do a very cold boot in the morning.21:49
RoyKif there's a write problem, it'll be reported in dmesg and the kernel log21:49
RoyKand if there are no silent errors, the error is elsewhere21:49
RoyKI'll put my money on the memory21:49
JenshaeHaven't had any random application crashes.21:50
JenshaeEverything to, from and using the boot drive has given me no problems.21:51
RoyKso try a memory test first - if it fails, clean the memory connectors, both the motherboard and the chips, with isopropylic alcohol (if you can find it somewhere in these coronean times) and run a new test21:51
RoyKusing a brush to clean the motherboard sockets is usually the way to go21:51
JenshaeI use the boot drive for my personal stuff because it isn't shared.21:52
RoyKbut check the memory first, so that you can see if there's something there21:52
RoyKbut I'm tired - ttyl21:52
JenshaeSleep well.21:52
JenshaeThank you for all the help.21:52
RoyKthanks21:52
JenshaeI have been asking around this problem for months on and off, this has gotten me a lot further than anywhere or anyone else.21:53
quadrathoch2i guess, zfs for linux is still pretty new, so not alot of people know stuff Jenshae21:54
JenshaeTrying to give Roy-K more of an accolade than critise the community.21:55
quadrathoch2I never thought you would critize the community21:56
JenshaeSorry, read it more as an excuse or defence than an explanation.21:57
JenshaeSee you all tomorrow / Friday. Thanks again for the help. o722:06
=== eth01_ is now known as eth01

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!