[17:58] <fretegi> morning
[17:59] <fretegi> anyone around to help with a raid issue?
[18:00] <fretegi> have a raid 1 array, that i just cannot seem to properly repair.  so thought is to make a new array, degraded.  mount it, copy data over, then add a second disk
[18:29] <RoyK> fretegi: one drive down?
[18:30] <fretegi> RoyK, yea so funny story, parted was ran against 1 drive in the array
[18:31] <fretegi> RoyK, and no matter what when i re-add the drive, raid then shows active with both devices, but wont survive a reboot
[18:32] <RoyK> fretegi: did you try to remove the bad-ish drive and zero its superblock and then add it again?
[18:32] <RoyK> as in no --re-add
[18:33] <RoyK> iirc --remove will remove the superblock, so probably not needed to --zero-superblock
[18:33] <fretegi> RoyK, so what ive done this time is, on the device that was removed.. i zero'd the superblocks, ran DD for the first 1024, ran parted to make a new label and new partition scheme, built a new degraded raid device, mounted and am rsync'ing data over now from the working device in the old raid.  then zero that drive and add to the new md1
[18:33] <RoyK> fretegi: can you check and pastebin 'mdadm --examine /dev/sdX' where x is the name of each member of the raid?
[18:34] <fretegi> RoyK, yea i did.. zero'd superblocks, tried to remove device, updated the mdadm.conf.  all this 5x or so.  same thing.. it would create add the device, rebuild raid.  show active..  then die on reboot
[18:34] <RoyK> fretegi: also, have you checked smart data?
[18:34] <RoyK> smartctl -a /dev/sdX
[18:34] <fretegi> RoyK, yup checked smart data, both involved drives are good
[18:34] <RoyK> and /etc/mdadm/mdadm.conf is fine?
[18:34] <RoyK> and initramfs is updated with its contents?
[18:35] <fretegi> RoyK, yup, mdadm was even updated after the last 2 tries to just make sure the proper UUID was referenced..
[18:36] <fretegi> RoyK, ah... i did NOT update initramfs...
[18:36] <fretegi> is that necessary if the md device in question is only a data logical volume, no OS related data on it?
[18:37] <RoyK> update-initramfs -u
[18:37] <RoyK> update-initramfs -u -k all
[18:37] <RoyK> perhaps if you want to update it for all installed kernels
[18:37] <RoyK> but the former should do as well
[18:37] <fretegi> right but is that needed if the raid does not contact OS partitions?
[18:37] <fretegi> err... logical vols
[18:38] <RoyK> probably not, but it won't hurt ;)
[18:38] <fretegi> good thinking
[18:38] <RoyK> fretegi: you can generate the mdadm.conf stuff with mdadm --detail --scan
[18:38] <fretegi> RoyK, that is exactl what i did
[18:39] <fretegi> RoyK, so when remove drives from a mdadm raid, is there anything else you have to do besides zero out the superblocks to make the drive available for use again?
[18:40] <RoyK> fretegi: can you please pastebin output of 'cat /proc/mdstat' and 'mdadm --examine' for both drives first?
[18:40] <fretegi> RoyK, i was fiddling wth that disk and no matter what mdadm would still referense evidence of a prior xfs filesystem, but yet could not build the damn array
[18:40] <fretegi> RoyK, sure, but hte second drive is now part of a second degraded array
[18:41] <RoyK> hm - that's weird
[18:42] <fretegi> RoyK, https://dpaste.org/wERH
[18:43] <fretegi> RoyK, no i did that intentionally...  seemed everyone was gone for Sunday so i just decided to take another approach.. made a new degraged array using the disk that has been a pita, copy the data over, remove old array and add that disk to the new array.
[18:43] <RoyK> hm
[18:43] <RoyK> on which one is the rootfs now?
[18:43] <fretegi> RoyK, md0 old array, md1 new array, rootfs not on either
[18:44] <RoyK> any data on them at all?
[18:45] <fretegi> https://dpaste.org/yysq
[18:45] <fretegi> RoyK, oh yea, data perfectly intact in md0 (and backed up).  currently rsync'ing to md1
[18:46] <RoyK> fretegi: good, but you can probably stop that rsync
[18:47] <RoyK> you won't be able to add sdc1 to md0 as member disk, it's too small
[18:47] <RoyK> better remove that partition and try with the full disk instead
[18:47] <RoyK> they should be the same size
[18:48] <fretegi> RoyK, so that leads me to where i am now.... several people in here mentioned that have the md device set up on the full disk and not a partition was a problem
[18:48] <fretegi> that was the primary reason for my saying heck with it and just building a new array...
[18:49] <RoyK> but md1 is your new array?
[18:49] <fretegi> right
[18:49] <RoyK> which is on a partition
[18:49] <fretegi> using a partition, because the old array was just on the full disks, and folks were saying that was a bad approach
[18:49] <fretegi> although this raid array been working fine for like 5 years
[18:50] <RoyK> AFAIK the only reason to use a partition is to have grub work with it, since that can be rather troublesome without it
[18:50] <RoyK> fretegi: itæs *not* a bad approach
[18:50] <RoyK> partitions aren't needed
[18:50] <RoyK> s/itæs/it's/
[18:51] <fretegi> RoyK, see that is exactly my understanding, unless of course there was some reason u did need a partition, but mdadm was not that reason
[18:51] <fretegi> so to back up a tick...
[18:51] <RoyK> I generally use partitions if I 1: need to boot off the device and thus need grub or 2: for some reason can't use lvm
[18:52] <RoyK> so my typical raid is a bunch of disks with raid put on top of the disks directly and then lvm on there and lastly, xfs (or perhaps ext4 if it's smaller stuff or somewhere I might want to shrink the fs) on top
[18:53] <RoyK> I've used this approach for work and home machines for at least a a decade - works well
[18:55] <fretegi> RoyK, originally i had md0 raid 1 built on 2 entire disks (sdb & sdc).  The array is just for a data volume, all os componenets on another disk.  the wrong disk had parted ran against it (sdc) which of course took it out of array.  md0 has lvl with xfs.  i re-added that sdc to md0 5 or 6x, wont survive boot.  so i said hell with it and made a new degraded array from sdc, am copying data over and intend to break down md0, adding disk
[18:55] <fretegi> sdb into md1
[18:55] <RoyK> with "smaller", I mean less than say 5TiB or something I know won't need to grow and/or is sufficiently fast for fsck to finish within a short while (which isn't necessarily what happens with large ext4 filesystems)
[18:56] <RoyK> did you re-add or add? also, did you add sdc1 or sdc?
[18:56] <fretegi> ok so very simmiliar setups to what i am doing here
[18:56] <RoyK> better remove those partitions before you add the whole drive, though - I've seen restovers hanging around at some cases and that's not pretty
[18:56] <fretegi> RoyK, so at that time, sdc had no partitions.  since building a new array, and many people in here criticized that lack of a partition, i created one for this array
[18:57] <fretegi> RoyK, well the partitoin is the entire disk
[18:57] <RoyK> whoever critisized you for that should have his or her head examined by moths
[18:57] <fretegi> so was gonna partition sdb and add to md1
[18:58] <RoyK> just don't use partitions
[18:58] <fretegi> RoyK, yea made no sense to me, but was like 4 people in here lol
[18:58] <RoyK> I really can't understand why - partitions have no function there
[18:58] <RoyK> all md needs is a chunk of storage
[19:00] <fretegi> RoyK, so am i better off trying to re add sdc to md0? just takes hours to rebuild and hate to waste that all over again to not have it survive a reboot
[19:01] <RoyK> so, ok - my advice: the working 1-drive mirror of md0 is the one with data, right? if so, stop md1 and zero the drive's superblock and better dd some zeros on it as well. Add it to md0 and wait for it to resync. While waiting, run mdadm --detail --scan to get the config line for mdadm.conf and copy it or redirect it into the file. run editor /etc/mdadm/mdadm.conf and remove whatever leftovers from old
[19:02] <RoyK> stuff there. update initramfs as mentioned above and wait for resync to finish. this should do it.
[19:03] <fretegi> RoyK, do i have to do anything else to rmeove md1?
[19:04] <RoyK> mdadm --stop /dev/md1
[19:05] <fretegi> RoyK, no i get that, i mean between stopping it and pulling any reference from mdadm.conf, thats all i need to thoroughly delete md1 right? just dont want this thing to try to spin up md1 later is all
[19:05] <RoyK> doublecheck that all data is in place and mdadm --zero-superblock /dev/sdX (was it c?) and remove the partition table, preferably with dd if=/dev/zero of=/dev/sdX bs=1M count=1k or something - that one writes a gig of zeros, so overkill, but what the hell
[19:05] <RoyK> fretegi: let's fix mdadm.conf when you have added the last drive first
[19:13] <fretegi> RoyK, ok, md1 stopped, lvm volumes etc. that were mounted are gone, dev/sdc superblocks zero'd dd'd 2k count on sdc.. mdadm.conf has md1 references #'d and heres some output
[19:13] <fretegi> https://dpaste.org/YpcS
[19:14] <fretegi> so just grow md0 and add sdc?
[19:15] <fretegi> RoyK, data confirmed to still be active on md0
[19:15] <RoyK> fretegi: add sdc first
[19:17] <fretegi> RoyK, since it shows no unused devices dont i need to grow it?
[19:17] <fretegi> RoyK, md0 that is
[19:18] <RoyK> just add the new one first and it'll become a spare
[19:19] <fretegi> RoyK, done https://dpaste.org/gJxU
[19:20] <fretegi> RoyK, mdad --detail https://dpaste.org/U05w
[19:21] <RoyK> good - mdadm --grow /dev/md0 --raid-devices 2 # iirc ;)
[19:22] <fretegi> RoyK, building https://dpaste.org/qT5Y
[19:24] <fretegi> RoyK, mdadm --detail --scan outputs the same UUID as what is already in the mdadm.conf
[19:24] <RoyK> that is correct
[19:24] <RoyK> just update initramfs, then
[19:26] <fretegi> RoyK, actually isnt that mdadm.conf line supposed to reference the # of devices?
[19:26] <fretegi> output shows the spare drive still, guess because its rebuilding
[19:27] <RoyK> fretegi: pastebin /proc/mdstat again
[19:28] <fretegi> https://dpaste.org/UYZ4
[19:28] <RoyK> ok, finished in two and a half hours
[19:29] <fretegi> right
[19:29] <fretegi> then update initramfs
[19:29] <fretegi> and should be good
[19:29] <RoyK> nah - it should be fine now
[19:29] <RoyK> the uuid won't change
[19:29] <fretegi> then why was htis thing not started on boot before?
[19:29] <RoyK> just remove old references to old arrays and then add the current one
[19:29] <fretegi> have not done anything different on this go then the last 5x
[19:29] <RoyK> I have no idea :)
[19:30] <fretegi> remove from mdadm.conf?
[19:30] <RoyK> I just followed my own playbook on how to debug these sort of things
[19:30] <RoyK> yes, just remove those arrays defined there and take the output from mdadm --detail --scan and add it to the end (perhaps with >>)
[19:30] <fretegi> RoyK, oh i get it, and your process exactly lines up with my udnerstanding... i just could not get the damned md0 to start on boot, couldnt figure out why
[19:31] <fretegi> RoyK, even when the output still shows a spare?
[19:31] <fretegi> https://dpaste.org/x1NT
[19:31] <RoyK> oh - remove that part
[19:32] <RoyK> just remove "spares=1 "
[19:32] <RoyK> but you could run it again later if you like, but it'll probably show the same
[19:32] <RoyK> without that spare
[19:32] <fretegi> once fully recovered
[19:33] <RoyK> good luck :)
[19:33] <fretegi> now we just wait and see i suppose ha
[19:33] <fretegi> letcha know in 2.5 hours
[19:33] <fretegi> i mean there is nothing i need to do to have this thing start on boot right?
[19:33] <RoyK> I'll probably be awake :)
[19:34] <RoyK> nah - not really
[19:34] <fretegi> yea thats what i thought...  so weird
[19:36] <fretegi> and wost of it was... md0 wouldnt start as a device was missing, which means that LVM couldnt load the fs on md0, but oddly that prevented a boot...  guess it caused LVM to freak out and since i have /boot and / on lvm volumes, just in a dif. group, machine would not boot
[19:36] <fretegi> out comes the live cd, mounting all the file systems blah blah
[19:36] <RoyK> partitions were invented in the ightees when filesystems didn't support big drives (such as MSDOS 3.3's max 32MB). AFAIK they are still needed for grub to work, but that might change over time or perhaps has already. We have stuff like LVM today that does this *way* more flexible
[19:37] <fretegi> lvm is awesome, just never had it choke like this lol
[19:37] <fretegi> and didnt think it choking on my data volumes would take the system down
[19:37] <fretegi> well... prevent a boot anyway
[19:40] <RoyK> there's a kernel setting to allow boot degraded
[19:41] <RoyK> imho it should be on by default
[19:41] <RoyK> which distro is this?
[19:41] <fretegi> ubuntu 16.04
[19:43] <RoyK> hm
[19:44] <RoyK> that seems to be fixed https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1635049
[19:44] <fretegi> in 16.04?
[19:44] <RoyK> fretegi: the bug is for 1604
[19:44] <fretegi> RoyK, ah gotcha where there ya have it ;)
[19:44] <RoyK> fretegi: pastebin output of lsb_release -a
[19:45] <fretegi> No LSB modules are available.
[19:45] <fretegi> Distributor ID:	Ubuntu
[19:45] <fretegi> Description:	Ubuntu 16.04.7 LTS
[19:45] <fretegi> Release:	16.04
[19:45] <fretegi> Codename:	xenial
[19:45] <fretegi> sorry, thought would be one line
[19:45] <RoyK> np
[19:45] <RoyK> but that should be updated nicely
[19:45] <fretegi> yea shes old
[19:48] <fretegi> https://dpaste.org/VGEq
[19:48] <fretegi> all looks good
[19:52] <RoyK> and nothing nasty in dmesg?
[19:52] <RoyK> preferably dmesg -T if 16.04 supports that flag
[20:05] <fretegi> [Wed Sep  9 16:32:06 2020] cgroup: new mount options do not match the existing superblock, will be ignored
[20:07] <fretegi> actually u know what... i shrank that raid to just 1 device before this most recent reboot..  DMESG doesnt go back far enough now to see the issue
[20:21] <RoyK> fretegi: then /var/log/kern.log.something should show it
[20:21] <RoyK> or whatever that was called in 16.04 ;)
[20:33] <RoyK> fretegi: still running?
[20:37] <fretegi> yup another 100 min
[20:39] <RoyK> it usually slows down towards the end - it's about the double amout of sectors per track on the outside compared to the inside of the disk, so half the speed, since the spin rate is the same
[20:42] <fretegi> gotcha
[20:43] <fretegi> gonna run an errand while this is building, bbiab, appreciate all the help buddy!  see ya soon
[21:46] <fridtjof[m]> Alright, finally found time to go after the qemu-img issue again!
[21:47] <fridtjof[m]> sarnold: so far, i can also reproduce with upstream qemu-img. Time to bisect!
[22:25] <RoyK> fretegi: how's it going?
[23:15] <fridtjof[m]> found the bad commit! 34fa110e424e9a6a9b7e0274c3d4bfee766eb7ed
[23:38] <fretegi> RoyK, rebooting now
[23:38] <RoyK> fretegi: did you update initramfs first?
[23:39] <fretegi> RoyK, yup and same thing
[23:39] <fretegi> damn md0 is inactive
[23:39] <RoyK> damn
[23:40] <fretegi> https://dpaste.org/bQkv
[23:40] <RoyK> are you in the rescue thing?
[23:40] <fretegi> seriously have no idea what the heck is the deal here
[23:40] <RoyK> fretegi: does lsblk show the other drive?
[23:41] <RoyK> it really shouldn't be (S) anyway
[23:41] <RoyK> if a disk fell out, well, no big deal
[23:41] <RoyK> fretegi: have you considered updating the kernel?
[23:41] <RoyK> it might be that there's a stone old bug around that noone bothers to fix
[23:42] <fretegi> https://dpaste.org/9rLd
[23:42] <fretegi> shows the 2 disks
[23:42] <RoyK> ok, and mdadm --examine for those?
[23:43] <RoyK> and pastebin output of uname -a as well
[23:43] <fretegi> https://dpaste.org/B3iP
[23:44] <fretegi> see sdc all eff'd up, you saw the post when it was building.. i sent --examine output and all was well
[23:44] <fretegi> https://dpaste.org/N5P8
[23:45] <RoyK> hm - never seen that
[23:46] <fretegi> so im thinking either 1 i just nuke md0 start over from backup (but seems kinda like cheating, we are linux guys afterall) or degrade md0, make md1 with sdc only, copy data over, then nuke md0 and add sdb into md1..
[23:46] <RoyK> anything in BIOS flagging sdc as a raid drive or something?
[23:46] <fretegi> theres a thought, have not looked, but have not made any bios changes either tho
[23:47] <fretegi> pita to confirm, not easy to get a screen hooked up to this thing
[23:47] <RoyK> take a look or just reset to defaults and turn off anything that looks like raid in there
[23:47] <fretegi> but never made a bios changes so...
[23:47] <fretegi> should all be ahci
[23:47] <RoyK> I don't know - I can just guess
[23:48] <RoyK> but it seems like the superblock has been overwritten by something and that something is either the BIOS or some nasty virus of sorts - no idea
[23:49] <fretegi> well the way i broke it was by simply running parted
[23:50] <fretegi> made it a gpt label
[23:50] <fretegi> is that wrong
[23:50] <RoyK> not sure it's relevant, but http://mbrwizard.com/thembr.php
[23:50] <RoyK> as I said - there's no need for partitions
[23:51] <RoyK> be it MBR or GPT
[23:51] <fretegi> so i ran parted after dd'ing the drive, was that wrong?
[23:52] <RoyK> but try to hook up a monitor to that thing and check BIOS settings
[23:52] <RoyK> fretegi: I can't understand why you'd want to do that
[23:52] <RoyK> anyway - I guess parted maybe asked you if you wanted to save your changes or similar?
[23:53] <fretegi> RoyK, well, superblock is gone.. so ave to reassemble anyway right?
[23:53] <fretegi> as a best case
[23:53] <fretegi> have to reassemble*
[23:54] <RoyK> if you want a partition table there, add one, but then you'll need to rsync the stuff over to the new array
[23:54] <RoyK> I'll suggest not using a partition table at all, since you don't need it
[23:54] <fretegi> i mean to fix this array..  to get this disk back in... gonna have to reassemble anyway
[23:55] <RoyK> then first try to reassemble the raid with the working drive
[23:55] <RoyK> mdadm --assemble /dev/md0 after mdadm --stop /dev/md0
[23:56] <fretegi> yea got that already
[23:56] <RoyK> it should tell you 'assembled with 1 out of 2 drives' or something like that
[23:56] <fretegi> mdadm --assemble --scan didnt work.. but i could force it with mdadm --assemble /dev/md0 /dev/sdb -f
[23:56] <fretegi> then started with just sdb
[23:57] <RoyK> the zero the first megs on the target drive (not the one in the raid, it could be rather messy) and try to add it. better check lsblk /dev/sdc to check
[23:57] <RoyK> or just mdadm --examine
[23:58] <fretegi> so i dont have to shrink md0?
[23:58] <fretegi> wont let me fail or remove sdc, doesnt even see it as a valid raid device
[23:59] <fretegi> RoyK, kinda curious...  what if i nuke the superblock and dd the first few k bytes of sdc.  make md1 degraded and reboot.. see if it starts md1 on boot