[11:14] <MacSlow> Greetings everybody!
[16:38] <eest> anyone here involved with the ubuntu installer? i have run into an issue where the 22.04 installer will crash when trying to set up software RAID where the two disks share the same WWN (which seems like like something that should  should not happen, but alas).
[16:38] <eest> i could make the installer work by building my own with a patched curtin, where it will just not use the wwn: https://github.com/eest/curtin/commit/1052baa37b81cd348e17d17a2ef545c93ce26505
[16:38] -ubottu:#ubuntu-devel- Commit 1052baa in eest/curtin "Skip looking for wwn for testing"
[16:40] <eest> i am thinking a proper fix would be to at some point iterate over all disks and check for duplicate WWNs and add this info somewhere to the disk, then you could check this when iterating over the different disk_key values and just continue to the next key if it the case
[16:41] <arraybolt3> eest: Have you filed a bug report on this? That's usually the quickest way to get the attention of relevant people.
[16:42] <eest> i have not, but it seems not unlike https://bugs.launchpad.net/curtin/+bug/1955511 (but there it is the serial being duplicate)
[16:42] -ubottu:#ubuntu-devel- Launchpad bug 1955511 in curtin "curtin cannot identify nvme namespaces because they share a serial number" [Undecided, New]
[16:42]  * arraybolt3 looks
[16:43] <eest> i dont have nvme drives, and my serials are unique
[16:43] <arraybolt3> So... I don't know one of the terms you're using... what is a WWN?
[16:43] <eest> it is a "World Wide Name", and you can query for it using a SCSI command
[16:44] <eest> for example
[16:44] <eest> lsblk -S -d -o TRAN,NAME,TYPE,MODEL,SERIAL,SIZE,WWN
[16:44] <arraybolt3> Ah, OK.
[16:44] <arraybolt3> This does indeed look like the same bug possibly, since it has to do with disk detection based on not-necessarily-unique IDs.
[16:45] <eest> yes
[16:45] <arraybolt3> eest: Out of curiosity, *how* did you end up with multiple disks with the same WWN?
[16:45] <arraybolt3> (I realize this is irrelevant to the bug itself, even if this is weird, it can in theory work, and if it can be made to work in practice then it probably should be made to work IMO.)
[16:45] <eest> the machine in question is a Dell VEP4600, i dont know why the WWNs are shared, they just came like that
[16:46] <arraybolt3> Hmm. OK.
[16:46] <arraybolt3> What I'd do at this point is:
[16:46] <eest> i have even seen that same shared WWN being present on a whole other server of the same make and model
[16:46] <eest> so i guess the drive manufacturer reuses WWN for series of drives or something, i have no idea really
[16:47] <arraybolt3> 1: File a new bug. Yes, I know that we think this is the same bug, but Ubuntu doesn't care about having duplicate bug reports, unlike virtually every single other open source project. We actually like duplicates. And too often, things that look like duplicates aren't.
[16:47] <arraybolt3> 2: In that bug report, record everything you know about the bug, including your suggested fix.
[16:48] <arraybolt3> 3: Paste the bug link here, I'll see if I can find the relevant people to talk to about the suggested fix and see what we can come up with.
[16:49] <eest> sure, i can open a bug report, i was hoping i could find someone to discuss the best place to solve it becuase my fix just skips using wwn, and it is not super clear to me at what layer above it makes most sense to do this iteration over the discovered drives
[16:50] <arraybolt3> That makes sense. I'll take a look real quick (I'm not very familiar with curtin but I do enough Python programming to have a general idea of what I'm looking at)
[16:52] <eest> fram what i can tell that function gets passed a config and gets stuff from there, but then there is jsonschema declarations in https://github.com/canonical/curtin/blob/master/curtin/block/schemas.py so i guess those would need to be extended, but then there is the probert stuff i ave not looked at yet etc
[16:53] <arraybolt3> That seems complicated, I hope it doesn't end up being that involved.
[16:54] <arraybolt3> Hmm... so here's what I'm seeing.
[16:55] <arraybolt3> if disk_key in ['wwn', 'serial']:
[16:55] <arraybolt3>     volume_path = block.lookup_disk(vol_value)
[16:55] <arraybolt3> The code appears to try WWN first, if it finds that it uses it. Then it tries serial, if it finds that it uses it, etc.
[16:56] <eest> correct
[16:56] <arraybolt3> I think it needs some way of deduplicating things, but I'm not sure how to do that quite yet, so bear with me...
[16:57] <arraybolt3> (I'm thinking, if it finds duplicate WWNs, forget them and use serial numbers, and vice versa.)
[16:57] <arraybolt3> (And if it finds duplicates of both, then all hope is lost anyway :P)
[16:57] <eest> i think the problem is that at this point in the code you are only operating on a per-disk basis, so unless you want to "for each disk, check all other disks" repeatedly, you would need to have knowledge that things are duplicated beforehand
[16:58] <arraybolt3> True. So maybe the fix is higher, wherever get_path_to_storage_volume() gets called.
[16:58] <eest> so there should be some point in the code where the config info has been collected, and you can do a single pass to check for duplicates
[16:58] <eest> it gets called at multiple places
[16:58] <eest> it seems
[16:59] <arraybolt3> I'm not sure if that would work, because it seems like, if it finds multiple disks with the same WWN, it will return the same (first) disk both times.
[16:59] <arraybolt3> So deduplication at that point would result in a missing disk, I think.
[17:01] <eest> i dont mean deduplicating stuff, just includng an additional field or something that the wwn is duplicated so it can skip it and contiue to serial etc
[17:01] <arraybolt3> What I'm thinking is, perhaps an additional array can be added that stores a list of WWNs and serials that have already been seen. The get_path_to_storage_volume() code can then add to that list each time it see a WWN or serial number, and if it finds one that it's already seen before, it can also check serial number. If both are duplicates, it can assume it's looking at the same drive as
[17:01] <arraybolt3> before, otherwise, it will use whichever one is unique this time.
[17:01] <arraybolt3> eest: I assume you are able to manually set up software RAID on these disks without using the installer, and it works?
[17:02] <eest> with the patch i linked i can use the installer to do it
[17:02] <arraybolt3> Very good. OK, so maybe we're not thinking *too* far outside the box :P
[17:02] <eest> but my patch is just an ugly "dont care about wwn", trying to figure out how to do it better for something that would be accepted by upstream
[17:03] <arraybolt3> True.
[17:03] <arraybolt3> Since the bug you linked to talks about a difference between path and serial number, I'm thinking maybe we need to do this duplicate checking more extensively than just for WWN and serial.
[17:04] <eest> yes, i was kinda thinking you could have a dict of "keys found to be duplcicate" so you could dod "if disk_key in duplicate_disk_keys, continue" kinda
[17:04] <eest> or something like that
[17:05] <eest> but it goes back to "in what layer does it make sense to discover and fill in that information", i am already one step away from the installer code with curtin, but i notice there is also probert involved in finding stuff
[17:07] <arraybolt3> Probert... I have no idea what that is :D
[17:07] <eest> https://github.com/canonical/probert
[17:07] <eest> i didnt either until last week :P
[17:07] <arraybolt3> Hardware discovery tool... hmm.
[17:09] <arraybolt3> I don't think that is probably involved, though, because for one, you're able to fix the problem by patching curtin, and for two, I think the code in question has to do with taking hardware that is known about and figuring out data about it, as opposed to probing for hardware.
[17:09] <arraybolt3> (I could be wrong, this is just my first impression from a brief skim of the code.)
[17:10] <eest> well, i was thinking curtin was operating on information that was collected by probert
[17:10] <eest> but i am unsure as well
[17:13] <arraybolt3> All the info should be there, I think, it's just using that info in a way that makes sense. Perhaps probert could help fix the problem, but if ignoring WWN fixes the problem for you, I doubt probert is doing anything wrong.
[17:14] <eest> i dont think it is doing something wrong, its just that one could argue if that is the correct location to also discoer that WWNs or serials are duplicate and add addional fields that downstream consumers can use
[17:14] <arraybolt3> OK, I'll have to look into this further later, but I'll put this on the "to be looked into further" later.
[17:14] <eest> ill report back with a link to the bug report when i have created it, thanks for discussing
[17:14] <arraybolt3> +1, glad to help!
[17:24] <arraybolt3> One possible concern with my current plan is, assume that you have drive wwn1serial1 and drive wwn1serial2. If the function is told to find the path of drive wwn1serial2, and it uses only the WWN to find it, it *may* get the path of wwn1serial1 on accident. Then when it is handed wwn1serial1, it will detect the duplicate WWN, and use the serial number instead, but it will still accidentally
[17:24] <arraybolt3> pass the same drive.
[17:24] <arraybolt3> s/pass/return/
[17:25] <arraybolt3> Hrm. This will probably be tricker than I thought :P
[17:59] <eest> https://bugs.launchpad.net/curtin/+bug/2003654
[17:59] -ubottu:#ubuntu-devel- Launchpad bug 2003654 in curtin "Installer crashes when setting up software RAID and disks have duplicate WWN" [Undecided, New]
[18:00] <eest> getting the wrong drive is actaully what happens now
[18:01] <eest> from what i can see curtin looks up the wwn in /dev/disk/by-id/wwn-* and in my machine there will only one such link, pointing to one of the drives
[18:08] <eest> should add that to the ticket
[18:10] <eest> done
[18:12] <eest> arraybolt3: yeah this is what i have been thinking as well, it seems a bit more involved to fix this for real, that is why i was hoping i could find someone who has a better grasp of the overall flow to know where this should even be fixed
[18:38] <arraybolt3> Well, the bug report should help that. Someone should at least get notified.
[18:47] <eest> it's funny how you can never assume anything ever, "WWN, guaranteed to be unqiue", nope!
[18:47] <arraybolt3> Right!
[18:47] <arraybolt3> Welcome to manufacturers departing from standards (whether by accident or intentionally).
[18:50] <eest> also funny with this "software excusing for hardware" kludge we are discussing, but even if i wanted to send these disks back to prove a point i dont think anyone would care :P
[18:53] <eest> as i saw another server (possible from the same batch) having identical WWN for its disks as this server i am working on i dont think it is a one-off glitch either
[18:54] <arraybolt3> The ideal solution would be for manufacturers to do things the way standards say. That obviously doesn't happen and has pretty much never happened AFAIK, thus you end up with all sorts of hacks like this in, say, the Linux kernel and whatnot. But I'm going off-topic. Yeah, it stinks, but hopefully there's a good way around it.
[19:01] <arraybolt3> eest: Are you able to do more testing?
[19:01] <eest> yes
[19:01] <arraybolt3> If so, rather than removing the 'wwn' field entirely, try moving the 'path' field to the *start*. If it's supposed to be working on /dev/sdc anyway, and it knows it, why does it have to detect it? It already knows what disk to work on, just do that.
[19:02] <arraybolt3> Then if there isn't any provided device path, it can continue on to try WWN, then serial number, then device-id.
[19:02] <eest> yeah, that have occured to me also, like mentioned in the report there is a 'path' field present
[19:03] <arraybolt3> Really that might be enough to fix the whole problem. And fstab relies on partition UUIDs, not any device-specific numbers, so that shouldn't be a problem either.
[19:04] <eest> it is mentioned in passing here: https://github.com/canonical/curtin/blob/b08eecd68cf5f1bccf4255b3d00a77af51c159f7/curtin/storage_config.py#L722-L726
[19:04] <eest> (that wwn/serial will be prefere) unfortunately not why that is
[19:05] <arraybolt3> Wow, so sometimes the path isn't always accurate?
[19:05] <eest> not sure!
[19:06] <arraybolt3> And "asdict" isn't a very descriptive function name :-/
[19:06] <arraybolt3> Ah, but thankfully it is documented.
[19:09] <arraybolt3> I'll definitely have to wait to do this anymore, there's a bunch of other stuff I have to do, but we have a bug report, and we have some digging done, so hopefully we're closer!
[19:09] <eest> hehe, understood, thanks again for taking some time to discuss it
[19:29] <rbasak> eest: thank you for filing the bug. Feel free to shout me here if you don't get a response in a week or two, and I can try to follow up with a colleague.
[19:32] <eest> rbasak: thanks for the info!