[13:08] hello. i have a RHEL8 image for azure that i'm trying to harden. cloud-init works at first, but after i run the ansible-lockdown RHEL8-CIS job to apply CIS hardening, it breaks. if i disabled all of section 1 in CIS, it works again...but i can't pin down exactly what could be breaking it in that section [13:08] anyone here familiar with CIS hardening and has maybe ran into this issue before? any idea how to solve it without disabling the whole secion, but maybe one or two individual measures? [13:12] here are the CIS section 1 hardening measures; the one that have pass are being enforced. https://dpaste.com/8WWM5ANAU [13:12] "it breaks" is a pretty broad description. logs are a useful way to see what got wrong [13:13] waldi, basically cloud-init runs, but it doesn't run my custom script...i don't see any mention of it in the logs [13:13] it's like customdata from azure is not being passed to the vm maybe? [13:13] this is because of udf [13:14] The CIS hardening script probably disable the UDF module [13:14] i have a script that partitions an additional disk and mounts it, as well as installs an agent...but that doesn't run with the hardening [13:14] Which is used by azure to pass data to cloud-init [13:14] udf is failed, i disabled that [13:14] and noexec on var is also fail, because i disabled that as well [13:15] CorvetteZR1: cloud-init will log what it does. use this information [13:16] i looked and i see it doing lots of stuff but not executing my script. what exactly should i look for? the logs are very verbose and i don't know what a fail customdata should look like... [13:16] gjolly: and the error message is pretty likely to show up in the log. no need to guess [13:18] CorvetteZR1: maybe you can just show it? you have no idea to interpret it [13:21] waldi: sure I'm was guessing because we had (what seems to be) the exact same issue reported Ubuntu this week. [13:22] ok. i'll build another image and paste the logs. even though udf says failed, it's actually being enforced in cis...so maybe when generating the report, it looks somewhere else. will give this a shot [14:06] ok, that appears to be it, cis was messing with udf [14:06] but still failed the check in the report [14:06] thanks a lot waldi and gjolly ! [14:41] hi. i have another question related to disk config with cloud-init. in my scenario, the system boots up and disk gets attached some seconds after; but i need cloudinit to partition/format/mount that disk on boot [14:42] i have this script, but it works on boot like 50% of the time ... lately less than 50%: https://dpaste.com/F4HBP7Q2N [14:42] is there a better way to handle this? [14:44] using cloudinit...since azure/terraform are not giving me many options. the reason for needing this workaround is this issue in tf: https://github.com/hashicorp/terraform-provider-azurerm/issues/6117 [14:44] -ubottu:#cloud-init- Issue 6117 in hashicorp/terraform-provider-azurerm "Support for disk attachment to VMs at creation time" [Open] [14:49] CorvetteZR1: when it doesn't work does /var/log/cloud-init.log show anything? [14:50] also what exactly does not work? the partitioning? the formatting? the mounting? [14:50] the partition is not created [14:51] and does the logfile show any errors? [14:51] the mountpoint gets added, i see fstab entry...but without the partition, it doesn't mount it ofcourse [14:51] i'll check again. booting a new image now, will see if it fails again or not [14:52] also which distro (and its version) and which version of cloud-init? [14:54] rhel 8.6 ... cloudinit 21.2.x i think, but i'll confirm shortly...taking a while to boot now [14:54] i might have broken something else :) [14:55] there have been quite a few azure-related changes between 21.2.x and 23.1.1, wondering if those could be a factor [14:56] cloud-init 22.1-6.el8_7.2 [14:57] if cloud-init.log does not show any useful info for the problem then it would be proably a good idea to enable debugging as then cloud-init.log will have more detailed info of what is happening [14:57] ok, so yea, it booted up, again no partition [14:58] this sounds like a race condition. for that cloud-init would need hotplug support [14:58] waldi: I though he was using the "until" entry (for bootcmd) in his user-data to handle this [15:00] yea, that's the idea...but it's very intermittent [15:00] so was wondering if there is a better way to add that delay? [15:00] have you checked cloud-init.log? [15:00] yes, looking at it now...no idea what to look for [15:00] well any warning/errors [15:00] it looks like there is another azure disk (not managed by me) that is trying to take that partition [15:01] take that partition? you mean that that device name? [15:02] yes [15:02] wasn't there some discussion or bug a while ago related to the order of multiple disks appearing? [15:02] it's trying to mount some ntfs disk...i'm formatting mine as xfs [15:03] right, so how do you know what the disk you want will be called? [15:06] i'm trying to paste the log, but it's too big [15:06] anyone know where i can paste it? dpaste and pastebin.com don't like it [15:06] i expect the device to be /dev/sdb1 [15:11] your user-data references /dev/disk/azure/scsi1/lun0 though, not /dev/sdb [15:12] you don't need to past the full log, only the disk_setup related portions [15:13] right, i'm using the lun because sometimes sdb will be something else [15:14] isn't the "ntfs" disk an Azure temporary disk? [15:14] i don't know what to paste because this appears to be a different disk then what i expect [15:14] https://learn.microsoft.com/en-us/azure/virtual-machines/managed-disks-overview#temporary-disk [15:14] yes, but that's not the device i'm targetting...at least not trying to [15:15] ok, well device names will be allocated by kernel in the order it sees the disks [15:15] so if it sees that disk 1st then it will be sdb and the order disk sdc [15:16] ah, found something... [15:17] is this not related to https://learn.microsoft.com/en-us/azure/virtual-machines/linux/azure-to-guest-disk-mapping ? [15:17] so you can't assume lun0, you have to see the lun allocated to a disk and use that in the user-data? [15:22] ok, here is the relevant log: https://dpaste.com/3WCTR68UL [15:27] did you read the link about disk mapping? [15:28] not yet, will have a look now [15:28] so from the log it looks like lun0 is a "ephemeral" disk, aka Azure temporary disk, not the disk you added [15:29] so i should use another lun id in my tf? are the two coliding? [15:29] so he uses an old waagent. the udev rules come from there and it properly uses a special name for the azure ephemeral disk [15:31] because if i look at that disk now, it's the one i added...i can format and mount it manually [15:34] does the log i pasted have anything meaningful? [15:35] did you check the lun info in Azure for that disk as mentioned in the link I provided? [15:35] is it lun0? [15:35] yes [15:36] and is there also a temporary disk? [15:37] i see 2 disks...sda which is the os, and sdb which is the data disk i'm trying to format [15:37] so did you create this disk when you created the VM? or is it actually a temporary disk created automatically? [15:38] https://dpaste.com/AEZTM5GL6 [15:38] yes, i create it in terraform with the vm [15:38] and attach it as lun 0 [15:39] but it gets attached AFTER the vm boots up...which is why i need that sleep [15:40] well from the log you provided that device did NOT exist at the time cloud-init tried to partition it: [15:40] 2023-03-10 14:53:59,103 - util.py[WARNING]: Failed partitioning operation [15:40] Device /dev/disk/azure/scsi1/lun0 did not exist and was not created with a udevadm settle. [15:40] exactly! which is what i'm trying to solve [15:40] my solution appears to be broken, and i'm asking if there is a better way to implement it [15:41] well perhaps as waldi mentioned there is a wagent issue? [15:41] this is my script, and it's not working: https://dpaste.com/F4HBP7Q2N [15:41] is there a way to fix it that you are aware of? [15:42] no idea...not referencing waagent in the terraform anywhere. it's a new vm everytime, so shouldn't it just grab the latest agent? [15:43] i'm just wondering if there is a better way to wait for the disk before cloud init runs. better way than this: until [ -e /dev/disk/azure/scsi1/lun0 ]; do sleep 1; done any ideas? [15:43] is wagent installed as part of the RHEL disk image you are using? [15:45] meena: did you get freebsd booting on lxd? you shared the kernel MR before fixing the divide-by-zero, was that all that was required? [15:51] minimal, yes. WALinuxAgent-2.7.0.6 running on redhat 8.6 [15:52] the image was built using latest rhel86 from azure, so it should have all the latest agents and such...during the image baking we fully update it too [15:52] and did you check what waldi mentioned about wagent naming/renaming temporary disks? [15:58] race condition, hotplug support? sorry, not following, is this something i can configure? [15:59] udev rules? [16:01] this is kinda too low level for me...was hoping it's just something that can be fixed within the cloudinit script since it does work sometime [16:04] if cloud-init does not see a device (sometimes) for the disk then what do you expect cloud-init to do? [16:12] like i said, i just want to know if there is a better way to delay cloudinit than what i'm doing now. is there a better solution than doing this? until [ -e /dev/disk/azure/scsi1/lun0 ]; do sleep 1; done [16:13] you can set fstab "by hand" and use x-systemd.makefs. if your systemd is new enough. but if you have an old waagent that does assign incorrect names for the resource disk, then this will break as well [16:15] please show "udevadm info -q all -n sda -n sdb -n sdc -n sdd", so we can see if this information is actually wrong [16:17] CorvetteZR1: and you can use https://gist.github.com/ to provide pretty large logs [16:18] waldi, this is the waagent version...i think it's newest, no? WALinuxAgent-2.7.0.6 [16:19] i'll run that command shortly, i'm using lun1 instead of lun0 now and incrase sleep to 30, will see if that makes a difference [16:20] as debian already uses 2.7.3, this is by far the newest [16:22] that udevadm command just says "device already specified" [16:22] changing to lun1 didn't help [16:23] i think that sleep condition is either being ignored or not working [16:23] which is kinda what i was suspecting when i asked this initially. is it the right syntax? [16:27] have you checked the cloud-init.log for "Running module bootcmd" ? [16:29] Running module bootcmd () with frequency always [16:30] and the following lines that indicate what it may have done and it it succeeded or failed? [16:31] next line is start...don't see success or fail [16:31] Skipping module named bootcmd, no 'bootcmd' key in configuration [16:31] and then success [16:31] so it never saw any bootcmd section in user-data [16:31] so it never ran your command [16:32] ok. so is my syntax broken? [16:32] "no 'bootcmd' key in configuration" [16:32] so is this wrong? https://dpaste.com/F4HBP7Q2N [16:33] you don't need to keep pasting the same thing [16:33] is that actuallu being used as user-data? [16:34] i don't know... [16:34] see /var/lib/cloud-init/instance/user-data* or so [16:35] looks like someone was having this issue a couple of years ago: https://bugzilla.redhat.com/show_bug.cgi?id=1925213 [16:35] -ubottu:#cloud-init- bugzilla.redhat.com bug 1925213 in Red Hat Enterprise Linux 8 "[Azure][RHEL-8] cloud-init.service costs 2 minutes in DSv4 VM" [Medium, Closed: Currentrelease] [16:36] waldi, i see the same file in user-data [16:36] with that bootcmd and all [16:37] this script also installs a monitor agent, which gets installed and that works [16:37] it's just this disk partition/mounting which fails === paride4 is now known as paride [17:23] holmanb: the fix is committed, but we still need to fix virtio_random, which is otherwise eating one whole CPU while pulling no entropy at all [17:26] meena: ack, not ideal [17:27] we have this: https://reviews.freebsd.org/D38898 accepted but not committed [17:27] i think the boot fix patch was also merged on the current release branch [17:33] the virtio thing is great, cuz once you add this patch, it no longer spins on the CPU, but when you look at dtrace you can then better see what it does: nothing [17:39] meena: does not producing entropy block boot in some other way? [17:39] meena: or are you just concerned about it being incorrect (and insecure) because of the lack of entropy? [17:39] nah, FreeBSD has loads of entropy providers [17:41] it just hints at some underlying bug in either the virtio Subsystem or just virtio_random [17:49] gotcha, agreed [18:24] when testing NoCloud with "ds=" on cmdline I've noticed that in the Grub configuration I need to enclose the ds value inside quotes as otherwise then passed to kernel it is truncated at 1st semicolon, e.g. passing ds=nocloud-net;s=http:1.2.3.4/seed results in ds=nocloud-net being passed to kernel whereas ds='nocloud-net;s=http:1.2.3.4/seed' results in the full value being passed [18:26] minimal: that's a feature of grub's parsing, right? [18:27] minimal: same thing happens with qemu on the commandline: https://cloudinit.readthedocs.io/en/latest/tutorial/qemu.html#launch-a-virtual-machine-with-our-user-data [18:39] minimal: I suppose that should be documented [18:44] holmanb: not sure, its not to with Grub's handling of /etc/default/grub variables or grub.cfg, if I boot and edit the relevant Grub menu item then what appears there requires the quotes [18:45] ah, I'd missed the qemu example in the docs [18:49] holmanb: I'm also trying to figure out network interface setup with "ds=" and seed URL, it seems like the fallback network config is creating /e/n/i for DHCP, then the OS ifup's using that config, c-i then fetchs docs from seed url (with meta-data including network), the revised network config is written to /e/n/i, then cloud-init does a "ifup", which does nothing as interface is already up, and so machine ends up with "temporary" (DHCP) IP settings [18:49] rather than that from seed meta-data - looks like it should be doing a "ifdown" before the 2nd "ifup" [18:49] am still investigating though... [18:51] minimal: iirc both /etc/default/grub and grub.cfg are shell (posix?) scripts, right? I'd assume setting the url in /etc/default/grub via kernel cmdline would also fail if not double or single quoted [18:52] minimal: haven't followed the codepath for eni specifically, but I think the ephemeral code is supposed to do teardown using the context manager after temp dhcp is used to grab configs [18:52] holmanb: yes but I "bypassed" any shell-related issues by modifying/testing it in the Grub "edit boot menu entry" [18:54] holmanb: it doesn't look like NoCloud uses the ephemeral DHCP stuff at all, it uses the fallback network config (at least in 23.1 which I happen to be testing) [18:59] minimal: ah, you're right [18:59] it just uses util.read_seeded() directly [19:00] holmanb: a good question is whether it should use the Ephemeral stuff [19:00] I think it should, rather than writing to /e/n/i twice [19:00] minimal: gut feeling is that it _could_ use some of the ephemeral stuff, but shouldn't use the dhcp portion of it [19:01] maybe just EphemeralIPNetwork() [19:01] so use the fallback dhcp instead of ephemeral dhcp? what's the difference? [19:03] minimal: nevermind, my suggestion doesn't make sense [19:03] minimal: I'd have to dig a bit more to understand what's going on [19:03] I think what is missing is that the interface is not being brought down once the seeded URLs are fetched [19:08] I'll need to do some testing with git master, also to check if another issue is still present (retrying of vendor-data seed url multiple times despite webserver returning 404) [19:09] +1 sounds right to me without digging into it too much [19:11] right to retry? [19:11] right to ifdown before ifup [19:12] ah ok [19:12] sorry for the adhoc questions, I'll formalise things more once I've finished investigating [19:13] as for the retry, those are typically workarounds when the imds isn't up before cloud-init tries, which is annoying and shouldn't be required, but I'd be hesitent to remove, esp since NoCloud is so general so we don't have a specific cloud to shake a fist at [19:13] minimal: all good [19:15] minimal: to avoid the vendordata retry in the tutorial we just create an empty file [19:15] minimal: https://cloudinit.readthedocs.io/en/latest/tutorial/qemu.html#define-our-vendor-data [19:16] yeah, I just seems "strange" that a 404 (indeed any HTTP response code) is treated as a "connectivity" failure [19:16] minimal: fair, I see what you're saying now