/srv/irclogs.ubuntu.com/2023/03/10/#cloud-init.txt

CorvetteZR1hello.  i have a RHEL8 image for azure that i'm trying to harden.  cloud-init works at first, but after i run the ansible-lockdown RHEL8-CIS job to apply CIS hardening, it breaks.  if i disabled all of section 1 in CIS, it works again...but i can't pin down exactly what could be breaking it in that section13:08
CorvetteZR1anyone here familiar with CIS hardening and has maybe ran into this issue before?  any idea how to solve it without disabling the whole secion, but maybe one or two individual measures?13:08
CorvetteZR1here are the CIS section 1 hardening measures; the one that have pass are being enforced.  https://dpaste.com/8WWM5ANAU13:12
waldi"it breaks" is a pretty broad description. logs are a useful way to see what got wrong13:12
CorvetteZR1waldi, basically cloud-init runs, but it doesn't run my custom script...i don't see any mention of it in the logs13:13
CorvetteZR1it's like customdata from azure is not being passed to the vm maybe?13:13
gjollythis is because of udf13:13
gjollyThe CIS hardening script probably disable the UDF module13:14
CorvetteZR1i have a script that partitions an additional disk and mounts it, as well as installs an agent...but that doesn't run with the hardening13:14
gjollyWhich is used by azure to pass data to cloud-init13:14
CorvetteZR1udf is failed, i disabled that13:14
CorvetteZR1and noexec on var is also fail, because i disabled that as well13:14
waldiCorvetteZR1: cloud-init will log what it does. use this information13:15
CorvetteZR1i looked and i see it doing lots of stuff but not executing my script.  what exactly should i look for?  the logs are very verbose and i don't know what a fail customdata should look like...13:16
waldigjolly: and the error message is pretty likely to show up in the log. no need to guess13:16
waldiCorvetteZR1: maybe you can just show it? you have no idea to interpret it13:18
gjollywaldi: sure I'm was guessing because we had (what seems to be) the exact same issue reported Ubuntu this week.13:21
CorvetteZR1ok.  i'll build another image and paste the logs.  even though udf says failed, it's actually being enforced in cis...so maybe when generating the report, it looks somewhere else.  will give this a shot13:22
CorvetteZR1ok, that appears to be it, cis was messing with udf14:06
CorvetteZR1but still failed the check in the report14:06
CorvetteZR1thanks a lot waldi and gjolly !14:06
CorvetteZR1hi.  i have another question related to disk config with cloud-init.  in my scenario, the system boots up and disk gets attached some seconds after; but i need cloudinit to partition/format/mount that disk on boot14:41
CorvetteZR1i have this script, but it works on boot like 50% of the time ... lately less than 50%: https://dpaste.com/F4HBP7Q2N14:42
CorvetteZR1is there a better way to handle this?14:42
CorvetteZR1using cloudinit...since azure/terraform are not giving me many options.  the reason for needing this workaround is this issue in tf: https://github.com/hashicorp/terraform-provider-azurerm/issues/611714:44
-ubottu:#cloud-init- Issue 6117 in hashicorp/terraform-provider-azurerm "Support for disk attachment to VMs at creation time" [Open]14:44
minimalCorvetteZR1: when it doesn't work does /var/log/cloud-init.log show anything?14:49
minimalalso what exactly does not work? the partitioning? the formatting? the mounting?14:50
CorvetteZR1the partition is not created14:50
minimaland does the logfile show any errors?14:51
CorvetteZR1the mountpoint gets added, i see fstab entry...but without the partition, it doesn't mount it ofcourse14:51
CorvetteZR1i'll check again.  booting a new image now, will see if it fails again or not14:51
minimalalso which distro (and its version) and which version of cloud-init?14:52
CorvetteZR1rhel 8.6 ... cloudinit 21.2.x i think, but i'll confirm shortly...taking a while to boot now14:54
CorvetteZR1i might have broken something else :)14:54
minimalthere have been quite a few azure-related changes between 21.2.x and 23.1.1, wondering if those could be a factor14:55
CorvetteZR1cloud-init 22.1-6.el8_7.214:56
minimalif cloud-init.log does not show any useful info for the problem then it would be proably a good idea to enable debugging as then cloud-init.log will have more detailed info of what is happening14:57
CorvetteZR1ok, so yea, it booted up, again no partition14:57
waldithis sounds like a race condition. for that cloud-init would need hotplug support14:58
minimalwaldi: I though he was using the "until" entry (for bootcmd) in his user-data to handle this14:58
CorvetteZR1yea, that's the idea...but it's very intermittent15:00
CorvetteZR1so was wondering if there is a better way to add that delay?15:00
minimalhave you checked cloud-init.log?15:00
CorvetteZR1yes, looking at it now...no idea what to look for15:00
minimalwell any warning/errors15:00
CorvetteZR1it looks like there is another azure disk (not managed by me) that is trying to take that partition15:00
minimaltake that partition? you mean that that device name?15:01
CorvetteZR1yes15:02
minimalwasn't there some discussion or bug a while ago related to the order of multiple disks appearing?15:02
CorvetteZR1it's trying to mount some ntfs disk...i'm formatting mine as xfs15:02
minimalright, so how do you know what the disk you want will be called?15:03
CorvetteZR1i'm trying to paste the log, but it's too big15:06
CorvetteZR1anyone know where i can paste it?  dpaste and pastebin.com don't like it15:06
CorvetteZR1i expect the device to be /dev/sdb115:06
minimalyour user-data references /dev/disk/azure/scsi1/lun0 though, not /dev/sdb15:11
minimalyou don't need to past the full log, only the disk_setup related portions15:12
CorvetteZR1right, i'm using the lun because sometimes sdb will be something else15:13
minimalisn't the "ntfs" disk an Azure temporary disk?15:14
CorvetteZR1i don't know what to paste because this appears to be a different disk then what i expect15:14
minimalhttps://learn.microsoft.com/en-us/azure/virtual-machines/managed-disks-overview#temporary-disk15:14
CorvetteZR1yes, but that's not the device i'm targetting...at least not trying to15:14
minimalok, well device names will be allocated by kernel in the order it sees the disks15:15
minimalso if it sees that disk 1st then it will be sdb and the order disk sdc15:15
CorvetteZR1ah, found something...15:16
minimalis this not related to https://learn.microsoft.com/en-us/azure/virtual-machines/linux/azure-to-guest-disk-mapping ?15:17
minimalso you can't assume lun0, you have to see the lun allocated to a disk and use that in the user-data?15:17
CorvetteZR1ok, here is the relevant log: https://dpaste.com/3WCTR68UL15:22
minimaldid you read the link about disk mapping?15:27
CorvetteZR1not yet, will have a look now15:28
minimalso from the log it looks like lun0 is a "ephemeral" disk, aka Azure temporary disk, not the disk you added15:28
CorvetteZR1so i should use another lun id in my tf?  are the two coliding?15:29
waldiso he uses an old waagent. the udev rules come from there and it properly uses a special name for the azure ephemeral disk15:29
CorvetteZR1because if i look at that disk now, it's the one i added...i can format and mount it manually15:31
CorvetteZR1does the log i pasted have anything meaningful?15:34
minimaldid you check the lun info in Azure for that disk as mentioned in the link I provided?15:35
minimalis it lun0?15:35
CorvetteZR1yes15:35
minimaland is there also a temporary disk?15:36
CorvetteZR1i see 2 disks...sda which is the os, and sdb which is the data disk i'm trying to format15:37
minimalso did you create this disk when you created the VM? or is it actually a temporary disk created automatically?15:37
CorvetteZR1https://dpaste.com/AEZTM5GL615:38
CorvetteZR1yes, i create it in terraform with the vm15:38
CorvetteZR1and attach it as lun 015:38
CorvetteZR1but it gets attached AFTER the vm boots up...which is why i need that sleep15:39
minimalwell from the log you provided that device did NOT exist at the time cloud-init tried to partition it:15:40
minimal2023-03-10 14:53:59,103 - util.py[WARNING]: Failed partitioning operation15:40
minimalDevice /dev/disk/azure/scsi1/lun0 did not exist and was not created with a udevadm settle.15:40
CorvetteZR1exactly!  which is what i'm trying to solve15:40
CorvetteZR1my solution appears to be broken, and i'm asking if there is a better way to implement it15:40
minimalwell perhaps as waldi mentioned there is a wagent issue?15:41
CorvetteZR1this is my script, and it's not working:  https://dpaste.com/F4HBP7Q2N15:41
CorvetteZR1is there a way to fix it that you are aware of?15:41
CorvetteZR1no idea...not referencing waagent in the terraform anywhere.  it's a new vm everytime, so shouldn't it just grab the latest agent?15:42
CorvetteZR1i'm just wondering if there is a better way to wait for the disk before cloud init runs.  better way than this:  until [ -e /dev/disk/azure/scsi1/lun0 ]; do sleep 1; done     any ideas?15:43
minimalis wagent installed as part of the RHEL disk image you are using?15:43
holmanbmeena: did you get freebsd booting on lxd? you shared the kernel MR before fixing the divide-by-zero, was that all that was required?15:45
CorvetteZR1minimal, yes.  WALinuxAgent-2.7.0.6 running on redhat 8.615:51
CorvetteZR1the image was built using latest rhel86 from azure, so it should have all the latest agents and such...during the image baking we fully update it too15:52
minimaland did you check what waldi mentioned about wagent naming/renaming temporary disks?15:52
CorvetteZR1race condition, hotplug support?  sorry, not following, is this something i can configure?15:58
CorvetteZR1udev rules?15:59
CorvetteZR1this is kinda too low level for me...was hoping it's just something that can be fixed within the cloudinit script since it does work sometime16:01
minimalif cloud-init does not see a device (sometimes) for the disk then what do you expect cloud-init to do?16:04
CorvetteZR1like i said, i just want to know if there is a better way to delay cloudinit than what i'm doing now.  is there a better solution than doing this?  until [ -e /dev/disk/azure/scsi1/lun0 ]; do sleep 1; done16:12
waldiyou can set fstab "by hand" and use x-systemd.makefs. if your systemd is new enough. but if you have an old waagent that does assign incorrect names for the resource disk, then this will break as well16:13
waldiplease show "udevadm info -q all -n sda -n sdb -n sdc -n sdd", so we can see if this information is actually wrong16:15
waldiCorvetteZR1: and you can use https://gist.github.com/ to provide pretty large logs16:17
CorvetteZR1waldi, this is the waagent version...i think it's newest, no?  WALinuxAgent-2.7.0.616:18
CorvetteZR1i'll run that command shortly, i'm using lun1 instead of lun0 now and incrase sleep to 30, will see if that makes a difference16:19
waldias debian already uses 2.7.3, this is by far the newest16:20
CorvetteZR1that udevadm command just says "device already specified"16:22
CorvetteZR1changing to lun1 didn't help16:22
CorvetteZR1i think that sleep condition is either being ignored or not working16:23
CorvetteZR1which is kinda what i was suspecting when i asked this initially.  is it the right syntax?16:23
minimalhave you checked the cloud-init.log for "Running module bootcmd" ?16:27
CorvetteZR1Running module bootcmd (<module 'cloudinit.config.cc_bootcmd' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_bootcmd.py'>) with frequency always16:29
minimaland the following lines that indicate what it may have done and it it succeeded or failed?16:30
CorvetteZR1next line is start...don't see success or fail16:31
CorvetteZR1Skipping module named bootcmd, no 'bootcmd' key in configuration16:31
CorvetteZR1and then success16:31
minimalso it never saw any bootcmd section in user-data16:31
minimalso it never ran your command16:31
CorvetteZR1ok.  so is my syntax broken?16:32
minimal"no 'bootcmd' key in configuration"16:32
CorvetteZR1so is this wrong?  https://dpaste.com/F4HBP7Q2N16:32
minimalyou don't need to keep pasting the same thing16:33
minimalis that actuallu being used as user-data?16:33
CorvetteZR1i don't know...16:34
waldisee /var/lib/cloud-init/instance/user-data* or so16:34
CorvetteZR1looks like someone was having this issue a couple of years ago:  https://bugzilla.redhat.com/show_bug.cgi?id=192521316:35
-ubottu:#cloud-init- bugzilla.redhat.com bug 1925213 in Red Hat Enterprise Linux 8 "[Azure][RHEL-8] cloud-init.service costs 2 minutes in DSv4 VM" [Medium, Closed: Currentrelease]16:35
CorvetteZR1waldi, i see the same file in user-data16:36
CorvetteZR1with that bootcmd and all16:36
CorvetteZR1this script also installs a monitor agent, which gets installed and that works16:37
CorvetteZR1it's just this disk partition/mounting which fails16:37
=== paride4 is now known as paride
meenaholmanb: the fix is committed, but we still need to fix virtio_random, which is otherwise eating one whole CPU while pulling no entropy at all17:23
holmanbmeena: ack, not ideal17:26
meenawe have this: https://reviews.freebsd.org/D38898 accepted but not committed17:27
meenai think the boot fix patch was also merged on the current release branch17:27
meenathe virtio thing is great, cuz once you add this patch, it no longer spins on the CPU, but when you look at dtrace you can then better see what it does: nothing17:33
holmanbmeena: does not producing entropy block boot in some other way?17:39
holmanbmeena: or are you just concerned about it being incorrect (and insecure) because of the lack of entropy?17:39
meenanah, FreeBSD has loads of entropy providers17:39
meenait just hints at some underlying bug in either the virtio Subsystem or just virtio_random17:41
holmanbgotcha, agreed17:49
minimalwhen testing NoCloud with "ds=" on cmdline I've noticed that in the Grub configuration I need to enclose the ds value inside quotes as otherwise then passed to kernel it is truncated at 1st semicolon, e.g. passing ds=nocloud-net;s=http:1.2.3.4/seed results in ds=nocloud-net being passed to kernel whereas ds='nocloud-net;s=http:1.2.3.4/seed' results in the full value being passed18:24
holmanbminimal: that's a feature of grub's parsing, right?18:26
holmanbminimal: same thing happens with qemu on the commandline: https://cloudinit.readthedocs.io/en/latest/tutorial/qemu.html#launch-a-virtual-machine-with-our-user-data18:27
holmanbminimal: I suppose that should be documented18:39
minimalholmanb: not sure, its not to with Grub's handling of /etc/default/grub variables or grub.cfg, if I boot and edit the relevant Grub menu item then what appears there requires the quotes18:44
minimalah, I'd missed the qemu example in the docs18:45
minimalholmanb: I'm also trying to figure out network interface setup with "ds=" and seed URL, it seems like the fallback network config is creating /e/n/i for DHCP, then the OS ifup's using that config, c-i then fetchs docs from seed url (with meta-data including network), the revised network config is written to /e/n/i, then cloud-init does a "ifup", which does nothing as interface is already up, and so machine ends up with "temporary" (DHCP) IP settings 18:49
minimalrather than that from seed meta-data - looks like it should be doing a "ifdown" before the 2nd "ifup"18:49
minimalam still investigating though...18:49
holmanbminimal: iirc both /etc/default/grub and grub.cfg are shell (posix?) scripts, right? I'd assume setting the url in /etc/default/grub via kernel cmdline would also fail if not double or single quoted 18:51
holmanbminimal: haven't followed the codepath for eni specifically, but I think the ephemeral code is supposed to do teardown using the context manager after temp dhcp is used to grab configs18:52
minimalholmanb: yes but I "bypassed" any shell-related issues by modifying/testing it in the Grub "edit boot menu entry"18:52
minimalholmanb: it doesn't look like NoCloud uses the ephemeral DHCP stuff at all, it uses the fallback network config (at least in 23.1 which I happen to be testing)18:54
holmanbminimal: ah, you're right18:59
holmanbit just uses util.read_seeded() directly18:59
minimalholmanb: a good question is whether it should use the Ephemeral stuff19:00
minimalI think it should, rather than writing to /e/n/i twice19:00
holmanbminimal: gut feeling is that it _could_ use some of the ephemeral stuff, but shouldn't use the dhcp portion of it19:00
holmanbmaybe just EphemeralIPNetwork()19:01
minimalso use the fallback dhcp instead of ephemeral dhcp? what's the difference?19:01
holmanbminimal: nevermind, my suggestion doesn't make sense19:03
holmanbminimal: I'd have to dig a bit more to understand what's going on19:03
minimalI think what is missing is that the interface is not being brought down once the seeded URLs are fetched19:03
minimalI'll need to do some testing with git master, also to check if another issue is still present (retrying of vendor-data seed url multiple times despite webserver returning 404)19:08
holmanb+1 sounds right to me without digging into it too much19:09
minimalright to retry?19:11
holmanbright to ifdown before ifup19:11
minimalah ok19:12
minimalsorry for the adhoc questions, I'll formalise things more once I've finished investigating19:12
holmanbas for the retry, those are typically workarounds when the imds isn't up before cloud-init tries, which is annoying and shouldn't be required, but I'd be hesitent to remove, esp since NoCloud is so general so we don't have a specific cloud to shake a fist at19:13
holmanbminimal: all good19:13
holmanbminimal: to avoid the vendordata retry in the tutorial we just create an empty file19:15
holmanbminimal: https://cloudinit.readthedocs.io/en/latest/tutorial/qemu.html#define-our-vendor-data19:15
minimalyeah, I just seems "strange" that a 404 (indeed any HTTP response code) is treated as a "connectivity" failure19:16
holmanbminimal: fair, I see what you're saying now19:16

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!