[17:09] <[42]> i finally got around to spend a bit more time debugging my cloud-init network issues on debian
[17:09] <[42]> so i'm trying to use netplan rendering with systemd-networkd
[17:11] <[42]> as i also have ifupdown installed i'm trying to force the renderer to netplan via "network: renderers: ['netplan']" in /etc/cloud/cloud.cfg
[17:13] <[42]> with renderer netplan set i get https://paste.debian.net/plainh/f8dfa149 - without it it renders as expected into /etc/network/interfaces.d
[17:14] <[42]> example config: https://paste.debian.net/plain/1178087
[17:14] <minimal> what's you network-config.yaml contents? Is it a v1 or v2 config?
[17:14] <[42]> cloud-init 20.2 on debian buster
[17:15] <[42]> fwiw both don't work
[17:15] <[42]> i'm trying with a simplified v1 config right now generated by proxmox but my custom v2 config yields the same result
[17:16] <minimal> tried turning on debug for cloud-init, might give more info on what its doing just before the error
[17:21] <[42]> https://paste.debian.net/plainh/1bcf59dc
[17:24] <minimal> where's it getting the network config from? Which DataSource are you using? doesn't look like NoCloud as no sign of it mounting fs with the YAML files
[17:24] <minimal> did you put the config straight into the cloud.cfg file?
[17:24] <[42]> it's NoCloud
[17:25] <minimal> strange, i'd expect to see lines where it runs blkid to find a FS with label 'cidata' before mounting it.
[17:25] <minimal> is this not the 1st boot for this machine?
[17:26] <[42]> 2nd iirc but i've been running `cloud-init clean` and `cloud-init --debug init`
[17:27] <[42]> https://paste.debian.net/plainh/1ee03e90 grabbed the full log files
[17:28] <minimal> right, full log has what I expected for NoCloud:
[17:28] <minimal> 2020-12-23 17:25:25,057 - DataSourceNoCloud.py[DEBUG]: Attempting to use data from /dev/sr0
[17:30] <minimal> and same config worked before, you just forced netplan render now?
[17:31] <[42]> i haven't used this with cloud-init before but if i don't force netplan renderer i get no error and it creates a file in /etc/network/interfaces.d with the expected contents
[17:34] <minimal> haven't used netplan myself either. Wasn't sure if it was supported by c-i on Debian.
[17:35] <minimal> the log line: "RuntimeError: Unknown network config version: None" shows its very confused about the value of version. Might need to try adding some debug prints to the code around the error to figure out what's going on
[17:35] <[42]> which file should i be looking at for that?
[17:36] <[42]> nvm it's in the traceback
[17:37] <minimal> checking cloudinit/distros/debian.py I do see netplan mentioned in there should I guess it is supported on Debian
[17:37] <[42]> oh
[17:38] <[42]> added a debug statement for the netcfg section
[17:38] <[42]> so it considers {'renderers': ['netplan']} to be the entirety of my network config
[17:38] <minimal> netplan and network-config v2 are basically the same thing. Is the error the same when supplying v2 config?
[17:39] <[42]> it doesn't even see my network config
[17:39] <[42]> only takes the network block from /etc/cloud/cloud.cfg
[17:39] <[42]> which is supposed to set the renderer
[17:41] <minimal> 2020-12-23 17:25:25,058 - util.py[DEBUG]: Reading from /mnt//network-config (quiet=False)
[17:41] <minimal> 2020-12-23 17:25:25,058 - util.py[DEBUG]: Read 318 bytes from /mnt//network-config
[17:41] <minimal> that's it reading your YAML once ISO is mounted
[17:42] <[42]> cloud_init.net tries  version = netcfg.get('version')
[17:42] <[42]> but netcfg is literally {'renderers': ['netplan']}
[17:42] <minimal> try looking at /run/cloud-init/instance-data.json should (from memory) contain what its read from the ISO
[17:42] <[42]> i added a debug print there
[17:43] <[42]> in def extract_physdevs(netcfg)
[17:43] <minimal> yeah it seems like the YAML is being somehow "lost" after being loaded
[17:46] <[42]> there's nothing network related in instance-data.json
[17:46] <minimal> strange as your log has this:
[17:46] <minimal> 2020-12-23 17:25:25,064 - handlers.py[DEBUG]: finish: init-network/search-NoCloudNet: SUCCESS: found network data from DataSourceNoCloudNet
[17:47] <minimal> and that's just logged after it writes stuff to /run/cloud-init/instance-data.json
[17:48] <[42]> ah it's in -sensitive.json
[17:49] <minimal> ah, I didn't suggest that as I thought only password and the like in there
[17:49] <minimal> so its loaded it ok
[17:49] <[42]> https://paste.debian.net/plainh/6b9ef9c3
[17:49] <minimal> its somehow getting "lost" after that
[17:49] <[42]> that also just has the renderer defined
[17:49] <[42]> so it's lost before that
[17:52] <[42]> > 2020-12-23 17:51:37,447 - stages.py[INFO]: loaded network config from system_cfg
[17:53] <[42]> if i understand it correctly it's not loading from the datasource but just system config
[17:54] <minimal> just checked a VM here which uses ISO - sorry, those files don't contain the network data.
[17:54] <minimal> am comparing your logs with the ones here
[17:57] <[42]> self.datasource.network_config does contain the config in _find_networking_config
[17:59] <minimal> your logs refer to DataSourceNoCloudNet whereas mine refer to DataSourceNoCloud
[17:59] <[42]> it tries cmdline, initramfs, system_cfg
[17:59] <[42]> in this order
[17:59] <[42]> and system_cfg returns
[18:01] <[42]> it reads the order ('cmdline', 'initramfs', 'system_cfg', 'ds') from self.datasource.network_config_sources
[18:02] <minimal> where does your YAML come from exactly? your logs mention sr0 but I don't see any mount/umount logged
[18:02] <[42]> i manually mounted it before already
[18:02] <minimal> its mounted via /etc/fstab?
[18:02] <[42]> nope
[18:03] <minimal> you're manually running cloud-init on a running system?
[18:03] <[42]> yes
[18:03] <[42]> `cloud-init clean && cloud-init --debug init`
[18:03] <minimal> ah ok, normally during 1st time boot I'd expect to see c-i mount the cidata FS
[18:06] <[42]> class DataSource(metaclass=abc.ABCMeta) defaults to preferring NetworkConfigSource.system_cfg over NetworkConfigSource.ds
[18:06] <[42]> and i guess it just doesn't merge network config but can only replace instead?
[18:06] <minimal> its obviously crashing as the data structure that should contain the network config is empty
[18:06] <minimal> where's that class?
[18:06] <[42]> /usr/lib/python3/dist-packages/cloudinit/sources/__init__.py
[18:09] <minimal> your /etc/cloud/cloud.cfg contains: datasource_list: [ 'NoCloud' ] ?
[18:10] <minimal> well actually I guess [ 'NoCloud', 'None' ]
[18:11] <minimal> and you are manually running the 4 init scripts in sequence?
[18:12] <[42]> i'm not manually running individual init scripts, i'm running `cloud-init init` which as far as i understand should take care of that for me?
[18:13] <minimal> there's a sequence of 4 script run during boot
[18:15] <[42]> :q
[18:15] <minimal> cloud-init-local -> cloud-init -> cloud-config -> cloud-final
[18:15] <[42]> datasource_list is not actually present in cloud.cfg
[18:15] <minimal> I think there's a buildin default DS list, not sure
[18:16] <minimal> but the sequence of running those scripts during boot is important. As you're on Debian I assume there will be systemd equivalent service files
[18:17] <minimal> cloud-init-local is run early to do things like bring up temporary networking (i.e. on AWS to talk to Metadata Service) and for NoCloud to fetch YAML config with network info
[18:17] <minimal> so that networking is up-and-running before cloud-init script runs next
[18:18] <minimal> https://cloudinit.readthedocs.io/en/latest/topics/boot.html
[18:24] <minimal> remember cloud-init is designed to be run *during* system boot, not manually (unless you're debugging it) ;-)
[18:25] <[42]> (which i am)
[18:26] <minimal> that link I posted shows the sequence of scripts
[18:26] <minimal> so you need to run those manually when testing
[18:26] <minimal> after first doing a "cloud-init clean"
[18:27] <[42]> i may have just found the issue
[18:27] <[42]> give me a few min to test
[18:49] <[42]> so the main issue was having the network block at root level in cloud.cfg
[18:50] <[42]> i'm still working on remaining issues but it should have been under system_info
[18:50] <[42]> i'm getting a netplan config now
[18:59] <minimal> what was the cloud.cfg issue?
[18:59] <minimal> where you specified the renderer?
[19:00] <[42]> yes
[19:00] <[42]> it shouldn't be on root level
[19:00] <[42]> because then it conflicts with the network cfg i'm passing in the ds
[19:00] <[42]> network is coming up now but for some reason it's stuck for a minute here: [  *** ] A start job is running for Raise network interfaces (18s / 5min 1s)
[19:01] <[42]> oh nvm i was too fast on that
[19:01] <[42]> after rebuilding it's borked again
[19:01] <[42]> hm, last attempt it used eth0, now it's back to ens18
[19:01] <[42]> strange
[19:02] <minimal> yupe, in system_info is also the distro value - that's used the select the relevant distro-specific Python file which defined which renders are used automatically by a distro (i.e. debian.py has 'eni' and 'netplan' in that order)
[19:04] <minimal> if your network has mac address specified then it can/should use that to rename an interface
[19:04] <minimal> with debug turned on the cloud-init.log will show if it decides to rename
[19:04] <[42]> i was trying to avoid having to manually put the mac in there as well :)
[19:05] <minimal> you could use the Linux kernel cmdline option to disable the "new style" interface naming
[19:07] <minimal> net.ifnames=0
[19:07] <minimal> that will force the use of old style eth0, eth1, etc form
[19:08] <[42]> i don't mind the new names
[19:08] <[42]> they're still predictable
[19:08] <minimal> BTW the v1 network-config is more limited than v2, I recommend using it
[19:08] <[42]> actually even better predictable without having to know the mac
[19:08] <minimal> v2 that is...
[19:08] <[42]> yeah i'm already using v2
[19:08] <[42]> my network config requires that anyways
[19:08] <[42]> (also reason why i use netplan)
[19:08] <minimal> there are some things I can't do using v1
[19:09] <[42]> my network cfg for example has different source ips for different routes
[19:09] <minimal> v2 will work with /etc/network/interfaces
[19:10] <minimal> hmm, having had to look at source IPs, what's the v2 format for specifying that? routes are just dest net, via, and metric from memory
[19:12] <[42]> e.g. {"to":"10.120.0.0/16","via":"168.119.13.244","from":"10.120.18.12"}
[19:13] <minimal> 'from' isn't mentioned in the cloud-init docs
[19:13] <[42]> and then in the default route i'd have the vms public ip
[19:13] <[42]> no it's netplan config
[19:13] <[42]> so i'm not sure if it would render correctly for ifupdown
[19:13] <[42]> even though it's not documented in ci it's passed to netplan
[19:14] <[42]> https://paste.debian.net/plainh/40e5318a
[19:14] <minimal> but you're providing a v2 network config, not a netplan for cloud-init to convert to netplan
[19:15] <minimal> hmm, it could "break" in future - unless you submit a PR to get it officially documented
[19:15] <[42]> the entire body of the network: block is passed to netplan
[19:15] <[42]> so i'm practically giving it a netplan config
[19:17] <minimal> true, but its not guaranteed to always work - either its intended cloud-init behaviour but someone forgot to document it or its unintended behaviour and therefore could change at any time
[19:17] <[42]> https://paste.debian.net/plainh/40e5318a for some reason there's a dhcpclient running in this image
[19:18] <[42]> and that seems to be what's blocking that long
[19:18] <minimal> I have to find some time to raise a MR/PR for a minor fix in cloud-init /etc/init/network renderer for static routes
[19:20] <minimal> isn't the dhclient started by systemd? or does it handle that itself natively these days?
[19:21] <[42]> i don't have a dhclient service
[19:22] <[42]> as in dhclient.service does not exist
[19:22] <[42]> (nor does anything dhc*)
[19:24] <[42]> it seems that dhclient is started before systemd-networkd logs ens18: Configured
[19:27] <minimal> NetworkManager or something like that?
[19:28] <[42]> ifup@ens18.service
[19:28] <[42]> that's it
[19:30] <[42]> https://bugs.launchpad.net/cloud-init/+bug/1909138 should solve the documentation issue
[19:30] <ubot5`> Ubuntu bug 1909138 in cloud-init "cloud-init should officially support routes with source ip" [Undecided,New]
[19:32] <minimal> ok
[19:33] <minimal> I guess the underlying issue is that if there's no way to map from v2 to eni & netplan & other (RedHat/Suse) whether it would be accepted as an optional entry or not
[19:34] <[42]> > Dec 23 19:03:42 debian cloud-ifupdown-helper[285]: Generated configuration for ens18
[19:35] <[42]> so that's
[19:35] <[42]> being generated via /etc/network/cloud-ifupdown-helper
[19:36] <minimal> Actually I have 2 issues regarding static routes still to raise (both for /etc/network/interfaces): (1) with both IPv4 and IPv6 static routes the renderer puts them together in the IPv6 interface definition rather than separately in the IPv4 and IPv6 sections, and (b) I want to modify the render to use "ip" rather than "route" if its installed locally - using "ip" might also by change mean that source routing could be specified
[19:36] <[42]> i guess i'll adjust my patch script to nuke /etc/udev/rules.d/75-cloud-ifupdown.rules
[19:36] <[42]> then that should be fixed
[19:36] <[42]> does it technically make a difference if the route is set in v4 vs v6 section?
[19:37] <[42]> https://bugs.launchpad.net/cloud-init/+bug/925145 would be your second issue
[19:37] <ubot5`> Ubuntu bug 925145 in Fedora "Use ip instead of ifconfig and route" [Medium,Confirmed]
[19:38] <minimal> yeah, I meant I was intending to raise an MR to actually implement it
[19:39] <[42]> ah
[19:40] <minimal> yeah mixed v4/v6 static - well in theory an ifup (or equivilant) could successfully bring up IPv4 on an interface but not IPv6 and then the IPv4 static routes would be missing............ not a big issue, more of an minor irritation
[19:40] <minimal> I'm the Alpine cloud-init maintainer :-) We use /etc/network/intefaces
[19:41] <[42]> i knew they were separated blocks but i didn't know it internally has independent states for v4 and v6
[19:42] <[42]> alpine - fits your nick :D
[19:44] <minimal> with /e/n/i each "iface" section is a separate stanza and so logically self contained
[19:45] <[42]> > [   18.966169] cloud-init[487]: Cloud-init v. 20.2 running 'modules:final' at Wed, 23 Dec 2020 19:45:35 +0000. Up 17.67 seconds.
[19:45] <[42]> yay
[19:46] <minimal> all sorted?
[19:46] <[42]> yeah
[19:46] <minimal> building your own Debian disk images?
[19:46] <[42]> not really
[19:46] <[42]> patching prebuilt ones
[19:46] <[42]> https://cdimage.debian.org/cdimage/cloud/buster/20201214-484/debian-10-generic-amd64-20201214-484.tar.xz
[19:47] <[42]> reorder partitions, install netplan, make it use systemd-networkd and systemd-resolved
[19:47] <[42]> and a second patched image that uses btrfs instead of ext4
[19:47] <minimal> so you're creating a franken-Ubuntu in other words? ;-)
[19:48] <[42]> lol
[19:48] <[42]> still a regular debian :P
[19:48] <minimal> I build my own images, have a nice Ansible playbook for cranking out Virtual, Physical, and RPI images
[19:49] <minimal> need to find time to get back to doing the Debian ones
[19:49] <[42]> i was thinking about it but this way i already have the cloud-optimized configuration
[19:50] <[42]> and i don't have to fiddle with the debootstrap or whatever stuff and build a disk image from that
[19:50] <minimal> yeah I'm using cloud-init even for physical machines - small partition with the YAML config so just DD the image onto a box and either the static IPs etc and boot
[19:50] <minimal> s/either/edit/
[19:51] <minimal> I guess it depends on how stripped down / tailored you want it to be
[19:51] <[42]> https://paste.debian.net/plainh/85d9c249
[19:54] <minimal> looks fine. Did you figure out where the 75-cloud-ifupdown.rules came from?
[19:55] <[42]> it's part of debians cloud image
[19:56] <[42]> https://salsa.debian.org/cloud-team/debian-cloud-images/-/blob/master/config_space/files/etc/udev/rules.d/75-cloud-ifupdown.rules/CLOUD
[19:57] <minimal> right. You've take the tweak-it approach, I've taken the 1001 Flavours approach (do you want LWN Y/N? encryption Y/N? physical or virtual? etc) lol
[19:58] <[42]> i've also used preseeds in the past
[19:58] <[42]> but they're more painful to integrate when not using dhcp than cloud-init
[19:58] <minimal> me too, I've been a Debian user for a very long time :-)
[20:00] <minimal> I'm building ready-to-rock-out-of-the-box images to avoid need to run Ansible/Chef/Salt/Puppet once up for typical configuration - so locked down SSHd config, basic services etc already done.
[20:02] <[42]> i only have a very limited base config but it'd probably be a good idea to include that in my base image too now that i'm patching them anyways
[20:04] <minimal> when you start thinking about central syslogging, SSHd, Prometheus node-exporter, disabling unrequired kernel modules, etc its a neverending source of effort
[20:05] <[42]> one of the sshd_config lockdowns i like the most is limiting server keys to ed25519 and disabling all others - most login attempts already fail like this: Unable to negotiate with 112.85.42.30 port 32646: no matching host key type found.
[20:06] <minimal> yupe, have done that already :-) Still need to raise an Alpine MR to change the init.d script to stop it creating all (missing) types at startup
[20:08] <minimal> what about installing/running rng? for hardware random (or virtio-rng for VMs)
[20:08] <[42]> --args "-device virtio-rng-pci"
[20:10] <minimal> no machines or VMs without hardware random that need haveged or jitter-entropy via rngd?
[20:10] <[42]> all of my personal vms with virtio rng passthru
[20:11] <minimal> modifying your SSHd initial startup to wait if entropy is too low before creating the host key?
[20:11] <[42]> don't think i've done that yet
[20:12] <[42]> but that should only apply on fresh machines
[20:12] <[42]> so they should normally have enough entropy
[20:12] <minimal> true, I use virtio-rng too, I guess as I'm using a mix of physical and VMs and some physical boxes don't have hw rng that I've catered for varying scenarios
[20:12] <[42]> i don't create vms that often
[20:12] <minimal> they *should* have enough entropy but safer to check (and wait) before creating keys
[20:13] <minimal> thank you for visiting the "1001 things to do to harden your server channel" ;-)
[20:14] <[42]> hehe
[20:16] <[42]> how do you block the ssh key generation until enough entropy is ensured?
[20:16] <minimal> cat /proc/sys/kernel/random/entropy_avail
[20:17] <minimal> and decide based on the value
[20:17] <[42]> so you modify the hostkey generation script?
[20:17] <minimal> below 1000 is not great
[20:18] <minimal> on this laptop for example I'm seeing 3800 now - HW rng + haveged + jitterentropy-rngd
[20:19] <minimal> modify the init.d script - already had to do that to stop it creating more than just ED25519 keys
[20:19] <[42]> yeah
[20:19] <[42]> well it doesn't hurt to have the extra keys if they're not referenced but yeah :D
[20:20] <[42]> cleaner without
[20:20] <minimal> init.d script (or systemd service) runs ssh-keygen so just wrap that
[20:21] <minimal> well if the other keys are there there's a risk sshd config, now or in future, could allow they to be used.......whereas if they don't exist they can't be used.
[20:22] <[42]> doesn't sshd_config default to all supported types unless you specify one or more in which case only the specified ones are used?
[20:23] <[42]> ssh_genkeytypes in cloud-init allows specifying which host keys are generated with that
[20:24] <minimal> yeah. My point was if you unintentionally changed sshd_config (i.e. my mistake) you could end up re-enabling one of the other key types
[20:24] <[42]> good point
[20:25] <minimal> whereas if the other keys don't exist then such a mistake in config doesn't matter
[20:25] <[42]> e.g. when a new dist version of a config is shipped
[20:25] <minimal> indeed.......you end up with a default config file with lots enabled
[20:26] <minimal> though that's really a seperate issue of whether you're running anything (e.g. InSpec or Ansible) to check/enforce secure configs
[20:27] <minimal> ssh_genkeytypes: yes, that's not the issue - the usual sshd init.d/systemd file at startup will still create any "missing" hostkeys
[20:28] <minimal> so even if c-i creates just the one the sshd service will create the rest
[20:28] <[42]> not in the cloud-init debian image :)
[20:28] <minimal> in yours or stock? haven't checked their cloud images for a while
[20:29] <[42]> stock
[20:29] <[42]> i didn't patch it out
[20:30] <minimal> ok, don't have sshd on this Debian box, must check another to see. Anyway as I'm working on 2+ distros its a general thing to address/ensure
[20:31] <[42]> interestingly /usr/lib/systemd/system/cloud-init.service:Wants=sshd-keygen.service
[20:31] <[42]> but sshd-keygen.service doesn't exist
[20:31] <[42]> funnily that's enough to enable sshd-keygen.service in systemctl bash-completion
[20:33] <minimal> oh, word of advise, if you specify MAC addresses in network config its best to quote them
[20:33] <[42]> i remember that due to yaml being funny
[20:34] <[42]> but i don't remember what exactly was the issue
[20:34] <minimal> yes, I raised a MR a couple of months ago to get the c-i docs fixed regarding this and it opened up a whole can of worms...
[20:35] <[42]> :D
[20:35] <[42]> i really like yaml but some things are just... special
[20:35] <[42]> (others call me crazy for liking yaml)
[20:35] <minimal> the c-i YAML configs are YAML 1.1
[20:36] <minimal> however there's no version declared in the docs anywhere (same issue applies to netplan as its basically network-config v2)
[20:36] <[42]> nice
[20:36] <[42]> i do actually have another project where i'm using mac addresses
[20:37] <[42]> this reminds me to double check if they're quoted there :D
[20:37] <minimal> now there's a type in YAML 1.1 called a Sexagesimal which is Base 60
[20:37] <[42]> and they're not :(
[20:37] <minimal> that is *not* present in YAML 1.2
[20:38] <[42]> yeah my other project uses pyyaml which is yaml 1.1
[20:38] <[42]> and *surprise* the macs are not currently quoted
[20:38] <minimal> if you have a MAC address that is completely numeric (no 'a-f' present) is can possibly be mistaken for a Base 60 number
[20:39] <minimal> so for someone using cloud-init in VMs where any MAC address is "made up" (and the typical KVM prefix is all numeric) its a potential problem - which I hit
[20:40] <minimal> to avoid the problem you need to quote any values that can be misinterpreted as Base 60
[20:40] <[42]> yeah
[20:40] <[42]> in my other project i'll just switch to a yaml 1.2 parser
[20:40] <[42]> although almost all macs in there are from actual physical devices
[20:41] <minimal> so as part of my MR to fix the docs I also "fixed" their testcases to quote all MACs for uniformity - and some tests failed
[20:41] <[42]> lol
[20:42] <minimal> as in their test framework some of the test YAML with quoted values were read into Python and stubs to create resultant netplan YAML which the parsing code did'nt quote (as for those values not required) - so test result vs expected result mismatch
[20:43] <[42]> nice
[20:44] <minimal> pyyaml (which c-i uses) *does* recognise the "%YAML" directive with different values, however it doesn't appear to convert data accordingly
[20:46] <minimal> all fun and games
[20:47] <minimal> I've hit several corner cases with c-i as I guess I use it in non-typical ways