[11:45] hi all! I have a question regarding user-data update. I have updated a user-data via openstack nova and now I have a new value when chek it via http://169.254.169.254/2009-04-04/user-data. Also I added [scripts_users, always] in the config file. But it still doesn't work on reboot. When I delete obj.pkl manually and reboot it works fine. But I think it's not the better way [13:47] smoser: <3 [13:48] just noticed you merged the scaleway datasource for xenial on December [13:48] thanks for the xmas gift :) [14:40] niluje: \o/ [15:10] "caribou" joined us on monday :p [19:10] @smoser, can you take a look at my latest iteration? I proposed and implemented a solution that works post xenial: https://code.launchpad.net/~dojordan/cloud-init/+git/cloud-init/+merge/334341 [19:18] I found what I think is an interesting (bug?) behavior... [19:19] If you are using NoCloud (via an ISO attached to a VM) and you remove the ISO, the next time you boot it will decide you're using None instead of NoCloud and it will re-initialize. [19:20] I'm not sure how to improve this behavior, but I wonder if using the DMI UUID or serial for both None and NoCloud would fix it. [19:21] I suspect similar behavior would be shown if you denied access to the instance metadata on an EC2 instance early enough that cloud-init couldn't read it. [19:22] ajorg: that is correct. [19:22] yeah, we could use the dmi uuid (as some other clouds do). and keep that. [19:23] there is also 'manual_cache_clean' that you can set [19:23] and then you can remove the disk. [19:23] If we use the same ID for both NoCloud and None will it detect that cloud-init has already run even though it has changed clouds? [19:25] A note that says "please don't remove the ISO or this will happen" in the docs would also be a good start. [20:29] blackboxsw: [20:29] https://git.launchpad.net/~dojordan/cloud-init/commit/?id=7f23e5c4808a9c647cd4d5277625a723a58b132b [20:29] "on ubuntu release > xenial we rely on systemd-networkd" [20:29] do you know if that is correct ? [20:31] reading [20:34] smoser: for artful and bionic I know that's true... not sure about zesty. I don't recall, but I'll spin one up now [20:34] I left nearly the same log message on dropping ifupdown Azure stuff in my commit [20:34] " In artful and bionic ifupdown package is no longer installed in default [20:34] cloud images. As such, Azure can't use those tools to bounce the network [20:34] informing DDNS about hostname changes." [20:35] blackboxsw: i just commented there [20:35] mwhudson says that you and i are blowing smoke [21:09] " the data thre is new and will get documented more as we go. [21:09] That function will make its way back to 16.04 in a cloud-init SRU in the next month or so." .... smoser I'll make a card for that. [21:09] I should have shoved something into RTD when we added instance-data.json, but we can get it in soon [21:10] I kinda figured we needed to do it with the jinja-template handling too [21:11] @blackboxsw unrelated, but do you know where the SendHostname=true flag lives? [21:12] smoser: yeah you pointed me at his comment there that systemd isn't handling publishing dhcp info on hostname change..... I thought I ran on azure artful and bionic and the update did happen, but I don't know what triggered it. hrm. [21:12] I actually can't find that flag anywhere actually being set [21:14] Im looking at an artful azure VM now, and I checked /lib/systemd/network, /run/systemd/network, /etc/systemd/network [21:15] dojordan: cloud-init doesn't set it, I only read the docs on it as default behavior if unset. I can see in systemd during dhcp client config refresh they check for said flag... but I can't confirm whether something sets that up. [21:15] * blackboxsw needs to look at this again it seems. my recall is dusty on what I originally tested/saw on artful and bionic. I'll spin up azure vms for testing now [21:16] Gotcha, ill enable the flag and look for a dhcp request [21:24] what's the policy in cloud-init about restarting another service? Manual testing seems to confirm what michael said in bug 1739516 that changing the hostname doesn't appear to actually send a DHCP request [21:24] bug 1739516 in cloud-init "networking comes up before hostname is set" [Medium,Confirmed] https://launchpad.net/bugs/1739516 [21:25] dojordan: well if we have to bounce it, we can [21:25] dojordan: so systemd-networkd docs say the SendHostname=true needs to be in the [DHCP] section of the configs. I see that DHCP section in /run/systemd/network/10-netplan-eth0.network.... but I'm guessing we could provided a /etc/systemd/network/11-something.network file containing [DHCP]\bSendHostname=true if needed? (shot in the dark as I'm working off docs at the moment [21:25] ads we have before [21:25] it'd be nicer if you coudl bounce *that* inteface [21:25] rather than restarting the service [21:26] i'm somwwhat concerned about restartign the service during boot, but we will see. [21:26] that's just me RTFMing though. poking at an azure instance in a min here to confirm [21:27] smoser: rharper do we know if netplan configs allow you to post arbitrary dhcp config options, or just dhcp4: on dhcp6: on etc [21:27] s/on/true [21:28] * blackboxsw doesn't see anything that looks like it at http://people.canonical.com/~mtrudel/netplan/ [21:28] not sure if that's the 'canonical' source for netplan docs though [21:28] blackboxsw: netplan does not expose arbitrary dhcp options; I think accept-ra is the exception at this time [21:29] what was the flag we were interested in ? [21:29] SendHostname; but I do wonder if we can figure out what's going on before we start exposing these sorts of things in the top-level yaml; [21:30] is this some sort of dynamic dns update based on hostname from a dhclient request ? or is it doing something else with this specific mechanism [21:30] I think the flag is simply whether or nor to send the hostname when sending a dhcp request, but not actually sending dhcp everytime the hostname is changed [21:31] smoser, rharper: SendHostname is default in systemd. [21:31] it's just sent when the client attempts to obtain (or renew) a lease IIUC [21:32] (so is UseHostname=, which is meant to use the hostname that DHCP hands out) [21:32] gotcha, so we won't be able to rely on that to force a new dhcp [21:32] what is this about? [21:33] basically we (azure) need a way to force a dhcp request to acquire a new IP in certain circumstances [21:33] we used to use ifdown/ifup but post xenial we no longer have those binaries installed by default [21:34] dojordan: wouldn't the "right way" be to have the DHCP server send a DHCPNAK? [21:34] assuming that works "out of line" [21:37] or the other right way would be to use leases less than 2^23 - 1 seconds... but unfortunately in our stack the dhcp server is not aware of when to send the DHCPNAK [21:37] right [21:38] short leases obviate the issue of "we need to force the client to re-configure now", at the cost of more packets happening on the network [21:38] whereas DHCPNAK or doing stuff on the client means there needs to be code that decides it's time to reconfigure outside of the lease expiry time [21:40] is the client or infrastructure more likely to know what set of circumstances the client should do DHCP again? [21:40] client [21:41] (my guess is the server should always be authoritative, but I don't know of your use case) [21:41] ah, interesting [21:41] i guess we could just run dhclient -r? [21:41] i can elaborate on the scenario: [21:41] dojordan: dhclient isn't what's doing DHCP when using systemd-networkd. [21:41] oh right... and systemd-network doesn't expose the same api? [21:42] I don't think it exposes that, but I'm not sure :) [21:43] thinking outside the box, can we delete the dhcp lease files? does that retrigger it? [21:43] dojordan: this m ight become easier (and more easily use dhclient sandboxed) if we get to running in local time. [21:44] sorry didn't quite catch that [21:44] hlelo [21:44] oh. hm. [21:44] *hello [21:44] wait. [21:44] how does this project relate to CoreOS cloud-init and RancherOS cloud-init? [21:45] now i'm confused. [21:45] dojordan: I don't think it would trigger it, but you might be able to just restart systemd-networkd without adverse consequences [21:45] nazarewk: no. coreos is not here. [21:45] * smoser googles rancheros [21:45] smoser: i'm researching cloud operating systems [21:45] that was my though too, just wanted to make sure it would be safe [21:45] dojordan: we run azure datasource at local [21:46] meaning proper system configured networking isnt up yet. [21:46] after going through coreos (deprecated cloud-init), RancherOS (forked from CoreOS cloud-init) i stumbled upon configuring Project Atomic with cloud-init [21:46] now i'm confused. [21:46] and saw that is some wide standard [21:46] smoser: again? [21:46] ;) [21:46] any idea how those 3 (or 2 i guess) relate to each other? [21:47] well, project atomic to my knowledge uses this cloud-init [21:47] (as contributors from redhat have added that support) [21:48] i already know that, i am more interested in how cloud-init came to be [21:48] i can't find any info on where it came from [21:49] ok looks like i found this [21:49] https://github.com/coreos/coreos-cloudinit#configuration-with-cloud-config [21:51] dojordan: so, I guess what you would need to do depends on what the circumstances are in which a client needs to restart DHCP, if it's because new information was received from the datasource, maybe the best is for cloud-init to restart systemd-networkd when it finds out, if systemd-networkd is even running [21:52] OTOH, if it's potentially hours after boot, then some agent would need to do it, if it's safe. [21:52] so the reason this matters is if we hit our instance metadata service with a stale IP, it doesn't recognize it and the client will throw an exception [21:52] so we catch it and bounce the nic (on xenial) [21:53] so this isn't actually doing any real DHCP? [21:53] otherwise you'd have the lease times to tell you when to renew, so you theoretically never have a stale IP [21:53] our leases are infinite [21:54] but it is doing real DHCP in the sense we are getting a new ip,dns,subnet, etc [21:54] the reason it could be stale is our platform is moving a vm from one vnet to another [21:54] yes, but I mean even if the IP belongs to you forever, a short lease will have you ping the DHCP server periodically and possibly get new information [21:54] ah [21:57] @smoser, not entirely sure when systemd-networkd is running, but I have confirmed in testing that in my current PR we do successfully hit the instance metadata server in our data source. So I guess some parts of the networking stack are setup? [22:01] dojordan: right. [22:01] thats what is confusing to me [22:01] i think you are actually ending up "bouncing" the network that wasnt up yet [22:02] but not sure. i'm looking at that. [22:05] what stage is cloud-init at when it's time to bounce? has cloud-init net already run? is that what we're trying to determine ? [22:05] yeah, this is during _get_data within the AzureDataSource, so if it is in local then network has yet to run [22:07] then it won't yet be up; yeah [22:07] right [22:07] and then the resume *shouldn't* need to kick dhcp since it never leased anything anyhow; execpt how are we polling metadata service to know when it's time to come up again ? [22:08] right [22:08] i think it came up because we bounced it [22:09] so it ran 'ifdown; ifup' and it came up. since the ifdown didnt do anthing [22:09] thats my hypothesis [22:09] on xenial [22:09] i think you're right [22:10] so theoretically it wouldn't work > xenial as there is no ifup [22:10] w.r.t the systemd path; if were in the same boat; I think an ip link set down on the interface that needs to bounce may be enough for systemd-networkd to allow a restart to kick DHCP again [22:10] that's something testable [22:11] yes i am actually testing that later this week or early next [22:11] waiting for the networking team... [22:11] but they claimed changing the link state of the interface (removing and re adding the nic from our hypervisor) didn't trigger dhcp in the vm. [22:12] dojordan: since we're running at local time frame, we dont have to "bounce" [22:12] and we can use somethign more like the ec2 datasource [22:12] sounds good [22:12] dojordan: that make sense as it never DHCP'd in the first place; [22:12] which does a dhclient, gets its network info, then we can use the 'EphemeralIPV4Network' context manager [22:12] to hit the datasource, and timeout and do it all again [22:13] beauty [22:13] i like it [22:13] but what remains is, while it's sleeping, how can it know when it's time to wake-up if we don't have a network interface up to poll a URL ? [22:13] we bring up a temporary one [22:13] rharper: right now, if uyou're looking at his branch it runs [22:13] self.bounce_network_with_azure_hostname [22:13] in _reprovision [22:13] and that does [22:13] (just to confirm) the ec2 ephemeralipv4network hits the dhcp server? [22:13] sh -c 'ifdown eth0; ifup eth0' [22:13] which ends up bringing it up the first time through [22:14] and relying on "stale" networking configuration to do so [22:14] dojordan: look at the ec2 code you shuld be able to see. [22:14] but, the ifup eth0 isn't sufficient in local; there's no eth0 network config to indicate it needs to DHCP [22:14] at least in > xenial [22:15] or at least I'm not sure we write out the network config for it starts "sleep waiting" [22:15] this hunk in _get_data [22:15] http://paste.ubuntu.com/26368525/ [22:15] we'd just do [22:16] oh, EC2 does, but not in Azure at this time [22:16] with net.EphemeralIpv4Network(**net_params): [22:16] right [22:16] so i'm saying something like: [22:16] while True: [22:17] dhcp_leases = dhcp.maybe_perform_dhcp_discovery(self.fallback_interface) [22:17] if not dhcp_leases: [22:17] something_bad [22:17] with net.Ephem.... [22:17] hit MD service [22:17] on happy path break [22:18] that is terrible i realize , buti think maybe explains it. [22:18] right; the net.Eph dhcp stuff; how did we work around requiring dhclient ? or do we keep that present ?for Artful + Bionic ? [22:19] we require dhclient [22:20] nazarewk: I believe cloud-init originated here, with these folks, originally a Canonical internal product that garnered broad adoption and was picked up a supported by other OSes and clouds. :) https://www.podcastinit.com/cloud-init-with-scott-moser-episode-126/ [22:20] :) [22:25] smoser: per the suggestion about the poll_imds looping.... right we should be able to call maybe_permform_dhcp_discovery which attempts dhclient queries using the fallback nic. and returns an empty list if dhclient doesn't exist or can't get an ip address [22:26] you're pastebin explains what we do currently in ec2 which could apply here in azure too https://paste.ubuntu.com/26368525/ [22:27] i hope dojordan follows. i think blackboxsw and rharper do, but ih ave to run. [22:27] i'll look in tomorrow on it a bit dojordan [22:27] dojordan: (just to confirm) the ec2 ephemeralipv4network hits the dhcp server?.... Ephemeral ipv4 network uses the response for maybe_perform_dhcp_discovery [22:30] you provide it with a params interface, ip, prefix_or_mask broadcast and router all of which maybe_perform_dhcp_discovery returns [22:30] you provide it with the params (interface, ip, prefix_or_mask broadcast and router) all of which maybe_perform_dhcp_discovery returns [22:33] the specific use (which could be nearly the same for Azure) is here https://git.launchpad.net/cloud-init/tree/cloudinit/sources/DataSourceEc2.py?id=78372f16d2711812793196aa8003ad51693ca472#n105 [22:34] though within the with net.EphemeralIPv4Network context youc could do your polling of IMDS [22:35] as that EphemeralIPv4Network context manager serves only to temporarily bring up a network interface to allow you to hit an external URL. Then it tears that interface back down [22:37] so it basically performs whatever static network setup is required (including routes) on a given interface for your context and then any setup it needed to perform it tears down upon __exit__ [22:39] yeah I will play around with it this afternoon, I might make a modification to add retry support though [22:39] I'm +1 on retries, I like those more that while Trues :) [22:51] haha yeah a little scary typing that