[06:12] smoser: I've added some comments on https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1603222; your thoughts would be much appreciated. :) [13:48] Odd_Bloke, so i think that at this point more of the stuff is in cloud-inti, right ? [13:48] i think what you're saying is that cloud-init in xenial does not depend on udev rules form walinux agent [13:49] so i'd prefer if we fix something to have that be consistent between trusty -> xenial + [13:58] smoser: The problem is orthogonal to udev rules, really. [13:58] smoser: One thing I wasn't sure about is how introducing the udev rules to existing instances would affect them. [13:58] But both sets would apply, so I think it would be fine... [13:59] right. they both should set up symlinks. [13:59] why is it orthogonal? === rangerpbzzzz is now known as rangerpb [15:05] Because the problem is that they don't use the udev rules properly; it doesn't matter which set of udev rules are there. [15:06] But, yeah, I'm happy to backport the udev rules if that seems safe. [15:06] I was just thinking in terms of minimising the backport diff. [16:44] smoser: will test your fix for bond right now [16:44] however I believe 3.2) described in bug still isn't fixed. [18:14] mgagne, looking [18:18] in a meeting but so far, only the auto stanza on bond0 looks to be missing [18:19] mgagne, i'm not sure its requried. [18:19] it is as far as I know, I added it and it worked [18:19] i'll compare the network config against other stuff we have examples of (curtin vmtest) where we actually verify [18:19] will test after my meeting [18:19] thanks mgagne [18:52] rharper, look at https://bugs.launchpad.net/cloud-init/+bug/1605749 and have a think. [18:53] smoser: is that your bonding fix ? [18:53] yeah; I read the branch earlier, my initial thought was that we'd generically want to "Resolve links" at render time [18:53] rather than be bond specificy [18:55] the mechanism to use link['id'] as the interface name key is common for all types; that is in the network state we only have link/ids and then at runtime we'd do a replacement of link-id with get_name_from_macaddr(link_to_mac[link-id]) sort of lookup [18:55] rharper, well, do other things have referenced links ? [18:55] any of the combined types [18:55] bridges, bonds, and vlans [18:56] rharper, well, if we want, its easy enough now with the same generic mechanism [18:57] w.r.t auto on bond0; that is needed; there's a bug related not getting auto on stanzas without network config (the subnet has the 'control' value and default) [18:57] so a bond with no subnets, but then vlans on top misses the 'auto bondX' line; I've found this in curtin since Friday, working on a fix there; [18:57] ok. [18:57] then the 2 things are [18:57] a.) auto bond [necessary] [18:58] b.) resolve links generically [18:58] where b is not strictly necessary at this time... [18:58] can you give anohter example of where it is? [18:58] a) bonds default to auto, unless a subnet with 'control' says otherwise [18:59] for b) just replace the vlan_raw_interface value from bond.X to interface0 [18:59] the underlying device if it's type physical' will refer to another "links" element which may not have a 'name' key set [19:00] think, eth0.123; the eth0 would be the link.id; and that may not be the name of the device, like bond_interfaces contains link.ids [19:06] where did you come up with the string 'vlan_raw_inteface' ? [19:06] i dont see that anywhere. [19:06] that's the eni name [19:07] sorry, vlan_link is the network_data.json field [19:07] in cloudinit/sources/helpers/openstack.py:593 [19:08] we generate a nic name based on the vlank_link and vlan_id, which is OK since vlan is a constructed interface, but the vlan_link points to the underlying device (this is a link.id) and needs to be replaced [19:12] rharper, ok. i can add a test for vlan on that also. [19:12] if you need a config yaml, I have one [19:14] http://paste.ubuntu.com/23059320/ [19:15] i think that is what you were saying [19:15] that shows the error. [19:15] in 'eth0.602', does 'eth0' actually matter ? [19:17] for our internal state no [19:17] but it's typical shorthand for underlying device . vlan_id [19:17] so, the vlan scripts in ifupdown split on . and call vconfig with the first segment and pass the second as the vlan_id [19:18] you can instead say iface vlan1 and underneath specify the vlan_raw_device (eth0), and vlan_id (123) [19:23] well that sucks [19:25] harder to get at that [19:27] no, we just need to tag the elements of state that use a link.id [19:28] and when we're rendering the interface, do an id lookup by mac [19:28] we already repeat the vlan_id as an interface attribute [19:28] we can set the vlan interface name to vlan{index} instead of {link_id}.{vlan_id} [19:28] which is what we're doing now; [19:46] well, https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+ref/bond_name is updated to now have a vlan test case that i think renders correctly [19:46] i dont love the mechanism, but seems to work [19:51] looking [19:53] still missing the auto up change, assuming youre looking at that. [19:54] yeah, that's a one-liner in eni.py [19:55] basically in the case that we don't have a subnet configured, if we have 'bond-masters' or 'bond-slaves' we emit an auto $iface [19:55] I need to think a bit more [19:55] as it's a general issue for interfaces that don't include a subnet config since 'control' is a property of a subnet [19:56] yeah. [19:56] smoser: so I patched on my side to get auto added. It now fails as described in the bug description in 3.1) [19:57] mgagne, well, how are you running that ? [19:57] smoser: running what? [19:58] smoser: install cloud-init from repo, apply patches found in your branch, build image, upload image, boot [20:00] http://imgur.com/yA3eslq [20:09] I am wondering if I can use cloud-init in my home lab running with vmware fusion. Is there a metadata server I could run locally thats fairly simple/lightweight? [20:14] rocket, there is some vmware support, but i'm not familiar enough with their product line to know if 'fusion' supports it. [20:15] I was just hoping to start up a pythonic based webserver or something .. or should I be looking at creating my own that produces the yaml files I am seeing in documentation? [20:16] I just didn't know what was required for a really simple setup [20:16] I *think* I just need random hostnames and point that towards a saltmaster etc.. [20:19] rocket, theres two things that provide difficulty i think [20:19] a.) you need data per isntance-id ... each instance needs to somewho get different data [20:19] b.) you have to tell cloud-init where the metadata service is. [20:20] you could mock the ec2 one by mocking 169.254.169.254 and plumbing that network in [20:20] smoser: 3.1 is another run-time variant; bonds inherit mac address of the slaves ; when we're doing the lookup, we can filter by type (we only need to loop up names by macs of 'physical' devices) [20:23] rharper, right. but he's getting a stack trace ther [20:23] which is interesting and i can't reproduce. [20:23] smoser: we are not yet at 3.2, we are still stuck at 3.1 [20:24] smoser: how are you testing? do you have access to an openstack cloud? [20:24] python3 NotADirectory and FileNotFound are obnoxious [20:25] smoser: we need to not try to look up the bond mac address; it can be called *whatever* bond{index} ; only the bond_interfaces lists of link_ids need to be resolved (and actually) we need to check the type of the links to see if they're physical, otherwise we can ignore the mac lookup [20:25] mgagne, well, i do, yes, but was not focused there yet, and i dont have an openstack cloud that woudl ask me to bind lik ehtat [20:25] this line is problematic: https://git.launchpad.net/cloud-init/tree/cloudinit/net/__init__.py#n99 [20:26] it makes the assumption that all devices found in this folder are a real device and file is a directory [20:26] this is why this line fails: https://git.launchpad.net/cloud-init/tree/cloudinit/net/__init__.py#n350 [20:26] but could be related to python3 as you said [20:31] mgagne, well, it doesnt really make the assumption [20:31] it accepts an OSError and a IOError and does the right thing [20:32] I'm not sure why one would list those devices and only filter them later [20:42] rharper: I'm not sure why cloud-init tries to configure the network a second time. The 2nd time is run, slaves mac address might be updated and no longer match the ones found in config-drive. [20:42] smoser: it does the right thing (read_sys_net) however, the interfaces_by_mac does not like 'bonding_masters' file and throws exception; this prevents creating the mac_to_ifname; we can handle the NotADirectoryError and continue [20:43] rharper, we are trying to handle that. [20:43] thats the thing [20:43] NotADirectoryError is an OSError [20:43] interesting [20:43] when I test it, it's not handled [20:43] but apparently does not have errno = 2 [20:43] where do you test this ? [20:43] xenial vm [20:43] with bond added [20:44] if you're on diglett you can ssh into the vm [20:44] smoser: ssh ubuntu@192.168.122.178 [20:45] this is not with your branch, so if you've updated it's just what's in xenial (cloud-init level) [20:45] current code is testing for ENOENT, not ENOTDIR [20:45] probably need eNOTDIR [20:45] yeah. [20:45] http://paste.ubuntu.com/23059495/ [20:48] rharper, http://paste.ubuntu.com/23059501/ [20:49] obnoxious [20:49] so open("/sys/class/net/bonding_masters/address") throws a NotADirectoryError with a errno of 20 [20:49] try with an existing file and append a filename to it and try opening it [20:50] mgagne, right. thats it. ok. thank you [20:51] yeah; it's the full path that included not-a-dir-element [20:52] http://paste.ubuntu.com/23059511/ [20:52] y [20:53] there is a question in my mind if there could be 2 nicks with the same address [20:54] wow [20:54] $ cat /sys/class/net/bond0/address [20:54] 52:54:00:f2:5a:35 [20:54] $ cat /sys/class/net/ens3/address [20:54] 52:54:00:b2:5a:27 [20:55] $ cat /sys/class/net/ens5/address [20:55] cat: /sys/class/net/ens5/address: No such file or directory [20:55] so the answer is that you can't have 2, but if this were to run after a bond were set up, we'd get the bond as the device with that mac [20:55] which is odd [20:55] root@localhost:/sys/class/net# cat bond0/address [20:55] 0c:c4:7a:34:6e:3c [20:55] root@localhost:/sys/class/net# cat eno1/address [20:55] 0c:c4:7a:34:6e:3c [20:55] root@localhost:/sys/class/net# cat eno2/address [20:55] 0c:c4:7a:34:6e:3c [20:56] as I mentioned before; for naming, we can ignore non-physical devices; [20:57] bonds/bridges/vlans have various configs that inherit mac of underlying devices; we really want to know the physical nic and mac pairing [20:57] that is odd. [20:58] well, rharper we dont *always* want the physical nic. we could put a bond on two vlans [20:58] i think though, that the code i have in that tree is actually right. [20:58] vlan names are arbitrary [20:58] as are bond names [20:58] sure. but if we're looking to get a mapping of mac to interfacen ame, then the path is valid. [20:59] that is, we can always set them [20:59] but i think the code is doing the right thing at this point. [20:59] as it looks through the links and sets up the name, and only overwrites it with what it found in /sys if it does not yet have a mac from the links table. [20:59] you can find the original mac address in /sys/class/net//bonding_slave/perm_hwaddr [20:59] to configure the bond or vlan correctly, we only need to lookup link_ids of physical devices; [21:00] and emit those names in the config [21:00] oh wait. it doesnt do that. but it shoudl. [21:02] yeah, its ok as it is right now i think. [21:03] mgagne, so i think what i just pushed will fix all but the 'auto' [21:03] i have to runnow. [21:04] https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+ref/bond_name ? [21:04] right [21:04] no, this won't fix the auto [21:04] the one 2dee860 should fix the NotADirectory [21:04] right [21:04] it should fix all *but* the auto [21:04] http://paste.openstack.org/show/557628/ [21:05] and the point 3.2) [21:05] mgagne: yeah [21:05] I'm stuck at 3.2 since last week [21:05] everything else has been fixed on my side since [21:05] maybe not in the way you would have done it [21:05] when you say later, do you mean subsequent boots ? [21:06] let me check [21:06] "3.2) Once 3.1) is fixed, configuration fails again later" [21:06] Once 3.1 is fixed, cloud-init will fail again at a different place further down [21:06] will reword the description [21:07] i dont understand 3.2 failure though [21:07] as when this runs, the bond should not be set up yet [21:07] cloud-init looks to be rerun twice. Since bond is already configured at this time, mac isn't found and crash [21:07] it is [21:08] it shoudl not run the network config step on the second time through if it successfully ran it on the first. [21:08] ie, cloud-init local should set up networking before anything is allowed to come up [21:08] and then cloud-init should come up and see its not first instance boot (cloud-init local was) and not run that section [21:08] well, I didn't see failure in dsmode=local and cloud-init still try to (re)configure network anyway [21:09] I can rebuild with latest patches and pull out logs [21:09] mgagne, sure. i do agree if it ran twice the second one well coudl fail [21:09] i do have ot run now. [21:10] the ENOTDIR error is caused by this second run [21:13] we should debug why you get a second run, but certainly the 'bonding_masters' file is only added after a bond is configured [21:19] image is rebuilding with all patches, gonna take a while before it builds and then a baremetal is booted [21:20] maybe I can pull the logs from the current baremetal, it ran twice anyway [21:24] rharper: http://paste.openstack.org/show/GHjpr6jMq1uxoCtRl922/ [21:24] k [21:25] "Execution continuing, no previous run detected that would allow us to stop early." [21:25] is config provided in two places? or ConfigDrive only? [21:26] we only have configdrive in place [21:26] "no-net is written by upstart cloud-init-nonet when network failed" I'm not sure xenial still has upstart [21:27] the file /var/lib/cloud/data/no-net does not exist on the machine [21:28] ok [21:28] yeah, when init runs, it reads the data; that somehow is definitely re-running net config [21:31] I don't understand the logic here... if no-net exists, it's because network config failed. The logic makes it so network config in dsmode=net will STOP if network previously failed. in our case, it sure didn't fail... so it will try to configure it again? o_O [21:32] I only found occurence of no-net in cmd/main.py and upstart config file so I'm not sure what to think here [21:32] so, dsmode=net, IIUC, is for things like AWS where the datasource is over the network [21:33] in that case, cloud-init has to bring up something (fallback networking, dhcp on an interface) and attempt to find a datasource over the networking [21:33] after acquiring a datasource, networking config may be included, in which case, we'd need to update the network config (override the fallback generated one) [21:35] I have those ds loaded: NoCloud, ConfigDrive, OVF, MAAS [21:35] Should I remove all but ConfigDrive? [21:35] in the ConfigDrive case, it's not via network but local file [21:35] you can configure them off but it will only select one [21:37] so I think DataSourceConfigDrive supports both local and net. [21:38] and there is no flag to previous it from running twice. But it could also be by design but didn't plan for bonding support and all its side effects [21:38] I think I can see it applied twice; it was only with bonding configs that we trip up [21:38] yea [21:38] I think there was a lot of logic that didn't account for bonding being configured and enabled at that time [21:39] like mac address changing [21:39] yeah; I don't think we want to apply whole-sale net config twice; [21:39] but this *could* be a valid scenario (I don't know yet how) [21:40] like boot with APIPA address in dsmode=local and later get an IP with metadata service or whatever. [21:41] I just don't want to bulldozer my way to make bonding work and break use cases I didn't know existed [21:41] yeah, so init [net] mode does re-read the json data and attempts to create network_state which invokes the openstack conversion, which fails when the initial state of the system is already configured [21:42] I don't think you're breaking anything; smoser or harlowja will have to help me understand why they'get parsed twice [21:42] only because it can't find the link mac address (after ENOTDIR is fixed of course) [21:43] sure; but in general, I'd like to know why we convert it twice (no need, it was already rendered into the instance_id object IIUC) [21:43] so, if it didn't fail converting due to the ENOTDIR, then it's attached to the stage object and you see: [21:43] stages.py[DEBUG]: not a new instance. network config is not applied. [21:43] you [21:44] yours never gets that far; maybe the rebuild will [21:45] * rharper steps out for a bit === rangerpb is now known as rangerpbzzzz [22:27] rharper: ok I fixed the last issue. baremetal is now booting fine [22:28] rharper: all patches: http://paste.ubuntu.com/23059836/