[06:12] <Odd_Bloke> smoser: I've added some comments on https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1603222; your thoughts would be much appreciated. :)
[13:48] <smoser> Odd_Bloke, so i think that at this point more of the stuff is in cloud-inti, right ?
[13:48] <smoser> i think what you're saying is that cloud-init in xenial does not depend on udev rules form walinux agent
[13:49] <smoser> so i'd prefer if we fix something to have that be consistent between trusty -> xenial +
[13:58] <Odd_Bloke> smoser: The problem is orthogonal to udev rules, really.
[13:58] <Odd_Bloke> smoser: One thing I wasn't sure about is how introducing the udev rules to existing instances would affect them.
[13:58] <Odd_Bloke> But both sets would apply, so I think it would be fine...
[13:59] <smoser> right. they both should set up symlinks.
[13:59] <smoser> why is it orthogonal?
[15:05] <Odd_Bloke> Because the problem is that they don't use the udev rules properly; it doesn't matter which set of udev rules are there.
[15:06] <Odd_Bloke> But, yeah, I'm happy to backport the udev rules if that seems safe.
[15:06] <Odd_Bloke> I was just thinking in terms of minimising the backport diff.
[16:44] <mgagne> smoser: will test your fix for bond right now
[16:44] <mgagne> however I believe 3.2) described in bug still isn't fixed.
[18:14] <smoser> mgagne, looking
[18:18] <mgagne> in a meeting but so far, only the auto stanza on bond0 looks to be missing
[18:19] <smoser> mgagne, i'm not sure its requried.
[18:19] <mgagne> it is as far as I know, I added it and it worked
[18:19] <smoser> i'll compare the network config against other stuff we have examples of (curtin vmtest) where we actually verify
[18:19] <mgagne> will test after my meeting
[18:19] <smoser> thanks mgagne
[18:52] <smoser> rharper, look at https://bugs.launchpad.net/cloud-init/+bug/1605749 and have a think.
[18:53] <rharper> smoser: is that your bonding fix ?
[18:53] <rharper> yeah; I read the branch earlier, my initial thought was that we'd generically want to "Resolve links" at render time
[18:53] <rharper> rather than be bond specificy
[18:55] <rharper> the mechanism to use link['id'] as the interface name key is common for all types; that is in the network state we only have link/ids and then at runtime we'd do a  replacement of link-id with get_name_from_macaddr(link_to_mac[link-id]) sort of lookup
[18:55] <smoser> rharper, well, do other things have referenced links ?
[18:55] <rharper> any of the combined types
[18:55] <rharper> bridges, bonds, and vlans
[18:56] <smoser> rharper, well, if we want, its easy enough now with the same generic mechanism
[18:57] <rharper> w.r.t auto on bond0; that is needed; there's a bug related not getting auto on stanzas without network config (the subnet has the 'control' value and default)
[18:57] <rharper> so a bond with no subnets, but then vlans on top misses the 'auto bondX' line;  I've found this in curtin since Friday, working on a fix there;
[18:57] <smoser> ok.
[18:57] <smoser> then the 2 things are
[18:57] <smoser> a.) auto bond [necessary]
[18:58] <smoser> b.) resolve links generically
[18:58] <smoser> where b is not strictly necessary at this time...
[18:58] <smoser> can you give anohter example of where it is?
[18:58] <rharper> a) bonds default to auto, unless a subnet with 'control' says otherwise
[18:59] <rharper> for b) just replace the vlan_raw_interface value from bond.X to interface0
[18:59] <rharper> the underlying device if it's type physical' will refer to another "links" element which may not have a 'name' key set
[19:00] <rharper> think, eth0.123;  the eth0 would be the link.id; and that may not be the name of the device, like bond_interfaces contains link.ids
[19:06] <smoser> where did you come up with the string 'vlan_raw_inteface' ?
[19:06] <smoser> i dont see that anywhere.
[19:06] <rharper> that's the eni name
[19:07] <rharper> sorry, vlan_link is the network_data.json field
[19:07] <rharper> in cloudinit/sources/helpers/openstack.py:593
[19:08] <rharper> we generate a nic name based on the vlank_link and vlan_id, which is OK since vlan is a constructed interface, but the vlan_link points to the underlying device (this is a link.id) and needs to be replaced
[19:12] <smoser> rharper, ok. i can add a test for vlan on that also.
[19:12] <rharper> if you need a config yaml, I have one
[19:14] <smoser> http://paste.ubuntu.com/23059320/
[19:15] <smoser> i think that is what you were saying
[19:15] <smoser> that shows the error.
[19:15] <smoser> in 'eth0.602', does 'eth0' actually matter ?
[19:17] <rharper> for our internal state no
[19:17] <rharper> but it's typical shorthand for underlying device . vlan_id
[19:17] <rharper> so, the vlan scripts in ifupdown split on . and call vconfig with the first segment and pass the second as the vlan_id
[19:18] <rharper> you can instead  say iface vlan1 and underneath specify the vlan_raw_device (eth0), and vlan_id (123)
[19:23] <smoser> well that sucks
[19:25] <smoser> harder to get at that
[19:27] <rharper> no, we just need to tag the elements of state that use a link.id
[19:28] <rharper> and when we're rendering the interface, do an id lookup by mac
[19:28] <rharper> we already repeat the vlan_id as an interface attribute
[19:28] <rharper> we can set the vlan interface name to vlan{index} instead of {link_id}.{vlan_id}
[19:28] <rharper> which is what we're doing now;
[19:46] <smoser> well, https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+ref/bond_name is updated to now have a vlan test case that i think renders correctly
[19:46] <smoser> i dont love the mechanism, but seems to work
[19:51] <rharper> looking
[19:53] <smoser> still missing the auto up change, assuming youre looking at that.
[19:54] <rharper> yeah, that's a one-liner in eni.py
[19:55] <rharper> basically in the case that we don't have a subnet configured, if we have 'bond-masters' or 'bond-slaves' we emit an auto $iface
[19:55] <rharper> I need to think a bit more
[19:55] <rharper> as it's a general issue for interfaces that don't include a subnet config since 'control' is a property of a subnet
[19:56] <smoser> yeah.
[19:56] <mgagne> smoser: so I patched on my side to get auto added. It now fails as described in the bug description in 3.1)
[19:57] <smoser> mgagne, well, how are you running that ?
[19:57] <mgagne> smoser: running what?
[19:58] <mgagne> smoser: install cloud-init from repo, apply patches found in your branch, build image, upload image, boot
[20:00] <mgagne> http://imgur.com/yA3eslq
[20:09] <rocket> I am wondering if I can use cloud-init in my home lab running with vmware fusion.  Is there a metadata server I could run locally thats fairly simple/lightweight?
[20:14] <smoser> rocket, there is some vmware support, but i'm not familiar enough with their product line to know if 'fusion' supports it.
[20:15] <rocket> I was just hoping to start up a pythonic based webserver or something .. or should I be looking at creating my own that produces the yaml files I am seeing in documentation?
[20:16] <rocket> I just didn't know what was required for a really simple setup
[20:16] <rocket> I *think* I just need random hostnames and point that towards a saltmaster etc..
[20:19] <smoser> rocket, theres two things that provide difficulty i think
[20:19] <smoser> a.) you need data per isntance-id ... each instance needs to somewho get different data
[20:19] <smoser> b.) you have to tell cloud-init where the metadata service is.
[20:20] <smoser> you could mock the ec2 one by mocking 169.254.169.254 and plumbing that network in
[20:20] <rharper> smoser: 3.1 is another run-time variant;  bonds inherit mac address of the slaves ;  when we're doing the lookup, we can filter by type (we only need to loop up names by macs of 'physical' devices)
[20:23] <smoser> rharper, right. but he's getting a stack trace ther
[20:23] <smoser> which is interesting and i can't reproduce.
[20:23] <mgagne> smoser: we are not yet at 3.2, we are still stuck at 3.1
[20:24] <mgagne> smoser: how are you testing? do you have access to an openstack cloud?
[20:24] <smoser> python3 NotADirectory and FileNotFound are obnoxious
[20:25] <rharper> smoser: we need to not try to look up the bond mac address; it can be called *whatever* bond{index} ; only the bond_interfaces lists of link_ids need to be resolved (and actually) we need to check the type of the links to see if they're physical, otherwise we can ignore the mac lookup
[20:25] <smoser> mgagne, well, i do, yes, but was not focused there yet, and i dont have an openstack cloud that woudl ask me to bind lik ehtat
[20:25] <mgagne> this line is problematic: https://git.launchpad.net/cloud-init/tree/cloudinit/net/__init__.py#n99
[20:26] <mgagne> it makes the assumption that all devices found in this folder are a real device and file is a directory
[20:26] <mgagne> this is why this line fails: https://git.launchpad.net/cloud-init/tree/cloudinit/net/__init__.py#n350
[20:26] <mgagne> but could be related to python3 as you said
[20:31] <smoser> mgagne, well, it doesnt really make the assumption
[20:31] <smoser> it accepts an OSError and a IOError and does the right thing
[20:32] <mgagne> I'm not sure why one would list those devices and only filter them later
[20:42] <mgagne> rharper: I'm not sure why cloud-init tries to configure the network a second time. The 2nd time is run, slaves mac address might be updated and no longer match the ones found in config-drive.
[20:42] <rharper> smoser: it does the right thing (read_sys_net) however, the interfaces_by_mac does not like 'bonding_masters' file and throws exception; this prevents creating the mac_to_ifname;  we can handle the NotADirectoryError and continue
[20:43] <smoser> rharper, we are trying to handle that.
[20:43] <smoser> thats the thing
[20:43] <smoser> NotADirectoryError is an OSError
[20:43] <rharper> interesting
[20:43] <rharper> when I test it, it's not handled
[20:43] <smoser> but apparently does not have errno = 2
[20:43] <smoser> where do you test this ?
[20:43] <rharper> xenial vm
[20:43] <rharper> with bond added
[20:44] <rharper> if you're on diglett you can ssh into the vm
[20:44] <rharper> smoser: ssh ubuntu@192.168.122.178
[20:45] <rharper> this is not with your branch, so if you've updated it's just what's in xenial (cloud-init level)
[20:45] <mgagne> current code is testing for ENOENT, not ENOTDIR
[20:45] <smoser> probably need eNOTDIR
[20:45] <smoser> yeah.
[20:45] <rharper> http://paste.ubuntu.com/23059495/
[20:48] <smoser> rharper, http://paste.ubuntu.com/23059501/
[20:49] <smoser> obnoxious
[20:49] <smoser> so open("/sys/class/net/bonding_masters/address") throws a NotADirectoryError with a errno of 20
[20:49] <mgagne> try with an existing file and append a filename to it and try opening it
[20:50] <smoser> mgagne, right. thats it. ok. thank you
[20:51] <rharper> yeah; it's the full path that included not-a-dir-element
[20:52] <smoser> http://paste.ubuntu.com/23059511/
[20:52] <rharper> y
[20:53] <smoser> there is  a question in my mind if there could be 2 nicks with the same address
[20:54] <smoser> wow
[20:54] <smoser> $ cat /sys/class/net/bond0/address
[20:54] <smoser> 52:54:00:f2:5a:35
[20:54] <smoser> $ cat /sys/class/net/ens3/address
[20:54] <smoser> 52:54:00:b2:5a:27
[20:55] <smoser> $ cat /sys/class/net/ens5/address
[20:55] <smoser> cat: /sys/class/net/ens5/address: No such file or directory
[20:55] <smoser> so the answer is that you can't have 2, but if this were to run after a bond were set up, we'd get the bond as the device with that mac
[20:55] <smoser> which is odd
[20:55] <mgagne> root@localhost:/sys/class/net# cat bond0/address
[20:55] <mgagne> 0c:c4:7a:34:6e:3c
[20:55] <mgagne> root@localhost:/sys/class/net# cat eno1/address
[20:55] <mgagne> 0c:c4:7a:34:6e:3c
[20:55] <mgagne> root@localhost:/sys/class/net# cat eno2/address
[20:55] <mgagne> 0c:c4:7a:34:6e:3c
[20:56] <rharper> as I mentioned before; for naming, we can ignore non-physical devices;
[20:57] <rharper> bonds/bridges/vlans have various configs that inherit mac of underlying devices;  we really want to know the physical nic and mac pairing
[20:57] <smoser> that is odd.
[20:58] <smoser> well, rharper we dont *always* want the physical nic. we could put a bond on two vlans
[20:58] <smoser> i think though, that the code i have in that tree is actually right.
[20:58] <rharper> vlan names are arbitrary
[20:58] <rharper> as are bond names
[20:58] <smoser> sure. but if we're looking to get a mapping of mac to interfacen ame, then the path is valid.
[20:59] <rharper> that is, we can always set them
[20:59] <smoser> but i think the code is doing the right thing at this point.
[20:59] <smoser> as it looks through the links and sets up the name, and only overwrites it with what it found in /sys if it does not yet have a mac from the links table.
[20:59] <mgagne> you can find the original mac address in /sys/class/net/<ifname>/bonding_slave/perm_hwaddr
[20:59] <rharper> to configure the bond or vlan correctly, we only need to lookup link_ids of physical devices;
[21:00] <rharper> and emit those names in the config
[21:00] <smoser> oh wait. it doesnt do that. but it shoudl.
[21:02] <smoser> yeah, its ok as it is right now i think.
[21:03] <smoser> mgagne, so i think what i just pushed will fix all but the 'auto'
[21:03] <smoser> i have to runnow.
[21:04] <mgagne> https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+ref/bond_name ?
[21:04] <smoser> right
[21:04] <mgagne> no, this won't fix the auto
[21:04] <smoser> the one 2dee860 should fix the NotADirectory
[21:04] <smoser> right
[21:04] <smoser> it should fix all *but* the auto
[21:04] <mgagne> http://paste.openstack.org/show/557628/
[21:05] <mgagne> and the point 3.2)
[21:05] <rharper> mgagne: yeah
[21:05] <mgagne> I'm stuck at 3.2 since last week
[21:05] <mgagne> everything else has been fixed on my side since
[21:05] <mgagne> maybe not in the way you would have done it
[21:05] <rharper> when you say later, do you mean subsequent boots ?
[21:06] <mgagne> let me check
[21:06] <mgagne> "3.2) Once 3.1) is fixed, configuration fails again later"
[21:06] <mgagne> Once 3.1 is fixed, cloud-init will fail again at a different place further down
[21:06] <mgagne> will reword the description
[21:07] <smoser> i dont understand 3.2 failure though
[21:07] <smoser> as when this runs, the bond should not be set up yet
[21:07] <mgagne> cloud-init looks to be rerun twice. Since bond is already configured at this time, mac isn't found and crash
[21:07] <mgagne> it is
[21:08] <smoser> it shoudl not run the network config step on the second time through if it successfully ran it on the first.
[21:08] <smoser> ie, cloud-init local should set up networking before anything is allowed to come up
[21:08] <smoser> and then cloud-init should come up and see its not first instance boot (cloud-init local was) and not run that section
[21:08] <mgagne> well, I didn't see failure in dsmode=local and cloud-init still try to (re)configure network anyway
[21:09] <mgagne> I can rebuild with latest patches and pull out logs
[21:09] <smoser> mgagne, sure. i do agree if it ran twice the second one well coudl fail
[21:09] <smoser> i do have ot run now.
[21:10] <mgagne> the ENOTDIR error is caused by this second run
[21:13] <rharper> we should debug why you get a second run, but certainly the 'bonding_masters' file is only added after a bond is configured
[21:19] <mgagne> image is rebuilding with all patches, gonna take a while before it builds and then a baremetal is booted
[21:20] <mgagne> maybe I can pull the logs from the current baremetal, it ran twice anyway
[21:24] <mgagne> rharper: http://paste.openstack.org/show/GHjpr6jMq1uxoCtRl922/
[21:24] <rharper> k
[21:25] <mgagne> "Execution continuing, no previous run detected that would allow us to stop early."
[21:25] <rharper> is config provided in two places? or ConfigDrive only?
[21:26] <mgagne> we only have configdrive in place
[21:26] <mgagne> "no-net is written by upstart cloud-init-nonet when network failed" I'm not sure xenial still has upstart
[21:27] <mgagne> the file /var/lib/cloud/data/no-net does not exist on the machine
[21:28] <rharper> ok
[21:28] <rharper> yeah, when init runs, it reads the data; that somehow is definitely re-running net config
[21:31] <mgagne> I don't understand the logic here... if no-net exists, it's because network config failed. The logic makes it so network config in dsmode=net will STOP if network previously failed. in our case, it sure didn't fail... so it will try to configure it again? o_O
[21:32] <mgagne> I only found occurence of no-net in cmd/main.py and upstart config file so I'm not sure what to think here
[21:32] <rharper> so, dsmode=net, IIUC, is for things like AWS where the datasource is over the network
[21:33] <rharper> in that case, cloud-init has to bring up something (fallback networking, dhcp on an interface) and attempt to find a datasource over the networking
[21:33] <rharper> after acquiring a datasource, networking config may be included, in which case, we'd need to update the network config (override the fallback generated one)
[21:35] <mgagne> I have those ds loaded: NoCloud, ConfigDrive, OVF, MAAS
[21:35] <mgagne> Should I remove all but ConfigDrive?
[21:35] <rharper> in the ConfigDrive case, it's not via network but local file
[21:35] <rharper> you can configure them off but it will only select one
[21:37] <mgagne> so I think DataSourceConfigDrive supports both local and net.
[21:38] <mgagne> and there is no flag to previous it from running twice. But it could also be by design but didn't plan for bonding support and all its side effects
[21:38] <rharper> I think I can see it applied twice; it was only with bonding configs that we trip up
[21:38] <mgagne> yea
[21:38] <mgagne> I think there was a lot of logic that didn't account for bonding being configured and enabled at that time
[21:39] <mgagne> like mac address changing
[21:39] <rharper> yeah; I don't think we want to apply whole-sale net config twice;
[21:39] <mgagne> but this *could* be a valid scenario (I don't know yet how)
[21:40] <mgagne> like boot with APIPA address in dsmode=local and later get an IP with metadata service or whatever.
[21:41] <mgagne> I just don't want to bulldozer my way to make bonding work and break use cases I didn't know existed
[21:41] <rharper> yeah, so init [net] mode does re-read the json data and attempts to create network_state which invokes the openstack conversion, which fails when the initial state of the system is already configured
[21:42] <rharper> I don't think you're breaking anything;  smoser or harlowja will have to help me understand why they'get parsed twice
[21:42] <mgagne> only because it can't find the link mac address (after ENOTDIR is fixed of course)
[21:43] <rharper> sure; but in general, I'd like to know why we convert it twice (no need, it was already rendered into the instance_id object IIUC)
[21:43] <rharper> so, if it didn't fail converting due to the ENOTDIR, then it's attached to the stage object and you see:
[21:43] <rharper> stages.py[DEBUG]: not a new instance. network config is not applied.
[21:43] <rharper> you
[21:44] <rharper> yours never gets that far; maybe the rebuild will
[21:45]  * rharper steps out for a bit 
[22:27] <mgagne> rharper: ok I fixed the last issue. baremetal is now booting fine
[22:28] <mgagne> rharper: all patches: http://paste.ubuntu.com/23059836/