[02:15] smoser: what's the diference between _write_network, _write_network_config, _bring_up_interfaces and _bring_up_interface? [03:38] smoser: I have gentoo networking working, at least for dhcp, probably for static as well [03:39] smoser: last two commits here https://github.com/prometheanfire/cloud-init [03:46] only 'error' that happens doesn't seem to hurt anything [03:46] 2016-08-22 03:44:21,495 - __init__.py[WARNING]: apply_network_config is not currently implemented for distribution ''. Attempting to use apply_network === Takumo is now known as Jagmilleurs === Jagmilleurs is now known as Takumo [13:34] prometheanfire, here now. reading your comments. [13:34] you gave the systsem bonding config ? [13:34] smoser: hi [13:34] smoser: no, the kernel just supports it [13:34] and this breaks things [13:35] hm.. [13:35] because bonding_masters exists as a file within /sys/class/net [13:35] well, there is another fix for that, that i do need to get in. but i didn't believe that it was as simple as you say. [13:35] the ubuntu kernels *do* support bonding [13:35] well, for us the module is loaded static [13:35] probably its as a module though, and likely it isnt loaded when that runs. [13:36] maybe that's why we hit it [13:36] "module loaded static" [13:36] ? [13:36] you m ean builtin [13:36] ya [13:36] :) [13:36] static kernel :D [13:36] it's how I (personally) compile my kernels [13:36] ok. [13:36] easy to ship [13:37] this is the same code used in glean if it helps [13:37] glean ? [13:38] simple-init? [13:38] same thing [13:38] https://github.com/openstack-infra/glean/ [13:38] https://github.com/openstack-infra/glean/blob/master/glean/cmd.py#L686 [13:45] prometheanfire, so https://git.launchpad.net/~smoser/cloud-init/log/?h=bond_name has the fix for bond_master stuff [13:46] and wrt glean, i can't just take that, as license is not compatible. cloud-init requires signing cla for contribution. [13:46] right [13:46] :-( [13:46] about the code [13:46] I wrote it [13:47] so I should be able to contribute it to both projects [13:47] I'd think at least [13:47] oh. yeah, then you can. [13:47] yes. [13:47] :D [13:48] the other coment is that '_write_network' is kind of the "legacy" mechanism [13:48] in any case we should probably ignore the other interfaces, but if your boding update code works then that solves the immediate problem [13:50] so, should I convert to _write_networks? [13:51] we want "_write_network_config", and using a Renderer like ubuntu/debian and rhel does [13:53] unfortunately i think you're going to want more doc / info on the "network state" :) [13:53] is netconfig the same thing as settings? [13:54] no. its more like what is described at http://people.canonical.com/~rharper/curtin/topics/networking.html [13:54] and ubuntu and rhel basically load that thing into a 'network_state' and then render from mit. [13:54] ya, that's doable [13:54] a useable datastructure [13:55] the one thing that i can tell you is that there are unit tests that you can easily run to poke around at what is happening [13:55] (and please do contribute unit tests for your new code) [13:55] shouldn't need unit tests for that, but ok :P [13:55] prometheanfire, thank you. i'm really happy to have your contributions [13:55] well, for rendering... you can have quite complex state [13:56] yarp [13:56] at the moment I'll be using this patch on 0.7.7 [13:56] or will soon [13:57] but I've tested it at least and it fufills the simple use case [13:57] and even better, it actually works [13:57] current code is broken for networking on gentoo, it lays down debain style configs [13:59] right now the code to use the new method looks over complicated [14:04] prometheanfire, yeah. i'm not really happy with it either. note though that it supports much more complex config [14:05] multiple ips per nic, ipv4 ipv6, bond, vlan.. [14:05] it does [14:05] so yes, it is more complicated [14:05] I just don't even know where to start [14:05] it's that hairy [14:05] yeah. [14:07] i'd suggest looking at rhel, and how that works. [14:08] and then just working from unit tests to get somethign sane. [14:08] her... i'll try to get some scaffolding in place for you [14:08] s/her/here/ [14:08] current code actually started from arch [14:09] waht wyould you suggest as the name for the ntwork config style ? [14:09] we have a name, sec [14:09] ie, 'eni', 'sysconfig' ... [14:10] https://wiki.gentoo.org/wiki/Netifrc [14:10] netifrc [14:11] looking at rhel's stuff in net/sysconfig [14:11] k. thanks.. [14:17] ok, this is an example of the netconfig [14:17] {'version': 1, 'config': [{'name': 'eth0', 'subnets': [{'type': 'dhcp'}], 'mac_address': 'fa:16:3e:00:10:ee', 'type': 'physical'}]} [ ok ] [14:18] right. thats netconfig, then it gets loaded into 'NetworkState', and the renderers actually take the NetworkState.. a middle ground [14:18] right [14:18] give me 10 more minutes, and i'll try to hand you a unit test thing to fill in [14:18] cool [14:19] have been looking at sysconfig.py, looks like it's just a longer version of my if stuff [14:20] does netconfig allow you to have vlans in your bonds or bonds in your vlans, etc? [14:23] oh, it's listed explicitly [14:23] neat [14:32] http://paste.ubuntu.com/23078466/ [14:33] prometheanfire, ^ that should at least allow you to easily poke through the code and see your results easily. [14:33] i did not add the _write_network_config to gentoo distro but i'm guessing you can figure that out. [14:34] one thing to point out, the renderer should be idempotent. whatever network config is there, the one provided is what the sysstem *should* do [14:37] ya, that's the easy part [14:37] I've already started on it [15:31] mgagne_, i'd appreciate your thoughts on https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/303563 [15:31] ideally i'd liek to have that merged today, you mentioned it had some issues still i think [15:34] i think you still say 3.2 is busted. [16:06] smoser: I will test the whole patch set. Is it complete yet? [16:06] smoser: 3.2 got fixed with the bonding_slaves detection [16:08] mgagne_, then i think it is complete ... [16:08] pending your test :) [16:08] and my fuzzy memory [16:08] smoser: for some reasons, cloud-init configures network twice. At this point, I don't care about the reason, it's beyond my expertise. So detecting bonding should do the job. [16:08] it configures networking once. [16:08] ok, will launch a build with your patches against 0.7.7~bzr1256-0ubuntu1~16.04.1 [16:08] it renames devices twice. [16:08] smoser: It goes through the network json config parsing twice at least [16:08] i'mi pretty sure thats the case, and i have to think again about why we rename network devices after we've configured. [16:15] so I will be applying this patch http://paste.ubuntu.com/23078862/ verbatim to cloud-init 0.7.7~bzr1256-0ubuntu1~16.04.1 and will boot on a baremetal with bonding+vlans to validate. [16:19] hi [17:03] thanks [17:09] prometheanfire, merged your branch [17:09] yarp [17:09] will rebase my integration branch when your stuff merges [17:23] I'll see if I can test your patch [17:55] smoser: so I tested the patches. Somehow I managed to reproduce an intermittent bug my coworker had. Default gateway fails to configure and server doesn't ping. [17:55] hmm... i think maybe rharper might know something. [17:55] mgagne_, xenial, right ? [17:56] yes [17:56] ifupdown is very "fun". [17:56] smoser: is there anything I can look at? default gw is configured in post-up with route add || true [17:56] so I'm not sure how I'm supposed to debug that [17:56] and i know that rharper has been doing some hair pulling over bonds recently. [17:56] so you can get into it, mgagne ? [17:57] so we never supported ubuntu 16.04, even without bonding yet so I'm not sure if it's a known issue with bonding or cloud-init. [17:57] any logs I can pull to help debug? [18:10] smoser: reviewed your patch, worksforme [18:11] mgagne_, grab /var/log/cloud-init.log [18:12] also, closed the other merge request [18:13] mgagne_, can you get to it while its up? [18:13] or are you ust able to shut down and collect files [18:13] yea, will get the logs after my meeting =) [18:13] will get the whole /var/log if needed [18:14] if you can get while its up, please get output of [18:14] ifconfig -a [18:14] and any thing else you might find useful [18:14] systemct status [18:14] woiuld be good [20:06] if post-up fails, will the output be logged? [20:09] mgagne_: any stderr from the ifup will be capture in the networking.service log, so you should see something in systemctl status -l networking [20:09] ok, I ran that command and didn't see anything [20:09] mgagne_: from your gist though, the devices and routes all came up as configured [20:09] will rerun [20:09] to make sure [20:09] check cloud-init-output.log, you won't see the default gw [20:09] only link local routes [20:11] in your gist, do you have the etc/network/interfaces.d/50-cloud-init.cfg file ? [20:11] rharper, interfaces-cloud-init.txt [20:11] yes [20:11] https://gist.github.com/mgagne/fbc1b05412f41426f2e248acd5efad14#file-interfaces-cloud-init-txt [20:12] smoser: ah right [20:12] added systemctl status -l networking output [20:12] so I suspect that *maybe* the default route is added but something removes it later? [20:12] or will || true hide the failure? [20:13] mgagne_: in your gist, the bond0.602 default gw is the one you expect ? (the launchpad bug had other config) [20:13] yes [20:14] I thought it would be configured with the gateway stanza but ¯\_(ツ)_/¯ [20:17] Odd_Bloke, around ? [20:18] test_exception_fetching_fabric_data_doesnt_propagate [20:18] why would i not want that to propogate? [20:21] rharper: is there anything I can do to help debug? I don't mind rebuilding an image with debug config or whatever. [20:21] mgagne_: a plain route -n would be nice [20:21] and the original network_data.json; [20:22] rharper: I added the default route already so I can SSH and pull logs [20:22] it's applying some routes; I just can't see why it wouldn't do the post up [20:23] rharper: added to gist [20:24] rharper: I will try to reboot a 2nd server which didn't have the issue and see if I can reproduce after multiple reboots [20:24] ok [20:24] rharper: let me know if you would prefer to get SSH access for further debug, this can be done [20:24] ok [20:30] rangerpbzzzz, around ? [20:30] wonder if its ok if i open a bug and assign it toyou. [20:34] mgagne_: so I can recreate the case where the cloud-init-output does not contain the default route; but post-up on bond0.602 does run and work; maybe we could add a cloud-init final command to run route -n so we can see that? in xenial, cloud-init writes the files and networking.service is doing an ifup -a (which will bring up any non physical devices ; the physical devices with bond-master will create the bond0 and [20:34] enslave them) and then the ifup -a will trigger an ifup on bond0.602 and bond0.612; they'll run and run the post-up which should add the default gw you need; [20:35] rharper: so you think that: cloud-init runs route -n, doesn't see default gw at this point in time but later route should be configured by ifup? [20:35] I'm not entirely certain but in my recreate; the output info doesn't show the bond.vlan route; but when I login and run route; it's fully up [20:35] because it's true that the /32 route doesn't show in route output. this means something is running later to add routes. [20:36] I don't know when it runs to collect the network status [20:36] but possibly too soon or some other reason [20:36] not sure if smoser has more details [20:36] could it be the slaves link aren't fully up and therefore routes aren't applied yet since it's in post-up? [20:36] yes [20:36] slaves can take some time [20:37] bonding scripts will wait up to 600 seconds for a bond to join [20:37] err , 60 seconds [20:37] hm. [20:37] (60 * 0.1) [20:37] 6 seconds? :D [20:37] cloud-init writes network status during 'init' stage [20:37] *sigh* 600 * 0.1 [20:37] which shoudl be after static networking is up [20:37] so if that runs before all if the 'ifup' stuff is finished, then that is a bug [20:39] # systemctl cat cloud-init.service | grep networking [20:39] After=cloud-init-local.service networking.service [20:39] smoser: no, I can see the route info now; I was looking at the top of the -output file before I added the config; I definitely see the default routes running but; this is a VM versus baremetal; [20:39] Requires=networking.service [20:39] but Net device info shows the interface as up [20:39] I don't know where it takes the status but it means slaves are up too? [20:40] http://paste.ubuntu.com/23079501/ [20:40] it should look like that [20:40] I only added the one bond with default route (and used active-backup on a second nic in a VM); but the output should look similar in number of routes [20:42] but it is odd that during the dump of the route in mgagne_ case, there is nic message from kernel about being up; [20:43] the info table runs at Up 48.87 , but the nic up message isn't until 57 [20:44] so I'm rebooting in loop to try to reproduce the problem and so far, no luck [20:44] coworker says that if you reboot a node with the problem, gw is configured properly and your issue is fixed. [20:44] yeah, the switch delay [20:45] so I'm wondering if it's something cloud-init does at that time [20:45] which isn't done in next boot [20:46] like renaming an interface [20:46] you can force cloud-init to re-run by nuking /var/lib/cloud/* in the instance befere rebooting [20:46] renaming happens on each boot [20:46] or an attempt [20:46] by cloud-init? [20:46] yes [20:46] ok, will nuke /var/lib/cloud/* on an other node I have [20:47] and reboot forever [20:47] that's how I test as well [20:57] * rharper steps away for a bit [22:30] so I reboot and rebuilt 10+ times and I can't reproduce [22:30] it looks to be a very unlucky race condition