[02:15] <prometheanfire> smoser: what's the diference between _write_network, _write_network_config, _bring_up_interfaces and _bring_up_interface?
[03:38] <prometheanfire> smoser: I have gentoo networking working, at least for dhcp, probably for static as well
[03:39] <prometheanfire> smoser: last two commits here https://github.com/prometheanfire/cloud-init
[03:46] <prometheanfire> only 'error' that happens doesn't seem to hurt anything
[03:46] <prometheanfire> 2016-08-22 03:44:21,495 - __init__.py[WARNING]: apply_network_config is not currently implemented for distribution '<class 'cloudinit.distros.gentoo.Distro'>'.  Attempting to use apply_network
[13:34] <smoser> prometheanfire, here now. reading your comments.
[13:34] <smoser> you gave the systsem bonding config ?
[13:34] <prometheanfire> smoser: hi
[13:34] <prometheanfire> smoser: no, the kernel just supports it
[13:34] <prometheanfire> and this breaks things
[13:35] <smoser> hm..
[13:35] <prometheanfire> because bonding_masters exists as a file within /sys/class/net
[13:35] <smoser> well, there is another fix for that, that i do need to get in. but i didn't believe that it was as simple as you say.
[13:35] <smoser> the ubuntu kernels *do* support bonding
[13:35] <prometheanfire> well, for us the module is loaded static
[13:35] <smoser> probably its as a module though, and likely it isnt loaded when that runs.
[13:36] <prometheanfire> maybe that's why we hit it
[13:36] <smoser> "module loaded static"
[13:36] <smoser> ?
[13:36] <smoser> you m ean builtin
[13:36] <prometheanfire> ya
[13:36] <smoser> :)
[13:36] <prometheanfire> static kernel :D
[13:36] <prometheanfire> it's how I (personally) compile my kernels
[13:36] <smoser> ok.
[13:36] <prometheanfire> easy to ship
[13:37] <prometheanfire> this is the same code used in glean if it helps
[13:37] <smoser> glean ?
[13:38] <prometheanfire> simple-init?
[13:38] <prometheanfire> same thing
[13:38] <prometheanfire> https://github.com/openstack-infra/glean/
[13:38] <prometheanfire> https://github.com/openstack-infra/glean/blob/master/glean/cmd.py#L686
[13:45] <smoser> prometheanfire, so https://git.launchpad.net/~smoser/cloud-init/log/?h=bond_name has the fix for bond_master stuff
[13:46] <smoser> and wrt glean, i can't just take that, as license is not compatible.  cloud-init requires signing cla for contribution.
[13:46] <prometheanfire> right
[13:46] <smoser> :-(
[13:46] <prometheanfire> about the code
[13:46] <prometheanfire> I wrote it
[13:47] <prometheanfire> so I should be able to contribute it to both projects
[13:47] <prometheanfire> I'd think at least
[13:47] <smoser> oh. yeah, then you can.
[13:47] <smoser> yes.
[13:47] <prometheanfire> :D
[13:48] <smoser> the other coment is that '_write_network' is kind of the "legacy" mechanism
[13:48] <prometheanfire> in any case we should probably ignore the other interfaces, but if your boding update code works then that solves the immediate problem
[13:50] <prometheanfire> so, should I convert to _write_networks?
[13:51] <smoser> we want "_write_network_config", and using a Renderer like ubuntu/debian and rhel does
[13:53] <smoser> unfortunately i think you're going to want more doc / info on the "network state" :)
[13:53] <prometheanfire> is netconfig the same thing as settings?
[13:54] <smoser> no. its more like what is described at http://people.canonical.com/~rharper/curtin/topics/networking.html
[13:54] <smoser> and ubuntu and rhel basically load that thing into a 'network_state' and then render from mit.
[13:54] <prometheanfire> ya, that's doable
[13:54] <prometheanfire> a useable datastructure
[13:55] <smoser> the one thing that i can tell you is that there are unit tests that you can easily run to poke around at what is happening
[13:55] <smoser> (and please do contribute unit tests for your new code)
[13:55] <prometheanfire> shouldn't need unit tests for that, but ok :P
[13:55] <smoser> prometheanfire, thank you. i'm really happy to have your contributions
[13:55] <smoser> well, for rendering... you  can have quite complex state
[13:56] <prometheanfire> yarp
[13:56] <prometheanfire> at the moment I'll be using this patch on 0.7.7
[13:56] <prometheanfire> or will soon
[13:57] <prometheanfire> but I've tested it at least and it fufills the simple use case
[13:57] <prometheanfire> and even better, it actually works
[13:57] <prometheanfire> current code is broken for networking on gentoo, it lays down debain style configs
[13:59] <prometheanfire> right now the code to use the new method looks over complicated
[14:04] <smoser> prometheanfire, yeah. i'm not really happy with it either. note though that it supports much more complex config
[14:05] <smoser> multiple ips per nic, ipv4 ipv6, bond, vlan..
[14:05] <prometheanfire> it does
[14:05] <smoser> so yes, it is more complicated
[14:05] <prometheanfire> I just don't even know where to start
[14:05] <prometheanfire> it's that hairy
[14:05] <smoser> yeah.
[14:07] <smoser> i'd suggest looking at rhel, and how that works.
[14:08] <smoser> and then just working from unit tests to get somethign sane.
[14:08] <smoser> her... i'll try to get some scaffolding in place for you
[14:08] <smoser> s/her/here/
[14:08] <prometheanfire> current code actually started from arch
[14:09] <smoser> waht wyould you suggest as the name for the ntwork config style ?
[14:09] <prometheanfire> we have a name, sec
[14:09] <smoser> ie, 'eni', 'sysconfig' ...
[14:10] <prometheanfire> https://wiki.gentoo.org/wiki/Netifrc
[14:10] <prometheanfire> netifrc
[14:11] <prometheanfire> looking at rhel's stuff in net/sysconfig
[14:11] <smoser> k. thanks..
[14:17] <prometheanfire> ok, this is an example of the netconfig
[14:17] <prometheanfire> {'version': 1, 'config': [{'name': 'eth0', 'subnets': [{'type': 'dhcp'}], 'mac_address': 'fa:16:3e:00:10:ee', 'type': 'physical'}]}                                                                                                                                        [ ok ]
[14:18] <smoser> right. thats netconfig, then it gets loaded into 'NetworkState', and the renderers actually take the NetworkState.. a middle ground
[14:18] <prometheanfire> right
[14:18] <smoser> give me 10 more minutes, and i'll try to hand you a unit test thing to fill in
[14:18] <prometheanfire> cool
[14:19] <prometheanfire> have been looking at sysconfig.py, looks like it's just a longer version of my if  stuff
[14:20] <prometheanfire> does netconfig allow you to have vlans in your bonds or bonds in your vlans, etc?
[14:23] <prometheanfire> oh, it's listed explicitly
[14:23] <prometheanfire> neat
[14:32] <smoser>  http://paste.ubuntu.com/23078466/
[14:33] <smoser> prometheanfire, ^ that should at least allow you to easily poke through the code and see your results easily.
[14:33] <smoser> i did not add the _write_network_config to gentoo distro but i'm guessing you can figure that out.
[14:34] <smoser> one thing to point out, the renderer should be idempotent. whatever network config is there, the one provided is what the sysstem *should* do
[14:37] <prometheanfire> ya, that's the easy part
[14:37] <prometheanfire> I've already started on it
[15:31] <smoser> mgagne_, i'd appreciate your thoughts on https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/303563
[15:31] <smoser> ideally i'd liek to have that merged today, you mentioned it had some issues still i think
[15:34] <smoser> i think you still say 3.2 is busted.
[16:06] <mgagne_> smoser: I will test the whole patch set. Is it complete yet?
[16:06] <mgagne_> smoser: 3.2 got fixed with the bonding_slaves detection
[16:08] <smoser> mgagne_, then i think it is complete ...
[16:08] <smoser> pending your test :)
[16:08] <smoser> and my fuzzy memory
[16:08] <mgagne_> smoser: for some reasons, cloud-init configures network twice. At this point, I don't care about the reason, it's beyond my expertise. So detecting bonding should do the job.
[16:08] <smoser> it configures networking once.
[16:08] <mgagne_> ok, will launch a build with your patches against 0.7.7~bzr1256-0ubuntu1~16.04.1
[16:08] <smoser> it renames devices twice.
[16:08] <mgagne_> smoser: It goes through the network json config parsing twice at least
[16:08] <smoser> i'mi pretty sure thats the case, and i have to think again about why we rename network devices after we've configured.
[16:15] <mgagne_> so I will be applying this patch http://paste.ubuntu.com/23078862/ verbatim to cloud-init 0.7.7~bzr1256-0ubuntu1~16.04.1 and will boot on a baremetal with bonding+vlans to validate.
[16:19] <Tim_> hi
[17:03] <prometheanfire> thanks
[17:09] <smoser> prometheanfire, merged your branch
[17:09] <prometheanfire> yarp
[17:09] <prometheanfire> will rebase my integration branch when your stuff merges
[17:23] <prometheanfire> I'll see if I can test your patch
[17:55] <mgagne_> smoser: so I tested the patches. Somehow I managed to reproduce an intermittent bug my coworker had. Default gateway fails to configure and server doesn't ping.
[17:55] <smoser> hmm... i think maybe rharper might know something.
[17:55] <smoser> mgagne_, xenial, right ?
[17:56] <mgagne_> yes
[17:56] <smoser> ifupdown is very "fun".
[17:56] <mgagne_> smoser: is there anything I can look at? default gw is configured in post-up with route add || true
[17:56] <mgagne_> so I'm not sure how I'm supposed to debug that
[17:56] <smoser> and i know that rharper has been doing some hair pulling over bonds recently.
[17:56] <smoser> so you can get into it, mgagne ?
[17:57] <mgagne_> so we never supported ubuntu 16.04, even without bonding yet so I'm not sure if it's a known issue with bonding or cloud-init.
[17:57] <mgagne_> any logs I can pull to help debug?
[18:10] <prometheanfire> smoser: reviewed your patch, worksforme
[18:11] <smoser> mgagne_, grab /var/log/cloud-init.log
[18:12] <prometheanfire> also, closed the other merge request
[18:13] <smoser> mgagne_, can you get to it while its up?
[18:13] <smoser> or are you ust able to shut down and collect files
[18:13] <mgagne_> yea, will get the logs after my meeting =)
[18:13] <mgagne_> will get the whole /var/log if needed
[18:14] <smoser> if you can get while its up, please get output of
[18:14] <smoser> ifconfig -a
[18:14] <smoser> and any thing else you might find useful
[18:14] <smoser> systemct status
[18:14] <smoser> woiuld be good
[20:06] <mgagne_> if post-up fails, will the output be logged?
[20:09] <rharper> mgagne_: any stderr from the ifup will be capture in the networking.service log, so you should see something in systemctl status -l networking
[20:09] <mgagne_> ok, I ran that command and didn't see anything
[20:09] <rharper> mgagne_: from your gist though, the devices and routes all came up as configured
[20:09] <mgagne_> will rerun
[20:09] <mgagne_> to make sure
[20:09] <mgagne_> check cloud-init-output.log, you won't see the default gw
[20:09] <mgagne_> only link local routes
[20:11] <rharper> in your gist, do you have the etc/network/interfaces.d/50-cloud-init.cfg file ?
[20:11] <smoser> rharper, interfaces-cloud-init.txt
[20:11] <mgagne_> yes
[20:11] <smoser> https://gist.github.com/mgagne/fbc1b05412f41426f2e248acd5efad14#file-interfaces-cloud-init-txt
[20:12] <rharper> smoser: ah right
[20:12] <mgagne_> added systemctl status -l networking output
[20:12] <mgagne_> so I suspect that *maybe* the default route is added but something removes it later?
[20:12] <mgagne_> or will || true hide the failure?
[20:13] <rharper> mgagne_: in your gist, the bond0.602 default gw is the one you expect ?  (the launchpad bug had other config)
[20:13] <mgagne_> yes
[20:14] <mgagne_> I thought it would be configured with the gateway stanza but ¯\_(ツ)_/¯
[20:17] <smoser> Odd_Bloke, around ?
[20:18] <smoser>  test_exception_fetching_fabric_data_doesnt_propagate
[20:18] <smoser> why would i not want that to propogate?
[20:21] <mgagne_> rharper: is there anything I can do to help debug? I don't mind rebuilding an image with debug config or whatever.
[20:21] <rharper> mgagne_: a plain route -n  would be nice
[20:21] <rharper> and the original network_data.json;
[20:22] <mgagne_> rharper: I added the default route already so I can SSH and pull logs
[20:22] <rharper> it's applying some routes; I just can't see why it wouldn't do the post up
[20:23] <mgagne_> rharper: added to gist
[20:24] <mgagne_> rharper: I will try to reboot a 2nd server which didn't have the issue and see if I can reproduce after multiple reboots
[20:24] <rharper> ok
[20:24] <mgagne_> rharper: let me know if you would prefer to get SSH access for further debug, this can be done
[20:24] <rharper> ok
[20:30] <smoser> rangerpbzzzz, around ?
[20:30] <smoser> wonder if its ok if i open a bug and assign it toyou.
[20:34] <rharper> mgagne_: so I can recreate the case where the cloud-init-output does not contain the default route; but post-up on bond0.602 does run and work;  maybe we could add a cloud-init final command to run route -n so we can see that?  in xenial, cloud-init writes the files and networking.service is doing an ifup -a (which will bring up any non physical devices ;  the physical devices with bond-master will create the bond0 and
[20:34] <rharper> enslave them) and then the ifup -a will trigger an ifup on bond0.602 and bond0.612;  they'll run and run the post-up which should add the default gw you need;
[20:35] <mgagne_> rharper: so you think that: cloud-init runs route -n, doesn't see default gw at this point in time but later route should be configured by ifup?
[20:35] <rharper> I'm not entirely certain but in my recreate;  the output info doesn't show the bond.vlan route; but when I login and run route; it's fully up
[20:35] <mgagne_> because it's true that the /32 route doesn't show in route output. this means something is running later to add routes.
[20:36] <rharper> I don't know when it runs to collect the network status
[20:36] <rharper> but possibly too soon or some other reason
[20:36] <rharper> not sure if smoser has more details
[20:36] <mgagne_> could it be the slaves link aren't fully up and therefore routes aren't applied yet since it's in post-up?
[20:36] <rharper> yes
[20:36] <rharper> slaves can take some time
[20:37] <rharper> bonding scripts will wait up to 600 seconds for a bond to join
[20:37] <rharper> err , 60 seconds
[20:37] <smoser> hm.
[20:37] <rharper> (60 * 0.1)
[20:37] <mgagne_> 6 seconds? :D
[20:37] <smoser> cloud-init writes network status during 'init' stage
[20:37] <rharper> *sigh*  600 * 0.1
[20:37] <smoser> which shoudl be after static networking is up
[20:37] <smoser> so if that runs before all if the 'ifup' stuff is finished, then that is a bug
[20:39] <smoser> # systemctl cat cloud-init.service | grep networking
[20:39] <smoser> After=cloud-init-local.service networking.service
[20:39] <rharper> smoser: no, I can see the route info now;  I was looking at the top of the -output file before I added the config;  I definitely see the default routes running but; this is a VM versus baremetal;
[20:39] <smoser> Requires=networking.service
[20:39] <mgagne_> but Net device info shows the interface as up
[20:39] <mgagne_> I don't know where it takes the status but it means slaves are up too?
[20:40] <rharper> http://paste.ubuntu.com/23079501/
[20:40] <rharper> it should look like that
[20:40] <rharper> I only added the one bond with default route (and used active-backup on a second nic in a VM);  but the output should look similar in number of routes
[20:42] <rharper> but it is odd that during the dump of the route in mgagne_ case, there is nic message from kernel about being up;
[20:43] <rharper> the info table runs at Up 48.87 , but the nic up message isn't until 57
[20:44] <mgagne_> so I'm rebooting in loop to try to reproduce the problem and so far, no luck
[20:44] <mgagne_> coworker says that if you reboot a node with the problem, gw is configured properly and your issue is fixed.
[20:44] <rharper> yeah, the switch delay
[20:45] <mgagne_> so I'm wondering if it's something cloud-init does at that time
[20:45] <mgagne_> which isn't done in next boot
[20:46] <mgagne_> like renaming an interface
[20:46] <rharper> you can force cloud-init to re-run by nuking /var/lib/cloud/*  in the instance befere rebooting
[20:46] <rharper> renaming happens on each boot
[20:46] <rharper> or an attempt
[20:46] <mgagne_> by cloud-init?
[20:46] <rharper> yes
[20:46] <mgagne_> ok, will nuke /var/lib/cloud/* on an other node I have
[20:47] <mgagne_> and reboot forever
[20:47] <prometheanfire> that's how I test as well
[20:57]  * rharper steps away for  a bit
[22:30] <mgagne_> so I reboot and rebuilt 10+ times and I can't reproduce
[22:30] <mgagne_> it looks to be a very unlucky race condition