prometheanfire | smoser: what's the diference between _write_network, _write_network_config, _bring_up_interfaces and _bring_up_interface? | 02:15 |
---|---|---|
prometheanfire | smoser: I have gentoo networking working, at least for dhcp, probably for static as well | 03:38 |
prometheanfire | smoser: last two commits here https://github.com/prometheanfire/cloud-init | 03:39 |
prometheanfire | only 'error' that happens doesn't seem to hurt anything | 03:46 |
prometheanfire | 2016-08-22 03:44:21,495 - __init__.py[WARNING]: apply_network_config is not currently implemented for distribution '<class 'cloudinit.distros.gentoo.Distro'>'. Attempting to use apply_network | 03:46 |
=== Takumo is now known as Jagmilleurs | ||
=== Jagmilleurs is now known as Takumo | ||
smoser | prometheanfire, here now. reading your comments. | 13:34 |
smoser | you gave the systsem bonding config ? | 13:34 |
prometheanfire | smoser: hi | 13:34 |
prometheanfire | smoser: no, the kernel just supports it | 13:34 |
prometheanfire | and this breaks things | 13:34 |
smoser | hm.. | 13:35 |
prometheanfire | because bonding_masters exists as a file within /sys/class/net | 13:35 |
smoser | well, there is another fix for that, that i do need to get in. but i didn't believe that it was as simple as you say. | 13:35 |
smoser | the ubuntu kernels *do* support bonding | 13:35 |
prometheanfire | well, for us the module is loaded static | 13:35 |
smoser | probably its as a module though, and likely it isnt loaded when that runs. | 13:35 |
prometheanfire | maybe that's why we hit it | 13:36 |
smoser | "module loaded static" | 13:36 |
smoser | ? | 13:36 |
smoser | you m ean builtin | 13:36 |
prometheanfire | ya | 13:36 |
smoser | :) | 13:36 |
prometheanfire | static kernel :D | 13:36 |
prometheanfire | it's how I (personally) compile my kernels | 13:36 |
smoser | ok. | 13:36 |
prometheanfire | easy to ship | 13:36 |
prometheanfire | this is the same code used in glean if it helps | 13:37 |
smoser | glean ? | 13:37 |
prometheanfire | simple-init? | 13:38 |
prometheanfire | same thing | 13:38 |
prometheanfire | https://github.com/openstack-infra/glean/ | 13:38 |
prometheanfire | https://github.com/openstack-infra/glean/blob/master/glean/cmd.py#L686 | 13:38 |
smoser | prometheanfire, so https://git.launchpad.net/~smoser/cloud-init/log/?h=bond_name has the fix for bond_master stuff | 13:45 |
smoser | and wrt glean, i can't just take that, as license is not compatible. cloud-init requires signing cla for contribution. | 13:46 |
prometheanfire | right | 13:46 |
smoser | :-( | 13:46 |
prometheanfire | about the code | 13:46 |
prometheanfire | I wrote it | 13:46 |
prometheanfire | so I should be able to contribute it to both projects | 13:47 |
prometheanfire | I'd think at least | 13:47 |
smoser | oh. yeah, then you can. | 13:47 |
smoser | yes. | 13:47 |
prometheanfire | :D | 13:47 |
smoser | the other coment is that '_write_network' is kind of the "legacy" mechanism | 13:48 |
prometheanfire | in any case we should probably ignore the other interfaces, but if your boding update code works then that solves the immediate problem | 13:48 |
prometheanfire | so, should I convert to _write_networks? | 13:50 |
smoser | we want "_write_network_config", and using a Renderer like ubuntu/debian and rhel does | 13:51 |
smoser | unfortunately i think you're going to want more doc / info on the "network state" :) | 13:53 |
prometheanfire | is netconfig the same thing as settings? | 13:53 |
smoser | no. its more like what is described at http://people.canonical.com/~rharper/curtin/topics/networking.html | 13:54 |
smoser | and ubuntu and rhel basically load that thing into a 'network_state' and then render from mit. | 13:54 |
prometheanfire | ya, that's doable | 13:54 |
prometheanfire | a useable datastructure | 13:54 |
smoser | the one thing that i can tell you is that there are unit tests that you can easily run to poke around at what is happening | 13:55 |
smoser | (and please do contribute unit tests for your new code) | 13:55 |
prometheanfire | shouldn't need unit tests for that, but ok :P | 13:55 |
smoser | prometheanfire, thank you. i'm really happy to have your contributions | 13:55 |
smoser | well, for rendering... you can have quite complex state | 13:55 |
prometheanfire | yarp | 13:56 |
prometheanfire | at the moment I'll be using this patch on 0.7.7 | 13:56 |
prometheanfire | or will soon | 13:56 |
prometheanfire | but I've tested it at least and it fufills the simple use case | 13:57 |
prometheanfire | and even better, it actually works | 13:57 |
prometheanfire | current code is broken for networking on gentoo, it lays down debain style configs | 13:57 |
prometheanfire | right now the code to use the new method looks over complicated | 13:59 |
smoser | prometheanfire, yeah. i'm not really happy with it either. note though that it supports much more complex config | 14:04 |
smoser | multiple ips per nic, ipv4 ipv6, bond, vlan.. | 14:05 |
prometheanfire | it does | 14:05 |
smoser | so yes, it is more complicated | 14:05 |
prometheanfire | I just don't even know where to start | 14:05 |
prometheanfire | it's that hairy | 14:05 |
smoser | yeah. | 14:05 |
smoser | i'd suggest looking at rhel, and how that works. | 14:07 |
smoser | and then just working from unit tests to get somethign sane. | 14:08 |
smoser | her... i'll try to get some scaffolding in place for you | 14:08 |
smoser | s/her/here/ | 14:08 |
prometheanfire | current code actually started from arch | 14:08 |
smoser | waht wyould you suggest as the name for the ntwork config style ? | 14:09 |
prometheanfire | we have a name, sec | 14:09 |
smoser | ie, 'eni', 'sysconfig' ... | 14:09 |
prometheanfire | https://wiki.gentoo.org/wiki/Netifrc | 14:10 |
prometheanfire | netifrc | 14:10 |
prometheanfire | looking at rhel's stuff in net/sysconfig | 14:11 |
smoser | k. thanks.. | 14:11 |
prometheanfire | ok, this is an example of the netconfig | 14:17 |
prometheanfire | {'version': 1, 'config': [{'name': 'eth0', 'subnets': [{'type': 'dhcp'}], 'mac_address': 'fa:16:3e:00:10:ee', 'type': 'physical'}]} [ ok ] | 14:17 |
smoser | right. thats netconfig, then it gets loaded into 'NetworkState', and the renderers actually take the NetworkState.. a middle ground | 14:18 |
prometheanfire | right | 14:18 |
smoser | give me 10 more minutes, and i'll try to hand you a unit test thing to fill in | 14:18 |
prometheanfire | cool | 14:18 |
prometheanfire | have been looking at sysconfig.py, looks like it's just a longer version of my if stuff | 14:19 |
prometheanfire | does netconfig allow you to have vlans in your bonds or bonds in your vlans, etc? | 14:20 |
prometheanfire | oh, it's listed explicitly | 14:23 |
prometheanfire | neat | 14:23 |
smoser | http://paste.ubuntu.com/23078466/ | 14:32 |
smoser | prometheanfire, ^ that should at least allow you to easily poke through the code and see your results easily. | 14:33 |
smoser | i did not add the _write_network_config to gentoo distro but i'm guessing you can figure that out. | 14:33 |
smoser | one thing to point out, the renderer should be idempotent. whatever network config is there, the one provided is what the sysstem *should* do | 14:34 |
prometheanfire | ya, that's the easy part | 14:37 |
prometheanfire | I've already started on it | 14:37 |
smoser | mgagne_, i'd appreciate your thoughts on https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/303563 | 15:31 |
smoser | ideally i'd liek to have that merged today, you mentioned it had some issues still i think | 15:31 |
smoser | i think you still say 3.2 is busted. | 15:34 |
mgagne_ | smoser: I will test the whole patch set. Is it complete yet? | 16:06 |
mgagne_ | smoser: 3.2 got fixed with the bonding_slaves detection | 16:06 |
smoser | mgagne_, then i think it is complete ... | 16:08 |
smoser | pending your test :) | 16:08 |
smoser | and my fuzzy memory | 16:08 |
mgagne_ | smoser: for some reasons, cloud-init configures network twice. At this point, I don't care about the reason, it's beyond my expertise. So detecting bonding should do the job. | 16:08 |
smoser | it configures networking once. | 16:08 |
mgagne_ | ok, will launch a build with your patches against 0.7.7~bzr1256-0ubuntu1~16.04.1 | 16:08 |
smoser | it renames devices twice. | 16:08 |
mgagne_ | smoser: It goes through the network json config parsing twice at least | 16:08 |
smoser | i'mi pretty sure thats the case, and i have to think again about why we rename network devices after we've configured. | 16:08 |
mgagne_ | so I will be applying this patch http://paste.ubuntu.com/23078862/ verbatim to cloud-init 0.7.7~bzr1256-0ubuntu1~16.04.1 and will boot on a baremetal with bonding+vlans to validate. | 16:15 |
Tim_ | hi | 16:19 |
prometheanfire | thanks | 17:03 |
smoser | prometheanfire, merged your branch | 17:09 |
prometheanfire | yarp | 17:09 |
prometheanfire | will rebase my integration branch when your stuff merges | 17:09 |
prometheanfire | I'll see if I can test your patch | 17:23 |
mgagne_ | smoser: so I tested the patches. Somehow I managed to reproduce an intermittent bug my coworker had. Default gateway fails to configure and server doesn't ping. | 17:55 |
smoser | hmm... i think maybe rharper might know something. | 17:55 |
smoser | mgagne_, xenial, right ? | 17:55 |
mgagne_ | yes | 17:56 |
smoser | ifupdown is very "fun". | 17:56 |
mgagne_ | smoser: is there anything I can look at? default gw is configured in post-up with route add || true | 17:56 |
mgagne_ | so I'm not sure how I'm supposed to debug that | 17:56 |
smoser | and i know that rharper has been doing some hair pulling over bonds recently. | 17:56 |
smoser | so you can get into it, mgagne ? | 17:56 |
mgagne_ | so we never supported ubuntu 16.04, even without bonding yet so I'm not sure if it's a known issue with bonding or cloud-init. | 17:57 |
mgagne_ | any logs I can pull to help debug? | 17:57 |
prometheanfire | smoser: reviewed your patch, worksforme | 18:10 |
smoser | mgagne_, grab /var/log/cloud-init.log | 18:11 |
prometheanfire | also, closed the other merge request | 18:12 |
smoser | mgagne_, can you get to it while its up? | 18:13 |
smoser | or are you ust able to shut down and collect files | 18:13 |
mgagne_ | yea, will get the logs after my meeting =) | 18:13 |
mgagne_ | will get the whole /var/log if needed | 18:13 |
smoser | if you can get while its up, please get output of | 18:14 |
smoser | ifconfig -a | 18:14 |
smoser | and any thing else you might find useful | 18:14 |
smoser | systemct status | 18:14 |
smoser | woiuld be good | 18:14 |
mgagne_ | if post-up fails, will the output be logged? | 20:06 |
rharper | mgagne_: any stderr from the ifup will be capture in the networking.service log, so you should see something in systemctl status -l networking | 20:09 |
mgagne_ | ok, I ran that command and didn't see anything | 20:09 |
rharper | mgagne_: from your gist though, the devices and routes all came up as configured | 20:09 |
mgagne_ | will rerun | 20:09 |
mgagne_ | to make sure | 20:09 |
mgagne_ | check cloud-init-output.log, you won't see the default gw | 20:09 |
mgagne_ | only link local routes | 20:09 |
rharper | in your gist, do you have the etc/network/interfaces.d/50-cloud-init.cfg file ? | 20:11 |
smoser | rharper, interfaces-cloud-init.txt | 20:11 |
mgagne_ | yes | 20:11 |
smoser | https://gist.github.com/mgagne/fbc1b05412f41426f2e248acd5efad14#file-interfaces-cloud-init-txt | 20:11 |
rharper | smoser: ah right | 20:12 |
mgagne_ | added systemctl status -l networking output | 20:12 |
mgagne_ | so I suspect that *maybe* the default route is added but something removes it later? | 20:12 |
mgagne_ | or will || true hide the failure? | 20:12 |
rharper | mgagne_: in your gist, the bond0.602 default gw is the one you expect ? (the launchpad bug had other config) | 20:13 |
mgagne_ | yes | 20:13 |
mgagne_ | I thought it would be configured with the gateway stanza but ¯\_(ツ)_/¯ | 20:14 |
smoser | Odd_Bloke, around ? | 20:17 |
smoser | test_exception_fetching_fabric_data_doesnt_propagate | 20:18 |
smoser | why would i not want that to propogate? | 20:18 |
mgagne_ | rharper: is there anything I can do to help debug? I don't mind rebuilding an image with debug config or whatever. | 20:21 |
rharper | mgagne_: a plain route -n would be nice | 20:21 |
rharper | and the original network_data.json; | 20:21 |
mgagne_ | rharper: I added the default route already so I can SSH and pull logs | 20:22 |
rharper | it's applying some routes; I just can't see why it wouldn't do the post up | 20:22 |
mgagne_ | rharper: added to gist | 20:23 |
mgagne_ | rharper: I will try to reboot a 2nd server which didn't have the issue and see if I can reproduce after multiple reboots | 20:24 |
rharper | ok | 20:24 |
mgagne_ | rharper: let me know if you would prefer to get SSH access for further debug, this can be done | 20:24 |
rharper | ok | 20:24 |
smoser | rangerpbzzzz, around ? | 20:30 |
smoser | wonder if its ok if i open a bug and assign it toyou. | 20:30 |
rharper | mgagne_: so I can recreate the case where the cloud-init-output does not contain the default route; but post-up on bond0.602 does run and work; maybe we could add a cloud-init final command to run route -n so we can see that? in xenial, cloud-init writes the files and networking.service is doing an ifup -a (which will bring up any non physical devices ; the physical devices with bond-master will create the bond0 and | 20:34 |
rharper | enslave them) and then the ifup -a will trigger an ifup on bond0.602 and bond0.612; they'll run and run the post-up which should add the default gw you need; | 20:34 |
mgagne_ | rharper: so you think that: cloud-init runs route -n, doesn't see default gw at this point in time but later route should be configured by ifup? | 20:35 |
rharper | I'm not entirely certain but in my recreate; the output info doesn't show the bond.vlan route; but when I login and run route; it's fully up | 20:35 |
mgagne_ | because it's true that the /32 route doesn't show in route output. this means something is running later to add routes. | 20:35 |
rharper | I don't know when it runs to collect the network status | 20:36 |
rharper | but possibly too soon or some other reason | 20:36 |
rharper | not sure if smoser has more details | 20:36 |
mgagne_ | could it be the slaves link aren't fully up and therefore routes aren't applied yet since it's in post-up? | 20:36 |
rharper | yes | 20:36 |
rharper | slaves can take some time | 20:36 |
rharper | bonding scripts will wait up to 600 seconds for a bond to join | 20:37 |
rharper | err , 60 seconds | 20:37 |
smoser | hm. | 20:37 |
rharper | (60 * 0.1) | 20:37 |
mgagne_ | 6 seconds? :D | 20:37 |
smoser | cloud-init writes network status during 'init' stage | 20:37 |
rharper | *sigh* 600 * 0.1 | 20:37 |
smoser | which shoudl be after static networking is up | 20:37 |
smoser | so if that runs before all if the 'ifup' stuff is finished, then that is a bug | 20:37 |
smoser | # systemctl cat cloud-init.service | grep networking | 20:39 |
smoser | After=cloud-init-local.service networking.service | 20:39 |
rharper | smoser: no, I can see the route info now; I was looking at the top of the -output file before I added the config; I definitely see the default routes running but; this is a VM versus baremetal; | 20:39 |
smoser | Requires=networking.service | 20:39 |
mgagne_ | but Net device info shows the interface as up | 20:39 |
mgagne_ | I don't know where it takes the status but it means slaves are up too? | 20:39 |
rharper | http://paste.ubuntu.com/23079501/ | 20:40 |
rharper | it should look like that | 20:40 |
rharper | I only added the one bond with default route (and used active-backup on a second nic in a VM); but the output should look similar in number of routes | 20:40 |
rharper | but it is odd that during the dump of the route in mgagne_ case, there is nic message from kernel about being up; | 20:42 |
rharper | the info table runs at Up 48.87 , but the nic up message isn't until 57 | 20:43 |
mgagne_ | so I'm rebooting in loop to try to reproduce the problem and so far, no luck | 20:44 |
mgagne_ | coworker says that if you reboot a node with the problem, gw is configured properly and your issue is fixed. | 20:44 |
rharper | yeah, the switch delay | 20:44 |
mgagne_ | so I'm wondering if it's something cloud-init does at that time | 20:45 |
mgagne_ | which isn't done in next boot | 20:45 |
mgagne_ | like renaming an interface | 20:46 |
rharper | you can force cloud-init to re-run by nuking /var/lib/cloud/* in the instance befere rebooting | 20:46 |
rharper | renaming happens on each boot | 20:46 |
rharper | or an attempt | 20:46 |
mgagne_ | by cloud-init? | 20:46 |
rharper | yes | 20:46 |
mgagne_ | ok, will nuke /var/lib/cloud/* on an other node I have | 20:46 |
mgagne_ | and reboot forever | 20:47 |
prometheanfire | that's how I test as well | 20:47 |
* rharper steps away for a bit | 20:57 | |
mgagne_ | so I reboot and rebuilt 10+ times and I can't reproduce | 22:30 |
mgagne_ | it looks to be a very unlucky race condition | 22:30 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!