[14:14] Hi there. I [14:14] https://gist.github.com/mazzy89/340dae524474ca01d4e8aa1ee6472598 [14:14] I have this use case. any suggestion? [14:47] Hi! Does cloud-init update vendor-data on instance reboot? [14:49] does cloud-init overwrite an existing file if the file is created in write_files? === stelucz_ is now known as stelucz [15:53] \(´O`)/ [15:56] nice one [15:58] mazzy: yes, write_files will overwrite pre-existing files with content provided in your write_files section [15:59] blackboxsw thank you [16:00] o/ [16:00] morning alll [16:07] ok I think it's time for our bi-weekly meeting. probably going to be a short one this week. [16:08] more time for office hours then? [16:08] #startmeeting Cloud-init bi-weekly status meeting [16:08] Meeting started Mon Jan 22 16:08:22 2018 UTC. The chair is blackboxsw. Information about MeetBot at http://wiki.ubuntu.com/meetingology. [16:08] Available commands: action commands idea info link nick [16:08] certianly ajorg :) (on office hours) [16:09] Welcome to another episode of cloud-init bi-weekly status. We'll chat about about cloud-init updates and in progress work, and we'l drop into office hours for ongoing discussions/bug work etc. [16:10] #topic Recent changes [16:11] Just walking through git-log for what we have committed in the last couple of weeks, here's the brief summary [16:12] thx smoser [16:12] - shorten the message in the exception per powersj feedback [16:12] - Use the same botocore session so the patched changes stick. [16:12] - fix bad use of % [16:12] - Fix console_log, improve comments and raise PlatformError on. [16:12] - tests: Fix EC2 Platform to return console output as bytes. [16:12] - tests: remove zesty as supported OS to test [Joshua Powers] [16:12] - Do not log warning on config files that represent None. (LP: #1742479) [16:12] Launchpad bug 1742479 in cloud-init (Ubuntu) "setting manual_cache_clean causes warning" [Medium,Fix released] https://launchpad.net/bugs/1742479 [16:12] - tests: Use git hash pip dependency format for pylxd. [16:12] - tests: add integration requirements text file [Joshua Powers] [16:12] - MAAS: add check_instance_id based off oauth tokens. (LP: #1712680) [16:12] - tests: update apt sources list test [Joshua Powers] [16:12] - tests: clean up image properties [Joshua Powers] [16:12] - tests: rename test ssh keys to avoid appearance of leaking private keys. [16:12] Launchpad bug 1712680 in maas-images "cloud-init re-generates network config every reboot overwriting manual admin changes on CentOS." [Undecided,New] https://launchpad.net/bugs/1712680 [16:12] [Joshua Powers] [16:12] - tests: Enable AWS EC2 Integration Testing [Joshua Powers] [16:12] - cli: cloud-init clean handles symlinks (LP: #1741093) [16:12] Launchpad bug 1741093 in cloud-init "cloud-init clean traceback on instance dir symlink" [Low,Fix committed] https://launchpad.net/bugs/1741093 [16:13] What's being patched in botocore? [16:13] So a number of changes went into integration test related work, separating out requirements files. [16:14] MAASDatasource now also has smarted cache handling based on oauth token renewal from the maas server [16:14] so botocore is used by integration tests only as a mechanism to talk to the instance under test... looking back at the specifics here [16:14] it might have just been shuffling out how and where we define the dependency [16:14] blackboxsw: (my 'paste' to you was bad... http://paste.ubuntu.com/26438113/ is better, showing only those on master, not my local branch that was currently checked out ) [16:15] heh, oopsie daisy let's paste again inline then [16:15] - tests: remove zesty as supported OS to test [Joshua Powers] [16:15] - Do not log warning on config files that represent None. (LP: #1742479) [16:15] - tests: Use git hash pip dependency format for pylxd. [16:15] - tests: add integration requirements text file [Joshua Powers] [16:15] - MAAS: add check_instance_id based off oauth tokens. (LP: #1712680) [16:15] - tests: update apt sources list test [Joshua Powers] [16:15] - tests: clean up image properties [Joshua Powers] [16:15] - tests: rename test ssh keys to avoid appearance of leaking private keys. [16:15] [Joshua Powers] [16:15] - tests: Enable AWS EC2 Integration Testing [Joshua Powers] [16:15] - cli: cloud-init clean handles symlinks (LP: #1741093) [16:15] Launchpad bug 1742479 in cloud-init (Ubuntu) "setting manual_cache_clean causes warning" [Medium,Fix released] https://launchpad.net/bugs/1742479 [16:15] Launchpad bug 1712680 in maas-images "cloud-init re-generates network config every reboot overwriting manual admin changes on CentOS." [Undecided,New] https://launchpad.net/bugs/1712680 [16:15] Launchpad bug 1741093 in cloud-init "cloud-init clean traceback on instance dir symlink" [Low,Fix committed] https://launchpad.net/bugs/1741093 [16:15] ok the real deal, that looks better [16:16] ahh ajorg that interim commit message on botocore was about integration tests caching the session information during testing so we don't recreate that session with every ssh connection to the instance [16:16] just a little time savings per review comments on powersj branch I believe [16:17] okay, so nothing that needs to get upstreamed to botocore? [16:17] I don't think so, powersj smoser I have vague recollection of someone filing an upstream botocore issue. did we have to do that for something else though? [16:18] https://github.com/boto/botocore/issues/1351 [16:18] that was the issue smoser put in ^ [16:18] nice recall powersj thanks. [16:18] #link https://github.com/boto/botocore/issues/1351 [16:20] ajorg: you can read that bug. imo they have a data loss error, but not one that they can easily fix without causing failures in places that previously ran fine. [16:20] I'll ask them to re-open it. [16:21] At the very least they should answer your last. [16:21] thanks. [16:22] Generally anything significant that we have landed (and any inprogress work) should be available at the following link. [16:22] #link https://trello.com/b/hFtWKUn3/daily-cloud-init-curtin [16:23] anything else we should note over the last couple weeks? [16:23] otherwise I'll switch to ongoing work topic [16:24] #topic In-progress Development [16:25] As you may have seen last week, we've gotten through a few passes and discussions around dojordan's branch to define pre-provisioning [16:25] #link https://code.launchpad.net/~dojordan/cloud-init/+git/cloud-init/+merge/334341 [16:26] some of that discussion resulted in a new context manager: EphemeralDHCPv4 to support a sandboxed dhclient request on an instance. [16:27] this context manager affects Ec2 datasource a bit as it encapsulates all of the dhcp request -> EphemeralIPV4Network calls that Ec2 was doing [16:28] there may be a couple other datasources that follow suit with this type of sandboxed dhcp request in weeks to come [16:28] glad it turned out to be generally useful rather than only specifically to ec2 [16:28] absolutely [16:30] Some other in-progress bits look like we might try focusing a bit more on chrony support and gettting robjo's branches some more eyes. [16:31] and some work on Ubuntu snappy support per the snappy and snap config modules. [16:31] dojordan: i just put one comment on your mp. /me thanks dojordan again for his patience. [16:32] rharper: smoser powersj anything more in the immediate pipeline that I'm missing/ [16:32] ? [16:32] blackboxsw: we should get the EphemeralDHCP thingy into the digital ocean datasource also. [16:32] blackboxsw: a reply to the network discussion on the list from the azure folks and robjo [16:32] I took another look at https://code.launchpad.net/~yeazelm/cloud-init/+git/cloud-init/+merge/331897 and saw that origin/master seems to be failing some of the integration tests too. [16:32] (at least for me, locally, on a 16.04 instance) [16:32] ahhh right forgot about all your work there rharper, thanks! [16:33] ajorg: https://jenkins.ubuntu.com/server/view/cloud-init/job/cloud-init-ci-nightly/ [16:33] that is nigytly run of trunk [16:33] #link https://jenkins.ubuntu.com/server/view/cloud-init/job/cloud-init-ci-nightly/ [16:34] I'll try blackholing IMDS on my instance. Could be that's interfering with something. [16:35] it is red, but 218 (green) and 219 (red) used the same git has on trunk (5cc0b19b8). [16:35] I'll follow up during office hours [16:36] can you give me example of your failures ? we had "disk full" errors recently on our jenkins, so that might be the cause of the issue for 291. [16:36] s/291/219/ [16:36] I don't remember seeing that traceback recently. w/ warning messages present in cloud-init [16:36] powersj: ? can you explain lxc timeout failure at [16:36] https://jenkins.ubuntu.com/server/view/cloud-init/job/cloud-init-ci-nightly/219/consoleFull [16:37] smoser: we discovered that our qemu-migration test was installing lxd from the archive and causing conflicts with the snap installed lxd [16:37] I have a message to christian to prevent it, and I have already cleaned it up [16:37] so new runs should pass [16:37] 2018-01-22 16:19:03,550 - tests.cloud_tests - WARNING - test case: modules/ssh_import_id failed TestSshImportId.test_no_stages_errors with: AssertionError: 1 != 0 : errors ['(\'ssh-import-id\', ProcessExecutionError("Unexpected error while running command.\\nCommand: [\'sudo\', \'-Hu\', \'ubuntu\', \'ssh-import-id\', \'gh:powersj\', \'lp:smoser\']\\nExit code: 1\\nReason: -\\nStdout: -\\nStderr: -",))'] were encountered in stage m [16:38] hm.. well, that will hit launchpad.net over https [16:38] cloud-init-output.log probaly has more info (should be collected) [16:38] the actual error is: File "/var/lib/jenkins/slaves/torkoal/workspace/cloud-init-ci-nightly/tests/cloud_tests/platforms/instances.py", line 142, in _wait_for_system [16:38] raise OSError('timeout: after {}s system not started'.format(time)) [16:38] it is because when the qemu tests installed lxd it didn't initialize lxd networking [16:39] so no IP is received [16:39] ajorg: would you have had outbound access to launchpad https ? if not, then that'd be expected failure. [16:40] oh, and i guess 'gh:powersj' (github) [16:40] smoser: I'll check some things, but in short yes. Maybe lxc is being weird? [16:40] i dont like our user names in that test though... [16:40] smoser: we could use the bot instead [16:42] smoser: it's a public ec2 instance with no special outbound rules, and I can connect to public https sites from a normal session. [16:44] hrm, ok let's chat about what we can do to anonymize or drop that type of test data if we can [16:44] probably time to kick over to office hours [16:45] #topic Office Hours (next 30 minutes) [16:45] powersj: well, i think i'd prefer some public key that we state "no one has the private key for this." [16:46] obviously we could lie about that, but one would *expect* that you and I would gain access to the system using our public keys. [16:46] it doens't make me feel a lot better that a bot could/can. [16:47] Is there a way to limit integration testing to a specific test? [16:47] Feel free to bring up any topic/bugs/branches/features you'd like discussion on. We can also continue our discussion on the ssh key imports in teting [16:47] (takes a long time to run the full suite) [16:48] ajorg: yes [16:48] (reverse-i-search)`cloud_t': python3 -m tests.cloud_tests run --os-name=artful --platform=nocloud-kvm --preserve-data --data-dir=../results --verbose -t modules/locale -t modules/set_password [16:48] thanks, that should help [16:48] ajorg: you can specify the test names (like modules/set_password) and modules/locale in this test [16:48] yeah those are short ones I frequently test with [16:49] http://paste.ubuntu.com/26438334/ [16:49] that is what i use. and yeah... we've discussed that integration test could be easier to run :) [16:49] #link http://paste.ubuntu.com/26438334/ [16:50] nice 1 [16:51] smoser: to have a public key we know nobody has a private key for would that mean we'd need a separate github account (or maybe just an additional key associated w/ our bot account in gh [16:51] * blackboxsw checks github for authorizing multiple keys. [16:51] hrm, that wouldn't work as we need gh:ubuntu-server-bot (one key) n/m [16:52] I've got meetings most of today, so I'll have to follow up later. thanks everyone! [16:52] thanks ajorg [16:59] so, bot account for the time being is better than powersj owning the testing world ;) [16:59] but I'm not too concerned about it as this are supposed to be throw away instances [16:59] but I'm not too concerned about it as there instances under test are supposed to be throw away instances [17:00] *these instances*.... anyway [17:04] blackboxsw: right. it would require users on both those services . === hrybacki is now known as hrybacki_mtg [17:14] alrighty. think we're at the close of office hours. Last call? [17:16] Thanks for your time and contributions to cloud-init folks! [17:16] #endmeeting [17:16] Meeting ended Mon Jan 22 17:16:43 2018 UTC. [17:16] Minutes: http://ubottu.com/meetingology/logs/cloud-init/2018/cloud-init.2018-01-22-16.08.moin.txt [17:26] blackboxsw: thanks [17:26] np, just posting the notes to cloud-init.github.io now === hrybacki_mtg is now known as hrybacki [18:45] https://hackmd.io/OwBgpgxgRlwIwFooBMBMAzBAWYECcCAHIbgtIXgGyEDMIElcEQA= [18:46] blackboxsw, powersj rharper . i just put that together in order to go fresh system to functional "test my branch". [18:46] mainly for ajorg's request of "does trunk work". [18:46] reading [18:48] smoser: -t is super fancy sauce [18:48] * rharper has learned something new for tday [18:49] is it worth folding a bit of that into readthedocs content for cloudinit? I'm not sure where to 'host' that info (as it's a good top-level view) [18:49] yet it references tools that aren't cloudinit proper [18:50] smoser: why not use lxd via snap? [18:50] that way those instructions are not tied to xenial [18:50] well, i first tried not getting a new one. [18:50] just using what was in the image. [18:50] but something failed... a container didnt get an IP address. [18:50] so i thought "ok, just get something newer". [18:51] apt still seems easier and since we already reference other apt thhings, seems just easier to recommend apt [18:51] and i honestly didnt know how apt-installed versus snap installed get a long [18:51] rharper: you can also [18:51] apt-get install lxd/xenial-backports [18:52] ok [18:52] also someone could use tree_run to build their local tree and run the tests all in one [18:52] yeah, i knew there was something for that. [18:52] i'm fine to change that. [18:53] but "how to build a deb" seems useful doc anyway. and as it is right now it already recommended installing all these deps [18:53] so... might as well use them. rather than launching a container and putting them in there. [18:53] smoser: also nifty [18:53] along the way i find [18:53] http://paste.ubuntu.com/26438886/ [18:53] :-( [18:55] goodness [18:55] smoser: so, according to this: https://github.com/systemd/systemd/issues/2912 and just verifying this in bionic; systemd-networkd (configured via cloud-init/netplan); on ip link set down, the networkd v4 dhcp client will re-acquire leases; that's certainly new behavior w.r.t say xenial/ifupdown (confirming that now in a xenial image) [18:56] s/set down/set down; set up/ [19:06] appears that dhclient does this just fine as well; [19:15] rharper: ? you're saying that isc-dhcp does that ? [19:16] powersj: ubuntu@ec2-18-218-147-181.us-east-2.compute.amazonaws.com [19:16] there is failure there right now on ntp test. [19:16] $ time ./tools/tox-venv citest python3 -m tests.cloud_tests run -v --os-name=xenial --preserve-data --data-dir=/tmp/results.short.d --test tests/cloud_tests/testcases/modules/ntp_servers.py [19:16] i'm looking at it. [19:21] smoser: yeah, what's in xenial appears to bring the interface back up. [19:21] ah, but routing info isn't updated [19:21] lemme compare that to networkd [19:22] rharper: you're saying if i run: [19:22] ip link set eth0 down [19:22] that something will magically bring it back up? [19:22] i dont see that in a container here. [19:22] no, that bouncing the link state will restore the connectivity [19:23] the unplug (set down); and replug (set up) restores interface config (but not routing in isc-dhclient); in networkd, I can see the networkd dhclient require a lease and apply it [19:24] rharper: well, in isc-dhcp its not so much that it "restores interface config", its just that nothing removed that config. so putting it back up keeps it. [19:24] but the routes get dropped and do not get restored. [19:25] are you suggesting that 'link set down eth0' is the same thing as if the hypervisor pulled the cable ? [19:25] yeah, just walking through the link state change so we can capture what does an does not work here w.r.t the need for something [19:26] that's what the azure email is suggesting, that they can toggle the link state [19:26] you can look at the code they posted which watches the interface oper/link state and , I think to work around the isc-dhcp client not updating the route-interface, lauches another dhclient instance [19:27] yeah, under networkd; it does bring the route for the bounced interface back up; [19:53] * smoser wishes ajorg would show up. [20:20] @smoser, I answered your comment. During the re-init of the EphemeralDHCPv4 context manger we do actually want to look for the fallback nic again. It shouldn't change names but if it does we would never find it if we only run find_fallback_nic once. [20:25] dojordan: it wont change names. and you'd see different errors for sure if it did. [20:26] i handt considered the nic getting renamed ... its a bug if it does get renamed under us [20:26] but i had considered only another nic coming online [20:26] look once... first nic is eth1 [20:26] look again, first nic is eth0 [20:26] smoser: once we've renamed it, the specific nic cannot get renamed by udev; it would have to be some other program ding that [20:26] when do we rename ? [20:26] us being cloud init, or ubuntu? [20:26] in local netconfig, apply_nic_names [20:26] in the init (non network) no renaming has taken place. [20:27] rharper: yeah, we're runnign before that here. [20:27] got it, so during local we can assume it wont change [20:27] I think it happens before networking stage otherwise the network config can come up [20:27] ?* [20:27] dojordan: in this case "us" == cloud-init. [20:27] smoser: are we "paused" before stages calls apply_network_config() ? [20:28] dojordan: we're kind of all sorts of foobarred if other things are changing nic names on is. [20:28] at that point. [20:28] the pause for IMDS (polling netwokr metadata) is before we have rendered fallabck networking. [20:28] but what's the harm of looking for the fallback nic multiple times? It is up to the caller (of the context manager) to decide [20:28] happening all in cloud-init.lcoal ("pre-network") [20:29] dojordan: i just think it'd be harder to figure it out. and opens up a failure case. [20:29] boot, dhcp on eth0, hit MD, poll a bit... [20:29] then get a 404 [20:30] dhcp accidently on 'anic0' [20:30] 404 won't call find_fallback_nic [20:30] and then dead [20:30] or dhcp or anything [20:30] i thought it was 404 that caused it to re-try dhcp [20:30] only non 200-299 / 400 or unhandled exception (timeout, socket error, etc) [20:30] but whatever it is that causes that re-try dhcp. if the second time it goes off on amnother nic, then we're done. [20:30] nope, 404 means we hit the metadata server, but nothing is available for us yet [20:31] " if the second time it goes off on amnother nic, then we're done." - why [20:31] i assumed that only one nic is connected to the network that has the MD on it. [20:31] if all of them are, and a dhcp on any would be sufficient, then you're right. [20:31] but it still is odd. [20:31] to use one nic, and then use another the next time. [20:32] so question, if we unplug a nic and plug it back in from hyper-v, we would need to guarantee it will keep the iface name [20:33] otherwise i am fine with that change [20:33] err, assuming disconnect / connect doesnt change the name [20:33] oh right. i'd forgot that you might do that. [20:33] disconnect as in "pull the cable" ? [20:33] or as in "pull the nic" [20:34] pull the cable wont rename. [20:34] pull the nic, maybe. [20:35] dojordan: so... will all nics ever plugged into the system e able to reach the MD if they dhcp ? [20:35] ie, is taht a feature of the platform ? [20:35] I will confirm with the team doing that work. don't want to assume yes but i think so [21:24] dojordan: http://paste.ubuntu.com/26440131/ [21:24] what do you think about that ? [21:24] if i *did* get some debug messages there, that'd clearly lindicate what nic we were using. [21:25] I like it. sounds good to me. I'll fix some indentation errors too [21:26] :) [21:41] @smoser, pushed, running to lunch. Take a look when you get a chance [21:59] https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/336458 [21:59] that took longer than it should have to diagnose. [22:00] hmm ci isn't happy again [22:04] seems like some lxd issue [22:06] I'm looking at it [22:07] thanks [22:33] dojordan: run is looking better this time. sorry about that [22:33] we had some lxd issues over the weekend where another project's tests messed with the config [22:35] hrm hitting timeouts while pulling packages in ec2 testing the ec2 ntp fix Failed to fetch http://us-east-2.ec2.archive.ubuntu.com/ubuntu/pool/main/n/ntp/sntp_4.2.8p10+dfsg-5ubuntu3.1_amd64.deb 504 Gateway Time-out [IP: 52.15.107.13 80] [22:37] hey. [22:38] to get this out of my buffer on a ec2 system [22:38] http://paste.ubuntu.com/26440452/ [22:38] pylxd returns string not bytes from execute. [22:38] joy [22:38] so i tried to have it collect systemd journal, but can't. [22:38] i'll file bug on pylxd tomorrow. [22:38] we can work around by lxc cmdline as shown in that paste. [23:17] Thanks for restarting @powersj. @smoser, anything else, or can we ship it?