/srv/irclogs.ubuntu.com/2019/09/04/#cloud-init.txt

mruffellHi cloud-init team, I opened https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/184256205:36
ubot5Launchpad bug 1842562 in cloud-init (Ubuntu Eoan) "AWS: Add udev rule to set Instance Store device IO timeouts" [Medium,In progress]05:36
mruffellYou can ping me, or ddstreet if you have any questions. I hope cloud-init is the right place for it05:36
mruffellTheres still some debate going on in the SF case, but I think cloud-init is the best place05:37
Mechanismusok idgi, I'm trying to use {{ v1.local-hostname }} within my cloud-config.txt13:16
Mechanismusbut when I do I get unicode rendering errors13:17
Odd_BlokeMechanismus: What version of cloud-init are you using?  Where do you see the errors?13:28
Mechanismusversion: /usr/bin/cloud-init 19.1-1-gbaa47854-0ubuntu1~18.04.113:31
Mechanismuserrors when I run `cloud-init query --list-keys`13:31
MechanismusUnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte13:31
Odd_BlokeMechanismus: Oh, that's not good!  Could you file a bug at https://bugs.launchpad.net/cloud-init/+filebug and attach the tarball that `cloud-init collect-logs` generates, please?13:33
Mechanismuseh... I'll have to desensitize the file13:36
Mechanismus...after I find it13:36
Mechanismusit's being generated via terraform and supplied to an azure vm in user data13:36
Odd_BlokeFind it?  `cloud-init collect-logs` gathers all the data we would need, so you shouldn't need to go find anything. :)13:37
Mechanismusoh13:39
MechanismusI thought you meant the gzip that was provided to the vm at launch13:39
Odd_Bloke:)13:39
Mechanismusactually I have to redeploy the VM to get logs with specifically this situation13:42
Mechanismusthat'll take a minute13:42
Odd_Blokerharper: blackboxsw: We don't have a template for Oracle yet; which other template would you suggest basing it on?14:16
rharperOdd_Bloke: the openstack one ?14:16
rharperat least for now, it's mostly Openstack API right ?14:16
rharperuntil we switch datasources ?14:16
rharperit's been a while since I use the cli14:17
rharpersomething else might fit better14:17
Odd_BlokeI think I'll just launch instances using the web interface.  (I don't believe Oracle's user-facing API is OpenStack-compatible.)14:18
Odd_BlokeSo I guess I was really asking which one is most recently updated?14:19
Odd_BlokeLooks like Openstack probably is the best one.14:20
blackboxswmorning, yeah I15:12
blackboxswmorning, yeah I'd agree on basing the manual verification script on openstack or azure since you'll probably be calling oracle's api instead of a launch-oracle.py script as we don't have that yet15:14
MechanismusOdd_Bloke: ok so it looks like Azure's part of the cloud config is running, but my custom bits fail to merge in and I still get the error with cloud-init query15:36
MechanismusIs there anything I can look at in the logs for something I might be doing wrong before I actually open a bug?15:37
Odd_BlokeMechanismus: It's hard to know, because I don't really understand the problem you're seeing.  Honestly, a bug would make this a lot easier to work through.  Is there a particular reason you don't want to file one?15:39
MechanismusOdd_Bloke: not really except that if it's a bug then the turnaround time to get this working is arbitrary and I'm trying to get this working today15:40
MechanismusI'm about to try an alternative approach in generating my cloud config in terraform though which would let me work around this for now15:40
Odd_BlokeI mean, the same people who would help you in IRC are telling you to file a bug, so I'm not sure why you think the turnaround would be any different. ;)15:41
Odd_Blokeblackboxsw: I'm actually not going to document exactly how to launch an Oracle instance; the UI makes it fairly easy to work out, and I don't want us to have out-of-date docs when they change things.  I _will_ document the one thing that caught me out (remembering to add your SSH key).15:42
MechanismusI mean that if it's a matter of I'm trying to do something unsupported then I can fix that.  If the fact that the UnicodeDecodeError shouldn't happen even when I'm being dumb constitutes a bug then I get that, but getting the _machine_ working is kind of my top priority right now15:42
blackboxswOdd_Bloke: thanks, makes sense15:43
blackboxswthat's all I was really hoping, was if it was a complicated instance launch in any manner15:43
Odd_BlokeMechanismus: Well, we can always close out a bug Invalid if it turns out that you are doing something unsupported, but we would never expect to see a traceback.15:44
Odd_BlokeSo I expect there is a valid bug, even if it's "we should message better when we see bad input" or whatever.15:44
Odd_BlokeAnd a bug gives us a place to attach logs etc. and have discussion to that won't get lost in IRC backlog.15:44
blackboxswwe can reflect the bug lnk in channel here too to improve response time on resolution15:45
blackboxswUnicodeDecodeError rings a bell when handling some of the cloud metadata in the past. but we probably can/should address that in cloud-init proper if it's causing your general cloud-init query --all to fail.15:46
MechanismusOdd_Bloke: I fully agree with you and will be happy to look into it once I fix the issue I was working on when I ran into this15:46
* blackboxsw just realized I'm waay out of date on this discussion. I'll read the origin of this conversation to catch up15:47
blackboxswMechanismus: couple things you can/should try    to test whether your given jinja query syntax is valid, on a booted vm you can run:    cloud-init query --format "{{ v1.local-hostname }}"  to see if it's an accessible template variable.15:52
blackboxswMechanismus: specifically for our v1 standardized metadata,    I *think* you needed {{v1.local_hostname }} instead of {{ v1.local-hostname}}   as the hyphen gets interpreted as subtraction15:53
blackboxswon my system:      cloud-init query --format "{{v1.local-hostname}}"15:53
blackboxswWARNING: Ignoring jinja template for query commandline: Undefined jinja variable: "local-hostname". Jinja tried subtraction. Perhaps you meant "local_hostname"15:53
Mechanismusblackboxsw: Good point, v1.local-hostname is in instance-data.json but doesn't work with the query.  However, v1.local_hostname works, though that's exactly what I have in cloud-config on this machine with the errors15:54
blackboxswalso Mechanismus    inside the template, you can use python as a workaround... so if you could do something like {{ v1.local_hostname.decode('utf-8') }}15:55
blackboxswor {{ v1.keys() }} to see available subkeys under v115:55
blackboxswMechanismus: you could also check cloud-init query userdata  (which would provide your cloud-config yaml)  and you'd be able to process that content for your hostname: <myhost> or fqdn: declarations15:59
blackboxswbut that's a bigger lift :/15:59
smoserif i had to guess..16:00
smoser 0dc3a77f41f4544e4cb5a41637af7693410d4cdf16:00
smoserwould fix Mechanismus16:01
smoseralthough that should not occur with pyhon 316:01
blackboxswhrm, though I thought he was on v. 19.1.1 and that commitish was in ~18.516:03
smoseroh. you're right. i just loked at the date16:04
smoserand assumed the april commit didnt get into 19.116:05
blackboxswyeah true initially.16:05
blackboxswI mean, yes I thought so too initially16:05
blackboxswheh interesting on Azure for my SRU test16:06
smoserhttps://bugs.launchpad.net/cloud-init/+bug/1801364 is related, but not really.16:06
ubot5Launchpad bug 1801364 in cloud-init "persisting OpenStack metadata fails" [Undecided,Confirmed]16:06
blackboxswubuntu@my-e1:~$ sudo cloud-init query userdata16:06
blackboxsw../sethostname.yaml16:06
smoseri'm assuming he is not python 216:06
blackboxswI expected the metadata service to actually report the user-data, not the file name I used when launching the instance16:06
smoserare you sure you have userdata ?16:06
smoserand not just the name of a file ?16:07
blackboxswchecking my azcli launch command16:07
smoserto my knowledge 'az' takes only a custom-data blob in --custom-data16:08
smosernot a reference to a file16:08
blackboxswahh interesting, I specified a nonexistent file on --custom-data16:08
blackboxswif file does exist I think it gets populated properly.. .checking our latest SRU run16:09
Odd_BlokeYeah, my shell history strongly suggests you can pass a file to --custom-data.16:09
blackboxswsmoser: https://github.com/cloud-init/ubuntu-sru/blob/master/manual/azure-sru-19.2.21.txt16:09
blackboxswyeah if the file does exist (sethostname.yaml in that example ^) it works and sets SRU-worked-<cloud>16:10
blackboxswbut if file doesn't exist, azure just provides the string to the vm16:10
blackboxswand doesn't error (because it's 'flexible' in allowing blob or file)16:10
Odd_Blokegcloud has --metadata-from-file distinct from --metadata, which I prefer, I think.16:11
blackboxsw+1 Odd_Bloke, yeah explicit intent/failures16:11
smoserthe other better path is @filename16:12
smoserlike curl does16:12
blackboxswtrue16:12
* blackboxsw tries that @<file> w/ azcli to see if it'll fail on file absent 16:13
blackboxswor even succeed on file presence16:13
blackboxswok failure when providing --custom-data @<file>16:14
blackboxswDeployment failed. Correlation ID: 91ddbf3f-e296-4ea4-aab7-2189c314fe66. {16:14
blackboxsw  "error": {16:14
blackboxsw    "code": "PropertyChangeNotAllowed",16:14
blackboxsw    "message": "Changing property 'customData' is not allowed.",16:14
blackboxsw    "target": "customData"16:14
blackboxsw  }16:14
blackboxsw}16:14
blackboxsw:)16:14
blackboxswwell, we missed cloud-init status meeting Monday due to US holiday. We'll shift it to next Monday, and I'll send an email to the list16:16
=== blackboxsw changed the topic of #cloud-init to: Reviews: http://bit.ly/ci-reviews | Meeting minutes: https://goo.gl/mrHdaj | Next status meeting Sept 9 16:15 UTC | cloud-init v 19.2 (07/17) | https://bugs.launchpad.net/cloud-init/+filebug
* Odd_Bloke is going to look at SRU verification for https://bugs.launchpad.net/cloud-init/+bug/181285716:33
ubot5Launchpad bug 1812857 in cloud-init "RuntimeError: duplicate mac found! both 'ens4' and 'bond0' have mac '9c:XX:XX:46:5d:91'" [Medium,Fix released]16:33
blackboxswOdd_Bloke: good deal, reviewing oracle run now. I just pushed https://github.com/cloud-init/ubuntu-sru/pull/44 with correct mtu v1 and v2 inputs (which fixed the diffs of v2 output so it is now limited to just dict ordering diffs)16:41
blackboxswOdd_Bloke: I can't remember w/ Oracle. Upon reboot upgraded cloud-init changed from detecting DataSourceOpenStackLocal to detecting DataSourceOpenStack (net). I realize it's the same datasource, but was a bit surprised that it switched to !Local detection16:46
* blackboxsw tries to see if we have that same transition for Ec2Local -> Ec216:47
blackboxswit may be worth peeking at 'Crawl of metadata service" in cloud-init.log post the clean reboot to see why we cloud-init balks on OpenStackLocal post upgrade16:50
blackboxswmaybe there was an ephemeral dhcp response issue there?16:50
Odd_BlokeLooking.17:28
rharperblackboxsw: on your  pull request with the netplan v2 bits;  shouldn't you pull the mtu values from the devices in the verification ?17:31
blackboxswrharper: btw, you were right that netplan raises a warning about missing definitions for the bond_interfaces17:32
rharperyeah17:32
rharperso for verification there, I'd read the MTU values on the bond and member interfaces pre-upgrade (v1) and post-upgrade (v2)17:33
rharperit's fine to hard code the paths in the test since we're constructing the config (and interface names);17:33
blackboxswsure rharper agreed, I can grep -B 2 -i mtu     and we'll see the interfaces in most cases17:33
rharperI'm grabbing, bug #180670117:33
ubot5bug 1806701 in cloud-init "cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached" [Medium,Fix released] https://launchpad.net/bugs/180670117:33
rharperblackboxsw: you can read /sys/class/net/<iface>/mtu17:34
rharperif you're not playing with ipv6 mtu17:34
blackboxswrharper: that would be if a created a vm with that config and applied. I didn't do that, I was just running net-convert in the test17:34
rharperah17:34
rharperok17:34
rharperI saw NoCloud17:34
rharperso was thinking you were doing a VM17:34
blackboxswright, only because I did check on an lxc with -proposed enabled17:35
blackboxswto make sure our -proposed bits had the logic17:35
blackboxswinstead of just testing ti17:35
blackboxswtip17:35
rharperpok17:35
rharperthat's fine actually since it's about the netconfig generated17:35
Odd_BlokeOK, so there is a change in behaviour, and I think it's to do with network configuration; digging in more now.17:48
Odd_Bloke(When I said there weren't new tracebacks at stand-up, I was mistaken.)17:49
Odd_BlokeOK, I think the issue is coming from the classless static route support we now have for ephemeral DHCP.17:55
Odd_BlokeAnd if the interface already has an address, we handle failing to set it gracefully, but we will still attempt to apply the routes to it.17:56
Odd_BlokeAnd that fails, causing the DS to not be considered.17:56
Odd_BlokeGood catch, Chad.17:58
Odd_BlokeOK, I've got a fairly small patch which seems correct to me.  I'll propose it and we can discuss it.18:05
rharperOdd_Bloke: ah, yes, we really should have a net_is_up check in the oracle/openstack ds18:23
rharperin that, if networking is up, no need to bring up ephemeral DHCP18:23
rharperthat said, it can't hurt to be more defensive in Ephemeral DHCP as well18:23
Odd_Blokerharper: blackboxsw: https://code.launchpad.net/~daniel-thewatkins/cloud-init/+git/cloud-init/+merge/372289 <-- what do you think of that?18:26
blackboxswOdd_Bloke: ahh good deal. hrm. ok, so we caught a potential regression then. Sure, let's review what you've got when available and we'll get that in18:27
blackboxswreading now.18:28
rharperOdd_Bloke: left a comment, not quite sure what to do;  to me, if we called EphemeralDHCP, then I really expect it to do a DHCP not skip the dhcp + setup if the interface already has an IP ... ;  should we raise and exception instead?  and for Oracle/OpenStack, (or any user of EphemeralDHCP) we should check net.is_up(self.fallback_interface) before using the EphemeralDHCP18:32
Odd_BlokeHaving to check net.is_up before using it leads to slightly awkward code like get_metadata_from_imds in DataSourceAzure.py; if net.is_up(): do_thing() else: with EphemeralDHCP: do_thing()18:41
Odd_BlokeBut I agree that there's no point getting the lease when we're going to throw it away immediately without using any of it.18:41
blackboxswrharper: Odd_Bloke, probably fair to think about things that way, though existing behavior is to bail on all other setup if the interfaces already has an IP, regardless of Odd_Bloke's fix18:41
blackboxswand Odd_Bloke agree it is awkward to have every call side is_net_up() or EphemeralDCHP. it'd be nice to have that failsafe logic within EphermeralDCHP contextmgr.. maybe we could have a EphemeralDHCP(force=True) if we really want to force a dhclient run on an interface even if it already has config18:43
Odd_BlokeIn fact, I think get_metadata_from_imds is wrong because of this; it will report errors differently depending on whether or not an ephemeral lease was needed.18:46
Odd_Bloke(Not a big deal, but this is why avoiding having to spell out do_thing() twice is good.)18:47
blackboxswOdd_Bloke: do we have the traceback you saw on Oracle somewhere18:47
Odd_BlokeYou mean you can't see it in my terminal?18:47
blackboxswOdd_Bloke: no, I'm just sniffing your browser traffic to your banks18:47
rharperlol18:47
Odd_BlokeThere we go: https://paste.ubuntu.com/p/jt8hNMJjKb/18:47
blackboxswthanks man18:47
Odd_BlokeOh, OK, you could just have sniffed that URL then.18:48
Odd_BlokeBut I'll make it easier for you.18:48
Odd_BlokeYeah, so I think my fix is probably too far down the stack.18:48
* blackboxsw wonders really if we should be checking the same failure condition we already are for ['ip', '-family', 'inet', 'addr', 'add', cidr, 'broadcast',18:48
* blackboxsw self.broadcast, 'dev', self.interface],18:48
* blackboxsw The 'File exists' in stderr18:48
blackboxswas in we can try all setup commands and only queue cleanup for the commands which succeed18:49
Odd_BlokeWe should perhaps do that, but I don't think that's the root of the problem here.18:49
blackboxswand ignore the setup commands for routes or addrs that already exist18:49
Odd_BlokeWe should be able to know that we don't need to do DHCP at all here.18:49
blackboxswOdd_Bloke: agreed there too18:50
Odd_BlokeFWIW, we already do have support for not-DHCP'ing in the context manager, if we pass in a connectivity_url.18:50
Odd_BlokeSo the context manager already doesn't _always_ DHCP.18:50
blackboxswhrm, as in we could pass connectivity_url=self.metadata_address to EphemeralDHCPv4 maybe?18:53
blackboxswhrm no that wouldn't work, doesn't get setuntil you _crawl_metadata18:54
Odd_BlokeWe could refactor that though, I think.  Regardless, connectivity_url is broken because it doesn't consider 403s to be an indication that you have connectivity.18:59
Odd_Bloke(Which you obviously do, to get any sort of response!)18:59
rharperdo IMDS return 403s ?   would you want your connitivity url to do that ?18:59
Odd_Bloke403 indicates connectivity.19:00
Odd_Bloke(Perhaps the argument is named incorrectly. :p)19:00
Odd_BlokeAnd yes, on Oracle, `curl http://169.254.169.254` gives a 403.19:01
rharper*sigh*19:02
rharper=)19:02
Odd_BlokeThe same thing would happen on Google, too, at least; they expect a specific header in their requests.19:02
Odd_BlokeAnd connectivity_url doesn't allow specifying anything other than the URL string, obvs.19:02
Odd_BlokeLooks like nothing has ever used connectivity_url, so it wouldn't be super-surprising for there to be wrinkles with it, actually.19:09
Odd_BlokeI guess, to step back for a minute, is it worth fixing the OpenStack DS for Oracle when we're about to switch over to their dedicated DS?19:13
blackboxswOdd_Bloke: I guess I'm still trying to understand why the network is up and configured already in local timeframe after a reboot19:18
rharperblackboxsw: iscsi19:20
blackboxswahh ahh19:20
blackboxswOdd_Bloke: rharper.  I *think* it probably makes sense for this to go with Odd_Bloke's branch to avoid the time cost of !detecting OpenstackLocal. As that issue could potentially affect other private openstack clouds using iscsci root or providing network config on the kernel cmdline wouldn't it?19:26
Odd_BlokeWell, my branch is really too far down the stack.19:26
Odd_BlokeThere, if we've been given routes then we should be applying them regardless.19:27
Odd_BlokeThe change should be at least one frame further up, so that we don't even DHCP if we already have networking.19:28
rharperfirst, is this a regression on Oracle, or has it been this way ?  ie, do we need to apply fix and respin the SRU ?19:28
Odd_BlokeThis is a regression19:28
rharperrelated to the rfc3442 stuff ?19:28
Odd_BlokeYep.19:28
rharperI guess I don't understand why if we never when down this path before19:29
Odd_BlokeBecause we try to apply routes that already exist and don't handle that erroring.19:29
rharperbut previously we didnt ?  are we really DHCP'ing again on top of iscsi root ?19:29
rharperhow does that even work ?19:29
rharperwhere did the lease response come from ?19:29
Odd_BlokeWe ephemerally DHCP, and then in EphemeralIPv4Network._bringup_device the first util.subp call fails.  That failure is handled gracefully and, before, was the last thing that __enter__ did.19:31
Odd_BlokeHowever, __enter__ now unconditionally continues on to apply the routes that the DHCP response included, and that's what fails.19:31
rharperI see19:32
Odd_BlokeAnd that failure means that DataSourceOpenStackLocal doesn't find metadata, so we fall through to DataSourceOpenStack later on.19:32
Odd_Bloke(Which just uses the networking that the system already has, of course.)19:32
rharperwell, we set try dhcp to false for non-local19:33
rharperin the datasource19:33
Odd_BlokeRight.19:33
rharperwe really shouldn't DHCP if network is up19:33
Odd_BlokeYeah, agreed.19:33
rharperso we have new errors in local ,but I don't think anything functional fails;19:34
blackboxswrharper: right, we still ultimately detect ini-network DataSourceOpenstack, just a time cost of failing @ init-local timeframe19:35
blackboxswand seeing more traces19:35
rharperwell, I wonder if non-iscsi we'd see a network-config failure19:35
rharperin the iscsi case, we already use iscsi network-config instead of ds network-config19:36
rharperif local failed to render network-config, then we write out fallback I think, then at net time, we can crawl, all is well;19:36
Odd_Blokeblackboxsw: What's the time cost in your view, OOI?19:40
Odd_BlokeAFAICT, the time taken by the network traffic dominates, and we have to pay that cost in either route.19:40
Odd_BlokeBut I may be missing another consequence.19:40
Odd_Bloke(I didn't notice the two different data sources, so I'm clearly not on my top game today, lol)19:41
blackboxswOdd_Bloke: not big, certainly subsecond cost. lemme it looked like it as  +00.09700s19:44
Odd_BlokeOK, cool, just making sure I wasn't missing something else.19:45
Odd_Blokehttps://bugs.launchpad.net/cloud-init/+bug/1842752 <-- the bug we just discussed me filing20:18
ubot5Launchpad bug 1842752 in cloud-init "Additional traceback in logs when using DataSourceOpenStackLocal on Oracle" [Low,Triaged]20:18
blackboxswthanks Odd_Bloke20:19
Odd_BlokeThe traceback does appear on upgrade.20:40
rharperwhat do we want to do about verifying https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/183319220:55
ubot5Launchpad bug 1833192 in cloud-init (Ubuntu) "VMware: post custom script isn't run correctly" [Undecided,Fix released]20:55
Odd_Blokerharper: Could we just comment on the bug asking for help?20:59
Odd_BlokeAnd maybe reach out to what VMWare contacts we do have?20:59
rharperwe can ping them directly21:00
rharperI'll send an email21:00
blackboxswrharper: Odd_Bloke so I can validate that cloud-init behaves as expected, for bug #1840080 in the SRU21:45
ubot5bug 1840080 in cloud-init (Ubuntu) "cloud-init cc_ubuntu_drivers does not set up /etc/default/linux-modules-nvidia" [High,Fix released] https://launchpad.net/bugs/184008021:45
rharper\o/21:45
blackboxswit emits proper debconf-set-selections, yet ubuntu-drivers-common doesn't actually install linux-modules-nvidia packages21:45
blackboxswso, not really sure what we should do on this front. I *think* that behavior of cloud-init is correct, but we have yet to see the plumbing from ubuntu-drivers-common21:49
powersjblackboxsw, hit him up early tomorrow and ask; for now move on21:51
blackboxswyeah nothing else remains to move on to, waiting on CDO QA review, validation of ubuntu-drivers behavior, on aws GPU eoan instance, and I think Odd_Bloke is working the last remaining bug: #181285721:54
ubot5bug 1812857 in cloud-init "RuntimeError: duplicate mac found! both 'ens4' and 'bond0' have mac '9c:XX:XX:46:5d:91'" [Medium,Fix released] https://launchpad.net/bugs/181285721:54
blackboxswpowersj: I'll publish to copr el-testing. But, I think the rest of validation is grabbed/blocked or done.21:55
blackboxswso, we can touch base tomorrow to see if there is anything else of note that would require SRU-regen21:56
powersjsounds good21:56

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!