[14:53] hello. Can anyone help me debug a cloud init issue ? I am trying to get cloud-init to work with centos on Azure. The official cloudinit in centos (0.7.9) hangs for a while and does nothing, eventually. So i installed the very last and greatest cloudinit from sources, however now it gives me "util.py[WARNING]: No instance datasource found! Likely bad things to come!" [14:54] although i have a datasource_list: [Azure] in a config file in conf.d [14:54] what changed in the newer releases of cloudinit ? [14:55] it seems I do have a /usr/lib/python2.7/site-packages/cloudinit/sources containing a DataSourceAzure.py [15:23] Hello. Is there an easy way to use macro's/variables that contain the cloud-instance ip-adress and hostname in a write_files module? [15:30] mornin folks. GreatSnoopy have you tried our daily copr builds? https://copr.fedorainfracloud.org/coprs/g/cloud-init/cloud-init-dev/ [15:31] actually yes [15:31] that was before trying the source [15:32] the same behavior applies with those builds [15:32] I tried the source version as a last resort [15:32] that is version 17.1.x off of our master. The no datasource found is intriguing. I'll look over our Azure changes to see if there's something indicative of a prob there. GreatSnoopy, so I'd be curious to see a paste of your /var/log/cloud-init.log. It should report that it attempts to run Azure datasource and complains if something is amiss [15:33] just a moment [15:35] http://pastebin.centos.org/435761/15111921/ [15:36] for the record, the above is produced by running cloud-init --debug init [15:47] GreatSnoopy: +1 on that. ok so you upgraded from 0.7.5 -> 0.7.9 originally and saw this problem too? [15:48] blackboxsw: so the succession of events is the following: [15:49] first, used 0.7.9 that is availabpe in epel. That hangs for like 2 minutes, does nothing in the end when ran manually. And if the machine is rebooted, it just hangs there indefinitely and with ssh stopped is basically a bricked VM [15:49] so i went trying newer cloud-init [15:49] first used that copr repo [15:50] which gave me the behavior of finishing fast but also doing nothing and complaining about unavailable data sources [15:50] as in the log [15:50] then i tried the actual sources [15:50] installed dependencies and the cloud-init distribution with pip install . [15:51] (and -r requirements.txt) [15:51] but this solved nothing [15:51] so basically, the copr build and the raw, unpackaged build installed by hand (mis)behave in the same way [15:51] but differently than 0.7.9 :) [15:52] which is the current epel version available [15:52] yeah the copr report cloud-init-dev is actually only 4 hours old. Our CI builds after every commit I think. sorry for the thrashing you've experienced here. Would you be able to "cloud-init collect-logs" on the commandline and attach it to a new cloud-init bug @ https://bugs.launchpad.net/cloud-init/+filebug [15:53] I find this peculiar in your logs https://bugs.launchpad.net/cloud-init/+filebug [15:53] oops I mean this: 2017-11-20 15:34:09,308 - __init__.py[DEBUG]: Searching for network data source in: [] [15:53] I'd have expected to see Azure in that empty list [15:54] the config I added is the following, maybe I am using it wrong: [15:54] azure datasource though in cloud-init version 17.1 looks to be only run in init-local timeframe [15:54] http://pastebin.centos.org/435771/15111932/ [15:55] and in your logs I only see the stage labelled 'init' which is actually cloud-init's 'init-network' stage [15:55] which means Azure (as init-local stage Datasource) doesn't match as a network data source. [15:56] I *think8 [15:56] I *think*... though haven't had my coffee yet to confirm [15:56] because i ran it manually [15:56] maybe ? [15:56] if i let the machine boot, it will hang, actually [15:57] s/boot/reboot [15:59] cloud-init collect-logs seems to do nothing, /var/log/cloud-init* remain empty [16:00] GreatSnoopy, there are semaphore files blocking cloud-init fresh re-runs. If you are willing to let cloud-init re-run on your system in entirety, you could 'sudo rm -rf /var/log/cloud-init* /var/lib/cloud; sudo reboot' [16:00] that gives cloud-init the perception that it has never run before [16:01] i can do that, but that would brick the machine [16:01] var/lib/cloud/instances/ is actually empty [16:01] i can ofc delete every trace of them, but shouldn't it be able to run in the cli otherwise ? [16:01] i mean so that i do not need to reboot [16:01] and lose control of the machine ? [16:03] [root@democentfihn ~]# rm -Rf /var/log/cloud-init* /var/lib/cloud [root@democentfihn ~]# reboot [16:17] GreatSnoopy: so Azure datasource is init-local only, so running on the commandline you'd need to try "cloud-init init --local" since Azure is only run during init-local. I'm thinking though this might be that hang you were talking about though [16:18] ok, I'll try firing up a Centos image on Azure today to see if I can reproduce the prob [16:19] well, i rebooted the machine, and quite as expected ssh now is stopped so i cannot log any more [16:19] dunno if this is due to cloudinit or waagent [16:19] for s in cloud-init-local cloud-init cloud-config cloud-final; do echo == $s.service ==; systemctl restart $s.service || break; done [16:19] that would run everything in order that they would run. [16:21] i will pick the next VM and try again, but basically I am back to square one when using 0.7.9: basically i boot the machine but although in the boot diagnostic i can see the familiar cloudinit table with network configuration, the rest of the config does not run and ssh does not get to be started - basically i cannot log into the machine [16:22] GreatSnoopy: check this out. https://bugs.launchpad.net/cloud-init/+bug/1717611 this change did land in azure which might be affecting you [16:22] Launchpad bug 1717611 in cloud-init "Azure: Azure datasource needs to wait longer for SSH pubkey to be dropped by waagent" [Medium,Fix released] [16:23] GreatSnoopy: do you have a pointer to a public centos image we can use on Azure? [16:23] or is this custom [16:29] I am not sure I understand what you are asking from me :) It is supposed to be a custom image that we are building to have cloudinit pre-enabled so that we can then provision other machines [16:29] but we are not even there yet [16:29] now we have a vm created from a regular Azure Centos baze [16:30] which I think is provided by OpenLogic [16:54] gotcha, I was just wondering if you were using stock CentOS in azure or a custom image. I'll spin up an instance on azure to checkout [16:55] for now I'm just trying to get to nonstandard but i cannot pass the standard phase :)) [16:56] ideally this should work with 0.7.9 from epel, but if needed I can make my images with a newer cloudinit as long as it works [16:57] I'm with you, will ping you when I have progress on this. if you could file a bug in launchpad that'd help us reference progress on this. [16:58] https://bugs.launchpad.net/cloud-init/+filebug your simple paste of cloud.cfg.d & cloud-init.log with the steps to reproduce the hang would be sufficient [17:05] Just to simplify things, can we start by investigating the 0.7.9 issue ? Point being made is that ideally i should get it working with what is provided in the more mainstream OS repos [17:05] also, can you validate the soundness of the config I gave to cloudinit ? [17:05] i mean this http://pastebin.centos.org/435771/15111932/ [17:13] blackboxsw: or let's ask the other way around: what would be the preferred/recommended way to install cloud-init on centos in azure? [17:22] GreatSnoopy: I know cloud-init in certain clouds is already baked into centos images. Trying to confirm on azure now. [17:23] unfortunately, not the case in azure - at least up to centos 7.3 [17:24] they only have cloudinit for ubuntu and coreos [17:24] if a given cloud doesn't have an image that contains cloud-init. I'd install it then shut it down and take a snapshot or make an image of it in that cloud so I could reference that un subsequent VM creations [17:24] that is exactly what I am trying to do :) [17:25] hence the question, which is the recommended way to get cloudinit on that machine : the package in epel, slightly older - 0.7.9 or should i go and install the latest from source ? [17:25] and I wouldn't want to run cloud-init on that image before I snapshotted it (or I'd remove /var/log/cloud-init* /var/lib/cloud before snapshotting)( [17:26] that's understood, i always delete those items [17:27] GreatSnoopy: probably easiest for your to try to use 0.7.9 as it's in epel. But, if there are bugs with 0.7.9 the only fix we'd propse would land in upstream 17.X [17:28] we don't backport fixes to centos epel (and it's up to centos when they want to pull in latest cloud-init) [17:29] and it sounds like 0.7.9 and 17.1 are both causing probs for you. let's see what's up with that. I do think you might be hitting that infinite wait on ssh keys though on 0.7.9 [17:30] on systems like that, I'd expect you'd see"waiting for SSH public key files" in the logs. if you ever got there. [17:30] GreatSnoopy: looks like that ssh key times out at 900 seconds [17:30] so that's a 15 minute wait [17:31] * blackboxsw had to hit the calculator [17:35] what ssh keys does it expect and who is supposed to create those ? because although the source is a standard source image, i spin the instances via terraform [17:35] and the only thing I pass to the instance is custom data with the cloudinit yaml [17:41] GreatSnoopy: it looks like DatasourceAzure.py is waiting for ssh key from azure fabric to configure the instance (as the UI/api provides an ssh key that is used to contact the instance) [17:47] that should not be normal behavior : [17:47] because even in the GUI i can create a vm that has only password - no key [17:48] or is it a different one ? system only ? [17:48] that is provided no matter what the user actually provides ? [18:03] GreatSnoopy: it only waits for files to appear which are listed on the cdrom in the metadata there. [18:04] so... password only wont have any ssh keys listed in the metadata so it wont wait for anything [18:05] can i manually retrieve the file so that i can check the data received before i reboot the machine ? [18:05] where does that file "land" initially ? [18:08] walinux-agent would put it into /var/lib/waagent [18:08] for *.crt files in that directory [18:10] one more question : waagent should be disabled so that its cloud-init the one that starts it, or should be left enabled ? [18:10] cloudinit's relationship with waagent seems a little bit of chicken and egg dillema [18:14] GreatSnoopy: cloud-init no longer needs walinux-agent. [18:14] and so its default behavior is suggested. [18:14] which is 'agent_command' of '__builtin__' [18:15] interesting, but won't azure "see" the instance as failed if the cloud fabric cannot communicate with the agent ? or does cloudinit also create a replacement for that ? [18:16] in my previous experience, not having waagent running results in the instance being marked as failed after reboot [18:16] because it cannot communicate with the agent [18:23] GreatSnoopy: i' not sure when it went in, but yeah, you dont need walinux-agent anymore. [18:24] yeah. and newer ubuntu instances do not use it... let me check fof sure [19:09] filed this https://bugs.launchpad.net/cloud-init/+bug/1733403 [19:09] Launchpad bug 1733403 in cloud-init "cloud-init does not work reliably in Azure with Centos" [Undecided,New] [19:11] thanks for this bug GreatSnoopy and the good context. [19:19] I will come back tomorrow...for now I am out of VM's to brick :D [19:20] you know the most stupid part.... we managed to get this step BEFORE for both centos7 and debian [19:20] why this is not working any more I don't know [19:36] GreatSnoopy: thx again, one thing I wonder is your datasource config represents agent_command :['systemctl', 'start', 'waagent' ] ... I wonder if it'd work with ['service', 'walinuxagent', 'start'] instead [19:36] the datasource itself checks to see if agent_command == ['service', 'walinuxagent', 'start'] and grabs content from metadata in that case. [19:37] lets see, although that would be, well... ugly :) [19:39] yeah, think I misread the code. I think it checks to see if agent_command == '__builtin__' and then tries to get to metadata to pull in any ssh keys etc. [19:39] I'm referencing docs at https://cloudinit.readthedocs.io/en/latest/topics/datasources/azure.html as well [19:40] ... as I don't use azure too often :/ [19:41] for s in cloud-init-local cloud-init cloud-config cloud-final; do echo == $s.service ==; systemctl restart $s.service || break; done == cloud-init-local.service == == cloud-init.service == Job for cloud-init.service failed because the control process exited with error code. See "systemctl status cloud-init.service" and "journalctl -xe" for details. [19:41] 2017-11-20 19:40:48,245 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=udf', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True) 2017-11-20 19:40:48,378 - handlers.py[DEBUG]: finish: init-network/search-AzureNet: SUCCESS: no network data found from DataSourceAzureNet 2017-11-20 19:40:48,378 - util.py[WARNING]: No instance datasource found! Likely bad things to come! 2017-11-20 19:40:48,378 - util.p [19:42] i mean http://pastebin.centos.org/435866/ [19:43] in any case, i will be back tomorrow, and this time I will also rerun the whole process again (including the vm creation) [19:44] good deal thx GreatSnoopy [19:44] currently that was not made by me, and i will have to check if something got left out, just to be sure [19:44] thanks a bunch, guys. See you tomorrow, have a nice day ! [19:45] you too [20:57] blackboxsw: i'm grabbing merge of fix-ec2-fallback-nic now [20:57] sweet, I'm on the #jinja2 stuff. no other changes needed [20:57] ? [21:00] smoser: with that fix-ec2-fallback-nic branch landed, shall we do a minor SRU? [21:00] we could. [21:00] we have 2 fixes for ec2 that'd be helpful. [21:00] and it'd make the SRU simple [21:03] are you thinking cherry-pick ? [21:04] blackboxsw: ? [21:04] https://hastebin.com/ugoyozasuz [21:04] that is trunk -> bionic riht now [21:05] which we absolutely should do [21:11] smoser: I was thinking master !cherry-pick [21:11] forgot about the others [21:11] but the thing about doing something painful, is to repeat the process often :) [21:11] it can only get better with practice. [21:42] blackboxsw: i'm fine with SRU [21:42] but we need to do bionic first [21:42] and that can happen "right now" if you want to propose, i'll upload [21:51] ok will do smoser [21:54] blackboxsw: i'm pusing the integration test one [21:54] so wait on that ? [21:54] i'm tox && git push on it [21:54] pushed [21:54] 7624348712b4502f0085d30c05b34dce3f2ceeae [21:54] fire awaay [21:55] thats in now [21:56] ok grabbing [22:01] smoser: this is what I see http://pastebin.ubuntu.com/ [22:01] should the AliYun have been "bionic" [22:01] ? [22:01] instead of UNRELEASED? [22:03] blackboxsw: ah. [22:03] just squahs it into your new commit [22:03] it was committed as UNRELEASED as it was not released. [22:03] adn then the next release would just pick it up [22:03] kind of queueing things [22:03] so you just drop that old changelog entry and pull the AliYum comment up [22:03] make sense ? [22:04] yeah squash, gotcha [22:05] put the debian/ ones at the top. [22:05] no real reason [22:05] just how i've done it before [22:05] https://hastebin.com/sezipunibi [22:05] thos will be your top two entries [22:05] with your name instead of mine [22:05] i'll get that in and uploaded later tonight if you MP it [22:06] but have to run for now. [22:08] smoser: https://code.launchpad.net/~chad.smith/cloud-init/+git/cloud-init/+merge/333998 [22:08] moving on to artful,zesty,xenial [22:16] smoser: forgot with bionic(devel) should I remove bug #'s which don't affect ubuntu? or just on SRU series (artful, zesty, xenial) [22:19] repushed with my name removed from changelog [22:20] https://code.launchpad.net/~chad.smith/cloud-init/+git/cloud-init/+merge/333999 [22:21] blackboxsw: ah. i just leave them in for ubuntu-devel [22:21] ok thx. seeing merge conflict on artful for some reason [22:22] so re-push with the bug numbers on devel [22:22] smoser: re-push contains all bug #'s only removed my name in brackets [22:24] k [22:25] blackboxsw: when you do artful, zesty, xenial [22:25] cherry-pick the templates fix [22:25] and i'mo going to move your debian/cloud-int.templates to before the upstrema snapshot comment [22:25] +1 [22:28] I'm in hangout for a quick resolution [22:28] cherry pick is good. just not sure about why I'm seeing merge conflic [22:28] cherry pick is good. just not sure about why I'm seeing merge conflict [22:30] hm.. [22:30] ckonstanski [22:30] i'll fix that [22:30] he has username in changelog [22:30] or, just leave it [22:31] lets just levae it [22:31] but we should lint those sorts of things on merge proposal [22:33] blackboxsw: i'm not in a hurry to do the others tonight. [22:33] just uploaded bionic [22:33] kthx. sounds good [22:33] have a good one