[14:53] <GreatSnoopy> hello. Can anyone help me debug a cloud init issue ? I am trying to get cloud-init to work with centos on Azure. The official cloudinit in centos (0.7.9) hangs for a while and does nothing, eventually. So i installed the very last and greatest cloudinit from sources, however now it gives me "util.py[WARNING]: No instance datasource found! Likely bad things to come!"
[14:54] <GreatSnoopy> although i have a datasource_list: [Azure] in a config file in conf.d
[14:54] <GreatSnoopy> what changed in the newer releases of cloudinit ?
[14:55] <GreatSnoopy> it seems I do have a /usr/lib/python2.7/site-packages/cloudinit/sources containing a DataSourceAzure.py
[15:23] <morgana2313> Hello. Is there an easy way to use macro's/variables that contain the cloud-instance ip-adress and hostname in a write_files module?
[15:30] <blackboxsw> mornin folks.  GreatSnoopy have you tried our daily copr builds? https://copr.fedorainfracloud.org/coprs/g/cloud-init/cloud-init-dev/
[15:31] <GreatSnoopy> actually yes
[15:31] <GreatSnoopy> that was before trying the source
[15:32] <GreatSnoopy> the same behavior applies with those builds
[15:32] <GreatSnoopy> I tried the source version as a last resort
[15:32] <blackboxsw> that is version 17.1.x off of our master.  The no datasource found is intriguing. I'll look over our Azure changes to see if there's something indicative of a prob there.  GreatSnoopy, so I'd be curious to see a paste of your /var/log/cloud-init.log. It should report that it attempts to run Azure datasource and complains if something is amiss
[15:33] <GreatSnoopy> just a moment
[15:35] <GreatSnoopy> http://pastebin.centos.org/435761/15111921/
[15:36] <GreatSnoopy> for the record, the above is produced by running cloud-init --debug init
[15:47] <blackboxsw> GreatSnoopy: +1 on that. ok so you upgraded from 0.7.5 -> 0.7.9 originally and saw this problem too?
[15:48] <GreatSnoopy> blackboxsw: so the succession of events is the following:
[15:49] <GreatSnoopy> first, used 0.7.9 that is availabpe in epel. That hangs for like 2 minutes, does nothing in the end when ran manually. And if the machine is rebooted, it just hangs there indefinitely and with ssh stopped is basically a bricked VM
[15:49] <GreatSnoopy> so i went trying newer cloud-init
[15:49] <GreatSnoopy> first used that copr repo
[15:50] <GreatSnoopy> which gave me the behavior of finishing fast but also doing nothing and complaining about unavailable data sources
[15:50] <GreatSnoopy> as in the log
[15:50] <GreatSnoopy> then i tried the actual sources
[15:50] <GreatSnoopy> installed dependencies and the cloud-init distribution with pip install .
[15:51] <GreatSnoopy> (and -r requirements.txt)
[15:51] <GreatSnoopy> but this solved nothing
[15:51] <GreatSnoopy> so basically, the copr build and the raw, unpackaged build installed by hand (mis)behave in the same way
[15:51] <GreatSnoopy> but differently than 0.7.9 :)
[15:52] <GreatSnoopy> which is the current epel version available
[15:52] <blackboxsw> yeah the copr report cloud-init-dev is actually only 4 hours old. Our CI builds after every commit I think.    sorry for the thrashing you've experienced here. Would you be able to "cloud-init collect-logs"  on the commandline and attach it to a new cloud-init bug @ https://bugs.launchpad.net/cloud-init/+filebug
[15:53] <blackboxsw> I find this peculiar in your logs https://bugs.launchpad.net/cloud-init/+filebug
[15:53] <blackboxsw> oops I mean this: 2017-11-20 15:34:09,308 - __init__.py[DEBUG]: Searching for network data source in: []
[15:53] <blackboxsw> I'd have expected to see Azure in that empty list
[15:54] <GreatSnoopy> the config I added is the following, maybe I am using it wrong:
[15:54] <blackboxsw> azure datasource though in cloud-init version 17.1 looks to be only run in init-local timeframe
[15:54] <GreatSnoopy> http://pastebin.centos.org/435771/15111932/
[15:55] <blackboxsw> and in your logs I only see the stage labelled 'init' which is actually cloud-init's 'init-network' stage
[15:55] <blackboxsw> which means Azure (as init-local stage Datasource) doesn't match as a network data source.
[15:56] <blackboxsw> I *think8
[15:56] <blackboxsw> I *think*... though haven't had my coffee yet to confirm
[15:56] <GreatSnoopy> because i ran it manually
[15:56] <GreatSnoopy> maybe ?
[15:56] <GreatSnoopy> if i let the machine boot, it will hang, actually
[15:57] <GreatSnoopy> s/boot/reboot
[15:59] <GreatSnoopy> cloud-init collect-logs seems to do nothing, /var/log/cloud-init* remain empty
[16:00] <blackboxsw> GreatSnoopy,  there are semaphore files blocking cloud-init fresh re-runs. If you are willing to let cloud-init re-run on your system in entirety, you could 'sudo rm -rf /var/log/cloud-init* /var/lib/cloud; sudo reboot'
[16:00] <blackboxsw> that gives cloud-init the perception that it has never run before
[16:01] <GreatSnoopy> i can do that, but that would brick the machine
[16:01] <GreatSnoopy> var/lib/cloud/instances/ is actually empty
[16:01] <GreatSnoopy> i can ofc delete every trace of them, but shouldn't it be able to run in the cli otherwise ?
[16:01] <GreatSnoopy> i mean so that i do not need to reboot
[16:01] <GreatSnoopy> and lose control of the machine ?
[16:03] <GreatSnoopy> [root@democentfihn ~]# rm -Rf /var/log/cloud-init* /var/lib/cloud [root@democentfihn ~]# reboot
[16:17] <blackboxsw> GreatSnoopy: so Azure datasource is init-local only, so running on the commandline you'd need to try "cloud-init init --local" since Azure is only run during init-local.  I'm thinking though this might be that hang you were talking about though
[16:18] <blackboxsw> ok, I'll try firing up a Centos image on Azure today to see if I can reproduce the prob
[16:19] <GreatSnoopy> well, i rebooted the machine, and quite as expected ssh now is stopped so i cannot log any more
[16:19] <GreatSnoopy> dunno if this is due to cloudinit or waagent
[16:19] <smoser> for s in cloud-init-local cloud-init cloud-config cloud-final; do echo == $s.service ==; systemctl restart $s.service || break; done
[16:19] <smoser> that would run everything in order that they would run.
[16:21] <GreatSnoopy> i will pick the next VM and try again, but basically I am back to square one when using 0.7.9: basically i boot the machine but although in the boot diagnostic i can see the familiar cloudinit table with network configuration, the rest of the config does not run and ssh does not get to be started - basically i cannot log into the machine
[16:22] <blackboxsw> GreatSnoopy: check this out. https://bugs.launchpad.net/cloud-init/+bug/1717611 this change did land in azure which might be affecting you
[16:23] <blackboxsw> GreatSnoopy: do you have a pointer to a public centos image we can use on Azure?
[16:23] <blackboxsw> or is this custom
[16:29] <GreatSnoopy> I am not sure I understand what you are asking from me :) It is supposed to be a custom image that we are building to have cloudinit pre-enabled so that we can then provision other machines
[16:29] <GreatSnoopy> but we are not even there yet
[16:29] <GreatSnoopy> now we have a vm created from a regular Azure Centos baze
[16:30] <GreatSnoopy> which I think is provided by OpenLogic
[16:54] <blackboxsw> gotcha, I was just wondering if you were using stock CentOS in azure or a custom image. I'll spin up an instance on azure to checkout
[16:55] <GreatSnoopy> for now I'm just trying to get to nonstandard but i cannot pass the standard phase :))
[16:56] <GreatSnoopy> ideally this should work with 0.7.9 from epel, but if needed I can make my images with a newer cloudinit as long as it works
[16:57] <blackboxsw> I'm with you, will ping you when I have progress on this. if you could file a bug in launchpad that'd help us reference progress on this.
[16:58] <blackboxsw> https://bugs.launchpad.net/cloud-init/+filebug   your simple paste of cloud.cfg.d & cloud-init.log with the steps to reproduce the hang would be sufficient
[17:05] <GreatSnoopy> Just to simplify things, can we start by investigating the 0.7.9 issue ? Point being made is that ideally i should get it working with what is provided in the more mainstream OS repos
[17:05] <GreatSnoopy> also, can you validate the soundness of the config I gave to cloudinit ?
[17:05] <GreatSnoopy> i mean this http://pastebin.centos.org/435771/15111932/
[17:13] <GreatSnoopy> blackboxsw: or let's ask the other way around: what would be the preferred/recommended way to install cloud-init on centos in azure?
[17:22] <blackboxsw> GreatSnoopy: I know cloud-init in certain clouds is already baked into centos images. Trying to confirm on azure now.
[17:23] <GreatSnoopy> unfortunately, not the case in azure - at least up to centos 7.3
[17:24] <GreatSnoopy> they only have cloudinit for ubuntu and coreos
[17:24] <blackboxsw> if a given cloud doesn't have an image that contains cloud-init. I'd install it then shut it down and take a snapshot or make an image of it in that cloud so I could reference that un subsequent VM creations
[17:24] <GreatSnoopy> that is exactly what I am trying to do :)
[17:25] <GreatSnoopy> hence the question, which is the recommended way to get cloudinit on that machine : the package in epel, slightly older - 0.7.9 or should i go and install the latest from source ?
[17:25] <blackboxsw> and I wouldn't want to run cloud-init on that image before I snapshotted it (or I'd remove /var/log/cloud-init* /var/lib/cloud before snapshotting)(
[17:26] <GreatSnoopy> that's understood, i always delete those items
[17:27] <blackboxsw> GreatSnoopy: probably easiest for your to try to use 0.7.9 as it's in epel. But, if there are bugs with 0.7.9 the only fix we'd propse would land in upstream 17.X
[17:28] <blackboxsw> we don't backport fixes to centos epel (and it's up to centos when they want to pull in latest cloud-init)
[17:29] <blackboxsw> and it sounds like 0.7.9 and 17.1 are both causing probs for you. let's see what's up with that. I do think you might be hitting that infinite wait on ssh keys though  on 0.7.9
[17:30] <blackboxsw> on systems like that, I'd expect you'd see"waiting for SSH public key files" in the logs. if you ever got there.
[17:30] <blackboxsw> GreatSnoopy: looks like that ssh key times out at 900 seconds
[17:30] <blackboxsw> so that's a 15 minute wait
[17:31]  * blackboxsw had to hit the calculator
[17:35] <GreatSnoopy> what ssh keys does it expect and who is supposed to create those ? because although the source is a standard source image, i spin the instances via terraform
[17:35] <GreatSnoopy> and the only thing I pass to the instance is custom data with the cloudinit yaml
[17:41] <blackboxsw> GreatSnoopy: it looks like DatasourceAzure.py is waiting for ssh key from azure fabric to configure the instance (as the UI/api provides an ssh key that is used to contact the instance)
[17:47] <GreatSnoopy> that should not be normal behavior :
[17:47] <GreatSnoopy> because even in the GUI i can create a vm that has only password - no key
[17:48] <GreatSnoopy> or is it a different one ? system only ?
[17:48] <GreatSnoopy> that is provided no matter what the user actually provides ?
[18:03] <smoser> GreatSnoopy: it only waits for files to appear which are listed on the cdrom in the metadata there.
[18:04] <smoser> so... password only wont have any ssh keys listed in the metadata so it wont wait for anything
[18:05] <GreatSnoopy> can i manually retrieve the file so that i can check the data received before i reboot the machine ?
[18:05] <GreatSnoopy> where does that file "land" initially ?
[18:08] <smoser> walinux-agent would put it into /var/lib/waagent
[18:08] <smoser> for *.crt files in that directory
[18:10] <GreatSnoopy> one more question : waagent should be disabled so that its cloud-init the one that starts it, or should be left enabled ?
[18:10] <GreatSnoopy> cloudinit's relationship with waagent seems a little bit of chicken and egg dillema
[18:14] <smoser> GreatSnoopy: cloud-init no longer needs walinux-agent.
[18:14] <smoser> and so its default behavior is suggested.
[18:14] <smoser> which is 'agent_command' of '__builtin__'
[18:15] <GreatSnoopy> interesting, but won't azure "see" the instance as failed if the cloud fabric cannot communicate with the agent ? or does cloudinit also create a replacement for that ?
[18:16] <GreatSnoopy> in my previous experience, not having waagent running results in the instance being marked as failed after reboot
[18:16] <GreatSnoopy> because it cannot communicate with the agent
[18:23] <smoser> GreatSnoopy: i' not sure when it went in, but yeah, you dont need walinux-agent anymore.
[18:24] <smoser> yeah. and newer ubuntu instances do not use it... let me check fof sure
[19:09] <GreatSnoopy> filed this https://bugs.launchpad.net/cloud-init/+bug/1733403
[19:11] <blackboxsw> thanks for this bug GreatSnoopy and the good context.
[19:19] <GreatSnoopy> I will come back tomorrow...for now I am out of VM's to brick :D
[19:20] <GreatSnoopy> you know the most stupid part.... we managed to get this step BEFORE for both centos7 and debian
[19:20] <GreatSnoopy> why this is not working any more I don't know
[19:36] <blackboxsw> GreatSnoopy: thx again, one thing I wonder is your datasource config represents agent_command :['systemctl', 'start', 'waagent' ]   ...   I wonder if it'd work with   ['service', 'walinuxagent', 'start'] instead
[19:36] <blackboxsw> the datasource itself checks to see if agent_command  == ['service', 'walinuxagent', 'start'] and grabs content from metadata in that case.
[19:37] <GreatSnoopy> lets see, although that would be, well... ugly :)
[19:39] <blackboxsw> yeah, think I misread the code. I think it checks to see if agent_command == '__builtin__' and then tries to get to metadata to pull in any ssh keys etc.
[19:39] <blackboxsw> I'm referencing docs at https://cloudinit.readthedocs.io/en/latest/topics/datasources/azure.html as well
[19:40] <blackboxsw> ... as I don't use azure too often :/
[19:41] <GreatSnoopy> for s in cloud-init-local cloud-init cloud-config cloud-final; do echo == $s.service ==; systemctl restart $s.service || break; done == cloud-init-local.service == == cloud-init.service == Job for cloud-init.service failed because the control process exited with error code. See "systemctl status cloud-init.service" and "journalctl -xe" for details.
[19:41] <GreatSnoopy> 2017-11-20 19:40:48,245 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=udf', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True) 2017-11-20 19:40:48,378 - handlers.py[DEBUG]: finish: init-network/search-AzureNet: SUCCESS: no network data found from DataSourceAzureNet 2017-11-20 19:40:48,378 - util.py[WARNING]: No instance datasource found! Likely bad things to come! 2017-11-20 19:40:48,378 - util.p
[19:42] <GreatSnoopy> i mean http://pastebin.centos.org/435866/
[19:43] <GreatSnoopy> in any case, i will be back tomorrow, and this time I will also rerun the whole process again (including the vm creation)
[19:44] <blackboxsw> good deal thx GreatSnoopy
[19:44] <GreatSnoopy> currently that was not made by me, and i will have to check if something got left out, just to be sure
[19:44] <GreatSnoopy> thanks a bunch, guys. See you tomorrow, have a nice day !
[19:45] <blackboxsw> you too
[20:57] <smoser> blackboxsw: i'm grabbing merge of fix-ec2-fallback-nic now
[20:57] <blackboxsw> sweet, I'm on the #jinja2 stuff. no other changes needed
[20:57] <blackboxsw> ?
[21:00] <blackboxsw> smoser: with that fix-ec2-fallback-nic branch landed, shall we do a minor SRU?
[21:00] <smoser> we could.
[21:00] <blackboxsw> we have 2 fixes for ec2 that'd be helpful.
[21:00] <blackboxsw> and it'd make the SRU simple
[21:03] <smoser> are you thinking cherry-pick ?
[21:04] <smoser> blackboxsw: ?
[21:04] <smoser> https://hastebin.com/ugoyozasuz
[21:04] <smoser> that is trunk -> bionic riht now
[21:05] <smoser> which we absolutely should do
[21:11] <blackboxsw> smoser: I was thinking master !cherry-pick
[21:11] <blackboxsw> forgot about the others
[21:11] <blackboxsw> but the thing about doing something painful, is to repeat the process often :)
[21:11] <blackboxsw> it can only get better with practice.
[21:42] <smoser> blackboxsw: i'm fine with SRU
[21:42] <smoser> but we need to do bionic first
[21:42] <smoser> and that can happen "right now" if you want to propose, i'll upload
[21:51] <blackboxsw> ok will do smoser
[21:54] <smoser> blackboxsw: i'm pusing the integration test one
[21:54] <smoser> so wait on that ?
[21:54] <smoser> i'm tox && git push on it
[21:54] <smoser> pushed
[21:54] <smoser> 7624348712b4502f0085d30c05b34dce3f2ceeae
[21:54] <blackboxsw> fire awaay
[21:55] <smoser> thats in now
[21:56] <blackboxsw> ok grabbing
[22:01] <blackboxsw> smoser: this is what I see http://pastebin.ubuntu.com/
[22:01] <blackboxsw> should the AliYun have been "bionic"
[22:01] <blackboxsw> ?
[22:01] <blackboxsw> instead of UNRELEASED?
[22:03] <smoser> blackboxsw: ah.
[22:03] <smoser> just squahs it into your new commit
[22:03] <smoser> it was committed as UNRELEASED as it was not released.
[22:03] <smoser> adn then the next release would just pick it up
[22:03] <smoser> kind of queueing things
[22:03] <smoser> so you  just drop that old changelog entry and pull the AliYum comment up
[22:03] <smoser> make sense ?
[22:04] <blackboxsw> yeah squash, gotcha
[22:05] <smoser> put the debian/ ones at the top.
[22:05] <smoser> no real reason
[22:05] <smoser> just how i've done it before
[22:05] <smoser> https://hastebin.com/sezipunibi
[22:05] <smoser> thos will be your top two entries
[22:05] <smoser> with your name instead of mine
[22:05] <smoser> i'll get that in and uploaded later tonight if you MP it
[22:06] <smoser> but have to run for now.
[22:08] <blackboxsw> smoser: https://code.launchpad.net/~chad.smith/cloud-init/+git/cloud-init/+merge/333998
[22:08] <blackboxsw> moving on to artful,zesty,xenial
[22:16] <blackboxsw> smoser: forgot with bionic(devel) should I remove bug #'s which don't affect ubuntu? or just on SRU series (artful, zesty, xenial)
[22:19] <blackboxsw> repushed with my name removed from changelog
[22:20] <blackboxsw> https://code.launchpad.net/~chad.smith/cloud-init/+git/cloud-init/+merge/333999
[22:21] <smoser> blackboxsw: ah. i just leave them in for ubuntu-devel
[22:21] <blackboxsw> ok thx. seeing merge conflict on artful for some reason
[22:22] <smoser> so re-push with the bug numbers on devel
[22:22] <blackboxsw> smoser: re-push contains all bug #'s only removed my name in brackets
[22:24] <smoser> k
[22:25] <smoser> blackboxsw: when you do artful, zesty, xenial
[22:25] <smoser> cherry-pick the templates fix
[22:25] <smoser> and i'mo going to move your debian/cloud-int.templates to before the upstrema snapshot comment
[22:25] <blackboxsw> +1
[22:28] <blackboxsw> I'm in hangout for a quick resolution
[22:28] <blackboxsw> cherry pick is good. just not sure about why I'm seeing merge conflic
[22:28] <blackboxsw> cherry pick is good. just not sure about why I'm seeing merge conflict
[22:30] <smoser> hm..
[22:30] <smoser> ckonstanski
[22:30] <smoser> i'll fix that
[22:30] <smoser> he has username in changelog
[22:30] <smoser> or, just leave it
[22:31] <smoser> lets just levae it
[22:31] <smoser> but we should lint those sorts of things on merge proposal
[22:33] <smoser> blackboxsw: i'm not in a hurry to do the others tonight.
[22:33] <smoser> just uploaded bionic
[22:33] <blackboxsw> kthx. sounds good
[22:33] <blackboxsw> have a good one