GreatSnoopy | hello. Can anyone help me debug a cloud init issue ? I am trying to get cloud-init to work with centos on Azure. The official cloudinit in centos (0.7.9) hangs for a while and does nothing, eventually. So i installed the very last and greatest cloudinit from sources, however now it gives me "util.py[WARNING]: No instance datasource found! Likely bad things to come!" | 14:53 |
---|---|---|
GreatSnoopy | although i have a datasource_list: [Azure] in a config file in conf.d | 14:54 |
GreatSnoopy | what changed in the newer releases of cloudinit ? | 14:54 |
GreatSnoopy | it seems I do have a /usr/lib/python2.7/site-packages/cloudinit/sources containing a DataSourceAzure.py | 14:55 |
morgana2313 | Hello. Is there an easy way to use macro's/variables that contain the cloud-instance ip-adress and hostname in a write_files module? | 15:23 |
blackboxsw | mornin folks. GreatSnoopy have you tried our daily copr builds? https://copr.fedorainfracloud.org/coprs/g/cloud-init/cloud-init-dev/ | 15:30 |
GreatSnoopy | actually yes | 15:31 |
GreatSnoopy | that was before trying the source | 15:31 |
GreatSnoopy | the same behavior applies with those builds | 15:32 |
GreatSnoopy | I tried the source version as a last resort | 15:32 |
blackboxsw | that is version 17.1.x off of our master. The no datasource found is intriguing. I'll look over our Azure changes to see if there's something indicative of a prob there. GreatSnoopy, so I'd be curious to see a paste of your /var/log/cloud-init.log. It should report that it attempts to run Azure datasource and complains if something is amiss | 15:32 |
GreatSnoopy | just a moment | 15:33 |
GreatSnoopy | http://pastebin.centos.org/435761/15111921/ | 15:35 |
GreatSnoopy | for the record, the above is produced by running cloud-init --debug init | 15:36 |
blackboxsw | GreatSnoopy: +1 on that. ok so you upgraded from 0.7.5 -> 0.7.9 originally and saw this problem too? | 15:47 |
GreatSnoopy | blackboxsw: so the succession of events is the following: | 15:48 |
GreatSnoopy | first, used 0.7.9 that is availabpe in epel. That hangs for like 2 minutes, does nothing in the end when ran manually. And if the machine is rebooted, it just hangs there indefinitely and with ssh stopped is basically a bricked VM | 15:49 |
GreatSnoopy | so i went trying newer cloud-init | 15:49 |
GreatSnoopy | first used that copr repo | 15:49 |
GreatSnoopy | which gave me the behavior of finishing fast but also doing nothing and complaining about unavailable data sources | 15:50 |
GreatSnoopy | as in the log | 15:50 |
GreatSnoopy | then i tried the actual sources | 15:50 |
GreatSnoopy | installed dependencies and the cloud-init distribution with pip install . | 15:50 |
GreatSnoopy | (and -r requirements.txt) | 15:51 |
GreatSnoopy | but this solved nothing | 15:51 |
GreatSnoopy | so basically, the copr build and the raw, unpackaged build installed by hand (mis)behave in the same way | 15:51 |
GreatSnoopy | but differently than 0.7.9 :) | 15:51 |
GreatSnoopy | which is the current epel version available | 15:52 |
blackboxsw | yeah the copr report cloud-init-dev is actually only 4 hours old. Our CI builds after every commit I think. sorry for the thrashing you've experienced here. Would you be able to "cloud-init collect-logs" on the commandline and attach it to a new cloud-init bug @ https://bugs.launchpad.net/cloud-init/+filebug | 15:52 |
blackboxsw | I find this peculiar in your logs https://bugs.launchpad.net/cloud-init/+filebug | 15:53 |
blackboxsw | oops I mean this: 2017-11-20 15:34:09,308 - __init__.py[DEBUG]: Searching for network data source in: [] | 15:53 |
blackboxsw | I'd have expected to see Azure in that empty list | 15:53 |
GreatSnoopy | the config I added is the following, maybe I am using it wrong: | 15:54 |
blackboxsw | azure datasource though in cloud-init version 17.1 looks to be only run in init-local timeframe | 15:54 |
GreatSnoopy | http://pastebin.centos.org/435771/15111932/ | 15:54 |
blackboxsw | and in your logs I only see the stage labelled 'init' which is actually cloud-init's 'init-network' stage | 15:55 |
blackboxsw | which means Azure (as init-local stage Datasource) doesn't match as a network data source. | 15:55 |
blackboxsw | I *think8 | 15:56 |
blackboxsw | I *think*... though haven't had my coffee yet to confirm | 15:56 |
GreatSnoopy | because i ran it manually | 15:56 |
GreatSnoopy | maybe ? | 15:56 |
GreatSnoopy | if i let the machine boot, it will hang, actually | 15:56 |
GreatSnoopy | s/boot/reboot | 15:57 |
GreatSnoopy | cloud-init collect-logs seems to do nothing, /var/log/cloud-init* remain empty | 15:59 |
blackboxsw | GreatSnoopy, there are semaphore files blocking cloud-init fresh re-runs. If you are willing to let cloud-init re-run on your system in entirety, you could 'sudo rm -rf /var/log/cloud-init* /var/lib/cloud; sudo reboot' | 16:00 |
blackboxsw | that gives cloud-init the perception that it has never run before | 16:00 |
GreatSnoopy | i can do that, but that would brick the machine | 16:01 |
GreatSnoopy | var/lib/cloud/instances/ is actually empty | 16:01 |
GreatSnoopy | i can ofc delete every trace of them, but shouldn't it be able to run in the cli otherwise ? | 16:01 |
GreatSnoopy | i mean so that i do not need to reboot | 16:01 |
GreatSnoopy | and lose control of the machine ? | 16:01 |
GreatSnoopy | [root@democentfihn ~]# rm -Rf /var/log/cloud-init* /var/lib/cloud [root@democentfihn ~]# reboot | 16:03 |
blackboxsw | GreatSnoopy: so Azure datasource is init-local only, so running on the commandline you'd need to try "cloud-init init --local" since Azure is only run during init-local. I'm thinking though this might be that hang you were talking about though | 16:17 |
blackboxsw | ok, I'll try firing up a Centos image on Azure today to see if I can reproduce the prob | 16:18 |
GreatSnoopy | well, i rebooted the machine, and quite as expected ssh now is stopped so i cannot log any more | 16:19 |
GreatSnoopy | dunno if this is due to cloudinit or waagent | 16:19 |
smoser | for s in cloud-init-local cloud-init cloud-config cloud-final; do echo == $s.service ==; systemctl restart $s.service || break; done | 16:19 |
smoser | that would run everything in order that they would run. | 16:19 |
GreatSnoopy | i will pick the next VM and try again, but basically I am back to square one when using 0.7.9: basically i boot the machine but although in the boot diagnostic i can see the familiar cloudinit table with network configuration, the rest of the config does not run and ssh does not get to be started - basically i cannot log into the machine | 16:21 |
blackboxsw | GreatSnoopy: check this out. https://bugs.launchpad.net/cloud-init/+bug/1717611 this change did land in azure which might be affecting you | 16:22 |
ubot5 | Launchpad bug 1717611 in cloud-init "Azure: Azure datasource needs to wait longer for SSH pubkey to be dropped by waagent" [Medium,Fix released] | 16:22 |
blackboxsw | GreatSnoopy: do you have a pointer to a public centos image we can use on Azure? | 16:23 |
blackboxsw | or is this custom | 16:23 |
GreatSnoopy | I am not sure I understand what you are asking from me :) It is supposed to be a custom image that we are building to have cloudinit pre-enabled so that we can then provision other machines | 16:29 |
GreatSnoopy | but we are not even there yet | 16:29 |
GreatSnoopy | now we have a vm created from a regular Azure Centos baze | 16:29 |
GreatSnoopy | which I think is provided by OpenLogic | 16:30 |
blackboxsw | gotcha, I was just wondering if you were using stock CentOS in azure or a custom image. I'll spin up an instance on azure to checkout | 16:54 |
GreatSnoopy | for now I'm just trying to get to nonstandard but i cannot pass the standard phase :)) | 16:55 |
GreatSnoopy | ideally this should work with 0.7.9 from epel, but if needed I can make my images with a newer cloudinit as long as it works | 16:56 |
blackboxsw | I'm with you, will ping you when I have progress on this. if you could file a bug in launchpad that'd help us reference progress on this. | 16:57 |
blackboxsw | https://bugs.launchpad.net/cloud-init/+filebug your simple paste of cloud.cfg.d & cloud-init.log with the steps to reproduce the hang would be sufficient | 16:58 |
GreatSnoopy | Just to simplify things, can we start by investigating the 0.7.9 issue ? Point being made is that ideally i should get it working with what is provided in the more mainstream OS repos | 17:05 |
GreatSnoopy | also, can you validate the soundness of the config I gave to cloudinit ? | 17:05 |
GreatSnoopy | i mean this http://pastebin.centos.org/435771/15111932/ | 17:05 |
GreatSnoopy | blackboxsw: or let's ask the other way around: what would be the preferred/recommended way to install cloud-init on centos in azure? | 17:13 |
blackboxsw | GreatSnoopy: I know cloud-init in certain clouds is already baked into centos images. Trying to confirm on azure now. | 17:22 |
GreatSnoopy | unfortunately, not the case in azure - at least up to centos 7.3 | 17:23 |
GreatSnoopy | they only have cloudinit for ubuntu and coreos | 17:24 |
blackboxsw | if a given cloud doesn't have an image that contains cloud-init. I'd install it then shut it down and take a snapshot or make an image of it in that cloud so I could reference that un subsequent VM creations | 17:24 |
GreatSnoopy | that is exactly what I am trying to do :) | 17:24 |
GreatSnoopy | hence the question, which is the recommended way to get cloudinit on that machine : the package in epel, slightly older - 0.7.9 or should i go and install the latest from source ? | 17:25 |
blackboxsw | and I wouldn't want to run cloud-init on that image before I snapshotted it (or I'd remove /var/log/cloud-init* /var/lib/cloud before snapshotting)( | 17:25 |
GreatSnoopy | that's understood, i always delete those items | 17:26 |
blackboxsw | GreatSnoopy: probably easiest for your to try to use 0.7.9 as it's in epel. But, if there are bugs with 0.7.9 the only fix we'd propse would land in upstream 17.X | 17:27 |
blackboxsw | we don't backport fixes to centos epel (and it's up to centos when they want to pull in latest cloud-init) | 17:28 |
blackboxsw | and it sounds like 0.7.9 and 17.1 are both causing probs for you. let's see what's up with that. I do think you might be hitting that infinite wait on ssh keys though on 0.7.9 | 17:29 |
blackboxsw | on systems like that, I'd expect you'd see"waiting for SSH public key files" in the logs. if you ever got there. | 17:30 |
blackboxsw | GreatSnoopy: looks like that ssh key times out at 900 seconds | 17:30 |
blackboxsw | so that's a 15 minute wait | 17:30 |
* blackboxsw had to hit the calculator | 17:31 | |
GreatSnoopy | what ssh keys does it expect and who is supposed to create those ? because although the source is a standard source image, i spin the instances via terraform | 17:35 |
GreatSnoopy | and the only thing I pass to the instance is custom data with the cloudinit yaml | 17:35 |
blackboxsw | GreatSnoopy: it looks like DatasourceAzure.py is waiting for ssh key from azure fabric to configure the instance (as the UI/api provides an ssh key that is used to contact the instance) | 17:41 |
GreatSnoopy | that should not be normal behavior : | 17:47 |
GreatSnoopy | because even in the GUI i can create a vm that has only password - no key | 17:47 |
GreatSnoopy | or is it a different one ? system only ? | 17:48 |
GreatSnoopy | that is provided no matter what the user actually provides ? | 17:48 |
smoser | GreatSnoopy: it only waits for files to appear which are listed on the cdrom in the metadata there. | 18:03 |
smoser | so... password only wont have any ssh keys listed in the metadata so it wont wait for anything | 18:04 |
GreatSnoopy | can i manually retrieve the file so that i can check the data received before i reboot the machine ? | 18:05 |
GreatSnoopy | where does that file "land" initially ? | 18:05 |
smoser | walinux-agent would put it into /var/lib/waagent | 18:08 |
smoser | for *.crt files in that directory | 18:08 |
GreatSnoopy | one more question : waagent should be disabled so that its cloud-init the one that starts it, or should be left enabled ? | 18:10 |
GreatSnoopy | cloudinit's relationship with waagent seems a little bit of chicken and egg dillema | 18:10 |
smoser | GreatSnoopy: cloud-init no longer needs walinux-agent. | 18:14 |
smoser | and so its default behavior is suggested. | 18:14 |
smoser | which is 'agent_command' of '__builtin__' | 18:14 |
GreatSnoopy | interesting, but won't azure "see" the instance as failed if the cloud fabric cannot communicate with the agent ? or does cloudinit also create a replacement for that ? | 18:15 |
GreatSnoopy | in my previous experience, not having waagent running results in the instance being marked as failed after reboot | 18:16 |
GreatSnoopy | because it cannot communicate with the agent | 18:16 |
smoser | GreatSnoopy: i' not sure when it went in, but yeah, you dont need walinux-agent anymore. | 18:23 |
smoser | yeah. and newer ubuntu instances do not use it... let me check fof sure | 18:24 |
GreatSnoopy | filed this https://bugs.launchpad.net/cloud-init/+bug/1733403 | 19:09 |
ubot5 | Launchpad bug 1733403 in cloud-init "cloud-init does not work reliably in Azure with Centos" [Undecided,New] | 19:09 |
blackboxsw | thanks for this bug GreatSnoopy and the good context. | 19:11 |
GreatSnoopy | I will come back tomorrow...for now I am out of VM's to brick :D | 19:19 |
GreatSnoopy | you know the most stupid part.... we managed to get this step BEFORE for both centos7 and debian | 19:20 |
GreatSnoopy | why this is not working any more I don't know | 19:20 |
blackboxsw | GreatSnoopy: thx again, one thing I wonder is your datasource config represents agent_command :['systemctl', 'start', 'waagent' ] ... I wonder if it'd work with ['service', 'walinuxagent', 'start'] instead | 19:36 |
blackboxsw | the datasource itself checks to see if agent_command == ['service', 'walinuxagent', 'start'] and grabs content from metadata in that case. | 19:36 |
GreatSnoopy | lets see, although that would be, well... ugly :) | 19:37 |
blackboxsw | yeah, think I misread the code. I think it checks to see if agent_command == '__builtin__' and then tries to get to metadata to pull in any ssh keys etc. | 19:39 |
blackboxsw | I'm referencing docs at https://cloudinit.readthedocs.io/en/latest/topics/datasources/azure.html as well | 19:39 |
blackboxsw | ... as I don't use azure too often :/ | 19:40 |
GreatSnoopy | for s in cloud-init-local cloud-init cloud-config cloud-final; do echo == $s.service ==; systemctl restart $s.service || break; done == cloud-init-local.service == == cloud-init.service == Job for cloud-init.service failed because the control process exited with error code. See "systemctl status cloud-init.service" and "journalctl -xe" for details. | 19:41 |
GreatSnoopy | 2017-11-20 19:40:48,245 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=udf', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True) 2017-11-20 19:40:48,378 - handlers.py[DEBUG]: finish: init-network/search-AzureNet: SUCCESS: no network data found from DataSourceAzureNet 2017-11-20 19:40:48,378 - util.py[WARNING]: No instance datasource found! Likely bad things to come! 2017-11-20 19:40:48,378 - util.p | 19:41 |
GreatSnoopy | i mean http://pastebin.centos.org/435866/ | 19:42 |
GreatSnoopy | in any case, i will be back tomorrow, and this time I will also rerun the whole process again (including the vm creation) | 19:43 |
blackboxsw | good deal thx GreatSnoopy | 19:44 |
GreatSnoopy | currently that was not made by me, and i will have to check if something got left out, just to be sure | 19:44 |
GreatSnoopy | thanks a bunch, guys. See you tomorrow, have a nice day ! | 19:44 |
blackboxsw | you too | 19:45 |
smoser | blackboxsw: i'm grabbing merge of fix-ec2-fallback-nic now | 20:57 |
blackboxsw | sweet, I'm on the #jinja2 stuff. no other changes needed | 20:57 |
blackboxsw | ? | 20:57 |
blackboxsw | smoser: with that fix-ec2-fallback-nic branch landed, shall we do a minor SRU? | 21:00 |
smoser | we could. | 21:00 |
blackboxsw | we have 2 fixes for ec2 that'd be helpful. | 21:00 |
blackboxsw | and it'd make the SRU simple | 21:00 |
smoser | are you thinking cherry-pick ? | 21:03 |
smoser | blackboxsw: ? | 21:04 |
smoser | https://hastebin.com/ugoyozasuz | 21:04 |
smoser | that is trunk -> bionic riht now | 21:04 |
smoser | which we absolutely should do | 21:05 |
blackboxsw | smoser: I was thinking master !cherry-pick | 21:11 |
blackboxsw | forgot about the others | 21:11 |
blackboxsw | but the thing about doing something painful, is to repeat the process often :) | 21:11 |
blackboxsw | it can only get better with practice. | 21:11 |
smoser | blackboxsw: i'm fine with SRU | 21:42 |
smoser | but we need to do bionic first | 21:42 |
smoser | and that can happen "right now" if you want to propose, i'll upload | 21:42 |
blackboxsw | ok will do smoser | 21:51 |
smoser | blackboxsw: i'm pusing the integration test one | 21:54 |
smoser | so wait on that ? | 21:54 |
smoser | i'm tox && git push on it | 21:54 |
smoser | pushed | 21:54 |
smoser | 7624348712b4502f0085d30c05b34dce3f2ceeae | 21:54 |
blackboxsw | fire awaay | 21:54 |
smoser | thats in now | 21:55 |
blackboxsw | ok grabbing | 21:56 |
blackboxsw | smoser: this is what I see http://pastebin.ubuntu.com/ | 22:01 |
blackboxsw | should the AliYun have been "bionic" | 22:01 |
blackboxsw | ? | 22:01 |
blackboxsw | instead of UNRELEASED? | 22:01 |
smoser | blackboxsw: ah. | 22:03 |
smoser | just squahs it into your new commit | 22:03 |
smoser | it was committed as UNRELEASED as it was not released. | 22:03 |
smoser | adn then the next release would just pick it up | 22:03 |
smoser | kind of queueing things | 22:03 |
smoser | so you just drop that old changelog entry and pull the AliYum comment up | 22:03 |
smoser | make sense ? | 22:03 |
blackboxsw | yeah squash, gotcha | 22:04 |
smoser | put the debian/ ones at the top. | 22:05 |
smoser | no real reason | 22:05 |
smoser | just how i've done it before | 22:05 |
smoser | https://hastebin.com/sezipunibi | 22:05 |
smoser | thos will be your top two entries | 22:05 |
smoser | with your name instead of mine | 22:05 |
smoser | i'll get that in and uploaded later tonight if you MP it | 22:05 |
smoser | but have to run for now. | 22:06 |
blackboxsw | smoser: https://code.launchpad.net/~chad.smith/cloud-init/+git/cloud-init/+merge/333998 | 22:08 |
blackboxsw | moving on to artful,zesty,xenial | 22:08 |
blackboxsw | smoser: forgot with bionic(devel) should I remove bug #'s which don't affect ubuntu? or just on SRU series (artful, zesty, xenial) | 22:16 |
blackboxsw | repushed with my name removed from changelog | 22:19 |
blackboxsw | https://code.launchpad.net/~chad.smith/cloud-init/+git/cloud-init/+merge/333999 | 22:20 |
smoser | blackboxsw: ah. i just leave them in for ubuntu-devel | 22:21 |
blackboxsw | ok thx. seeing merge conflict on artful for some reason | 22:21 |
smoser | so re-push with the bug numbers on devel | 22:22 |
blackboxsw | smoser: re-push contains all bug #'s only removed my name in brackets | 22:22 |
smoser | k | 22:24 |
smoser | blackboxsw: when you do artful, zesty, xenial | 22:25 |
smoser | cherry-pick the templates fix | 22:25 |
smoser | and i'mo going to move your debian/cloud-int.templates to before the upstrema snapshot comment | 22:25 |
blackboxsw | +1 | 22:25 |
blackboxsw | I'm in hangout for a quick resolution | 22:28 |
blackboxsw | cherry pick is good. just not sure about why I'm seeing merge conflic | 22:28 |
blackboxsw | cherry pick is good. just not sure about why I'm seeing merge conflict | 22:28 |
smoser | hm.. | 22:30 |
smoser | ckonstanski | 22:30 |
smoser | i'll fix that | 22:30 |
smoser | he has username in changelog | 22:30 |
smoser | or, just leave it | 22:30 |
smoser | lets just levae it | 22:31 |
smoser | but we should lint those sorts of things on merge proposal | 22:31 |
smoser | blackboxsw: i'm not in a hurry to do the others tonight. | 22:33 |
smoser | just uploaded bionic | 22:33 |
blackboxsw | kthx. sounds good | 22:33 |
blackboxsw | have a good one | 22:33 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!