/srv/irclogs.ubuntu.com/2016/07/28/#maas.txt

mupBug #1607345 opened: Collect all logs needed to debug curtin/cloud-init for each deployment <oil> <cloud-init:New> <MAAS:Triaged> <https://launchpad.net/bugs/1607345>12:20
mupBug #1607112 changed: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112>13:32
mupBug #1607112 opened: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112>13:38
mupBug #1607112 changed: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112>13:50
mupBug #1576427 opened: [1.9.1] Commissioning didn't discover storage devices <MAAS:In Progress> <https://launchpad.net/bugs/1576427>14:35
voidspacehey, I used to be able to commission KVM nodes without *having* to set power information - starting them manually14:39
voidspacethat's no longer possible14:39
voidspaceis that by design, or a regression?14:39
voidspaceor both...14:39
mupBug #1607403 opened: [trunk] WebUI unavailable due to new version of AngularJS <MAAS:New> <angular.js (Ubuntu):New> <https://launchpad.net/bugs/1607403>15:05
roaksoaxvoidspace: by design. There's a "manual" power type15:59
voidspaceroaksoax: it's annoying :-( at least I know how to fix it now though, thanks.17:20
=== arturt__ is now known as arturt
nturnerI'm seeing a strange issue where one of my hardware nodes often fails commisioning due to "Failed to power on node ..." --- however, IPMI power control appears to be working fine for this node. Whenever I click on "check now", power status updates. But it doesn't always update automatically.18:10
nturnerDoes this sound familiar to anyone?18:10
nturner(When I say power status doesn't always update automatically, I mean if I leave the UI up, this node often shows stale power state, but as soon as I click on "check now", the state is updated correctly. It's like it isn't polling correctly. But only this node.)18:11
nturnerCurrently running 2.0.0~rc3+bzr5180-0ubuntu2~16.04.118:12
nturnerBut this behavior has been like this all month.18:13
zeestratnturner: Yeah, I am seeing the same thing here, both with rc2 and rc3. Should probably open up a bug.18:32
roaksoaxnarindergupta: seems like BMC issues18:49
roaksoaxerr18:49
roaksoaxnturner:18:49
roaksoaxnturner: seems like BMC issues18:50
roaksoaxnturner: like flaky BMC's18:50
narinderguptaroaksoax, may i know the bug number18:52
narinderguptanturner, do you know the which hardware?18:53
roaksoaxnarindergupta: you only deal with hp right ?19:03
roaksoaxnarindergupta: or dell too ?19:03
narinderguptaroaksoax, i deal in HP, lenovo, NEC, Ericsson19:08
narinderguptaroaksoax, no dell19:08
roaksoaxnarindergupta: k thanks!19:29
nturnerroaksoax: Hmm, wouldn't you expect to see power control errors if maas tried to update the power status and the BMC didn't respond?21:22
nturnerI don't see that. And every time I initiate a power status check by clicking on "check now", it works immediately.21:22
roaksoaxnturner: probably because maas does retry and your BMC says yes ?21:29
roaksoaxnturner: what does rackd.log  tell you ? does it tell you about any errors ?21:29
roaksoaxnturner: note that the power, in the UI, might not update immediately21:29
roaksoaxnturner: it may take a few more seconds to update21:30
nturnerroaksoax: Where can I find rackd.log? On the maas controller?21:32
nturnerroaksoax: here's an excerpt from the event log for this node: https://paste.ubuntu.com/21328195/21:33
nturnerIt looks to me like it concluded the deploy failed before it queried the BMC...21:34
roaksoaxnturner: Queried node's BMC - Power state queried: onWed, 27 Jul. 2016 19:03:4121:35
nturnerI found rackd.log on the maas controller. There are backtraces related to this. Will paste...21:35
roaksoaxcool21:36
roaksoaxnturner: also, are you using rc321:36
nturnerhttps://paste.ubuntu.com/21328779/21:36
nturnerYes, I upgraded today. Though the log entry I just posted was from yesterday21:37
nturnerhttps://paste.ubuntu.com/21328867/ is the same thing today, after upgrade21:38
nturnerroaksoax: That first event list is in reverse-chronological order; that "Queried node's BMC..." message is after the rest.21:40
roaksoaxnewell_: ^^21:40
roaksoaxnewell_: is there any debug logging that would shed some more light on that?21:41
nturnerI'd be happy to turn up tracing somewhere and try to reproduce this.21:42
newell_roaksoax: there isn't debug logging on the rack afaik21:43
roaksoaxnewell_: where can we inject some debug info to debug the above ?21:43
roaksoaxnturner: i'll lookg thorugh the code to try to find a good place to insert a piece of code to debug21:43
newell_roaksoax: well it is weird because this is being thrown from the base class21:44
newell_roaksoax: is this with trunk?21:44
roaksoaxnewell_: 2.0rc321:44
roaksoaxnewell_: but where can we find the output of the power command21:44
roaksoaxnewell_: and whether a power command succeeds21:44
roaksoaxnewell_: and whether we are retrying21:44
newell_roaksoax: in the perform_power method that is seen in the traceback21:45
newell_this is where the retries happen21:45
newell_peform_power utimately calls the "actual" power driver to perform either, off, on, query, etc.21:45
newell_ah, I have never seen this error actually thrown in practice but if you look in provisioningserver.drivers.power.__init__ perform_power that error is thrown at the end if the state never transitions21:47
roaksoaxnewell_: honeslt,y we need to add some debugging log there21:48
newell_nturner: what type of power driver are you using for this?21:49
roaksoaxnturner: if you try to do it just one, how many of "provisioningserver.drivers.power.PowerError: Failed to power 4y3h8d. BMC never transitioned from off to on." do you see... can you please share the logs just for 1 attempt ?21:56
nturnernewell_: this node is using LAN_2_0 [IPMI 2.0]22:12
nturnerroaksoax: Sure, will do one now.22:13
roaksoaxnturner: if you could apply this: http://paste.ubuntu.com/21335012/22:32
roaksoaxnturner: to /us/lib/python3/dist-packages/provisioningserver/..../__init__.py22:32
roaksoaxnturner: restart maas-rackd22:32
roaksoaxnturner: and retry it would be great22:33
nturnerroaksoax: sure, will do22:37
nturnernaturally, the last 2 deploys succeeded without incident =\22:37
nturnerAh, a failure! Logs coming...22:45
nturnerroaksoax: newell_: here's syslog output (with verbose named entries elided): https://paste.ubuntu.com/21336662/22:49
nturnerBased on this tracing, I wonder if the problem is simply that this system is sometimes slow to power on.22:50
newell_nturner: yeah your hardware seems to be slow22:50
nturnerIs it possible to change those timeout values or increase the number of retries?22:50
newell_nturner: you can if you edit the python file manually22:51
roaksoaxyeah there's no setting to do it22:51
nturneryeah, in there now...22:52
roaksoaxbut strange... it takes more than 24+ seconds to power on ?22:52
newell_35 seconds to be exact22:52
newell_nturner: do you have physical access to the hardware?22:53
newell_nturner: if so, does it really take longer than 35 seconds to power on?22:53
nturnerwell, it does seem odd.22:53
newell_nturner: I am assuming that at some point the power actually does turn on22:53
nturnerwhen I looked at the maas UI after seeing this in the logs, the power state shows as on22:54
newell_nturner: and does the node boot up at that point?22:54
nturnerI can try again with more retries and will monitor the UI a little closer during that time22:54
nturneryeah, the node does boot22:55
newell_nturner: if you edit the DEFAULT_WAITING_POLICY tuple in /usr/lib/python3/dist-packages/provisioningserver/drivers/power/__init__.py, save the file, and restart rackd as mentioned above, you will have more retries22:57
roaksoaxnewell_: maybe there's a bug in the UI were it is saying it is ON when it is not and it is failing to check ....22:58
nturnerlooks like you actually have to edit ipmi.py ... running now22:59
nturnerOK...23:00
nturnerso I configured it to retry many times after 12 seconds each...23:01
nturnerand after 5 or so, I opened the UI and clicked "check now" -- the UI showed Power on within a second23:01
nturnerMeanwhile, the "Successfully checked power state, checking if it is desired... off" continued in the log23:01
nturnerseems like there are 2 paths being taken here23:02
roaksoaxnturner: what if you manually turn off your BMC, and then click on "Check Power"23:03
nturnerroaksoax: newell_: What happens when I click "check now" in the UI? It doesn't appear to enter that maas.drivers.power logic (no traces seen).23:03
nturnerIt shows as off.23:04
newell_nturner: okay so if the BMC is off and you check the UI, that is working23:05
nturnerIf I click on "check now" every second after deploying, it shows On after about 12 or so seconds.23:05
newell_nturner: when you click on check now it should query the BMC via the power_query method in ipmi.py23:13
nturnernewell_: thanks, i added some tracing there.... it's very odd; i see /usr/sbin/ipmipower being run with the same arguments when I click the 'check now' in the UI and during the polling during deploy23:27
nturnerbut again, it polls for many cycles, and then i click 'check now' and the UI instantly shows Power on23:27
mupBug #1607560 opened: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560>23:28
* nturner has to head out for a bit; more fun tomorrow!23:28
newell_nturner: so just to be clear, when you increase the wait times, it all works fine correct?23:29
nturnernewell_: no23:30
nturnerThe only thing that seems to work reliably in this particular node is clicking 'check now' in the UI23:30
nturnerthe polling seems to somehow get different results23:31
nturnerwhich seems really weird23:31
mupBug #1607560 changed: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560>23:31
newell_nturner: can you file a bug for this?23:31
newell_nturner: if you would be so kind, also list what type of hardware you are using23:31
nturnernewell_: can do; will probably do this tomorrow23:31
nturnersure, no problem.23:31
newell_nturner: thanks!23:32
nturnerthanks for the help today23:32
nturnernow that I know where the relevant code is, I can have some fun doing a little further debug too =)23:33
newell_nturner: np :)23:38
mupBug #1607560 opened: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560>23:43

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!