mup | Bug #1607345 opened: Collect all logs needed to debug curtin/cloud-init for each deployment <oil> <cloud-init:New> <MAAS:Triaged> <https://launchpad.net/bugs/1607345> | 12:20 |
---|---|---|
mup | Bug #1607112 changed: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112> | 13:32 |
mup | Bug #1607112 opened: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112> | 13:38 |
mup | Bug #1607112 changed: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112> | 13:50 |
mup | Bug #1576427 opened: [1.9.1] Commissioning didn't discover storage devices <MAAS:In Progress> <https://launchpad.net/bugs/1576427> | 14:35 |
voidspace | hey, I used to be able to commission KVM nodes without *having* to set power information - starting them manually | 14:39 |
voidspace | that's no longer possible | 14:39 |
voidspace | is that by design, or a regression? | 14:39 |
voidspace | or both... | 14:39 |
mup | Bug #1607403 opened: [trunk] WebUI unavailable due to new version of AngularJS <MAAS:New> <angular.js (Ubuntu):New> <https://launchpad.net/bugs/1607403> | 15:05 |
roaksoax | voidspace: by design. There's a "manual" power type | 15:59 |
voidspace | roaksoax: it's annoying :-( at least I know how to fix it now though, thanks. | 17:20 |
=== arturt__ is now known as arturt | ||
nturner | I'm seeing a strange issue where one of my hardware nodes often fails commisioning due to "Failed to power on node ..." --- however, IPMI power control appears to be working fine for this node. Whenever I click on "check now", power status updates. But it doesn't always update automatically. | 18:10 |
nturner | Does this sound familiar to anyone? | 18:10 |
nturner | (When I say power status doesn't always update automatically, I mean if I leave the UI up, this node often shows stale power state, but as soon as I click on "check now", the state is updated correctly. It's like it isn't polling correctly. But only this node.) | 18:11 |
nturner | Currently running 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 | 18:12 |
nturner | But this behavior has been like this all month. | 18:13 |
zeestrat | nturner: Yeah, I am seeing the same thing here, both with rc2 and rc3. Should probably open up a bug. | 18:32 |
roaksoax | narindergupta: seems like BMC issues | 18:49 |
roaksoax | err | 18:49 |
roaksoax | nturner: | 18:49 |
roaksoax | nturner: seems like BMC issues | 18:50 |
roaksoax | nturner: like flaky BMC's | 18:50 |
narindergupta | roaksoax, may i know the bug number | 18:52 |
narindergupta | nturner, do you know the which hardware? | 18:53 |
roaksoax | narindergupta: you only deal with hp right ? | 19:03 |
roaksoax | narindergupta: or dell too ? | 19:03 |
narindergupta | roaksoax, i deal in HP, lenovo, NEC, Ericsson | 19:08 |
narindergupta | roaksoax, no dell | 19:08 |
roaksoax | narindergupta: k thanks! | 19:29 |
nturner | roaksoax: Hmm, wouldn't you expect to see power control errors if maas tried to update the power status and the BMC didn't respond? | 21:22 |
nturner | I don't see that. And every time I initiate a power status check by clicking on "check now", it works immediately. | 21:22 |
roaksoax | nturner: probably because maas does retry and your BMC says yes ? | 21:29 |
roaksoax | nturner: what does rackd.log tell you ? does it tell you about any errors ? | 21:29 |
roaksoax | nturner: note that the power, in the UI, might not update immediately | 21:29 |
roaksoax | nturner: it may take a few more seconds to update | 21:30 |
nturner | roaksoax: Where can I find rackd.log? On the maas controller? | 21:32 |
nturner | roaksoax: here's an excerpt from the event log for this node: https://paste.ubuntu.com/21328195/ | 21:33 |
nturner | It looks to me like it concluded the deploy failed before it queried the BMC... | 21:34 |
roaksoax | nturner: Queried node's BMC - Power state queried: onWed, 27 Jul. 2016 19:03:41 | 21:35 |
nturner | I found rackd.log on the maas controller. There are backtraces related to this. Will paste... | 21:35 |
roaksoax | cool | 21:36 |
roaksoax | nturner: also, are you using rc3 | 21:36 |
nturner | https://paste.ubuntu.com/21328779/ | 21:36 |
nturner | Yes, I upgraded today. Though the log entry I just posted was from yesterday | 21:37 |
nturner | https://paste.ubuntu.com/21328867/ is the same thing today, after upgrade | 21:38 |
nturner | roaksoax: That first event list is in reverse-chronological order; that "Queried node's BMC..." message is after the rest. | 21:40 |
roaksoax | newell_: ^^ | 21:40 |
roaksoax | newell_: is there any debug logging that would shed some more light on that? | 21:41 |
nturner | I'd be happy to turn up tracing somewhere and try to reproduce this. | 21:42 |
newell_ | roaksoax: there isn't debug logging on the rack afaik | 21:43 |
roaksoax | newell_: where can we inject some debug info to debug the above ? | 21:43 |
roaksoax | nturner: i'll lookg thorugh the code to try to find a good place to insert a piece of code to debug | 21:43 |
newell_ | roaksoax: well it is weird because this is being thrown from the base class | 21:44 |
newell_ | roaksoax: is this with trunk? | 21:44 |
roaksoax | newell_: 2.0rc3 | 21:44 |
roaksoax | newell_: but where can we find the output of the power command | 21:44 |
roaksoax | newell_: and whether a power command succeeds | 21:44 |
roaksoax | newell_: and whether we are retrying | 21:44 |
newell_ | roaksoax: in the perform_power method that is seen in the traceback | 21:45 |
newell_ | this is where the retries happen | 21:45 |
newell_ | peform_power utimately calls the "actual" power driver to perform either, off, on, query, etc. | 21:45 |
newell_ | ah, I have never seen this error actually thrown in practice but if you look in provisioningserver.drivers.power.__init__ perform_power that error is thrown at the end if the state never transitions | 21:47 |
roaksoax | newell_: honeslt,y we need to add some debugging log there | 21:48 |
newell_ | nturner: what type of power driver are you using for this? | 21:49 |
roaksoax | nturner: if you try to do it just one, how many of "provisioningserver.drivers.power.PowerError: Failed to power 4y3h8d. BMC never transitioned from off to on." do you see... can you please share the logs just for 1 attempt ? | 21:56 |
nturner | newell_: this node is using LAN_2_0 [IPMI 2.0] | 22:12 |
nturner | roaksoax: Sure, will do one now. | 22:13 |
roaksoax | nturner: if you could apply this: http://paste.ubuntu.com/21335012/ | 22:32 |
roaksoax | nturner: to /us/lib/python3/dist-packages/provisioningserver/..../__init__.py | 22:32 |
roaksoax | nturner: restart maas-rackd | 22:32 |
roaksoax | nturner: and retry it would be great | 22:33 |
nturner | roaksoax: sure, will do | 22:37 |
nturner | naturally, the last 2 deploys succeeded without incident =\ | 22:37 |
nturner | Ah, a failure! Logs coming... | 22:45 |
nturner | roaksoax: newell_: here's syslog output (with verbose named entries elided): https://paste.ubuntu.com/21336662/ | 22:49 |
nturner | Based on this tracing, I wonder if the problem is simply that this system is sometimes slow to power on. | 22:50 |
newell_ | nturner: yeah your hardware seems to be slow | 22:50 |
nturner | Is it possible to change those timeout values or increase the number of retries? | 22:50 |
newell_ | nturner: you can if you edit the python file manually | 22:51 |
roaksoax | yeah there's no setting to do it | 22:51 |
nturner | yeah, in there now... | 22:52 |
roaksoax | but strange... it takes more than 24+ seconds to power on ? | 22:52 |
newell_ | 35 seconds to be exact | 22:52 |
newell_ | nturner: do you have physical access to the hardware? | 22:53 |
newell_ | nturner: if so, does it really take longer than 35 seconds to power on? | 22:53 |
nturner | well, it does seem odd. | 22:53 |
newell_ | nturner: I am assuming that at some point the power actually does turn on | 22:53 |
nturner | when I looked at the maas UI after seeing this in the logs, the power state shows as on | 22:54 |
newell_ | nturner: and does the node boot up at that point? | 22:54 |
nturner | I can try again with more retries and will monitor the UI a little closer during that time | 22:54 |
nturner | yeah, the node does boot | 22:55 |
newell_ | nturner: if you edit the DEFAULT_WAITING_POLICY tuple in /usr/lib/python3/dist-packages/provisioningserver/drivers/power/__init__.py, save the file, and restart rackd as mentioned above, you will have more retries | 22:57 |
roaksoax | newell_: maybe there's a bug in the UI were it is saying it is ON when it is not and it is failing to check .... | 22:58 |
nturner | looks like you actually have to edit ipmi.py ... running now | 22:59 |
nturner | OK... | 23:00 |
nturner | so I configured it to retry many times after 12 seconds each... | 23:01 |
nturner | and after 5 or so, I opened the UI and clicked "check now" -- the UI showed Power on within a second | 23:01 |
nturner | Meanwhile, the "Successfully checked power state, checking if it is desired... off" continued in the log | 23:01 |
nturner | seems like there are 2 paths being taken here | 23:02 |
roaksoax | nturner: what if you manually turn off your BMC, and then click on "Check Power" | 23:03 |
nturner | roaksoax: newell_: What happens when I click "check now" in the UI? It doesn't appear to enter that maas.drivers.power logic (no traces seen). | 23:03 |
nturner | It shows as off. | 23:04 |
newell_ | nturner: okay so if the BMC is off and you check the UI, that is working | 23:05 |
nturner | If I click on "check now" every second after deploying, it shows On after about 12 or so seconds. | 23:05 |
newell_ | nturner: when you click on check now it should query the BMC via the power_query method in ipmi.py | 23:13 |
nturner | newell_: thanks, i added some tracing there.... it's very odd; i see /usr/sbin/ipmipower being run with the same arguments when I click the 'check now' in the UI and during the polling during deploy | 23:27 |
nturner | but again, it polls for many cycles, and then i click 'check now' and the UI instantly shows Power on | 23:27 |
mup | Bug #1607560 opened: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560> | 23:28 |
* nturner has to head out for a bit; more fun tomorrow! | 23:28 | |
newell_ | nturner: so just to be clear, when you increase the wait times, it all works fine correct? | 23:29 |
nturner | newell_: no | 23:30 |
nturner | The only thing that seems to work reliably in this particular node is clicking 'check now' in the UI | 23:30 |
nturner | the polling seems to somehow get different results | 23:31 |
nturner | which seems really weird | 23:31 |
mup | Bug #1607560 changed: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560> | 23:31 |
newell_ | nturner: can you file a bug for this? | 23:31 |
newell_ | nturner: if you would be so kind, also list what type of hardware you are using | 23:31 |
nturner | newell_: can do; will probably do this tomorrow | 23:31 |
nturner | sure, no problem. | 23:31 |
newell_ | nturner: thanks! | 23:32 |
nturner | thanks for the help today | 23:32 |
nturner | now that I know where the relevant code is, I can have some fun doing a little further debug too =) | 23:33 |
newell_ | nturner: np :) | 23:38 |
mup | Bug #1607560 opened: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560> | 23:43 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!