[12:20] <mup> Bug #1607345 opened: Collect all logs needed to debug curtin/cloud-init for each deployment <oil> <cloud-init:New> <MAAS:Triaged> <https://launchpad.net/bugs/1607345>
[13:32] <mup> Bug #1607112 changed: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112>
[13:38] <mup> Bug #1607112 opened: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112>
[13:50] <mup> Bug #1607112 changed: [2.0rc2] package installation fails when default gateway is not set <MAAS:Fix Released by andreserl> <https://launchpad.net/bugs/1607112>
[14:35] <mup> Bug #1576427 opened: [1.9.1] Commissioning didn't discover storage devices <MAAS:In Progress> <https://launchpad.net/bugs/1576427>
[14:39] <voidspace> hey, I used to be able to commission KVM nodes without *having* to set power information - starting them manually
[14:39] <voidspace> that's no longer possible
[14:39] <voidspace> is that by design, or a regression?
[14:39] <voidspace> or both...
[15:05] <mup> Bug #1607403 opened: [trunk] WebUI unavailable due to new version of AngularJS <MAAS:New> <angular.js (Ubuntu):New> <https://launchpad.net/bugs/1607403>
[15:59] <roaksoax> voidspace: by design. There's a "manual" power type
[17:20] <voidspace> roaksoax: it's annoying :-( at least I know how to fix it now though, thanks.
[18:10] <nturner> I'm seeing a strange issue where one of my hardware nodes often fails commisioning due to "Failed to power on node ..." --- however, IPMI power control appears to be working fine for this node. Whenever I click on "check now", power status updates. But it doesn't always update automatically.
[18:10] <nturner> Does this sound familiar to anyone?
[18:11] <nturner> (When I say power status doesn't always update automatically, I mean if I leave the UI up, this node often shows stale power state, but as soon as I click on "check now", the state is updated correctly. It's like it isn't polling correctly. But only this node.)
[18:12] <nturner> Currently running 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1
[18:13] <nturner> But this behavior has been like this all month.
[18:32] <zeestrat> nturner: Yeah, I am seeing the same thing here, both with rc2 and rc3. Should probably open up a bug.
[18:49] <roaksoax> narindergupta: seems like BMC issues
[18:49] <roaksoax> err
[18:49] <roaksoax> nturner:
[18:50] <roaksoax> nturner: seems like BMC issues
[18:50] <roaksoax> nturner: like flaky BMC's
[18:52] <narindergupta> roaksoax, may i know the bug number
[18:53] <narindergupta> nturner, do you know the which hardware?
[19:03] <roaksoax> narindergupta: you only deal with hp right ?
[19:03] <roaksoax> narindergupta: or dell too ?
[19:08] <narindergupta> roaksoax, i deal in HP, lenovo, NEC, Ericsson
[19:08] <narindergupta> roaksoax, no dell
[19:29] <roaksoax> narindergupta: k thanks!
[21:22] <nturner> roaksoax: Hmm, wouldn't you expect to see power control errors if maas tried to update the power status and the BMC didn't respond?
[21:22] <nturner> I don't see that. And every time I initiate a power status check by clicking on "check now", it works immediately.
[21:29] <roaksoax> nturner: probably because maas does retry and your BMC says yes ?
[21:29] <roaksoax> nturner: what does rackd.log  tell you ? does it tell you about any errors ?
[21:29] <roaksoax> nturner: note that the power, in the UI, might not update immediately
[21:30] <roaksoax> nturner: it may take a few more seconds to update
[21:32] <nturner> roaksoax: Where can I find rackd.log? On the maas controller?
[21:33] <nturner> roaksoax: here's an excerpt from the event log for this node: https://paste.ubuntu.com/21328195/
[21:34] <nturner> It looks to me like it concluded the deploy failed before it queried the BMC...
[21:35] <roaksoax> nturner: Queried node's BMC - Power state queried: onWed, 27 Jul. 2016 19:03:41
[21:35] <nturner> I found rackd.log on the maas controller. There are backtraces related to this. Will paste...
[21:36] <roaksoax> cool
[21:36] <roaksoax> nturner: also, are you using rc3
[21:36] <nturner> https://paste.ubuntu.com/21328779/
[21:37] <nturner> Yes, I upgraded today. Though the log entry I just posted was from yesterday
[21:38] <nturner> https://paste.ubuntu.com/21328867/ is the same thing today, after upgrade
[21:40] <nturner> roaksoax: That first event list is in reverse-chronological order; that "Queried node's BMC..." message is after the rest.
[21:40] <roaksoax> newell_: ^^
[21:41] <roaksoax> newell_: is there any debug logging that would shed some more light on that?
[21:42] <nturner> I'd be happy to turn up tracing somewhere and try to reproduce this.
[21:43] <newell_> roaksoax: there isn't debug logging on the rack afaik
[21:43] <roaksoax> newell_: where can we inject some debug info to debug the above ?
[21:43] <roaksoax> nturner: i'll lookg thorugh the code to try to find a good place to insert a piece of code to debug
[21:44] <newell_> roaksoax: well it is weird because this is being thrown from the base class
[21:44] <newell_> roaksoax: is this with trunk?
[21:44] <roaksoax> newell_: 2.0rc3
[21:44] <roaksoax> newell_: but where can we find the output of the power command
[21:44] <roaksoax> newell_: and whether a power command succeeds
[21:44] <roaksoax> newell_: and whether we are retrying
[21:45] <newell_> roaksoax: in the perform_power method that is seen in the traceback
[21:45] <newell_> this is where the retries happen
[21:45] <newell_> peform_power utimately calls the "actual" power driver to perform either, off, on, query, etc.
[21:47] <newell_> ah, I have never seen this error actually thrown in practice but if you look in provisioningserver.drivers.power.__init__ perform_power that error is thrown at the end if the state never transitions
[21:48] <roaksoax> newell_: honeslt,y we need to add some debugging log there
[21:49] <newell_> nturner: what type of power driver are you using for this?
[21:56] <roaksoax> nturner: if you try to do it just one, how many of "provisioningserver.drivers.power.PowerError: Failed to power 4y3h8d. BMC never transitioned from off to on." do you see... can you please share the logs just for 1 attempt ?
[22:12] <nturner> newell_: this node is using LAN_2_0 [IPMI 2.0]
[22:13] <nturner> roaksoax: Sure, will do one now.
[22:32] <roaksoax> nturner: if you could apply this: http://paste.ubuntu.com/21335012/
[22:32] <roaksoax> nturner: to /us/lib/python3/dist-packages/provisioningserver/..../__init__.py
[22:32] <roaksoax> nturner: restart maas-rackd
[22:33] <roaksoax> nturner: and retry it would be great
[22:37] <nturner> roaksoax: sure, will do
[22:37] <nturner> naturally, the last 2 deploys succeeded without incident =\
[22:45] <nturner> Ah, a failure! Logs coming...
[22:49] <nturner> roaksoax: newell_: here's syslog output (with verbose named entries elided): https://paste.ubuntu.com/21336662/
[22:50] <nturner> Based on this tracing, I wonder if the problem is simply that this system is sometimes slow to power on.
[22:50] <newell_> nturner: yeah your hardware seems to be slow
[22:50] <nturner> Is it possible to change those timeout values or increase the number of retries?
[22:51] <newell_> nturner: you can if you edit the python file manually
[22:51] <roaksoax> yeah there's no setting to do it
[22:52] <nturner> yeah, in there now...
[22:52] <roaksoax> but strange... it takes more than 24+ seconds to power on ?
[22:52] <newell_> 35 seconds to be exact
[22:53] <newell_> nturner: do you have physical access to the hardware?
[22:53] <newell_> nturner: if so, does it really take longer than 35 seconds to power on?
[22:53] <nturner> well, it does seem odd.
[22:53] <newell_> nturner: I am assuming that at some point the power actually does turn on
[22:54] <nturner> when I looked at the maas UI after seeing this in the logs, the power state shows as on
[22:54] <newell_> nturner: and does the node boot up at that point?
[22:54] <nturner> I can try again with more retries and will monitor the UI a little closer during that time
[22:55] <nturner> yeah, the node does boot
[22:57] <newell_> nturner: if you edit the DEFAULT_WAITING_POLICY tuple in /usr/lib/python3/dist-packages/provisioningserver/drivers/power/__init__.py, save the file, and restart rackd as mentioned above, you will have more retries
[22:58] <roaksoax> newell_: maybe there's a bug in the UI were it is saying it is ON when it is not and it is failing to check ....
[22:59] <nturner> looks like you actually have to edit ipmi.py ... running now
[23:00] <nturner> OK...
[23:01] <nturner> so I configured it to retry many times after 12 seconds each...
[23:01] <nturner> and after 5 or so, I opened the UI and clicked "check now" -- the UI showed Power on within a second
[23:01] <nturner> Meanwhile, the "Successfully checked power state, checking if it is desired... off" continued in the log
[23:02] <nturner> seems like there are 2 paths being taken here
[23:03] <roaksoax> nturner: what if you manually turn off your BMC, and then click on "Check Power"
[23:03] <nturner> roaksoax: newell_: What happens when I click "check now" in the UI? It doesn't appear to enter that maas.drivers.power logic (no traces seen).
[23:04] <nturner> It shows as off.
[23:05] <newell_> nturner: okay so if the BMC is off and you check the UI, that is working
[23:05] <nturner> If I click on "check now" every second after deploying, it shows On after about 12 or so seconds.
[23:13] <newell_> nturner: when you click on check now it should query the BMC via the power_query method in ipmi.py
[23:27] <nturner> newell_: thanks, i added some tracing there.... it's very odd; i see /usr/sbin/ipmipower being run with the same arguments when I click the 'check now' in the UI and during the polling during deploy
[23:27] <nturner> but again, it polls for many cycles, and then i click 'check now' and the UI instantly shows Power on
[23:28] <mup> Bug #1607560 opened: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560>
[23:28]  * nturner has to head out for a bit; more fun tomorrow!
[23:29] <newell_> nturner: so just to be clear, when you increase the wait times, it all works fine correct?
[23:30] <nturner> newell_: no
[23:30] <nturner> The only thing that seems to work reliably in this particular node is clicking 'check now' in the UI
[23:31] <nturner> the polling seems to somehow get different results
[23:31] <nturner> which seems really weird
[23:31] <mup> Bug #1607560 changed: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560>
[23:31] <newell_> nturner: can you file a bug for this?
[23:31] <newell_> nturner: if you would be so kind, also list what type of hardware you are using
[23:31] <nturner> newell_: can do; will probably do this tomorrow
[23:31] <nturner> sure, no problem.
[23:32] <newell_> nturner: thanks!
[23:32] <nturner> thanks for the help today
[23:33] <nturner> now that I know where the relevant code is, I can have some fun doing a little further debug too =)
[23:38] <newell_> nturner: np :)
[23:43] <mup> Bug #1607560 opened: switching rackd.conf maas_url back to localhost has no effect <MAAS:New> <https://launchpad.net/bugs/1607560>