[00:17] <bradm> anyone able to help debug a maas node that's been added to juju that's hanging around in agent-state: pending?
[00:29] <thumper> mwhudson: which provider?
[00:29] <mwhudson> thumper: manual
[00:30] <thumper> mwhudson: ok, you just need to have preseeded the image with the lxc ubuntu-cloud template
[00:30] <mwhudson> thumper: lxc-create -n foo -t ubuntu-cloud; lxc-destroy -n foo?
[00:30] <thumper> mwhudson: that'd probably work
[00:30] <mwhudson> cool
[00:31] <thumper> mwhudson: with caveats around differing series...
[00:31] <thumper> but if you are just using the defaults
[00:31] <thumper> I think that'll work
[00:33] <mwhudson> all trusty all the time
[00:33] <mwhudson> (also arm64, trying to create a precise lxc not going to go well)
[01:27] <bradm> so no ideas on why a juju unit would be stuck in pending?
[01:35] <thumper> bradm: a unit would be stuck in pending if the unit agent never started
[01:35] <thumper> bradm: which ment the machine agent didn't deploy the unit properly
[01:35] <thumper> bradm: which means, most likely, the machine agent isn't running... ?
[01:36] <bradm> thumper: no, it has them installed and running
[01:36] <thumper> bradm: the machine agent is running?
[01:37] <bradm> /var/lib/juju/tools/machine-31/jujud machine --data-dir /var/lib/juju --machine-id 31 --debug
[01:37] <bradm> thats what you mean? ^^
[01:37] <thumper> bradm: yeah
[01:37] <thumper> bradm: is there a jujud unit running?
[01:37] <bradm> oddly out of 16 hosts, only one succeed
[01:38] <bradm> thumper: I've only done a juju add-machine so far
[01:39] <bradm> we're trying to work out whats going on here, we're having some fun with bug #1263181
[01:39] <mup> Bug #1263181: curtin discovers HP /dev/cciss/c0d0 incorrectly <canonical-bootstack> <curtin:Triaged> <https://launchpad.net/bugs/1263181>
[01:40] <bradm> thumper: but I can confirm all 16 nodes have the machine agent running
[01:41] <bradm> thumper: and only one is in a started state
[01:41] <thumper> bradm: but is there a unit agent running?
[01:41] <thumper> bradm: can you ssh to the machine and see if the charm has been deployed?
[01:41] <bradm> thumper: no, none of them have a unit agent
[01:41] <bradm> thumper: we haven't deployed charms yet, we're having booting issues
[01:41] <thumper> bradm: also, does juju status show a machine for the unit?
[01:41] <bradm> thumper: at this point we're just doing a juju add-machine
[01:41] <thumper> bradm: ok, I'm confused
[01:42] <thumper> you did say "unit stuck in pending"
[01:42] <thumper> did you mean machine?
[01:42] <bradm> ah, I did say unit too, sorry
[01:42] <bradm> I should have said machine
[01:42] <bradm> I was using unit in the generic sense, didn't remember that it had a juju specific meaning
[01:43] <thumper> bradm: ah, ok...
[01:43] <bradm> thumper: the other fun part is we're using a 3 node HA bootstrap, so maybe somethings going on there
[01:43] <thumper> bradm: so the machines have been deployed, and the machine agent is running, but showing pending?
[01:44] <bradm> thumper: correct, on all but one of the hosts
[01:44] <thumper> bradm: so this would occur if the machine agent can't contact the api server
[01:44] <thumper> bradm: check the local logs on the machines
[01:45] <bradm> thumper: they're not particularly enlightening
[01:45] <bradm> thumper: let me pastebin one for you
[01:45] <thumper> kk
[01:46] <bradm> thumper: http://pastebin.ubuntu.com/8562185/ <- one that didn't work
[01:46] <bradm> thumper: foo-os-[123].maas is the HA'd bootstrap nodes
[01:47] <thumper> bradm: and that's it?
[01:47] <bradm> thumper: yup
[01:47] <bradm> thumper: like I said, not particularly enlightening
[01:47] <thumper> bradm: looks like the websocket handshake is failing for some reason
[01:48] <thumper> bradm: that's the first thing it does
[01:49] <thumper> bradm: if you look at the logs for the state servers, do you see the incoming connection from the other machines
[01:49] <thumper> ?
[01:49] <thumper> bradm: also, make sure the state servers have a decent logging-config
[01:49] <thumper> like DEBUG
[01:50] <thumper> bradm: by default the logging-config is set to WARN if you don't specify
[01:50] <thumper> if you bootstrapped with --debug, it should stay debug
[01:50] <bradm> thumper: http://pastebin.ubuntu.com/8562198/ <- one that did work
[01:50] <bradm> thumper: ok, let me see..
[01:52] <bradm> thumper: so its definately in debug
[01:54] <bradm> thumper: there's a lot going on, have you got an example of what an incoming connection whould look like in the logs?
[02:03]  * thumper looks
[02:04] <bradm> debug is pretty noisy, as you'd expect, its hard to tell what to look for
[02:05] <thumper> grr
[02:06] <bradm> HA bootstrap nodes doesn't help either, it throws enough of its own logs too
[02:08] <thumper> trying to get some info for you bradm
[02:08] <thumper> but hitting other local issues
[02:08] <thumper> gimmie a few minutes
[02:08] <bradm> thumper: sure, thats fine - I'm probably going to have some lunch soon anyway
[02:09] <bradm> I can hear my wife making something in the kitchen now..
[02:12] <bradm> thumper: in fact, lunchtime now!  will be back in a bit
[02:12] <thumper> kk
[02:22] <thumper> brad: 2014-10-15 02:07:01 DEBUG juju.apiserver apiserver.go:156 <- [1] machine-0 {"RequestId":<id>, ... Entities":[{"Tag":"machine-1"}]}}
[02:22] <thumper> 2014-10-15 02:07:01 DEBUG juju.apiserver apiserver.go:163 -> [1] machine-0 311.032us {"RequestId":<id>,"Response":{...}}
[02:22] <thumper> bradm: that was for machine-1
[02:22] <thumper> bradm: use whichever number you have
[02:22] <thumper> I thought there was a login logged, but seems not
[03:00] <bradm> thumper: righto, let me see..
[03:05] <bradm> thumper: curious, I have juju.state.apiserver, not juju.apiserver
[03:05] <thumper> bradm: oh... which version of juju?
[03:05] <thumper> we did move it
[03:05] <thumper> but perhaps that was 1.21
[03:05] <bradm> thumper: juju 1.20.9
[03:07] <bradm> machine-1: 2014-10-15 00:34:45 INFO juju.state.apiserver apiserver.go:165 [32] machine-40 API connection terminated after 3m0.104868462s
[03:08] <bradm> aha, here we go
[03:14] <bradm> thumper: http://pastebin.ubuntu.com/8562529/
[03:15] <bradm> thumper: so it looks like machine-40 does a whole bunch of requests, there's a response sent back, and then it just times out
[03:15] <thumper> bradm: definitely looks like a bug... :-(
[03:15] <bradm> thumper: this is on a customer site too :(
[03:16] <thumper> bradm: do you see the same if not doing HA?
[03:16] <bradm> thumper: I can try that out easily enough.
[03:16] <thumper> bradm: please
[03:20] <bradm> huh, what, that just failed
[03:20] <bradm> maybe I tried a bit too quickly
[03:21] <bradm> ah, this is better, it was far too quick last time
[03:22] <bradm> thumper: will let you know when its up, this is all HP kit and maas, so its not exactly fast
[03:22] <thumper> kk
[03:31] <bradm> thumper: ok, bootstrap is done, doing the add-machine now
[03:35] <bradm> thumper: right, all 16 have started, now its waiting time.
[03:44] <bradm> thumper: well look at that, a lot of them are coming back as started
[03:44] <thumper> bugger
[03:44] <thumper> bradm: looks like the HA bit is screwing things up
[03:44] <bradm> thumper: sure does - I'd like to try this again with HA once all these either hit started or we decide to give up
[03:45] <thumper> bradm: can I get you to file a bug and mark one of us will mark it critical
[03:45] <thumper> bradm: when you test it again that it
[03:45] <thumper> bradm: something is horribly wrong
[03:45] <thumper> bradm: please include all the version information you can
[03:46] <thumper> and provider info
[03:46] <bradm> thumper: sure, is there any particular logs or anything you need?  other than it not working with HA?
[03:46] <thumper> grab the logs from the state servers (the various HA machines), and at least one failing log for the machine again on one showing as pending
[03:46] <bradm> thumper: interestingly we've done something similar in a staging environment, and using HA worked fine - although its not as many hosts, and its with softlayer rather than physical HP kit
[03:47] <bradm> so its SeaMicro kit there, I think
[03:47] <thumper> bradm: either way, something is wrong...
[03:47] <thumper> bradm: and I'm not sure what
[03:47] <bradm> thumper: all 16 hosts are now in a agent-state of started.
[03:48] <thumper> bradm: when you did the test with HA, were all three HA machines up and running?
[03:48] <thumper> and stable?
[03:48] <bradm> thumper: yes
[03:48] <bradm> I'll quickly grab juju status output from this, and fire up the HA again
[03:48] <thumper> if you can confirm again, I bet it is something about the HA-ness of it all
[03:49] <thumper> not something I've had anything to do with I'm afraid
[03:49] <thumper> but lets get the bug filed and someone on it
[03:49] <bradm> right
[03:49] <thumper> cheers
[03:50] <bradm> thumper: this will be pretty nice once its working though, we have openstack deployed in HA mode to the ha bootstrap nodes into LXC, its working fairly well in testing
[06:44] <bradm> I've just filed bug #1381340 as per discussion with thumper, some kind of HA bootstrap node bug
[06:44] <mup> Bug #1381340: HA bootstrap mode causes machines stuck in agent-state pending <canonical-bootstack> <juju-core:New> <https://launchpad.net/bugs/1381340>
[09:12] <mattrae> hi, i'm doing debug-hooks and according to the docs, 'exit 1' will halt the queue. When i run 'exit 1' i see it moves on to the next queued hook, rather than halting. is there a step i'm missing? juju 1.18.4
[09:30] <marcoceppi_> mattrae: exit in debug-hooks is ignored in 1.18
[10:30] <mattrae> juju remove-relation doesn't appear to be completing. i still see the relation in juju status, and re-adding the relation says 'relation already exists'. what's the best way to debug?
[10:33] <mattrae> marcoceppi_: thanks for confirming the issue. sigh i feel like it should be fixed in the juju versions we ship with trusty.
[12:25] <catbus1> Hi, I juju deploy ceph on 3 physical nodes, each has 1TB on sdb. How long should I expect for nodes to finish the install?
[12:27] <catbus1> it's been at least 2 hours.
[12:27] <marcoceppi_> catbus1: shouldn't take more than a few mins
[12:27] <catbus1> ok, I will start over
[12:27] <marcoceppi_> catbus1: what is the status?
[12:27] <marcoceppi_> pending?
[12:27] <catbus1> pending
[12:27] <marcoceppi_> are the machines stuck?
[12:28] <catbus1> I can juju ssh in the unit and no busy process seen via top
[12:28] <marcoceppi_> catbus1: yeah, something is stuck
[12:28] <marcoceppi_> catbus1: can you `ps -aef | grep hooks` ?
[12:29] <catbus1> just one entry returned: ubuntu   13681 13661  0 12:29 pts/1    00:00:00 grep --color=auto hooks
[12:36] <marcoceppi_> catbus1: yeah, no hooks are running, something is borked
[14:19] <catbus11> marcoceppi_: I got ceph installed successfully and started now.
[14:19] <catbus11> it took about 5 minutes or so.
[14:26] <marcoceppi_> catbus11: sweet!
[14:27] <marcoceppi_> catbus11: rule of thumb, if nothing has happned for 10-20 mins, get suspicious
[14:28] <catbus11> ok