[00:17] anyone able to help debug a maas node that's been added to juju that's hanging around in agent-state: pending? [00:29] mwhudson: which provider? [00:29] thumper: manual [00:30] mwhudson: ok, you just need to have preseeded the image with the lxc ubuntu-cloud template [00:30] thumper: lxc-create -n foo -t ubuntu-cloud; lxc-destroy -n foo? [00:30] mwhudson: that'd probably work [00:30] cool [00:31] mwhudson: with caveats around differing series... [00:31] but if you are just using the defaults [00:31] I think that'll work [00:33] all trusty all the time [00:33] (also arm64, trying to create a precise lxc not going to go well) [01:27] so no ideas on why a juju unit would be stuck in pending? === psivaa_ is now known as psivaa === allenap_ is now known as allenap === FourDollars_ is now known as FourDollars === nottrobin_ is now known as nottrobin [01:35] bradm: a unit would be stuck in pending if the unit agent never started [01:35] bradm: which ment the machine agent didn't deploy the unit properly [01:35] bradm: which means, most likely, the machine agent isn't running... ? [01:36] thumper: no, it has them installed and running [01:36] bradm: the machine agent is running? [01:37] /var/lib/juju/tools/machine-31/jujud machine --data-dir /var/lib/juju --machine-id 31 --debug [01:37] thats what you mean? ^^ [01:37] bradm: yeah [01:37] bradm: is there a jujud unit running? [01:37] oddly out of 16 hosts, only one succeed [01:38] thumper: I've only done a juju add-machine so far [01:39] we're trying to work out whats going on here, we're having some fun with bug #1263181 [01:39] Bug #1263181: curtin discovers HP /dev/cciss/c0d0 incorrectly [01:40] thumper: but I can confirm all 16 nodes have the machine agent running [01:41] thumper: and only one is in a started state [01:41] bradm: but is there a unit agent running? [01:41] bradm: can you ssh to the machine and see if the charm has been deployed? [01:41] thumper: no, none of them have a unit agent [01:41] thumper: we haven't deployed charms yet, we're having booting issues [01:41] bradm: also, does juju status show a machine for the unit? [01:41] thumper: at this point we're just doing a juju add-machine [01:41] bradm: ok, I'm confused [01:42] you did say "unit stuck in pending" [01:42] did you mean machine? [01:42] ah, I did say unit too, sorry [01:42] I should have said machine [01:42] I was using unit in the generic sense, didn't remember that it had a juju specific meaning === alpacaherder is now known as skellat === psivaa_ is now known as psivaa [01:43] bradm: ah, ok... [01:43] thumper: the other fun part is we're using a 3 node HA bootstrap, so maybe somethings going on there [01:43] bradm: so the machines have been deployed, and the machine agent is running, but showing pending? [01:44] thumper: correct, on all but one of the hosts [01:44] bradm: so this would occur if the machine agent can't contact the api server [01:44] bradm: check the local logs on the machines [01:45] thumper: they're not particularly enlightening [01:45] thumper: let me pastebin one for you [01:45] kk [01:46] thumper: http://pastebin.ubuntu.com/8562185/ <- one that didn't work [01:46] thumper: foo-os-[123].maas is the HA'd bootstrap nodes [01:47] bradm: and that's it? [01:47] thumper: yup [01:47] thumper: like I said, not particularly enlightening [01:47] bradm: looks like the websocket handshake is failing for some reason [01:48] bradm: that's the first thing it does [01:49] bradm: if you look at the logs for the state servers, do you see the incoming connection from the other machines [01:49] ? [01:49] bradm: also, make sure the state servers have a decent logging-config [01:49] like DEBUG [01:50] bradm: by default the logging-config is set to WARN if you don't specify [01:50] if you bootstrapped with --debug, it should stay debug [01:50] thumper: http://pastebin.ubuntu.com/8562198/ <- one that did work [01:50] thumper: ok, let me see.. [01:52] thumper: so its definately in debug [01:54] thumper: there's a lot going on, have you got an example of what an incoming connection whould look like in the logs? [02:03] * thumper looks [02:04] debug is pretty noisy, as you'd expect, its hard to tell what to look for [02:05] grr [02:06] HA bootstrap nodes doesn't help either, it throws enough of its own logs too [02:08] trying to get some info for you bradm [02:08] but hitting other local issues [02:08] gimmie a few minutes [02:08] thumper: sure, thats fine - I'm probably going to have some lunch soon anyway [02:09] I can hear my wife making something in the kitchen now.. [02:12] thumper: in fact, lunchtime now! will be back in a bit [02:12] kk === pjdc_ is now known as pjdc [02:22] brad: 2014-10-15 02:07:01 DEBUG juju.apiserver apiserver.go:156 <- [1] machine-0 {"RequestId":, ... Entities":[{"Tag":"machine-1"}]}} [02:22] 2014-10-15 02:07:01 DEBUG juju.apiserver apiserver.go:163 -> [1] machine-0 311.032us {"RequestId":,"Response":{...}} [02:22] bradm: that was for machine-1 [02:22] bradm: use whichever number you have [02:22] I thought there was a login logged, but seems not [03:00] thumper: righto, let me see.. [03:05] thumper: curious, I have juju.state.apiserver, not juju.apiserver [03:05] bradm: oh... which version of juju? [03:05] we did move it [03:05] but perhaps that was 1.21 [03:05] thumper: juju 1.20.9 [03:07] machine-1: 2014-10-15 00:34:45 INFO juju.state.apiserver apiserver.go:165 [32] machine-40 API connection terminated after 3m0.104868462s [03:08] aha, here we go [03:14] thumper: http://pastebin.ubuntu.com/8562529/ [03:15] thumper: so it looks like machine-40 does a whole bunch of requests, there's a response sent back, and then it just times out [03:15] bradm: definitely looks like a bug... :-( [03:15] thumper: this is on a customer site too :( [03:16] bradm: do you see the same if not doing HA? [03:16] thumper: I can try that out easily enough. [03:16] bradm: please [03:20] huh, what, that just failed [03:20] maybe I tried a bit too quickly [03:21] ah, this is better, it was far too quick last time [03:22] thumper: will let you know when its up, this is all HP kit and maas, so its not exactly fast [03:22] kk === Guest90815 is now known as rcj [03:31] thumper: ok, bootstrap is done, doing the add-machine now [03:35] thumper: right, all 16 have started, now its waiting time. [03:44] thumper: well look at that, a lot of them are coming back as started [03:44] bugger [03:44] bradm: looks like the HA bit is screwing things up [03:44] thumper: sure does - I'd like to try this again with HA once all these either hit started or we decide to give up [03:45] bradm: can I get you to file a bug and mark one of us will mark it critical [03:45] bradm: when you test it again that it [03:45] bradm: something is horribly wrong [03:45] bradm: please include all the version information you can [03:46] and provider info [03:46] thumper: sure, is there any particular logs or anything you need? other than it not working with HA? [03:46] grab the logs from the state servers (the various HA machines), and at least one failing log for the machine again on one showing as pending [03:46] thumper: interestingly we've done something similar in a staging environment, and using HA worked fine - although its not as many hosts, and its with softlayer rather than physical HP kit [03:47] so its SeaMicro kit there, I think [03:47] bradm: either way, something is wrong... [03:47] bradm: and I'm not sure what [03:47] thumper: all 16 hosts are now in a agent-state of started. [03:48] bradm: when you did the test with HA, were all three HA machines up and running? [03:48] and stable? [03:48] thumper: yes [03:48] I'll quickly grab juju status output from this, and fire up the HA again [03:48] if you can confirm again, I bet it is something about the HA-ness of it all [03:49] not something I've had anything to do with I'm afraid [03:49] but lets get the bug filed and someone on it [03:49] right [03:49] cheers [03:50] thumper: this will be pretty nice once its working though, we have openstack deployed in HA mode to the ha bootstrap nodes into LXC, its working fairly well in testing === uru_ is now known as urulama === CyberJacob|Away is now known as CyberJacob [06:44] I've just filed bug #1381340 as per discussion with thumper, some kind of HA bootstrap node bug [06:44] Bug #1381340: HA bootstrap mode causes machines stuck in agent-state pending === CyberJacob is now known as CyberJacob|Away === uru_ is now known as urulama [09:12] hi, i'm doing debug-hooks and according to the docs, 'exit 1' will halt the queue. When i run 'exit 1' i see it moves on to the next queued hook, rather than halting. is there a step i'm missing? juju 1.18.4 [09:30] mattrae: exit in debug-hooks is ignored in 1.18 [10:30] juju remove-relation doesn't appear to be completing. i still see the relation in juju status, and re-adding the relation says 'relation already exists'. what's the best way to debug? [10:33] marcoceppi_: thanks for confirming the issue. sigh i feel like it should be fixed in the juju versions we ship with trusty. === luca__ is now known as luca [12:25] Hi, I juju deploy ceph on 3 physical nodes, each has 1TB on sdb. How long should I expect for nodes to finish the install? [12:27] it's been at least 2 hours. [12:27] catbus1: shouldn't take more than a few mins [12:27] ok, I will start over [12:27] catbus1: what is the status? [12:27] pending? [12:27] pending [12:27] are the machines stuck? [12:28] I can juju ssh in the unit and no busy process seen via top [12:28] catbus1: yeah, something is stuck [12:28] catbus1: can you `ps -aef | grep hooks` ? [12:29] just one entry returned: ubuntu 13681 13661 0 12:29 pts/1 00:00:00 grep --color=auto hooks [12:36] catbus1: yeah, no hooks are running, something is borked [14:19] marcoceppi_: I got ceph installed successfully and started now. [14:19] it took about 5 minutes or so. [14:26] catbus11: sweet! [14:27] catbus11: rule of thumb, if nothing has happned for 10-20 mins, get suspicious [14:28] ok === scuttle|afk is now known as scuttlemonkey === jcw4 is now known as jcw4|afk === jcw4|afk is now known as jcw4 === mfa298_ is now known as mfa298 === roadmr is now known as roadmr_afk === jcw4 is now known as jcw4|nomnom === beisner- is now known as beisner === roadmr_afk is now known as roadmr === BradCrittenden is now known as bac === BradCrittenden is now known as bac === CyberJacob|Away is now known as CyberJacob === CyberJacob is now known as CyberJacob|Away === jcw4|nomnom is now known as jcw4