bradm | anyone able to help debug a maas node that's been added to juju that's hanging around in agent-state: pending? | 00:17 |
---|---|---|
thumper | mwhudson: which provider? | 00:29 |
mwhudson | thumper: manual | 00:29 |
thumper | mwhudson: ok, you just need to have preseeded the image with the lxc ubuntu-cloud template | 00:30 |
mwhudson | thumper: lxc-create -n foo -t ubuntu-cloud; lxc-destroy -n foo? | 00:30 |
thumper | mwhudson: that'd probably work | 00:30 |
mwhudson | cool | 00:30 |
thumper | mwhudson: with caveats around differing series... | 00:31 |
thumper | but if you are just using the defaults | 00:31 |
thumper | I think that'll work | 00:31 |
mwhudson | all trusty all the time | 00:33 |
mwhudson | (also arm64, trying to create a precise lxc not going to go well) | 00:33 |
bradm | so no ideas on why a juju unit would be stuck in pending? | 01:27 |
=== psivaa_ is now known as psivaa | ||
=== allenap_ is now known as allenap | ||
=== FourDollars_ is now known as FourDollars | ||
=== nottrobin_ is now known as nottrobin | ||
thumper | bradm: a unit would be stuck in pending if the unit agent never started | 01:35 |
thumper | bradm: which ment the machine agent didn't deploy the unit properly | 01:35 |
thumper | bradm: which means, most likely, the machine agent isn't running... ? | 01:35 |
bradm | thumper: no, it has them installed and running | 01:36 |
thumper | bradm: the machine agent is running? | 01:36 |
bradm | /var/lib/juju/tools/machine-31/jujud machine --data-dir /var/lib/juju --machine-id 31 --debug | 01:37 |
bradm | thats what you mean? ^^ | 01:37 |
thumper | bradm: yeah | 01:37 |
thumper | bradm: is there a jujud unit running? | 01:37 |
bradm | oddly out of 16 hosts, only one succeed | 01:37 |
bradm | thumper: I've only done a juju add-machine so far | 01:38 |
bradm | we're trying to work out whats going on here, we're having some fun with bug #1263181 | 01:39 |
mup | Bug #1263181: curtin discovers HP /dev/cciss/c0d0 incorrectly <canonical-bootstack> <curtin:Triaged> <https://launchpad.net/bugs/1263181> | 01:39 |
bradm | thumper: but I can confirm all 16 nodes have the machine agent running | 01:40 |
bradm | thumper: and only one is in a started state | 01:41 |
thumper | bradm: but is there a unit agent running? | 01:41 |
thumper | bradm: can you ssh to the machine and see if the charm has been deployed? | 01:41 |
bradm | thumper: no, none of them have a unit agent | 01:41 |
bradm | thumper: we haven't deployed charms yet, we're having booting issues | 01:41 |
thumper | bradm: also, does juju status show a machine for the unit? | 01:41 |
bradm | thumper: at this point we're just doing a juju add-machine | 01:41 |
thumper | bradm: ok, I'm confused | 01:41 |
thumper | you did say "unit stuck in pending" | 01:42 |
thumper | did you mean machine? | 01:42 |
bradm | ah, I did say unit too, sorry | 01:42 |
bradm | I should have said machine | 01:42 |
bradm | I was using unit in the generic sense, didn't remember that it had a juju specific meaning | 01:42 |
=== alpacaherder is now known as skellat | ||
=== psivaa_ is now known as psivaa | ||
thumper | bradm: ah, ok... | 01:43 |
bradm | thumper: the other fun part is we're using a 3 node HA bootstrap, so maybe somethings going on there | 01:43 |
thumper | bradm: so the machines have been deployed, and the machine agent is running, but showing pending? | 01:43 |
bradm | thumper: correct, on all but one of the hosts | 01:44 |
thumper | bradm: so this would occur if the machine agent can't contact the api server | 01:44 |
thumper | bradm: check the local logs on the machines | 01:44 |
bradm | thumper: they're not particularly enlightening | 01:45 |
bradm | thumper: let me pastebin one for you | 01:45 |
thumper | kk | 01:45 |
bradm | thumper: http://pastebin.ubuntu.com/8562185/ <- one that didn't work | 01:46 |
bradm | thumper: foo-os-[123].maas is the HA'd bootstrap nodes | 01:46 |
thumper | bradm: and that's it? | 01:47 |
bradm | thumper: yup | 01:47 |
bradm | thumper: like I said, not particularly enlightening | 01:47 |
thumper | bradm: looks like the websocket handshake is failing for some reason | 01:47 |
thumper | bradm: that's the first thing it does | 01:48 |
thumper | bradm: if you look at the logs for the state servers, do you see the incoming connection from the other machines | 01:49 |
thumper | ? | 01:49 |
thumper | bradm: also, make sure the state servers have a decent logging-config | 01:49 |
thumper | like DEBUG | 01:49 |
thumper | bradm: by default the logging-config is set to WARN if you don't specify | 01:50 |
thumper | if you bootstrapped with --debug, it should stay debug | 01:50 |
bradm | thumper: http://pastebin.ubuntu.com/8562198/ <- one that did work | 01:50 |
bradm | thumper: ok, let me see.. | 01:50 |
bradm | thumper: so its definately in debug | 01:52 |
bradm | thumper: there's a lot going on, have you got an example of what an incoming connection whould look like in the logs? | 01:54 |
* thumper looks | 02:03 | |
bradm | debug is pretty noisy, as you'd expect, its hard to tell what to look for | 02:04 |
thumper | grr | 02:05 |
bradm | HA bootstrap nodes doesn't help either, it throws enough of its own logs too | 02:06 |
thumper | trying to get some info for you bradm | 02:08 |
thumper | but hitting other local issues | 02:08 |
thumper | gimmie a few minutes | 02:08 |
bradm | thumper: sure, thats fine - I'm probably going to have some lunch soon anyway | 02:08 |
bradm | I can hear my wife making something in the kitchen now.. | 02:09 |
bradm | thumper: in fact, lunchtime now! will be back in a bit | 02:12 |
thumper | kk | 02:12 |
=== pjdc_ is now known as pjdc | ||
thumper | brad: 2014-10-15 02:07:01 DEBUG juju.apiserver apiserver.go:156 <- [1] machine-0 {"RequestId":<id>, ... Entities":[{"Tag":"machine-1"}]}} | 02:22 |
thumper | 2014-10-15 02:07:01 DEBUG juju.apiserver apiserver.go:163 -> [1] machine-0 311.032us {"RequestId":<id>,"Response":{...}} | 02:22 |
thumper | bradm: that was for machine-1 | 02:22 |
thumper | bradm: use whichever number you have | 02:22 |
thumper | I thought there was a login logged, but seems not | 02:22 |
bradm | thumper: righto, let me see.. | 03:00 |
bradm | thumper: curious, I have juju.state.apiserver, not juju.apiserver | 03:05 |
thumper | bradm: oh... which version of juju? | 03:05 |
thumper | we did move it | 03:05 |
thumper | but perhaps that was 1.21 | 03:05 |
bradm | thumper: juju 1.20.9 | 03:05 |
bradm | machine-1: 2014-10-15 00:34:45 INFO juju.state.apiserver apiserver.go:165 [32] machine-40 API connection terminated after 3m0.104868462s | 03:07 |
bradm | aha, here we go | 03:08 |
bradm | thumper: http://pastebin.ubuntu.com/8562529/ | 03:14 |
bradm | thumper: so it looks like machine-40 does a whole bunch of requests, there's a response sent back, and then it just times out | 03:15 |
thumper | bradm: definitely looks like a bug... :-( | 03:15 |
bradm | thumper: this is on a customer site too :( | 03:15 |
thumper | bradm: do you see the same if not doing HA? | 03:16 |
bradm | thumper: I can try that out easily enough. | 03:16 |
thumper | bradm: please | 03:16 |
bradm | huh, what, that just failed | 03:20 |
bradm | maybe I tried a bit too quickly | 03:20 |
bradm | ah, this is better, it was far too quick last time | 03:21 |
bradm | thumper: will let you know when its up, this is all HP kit and maas, so its not exactly fast | 03:22 |
thumper | kk | 03:22 |
=== Guest90815 is now known as rcj | ||
bradm | thumper: ok, bootstrap is done, doing the add-machine now | 03:31 |
bradm | thumper: right, all 16 have started, now its waiting time. | 03:35 |
bradm | thumper: well look at that, a lot of them are coming back as started | 03:44 |
thumper | bugger | 03:44 |
thumper | bradm: looks like the HA bit is screwing things up | 03:44 |
bradm | thumper: sure does - I'd like to try this again with HA once all these either hit started or we decide to give up | 03:44 |
thumper | bradm: can I get you to file a bug and mark one of us will mark it critical | 03:45 |
thumper | bradm: when you test it again that it | 03:45 |
thumper | bradm: something is horribly wrong | 03:45 |
thumper | bradm: please include all the version information you can | 03:45 |
thumper | and provider info | 03:46 |
bradm | thumper: sure, is there any particular logs or anything you need? other than it not working with HA? | 03:46 |
thumper | grab the logs from the state servers (the various HA machines), and at least one failing log for the machine again on one showing as pending | 03:46 |
bradm | thumper: interestingly we've done something similar in a staging environment, and using HA worked fine - although its not as many hosts, and its with softlayer rather than physical HP kit | 03:46 |
bradm | so its SeaMicro kit there, I think | 03:47 |
thumper | bradm: either way, something is wrong... | 03:47 |
thumper | bradm: and I'm not sure what | 03:47 |
bradm | thumper: all 16 hosts are now in a agent-state of started. | 03:47 |
thumper | bradm: when you did the test with HA, were all three HA machines up and running? | 03:48 |
thumper | and stable? | 03:48 |
bradm | thumper: yes | 03:48 |
bradm | I'll quickly grab juju status output from this, and fire up the HA again | 03:48 |
thumper | if you can confirm again, I bet it is something about the HA-ness of it all | 03:48 |
thumper | not something I've had anything to do with I'm afraid | 03:49 |
thumper | but lets get the bug filed and someone on it | 03:49 |
bradm | right | 03:49 |
thumper | cheers | 03:49 |
bradm | thumper: this will be pretty nice once its working though, we have openstack deployed in HA mode to the ha bootstrap nodes into LXC, its working fairly well in testing | 03:50 |
=== uru_ is now known as urulama | ||
=== CyberJacob|Away is now known as CyberJacob | ||
bradm | I've just filed bug #1381340 as per discussion with thumper, some kind of HA bootstrap node bug | 06:44 |
mup | Bug #1381340: HA bootstrap mode causes machines stuck in agent-state pending <canonical-bootstack> <juju-core:New> <https://launchpad.net/bugs/1381340> | 06:44 |
=== CyberJacob is now known as CyberJacob|Away | ||
=== uru_ is now known as urulama | ||
mattrae | hi, i'm doing debug-hooks and according to the docs, 'exit 1' will halt the queue. When i run 'exit 1' i see it moves on to the next queued hook, rather than halting. is there a step i'm missing? juju 1.18.4 | 09:12 |
marcoceppi_ | mattrae: exit in debug-hooks is ignored in 1.18 | 09:30 |
mattrae | juju remove-relation doesn't appear to be completing. i still see the relation in juju status, and re-adding the relation says 'relation already exists'. what's the best way to debug? | 10:30 |
mattrae | marcoceppi_: thanks for confirming the issue. sigh i feel like it should be fixed in the juju versions we ship with trusty. | 10:33 |
=== luca__ is now known as luca | ||
catbus1 | Hi, I juju deploy ceph on 3 physical nodes, each has 1TB on sdb. How long should I expect for nodes to finish the install? | 12:25 |
catbus1 | it's been at least 2 hours. | 12:27 |
marcoceppi_ | catbus1: shouldn't take more than a few mins | 12:27 |
catbus1 | ok, I will start over | 12:27 |
marcoceppi_ | catbus1: what is the status? | 12:27 |
marcoceppi_ | pending? | 12:27 |
catbus1 | pending | 12:27 |
marcoceppi_ | are the machines stuck? | 12:27 |
catbus1 | I can juju ssh in the unit and no busy process seen via top | 12:28 |
marcoceppi_ | catbus1: yeah, something is stuck | 12:28 |
marcoceppi_ | catbus1: can you `ps -aef | grep hooks` ? | 12:28 |
catbus1 | just one entry returned: ubuntu 13681 13661 0 12:29 pts/1 00:00:00 grep --color=auto hooks | 12:29 |
marcoceppi_ | catbus1: yeah, no hooks are running, something is borked | 12:36 |
catbus11 | marcoceppi_: I got ceph installed successfully and started now. | 14:19 |
catbus11 | it took about 5 minutes or so. | 14:19 |
marcoceppi_ | catbus11: sweet! | 14:26 |
marcoceppi_ | catbus11: rule of thumb, if nothing has happned for 10-20 mins, get suspicious | 14:27 |
catbus11 | ok | 14:28 |
=== scuttle|afk is now known as scuttlemonkey | ||
=== jcw4 is now known as jcw4|afk | ||
=== jcw4|afk is now known as jcw4 | ||
=== mfa298_ is now known as mfa298 | ||
=== roadmr is now known as roadmr_afk | ||
=== jcw4 is now known as jcw4|nomnom | ||
=== beisner- is now known as beisner | ||
=== roadmr_afk is now known as roadmr | ||
=== BradCrittenden is now known as bac | ||
=== BradCrittenden is now known as bac | ||
=== CyberJacob|Away is now known as CyberJacob | ||
=== CyberJacob is now known as CyberJacob|Away | ||
=== jcw4|nomnom is now known as jcw4 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!