[08:50] o/ [08:50] One question here : how can you force a juju agent to recover properly ? [08:52] here is the context : I wanted to do some "crash testing" on my openstack clusters deployed with MaaS/Juju and see if it behaves well in case of full crash so I simply killed all my machines and stopped the cluster completely then restarted it. [08:53] and now, everything is totally broken and I have plenty of "agent loss" messages for my units, some containers appears to be down while they are not ... well ... Juju based deployments don't seem to play well with full shutdown [08:54] any ideas or best practices towards that kind of situation ? [08:58] You might have to go around and bounce the agents on each unit [09:02] tbh, MaaS + Juju makes deployments very enjoyable and efficient but I'm not as confident to the operation part ... it doesn't feel very secure [09:03] just restarting a machine can bring my cluster in a weird state, especially, the juju agent ... I have to poke around way too much for something that is meant to nearly run by itself ... and this is just a test lab with no production workload [09:04] can you explain that weird state? [09:04] tell me this just because I am doing wrong but I never encountered that kind of situation with some other platforms [09:06] well, plenty of juju agent loss messages, my status shows plenty of yello and red states, mysql-innodb-cluster appears to be completely borked ... looks like everything is broken but it's not really the case [09:07] when you drill down a little, you see that juju agent is running without error on the units, mysql seems to be fine ... in fact, it looks like this is the juju status that is completely inconsistent [09:07] so what should I trust ? [09:09] for example, I have that kind of error message : "agent lost, see 'juju show-status-log rabbitmq-server/0'" [09:09] when I do the "juju show-status-log rabbitmq-server/0", I have no error [09:10] I get this at the end of the status log : "workload active Unit is ready and clustered" [09:10] so, apparently, everything is OK except that juju says it is not [09:11] is there some command that can force "refreshing" the status by rechecking each agents ? [09:18] I have a simpler example, I have a "easyrsa" unit with the same message "agent loss" ... this is a LXD unit, so I just restarted the container but still, continer is started but status didn't change [09:18] stickupkid how can I just "bounce the agents on each unit" as you said ? [09:19] Hybrid512: If you are not running in HA, having the controller down means the workload agents will try connection a few times, but then fail with "connection impossible". [09:20] In that case what is required is to restart the jujud-machine-x service. [09:20] which controller ? I didn't kill the juju controller, only the machines in a model [09:21] I tried to restart jujud-machine-x service as you said but that didn't change anything and the systemctl status says everything is fine for this service [09:21] Hybrid512: There are 2 status indicators. One is for the workload and one is for the agent. The agent can be lost (I.e. not connected as seen here), but the last known workload status could be good. [09:22] but why is it not connected ? [09:22] Hybrid512: And what does /var/log/juju/machine-x.log indicate for those units? [09:23] nothing valuable [09:23] it ends with that kind og log : [09:23] 2020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/4 [09:23] 2020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/5 [09:23] 2020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/6 [09:24] how can I check the connection between the agent and the controller ? [09:27] hummm ... I think I put the finger on something ... [09:28] first, it seems that only the LXD based units are having issues, everything that is deployed on bare metal seem to work just fine [09:29] second : I think this might be related to some kind of race condition ... when those machines starts, I have many I/os and high load, I suppose it can break some things there ... couldn't it be possible to be less aggressive when starting the units ? [09:32] Hybrid512: What version of Juju are you running here? [09:32] latest stable [09:32] 2.8.6 [09:33] with latest MaaS : 2.8.2 [09:38] Hey, sometime back I proposed https://github.com/juju/charm-tools/pull/564 wondering if anyone would be able to peek? [09:40] Hybrid512: On the metal machine (0), if you run `lxc exec bash`, what do the machine logs there indicate? [09:40] chrome0, I'll ping cory_fu for you [09:40] manadart I can get into it without any issue [09:40] cheers [09:42] Hybrid512: On those containers, the machine and/or agent logs in /var/log/juju might tell us something. [09:42] chrome0, he won't be online till later on the day, but hopefully he'll see the message [09:42] Right, understood [09:43] manadart : 2020-11-18 09:36:28 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [4698e8] "machine-0-lxd-0" cannot open api: unable to connect to API: dial tcp 192.168.1.104:17070: i/o timeout [09:43] looks like it can't connect the juju controller [09:43] (192.168.1.104 is the juju controller IP in my setup) [09:45] ok, now there is something ... [09:45] Hybrid512: And if you restart the agents there... [09:45] I see something weird [09:46] on machine 0, I have 7 LXD containers [09:47] Machine 0 and container 0/lxd/5 are in a OK state but not the other containers [09:47] only difference is that 0/lxd/5 has 2 IPs corresponding to 2 different spaces [09:48] my spaces are : ost-int = 192.168.211.0/24 ost-pub = 192.168.210.0/24 [09:49] only machines having an IP in ost-pub can talk to the juju controller, others (inside ost-int) can't [09:50] but that is not normal, ost-int is routed and can talk to the 192.168.1.0/24 network which is used for juju controller [09:52] Hybrid512: Can you check `ip a` and/or `/etc/netplan/xxx.yaml` in 0/lxd/5's container? [09:52] WTF!! okay ... my bad, I found the issue [09:53] 192.168.211.0/24 doesn't seem to be routed (at least not to 192.168.1.0/24) so this is perfectly normal [09:54] humm ... well, not completely sure though ... my control plane has 3 VMs, all on the same hypervisor with the same subnets and they are managed by MaaS but they don't all behave the same [09:55] some of them are "talking" properly on these networks ... that's weird, I'll have to check my stuff [09:55] anyway, thanks pointing me in the right direction [10:00] Hybrid512: NP. Glad you've got to the bottom of it. [10:00] well ... hope so ... because by now, I don't see the difference between those VMs ... they are all the same ... very weird [10:00] digging ... [10:01] thx anyway [10:30] ho and another question totally unrelated : MaaS and Juju are great together but MaaS UI is not very practical when dealing with dozens of machines except by using filters which it does well [10:30] however, when you have a few models deployed with Juju, it is not very easy to see which machine is used by which model when looking at the MaaS UI [10:31] MaaS has a label called "Resource pool" which could be very efficiently be used y Juju to map its models in [10:32] is there a way to map juju models to MaaS resource pools ? If so, how ? if not, is there an alternative such as mapping models to MaaS tags ? [12:35] achilleasa, find version -> https://github.com/juju/juju/pull/12338 [12:38] looking [12:39] achilleasa, I have a question about facade version as I teaked the output [12:39] achilleasa, considering nobody is using it, it's behind a feature flag and it's omitempty, do we care? [13:39] stickupkid_: would it break people who have enabled the flag while using the 2.9 rc's? [13:39] achilleasa, nope [13:40] and it's not in the pylibjuju client bits either, right? [13:40] achilleasa, nope, we don't have any schema for it yet [13:40] I guess it's fine then [13:40] achilleasa, we do need to rebuild the schema to include it in the 2.9 branch [13:44] stickupkid_: you should probably add a card so we don't forget :D [13:44] achilleasa, YES! [13:51] stickupkid_: stupid question... why do I need a controller to run juju find? [13:52] achilleasa, because we're forcing charmhub to use controllers, one point of access... [13:53] achilleasa, controllers are a bastion for connections to the store ;-) [13:53] I get that deployment needs the controller and perhaps find in some cases (e.g when using an alternative store) [13:53] alternative store is done via juju add-model other --config charm-hub-url="https://api.staging.snapcraft.io" [13:53] but requiring a bootstrapped thing to search a remote store seems odd from a UX perspective [13:54] yes, but it means you can then only deploy what that model can [13:54] this becomes more important with architectures etc [13:55] also we get all the advantages of api facade revisioning, normalised data... [13:55] until you get a "juju find returns inconsistent results" bug :D [13:55] this was my rant around bootstrapping takes to long and why I think that a daemon should just be `juju init` [13:56] I get the reasoning but I think it should also be able to work without an active controller (and use the active one if available) [13:56] it's up to you to carve up a machine for juju to run [13:56] but that's a fight for another day [13:57] how do you reconcile the different types, you'd have to normalise the types the same way is the API server [13:57] seems like a waste [13:57] but you see, I can query the store from the web UI so having such a contraint on a client makes no sense [13:57] then query from the web UI ;) [13:58] it's like telling people, you have to think what you want to deploy (== find it in a web UI [on your headless box]) then set up the infra based on the arch you need and then you are set? [13:59] i'm a fan of people "thinking" :) [14:00] or to rephrase it: use an external source to find what you can use with your CLI tool :D [14:01] bring it up in standup [14:01] :) [14:01] not saying you're wrong, just that it was designed this way [14:03] will do. I get the arguments but I still think it's bad UX for a _search_ command (imagine apt requiring you to use SSO before you can search) [14:03] also when private charms arrive you'll need to SSO anyway to see them [14:03] so we won't handle that in the CLI [14:04] so you don't SSO in the CLI and pass a macaroon to the controller? [14:04] yeah, probably actually... === diddledan_ is now known as diddledan