/srv/irclogs.ubuntu.com/2020/11/18/#juju.txt

Hybrid512o/08:50
Hybrid512One question here : how can you force a juju agent to recover properly ?08:50
Hybrid512here is the context : I wanted to do some "crash testing" on my openstack clusters deployed with MaaS/Juju and see if it behaves well in case of full crash so I simply killed all my machines and stopped the cluster completely then restarted it.08:52
Hybrid512and now, everything is totally broken and I have plenty of "agent loss" messages for my units, some containers appears to be down while they are not ... well ... Juju based deployments don't seem to play well with full shutdown08:53
Hybrid512any ideas or best practices towards that kind of situation ?08:54
stickupkidYou might have to go around and bounce the agents on each unit08:58
Hybrid512tbh, MaaS + Juju makes deployments very enjoyable and efficient but I'm not as confident to the operation part ... it doesn't feel very secure09:02
Hybrid512just restarting a machine can bring my cluster in a weird state, especially, the juju agent ... I have to poke around way too much for something that is meant to nearly run by itself ... and this is just a test lab with no production workload09:03
stickupkidcan you explain that weird state?09:04
Hybrid512tell me this just because I am doing wrong but I never encountered that kind of situation with some other platforms09:04
Hybrid512well, plenty of juju agent loss messages, my status shows plenty of yello and red states, mysql-innodb-cluster appears to be completely borked ... looks like everything is broken but it's not really the case09:06
Hybrid512when you drill down a little, you see that juju agent is running without error on the units, mysql seems to be fine ... in fact, it looks like this is the juju status that is completely inconsistent09:07
Hybrid512so what should I trust ?09:07
Hybrid512for example, I have that kind of error message : "agent lost, see 'juju show-status-log rabbitmq-server/0'"09:09
Hybrid512when I do the "juju show-status-log rabbitmq-server/0", I have no error09:09
Hybrid512I get this at the end of the status log : "workload   active     Unit is ready and clustered"09:10
Hybrid512so, apparently, everything is OK except that juju says it is not09:10
Hybrid512is there some command that can force "refreshing" the status by rechecking each agents ?09:11
Hybrid512I have a simpler example, I have a "easyrsa" unit with the same message "agent loss" ... this is a LXD unit, so I just restarted the container but still, continer is started but status didn't change09:18
Hybrid512stickupkid how can I just "bounce the agents on each unit" as you said ?09:18
manadartHybrid512: If you are not running in HA, having the controller down means the workload agents will try connection a few times, but then fail with "connection impossible".09:19
manadartIn that case what is required is to restart the jujud-machine-x service.09:20
Hybrid512which controller ? I didn't kill the juju controller, only the machines in a model09:20
Hybrid512I tried to restart jujud-machine-x service as you said but that didn't change anything and the systemctl status says everything is fine for this service09:21
manadartHybrid512: There are 2 status indicators. One is for the workload and one is for the agent. The agent can be lost (I.e. not connected as seen here), but the last known workload status could be good.09:21
Hybrid512but why is it not connected ?09:22
manadartHybrid512: And what does /var/log/juju/machine-x.log indicate for those units?09:22
Hybrid512nothing valuable09:23
Hybrid512it ends with that kind og log :09:23
Hybrid5122020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/409:23
Hybrid5122020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/509:23
Hybrid5122020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/609:23
Hybrid512how can I check the connection between the agent and the controller ?09:24
Hybrid512hummm ... I think I put the finger on something ...09:27
Hybrid512first, it seems that only the LXD based units are having issues, everything that is deployed on bare metal seem to work just fine09:28
Hybrid512second : I think this might be related to some kind of race condition ... when those machines starts, I have many I/os and high load, I suppose it can break some things there ... couldn't it be possible to be less aggressive when starting the units ?09:29
manadartHybrid512: What version of Juju are you running here?09:32
Hybrid512latest stable09:32
Hybrid5122.8.609:32
Hybrid512with latest MaaS : 2.8.209:33
chrome0Hey, sometime back I proposed https://github.com/juju/charm-tools/pull/564 wondering if anyone would be able to peek?09:38
manadartHybrid512: On the metal machine (0), if you run `lxc exec <container> bash`, what do the machine logs there indicate?09:40
stickupkidchrome0, I'll ping cory_fu for you09:40
Hybrid512manadart I can get into it without any issue09:40
chrome0cheers09:40
manadartHybrid512: On those containers, the machine and/or agent logs in /var/log/juju might tell us something.09:42
stickupkidchrome0, he won't be online till later on the day, but hopefully he'll see the message09:42
chrome0Right, understood09:42
Hybrid512manadart : 2020-11-18 09:36:28 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [4698e8] "machine-0-lxd-0" cannot open api: unable to connect to API: dial tcp 192.168.1.104:17070: i/o timeout09:43
Hybrid512looks like it can't connect the juju controller09:43
Hybrid512(192.168.1.104 is the juju controller IP in my setup)09:43
Hybrid512ok, now there is something ...09:45
manadartHybrid512: And if you restart the agents there...09:45
Hybrid512I see something weird09:45
Hybrid512on machine 0, I have 7 LXD containers09:46
Hybrid512Machine 0 and container 0/lxd/5 are in a OK state but not the other containers09:47
Hybrid512only difference is that 0/lxd/5 has 2 IPs corresponding to 2 different spaces09:47
Hybrid512my spaces are : ost-int = 192.168.211.0/24   ost-pub = 192.168.210.0/2409:48
Hybrid512only machines having an IP in ost-pub can talk to the juju controller, others (inside ost-int) can't09:49
Hybrid512but that is not normal, ost-int is routed and can talk to the 192.168.1.0/24 network which is used for juju controller09:50
manadartHybrid512: Can you check `ip a` and/or `/etc/netplan/xxx.yaml` in 0/lxd/5's container?09:52
Hybrid512WTF!! okay ... my bad, I found the issue09:52
Hybrid512192.168.211.0/24 doesn't seem to be routed (at least not to 192.168.1.0/24) so this is perfectly normal09:53
Hybrid512humm ... well, not completely sure though ... my control plane has 3 VMs, all on the same hypervisor with the same subnets and they are managed by MaaS but they don't all behave the same09:54
Hybrid512some of them are "talking" properly on these networks ... that's weird, I'll have to check my stuff09:55
Hybrid512anyway, thanks pointing me in the right direction09:55
manadartHybrid512: NP. Glad you've got to the bottom of it.10:00
Hybrid512well ... hope so ... because by now, I don't see the difference between those VMs ... they are all the same ... very weird10:00
Hybrid512digging ...10:00
Hybrid512thx anyway10:01
Hybrid512ho and another question totally unrelated : MaaS and Juju are great together but MaaS UI is not very practical when dealing with dozens of machines except by using filters which it does well10:30
Hybrid512however, when you have a few models deployed with Juju, it is not very easy to see which machine is used by which model when looking at the MaaS UI10:30
Hybrid512MaaS has a label called "Resource pool" which could be very efficiently be used y Juju to map its models in10:31
Hybrid512is there a way to map juju models to MaaS resource pools ? If so, how ? if not, is there an alternative such as mapping models to MaaS tags ?10:32
stickupkidachilleasa, find version -> https://github.com/juju/juju/pull/1233812:35
achilleasalooking12:38
stickupkidachilleasa, I have a question about facade version as I teaked the output12:39
stickupkidachilleasa, considering nobody is using it, it's behind a feature flag and it's omitempty, do we care?12:39
achilleasastickupkid_: would it break people who have enabled the flag while using the 2.9 rc's?13:39
stickupkid_achilleasa, nope13:39
achilleasaand it's not in the pylibjuju client bits either, right?13:40
stickupkid_achilleasa, nope, we don't have any schema for it yet13:40
achilleasaI guess it's fine then13:40
stickupkid_achilleasa, we do need to rebuild the schema to include it in the 2.9 branch13:40
achilleasastickupkid_: you should probably add a card so we don't forget :D13:44
stickupkid_achilleasa, YES!13:44
achilleasastickupkid_: stupid question... why do I need a controller to run juju find?13:51
stickupkid_achilleasa, because we're forcing charmhub to use controllers, one point of access...13:52
stickupkid_achilleasa, controllers are a bastion for connections to the store ;-)13:53
achilleasaI get that deployment needs the controller and perhaps find in some cases (e.g when using an alternative store)13:53
stickupkid_alternative store is done via juju add-model other --config charm-hub-url="https://api.staging.snapcraft.io"13:53
achilleasabut requiring a bootstrapped thing to search a remote store seems odd from a UX perspective13:53
stickupkid_yes, but it means you can then only deploy what that model can13:54
stickupkid_this becomes more important with architectures etc13:54
stickupkid_also we get all the advantages of api facade revisioning, normalised data...13:55
achilleasauntil you get a "juju find returns inconsistent results" bug :D13:55
stickupkid_this was my rant around bootstrapping takes to long and why I think that a daemon should just be `juju init`13:55
achilleasaI get the reasoning but I think it should also be able to work without an active controller (and use the active one if available)13:56
stickupkid_it's up to you to carve up a machine for juju to run13:56
stickupkid_but that's a fight for another day13:56
stickupkid_how do you reconcile the different types, you'd have to normalise the types the same way is the API server13:57
stickupkid_seems like a waste13:57
achilleasabut you see, I can query the store from the web UI so having such a contraint on a client makes no sense13:57
stickupkid_then query from the web UI ;)13:57
achilleasait's like telling people, you have to think what you want to deploy (== find it in a web UI [on your headless box]) then set up the infra based on the arch you need and then you are set?13:58
stickupkid_i'm a fan of people "thinking" :)13:59
achilleasaor to rephrase it: use an external source to find what you can use with your CLI tool :D14:00
stickupkid_bring it up in standup14:01
stickupkid_:)14:01
stickupkid_not saying you're wrong, just that it was designed this way14:01
achilleasawill do. I get the arguments but I still think it's bad UX for a _search_ command (imagine apt requiring you to use SSO before you can search)14:03
stickupkid_also when private charms arrive you'll need to SSO anyway to see them14:03
stickupkid_so we won't handle that in the CLI14:03
achilleasaso you don't SSO in the CLI and pass a macaroon to the controller?14:04
stickupkid_yeah, probably actually...14:04
=== diddledan_ is now known as diddledan

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!