/srv/irclogs.ubuntu.com/2020/11/18/#juju.txt

Hybrid512	o/	08:50
Hybrid512	One question here : how can you force a juju agent to recover properly ?	08:50
Hybrid512	here is the context : I wanted to do some "crash testing" on my openstack clusters deployed with MaaS/Juju and see if it behaves well in case of full crash so I simply killed all my machines and stopped the cluster completely then restarted it.	08:52
Hybrid512	and now, everything is totally broken and I have plenty of "agent loss" messages for my units, some containers appears to be down while they are not ... well ... Juju based deployments don't seem to play well with full shutdown	08:53
Hybrid512	any ideas or best practices towards that kind of situation ?	08:54
stickupkid	You might have to go around and bounce the agents on each unit	08:58
Hybrid512	tbh, MaaS + Juju makes deployments very enjoyable and efficient but I'm not as confident to the operation part ... it doesn't feel very secure	09:02
Hybrid512	just restarting a machine can bring my cluster in a weird state, especially, the juju agent ... I have to poke around way too much for something that is meant to nearly run by itself ... and this is just a test lab with no production workload	09:03
stickupkid	can you explain that weird state?	09:04
Hybrid512	tell me this just because I am doing wrong but I never encountered that kind of situation with some other platforms	09:04
Hybrid512	well, plenty of juju agent loss messages, my status shows plenty of yello and red states, mysql-innodb-cluster appears to be completely borked ... looks like everything is broken but it's not really the case	09:06
Hybrid512	when you drill down a little, you see that juju agent is running without error on the units, mysql seems to be fine ... in fact, it looks like this is the juju status that is completely inconsistent	09:07
Hybrid512	so what should I trust ?	09:07
Hybrid512	for example, I have that kind of error message : "agent lost, see 'juju show-status-log rabbitmq-server/0'"	09:09
Hybrid512	when I do the "juju show-status-log rabbitmq-server/0", I have no error	09:09
Hybrid512	I get this at the end of the status log : "workload active Unit is ready and clustered"	09:10
Hybrid512	so, apparently, everything is OK except that juju says it is not	09:10
Hybrid512	is there some command that can force "refreshing" the status by rechecking each agents ?	09:11
Hybrid512	I have a simpler example, I have a "easyrsa" unit with the same message "agent loss" ... this is a LXD unit, so I just restarted the container but still, continer is started but status didn't change	09:18
Hybrid512	stickupkid how can I just "bounce the agents on each unit" as you said ?	09:18
manadart	Hybrid512: If you are not running in HA, having the controller down means the workload agents will try connection a few times, but then fail with "connection impossible".	09:19
manadart	In that case what is required is to restart the jujud-machine-x service.	09:20
Hybrid512	which controller ? I didn't kill the juju controller, only the machines in a model	09:20
Hybrid512	I tried to restart jujud-machine-x service as you said but that didn't change anything and the systemctl status says everything is fine for this service	09:21
manadart	Hybrid512: There are 2 status indicators. One is for the workload and one is for the agent. The agent can be lost (I.e. not connected as seen here), but the last known workload status could be good.	09:21
Hybrid512	but why is it not connected ?	09:22
manadart	Hybrid512: And what does /var/log/juju/machine-x.log indicate for those units?	09:22
Hybrid512	nothing valuable	09:23
Hybrid512	it ends with that kind og log :	09:23
Hybrid512	2020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/4	09:23
Hybrid512	2020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/5	09:23
Hybrid512	2020-11-18 08:43:46 INFO juju.worker.provisioner provisioner_task.go:744 maintainMachines: 0/lxd/6	09:23
Hybrid512	how can I check the connection between the agent and the controller ?	09:24
Hybrid512	hummm ... I think I put the finger on something ...	09:27
Hybrid512	first, it seems that only the LXD based units are having issues, everything that is deployed on bare metal seem to work just fine	09:28
Hybrid512	second : I think this might be related to some kind of race condition ... when those machines starts, I have many I/os and high load, I suppose it can break some things there ... couldn't it be possible to be less aggressive when starting the units ?	09:29
manadart	Hybrid512: What version of Juju are you running here?	09:32
Hybrid512	latest stable	09:32
Hybrid512	2.8.6	09:32
Hybrid512	with latest MaaS : 2.8.2	09:33
chrome0	Hey, sometime back I proposed https://github.com/juju/charm-tools/pull/564 wondering if anyone would be able to peek?	09:38
manadart	Hybrid512: On the metal machine (0), if you run `lxc exec <container> bash`, what do the machine logs there indicate?	09:40
stickupkid	chrome0, I'll ping cory_fu for you	09:40
Hybrid512	manadart I can get into it without any issue	09:40
chrome0	cheers	09:40
manadart	Hybrid512: On those containers, the machine and/or agent logs in /var/log/juju might tell us something.	09:42
stickupkid	chrome0, he won't be online till later on the day, but hopefully he'll see the message	09:42
chrome0	Right, understood	09:42
Hybrid512	manadart : 2020-11-18 09:36:28 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [4698e8] "machine-0-lxd-0" cannot open api: unable to connect to API: dial tcp 192.168.1.104:17070: i/o timeout	09:43
Hybrid512	looks like it can't connect the juju controller	09:43
Hybrid512	(192.168.1.104 is the juju controller IP in my setup)	09:43
Hybrid512	ok, now there is something ...	09:45
manadart	Hybrid512: And if you restart the agents there...	09:45
Hybrid512	I see something weird	09:45
Hybrid512	on machine 0, I have 7 LXD containers	09:46
Hybrid512	Machine 0 and container 0/lxd/5 are in a OK state but not the other containers	09:47
Hybrid512	only difference is that 0/lxd/5 has 2 IPs corresponding to 2 different spaces	09:47
Hybrid512	my spaces are : ost-int = 192.168.211.0/24 ost-pub = 192.168.210.0/24	09:48
Hybrid512	only machines having an IP in ost-pub can talk to the juju controller, others (inside ost-int) can't	09:49
Hybrid512	but that is not normal, ost-int is routed and can talk to the 192.168.1.0/24 network which is used for juju controller	09:50
manadart	Hybrid512: Can you check `ip a` and/or `/etc/netplan/xxx.yaml` in 0/lxd/5's container?	09:52
Hybrid512	WTF!! okay ... my bad, I found the issue	09:52
Hybrid512	192.168.211.0/24 doesn't seem to be routed (at least not to 192.168.1.0/24) so this is perfectly normal	09:53
Hybrid512	humm ... well, not completely sure though ... my control plane has 3 VMs, all on the same hypervisor with the same subnets and they are managed by MaaS but they don't all behave the same	09:54
Hybrid512	some of them are "talking" properly on these networks ... that's weird, I'll have to check my stuff	09:55
Hybrid512	anyway, thanks pointing me in the right direction	09:55
manadart	Hybrid512: NP. Glad you've got to the bottom of it.	10:00
Hybrid512	well ... hope so ... because by now, I don't see the difference between those VMs ... they are all the same ... very weird	10:00
Hybrid512	digging ...	10:00
Hybrid512	thx anyway	10:01
Hybrid512	ho and another question totally unrelated : MaaS and Juju are great together but MaaS UI is not very practical when dealing with dozens of machines except by using filters which it does well	10:30
Hybrid512	however, when you have a few models deployed with Juju, it is not very easy to see which machine is used by which model when looking at the MaaS UI	10:30
Hybrid512	MaaS has a label called "Resource pool" which could be very efficiently be used y Juju to map its models in	10:31
Hybrid512	is there a way to map juju models to MaaS resource pools ? If so, how ? if not, is there an alternative such as mapping models to MaaS tags ?	10:32
stickupkid	achilleasa, find version -> https://github.com/juju/juju/pull/12338	12:35
achilleasa	looking	12:38
stickupkid	achilleasa, I have a question about facade version as I teaked the output	12:39
stickupkid	achilleasa, considering nobody is using it, it's behind a feature flag and it's omitempty, do we care?	12:39
achilleasa	stickupkid_: would it break people who have enabled the flag while using the 2.9 rc's?	13:39
stickupkid_	achilleasa, nope	13:39
achilleasa	and it's not in the pylibjuju client bits either, right?	13:40
stickupkid_	achilleasa, nope, we don't have any schema for it yet	13:40
achilleasa	I guess it's fine then	13:40
stickupkid_	achilleasa, we do need to rebuild the schema to include it in the 2.9 branch	13:40
achilleasa	stickupkid_: you should probably add a card so we don't forget :D	13:44
stickupkid_	achilleasa, YES!	13:44
achilleasa	stickupkid_: stupid question... why do I need a controller to run juju find?	13:51
stickupkid_	achilleasa, because we're forcing charmhub to use controllers, one point of access...	13:52
stickupkid_	achilleasa, controllers are a bastion for connections to the store ;-)	13:53
achilleasa	I get that deployment needs the controller and perhaps find in some cases (e.g when using an alternative store)	13:53
stickupkid_	alternative store is done via juju add-model other --config charm-hub-url="https://api.staging.snapcraft.io"	13:53
achilleasa	but requiring a bootstrapped thing to search a remote store seems odd from a UX perspective	13:53
stickupkid_	yes, but it means you can then only deploy what that model can	13:54
stickupkid_	this becomes more important with architectures etc	13:54
stickupkid_	also we get all the advantages of api facade revisioning, normalised data...	13:55
achilleasa	until you get a "juju find returns inconsistent results" bug :D	13:55
stickupkid_	this was my rant around bootstrapping takes to long and why I think that a daemon should just be `juju init`	13:55
achilleasa	I get the reasoning but I think it should also be able to work without an active controller (and use the active one if available)	13:56
stickupkid_	it's up to you to carve up a machine for juju to run	13:56
stickupkid_	but that's a fight for another day	13:56
stickupkid_	how do you reconcile the different types, you'd have to normalise the types the same way is the API server	13:57
stickupkid_	seems like a waste	13:57
achilleasa	but you see, I can query the store from the web UI so having such a contraint on a client makes no sense	13:57
stickupkid_	then query from the web UI ;)	13:57
achilleasa	it's like telling people, you have to think what you want to deploy (== find it in a web UI [on your headless box]) then set up the infra based on the arch you need and then you are set?	13:58
stickupkid_	i'm a fan of people "thinking" :)	13:59
achilleasa	or to rephrase it: use an external source to find what you can use with your CLI tool :D	14:00
stickupkid_	bring it up in standup	14:01
stickupkid_	:)	14:01
stickupkid_	not saying you're wrong, just that it was designed this way	14:01
achilleasa	will do. I get the arguments but I still think it's bad UX for a _search_ command (imagine apt requiring you to use SSO before you can search)	14:03
stickupkid_	also when private charms arrive you'll need to SSO anyway to see them	14:03
stickupkid_	so we won't handle that in the CLI	14:03
achilleasa	so you don't SSO in the CLI and pass a macaroon to the controller?	14:04
stickupkid_	yeah, probably actually...	14:04
=== diddledan_ is now known as diddledan

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!