=== frankban|afk is now known as frankban
kjackalGood morning Juju wolrd!07:20
bdxMLFQ scheduler with process budgeting for anyone interested, implemented by yours truly - https://gist.github.com/jamesbeedy/f97393235a06f878655c7eeace71750010:04
Zichi Juju world, deploying a new CDK cluster (in the latest version this time /!\) and one of my 5 etcd is stuck at "waiting/idle" with message "Waiting for unit to complete registration."10:32
Zicthis is the only etcd unit stuck at that, the 4 others are active/idle10:33
ZicI took a look at "juju debug-log" but it does not seem to be special error10:34
admcleodZic: anything anywhere else? e.g. syslog, cdk logs, juju logs on the unit itself?10:54
Zicadmcleod: I just checked /var/log/juju/* and nothing in error, I will check syslog also11:05
ZicMay 24 11:04:24 mth-k8stestetcd-03 /snap/bin/etcd.etcdctl[1584]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.24 2.24]): older than "/usr/bin/snap" (2.25)11:05
ZicI have many of it11:05
admcleodZic: not entirely sure if thats going to be related but perhaps worth updating?11:06
Zicwas a fresh deploying11:06
admcleodZic: hrm ok11:08
admcleodZic: well. in any case, 4/5 etcd are ok.. underlying network issue? can 'etcd 5' communicate w/ the others?11:10
kjackalZic: I am trying the deployment  now11:10
admcleodkjackal to the rescue11:10
kjackalnot sure admcleod11:10
kjackalZic: you deployed the two extra etcds after the initial deployment had finished? Or did you trigger the deployment with 5 units?11:13
Zickjackal: directly deploying with 5 etcd yep11:13
Zicnot after11:13
kjackalok , Zic, thanks11:13
Zicwe're deploying all our CDK cluster with N+2 redundancy11:13
Zicso 5 etcd to have at least a quorum of 311:14
Zicask me if you need further action from me to have more debug logs, I don't have that much for now :(11:14
ZicI check the status of the etcd service, it's stopped11:15
Zicand cannot start:11:15
=== tinwood is now known as tinwood_afk
kjackalZic: strange... in any case redeploying to see if I can repro. Will letyou know if i need more help11:19
Zickjackal: never have this one, I'm thinking is just a random bug :(11:20
Zicall etcd have the same connectivity configuration11:21
ZicI tried to reboot the etcd unit machine which bug with no new result11:21
Zicjuju debug-log prints a little more after the reboot11:22
Zicbut nothing new compared to the local /var/log/syslog of the unit machine11:22
=== zeus is now known as Guest41638
Zickjackal: can I say to that unit to restart the charm installation from the beginning12:16
kjackalZic: no I do nto think so. I couldn;t repro this bug. My suggestion is to add another etcd unit so you have 5. Also if you could wait a bit lazyPower and tvansteenburgh will be up shortly, they might be able to offer their opinion12:19
Zicok, I can wait, this is the test cluster instance :)12:21
kjackalMany thanks Zic12:21
ZicI'm fearing to lost precious debug logs which could interest CDk-team if I go further and drops that silly unit so I can wait :)12:22
Zic(I will go dinner btw)12:22
lazyPowero/ morning12:45
tvansteenburghZic: i think the syslog msg is a red herring. might be a charm bug, i'm not sure12:45
tvansteenburghfound that status msg here https://github.com/juju-solutions/layer-etcd/blob/ca4c14c52822b41113e7df297e99097016494e07/reactive/etcd.py#L31812:45
tvansteenburgho/ lazyPower12:45
tvansteenburghmaybe you can help :)12:46
* lazyPower is reading backscroll12:46
lazyPowerhmm, smells like a race during unit registration12:47
lazyPower4/5 came up good, the 5'th refuses to start.12:47
NazHi, I want to point to an error in the documentation about Constraints.12:53
Nazhttps://jujucharms.com/docs/2.1/charms-constraints says and I quote:12:53
NazIn the event that a constraint cannot be met, the unit will not be deployed.  Note: Constraints work on an "or better" basis: If you ask for 4 CPUs, you may get 8, but you won't get 212:53
NazThis is NOT true, as I requested --constraints "cores=4 mem=16G" on my laptop having physically cores=2 and mem=8G and it instantiated a machine with 2 cores and 8G instead of the 4Cores/16G RAM requested as constraints.12:54
BlackDexHello there12:55
BlackDexi have a juju 1.25 env where an subordinate is stuck12:55
BlackDexit hangs on (stop)12:55
BlackDexand doesn't get removed12:55
BlackDexsomeone any ideas how to resolved that?12:56
tvansteenburghNaz: you are using the lxd provider?12:56
lazyPowerNaz: I dont think constraints like that are honored on the local/lxd provider. I know that behavior is true when using clouds like aws,gce,maas,openstack.12:56
BlackDexnever mind, i restarted the machine-xx jujud and it worked :S12:56
NazTvansteen, Yes, I am working on local cloud LXD12:56
BlackDexdid that before didn't worked, now it does12:56
NazLazyPower, Yup, I am working locally over LXD12:57
lazyPowerBlackDex: its not immediately obvious but if a relationship fails during the subordinate charms teardown, it will halt.12:57
* lazyPower steps away to make coffee12:58
Naz@LazyPower, @TvansteenBurgh, You guessed it right, but I think it's better to mention in the docs that this is applicable to online cloud but not to local/LXD,....12:59
NazIs it also the case for Locally deployed OPENSTACK?12:59
lazyPowerNaz: well, thats a slightly different story. LXD is based around density and will do everything it can to colocate your workloads on the requested hardware. The only type of allocation you could do would be to set cgroup limits, so you can over-request a machine as it were.13:02
rick_hbdx: you get he hangout link ok?13:02
lazyPowernow when you deploy openstack, and yoou use nova - you'd run into issues because nova wouldn't have enough vcpu's to -dedicate- to that requested workload and i would suspect it stays in pending13:02
lazyPowerbut if you used nova-lxd, it would likely be happy trying to cram whatever you throw at it wherever you have nova-lxd.13:03
Naz@LazyPower, I see, thank you, I have another question, please, how to upgrade the memory when working on localCloud/LXD?13:05
rick_hbdx: bueller, bueller13:05
NazSo in other words, If I have a machine instantiated as constraints 2G_RAM, and based on metrics, found it's struggling and want to upgrade it to 4G_RAM?13:06
lazyPowerNaz: so depending on your cloud provider. I know on GCE you can stop the unit and change the instance type, and once you start it back up, it will inform the controller of its new ip address (i'm unsure if it re-reports its hardware) but thats one way to do it.13:08
lazyPoweranother option would be to add a unit with different constraints, and then remove the unit with the lower constraints (not necessarily ideal with stateful workloads)13:08
=== tinwood_afk is now known as tinwood
Naz@LazyPower, yes, I understand, however, first option induces an OUTAGE, second is seamless from End-user perspective. However, I am interested in inspecting first option further please. what do you mean by stop? do you mean juju remove-unit?13:14
lazyPowerNaz: no not at all. When i say stop i refer to stopping the unit in the cloud control panel. Most clouds require you ot have the instance in a stopped/power-off state in order to change hardware configuration13:14
Nazor juju remove-machine?13:15
lazyPowerNaz: and ideally, you would be running in a highly available scenario to mitigate any outage.13:15
NazI think you meant Remove-machine and recreate another new one with higher constraints?13:15
Naz@LazyPower, Agree with you on HA :)13:16
lazyPowerNaz: The first method would not involve any changes to the model using juju. You would be issuing these commands against yoru cloud provider to halt the instance and change its hardware profile.13:16
BlackDexhmm i have a problem with relations13:17
BlackDexthey won't complete13:17
BlackDexand this causes the services to restart every x seconds (if not less)13:17
BlackDexthe cluster-relation-changed keeps getting called13:21
Naz@LazyPower, on local LXD Cloud, can I do some orchestration on resources like the ones offered for LXC?13:24
lazyPowerNaz: i'm not sure what you're asking me13:24
lazyPowerBlackDex: what charm is this?13:24
Naz@LazyPower, I want to do the following scenario: Start with limited resources machine , let's say 1 CPU, then during runtime, based on Real-time metrics, increase the CPU to 2. For example: in LXC: lxc config set my-container limits.cpu 113:27
Naz@LazyPower, how could I do this in JUJU?13:27
lazyPowerNaz: thats a bit beyond me, i dont know if we offer anything like that. I dont think juju is setting any constraints on the local provider.13:27
lazyPowerNaz: sorry i really dont know, i'd rather tell you i dont know than misinform you. The best i can say at this time is to try it and inspect the lxc profile post deployment. if you dont see any resource limits, its not something we support today but with a feature-request we can look into it.13:29
ZiclazyPower: I'm back, if you need further info :)13:30
Zic(for the 1/5 etcd which stuck at registration)13:30
lazyPowerZic: just one unit didn't turn up?13:30
lazyPowerZic: juju debug-log --replay -i unit-etcd-#  | pastebinit13:30
lazyPoweron the stuck unit13:31
lazyPoweri suspect a race during registration13:31
Naz@LazyPower, Ok, I understand, Could you please point some scenario on how you think I can do orchestration in juju? (Orchestration is getting some metrics and reacting upon it to answer the demand)13:32
lazyPowerZic: i didn't get back home until late lastnight so i havne't had a chance to fetch the resources, but i'll def. dig into that tonight for tomorrow.13:32
Zichttp://paste.ubuntu.com/24644010/ lazyPower13:32
lazyPowerNaz: elastisys has modeled an orchestrator for autoscaling. 1 moment while i grab you the link13:32
lazyPowerNaz: https://jujucharms.com/u/elastisys/charmscaler/13:33
Naz@LazyPower, Great, I will have a look into that :)13:33
ZiclazyPower: yup, I'm deploying this new test cluster in 1.7.3 at least to let the customer test his pod in 1.7.3, but I'm not giving up the subject to test the 1.5.3 -> 1.7.X, will wait what you discovered so far with our specific charm rv13:34
lazyPowerZic: unit-etcd-2: 10:17:11 INFO unit.etcd/2.juju-log cluster:0: Invoking reactive handler: reactive/etcd.py:279:register_node_with_leader13:35
lazyPowerso it sent reigstration detail to the leader. can you hop over on the leader and run `etcd.etcdctl member list`13:35
Zichttp://paste.ubuntu.com/24644023/ <= lazyPower13:36
lazyPowermember fd9d260cab7a11dc is unreachable: no available published client urls <-- it completed AND joined at one point13:37
lazyPowerif it had not joined it would say (unstarted)13:37
Zichmm, and more weird13:37
Zicwhere is 03 and 05 ?13:37
Zicone of them is missing because of error, ok13:37
Zicbut the other one? :D13:37
ZiclazyPower: did not saw something, wait13:38
* lazyPower waits13:38
Zic2 is missing13:38
Zicbut one is in active/idle13:38
Zicdid not saw the error message so...13:38
Zicfrom the beginning I spoke about etcd/213:38
Zicbut etcd/4 is also in problem13:39
lazyPoweri would say unit 2 and 4 raced13:39
lazyPowerand are now in a deadlock13:39
lazyPoweryou can juju remove-unit on those and re-add them and it should sort itself13:39
Zicthe two I just added to the default charm-bundle?13:39
Zic(I alway scale etcd at 5 instead of 3 in CDK)13:39
Zicwill try that lazyPower13:40
lazyPowerso remove the errored unit, `juju remove-unit etcd/4`  wait for it to complete13:40
lazyPowerif the cluster still reports healthy, `juju remove-unite etcd/2`13:40
lazyPowerif cluster continues to report healthy, then you can juju add-unit etcd -n 213:40
Zicon which machine "juju add-unit etcd -n 2" will add?13:41
Zichmm btw: etcd/4                    error     idle       9        mth-k8stestetcd-05.aws-us-east-1    2379/tcp        hook failed: "cluster-relation-broken"13:41
lazyPowerthere's a marginal chance that the units during turn up will miss another units registration request and attempt to register, you managed to hit that13:41
lazyPowerits a known deficiency because the coordination relies on querying the leader for the member list before it attempts registration. it looks for a non-healthy non-ready unit in the member list, if it finds it, it halts. if its not present it will declare its registering on the peer interface and attempt self registration13:42
lazyPowerZic: thats expected, the unit itself is in a broken state. the leader will deregister the unit from the cluster if it has any details in its registration data store13:43
lazyPowerZic: juju resolved --no-retry until the unit is gone.13:43
Zicit switch to "terminated" and then gone :)13:44
Zichmm, it removes also the machine from the controller :>13:46
lazyPowerZic: theres a way to change that, i think its the provisioner-harvest-mode model-config option13:52
lazyPowerZic: https://jujucharms.com/docs/2.1/models-config#juju-lifecycle-and-harvesting13:54
ZicI can just respawn the 2 etcd bugged machine after, not important for this test cluster :)13:55
Zicdo I remove etcd/2 also now?13:55
lazyPowerthe ones that are stuck in registration limbo, yeah13:59
BlackDexlazyPower: The cinder charm keeps running the cluster-relation-changed hook :(14:04
BlackDexsometimes it restarts haproxy and apache2, and sometimes it tells that it is already running14:04
lazyPowerBlackDex: pop over to #openstack-charms, they have the most experience with those charms as the community maintaining them :)14:04
BlackDexoke :)14:04
=== scuttle|afk is now known as scuttlemonkey
kjackalQuestion: Can you get the models name from within a running charm?14:59
kjackalis ther an env variable?14:59
jrwrennot AFAIK and why would you want to, a charm should definitely not behave differently based on the name of hte model in which it is deployed.15:00
kjackaljrwren: thanks for the quick reply. Yes you are probably right on this15:01
dakj_Hello guy I've a question for you, but the service landscape-client deployed on a node is working? Because I don't know how it works after the deploy. thanks15:07
ZiclazyPower: redeploying two new etcd, it's always stuck in "waiting/idle Waiting for unit to complete registration." :o15:10
lazyPowerZic: hmm, that removal should have kicked the one thats stuck15:10
lazyPowerZic: can you remote into the master and issue a member list to pastebin again?15:11
lazyPoweri'm going to validate an assumption i have of whats blocking the other 2 units15:11
lazyPowerZic: i'll also be latent, i'm in sig-onprem taking notes.15:12
Zichttp://paste.ubuntu.com/24644681/ <= lazyPower15:15
Zicno problem, I'm on weekend (yaaaiii \o/) in two hours :)15:15
Zicso I will also be latent time to go back home15:16
=== scuttlemonkey is now known as scuttle|afk
=== scuttle|afk is now known as scuttlemonkey
lazyPowerZic: on the leader `etcd.etdctl member remove fd9d260cab7a11dc`15:50
lazyPowershould unstick those pending units15:50
lazyPowerlooks like the unit that biffed registration didn't actually get removed, which is a whole different issue i'm going to have to look into if i can reproduce it15:51
ZiclazyPower: did that, got a new strange thing :(16:03
Zicone of the 2 new is full OK16:03
Zicetcd/5                    active    idle   14       mth-k8stestetcd-05.aws-us-east-1    2379/tcp        Healthy with 5 known peers16:03
Zicthe other one is not OK :16:04
Zicetcd/6                    active    idle   13       mth-k8stestetcd-03.aws-us-east-1    2379/tcp        Errored with 0 known peers16:04
Zicetcdctl cluster-health return all is healthy except this one with:16:04
Zicmember 34f56278a8fdd1cf is unreachable: no available published client urls16:04
lazyPowerZic: ok, what happened?16:10
lazyPowerah, lag on my end, 1 sec16:11
lazyPowerZic: this is the latest revision of the charm?16:11
Zicdeployed from the latest bundle-charm16:15
Zicnever upgraded, all fresh16:15
Zic(from this morning)16:15
Zic(#38 revision of canonical-kubernetes)16:15
ryebotZic: Can we see the journalctl logs of snap.etcd.etcd on mth-k8stestetcd-03.aws-us-east-1 ?16:15
ryebotwow, no errors or anything, just implodes16:18
ryebotZic: can you see if any processes are currently listening to port 2379 on that box?16:19
ryebot`netstat -plant | grep LISTEN | grep 2379`16:20
Zicand old man told me that "netstat is old, use 'ss' instead", but it's offtopic16:21
Zicwill try that16:21
ryebothaha sure whatever works :)16:21
* ryebot googles ss in a desperate attempt to recover lost youth.16:22
Zic(was a joke at my office for "old" coworker which always use "ifconfig" and "netstat")16:22
Zicryebot: you can use the same parameter than netstat16:22
Zicso ss -plant will work16:22
Zicthe output may differ a bit16:22
ryebotah cool16:22
ryebotTried it, got ESTABBED a bunch of times. Not sure how I feel about that.16:23
Zicin any case: this netstat/ss does not return anything16:23
ryebotokay, well that rules port conflicts out16:23
ryebothmm let me ruminate on these logs a bit16:23
ryebotZic: I'm guessing systemctl restarting etcd results in the same failure after a few moments?16:24
ryebotZic: Could I also see the logs from a good etcd?16:25
ryebotlazyPower: Is there a debug logging mode for etcd?16:25
ryebotnvm, --debug true seems to do it16:26
ryebotZic: can you also edit /etc/systemd/system/snap.etcd.etcd.service to add the --debug true flag?16:27
ryebotlazyPower: if you come at me with a lmgtfy, well, I deserve it. xD16:27
Zicryebot | Zic: I'm guessing systemctl restarting etcd results in the same failure after a few moments? →  yes16:28
Zicryebot | Zic: Could I also see the logs from a good etcd? → http://paste.ubuntu.com/24645416/16:29
Zichttp://paste.ubuntu.com/24645430/ <= for --debug16:31
ryebotZic: fantastic, thanks16:31
ryebotZic: There's an error in there I'm trying to get to the bottom of. Give me a little time to research.16:33
Zicnp :)16:34
ryebotZic: Can you ls the contents of /var/snap/etcd/common and /var/snap/etcd/current for me?16:43
=== grumble2 is now known as grumble
lazyPowerryebot: nah :) I'm loving the fact you stepped in to lend a hand <316:46
Zicryebot: ^16:46
ryebotlazyPower happy to ;)16:46
ryebotZic: Can you share the contents of /var/snap/etcd/common/etcd.conf?16:47
=== punk3r is now known as jojo
ryebotZic: thanks16:51
=== jojo is now known as badoit
Zic(root@mth-k8stestetcd-03:~# cat /var/snap/etcd/common/etcd.conf | nc termbin.com 9999 -> it's like pastebinit but without the pastebinit client :>)16:51
Zicdon't know why I didn't use that before16:51
ryebotZic: Can you get me the journalctl logs of the failing etcd again, but this time use `-o cat` in the journalctl flags?16:53
ryebotZic, can you also try replacing the ETCD_INITIAL_CLUSTER line in /var/snap/etcd/common/etcd.conf with the following, and then restart the snap.etcd.etcd service?16:56
ryebotZic: thanks16:57
Zichttp://paste.ubuntu.com/24645563/ <= ryebot for the ETCD_INITIAL_CLUSTER16:59
ryebotZic: thanks, same error in journalctl?17:01
Zicseems so17:01
Zic(need to back to home, will retrieve my backlog from there o/)17:02
ryebotZic: o/17:02
ryebotlazyPower: I need to grab lunch and continue with LF stuff, but I'll try to come back and help in a bit.17:03
lazyPowerty ryebot17:06
ryebotlazyPower Zic: We really need the end of that error line. I thought adding -o cat to the journalctl command would grab it, but it's still cut off17:06
ryebotlazyPower Zic: but the end of that line should have the actual error: https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L30617:07
lazyPowerryebot: journalctl -xn --no-pager17:07
lazyPowershould get you what you're looking for17:07
rick_hreminder Juju Show in 49 minutes! lazyPower hatch jrwren beisner jamespage kwmonroe  and anyone else interestsed17:11
ryebotZic: ^ would be super helpful17:11
ryebotlazyPower: u rok thanks17:11
* rick_h goes to get coffee to prep17:12
beisner_hi rick_h - on deck!17:15
lazyPowerryebot: that was all stack overflow ;)17:16
lazyPoweranastasiamac: super happy to see you've got some joy on https://bugs.launchpad.net/juju/+bug/1627127. \o/17:21
mupBug #1627127: resource-get gets hung on charm store <cdo-qa> <cdo-qa-blocker> <juju:In Progress by anastasia-macmood> <https://launchpad.net/bugs/1627127>17:21
rick_hJuju Show watch it from https://www.youtube.com/watch?v=VDXolq_eGkU and "join the conversation" at https://hangouts.google.com/hangouts/_/ytl/iwKTaiURK50IjCs9jO1d72S5s8ULBmDFiC7D91F9MiQ=?eid=103184405956510785630&hl=en_US17:50
magicaltrouttalking of ecosystem stuff17:54
rick_hwoot woot17:54
rick_hand openstack stuff17:54
magicaltroutThomas from Tengu gave a good talk at Apachecon that included a bunch of juju slides17:54
rick_hbig topic of the day17:54
magicaltroutand a bunch of people i bumped into were discussing snaps as a packaging format for various apache projects17:54
rick_hmagicaltrout: openstack folks doing a lot of snapping as well17:55
rick_hmagicaltrout: seems like a good thing :)17:55
magicaltrouti'm gonna dive into it properly at some point soon17:55
rick_hlet us know if you need any help17:56
bdxrick_h: that link is borked17:56
rick_hbdx: which link?17:56
rick_hhttps://hangouts.google.com/hangouts/_/cfovp34gqrf2vliprctda575c4e bdx17:56
rick_hbdx: others are in on the link not sure what's up17:56
magicaltroutoh i also started charming up the worlds fastest analytic database today as well for our data platform & will be free to use17:58
magicaltroutwhich is good for BI apps on the Juju ecosystem17:58
magicaltroutmarcoceppi: can you go through the Developer credits backlog when you get a spare slot? :)17:59
magicaltroutalso can someone review my gitlab charm18:00
magicaltroutnot cause i'm overly bothered about the promotion, i just want to get a code review to see if I'm following the process for the review queue in general18:00
magicaltrout</adhoc requests>18:01
beisner_#link openstack charm guide:  http://bit.ly/2rUKpnR18:39
=== Guest41638 is now known as zeus
=== mup_ is now known as mup
Budgie^Smoredid someone miss me?19:48
Budgie^Smoreo/ juju world19:49
lutostagdoes the juju model-config no-proxy option support wildcarding or the netmasking?20:46
anastasiamaclazyPower: tyvm!20:49
* lutostag to myself, nope it doesnt juju model-config no-proxy=$(printf '%s,' 10.5.0.{1..255}; echo -n localhost) # from https://unix.stackexchange.com/a/2347820:55

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!