=== frankban|afk is now known as frankban [07:20] Good morning Juju wolrd! [09:47] http://imgur.com/a/blvN7 [09:47] fml [10:04] MLFQ scheduler with process budgeting for anyone interested, implemented by yours truly - https://gist.github.com/jamesbeedy/f97393235a06f878655c7eeace717500 [10:32] hi Juju world, deploying a new CDK cluster (in the latest version this time /!\) and one of my 5 etcd is stuck at "waiting/idle" with message "Waiting for unit to complete registration." [10:33] this is the only etcd unit stuck at that, the 4 others are active/idle [10:34] I took a look at "juju debug-log" but it does not seem to be special error [10:54] Zic: anything anywhere else? e.g. syslog, cdk logs, juju logs on the unit itself? [11:05] admcleod: I just checked /var/log/juju/* and nothing in error, I will check syslog also [11:05] May 24 11:04:24 mth-k8stestetcd-03 /snap/bin/etcd.etcdctl[1584]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.24 2.24]): older than "/usr/bin/snap" (2.25) [11:05] I have many of it [11:06] Zic: not entirely sure if thats going to be related but perhaps worth updating? [11:06] was a fresh deploying [11:08] Zic: hrm ok [11:10] Zic: well. in any case, 4/5 etcd are ok.. underlying network issue? can 'etcd 5' communicate w/ the others? [11:10] Zic: I am trying the deployment now [11:10] kjackal to the rescue [11:10] not sure admcleod [11:13] Zic: you deployed the two extra etcds after the initial deployment had finished? Or did you trigger the deployment with 5 units? [11:13] kjackal: directly deploying with 5 etcd yep [11:13] not after [11:13] ok , Zic, thanks [11:13] we're deploying all our CDK cluster with N+2 redundancy [11:14] so 5 etcd to have at least a quorum of 3 [11:14] ask me if you need further action from me to have more debug logs, I don't have that much for now :( [11:15] I check the status of the etcd service, it's stopped [11:15] and cannot start: [11:15] http://paste.ubuntu.com/24643047/ === tinwood is now known as tinwood_afk [11:19] Zic: strange... in any case redeploying to see if I can repro. Will letyou know if i need more help [11:20] kjackal: never have this one, I'm thinking is just a random bug :( [11:21] all etcd have the same connectivity configuration [11:21] I tried to reboot the etcd unit machine which bug with no new result [11:22] http://paste.ubuntu.com/24643101/ [11:22] juju debug-log prints a little more after the reboot [11:22] but nothing new compared to the local /var/log/syslog of the unit machine === zeus is now known as Guest41638 [12:16] kjackal: can I say to that unit to restart the charm installation from the beginning [12:16] +? [12:19] Zic: no I do nto think so. I couldn;t repro this bug. My suggestion is to add another etcd unit so you have 5. Also if you could wait a bit lazyPower and tvansteenburgh will be up shortly, they might be able to offer their opinion [12:21] ok, I can wait, this is the test cluster instance :) [12:21] Many thanks Zic [12:22] I'm fearing to lost precious debug logs which could interest CDk-team if I go further and drops that silly unit so I can wait :) [12:22] (I will go dinner btw) [12:24] enjoy [12:45] o/ morning [12:45] Zic: i think the syslog msg is a red herring. might be a charm bug, i'm not sure [12:45] found that status msg here https://github.com/juju-solutions/layer-etcd/blob/ca4c14c52822b41113e7df297e99097016494e07/reactive/etcd.py#L318 [12:45] o/ lazyPower [12:46] maybe you can help :) [12:46] * lazyPower is reading backscroll [12:47] hmm, smells like a race during unit registration [12:47] 4/5 came up good, the 5'th refuses to start. [12:53] Hi, I want to point to an error in the documentation about Constraints. [12:53] https://jujucharms.com/docs/2.1/charms-constraints says and I quote: [12:53] In the event that a constraint cannot be met, the unit will not be deployed. Note: Constraints work on an "or better" basis: If you ask for 4 CPUs, you may get 8, but you won't get 2 [12:54] This is NOT true, as I requested --constraints "cores=4 mem=16G" on my laptop having physically cores=2 and mem=8G and it instantiated a machine with 2 cores and 8G instead of the 4Cores/16G RAM requested as constraints. [12:55] Hello there [12:55] i have a juju 1.25 env where an subordinate is stuck [12:55] it hangs on (stop) [12:55] and doesn't get removed [12:56] someone any ideas how to resolved that? [12:56] Naz: you are using the lxd provider? [12:56] Naz: I dont think constraints like that are honored on the local/lxd provider. I know that behavior is true when using clouds like aws,gce,maas,openstack. [12:56] never mind, i restarted the machine-xx jujud and it worked :S [12:56] Tvansteen, Yes, I am working on local cloud LXD [12:56] did that before didn't worked, now it does [12:57] LazyPower, Yup, I am working locally over LXD [12:57] BlackDex: its not immediately obvious but if a relationship fails during the subordinate charms teardown, it will halt. [12:58] * lazyPower steps away to make coffee [12:59] @LazyPower, @TvansteenBurgh, You guessed it right, but I think it's better to mention in the docs that this is applicable to online cloud but not to local/LXD,.... [12:59] Is it also the case for Locally deployed OPENSTACK? [13:02] Naz: well, thats a slightly different story. LXD is based around density and will do everything it can to colocate your workloads on the requested hardware. The only type of allocation you could do would be to set cgroup limits, so you can over-request a machine as it were. [13:02] bdx: you get he hangout link ok? [13:02] now when you deploy openstack, and yoou use nova - you'd run into issues because nova wouldn't have enough vcpu's to -dedicate- to that requested workload and i would suspect it stays in pending [13:03] but if you used nova-lxd, it would likely be happy trying to cram whatever you throw at it wherever you have nova-lxd. [13:05] @LazyPower, I see, thank you, I have another question, please, how to upgrade the memory when working on localCloud/LXD? [13:05] bdx: bueller, bueller [13:06] So in other words, If I have a machine instantiated as constraints 2G_RAM, and based on metrics, found it's struggling and want to upgrade it to 4G_RAM? [13:08] Naz: so depending on your cloud provider. I know on GCE you can stop the unit and change the instance type, and once you start it back up, it will inform the controller of its new ip address (i'm unsure if it re-reports its hardware) but thats one way to do it. [13:08] another option would be to add a unit with different constraints, and then remove the unit with the lower constraints (not necessarily ideal with stateful workloads) === tinwood_afk is now known as tinwood [13:14] @LazyPower, yes, I understand, however, first option induces an OUTAGE, second is seamless from End-user perspective. However, I am interested in inspecting first option further please. what do you mean by stop? do you mean juju remove-unit? [13:14] Naz: no not at all. When i say stop i refer to stopping the unit in the cloud control panel. Most clouds require you ot have the instance in a stopped/power-off state in order to change hardware configuration [13:15] or juju remove-machine? [13:15] Naz: and ideally, you would be running in a highly available scenario to mitigate any outage. [13:15] I think you meant Remove-machine and recreate another new one with higher constraints? [13:16] @LazyPower, Agree with you on HA :) [13:16] Naz: The first method would not involve any changes to the model using juju. You would be issuing these commands against yoru cloud provider to halt the instance and change its hardware profile. [13:17] hmm i have a problem with relations [13:17] they won't complete [13:17] and this causes the services to restart every x seconds (if not less) [13:21] the cluster-relation-changed keeps getting called [13:24] @LazyPower, on local LXD Cloud, can I do some orchestration on resources like the ones offered for LXC? [13:24] Naz: i'm not sure what you're asking me [13:24] BlackDex: what charm is this? [13:26] cinder [13:27] @LazyPower, I want to do the following scenario: Start with limited resources machine , let's say 1 CPU, then during runtime, based on Real-time metrics, increase the CPU to 2. For example: in LXC: lxc config set my-container limits.cpu 1 [13:27] @LazyPower, how could I do this in JUJU? [13:27] Naz: thats a bit beyond me, i dont know if we offer anything like that. I dont think juju is setting any constraints on the local provider. [13:29] Naz: sorry i really dont know, i'd rather tell you i dont know than misinform you. The best i can say at this time is to try it and inspect the lxc profile post deployment. if you dont see any resource limits, its not something we support today but with a feature-request we can look into it. [13:30] lazyPower: I'm back, if you need further info :) [13:30] (for the 1/5 etcd which stuck at registration) [13:30] Zic: just one unit didn't turn up? [13:30] yup [13:30] Zic: juju debug-log --replay -i unit-etcd-# | pastebinit [13:31] on the stuck unit [13:31] i suspect a race during registration [13:32] @LazyPower, Ok, I understand, Could you please point some scenario on how you think I can do orchestration in juju? (Orchestration is getting some metrics and reacting upon it to answer the demand) [13:32] Zic: i didn't get back home until late lastnight so i havne't had a chance to fetch the resources, but i'll def. dig into that tonight for tomorrow. [13:32] http://paste.ubuntu.com/24644010/ lazyPower [13:32] Naz: elastisys has modeled an orchestrator for autoscaling. 1 moment while i grab you the link [13:33] Naz: https://jujucharms.com/u/elastisys/charmscaler/ [13:33] @LazyPower, Great, I will have a look into that :) [13:34] lazyPower: yup, I'm deploying this new test cluster in 1.7.3 at least to let the customer test his pod in 1.7.3, but I'm not giving up the subject to test the 1.5.3 -> 1.7.X, will wait what you discovered so far with our specific charm rv [13:35] Zic: unit-etcd-2: 10:17:11 INFO unit.etcd/2.juju-log cluster:0: Invoking reactive handler: reactive/etcd.py:279:register_node_with_leader [13:35] so it sent reigstration detail to the leader. can you hop over on the leader and run `etcd.etcdctl member list` [13:36] http://paste.ubuntu.com/24644023/ <= lazyPower [13:37] weird [13:37] member fd9d260cab7a11dc is unreachable: no available published client urls <-- it completed AND joined at one point [13:37] if it had not joined it would say (unstarted) [13:37] hmm, and more weird [13:37] where is 03 and 05 ? [13:37] one of them is missing because of error, ok [13:37] but the other one? :D [13:38] OH [13:38] lazyPower: did not saw something, wait [13:38] * lazyPower waits [13:38] 2 is missing [13:38] but one is in active/idle [13:38] did not saw the error message so... [13:38] http://paste.ubuntu.com/24644044/ [13:38] from the beginning I spoke about etcd/2 [13:39] but etcd/4 is also in problem [13:39] i would say unit 2 and 4 raced [13:39] and are now in a deadlock [13:39] you can juju remove-unit on those and re-add them and it should sort itself [13:39] the two I just added to the default charm-bundle? [13:39] (I alway scale etcd at 5 instead of 3 in CDK) [13:40] will try that lazyPower [13:40] so remove the errored unit, `juju remove-unit etcd/4` wait for it to complete [13:40] if the cluster still reports healthy, `juju remove-unite etcd/2` [13:40] if cluster continues to report healthy, then you can juju add-unit etcd -n 2 [13:41] on which machine "juju add-unit etcd -n 2" will add? [13:41] hmm btw: etcd/4 error idle 9 mth-k8stestetcd-05.aws-us-east-1 2379/tcp hook failed: "cluster-relation-broken" [13:41] there's a marginal chance that the units during turn up will miss another units registration request and attempt to register, you managed to hit that [13:42] its a known deficiency because the coordination relies on querying the leader for the member list before it attempts registration. it looks for a non-healthy non-ready unit in the member list, if it finds it, it halts. if its not present it will declare its registering on the peer interface and attempt self registration [13:43] Zic: thats expected, the unit itself is in a broken state. the leader will deregister the unit from the cluster if it has any details in its registration data store [13:43] Zic: juju resolved --no-retry until the unit is gone. [13:44] ok [13:44] it switch to "terminated" and then gone :) [13:46] hmm, it removes also the machine from the controller :> [13:52] Zic: theres a way to change that, i think its the provisioner-harvest-mode model-config option [13:54] Zic: https://jujucharms.com/docs/2.1/models-config#juju-lifecycle-and-harvesting [13:55] I can just respawn the 2 etcd bugged machine after, not important for this test cluster :) [13:55] do I remove etcd/2 also now? [13:59] the ones that are stuck in registration limbo, yeah [14:04] lazyPower: The cinder charm keeps running the cluster-relation-changed hook :( [14:04] sometimes it restarts haproxy and apache2, and sometimes it tells that it is already running [14:04] BlackDex: pop over to #openstack-charms, they have the most experience with those charms as the community maintaining them :) [14:04] oke :) === scuttle|afk is now known as scuttlemonkey [14:59] Question: Can you get the models name from within a running charm? [14:59] is ther an env variable? [15:00] not AFAIK and why would you want to, a charm should definitely not behave differently based on the name of hte model in which it is deployed. [15:01] jrwren: thanks for the quick reply. Yes you are probably right on this [15:07] Hello guy I've a question for you, but the service landscape-client deployed on a node is working? Because I don't know how it works after the deploy. thanks [15:10] lazyPower: redeploying two new etcd, it's always stuck in "waiting/idle Waiting for unit to complete registration." :o [15:10] Zic: hmm, that removal should have kicked the one thats stuck [15:11] Zic: can you remote into the master and issue a member list to pastebin again? [15:11] i'm going to validate an assumption i have of whats blocking the other 2 units [15:12] Zic: i'll also be latent, i'm in sig-onprem taking notes. [15:15] http://paste.ubuntu.com/24644681/ <= lazyPower [15:15] no problem, I'm on weekend (yaaaiii \o/) in two hours :) [15:16] so I will also be latent time to go back home === scuttlemonkey is now known as scuttle|afk === scuttle|afk is now known as scuttlemonkey [15:50] Zic: on the leader `etcd.etdctl member remove fd9d260cab7a11dc` [15:50] should unstick those pending units [15:51] looks like the unit that biffed registration didn't actually get removed, which is a whole different issue i'm going to have to look into if i can reproduce it [16:03] lazyPower: did that, got a new strange thing :( [16:03] one of the 2 new is full OK [16:03] etcd/5 active idle 14 mth-k8stestetcd-05.aws-us-east-1 2379/tcp Healthy with 5 known peers [16:04] the other one is not OK : [16:04] etcd/6 active idle 13 mth-k8stestetcd-03.aws-us-east-1 2379/tcp Errored with 0 known peers [16:04] etcdctl cluster-health return all is healthy except this one with: [16:04] member 34f56278a8fdd1cf is unreachable: no available published client urls [16:04] :( [16:10] Zic: ok, what happened? [16:11] ah, lag on my end, 1 sec [16:11] Zic: this is the latest revision of the charm? [16:15] yup [16:15] deployed from the latest bundle-charm [16:15] never upgraded, all fresh [16:15] (from this morning) [16:15] (#38 revision of canonical-kubernetes) [16:15] Zic: Can we see the journalctl logs of snap.etcd.etcd on mth-k8stestetcd-03.aws-us-east-1 ? [16:16] yup [16:17] http://paste.ubuntu.com/24645361/ [16:18] thanks [16:18] wow, no errors or anything, just implodes [16:19] Zic: can you see if any processes are currently listening to port 2379 on that box? [16:20] `netstat -plant | grep LISTEN | grep 2379` [16:21] and old man told me that "netstat is old, use 'ss' instead", but it's offtopic [16:21] will try that [16:21] haha sure whatever works :) [16:22] * ryebot googles ss in a desperate attempt to recover lost youth. [16:22] (was a joke at my office for "old" coworker which always use "ifconfig" and "netstat") [16:22] xD [16:22] ryebot: you can use the same parameter than netstat [16:22] so ss -plant will work [16:22] the output may differ a bit [16:22] ah cool [16:23] Tried it, got ESTABBED a bunch of times. Not sure how I feel about that. [16:23] in any case: this netstat/ss does not return anything [16:23] :( [16:23] okay, well that rules port conflicts out [16:23] hmm let me ruminate on these logs a bit [16:24] Zic: I'm guessing systemctl restarting etcd results in the same failure after a few moments? [16:25] Zic: Could I also see the logs from a good etcd? [16:25] lazyPower: Is there a debug logging mode for etcd? [16:26] nvm, --debug true seems to do it [16:27] Zic: can you also edit /etc/systemd/system/snap.etcd.etcd.service to add the --debug true flag? [16:27] lazyPower: if you come at me with a lmgtfy, well, I deserve it. xD [16:28] ryebot | Zic: I'm guessing systemctl restarting etcd results in the same failure after a few moments? → yes [16:28] +1 [16:29] ryebot | Zic: Could I also see the logs from a good etcd? → http://paste.ubuntu.com/24645416/ [16:30] thanks [16:31] http://paste.ubuntu.com/24645430/ <= for --debug [16:31] Zic: fantastic, thanks [16:33] Zic: There's an error in there I'm trying to get to the bottom of. Give me a little time to research. [16:34] np :) [16:43] Zic: Can you ls the contents of /var/snap/etcd/common and /var/snap/etcd/current for me? === grumble2 is now known as grumble [16:46] ryebot: nah :) I'm loving the fact you stepped in to lend a hand <3 [16:46] http://paste.ubuntu.com/24645509/ [16:46] ryebot: ^ [16:46] thanks [16:46] lazyPower happy to ;) [16:47] Zic: Can you share the contents of /var/snap/etcd/common/etcd.conf? === punk3r is now known as jojo [16:50] http://termbin.com/pwyc [16:51] Zic: thanks === jojo is now known as badoit [16:51] (root@mth-k8stestetcd-03:~# cat /var/snap/etcd/common/etcd.conf | nc termbin.com 9999 -> it's like pastebinit but without the pastebinit client :>) [16:51] don't know why I didn't use that before [16:52] cool [16:53] Zic: Can you get me the journalctl logs of the failing etcd again, but this time use `-o cat` in the journalctl flags? [16:56] Zic, can you also try replacing the ETCD_INITIAL_CLUSTER line in /var/snap/etcd/common/etcd.conf with the following, and then restart the snap.etcd.etcd service? [16:56] ETCD_INITIAL_CLUSTER="etcd6=https://mth-k8stestetcd-03.aws-us-east-1:2380,etcd1=https://mth-k8stestetcd-01.aws-us-east-1:2380,etcd0=https://mth-k8stestetcd-02.aws-us-east-1:2380,etcd3=https://mth-k8stestetcd-04.aws-us-east-1:2380" [16:57] http://paste.ubuntu.com/24645558/ [16:57] Zic: thanks [16:59] http://paste.ubuntu.com/24645563/ <= ryebot for the ETCD_INITIAL_CLUSTER [17:01] Zic: thanks, same error in journalctl? [17:01] http://paste.ubuntu.com/24645573/ [17:01] seems so [17:02] (need to back to home, will retrieve my backlog from there o/) [17:02] Zic: o/ [17:03] lazyPower: I need to grab lunch and continue with LF stuff, but I'll try to come back and help in a bit. [17:06] ty ryebot [17:06] lazyPower Zic: We really need the end of that error line. I thought adding -o cat to the journalctl command would grab it, but it's still cut off [17:07] lazyPower Zic: but the end of that line should have the actual error: https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L306 [17:07] ryebot: journalctl -xn --no-pager [17:07] should get you what you're looking for [17:11] reminder Juju Show in 49 minutes! lazyPower hatch jrwren beisner jamespage kwmonroe and anyone else interestsed [17:11] Zic: ^ would be super helpful [17:11] lazyPower: u rok thanks [17:12] * rick_h goes to get coffee to prep [17:15] hi rick_h - on deck! [17:16] ryebot: that was all stack overflow ;) [17:21] anastasiamac: super happy to see you've got some joy on https://bugs.launchpad.net/juju/+bug/1627127. \o/ [17:21] Bug #1627127: resource-get gets hung on charm store [17:50] Juju Show watch it from https://www.youtube.com/watch?v=VDXolq_eGkU and "join the conversation" at https://hangouts.google.com/hangouts/_/ytl/iwKTaiURK50IjCs9jO1d72S5s8ULBmDFiC7D91F9MiQ=?eid=103184405956510785630&hl=en_US [17:54] talking of ecosystem stuff [17:54] woot woot [17:54] and openstack stuff [17:54] Thomas from Tengu gave a good talk at Apachecon that included a bunch of juju slides [17:54] big topic of the day [17:54] and a bunch of people i bumped into were discussing snaps as a packaging format for various apache projects [17:55] magicaltrout: openstack folks doing a lot of snapping as well [17:55] magicaltrout: seems like a good thing :) [17:55] indeed [17:55] i'm gonna dive into it properly at some point soon [17:56] let us know if you need any help [17:56] rick_h: that link is borked [17:56] https://hangouts.google.com/hangouts/_/ytl/iwKTaiURK50IjCs9jO1d72S5s8ULBmDFiC7D91F9MiQ=?eid=103184405956510785630&hl=en_US [17:56] bdx: which link? [17:56] https://hangouts.google.com/hangouts/_/cfovp34gqrf2vliprctda575c4e bdx [17:56] bdx: others are in on the link not sure what's up [17:58] oh i also started charming up the worlds fastest analytic database today as well for our data platform & will be free to use [17:58] which is good for BI apps on the Juju ecosystem [17:59] marcoceppi: can you go through the Developer credits backlog when you get a spare slot? :) [18:00] also can someone review my gitlab charm [18:00] not cause i'm overly bothered about the promotion, i just want to get a code review to see if I'm following the process for the review queue in general [18:01] [18:39] #link openstack charm guide: http://bit.ly/2rUKpnR [18:44] ty === Guest41638 is now known as zeus === mup_ is now known as mup [19:48] did someone miss me? [19:49] o/ juju world [20:46] does the juju model-config no-proxy option support wildcarding or the 10.0.0.0/21 netmasking? [20:49] lazyPower: tyvm! [20:55] * lutostag to myself, nope it doesnt juju model-config no-proxy=$(printf '%s,' 10.5.0.{1..255}; echo -n localhost) # from https://unix.stackexchange.com/a/23478