[06:06] <anrah_> Quick question about the juju instances hostnames
[06:07] <anrah_> Is the format always juju-<someid>-<modelname>-<instance number> ?
[06:08] <anrah_> thing is that I require name of the current model to be resolved within instances, and agent.conf file shows only uuid and I can't find any other appropriate way (besides talking to API) to get the model where the instance is running
[08:13] <Zic> lazyPower: saw your message of yesterday, ack, so I will need to downgrade docker at our production-cluster also, before upgrading to K8s 1.6 :(
[08:13] <Zic> but I'm afraid of garbage collecting problem with Docker that we had with the version from Ubuntu archive
[08:13] <Zic> the Docker version from docker.com private repository fixed that
[08:14] <Zic> a bug is open at Docker for that, but the error message is so generic that this bug stayed open since April 2016 with no other news than "We upgraded to the latest version of Docker and that fixed everything!" :'(
[13:16] <iatrou> hi, I am testing cdk using snapd 2.22.6 conjure-up 2.1.5 and the kubernetes-master is stuck for a while in waiting:  http://paste.ubuntu.com/24420144/
[13:16] <stokachu> iatrou: yea ive seen that before, lazyPower ^
[13:19] <lazyPower> iatrou: can you pastebin the output of `kubectl get po --all-namespaces`? You can execute that on the master itself.
[13:20] <iatrou>  here is /var/log/juju/unit-kubernetes-master-0.log http://paste.ubuntu.com/24420169/
[13:20] <iatrou> lazyPower: "The connection to the server localhost:8080 was refused - did you specify the right host or port?"
[13:22] <lazyPower> iatrou: i see that
[13:23] <lazyPower> iatrou: do you get that same error message when attempting to issue the kubectl command on the master?
[13:23] <zeestrat> Does anyone know of a way to use dynamic variables such as env variables when deploying bundles with the native juju deploy in 2.x?
[13:23] <lazyPower> zeestrat: you would need to use something like envtemplate to render the bundle and substitute the variables, then deploy that generated bundle
[13:25] <iatrou> lazyPower: that was from kubernetes-master, from the conjure-up host, for kubectl.conjure-canonical-kubern-f24 get po --all-namespaces I get 502 Bad Gateway
[13:25] <lazyPower> iatrou: whats the status of the apiserver when you `systemctl status kube-apiserver`?
[13:26] <iatrou> lazyPower: inactive (dead)
[13:27]  * magicalt1out wants to update to the latest CDK but my underlying openstack hardware is so woeful I don't currently dare =/
[13:27] <lazyPower> iatrou: we've found the culprit, can you attempt a restart of that service? `systemctl restart kube-apiserver`  then re-insepct to make sure its active
[13:28] <stokachu> lazyPower: this problem has happened to me a few times too
[13:29] <lazyPower> stokachu: any indicator or collection of logs so we can scrutinze what actually happened?
[13:29] <lazyPower> i suspect a race condition but i'm not certain of that
[13:29] <stokachu> im running more tests today and will get you those logs
[13:30] <lazyPower> ok, thanks stokachu
[13:31] <iatrou> lazyPower: I must be doing something wrong: on kubernetes-master I get Failed to restart kube-apiserver.service: Unit kube-apiserver.service not found.
[13:32] <lazyPower> oh, wait, i see what i did there. my mistake
[13:32] <lazyPower> i gave you the wrong service name
[13:32] <lazyPower> iatrou: systemctl status snap.kube-apiserver.daemon
[13:33] <lazyPower> i'm going to have to un-learn that muscle memory
[13:34] <magicaltrout> or get a mac with those silly bars, so you can put it on a hot key ;)
[13:34] <lazyPower> ^ that
[13:34] <lazyPower> only, can we get an updated x1 carbon with a silly bar?
[13:34] <Zic> lazyPower: hey, I just upgraded a production-cluster this time and... except that I needed to kubectl delete -f <all_cdk_addons> && kubectl create -f <all_cdk_addons> like earlier, it works well
[13:34] <magicaltrout> hehe
[13:34] <Zic> it's the one of my two K8s cluster, the smaller one
[13:35] <lazyPower> Zic: and this is with install_from_upstream=true?
[13:35] <iatrou> lazyPower: OK, still dead, but for the "right" reason this time, the restart "fixed" the waiting on the master, but  not the workers...
[13:35] <Zic> lazyPower: nope, I downgraded just before
[13:35] <lazyPower> iatrou: ok, so i suspect kubelet needs restarted on the workers
[13:35] <Zic> I do not attempt with it :(
[13:35] <lazyPower> iatrou: its that or wait for update-status to run and it should attempt to reconverge
[13:35] <lazyPower> Zic: ok, good
[13:35] <lazyPower> Zic: i konw the GC issue is a thing for you, but we dont have teh bandwidth to enable that level of testing at this time if the upstream project isn't doing it :(
[13:36] <lazyPower> Zic: i suspect with 1.7/1.8 those newer dockers will be addressed, however at the release of 1.6 it was only vetted against 1.10-1.13
[13:36] <Zic> lazyPower: before the kubectl create/delete cdk-addons, my kube-dns was stuck at ContainCreating with a Mount option error
[13:36] <Zic> lazyPower: for this smaller customer, he has no Docker garbage collecting error
[13:36] <Zic> so I downgraded without any drawbacks
[13:36] <Zic> it's the bigger one where it's a problem
[13:37]  * lazyPower nods
[13:38] <Zic> in practical, the GC issue is doing this: some of the container of a pod (ZooKeeper one) have an heavy usage which may busy the kubernetes-worker units machine at 100% of its capability
[13:38] <Zic> - docker stuck
[13:38] <Zic> - docker will eventually came alive after, but with orphan container error and GC in loop
[13:39] <Zic> - kubelet cannot schedule any new pod on the node of this docker
[13:39] <Zic> -> solution, restart dockerd
[13:39] <Zic> it happens twice a days
[13:39] <Zic> -s
[13:40] <Zic> the 17.04-ce version of Docker is not affected
[13:40] <lazyPower> Zic: catch-22, the 17.04-ce edition of docker breaks networking as we've discovered
[13:40] <Zic> yep :(
[13:40] <Zic> in 1.6.1
[13:41] <Zic> worked well in 1.5.3 :)
[13:41] <Zic> a possible mitigation: I recommend to my customer to set Kubernetes "limits" on his manifest for ZooKepper pods
[13:41] <Zic> of CPU/RAM
[13:41] <Zic> to never hit the 100% spot
[13:43] <lazyPower> Zic: that sounds like the proper way to do this
[13:43] <lazyPower> you might have to scale zookeeper.... but if its related to resource consumption and why it gets stuck GC'ing...
[13:43] <Zic> as the GC issue with dockerd seems to trigger when the kubernetes-worker units machine is under hevy load-average
[13:43] <lazyPower> ah
[13:43] <lazyPower> ok
[13:44] <lazyPower> well thats certainly something to explore and try
[13:44] <lazyPower> i would def. give it resource limits and see if that resolves it
[13:44] <lazyPower> Zic: also ifyou want to file a bug to track this issue with us, i'm sure someone else is bound to hit this and it would be good to have it documented somewhere
[13:48] <Zic> https://github.com/kubernetes/kubernetes/issues/39028 <= it's this kind
[13:49] <Zic> with this kind of error:
[13:49] <Zic> Mar 31 00:08:15 mth-k8s-01 kubelet[21200]: E0331 00:08:15.911907   21200 kubelet.go:1128] Container garbage collection failed: operation timeout: context deadline exceeded
[13:54] <lazyPower> iatrou: any change in status on your workers?
[13:54]  * lazyPower looks @ linked issue from zic
[13:55] <Zic> no clear identification / solution of the problem :s
[13:55] <Zic> the OP seems to have this issue with a heaby Prometheus pod
[13:55] <Zic> heavy
[13:56] <lazyPower> ok i see the linked issue now as well relating ot gc
[13:56] <lazyPower> this is a hairy issue if we're seeing perf gc lock from the underlying container daemon, we're running heapster, which should be sending data back to kubelet to do GC at runtime
[13:56] <lazyPower> but if the container process fails to GC, yeah we're kind of dead in the water here too
[13:56]  * lazyPower sighs
[13:56] <lazyPower> if its not one thing its another amirite Zic?
[13:58] <Zic> I needed to Google "amirite" xD </French_guy>
[14:00] <Zic> lazyPower: most of the time, it's Zookeeper pods
[14:00] <Zic> I never saw another things, except one time, a Kibana one, but seems to be an isolated case
[14:00] <lazyPower> Zic: fair enough, i'm speaking internet gibberish
[14:01] <Zic> :D
[14:01] <lazyPower> so i dont think its english guy vs french guy lingo, its just me being full of trash talk from the internet
[14:01] <Zic> never kube-system pods or default pods like ingress-controller failed at this GC problem
[14:01] <lazyPower> Zic: that kibana pod gc seems unrelated, kibana is client side. I would expect that from ES though.
[14:01] <lazyPower> like kibana has *some* Server side jruby code, but for the most part its client side
[14:02] <Zic> yep
[14:02] <Zic> I also have ES pods
[14:02] <lazyPower> and those dont give you trouble w/ GC?
[14:02] <Zic> they never provoke this GC issue
[14:02]  * lazyPower does the schenanigans dance
[14:02] <Zic> it always start with this ZK, and sometime other pod begins to fail
[14:02] <lazyPower> i suspect its ZK bleeding into other areas because of the GC issue
[14:03] <Zic> (I think it's untied, they are just crashing, auto-healing tried to heal them, but cannot as kubelet cannot reschedule new container)
[14:03] <lazyPower> yeah
[14:03] <lazyPower> that
[14:03] <lazyPower> thats my bold prediction as well
[14:08] <Zic> in conclusion, with the docker_from_upstream=false and for that smaller cluster which don't have any GC issue, the only "problem" I got during after the upgrade was the need of delete/create all cdk-addons YAML manifests
[14:08] <Zic> don't know if "all" was really required, was just kube-dns which failing
[14:08] <Zic> but just in case, I did it for all cdk-addons
[14:08] <Zic> s/during//
[14:10] <Zic> (I think you already identified this problem yesterday)
[14:11] <iatrou> lazyPower: nope, the workers are still in waiting
[14:13] <Zic> don't know if it's a good advice but, when juju status is stuck at something with "waiting", and if I see no action at all at a "juju debug-log" or in the units concerned by the "waiting", I just try to reboot the juju controller and/or the unit concerned :>
[14:13] <Zic> 90% of the time it works
[14:13] <Zic> 10% of the other time, I keep this error until lazyPower wake up :>
[14:13]  * Zic => exit
[14:14] <lazyPower> iatrou: ok, lets remote into one of the workers and see whats happening. do you have a minute to troubleshoot?
[14:17] <iatrou> lazyPower: sure, what am I looking for?
[14:17] <lazyPower> iatrou: systemctl status snap.kubelet.daemon
[14:17] <lazyPower> is it dead? any indicator in the logs of a failure scenario?
[14:19] <iatrou> inactive (dead), snap[1510]: cannot change profile for the next exec call: No such file or directory
[14:21] <iatrou> restart seems to recover it
[14:22] <iatrou> lazyPower: ^^
[14:27] <lazyPower> iatrou: excellent, sorry you ran into this. was this an upgrade or a fresh deploy?
[14:28] <erik_lonroth> Hello! I've spent some time writing a "Hello World" tutorial about getting started with juju development. If you like to give me feedback to it, I'd be glad. I intend to expand with more tutorials as I learn from this myself. https://github.com/erik78se/juju/wiki/The-hello-world-charm
[14:28] <lazyPower> erik_lonroth: certainly, thanks for sharing. Another good place to cross post that would be the juju mailing list "juju@lists.ubuntu.com"
[14:28] <erik_lonroth> Oh, thanx for that tip.
[14:29] <iatrou> lazyPower: fresh install
[14:30] <lazyPower> iatrou: ah interesting. What cloud was this deployed to?
[14:31] <iatrou> lazyPower: localhost
[14:33] <lazyPower> ok, thats interesting. There appears to be a race that you hit and i'm not sure why
[14:33] <lazyPower> iatrou: if you could snag the contents of the kube-apiserver logs, i think that's good enough to diagnose what happened
[14:36] <iatrou> lazyPower: from the master? where are they located?
[14:39] <lazyPower> iatrou: from master
[14:39] <lazyPower> iatrou: in standup give me 5
[14:39] <magicaltrout> bet you're not stood up
[14:41] <lazyPower> magicaltrout: i have a convertible desk
[14:42] <lazyPower> i stand for meetings sometimes :)
[14:50] <bdx> jam: ping
[14:59] <lazyPower> iatrou: ok, are you familiar with juju-crashdump?
[14:59] <lazyPower> iatrou: it would probably be easier to just collect the env so i dont have to come back for more logs down the road.
[15:02] <iatrou> lazyPower: apologies otp, getting back to you in a bit
[15:02] <lazyPower> iatrou: no worries, ping when you're available
[15:04] <magicaltrout> otp? off to the pub?
[15:05] <aisrael> on the phone, but the pub sounds good too. ;)
[15:09] <magicaltrout> ah
[15:09] <magicaltrout> yeah the pub is the better one
[15:09] <magicaltrout> or
[15:09] <magicaltrout> otpitp
[15:10] <magicaltrout> is acceptable if you need to combine work and alcohol
[15:28] <Zic> lazyPower: found a minor "bug" -> kube-apiloadbalancer drop my kubectl executed remotely after 90s
[15:29] <Zic> maybe because of the       proxy_read_timeout      90;
[15:29] <Zic> at /etc/nginx/sites-enabled/apilb
[15:29] <Zic> s/executed/exec executed/
[15:30] <lazyPower> Zic: you can try increasing that to some absurdly large number like 99999999;
[15:30] <lazyPower> you cannot wholesale disable the read timeout though, that much i'm 90% certain of.
[15:31] <Zic> I switched it to 600s/10min instead
[15:31] <Zic> but I don't know if I will be overwritten at next Juju check, or just at next juju upgrade-charm kube-apiloadbalncer (not important)
[15:33] <mbruzek> Zic - Then you can edit the template used to write that file. /var/lib/juju/agents/unit-kubeapi ... find that template file and edit it there
[15:35] <Zic> if it's just at next-upgrade, it's not a very big problem, and actually, next-upgrade is switching to HAProxy right? :)
[15:35] <Zic> if it's actively overwritten, yeah, I will edit the template, thanks for the trick
[15:35] <mbruzek> Not sure when haproxy is coming, we had problems with the tls keys
[15:36] <mbruzek> passing through the traffic as a layer 4 router, the addresses were incorrect
[15:46] <lazyPower> o/ mbruzek
[15:46] <mbruzek> heyo
[15:49] <lazyPower> mbruzek: i saw you on steam lastnight and got giddy :D
[15:50] <mbruzek> Firewatch is a great linux game
[15:50] <lazyPower> +1 to that
[15:50] <lazyPower> so is *Drumroll* ARK
[15:50] <lazyPower> but i'm clearly biased ;)
[15:50] <mbruzek> Ah ark
[15:50] <lazyPower> i started playing the even more difficult version: Scorched Earth
[15:50] <lazyPower> i need to find some good persian music as a backdrop while playing
[15:51] <lazyPower> or maybe just bolero on a loop
[15:58] <kwmonroe> aisrael: marcoceppi:  will one of you confirm the right benchmark doohickey to import:
[15:58] <kwmonroe> >>> from charmhelpers.contrib.benchmark import Benchmark
[15:58] <kwmonroe> >>> from charms.benchmark import Benchmark
[15:58] <kwmonroe> it's the last one, right?
[15:58] <aisrael> the last one, yeah
[15:58] <kwmonroe> gracias
[15:58] <aisrael> charms.contrib.benchmark was the initial version, before we spun it out
[15:59] <kwmonroe> ack
[16:25] <Zic> the ambiance of Firewatch is brillant :>
[16:36] <lazyPower> Zic: +1 to that, Panic! writes great software
[16:36] <lazyPower> a bit pricey, but excellent tools all the same
[17:29] <siva> I am seeing an issue where 'juju status' command just hangs
[17:30] <Guest31335> How do I resolve this issue?
[17:30] <Guest31335> Both the 'juju status' and 'juju list-models' command just hangs
[17:31] <Guest31335> ubuntu@juju-api-client:/var/log$ juju --version 2.0.2-xenial-amd64
[17:31] <Guest31335> How do I resolve this one>
[17:31] <Guest31335> How do I resolve this one?