[06:31] <kklimonda> how do I give someone ssh access to the controller?
[06:31] <kklimonda> I've done juju grant -c [controller] [user] superuser
[06:32] <kklimonda> and that did *someting* but now I can't figure out how to access it
[06:32] <kklimonda> juju ssh -m controller 0 says that there is no such model ctrl:[user]/controller
[06:32] <kklimonda> I've tried juju ssh -m admin/controller 0 but that also didn't work
[13:24] <lazyPower> kklimonda: I dont think add-user actually adds the ssh key. There's a juju add-ssh-key command that has to be run in order for that to work. rick_h would know best though
[13:24] <rick_h> lazyPower: correct, atm the admin has to add the key for the user
[13:24] <rick_h> lazyPower: kklimonda it's a known issue and there's a task for the future to make keys end user manageable
[13:25] <lazyPower> ty for the alley oop rick_h
[13:42] <bdx> I'm experiencing some extreme crazyness
[13:43] <bdx> c4 type instances have some issue with lxd from what I can tell
[13:43] <bdx> not sure if its juju, or lxd or what
[13:48] <bdx> the issue is happening with t2 instances too
[13:54] <Zic> lazyPower: hi, long time with no problems but today I have one :) on our test cluster (happily...) upgraded from 1.5.3 to 1.6.1, kube-dns keeps crashing with kubernetes-dashboard, saying that kind of things: http://paste.ubuntu.com/24413931/
[13:54] <Zic> juju status is all green
[13:54] <Zic> seems like a problem of endpoints services which does not respond (last part of my pastebin)
[13:55] <Zic> as it's a test cluster, I tried to reboot every single machines composing it, with no more luck
[13:57] <Zic> http://paste.ubuntu.com/24413949/ <= same kind of message for a kubectl logs on kube-dns container
[13:58] <bdx> lxd is failing across the board for me right now .... on aws instances
[14:00] <bdx> http://paste.ubuntu.com/24413976/
[14:00] <bdx> ^ is something I've been doing on a daily basis
[14:00] <magicalt1out> does look pretty broken
[14:01] <bdx> I woke up early to test out some newnew, and thats what I get
[14:01] <bdx> yeah ... at first I thought it was specific to instance type ... but its happening on all instance types (at least the 5 I've tried)
[14:02] <bdx> then I thought it might be a juju 2.1.2 thing .... as I just created my first model on 2.1.2 .... but I just verified its happening on 2.0.3 models as well
[14:03] <bdx> @team, what is going on here?
[14:03] <lazyPower> bdx: we're going to need bare minimum a bug report with a juju-crashdump log (you can report skinny, we dont need the charm artifacts)
[14:04] <bdx> lazyPower: is crashdump a plugin?
[14:04] <lazyPower> bdx: snap install juju-crashdump --classic, juju-crashdump -s      should get you moving
[14:04] <bdx> nice
[14:04] <bdx> thx
[14:09] <bdx> lazyPower: http://paste.ubuntu.com/24414014/
[14:09] <lazyPower> lutostag: ping
[14:09] <lutostag> pong
[14:09] <lutostag> son of a
[14:09] <lazyPower> lutostag: i think we found a scenario where crashdump is misbehaving because of unstarted units
[14:10] <lutostag> bdx: --edge
[14:10] <lazyPower> bdx: to be clear, snap refresh juju-crashdump --edge --classic
[14:10] <lutostag> (fixed that bug, need to release it to stable)
[14:10] <lazyPower> lutostag: ty <3
[14:12] <magicalt1out> bdx_: 16:10 < lazyPower> bdx: to be clear, snap refresh juju-crashdump --edge --classic
[14:14] <lazyPower> rip
[14:15] <bdx_> crashdump now spams me with
[14:15] <bdx_> http://paste.ubuntu.com/24414051/
[14:15] <bdx_> lol
[14:15] <bdx_> oh no
[14:15] <bdx_> lazyPower: I appreciate the willingness to help out none the less
[14:15] <lazyPower> bdx_: thats fine
[14:15] <lazyPower> bdx_: the spam is expected
[14:15] <bdx_> ok
[14:15] <lazyPower> its doing a lot of subprocess calls, and that cgo bit is golang doing what it does best
[14:15] <bdx_> gotcha .. nice
[14:16] <lazyPower> its that or snap, i'm unconvinced on which level is spamming that
[14:16] <lazyPower> but its known and expected all the same, it takes a bit to grab everything on a large deployment, i hope you passed -s or --skinny so it doesn't spend forever nabbing all the charm source
[14:17] <lazyPower> the idea behind crashdump is we've professionalized nabbing state and debug/status messaging so we can tease apart the deployment artifacts and find root causes. Feel free to inspect the package and see what we're grabbing
[14:17] <lazyPower> any ideas on improvement are welcome
[14:18] <bdx_> oh ...
[14:18] <bdx_> ha
[14:19] <bdx_> I shall, thx
[14:21] <Zic> (lazyPower: did you see my last messages or they afraid you so much that I must have been cursed? :D)
[14:23] <bdx_> lazyPower: these models are on beta controller
[14:23] <lazyPower> Zic: totally missed it, whats up?
[14:23] <lazyPower> bdx_: so something went fubar during collection or...?
[14:23] <bdx_> lazyPower: do you think there is a possibility that juju-crashdump can't collect the info it needs because my user doesn't have permission?
[14:24] <Zic> lazyPower: (repasting my messages & pastes here: http://paste.ubuntu.com/24414104/)
[14:24] <lazyPower> lutostag: have we tested crashdump with jaas?
[14:24] <bdx_> no ... its just spamming hard though with "runtime/cgo: pthread_create failed: Resource temporarily unavailable"
[14:24] <lazyPower> bdx_: it takes a while, seriously. its nabbing a ton of data
[14:25] <bdx_> ok
[14:25] <lazyPower> on a 4 unit small k8s cluster the collection can take ~ 5 minutes.
[14:25] <Zic> to sum up: seems I have an Service/Endpoint problem on my K8s-test cluster upgraded to 1.6
[14:25] <lazyPower> but i di dnt pass --skinny.
[14:25] <lazyPower> Zic: looking now
[14:25] <Zic> thx
[14:26] <lazyPower> Zic: check on flannel on the unit running the dashboard, is the flannel.1 interface up?
[14:27] <lazyPower> Zic: also, check kube-proxy service is started and not in error
[14:28] <Zic> http://paste.ubuntu.com/24414128/
[14:28] <Zic> Flannel is OK but kube-proxy is crashed
[14:28] <lazyPower> Zic: thats why its failing
[14:28] <lazyPower> lets dig into why kube-proxy is dead, anything in the logs?
[14:28] <bdx_> lazyPower, lutostag: the last message it gave after 5 mins of spam was "runtime/cgo: need to run as root or suid"
[14:29] <bdx_> I'm guessing it needs to be ran as root?
[14:29] <bdx_> hmmm
[14:29] <Zic> lazyPower: I'm running journalctl -u kube-proxy but nothing except start/stop/backoff of systemd, do I have a better logs somewhere else?
[14:29] <lazyPower> Zic: can you just recycle the daemon? does it stick or does it immediately crash?
[14:29] <bdx_> alright ... running again as root
[14:30] <lazyPower> bdx_: hang on, you shouldnt' need to run it as root
[14:30] <lazyPower> lutostag: ^ wat
[14:30] <Zic> lazyPower: http://paste.ubuntu.com/24414138/ <= logs from a fresh restart
[14:30] <Zic> error code 203 :x
[14:31] <lazyPower> Cynerva: ryebot  -- post standup, lets dig into this together ^
[14:31] <ryebot> lazyPower: +1
[14:31] <lazyPower> Zic: need you on ice for a bit while we do standup and will return to ask more questions
[14:32] <bdx_> heres the bug https://bugs.launchpad.net/juju/+bug/1684143
[14:32] <mup> Bug #1684143: applications deployed to lxd on aws instances failing <juju:New> <https://launchpad.net/bugs/1684143>
[14:32] <bdx_> I'll attach crashdump output when I can get it working
[14:34] <Zic> lazyPower: no problem, thanks :)
[14:38] <Zic> lazyPower: I found this in plain-text syslog: syslog.1:Apr 18 15:20:10 ig1-k8s-04 systemd[1163]: kube-proxy.service: Failed at step EXEC spawning /usr/local/bin/kube-proxy: No such file or directory
[14:38] <lazyPower> Zic: oooo snap, that looks like a stale hash. it should be spawning from /snap/bin/kube-proxy
[14:39] <Zic> the log is from tomorrow, I'm looking at the .service systemd's unit if it's really the case
[14:39] <Zic> hmm, I have similar logs for our restart test earlier
[14:40] <Zic> http://paste.ubuntu.com/24414178/
[14:40] <Zic> the ExecStart is wrong so :)
[14:41] <Zic> -r--r--r-- 1 root root 425 Feb 16 11:15 /lib/systemd/system/kube-proxy.service
[14:41] <Zic> not touched by the snap upgrade
[14:42] <Zic> seems I hit the spot! :D
[14:43] <lazyPower> Zic: before you update that hang on
[14:43] <lazyPower> the snaps have a different system exec scheme, they use bash wrappers and a systemd script that gets installed on snap install.
[14:47] <Zic> (for info kube-proxy is dead on all kubernetes-worker units, I just cheched that, not only on kube-dns/kubernetes-dashboard nodes)
[14:57] <lazyPower> Zic: systemctl status snap.kube-proxy.daemon
[14:58] <Zic> http://paste.ubuntu.com/24414247/
[15:00] <lazyPower> xref with https://github.com/kubernetes/kubernetes/issues/26003
[15:00] <lazyPower> Zic: are you using network policies?
[15:01] <Zic> this test-cluster is not customized at all, the only parameter we change was docker_from_upstream
[15:01] <Zic> changed*
[15:01] <lazyPower> hmm
[15:02] <lazyPower> ok still in standup, will circle back in a sec
[15:02] <Zic> (docker_from_upstream was set to "true" before the upgrade to 1.6)
[15:04] <lazyPower> Zic: this is in reference to your workload objects
[15:05] <lazyPower> Zic: sudo iptables --list
[15:06] <lazyPower> lets see if it even created the iptables rulechains to do the serviceip forwarding
[15:06] <Zic> paste.ubuntu.com/24414272/
[15:06] <Zic> http://paste.ubuntu.com/24414272/
[15:10] <lutostag> hmm, bdx, this is with a jaas deployment?
[15:11] <lutostag> I'll see if I can get a one-off run to test that real quick
[15:17] <lazyPower> that seems fine...
[15:17]  * lazyPower ponders
[15:20] <Cynerva> Zic: i just remembered hitting something like this during my upgrade testing. What eventually got me in a working state was to recreate the pods that are failing
[15:21] <Zic> Cynerva: was my first attempt :)
[15:21] <lazyPower> Zic: which templates did you use?
[15:21] <lazyPower> Zic: the ones found in /etc/kubernetes?
[15:21] <Zic> the one at ~/cdk
[15:22] <Zic> oops
[15:22] <Zic> precisely at ~/snap/cdk-addons/current/addons :)
[15:22] <lazyPower> ok
[15:23] <Cynerva> dang, okay
[15:23] <Zic> through a kubectl replace -f
[15:23] <Cynerva> hmm i wonder if that recreates the pods? or just the deployment objects?
[15:24] <lazyPower> it *should* have recreated the pods
[15:24] <lazyPower> in nuke/repave style
[15:24] <lazyPower> it doesn't blue/green unless you specify a rolling update
[15:24] <Cynerva> okay
[15:25] <lazyPower> Zic: for grins on the worker
[15:25] <lazyPower> can you curl the http endpoint for your kubernetes-apiserver VIP?
[15:25] <lazyPower> curl https://10.152.183.1
[15:25] <Zic> (I tried something more agressively: http://paste.ubuntu.com/24414342/)
[15:25] <Zic> (about kubectl replace)
[15:26] <lazyPower> ok
[15:26] <Zic> don't know if all this error are ignorable
[15:26] <lazyPower> so, that tells me any attempt to replace has failed
[15:26] <Zic> yup :(
[15:26] <lazyPower> you'll need to kubectl rm -f
[15:26] <lazyPower> and then reschedule
[15:26] <lazyPower> this *may* fix the issue
[15:26] <lazyPower> but i doubt it
[15:26] <Zic> ah, don't try this one, I will immediately
[15:27] <Zic> kubectl rm seems to not exist (?)
[15:27] <Zic> delete?
[15:27] <lazyPower> ya
[15:27] <lazyPower> just checking if you're awake ;)
[15:27] <Zic> :D
[15:28] <Zic> http://paste.ubuntu.com/24414356/
[15:28] <Zic> Container still creating
[15:28] <Zic> I'm waiting a bit
[15:29] <Zic> http://paste.ubuntu.com/24414359/ <= the second line is strange
[15:30] <Zic> about the curl test: root@ig1-k8s-01:~# curl https://10.152.183.1
[15:30] <Zic> Unauthorized
[15:30] <lazyPower> OH
[15:30] <lazyPower> Well thats good!
[15:31] <lazyPower> if your VIP is responding, its not a networking issue
[15:31] <lazyPower> and we expect that since you dont have the tls key on that curl command. had you included teh k8s key(s) on that curl rquest it would 404 (i think) you. As it begins at /api.
[15:32] <Zic> http://paste.ubuntu.com/24414374/ ContainerCreating finished but... it's another bad state now :(
[15:32] <Zic> don't know why it reached an ImagePullBackOff
[15:33] <lazyPower> Zic: give it a sec
[15:33] <lazyPower> that can happen when ther's issues hitting the gcr.io registry
[15:33] <lazyPower> temporary networking issue, saturation, noisy neighbors, etc.
[15:33] <Zic>   24s24s1kubelet, ig1-k8s-05spec.containers{influxdb}WarningFailedFailed to pull image "gcr.io/google_containers/heapster-influxdb-amd64:v1.1.1": rpc error: code = 2 desc = Error pulling image (v1.1.1) from gcr.io/google_containers/heapster-influxdb-amd64, Get https://gcr.io/v1/images/55d63942e2eb6a74ea81cbfccd95ef0f44f599a04ed4a46a41dc868a639c847d/ancestry: dial tcp 64.233.166.82:443: i/o timeout
[15:33] <Zic> seems like
[15:33] <lazyPower> yeah
[15:34] <Zic> oh, except grafana, all pods are now Running
[15:34] <lazyPower> i suspect your'e experiencing an outage atm. let me check here
[15:34] <lazyPower> \o/
[15:34] <lazyPower> nice
[15:34] <lazyPower> so it self resolved
[15:34] <Zic> kube-system   kube-dns-806549836-w842j                2/3       CrashLoopBackOff   3          6m        10.1.79.7    ig1-k8s-02
[15:34]  * lazyPower chalks it up to internet gremlins
[15:34] <Zic> kube-system   kubernetes-dashboard-2917854236-qmvn3   0/1       Error              5          6m        10.1.36.7    ig1-k8s-04
[15:34] <Zic> speaks too fast :'(
[15:34] <lazyPower> Zic: you're playing with my heart man
[15:34] <Zic> :'(
[15:34] <lazyPower> ok, lets start with dns
[15:35] <Zic> was blocked in CLBO for so much time I was too happy to see a Running state :(
[15:35] <lazyPower> whats the story with dns clbo?
[15:35] <lazyPower> failed healthc heck, failed to reach apiserver?
[15:35] <Zic>   1m1m2kubelet, ig1-k8s-02WarningFailedSyncError syncing pod, skipping: failed to "StartContainer" for "kubedns" with CrashLoopBackOff: "Back-off 20s restarting failed container=kubedns pod=kube-dns-806549836-w842j_kube-system(c5838bc9-2514-11e7-b7ef-005056949324)"
[15:35] <Zic> let me do a kubectl logs on it
[15:37] <Zic> http://paste.ubuntu.com/24414398/
[15:38] <Zic> grafana hits Running and stayed in Running. but kube-dns & kubernetes-dashboard are stuck in CLBO now
[15:38] <Zic> http://paste.ubuntu.com/24414411/ <= for dashboard
[15:40] <lazyPower> hmm
[15:40] <lazyPower> i'm uncertain why the dashboard isn't able to reach the VIP
[15:40] <lazyPower> but i'm still concerned about kube-dns
[15:40] <lazyPower> looks like the sidecar for dnsmasq metrics is whats causing it to fail
[15:42] <lazyPower> Zic: give me a repeat describe for the dns pods now that they are out of errimgpull
[15:42] <Zic> I got more info about kubernetes-dashboard through a direct `docker logs` at local worker: Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.152.183.1:443/version: dial tcp
[15:42] <Zic> 10.152.183.1:443: i/o timeout
[15:43] <Zic> don't know why it got a timeout if I can curl it...
[15:43] <lazyPower> right
[15:43] <lazyPower> I'm not sure whats fishy there but somethings up
[15:43] <lazyPower> and to make this all the more interesting, our upgrade tests didn't surface this, the addons upgraded without issue
[15:44] <Zic> http://paste.ubuntu.com/24414438/
[15:44] <Zic> lazyPower: you know that I'm cursed and love to hit all the bug that nobody have :D
[15:45] <lazyPower> so the primary issue here is the kubednsmasq pod is still failing to pull.
[15:46] <Cynerva> Zic: can you paste journalctl logs for snap.kubelet.daemon, snap.kube-proxy.daemon, and flannel?
[15:47] <lazyPower> Zic: additionally, on any unit, try this:   docker pull gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
[15:47] <lazyPower> well, any worker. the master doesn't have docker so you'll figure out real quick to not do it there.
[15:49] <Zic> Cynerva: http://paste.ubuntu.com/24414459/
[15:49] <Cynerva> thanks
[15:50] <Zic> lazyPower: http://paste.ubuntu.com/24414470/
[15:50]  * lazyPower blinks
[15:50] <lazyPower> thats *literally* the manual interaction of what that stupid kubelet operation is trying to make happen
[15:50]  * lazyPower flips tables
[15:51] <lazyPower> Zic: juju run --application kubernetes-worker "docker pull gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1"
[15:51] <lazyPower> pre-load all the workers with that image. if it resolves itself, again, i dont know why, but gremlins.
[15:52] <Zic> ok, it's loading :)
[15:53] <Cynerva> nothing interesting in the service logs aside from the stream errors, and those aren't telling us much O.o
[15:53] <lazyPower> Cynerva: i see we're missing the conntrack bin. we should probably add that and pack it into kube-proxy
[15:53] <lazyPower> that'll be needed for large scale deployments so it properly tracks and terminates stale connections. conntrack bits were causing rimas problems before on another distro. I want to learn from that mistake if we can.
[15:54] <Zic> lazyPower: http://paste.ubuntu.com/24414492/
[15:54] <Zic> problem on one of the units
[15:54] <Cynerva> hmm that's weird, kube-proxy is classic confinement
[15:54] <Zic> seems very likely that gcr.io has an issue
[15:54] <lazyPower> Zic:   UnitId: kubernetes-worker/1 <-- so we need to figure out why that unit is having connectivity issues
[15:55] <Zic> kubernetes-worker/1       active    idle   10       ig1-k8s-01
[15:55] <Zic> it's ig1-k8s-01, I will do a manual check
[15:55] <Zic> at least, it can ping 64.233.166.82
[15:56] <Zic> http://paste.ubuntu.com/24414505/
[15:56] <Zic> wtf :>
[15:56] <lazyPower> ah looks like it might have been 3
[15:56] <lazyPower> i misread the yaml
[15:56] <lazyPower> kubernetes-worker/3
[15:56] <Zic> oops, I did not check too :D
[15:57] <Zic> ok so it's kubernetes-worker/3*      active    idle   12       ig1-k8s-03
[15:57] <lazyPower> Zic: again, just making sure you're awake
[15:57] <Zic> :)
[15:58] <Zic> pinging is OK, it's pulling now, in progress...
[15:58] <Zic> http://paste.ubuntu.com/24414515/
[15:58] <Zic> it stopped
[15:58] <lazyPower> so either there's a network issue on that unit, or gcr.io is having trouble
[15:58] <lazyPower> i wouldn't be surprised of either
[15:59] <lazyPower> if you retry does it succeed or does it keep getting rejected?
[15:59] <Zic> ig1-k8s-03 has the exact same network configuration of other 4 kubernetes-worker units (they are all NATed by our hypervisor through the same public IP)
[15:59] <Zic> Status: Downloaded newer image for gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
[15:59] <Zic> just work at the second attempt...
[15:59] <Zic> silly gcr.io
[16:01] <lazyPower> Zic: did that resolve the deployment?
[16:01] <Zic> lazyPower: hmm, saw in describe pod kube-dns that it tries to redownload the docker image
[16:01] <lazyPower> now that the image is cached on all the workers, it shouldn't be complaining about image pull sync
[16:01] <Zic> even if it's already pulled :(
[16:01] <Zic>   21s21s1kubelet, ig1-k8s-03spec.containers{kubedns}NormalPullingpulling image "gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1"
[16:01]  * lazyPower sighs
[16:01] <lazyPower> it probably has pull: always in the manifest
[16:02] <lazyPower> because lets DDOS our registry sounds like a great plan.
[16:02] <Zic> http://paste.ubuntu.com/24414538/
[16:02] <Zic> huhu
[16:02] <Zic> it's sidecar now
[16:03] <lazyPower> Zic: edit the manifest for kube-dns and set teh stupid image pull policy from imagePullPolicy: Always  to    imagePullPolicy:IfNotPresent
[16:03] <lazyPower> and reschedule kubedns
[16:03] <lazyPower> (delete and recreate)
[16:04] <lazyPower> mind you this is all a work-around to whatever networking issue we're seeing
[16:04] <Zic> kubedns-cm.yaml          kubedns-controller.yaml  kubedns-sa.yaml          kubedns-svc.yaml
[16:04] <lazyPower> i'm not convinced
[16:04] <Zic> at the controller?
[16:04] <lazyPower> kbuedns-controller.yaml
[16:04] <Budgie^Smore> o/ juju world :)
[16:04] <lazyPower> ya
[16:04] <lazyPower> Budgie^Smore: o/
[16:04] <lazyPower> Budgie^Smore: did you bring your rocket launcher? We're on a bug hunt
[16:05] <Zic> lazyPower: in fact there is 0 ImagePullPolicy at the controller :D
[16:05] <Zic> so it must be the default value, which is... IfNotPresent
[16:05] <Zic> I don't understand :D
[16:05]  * lazyPower flips tables
[16:05] <lazyPower> Zic: i dont know what to recommend at this point
[16:05] <lazyPower> i've given every thought i can to work around this issue, the crux is the connectivity of grabbing hat image for kubedns
[16:06] <Budgie^Smore> lazyPower no rocket launcher... pop corn to watch the show though ;-)
[16:06] <lazyPower> and i have no clue why teh dashboard pod is unable to contact the VIP if the host machine can contact the VIP
[16:06] <lazyPower> you did however give us some clues that our removal was not working as expected and have a fix en-route for that
[16:07] <Zic> lazyPower: sidecar just finished to pull... but health check is not good: http://paste.ubuntu.com/24414621/
[16:07] <lazyPower> well, progress
[16:07] <lazyPower> whats in the logs for the pod?
[16:07] <lazyPower> (s)
[16:07] <lazyPower> same thing where dns cant reach the service vip of kube-apiserver?
[16:08] <Zic> http://paste.ubuntu.com/24414651/
[16:08] <Zic> seems like it yup
[16:08] <Zic> it times out on the VIP
[16:08] <Zic> like the dashboard :(
[16:08] <Zic> reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.152.183.1:443/api/v1/services?resourceVersion=0: dial tcp 10.152.183.1:443: i/o timeout
[16:09] <lazyPower> yah, i see that :(
[16:09] <lazyPower> so we've resolved the other nit-noid issues but the core of why it cant find the vip is still alien to me
[16:09] <lazyPower> if hte host can see it, the container should see it
[16:09] <lazyPower> Zic: can you fire up an ubuntu pod and attempt the same curl test?
[16:09] <lazyPower> Zic: from within the container, via kubectl exec
[16:10] <Zic> http://paste.ubuntu.com/24414704/ <= re-doing the test, it answered
[16:10] <lazyPower> Budgie^Smore: share that popped corn
[16:10] <Zic> lazyPower: yup, trying that
[16:11] <Budgie^Smore> lazyPower come and get it :P
[16:12] <lazyPower> Open source the corn man
[16:12] <lazyPower> s/corn man/corn, man/
[16:12] <Zic> hum
[16:12] <Zic> lazyPower: no network inside a container
[16:12] <Zic> no network at all
[16:12] <lazyPower> Zic: boom
[16:12] <lazyPower> progress
[16:12] <Zic> can't do an apt install curl :(
[16:12] <lazyPower> now lets figure out why the container has no network
[16:12] <lazyPower> whats in /etc/default/docker?
[16:13] <Zic> http://paste.ubuntu.com/24414713/
[16:13] <Zic> pretty empty
[16:13] <Budgie^Smore> lazyPower lol now you going to get me to code a corny corn popper ;-) pun most assuredly intended!
[16:14] <Zic> hmm
[16:14] <Zic> lazyPower: strange things: this new container has no network
[16:14] <Zic> but I tried a kubectl exec at an ingress controller
[16:14] <Zic> network is up ghere
[16:14] <Zic> -g
[16:15] <Zic> why ingress-controller has network and the new container don't :o
[16:15] <Budgie^Smore> (me being lazy) you have tried killing the container and having it start on another node right?
[16:17] <Zic> lazyPower: http://paste.ubuntu.com/24414727/
[16:17] <Zic> don't understand this part...
[16:18] <Zic> ubuntu is running at kubernetes-worker/0
[16:18] <Zic> the ingress-controller I used is running at kubernetes-worker/4
[16:19] <lazyPower> Zic: whats the age on that ingress controller?
[16:19] <lazyPower> is it from pre-upgrade?
[16:19] <Zic> 1d
[16:19] <Zic> so after-upgrade
[16:19] <Zic> (was 62days before)
[16:19] <Zic> default       default-http-backend-35bpm              1/1       Running            1          62d       10.1.80.5    ig1-k8s-01
[16:19] <Budgie^Smore> Zic, lazyPower have you checked the IPs and iptables yet? could it be a flannel / overlay network?
[16:19] <Zic> this, however, is up since 62d
[16:20] <Zic> and is also located at ig1-k8s-01
[16:20] <lazyPower> Budgie^Smore: it could be, however it should fallback to the default docker network driver iirc.
[16:21] <lazyPower> Zic: try again but watch the kubelet log
[16:21] <lazyPower> see if anything leaps out at you there
[16:22] <Zic> lazyPower: for info, pods of kube-system has no network also
[16:22] <Zic> tried in the grafana-influxdb pod
[16:22] <Zic> no network
[16:22] <Zic> seems like just ingress-controllers have network :o
[16:23] <Cynerva> Zic: ingress controllers have hostNetwork: true, so i think they bypass flannel/cni entirely
[16:23] <lazyPower> Zic: i'm at an impass now, but we've gotten deeper into the issue that seems like yet another symptom, but not the root cause
[16:24] <Cynerva> not entirely sure how that works, but they're definitely a special case
[16:24] <lazyPower> Cynerva: that would be the case if it specifies host network it doesn't use any of the containerd networking bits. its binding on teh hosts tcp stack.
[16:24] <Zic> lazyPower: could it be tied to our use of docker_from_upstream ?
[16:24] <Zic> I can switch it to false if you want
[16:24] <lazyPower> Zic: quite possible, if you switch it back to archive, do things work?
[16:25] <Zic> I will try now
[16:25] <lazyPower> Cynerva: ryebot - i dont think we've tested with upstream docker in quite some time... is this true yeah?
[16:26] <Cynerva> lazyPower: yeah, we haven't that i'm aware of
[16:26] <lazyPower> I thought so, i might actually submit a PR this week to remove that option from the k8s charms as its inhereted from layer-docker.
[16:26] <lazyPower> if we're not extensively testing it, we shouldn't offer it
[16:27] <Zic> lazyPower: we have a serious garbage collection in our prod-cluster with the Docker version of Ubuntu :(
[16:27] <Zic> it's why we switched to PPA version
[16:31] <lazyPower> Zic: thats unfortunate if this resolves the issue
[16:32] <Zic> yup :( with the docker version at Ubuntu Archive, we got a lot of dockerd stucked at garbage collecting
[16:32] <lazyPower> if it doesn't i'm not really sure where to go from here either, as this makes no sense to me that your container network just falls out
[16:32] <Zic> switching docker_from_upstream resolve immediately this issue
[16:32] <lazyPower> seriously?
[16:32] <Zic> yup :(
[16:32] <lazyPower> welp
[16:32] <lazyPower> nothing to do here
[16:32]  * lazyPower jetpacks away
[16:32] <Zic> was Kibana containers wich crash Docker garbage collection
[16:33] <lazyPower> hmmm
[16:33] <lazyPower> Zic: its 1.11.x coming from archive correct?
[16:33] <Zic> lazyPower: careful, saying that for our production-cluster, for the test-cluster we're debugging, downgrading is in progress
[16:34] <Zic> lazyPower: downgrading is finished and... all my pods are Running and have network connectivity
[16:34] <Zic> :o
[16:34] <lazyPower> Zic: perhaps it was jsut recycling docker that did it?
[16:35] <Zic> to recap what I said: we used docker_from_upstream as we hitted severe garbage collection bug with dockerd on production with heavy usage-intensive container like Kibana, with the version from PPA of docker.com, it was fixed (in 1.5.3)
[16:36] <Zic> but it seems that this docker version of docker.com breaks network in 1.6
[16:36] <lazyPower> i'm running a deploy with install_from_upstream=true right now
[16:36] <Zic> (to be clear, as we mixed our conversation about two different clusters earlier)
[16:36] <lazyPower> yep, i follow you now
[16:36] <Zic> for now, the test-cluster we're debugging here is now fixed
[16:36] <Zic> with docker_from_upstream sets to false
[16:36] <lazyPower> Zic: prior to doing that, did you attempt to restart the docker daemon?
[16:37] <lazyPower> was that part of your troubleshooting?
[16:37] <Zic> it's a bit lame as we are now using docker_from_upstream=true at production :/
[16:37] <Zic> lazyPower: after the downgrading, yeah, I restarted docker
[16:37] <lazyPower> Zic: i meant before
[16:37] <Zic> ah, yeah, rebooted the whole cluster too
[16:37] <lazyPower> Zic: well, i just deployed and upgraded
[16:37] <lazyPower> so far so good
[16:38] <lazyPower> to be clear - deployed with docker from archive, enabled install_from_upstream, things are still running
[16:38] <Zic> did you enable docker_from_upstream at 1.5.3, then upgrade to 1.6 ? :D
[16:38] <Zic> was the correct path
[16:38] <Zic> don't know if it can played at the game
[16:41] <Zic> lazyPower: the *exact* path was: switching to docker_from_upstream=true, look at juju status and when it's ended, restart docker on every kubernetes-worker units (as the juju scenario don't handle this part) <some days passed> -> upgrade to 1.6 with the Ubuntu Insights tutorial -> CLBO at kube-dns+kubernetes-dashboard after the upgrade / no network in container
[16:45] <lazyPower> Zic: running another deploy through the upgrade scenario
[16:45] <lazyPower> but i got networking with upstream docker from a fresh deployment
[16:45] <lazyPower> so, murky water here...
[16:57] <lazyPower> Zic: looks like Cynerva may have confirmed the behavior
[16:57] <lazyPower> still debugging but yeah, we're close to identifying the symptom
[17:04] <Zic> lazyPower: great! I'm leaving my office to go back to home, I will my backlog later if you find something else :)
[17:04] <bdx_> lazyPower: the issue seems to be with us-east-1a  ..... the only way I can get an instance to deploy to us-east-1a is by spaces constraint, where the subnet in the space is in 1a
[17:05] <bdx_> otherwise, `juju deploy ubuntu -n10` will not deploy anything to 1a
[17:05] <bdx_> its the instances that I get into 1a with the spaces constraint that exhibit the issue of failing lxd
[17:16] <lazyPower> Zic: the only thing to note here is that with that upstream version of docker (1.28 API) is well beyond whats been tested by upstream. In the 1.6 release notes Drop the support for docker 1.9.x. Docker versions 1.10.3, 1.11.2, 1.12.6 have been validated. Anything outside of that is likely to have gremlins, as we're finding.
[19:41] <dockerer> Hi