/srv/irclogs.ubuntu.com/2017/04/19/#juju.txt

kklimondahow do I give someone ssh access to the controller?06:31
kklimondaI've done juju grant -c [controller] [user] superuser06:31
kklimondaand that did *someting* but now I can't figure out how to access it06:32
kklimondajuju ssh -m controller 0 says that there is no such model ctrl:[user]/controller06:32
kklimondaI've tried juju ssh -m admin/controller 0 but that also didn't work06:32
=== frankban|afk is now known as frankban
=== caribou_ is now known as caribou
lazyPowerkklimonda: I dont think add-user actually adds the ssh key. There's a juju add-ssh-key command that has to be run in order for that to work. rick_h would know best though13:24
rick_hlazyPower: correct, atm the admin has to add the key for the user13:24
rick_hlazyPower: kklimonda it's a known issue and there's a task for the future to make keys end user manageable13:24
lazyPowerty for the alley oop rick_h13:25
bdxI'm experiencing some extreme crazyness13:42
bdxc4 type instances have some issue with lxd from what I can tell13:43
bdxnot sure if its juju, or lxd or what13:43
bdxthe issue is happening with t2 instances too13:48
ZiclazyPower: hi, long time with no problems but today I have one :) on our test cluster (happily...) upgraded from 1.5.3 to 1.6.1, kube-dns keeps crashing with kubernetes-dashboard, saying that kind of things: http://paste.ubuntu.com/24413931/13:54
Zicjuju status is all green13:54
Zicseems like a problem of endpoints services which does not respond (last part of my pastebin)13:54
Zicas it's a test cluster, I tried to reboot every single machines composing it, with no more luck13:55
Zichttp://paste.ubuntu.com/24413949/ <= same kind of message for a kubectl logs on kube-dns container13:57
bdxlxd is failing across the board for me right now .... on aws instances13:58
bdxhttp://paste.ubuntu.com/24413976/14:00
bdx^ is something I've been doing on a daily basis14:00
magicalt1outdoes look pretty broken14:00
bdxI woke up early to test out some newnew, and thats what I get14:01
bdxyeah ... at first I thought it was specific to instance type ... but its happening on all instance types (at least the 5 I've tried)14:01
bdxthen I thought it might be a juju 2.1.2 thing .... as I just created my first model on 2.1.2 .... but I just verified its happening on 2.0.3 models as well14:02
bdx@team, what is going on here?14:03
lazyPowerbdx: we're going to need bare minimum a bug report with a juju-crashdump log (you can report skinny, we dont need the charm artifacts)14:03
bdxlazyPower: is crashdump a plugin?14:04
lazyPowerbdx: snap install juju-crashdump --classic, juju-crashdump -s      should get you moving14:04
bdxnice14:04
bdxthx14:04
=== salmankhan1 is now known as salmankhan
bdxlazyPower: http://paste.ubuntu.com/24414014/14:09
lazyPowerlutostag: ping14:09
lutostagpong14:09
lutostagson of a14:09
lazyPowerlutostag: i think we found a scenario where crashdump is misbehaving because of unstarted units14:09
lutostagbdx: --edge14:10
lazyPowerbdx: to be clear, snap refresh juju-crashdump --edge --classic14:10
lutostag(fixed that bug, need to release it to stable)14:10
lazyPowerlutostag: ty <314:10
magicalt1outbdx_: 16:10 < lazyPower> bdx: to be clear, snap refresh juju-crashdump --edge --classic14:12
lazyPowerrip14:14
bdx_crashdump now spams me with14:15
bdx_http://paste.ubuntu.com/24414051/14:15
bdx_lol14:15
bdx_oh no14:15
bdx_lazyPower: I appreciate the willingness to help out none the less14:15
lazyPowerbdx_: thats fine14:15
lazyPowerbdx_: the spam is expected14:15
bdx_ok14:15
lazyPowerits doing a lot of subprocess calls, and that cgo bit is golang doing what it does best14:15
bdx_gotcha .. nice14:15
lazyPowerits that or snap, i'm unconvinced on which level is spamming that14:16
lazyPowerbut its known and expected all the same, it takes a bit to grab everything on a large deployment, i hope you passed -s or --skinny so it doesn't spend forever nabbing all the charm source14:16
lazyPowerthe idea behind crashdump is we've professionalized nabbing state and debug/status messaging so we can tease apart the deployment artifacts and find root causes. Feel free to inspect the package and see what we're grabbing14:17
lazyPowerany ideas on improvement are welcome14:17
bdx_oh ...14:18
bdx_ha14:18
=== salmankhan1 is now known as salmankhan
bdx_I shall, thx14:19
Zic(lazyPower: did you see my last messages or they afraid you so much that I must have been cursed? :D)14:21
bdx_lazyPower: these models are on beta controller14:23
lazyPowerZic: totally missed it, whats up?14:23
lazyPowerbdx_: so something went fubar during collection or...?14:23
bdx_lazyPower: do you think there is a possibility that juju-crashdump can't collect the info it needs because my user doesn't have permission?14:23
ZiclazyPower: (repasting my messages & pastes here: http://paste.ubuntu.com/24414104/)14:24
lazyPowerlutostag: have we tested crashdump with jaas?14:24
bdx_no ... its just spamming hard though with "runtime/cgo: pthread_create failed: Resource temporarily unavailable"14:24
lazyPowerbdx_: it takes a while, seriously. its nabbing a ton of data14:24
bdx_ok14:25
lazyPoweron a 4 unit small k8s cluster the collection can take ~ 5 minutes.14:25
Zicto sum up: seems I have an Service/Endpoint problem on my K8s-test cluster upgraded to 1.614:25
lazyPowerbut i di dnt pass --skinny.14:25
lazyPowerZic: looking now14:25
Zicthx14:25
lazyPowerZic: check on flannel on the unit running the dashboard, is the flannel.1 interface up?14:26
lazyPowerZic: also, check kube-proxy service is started and not in error14:27
=== salmankhan1 is now known as salmankhan
Zichttp://paste.ubuntu.com/24414128/14:28
ZicFlannel is OK but kube-proxy is crashed14:28
lazyPowerZic: thats why its failing14:28
lazyPowerlets dig into why kube-proxy is dead, anything in the logs?14:28
bdx_lazyPower, lutostag: the last message it gave after 5 mins of spam was "runtime/cgo: need to run as root or suid"14:28
bdx_I'm guessing it needs to be ran as root?14:29
bdx_hmmm14:29
ZiclazyPower: I'm running journalctl -u kube-proxy but nothing except start/stop/backoff of systemd, do I have a better logs somewhere else?14:29
lazyPowerZic: can you just recycle the daemon? does it stick or does it immediately crash?14:29
bdx_alright ... running again as root14:29
lazyPowerbdx_: hang on, you shouldnt' need to run it as root14:30
lazyPowerlutostag: ^ wat14:30
ZiclazyPower: http://paste.ubuntu.com/24414138/ <= logs from a fresh restart14:30
Zicerror code 203 :x14:30
lazyPowerCynerva: ryebot  -- post standup, lets dig into this together ^14:31
ryebotlazyPower: +114:31
lazyPowerZic: need you on ice for a bit while we do standup and will return to ask more questions14:31
bdx_heres the bug https://bugs.launchpad.net/juju/+bug/168414314:32
mupBug #1684143: applications deployed to lxd on aws instances failing <juju:New> <https://launchpad.net/bugs/1684143>14:32
bdx_I'll attach crashdump output when I can get it working14:32
ZiclazyPower: no problem, thanks :)14:34
ZiclazyPower: I found this in plain-text syslog: syslog.1:Apr 18 15:20:10 ig1-k8s-04 systemd[1163]: kube-proxy.service: Failed at step EXEC spawning /usr/local/bin/kube-proxy: No such file or directory14:38
lazyPowerZic: oooo snap, that looks like a stale hash. it should be spawning from /snap/bin/kube-proxy14:38
Zicthe log is from tomorrow, I'm looking at the .service systemd's unit if it's really the case14:39
Zichmm, I have similar logs for our restart test earlier14:39
Zichttp://paste.ubuntu.com/24414178/14:40
Zicthe ExecStart is wrong so :)14:40
Zic-r--r--r-- 1 root root 425 Feb 16 11:15 /lib/systemd/system/kube-proxy.service14:41
Zicnot touched by the snap upgrade14:41
Zicseems I hit the spot! :D14:42
lazyPowerZic: before you update that hang on14:43
lazyPowerthe snaps have a different system exec scheme, they use bash wrappers and a systemd script that gets installed on snap install.14:43
Zic(for info kube-proxy is dead on all kubernetes-worker units, I just cheched that, not only on kube-dns/kubernetes-dashboard nodes)14:47
lazyPowerZic: systemctl status snap.kube-proxy.daemon14:57
Zichttp://paste.ubuntu.com/24414247/14:58
lazyPowerxref with https://github.com/kubernetes/kubernetes/issues/2600315:00
lazyPowerZic: are you using network policies?15:00
Zicthis test-cluster is not customized at all, the only parameter we change was docker_from_upstream15:01
Zicchanged*15:01
lazyPowerhmm15:01
lazyPowerok still in standup, will circle back in a sec15:02
Zic(docker_from_upstream was set to "true" before the upgrade to 1.6)15:02
lazyPowerZic: this is in reference to your workload objects15:04
lazyPowerZic: sudo iptables --list15:05
lazyPowerlets see if it even created the iptables rulechains to do the serviceip forwarding15:06
Zicpaste.ubuntu.com/24414272/15:06
Zichttp://paste.ubuntu.com/24414272/15:06
lutostaghmm, bdx, this is with a jaas deployment?15:10
lutostagI'll see if I can get a one-off run to test that real quick15:11
lazyPowerthat seems fine...15:17
* lazyPower ponders15:17
CynervaZic: i just remembered hitting something like this during my upgrade testing. What eventually got me in a working state was to recreate the pods that are failing15:20
ZicCynerva: was my first attempt :)15:21
lazyPowerZic: which templates did you use?15:21
lazyPowerZic: the ones found in /etc/kubernetes?15:21
Zicthe one at ~/cdk15:21
Zicoops15:22
Zicprecisely at ~/snap/cdk-addons/current/addons :)15:22
lazyPowerok15:22
Cynervadang, okay15:23
Zicthrough a kubectl replace -f15:23
Cynervahmm i wonder if that recreates the pods? or just the deployment objects?15:23
lazyPowerit *should* have recreated the pods15:24
lazyPowerin nuke/repave style15:24
lazyPowerit doesn't blue/green unless you specify a rolling update15:24
Cynervaokay15:24
lazyPowerZic: for grins on the worker15:25
lazyPowercan you curl the http endpoint for your kubernetes-apiserver VIP?15:25
lazyPowercurl https://10.152.183.115:25
Zic(I tried something more agressively: http://paste.ubuntu.com/24414342/)15:25
Zic(about kubectl replace)15:25
lazyPowerok15:26
Zicdon't know if all this error are ignorable15:26
lazyPowerso, that tells me any attempt to replace has failed15:26
Zicyup :(15:26
lazyPoweryou'll need to kubectl rm -f15:26
lazyPowerand then reschedule15:26
lazyPowerthis *may* fix the issue15:26
lazyPowerbut i doubt it15:26
Zicah, don't try this one, I will immediately15:26
Zickubectl rm seems to not exist (?)15:27
Zicdelete?15:27
lazyPowerya15:27
lazyPowerjust checking if you're awake ;)15:27
Zic:D15:27
Zichttp://paste.ubuntu.com/24414356/15:28
ZicContainer still creating15:28
ZicI'm waiting a bit15:28
Zichttp://paste.ubuntu.com/24414359/ <= the second line is strange15:29
Zicabout the curl test: root@ig1-k8s-01:~# curl https://10.152.183.115:30
ZicUnauthorized15:30
lazyPowerOH15:30
lazyPowerWell thats good!15:30
lazyPowerif your VIP is responding, its not a networking issue15:31
lazyPowerand we expect that since you dont have the tls key on that curl command. had you included teh k8s key(s) on that curl rquest it would 404 (i think) you. As it begins at /api.15:31
Zichttp://paste.ubuntu.com/24414374/ ContainerCreating finished but... it's another bad state now :(15:32
Zicdon't know why it reached an ImagePullBackOff15:32
lazyPowerZic: give it a sec15:33
lazyPowerthat can happen when ther's issues hitting the gcr.io registry15:33
lazyPowertemporary networking issue, saturation, noisy neighbors, etc.15:33
Zic  24s24s1kubelet, ig1-k8s-05spec.containers{influxdb}WarningFailedFailed to pull image "gcr.io/google_containers/heapster-influxdb-amd64:v1.1.1": rpc error: code = 2 desc = Error pulling image (v1.1.1) from gcr.io/google_containers/heapster-influxdb-amd64, Get https://gcr.io/v1/images/55d63942e2eb6a74ea81cbfccd95ef0f44f599a04ed4a46a41dc868a639c847d/ancestry: dial tcp 64.233.166.82:443: i/o timeout15:33
Zicseems like15:33
lazyPoweryeah15:33
Zicoh, except grafana, all pods are now Running15:34
lazyPoweri suspect your'e experiencing an outage atm. let me check here15:34
lazyPower\o/15:34
lazyPowernice15:34
lazyPowerso it self resolved15:34
Zickube-system   kube-dns-806549836-w842j                2/3       CrashLoopBackOff   3          6m        10.1.79.7    ig1-k8s-0215:34
* lazyPower chalks it up to internet gremlins15:34
Zickube-system   kubernetes-dashboard-2917854236-qmvn3   0/1       Error              5          6m        10.1.36.7    ig1-k8s-0415:34
Zicspeaks too fast :'(15:34
lazyPowerZic: you're playing with my heart man15:34
Zic:'(15:34
lazyPowerok, lets start with dns15:34
Zicwas blocked in CLBO for so much time I was too happy to see a Running state :(15:35
lazyPowerwhats the story with dns clbo?15:35
lazyPowerfailed healthc heck, failed to reach apiserver?15:35
Zic  1m1m2kubelet, ig1-k8s-02WarningFailedSyncError syncing pod, skipping: failed to "StartContainer" for "kubedns" with CrashLoopBackOff: "Back-off 20s restarting failed container=kubedns pod=kube-dns-806549836-w842j_kube-system(c5838bc9-2514-11e7-b7ef-005056949324)"15:35
Ziclet me do a kubectl logs on it15:35
Zichttp://paste.ubuntu.com/24414398/15:37
Zicgrafana hits Running and stayed in Running. but kube-dns & kubernetes-dashboard are stuck in CLBO now15:38
Zichttp://paste.ubuntu.com/24414411/ <= for dashboard15:38
lazyPowerhmm15:40
lazyPoweri'm uncertain why the dashboard isn't able to reach the VIP15:40
lazyPowerbut i'm still concerned about kube-dns15:40
lazyPowerlooks like the sidecar for dnsmasq metrics is whats causing it to fail15:40
lazyPowerZic: give me a repeat describe for the dns pods now that they are out of errimgpull15:42
ZicI got more info about kubernetes-dashboard through a direct `docker logs` at local worker: Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.152.183.1:443/version: dial tcp15:42
Zic10.152.183.1:443: i/o timeout15:42
Zicdon't know why it got a timeout if I can curl it...15:43
lazyPowerright15:43
lazyPowerI'm not sure whats fishy there but somethings up15:43
lazyPowerand to make this all the more interesting, our upgrade tests didn't surface this, the addons upgraded without issue15:43
Zichttp://paste.ubuntu.com/24414438/15:44
ZiclazyPower: you know that I'm cursed and love to hit all the bug that nobody have :D15:44
lazyPowerso the primary issue here is the kubednsmasq pod is still failing to pull.15:45
CynervaZic: can you paste journalctl logs for snap.kubelet.daemon, snap.kube-proxy.daemon, and flannel?15:46
lazyPowerZic: additionally, on any unit, try this:   docker pull gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.115:47
lazyPowerwell, any worker. the master doesn't have docker so you'll figure out real quick to not do it there.15:47
ZicCynerva: http://paste.ubuntu.com/24414459/15:49
Cynervathanks15:49
ZiclazyPower: http://paste.ubuntu.com/24414470/15:50
* lazyPower blinks15:50
lazyPowerthats *literally* the manual interaction of what that stupid kubelet operation is trying to make happen15:50
* lazyPower flips tables15:50
lazyPowerZic: juju run --application kubernetes-worker "docker pull gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1"15:51
lazyPowerpre-load all the workers with that image. if it resolves itself, again, i dont know why, but gremlins.15:51
Zicok, it's loading :)15:52
Cynervanothing interesting in the service logs aside from the stream errors, and those aren't telling us much O.o15:53
lazyPowerCynerva: i see we're missing the conntrack bin. we should probably add that and pack it into kube-proxy15:53
lazyPowerthat'll be needed for large scale deployments so it properly tracks and terminates stale connections. conntrack bits were causing rimas problems before on another distro. I want to learn from that mistake if we can.15:53
ZiclazyPower: http://paste.ubuntu.com/24414492/15:54
Zicproblem on one of the units15:54
Cynervahmm that's weird, kube-proxy is classic confinement15:54
Zicseems very likely that gcr.io has an issue15:54
lazyPowerZic:   UnitId: kubernetes-worker/1 <-- so we need to figure out why that unit is having connectivity issues15:54
Zickubernetes-worker/1       active    idle   10       ig1-k8s-0115:55
Zicit's ig1-k8s-01, I will do a manual check15:55
Zicat least, it can ping 64.233.166.8215:55
Zichttp://paste.ubuntu.com/24414505/15:56
Zicwtf :>15:56
lazyPowerah looks like it might have been 315:56
lazyPoweri misread the yaml15:56
lazyPowerkubernetes-worker/315:56
Zicoops, I did not check too :D15:56
Zicok so it's kubernetes-worker/3*      active    idle   12       ig1-k8s-0315:57
lazyPowerZic: again, just making sure you're awake15:57
Zic:)15:57
Zicpinging is OK, it's pulling now, in progress...15:58
Zichttp://paste.ubuntu.com/24414515/15:58
Zicit stopped15:58
lazyPowerso either there's a network issue on that unit, or gcr.io is having trouble15:58
lazyPoweri wouldn't be surprised of either15:58
lazyPowerif you retry does it succeed or does it keep getting rejected?15:59
Zicig1-k8s-03 has the exact same network configuration of other 4 kubernetes-worker units (they are all NATed by our hypervisor through the same public IP)15:59
ZicStatus: Downloaded newer image for gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.115:59
Zicjust work at the second attempt...15:59
Zicsilly gcr.io15:59
lazyPowerZic: did that resolve the deployment?16:01
ZiclazyPower: hmm, saw in describe pod kube-dns that it tries to redownload the docker image16:01
lazyPowernow that the image is cached on all the workers, it shouldn't be complaining about image pull sync16:01
Ziceven if it's already pulled :(16:01
Zic  21s21s1kubelet, ig1-k8s-03spec.containers{kubedns}NormalPullingpulling image "gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1"16:01
* lazyPower sighs16:01
lazyPowerit probably has pull: always in the manifest16:01
lazyPowerbecause lets DDOS our registry sounds like a great plan.16:02
Zichttp://paste.ubuntu.com/24414538/16:02
Zichuhu16:02
Zicit's sidecar now16:02
lazyPowerZic: edit the manifest for kube-dns and set teh stupid image pull policy from imagePullPolicy: Always  to    imagePullPolicy:IfNotPresent16:03
lazyPowerand reschedule kubedns16:03
lazyPower(delete and recreate)16:03
lazyPowermind you this is all a work-around to whatever networking issue we're seeing16:04
Zickubedns-cm.yaml          kubedns-controller.yaml  kubedns-sa.yaml          kubedns-svc.yaml16:04
lazyPoweri'm not convinced16:04
Zicat the controller?16:04
lazyPowerkbuedns-controller.yaml16:04
Budgie^Smoreo/ juju world :)16:04
lazyPowerya16:04
lazyPowerBudgie^Smore: o/16:04
lazyPowerBudgie^Smore: did you bring your rocket launcher? We're on a bug hunt16:04
ZiclazyPower: in fact there is 0 ImagePullPolicy at the controller :D16:05
Zicso it must be the default value, which is... IfNotPresent16:05
ZicI don't understand :D16:05
* lazyPower flips tables16:05
lazyPowerZic: i dont know what to recommend at this point16:05
lazyPoweri've given every thought i can to work around this issue, the crux is the connectivity of grabbing hat image for kubedns16:05
Budgie^SmorelazyPower no rocket launcher... pop corn to watch the show though ;-)16:06
lazyPowerand i have no clue why teh dashboard pod is unable to contact the VIP if the host machine can contact the VIP16:06
lazyPoweryou did however give us some clues that our removal was not working as expected and have a fix en-route for that16:06
ZiclazyPower: sidecar just finished to pull... but health check is not good: http://paste.ubuntu.com/24414621/16:07
lazyPowerwell, progress16:07
lazyPowerwhats in the logs for the pod?16:07
lazyPower(s)16:07
lazyPowersame thing where dns cant reach the service vip of kube-apiserver?16:07
Zichttp://paste.ubuntu.com/24414651/16:08
Zicseems like it yup16:08
Zicit times out on the VIP16:08
Ziclike the dashboard :(16:08
Zicreflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.152.183.1:443/api/v1/services?resourceVersion=0: dial tcp 10.152.183.1:443: i/o timeout16:08
lazyPoweryah, i see that :(16:09
lazyPowerso we've resolved the other nit-noid issues but the core of why it cant find the vip is still alien to me16:09
lazyPowerif hte host can see it, the container should see it16:09
lazyPowerZic: can you fire up an ubuntu pod and attempt the same curl test?16:09
lazyPowerZic: from within the container, via kubectl exec16:09
Zichttp://paste.ubuntu.com/24414704/ <= re-doing the test, it answered16:10
lazyPowerBudgie^Smore: share that popped corn16:10
ZiclazyPower: yup, trying that16:10
Budgie^SmorelazyPower come and get it :P16:11
lazyPowerOpen source the corn man16:12
lazyPowers/corn man/corn, man/16:12
Zichum16:12
ZiclazyPower: no network inside a container16:12
Zicno network at all16:12
lazyPowerZic: boom16:12
lazyPowerprogress16:12
Ziccan't do an apt install curl :(16:12
lazyPowernow lets figure out why the container has no network16:12
lazyPowerwhats in /etc/default/docker?16:12
Zichttp://paste.ubuntu.com/24414713/16:13
Zicpretty empty16:13
Budgie^SmorelazyPower lol now you going to get me to code a corny corn popper ;-) pun most assuredly intended!16:13
Zichmm16:14
ZiclazyPower: strange things: this new container has no network16:14
Zicbut I tried a kubectl exec at an ingress controller16:14
Zicnetwork is up ghere16:14
Zic-g16:14
Zicwhy ingress-controller has network and the new container don't :o16:15
Budgie^Smore(me being lazy) you have tried killing the container and having it start on another node right?16:15
ZiclazyPower: http://paste.ubuntu.com/24414727/16:17
Zicdon't understand this part...16:17
Zicubuntu is running at kubernetes-worker/016:18
Zicthe ingress-controller I used is running at kubernetes-worker/416:18
lazyPowerZic: whats the age on that ingress controller?16:19
lazyPoweris it from pre-upgrade?16:19
Zic1d16:19
Zicso after-upgrade16:19
Zic(was 62days before)16:19
Zicdefault       default-http-backend-35bpm              1/1       Running            1          62d       10.1.80.5    ig1-k8s-0116:19
Budgie^SmoreZic, lazyPower have you checked the IPs and iptables yet? could it be a flannel / overlay network?16:19
Zicthis, however, is up since 62d16:19
Zicand is also located at ig1-k8s-0116:20
lazyPowerBudgie^Smore: it could be, however it should fallback to the default docker network driver iirc.16:20
lazyPowerZic: try again but watch the kubelet log16:21
lazyPowersee if anything leaps out at you there16:21
ZiclazyPower: for info, pods of kube-system has no network also16:22
Zictried in the grafana-influxdb pod16:22
Zicno network16:22
Zicseems like just ingress-controllers have network :o16:22
CynervaZic: ingress controllers have hostNetwork: true, so i think they bypass flannel/cni entirely16:23
lazyPowerZic: i'm at an impass now, but we've gotten deeper into the issue that seems like yet another symptom, but not the root cause16:23
Cynervanot entirely sure how that works, but they're definitely a special case16:24
lazyPowerCynerva: that would be the case if it specifies host network it doesn't use any of the containerd networking bits. its binding on teh hosts tcp stack.16:24
ZiclazyPower: could it be tied to our use of docker_from_upstream ?16:24
ZicI can switch it to false if you want16:24
lazyPowerZic: quite possible, if you switch it back to archive, do things work?16:24
ZicI will try now16:25
lazyPowerCynerva: ryebot - i dont think we've tested with upstream docker in quite some time... is this true yeah?16:25
CynervalazyPower: yeah, we haven't that i'm aware of16:26
lazyPowerI thought so, i might actually submit a PR this week to remove that option from the k8s charms as its inhereted from layer-docker.16:26
lazyPowerif we're not extensively testing it, we shouldn't offer it16:26
ZiclazyPower: we have a serious garbage collection in our prod-cluster with the Docker version of Ubuntu :(16:27
Zicit's why we switched to PPA version16:27
=== frankban is now known as frankban|afk
lazyPowerZic: thats unfortunate if this resolves the issue16:31
Zicyup :( with the docker version at Ubuntu Archive, we got a lot of dockerd stucked at garbage collecting16:32
lazyPowerif it doesn't i'm not really sure where to go from here either, as this makes no sense to me that your container network just falls out16:32
Zicswitching docker_from_upstream resolve immediately this issue16:32
lazyPowerseriously?16:32
Zicyup :(16:32
lazyPowerwelp16:32
lazyPowernothing to do here16:32
* lazyPower jetpacks away16:32
Zicwas Kibana containers wich crash Docker garbage collection16:32
lazyPowerhmmm16:33
lazyPowerZic: its 1.11.x coming from archive correct?16:33
ZiclazyPower: careful, saying that for our production-cluster, for the test-cluster we're debugging, downgrading is in progress16:33
ZiclazyPower: downgrading is finished and... all my pods are Running and have network connectivity16:34
Zic:o16:34
lazyPowerZic: perhaps it was jsut recycling docker that did it?16:34
Zicto recap what I said: we used docker_from_upstream as we hitted severe garbage collection bug with dockerd on production with heavy usage-intensive container like Kibana, with the version from PPA of docker.com, it was fixed (in 1.5.3)16:35
Zicbut it seems that this docker version of docker.com breaks network in 1.616:36
lazyPoweri'm running a deploy with install_from_upstream=true right now16:36
Zic(to be clear, as we mixed our conversation about two different clusters earlier)16:36
lazyPoweryep, i follow you now16:36
Zicfor now, the test-cluster we're debugging here is now fixed16:36
Zicwith docker_from_upstream sets to false16:36
lazyPowerZic: prior to doing that, did you attempt to restart the docker daemon?16:36
lazyPowerwas that part of your troubleshooting?16:37
Zicit's a bit lame as we are now using docker_from_upstream=true at production :/16:37
ZiclazyPower: after the downgrading, yeah, I restarted docker16:37
lazyPowerZic: i meant before16:37
Zicah, yeah, rebooted the whole cluster too16:37
lazyPowerZic: well, i just deployed and upgraded16:37
lazyPowerso far so good16:37
lazyPowerto be clear - deployed with docker from archive, enabled install_from_upstream, things are still running16:38
Zicdid you enable docker_from_upstream at 1.5.3, then upgrade to 1.6 ? :D16:38
Zicwas the correct path16:38
Zicdon't know if it can played at the game16:38
ZiclazyPower: the *exact* path was: switching to docker_from_upstream=true, look at juju status and when it's ended, restart docker on every kubernetes-worker units (as the juju scenario don't handle this part) <some days passed> -> upgrade to 1.6 with the Ubuntu Insights tutorial -> CLBO at kube-dns+kubernetes-dashboard after the upgrade / no network in container16:41
lazyPowerZic: running another deploy through the upgrade scenario16:45
lazyPowerbut i got networking with upstream docker from a fresh deployment16:45
lazyPowerso, murky water here...16:45
lazyPowerZic: looks like Cynerva may have confirmed the behavior16:57
lazyPowerstill debugging but yeah, we're close to identifying the symptom16:57
ZiclazyPower: great! I'm leaving my office to go back to home, I will my backlog later if you find something else :)17:04
bdx_lazyPower: the issue seems to be with us-east-1a  ..... the only way I can get an instance to deploy to us-east-1a is by spaces constraint, where the subnet in the space is in 1a17:04
bdx_otherwise, `juju deploy ubuntu -n10` will not deploy anything to 1a17:05
bdx_its the instances that I get into 1a with the spaces constraint that exhibit the issue of failing lxd17:05
lazyPowerZic: the only thing to note here is that with that upstream version of docker (1.28 API) is well beyond whats been tested by upstream. In the 1.6 release notes Drop the support for docker 1.9.x. Docker versions 1.10.3, 1.11.2, 1.12.6 have been validated. Anything outside of that is likely to have gremlins, as we're finding.17:16
dockererHi19:41
=== frankban|afk is now known as frankban
=== jasondotstar_ is now known as jasondotstar

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!