/srv/irclogs.ubuntu.com/2017/04/19/#juju.txt

kklimonda	how do I give someone ssh access to the controller?	06:31
kklimonda	I've done juju grant -c [controller] [user] superuser	06:31
kklimonda	and that did someting but now I can't figure out how to access it	06:32
kklimonda	juju ssh -m controller 0 says that there is no such model ctrl:[user]/controller	06:32
kklimonda	I've tried juju ssh -m admin/controller 0 but that also didn't work	06:32
=== frankban\|afk is now known as frankban
=== caribou_ is now known as caribou
lazyPower	kklimonda: I dont think add-user actually adds the ssh key. There's a juju add-ssh-key command that has to be run in order for that to work. rick_h would know best though	13:24
rick_h	lazyPower: correct, atm the admin has to add the key for the user	13:24
rick_h	lazyPower: kklimonda it's a known issue and there's a task for the future to make keys end user manageable	13:24
lazyPower	ty for the alley oop rick_h	13:25
bdx	I'm experiencing some extreme crazyness	13:42
bdx	c4 type instances have some issue with lxd from what I can tell	13:43
bdx	not sure if its juju, or lxd or what	13:43
bdx	the issue is happening with t2 instances too	13:48
Zic	lazyPower: hi, long time with no problems but today I have one :) on our test cluster (happily...) upgraded from 1.5.3 to 1.6.1, kube-dns keeps crashing with kubernetes-dashboard, saying that kind of things: http://paste.ubuntu.com/24413931/	13:54
Zic	juju status is all green	13:54
Zic	seems like a problem of endpoints services which does not respond (last part of my pastebin)	13:54
Zic	as it's a test cluster, I tried to reboot every single machines composing it, with no more luck	13:55
Zic	http://paste.ubuntu.com/24413949/ <= same kind of message for a kubectl logs on kube-dns container	13:57
bdx	lxd is failing across the board for me right now .... on aws instances	13:58
bdx	http://paste.ubuntu.com/24413976/	14:00
bdx	^ is something I've been doing on a daily basis	14:00
magicalt1out	does look pretty broken	14:00
bdx	I woke up early to test out some newnew, and thats what I get	14:01
bdx	yeah ... at first I thought it was specific to instance type ... but its happening on all instance types (at least the 5 I've tried)	14:01
bdx	then I thought it might be a juju 2.1.2 thing .... as I just created my first model on 2.1.2 .... but I just verified its happening on 2.0.3 models as well	14:02
bdx	@team, what is going on here?	14:03
lazyPower	bdx: we're going to need bare minimum a bug report with a juju-crashdump log (you can report skinny, we dont need the charm artifacts)	14:03
bdx	lazyPower: is crashdump a plugin?	14:04
lazyPower	bdx: snap install juju-crashdump --classic, juju-crashdump -s should get you moving	14:04
bdx	nice	14:04
bdx	thx	14:04
=== salmankhan1 is now known as salmankhan
bdx	lazyPower: http://paste.ubuntu.com/24414014/	14:09
lazyPower	lutostag: ping	14:09
lutostag	pong	14:09
lutostag	son of a	14:09
lazyPower	lutostag: i think we found a scenario where crashdump is misbehaving because of unstarted units	14:09
lutostag	bdx: --edge	14:10
lazyPower	bdx: to be clear, snap refresh juju-crashdump --edge --classic	14:10
lutostag	(fixed that bug, need to release it to stable)	14:10
lazyPower	lutostag: ty <3	14:10
magicalt1out	bdx_: 16:10 < lazyPower> bdx: to be clear, snap refresh juju-crashdump --edge --classic	14:12
lazyPower	rip	14:14
bdx_	crashdump now spams me with	14:15
bdx_	http://paste.ubuntu.com/24414051/	14:15
bdx_	lol	14:15
bdx_	oh no	14:15
bdx_	lazyPower: I appreciate the willingness to help out none the less	14:15
lazyPower	bdx_: thats fine	14:15
lazyPower	bdx_: the spam is expected	14:15
bdx_	ok	14:15
lazyPower	its doing a lot of subprocess calls, and that cgo bit is golang doing what it does best	14:15
bdx_	gotcha .. nice	14:15
lazyPower	its that or snap, i'm unconvinced on which level is spamming that	14:16
lazyPower	but its known and expected all the same, it takes a bit to grab everything on a large deployment, i hope you passed -s or --skinny so it doesn't spend forever nabbing all the charm source	14:16
lazyPower	the idea behind crashdump is we've professionalized nabbing state and debug/status messaging so we can tease apart the deployment artifacts and find root causes. Feel free to inspect the package and see what we're grabbing	14:17
lazyPower	any ideas on improvement are welcome	14:17
bdx_	oh ...	14:18
bdx_	ha	14:18
=== salmankhan1 is now known as salmankhan
bdx_	I shall, thx	14:19
Zic	(lazyPower: did you see my last messages or they afraid you so much that I must have been cursed? :D)	14:21
bdx_	lazyPower: these models are on beta controller	14:23
lazyPower	Zic: totally missed it, whats up?	14:23
lazyPower	bdx_: so something went fubar during collection or...?	14:23
bdx_	lazyPower: do you think there is a possibility that juju-crashdump can't collect the info it needs because my user doesn't have permission?	14:23
Zic	lazyPower: (repasting my messages & pastes here: http://paste.ubuntu.com/24414104/)	14:24
lazyPower	lutostag: have we tested crashdump with jaas?	14:24
bdx_	no ... its just spamming hard though with "runtime/cgo: pthread_create failed: Resource temporarily unavailable"	14:24
lazyPower	bdx_: it takes a while, seriously. its nabbing a ton of data	14:24
bdx_	ok	14:25
lazyPower	on a 4 unit small k8s cluster the collection can take ~ 5 minutes.	14:25
Zic	to sum up: seems I have an Service/Endpoint problem on my K8s-test cluster upgraded to 1.6	14:25
lazyPower	but i di dnt pass --skinny.	14:25
lazyPower	Zic: looking now	14:25
Zic	thx	14:25
lazyPower	Zic: check on flannel on the unit running the dashboard, is the flannel.1 interface up?	14:26
lazyPower	Zic: also, check kube-proxy service is started and not in error	14:27
=== salmankhan1 is now known as salmankhan
Zic	http://paste.ubuntu.com/24414128/	14:28
Zic	Flannel is OK but kube-proxy is crashed	14:28
lazyPower	Zic: thats why its failing	14:28
lazyPower	lets dig into why kube-proxy is dead, anything in the logs?	14:28
bdx_	lazyPower, lutostag: the last message it gave after 5 mins of spam was "runtime/cgo: need to run as root or suid"	14:28
bdx_	I'm guessing it needs to be ran as root?	14:29
bdx_	hmmm	14:29
Zic	lazyPower: I'm running journalctl -u kube-proxy but nothing except start/stop/backoff of systemd, do I have a better logs somewhere else?	14:29
lazyPower	Zic: can you just recycle the daemon? does it stick or does it immediately crash?	14:29
bdx_	alright ... running again as root	14:29
lazyPower	bdx_: hang on, you shouldnt' need to run it as root	14:30
lazyPower	lutostag: ^ wat	14:30
Zic	lazyPower: http://paste.ubuntu.com/24414138/ <= logs from a fresh restart	14:30
Zic	error code 203 :x	14:30
lazyPower	Cynerva: ryebot -- post standup, lets dig into this together ^	14:31
ryebot	lazyPower: +1	14:31
lazyPower	Zic: need you on ice for a bit while we do standup and will return to ask more questions	14:31
bdx_	heres the bug https://bugs.launchpad.net/juju/+bug/1684143	14:32
mup	Bug #1684143: applications deployed to lxd on aws instances failing <juju:New> <https://launchpad.net/bugs/1684143>	14:32
bdx_	I'll attach crashdump output when I can get it working	14:32
Zic	lazyPower: no problem, thanks :)	14:34
Zic	lazyPower: I found this in plain-text syslog: syslog.1:Apr 18 15:20:10 ig1-k8s-04 systemd[1163]: kube-proxy.service: Failed at step EXEC spawning /usr/local/bin/kube-proxy: No such file or directory	14:38
lazyPower	Zic: oooo snap, that looks like a stale hash. it should be spawning from /snap/bin/kube-proxy	14:38
Zic	the log is from tomorrow, I'm looking at the .service systemd's unit if it's really the case	14:39
Zic	hmm, I have similar logs for our restart test earlier	14:39
Zic	http://paste.ubuntu.com/24414178/	14:40
Zic	the ExecStart is wrong so :)	14:40
Zic	-r--r--r-- 1 root root 425 Feb 16 11:15 /lib/systemd/system/kube-proxy.service	14:41
Zic	not touched by the snap upgrade	14:41
Zic	seems I hit the spot! :D	14:42
lazyPower	Zic: before you update that hang on	14:43
lazyPower	the snaps have a different system exec scheme, they use bash wrappers and a systemd script that gets installed on snap install.	14:43
Zic	(for info kube-proxy is dead on all kubernetes-worker units, I just cheched that, not only on kube-dns/kubernetes-dashboard nodes)	14:47
lazyPower	Zic: systemctl status snap.kube-proxy.daemon	14:57
Zic	http://paste.ubuntu.com/24414247/	14:58
lazyPower	xref with https://github.com/kubernetes/kubernetes/issues/26003	15:00
lazyPower	Zic: are you using network policies?	15:00
Zic	this test-cluster is not customized at all, the only parameter we change was docker_from_upstream	15:01
Zic	changed*	15:01
lazyPower	hmm	15:01
lazyPower	ok still in standup, will circle back in a sec	15:02
Zic	(docker_from_upstream was set to "true" before the upgrade to 1.6)	15:02
lazyPower	Zic: this is in reference to your workload objects	15:04
lazyPower	Zic: sudo iptables --list	15:05
lazyPower	lets see if it even created the iptables rulechains to do the serviceip forwarding	15:06
Zic	paste.ubuntu.com/24414272/	15:06
Zic	http://paste.ubuntu.com/24414272/	15:06
lutostag	hmm, bdx, this is with a jaas deployment?	15:10
lutostag	I'll see if I can get a one-off run to test that real quick	15:11
lazyPower	that seems fine...	15:17
* lazyPower ponders		15:17
Cynerva	Zic: i just remembered hitting something like this during my upgrade testing. What eventually got me in a working state was to recreate the pods that are failing	15:20
Zic	Cynerva: was my first attempt :)	15:21
lazyPower	Zic: which templates did you use?	15:21
lazyPower	Zic: the ones found in /etc/kubernetes?	15:21
Zic	the one at ~/cdk	15:21
Zic	oops	15:22
Zic	precisely at ~/snap/cdk-addons/current/addons :)	15:22
lazyPower	ok	15:22
Cynerva	dang, okay	15:23
Zic	through a kubectl replace -f	15:23
Cynerva	hmm i wonder if that recreates the pods? or just the deployment objects?	15:23
lazyPower	it should have recreated the pods	15:24
lazyPower	in nuke/repave style	15:24
lazyPower	it doesn't blue/green unless you specify a rolling update	15:24
Cynerva	okay	15:24
lazyPower	Zic: for grins on the worker	15:25
lazyPower	can you curl the http endpoint for your kubernetes-apiserver VIP?	15:25
lazyPower	curl https://10.152.183.1	15:25
Zic	(I tried something more agressively: http://paste.ubuntu.com/24414342/)	15:25
Zic	(about kubectl replace)	15:25
lazyPower	ok	15:26
Zic	don't know if all this error are ignorable	15:26
lazyPower	so, that tells me any attempt to replace has failed	15:26
Zic	yup :(	15:26
lazyPower	you'll need to kubectl rm -f	15:26
lazyPower	and then reschedule	15:26
lazyPower	this may fix the issue	15:26
lazyPower	but i doubt it	15:26
Zic	ah, don't try this one, I will immediately	15:26
Zic	kubectl rm seems to not exist (?)	15:27
Zic	delete?	15:27
lazyPower	ya	15:27
lazyPower	just checking if you're awake ;)	15:27
Zic	:D	15:27
Zic	http://paste.ubuntu.com/24414356/	15:28
Zic	Container still creating	15:28
Zic	I'm waiting a bit	15:28
Zic	http://paste.ubuntu.com/24414359/ <= the second line is strange	15:29
Zic	about the curl test: root@ig1-k8s-01:~# curl https://10.152.183.1	15:30
Zic	Unauthorized	15:30
lazyPower	OH	15:30
lazyPower	Well thats good!	15:30
lazyPower	if your VIP is responding, its not a networking issue	15:31
lazyPower	and we expect that since you dont have the tls key on that curl command. had you included teh k8s key(s) on that curl rquest it would 404 (i think) you. As it begins at /api.	15:31
Zic	http://paste.ubuntu.com/24414374/ ContainerCreating finished but... it's another bad state now :(	15:32
Zic	don't know why it reached an ImagePullBackOff	15:32
lazyPower	Zic: give it a sec	15:33
lazyPower	that can happen when ther's issues hitting the gcr.io registry	15:33
lazyPower	temporary networking issue, saturation, noisy neighbors, etc.	15:33
Zic	24s24s1kubelet, ig1-k8s-05spec.containers{influxdb}WarningFailedFailed to pull image "gcr.io/google_containers/heapster-influxdb-amd64:v1.1.1": rpc error: code = 2 desc = Error pulling image (v1.1.1) from gcr.io/google_containers/heapster-influxdb-amd64, Get https://gcr.io/v1/images/55d63942e2eb6a74ea81cbfccd95ef0f44f599a04ed4a46a41dc868a639c847d/ancestry: dial tcp 64.233.166.82:443: i/o timeout	15:33
Zic	seems like	15:33
lazyPower	yeah	15:33
Zic	oh, except grafana, all pods are now Running	15:34
lazyPower	i suspect your'e experiencing an outage atm. let me check here	15:34
lazyPower	\o/	15:34
lazyPower	nice	15:34
lazyPower	so it self resolved	15:34
Zic	kube-system kube-dns-806549836-w842j 2/3 CrashLoopBackOff 3 6m 10.1.79.7 ig1-k8s-02	15:34
* lazyPower chalks it up to internet gremlins		15:34
Zic	kube-system kubernetes-dashboard-2917854236-qmvn3 0/1 Error 5 6m 10.1.36.7 ig1-k8s-04	15:34
Zic	speaks too fast :'(	15:34
lazyPower	Zic: you're playing with my heart man	15:34
Zic	:'(	15:34
lazyPower	ok, lets start with dns	15:34
Zic	was blocked in CLBO for so much time I was too happy to see a Running state :(	15:35
lazyPower	whats the story with dns clbo?	15:35
lazyPower	failed healthc heck, failed to reach apiserver?	15:35
Zic	1m1m2kubelet, ig1-k8s-02WarningFailedSyncError syncing pod, skipping: failed to "StartContainer" for "kubedns" with CrashLoopBackOff: "Back-off 20s restarting failed container=kubedns pod=kube-dns-806549836-w842j_kube-system(c5838bc9-2514-11e7-b7ef-005056949324)"	15:35
Zic	let me do a kubectl logs on it	15:35
Zic	http://paste.ubuntu.com/24414398/	15:37
Zic	grafana hits Running and stayed in Running. but kube-dns & kubernetes-dashboard are stuck in CLBO now	15:38
Zic	http://paste.ubuntu.com/24414411/ <= for dashboard	15:38
lazyPower	hmm	15:40
lazyPower	i'm uncertain why the dashboard isn't able to reach the VIP	15:40
lazyPower	but i'm still concerned about kube-dns	15:40
lazyPower	looks like the sidecar for dnsmasq metrics is whats causing it to fail	15:40
lazyPower	Zic: give me a repeat describe for the dns pods now that they are out of errimgpull	15:42
Zic	I got more info about kubernetes-dashboard through a direct `docker logs` at local worker: Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.152.183.1:443/version: dial tcp	15:42
Zic	10.152.183.1:443: i/o timeout	15:42
Zic	don't know why it got a timeout if I can curl it...	15:43
lazyPower	right	15:43
lazyPower	I'm not sure whats fishy there but somethings up	15:43
lazyPower	and to make this all the more interesting, our upgrade tests didn't surface this, the addons upgraded without issue	15:43
Zic	http://paste.ubuntu.com/24414438/	15:44
Zic	lazyPower: you know that I'm cursed and love to hit all the bug that nobody have :D	15:44
lazyPower	so the primary issue here is the kubednsmasq pod is still failing to pull.	15:45
Cynerva	Zic: can you paste journalctl logs for snap.kubelet.daemon, snap.kube-proxy.daemon, and flannel?	15:46
lazyPower	Zic: additionally, on any unit, try this: docker pull gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1	15:47
lazyPower	well, any worker. the master doesn't have docker so you'll figure out real quick to not do it there.	15:47
Zic	Cynerva: http://paste.ubuntu.com/24414459/	15:49
Cynerva	thanks	15:49
Zic	lazyPower: http://paste.ubuntu.com/24414470/	15:50
* lazyPower blinks		15:50
lazyPower	thats literally the manual interaction of what that stupid kubelet operation is trying to make happen	15:50
* lazyPower flips tables		15:50
lazyPower	Zic: juju run --application kubernetes-worker "docker pull gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1"	15:51
lazyPower	pre-load all the workers with that image. if it resolves itself, again, i dont know why, but gremlins.	15:51
Zic	ok, it's loading :)	15:52
Cynerva	nothing interesting in the service logs aside from the stream errors, and those aren't telling us much O.o	15:53
lazyPower	Cynerva: i see we're missing the conntrack bin. we should probably add that and pack it into kube-proxy	15:53
lazyPower	that'll be needed for large scale deployments so it properly tracks and terminates stale connections. conntrack bits were causing rimas problems before on another distro. I want to learn from that mistake if we can.	15:53
Zic	lazyPower: http://paste.ubuntu.com/24414492/	15:54
Zic	problem on one of the units	15:54
Cynerva	hmm that's weird, kube-proxy is classic confinement	15:54
Zic	seems very likely that gcr.io has an issue	15:54
lazyPower	Zic: UnitId: kubernetes-worker/1 <-- so we need to figure out why that unit is having connectivity issues	15:54
Zic	kubernetes-worker/1 active idle 10 ig1-k8s-01	15:55
Zic	it's ig1-k8s-01, I will do a manual check	15:55
Zic	at least, it can ping 64.233.166.82	15:55
Zic	http://paste.ubuntu.com/24414505/	15:56
Zic	wtf :>	15:56
lazyPower	ah looks like it might have been 3	15:56
lazyPower	i misread the yaml	15:56
lazyPower	kubernetes-worker/3	15:56
Zic	oops, I did not check too :D	15:56
Zic	ok so it's kubernetes-worker/3* active idle 12 ig1-k8s-03	15:57
lazyPower	Zic: again, just making sure you're awake	15:57
Zic	:)	15:57
Zic	pinging is OK, it's pulling now, in progress...	15:58
Zic	http://paste.ubuntu.com/24414515/	15:58
Zic	it stopped	15:58
lazyPower	so either there's a network issue on that unit, or gcr.io is having trouble	15:58
lazyPower	i wouldn't be surprised of either	15:58
lazyPower	if you retry does it succeed or does it keep getting rejected?	15:59
Zic	ig1-k8s-03 has the exact same network configuration of other 4 kubernetes-worker units (they are all NATed by our hypervisor through the same public IP)	15:59
Zic	Status: Downloaded newer image for gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1	15:59
Zic	just work at the second attempt...	15:59
Zic	silly gcr.io	15:59
lazyPower	Zic: did that resolve the deployment?	16:01
Zic	lazyPower: hmm, saw in describe pod kube-dns that it tries to redownload the docker image	16:01
lazyPower	now that the image is cached on all the workers, it shouldn't be complaining about image pull sync	16:01
Zic	even if it's already pulled :(	16:01
Zic	21s21s1kubelet, ig1-k8s-03spec.containers{kubedns}NormalPullingpulling image "gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1"	16:01
* lazyPower sighs		16:01
lazyPower	it probably has pull: always in the manifest	16:01
lazyPower	because lets DDOS our registry sounds like a great plan.	16:02
Zic	http://paste.ubuntu.com/24414538/	16:02
Zic	huhu	16:02
Zic	it's sidecar now	16:02
lazyPower	Zic: edit the manifest for kube-dns and set teh stupid image pull policy from imagePullPolicy: Always to imagePullPolicy:IfNotPresent	16:03
lazyPower	and reschedule kubedns	16:03
lazyPower	(delete and recreate)	16:03
lazyPower	mind you this is all a work-around to whatever networking issue we're seeing	16:04
Zic	kubedns-cm.yaml kubedns-controller.yaml kubedns-sa.yaml kubedns-svc.yaml	16:04
lazyPower	i'm not convinced	16:04
Zic	at the controller?	16:04
lazyPower	kbuedns-controller.yaml	16:04
Budgie^Smore	o/ juju world :)	16:04
lazyPower	ya	16:04
lazyPower	Budgie^Smore: o/	16:04
lazyPower	Budgie^Smore: did you bring your rocket launcher? We're on a bug hunt	16:04
Zic	lazyPower: in fact there is 0 ImagePullPolicy at the controller :D	16:05
Zic	so it must be the default value, which is... IfNotPresent	16:05
Zic	I don't understand :D	16:05
* lazyPower flips tables		16:05
lazyPower	Zic: i dont know what to recommend at this point	16:05
lazyPower	i've given every thought i can to work around this issue, the crux is the connectivity of grabbing hat image for kubedns	16:05
Budgie^Smore	lazyPower no rocket launcher... pop corn to watch the show though ;-)	16:06
lazyPower	and i have no clue why teh dashboard pod is unable to contact the VIP if the host machine can contact the VIP	16:06
lazyPower	you did however give us some clues that our removal was not working as expected and have a fix en-route for that	16:06
Zic	lazyPower: sidecar just finished to pull... but health check is not good: http://paste.ubuntu.com/24414621/	16:07
lazyPower	well, progress	16:07
lazyPower	whats in the logs for the pod?	16:07
lazyPower	(s)	16:07
lazyPower	same thing where dns cant reach the service vip of kube-apiserver?	16:07
Zic	http://paste.ubuntu.com/24414651/	16:08
Zic	seems like it yup	16:08
Zic	it times out on the VIP	16:08
Zic	like the dashboard :(	16:08
Zic	reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.152.183.1:443/api/v1/services?resourceVersion=0: dial tcp 10.152.183.1:443: i/o timeout	16:08
lazyPower	yah, i see that :(	16:09
lazyPower	so we've resolved the other nit-noid issues but the core of why it cant find the vip is still alien to me	16:09
lazyPower	if hte host can see it, the container should see it	16:09
lazyPower	Zic: can you fire up an ubuntu pod and attempt the same curl test?	16:09
lazyPower	Zic: from within the container, via kubectl exec	16:09
Zic	http://paste.ubuntu.com/24414704/ <= re-doing the test, it answered	16:10
lazyPower	Budgie^Smore: share that popped corn	16:10
Zic	lazyPower: yup, trying that	16:10
Budgie^Smore	lazyPower come and get it :P	16:11
lazyPower	Open source the corn man	16:12
lazyPower	s/corn man/corn, man/	16:12
Zic	hum	16:12
Zic	lazyPower: no network inside a container	16:12
Zic	no network at all	16:12
lazyPower	Zic: boom	16:12
lazyPower	progress	16:12
Zic	can't do an apt install curl :(	16:12
lazyPower	now lets figure out why the container has no network	16:12
lazyPower	whats in /etc/default/docker?	16:12
Zic	http://paste.ubuntu.com/24414713/	16:13
Zic	pretty empty	16:13
Budgie^Smore	lazyPower lol now you going to get me to code a corny corn popper ;-) pun most assuredly intended!	16:13
Zic	hmm	16:14
Zic	lazyPower: strange things: this new container has no network	16:14
Zic	but I tried a kubectl exec at an ingress controller	16:14
Zic	network is up ghere	16:14
Zic	-g	16:14
Zic	why ingress-controller has network and the new container don't :o	16:15
Budgie^Smore	(me being lazy) you have tried killing the container and having it start on another node right?	16:15
Zic	lazyPower: http://paste.ubuntu.com/24414727/	16:17
Zic	don't understand this part...	16:17
Zic	ubuntu is running at kubernetes-worker/0	16:18
Zic	the ingress-controller I used is running at kubernetes-worker/4	16:18
lazyPower	Zic: whats the age on that ingress controller?	16:19
lazyPower	is it from pre-upgrade?	16:19
Zic	1d	16:19
Zic	so after-upgrade	16:19
Zic	(was 62days before)	16:19
Zic	default default-http-backend-35bpm 1/1 Running 1 62d 10.1.80.5 ig1-k8s-01	16:19
Budgie^Smore	Zic, lazyPower have you checked the IPs and iptables yet? could it be a flannel / overlay network?	16:19
Zic	this, however, is up since 62d	16:19
Zic	and is also located at ig1-k8s-01	16:20
lazyPower	Budgie^Smore: it could be, however it should fallback to the default docker network driver iirc.	16:20
lazyPower	Zic: try again but watch the kubelet log	16:21
lazyPower	see if anything leaps out at you there	16:21
Zic	lazyPower: for info, pods of kube-system has no network also	16:22
Zic	tried in the grafana-influxdb pod	16:22
Zic	no network	16:22
Zic	seems like just ingress-controllers have network :o	16:22
Cynerva	Zic: ingress controllers have hostNetwork: true, so i think they bypass flannel/cni entirely	16:23
lazyPower	Zic: i'm at an impass now, but we've gotten deeper into the issue that seems like yet another symptom, but not the root cause	16:23
Cynerva	not entirely sure how that works, but they're definitely a special case	16:24
lazyPower	Cynerva: that would be the case if it specifies host network it doesn't use any of the containerd networking bits. its binding on teh hosts tcp stack.	16:24
Zic	lazyPower: could it be tied to our use of docker_from_upstream ?	16:24
Zic	I can switch it to false if you want	16:24
lazyPower	Zic: quite possible, if you switch it back to archive, do things work?	16:24
Zic	I will try now	16:25
lazyPower	Cynerva: ryebot - i dont think we've tested with upstream docker in quite some time... is this true yeah?	16:25
Cynerva	lazyPower: yeah, we haven't that i'm aware of	16:26
lazyPower	I thought so, i might actually submit a PR this week to remove that option from the k8s charms as its inhereted from layer-docker.	16:26
lazyPower	if we're not extensively testing it, we shouldn't offer it	16:26
Zic	lazyPower: we have a serious garbage collection in our prod-cluster with the Docker version of Ubuntu :(	16:27
Zic	it's why we switched to PPA version	16:27
=== frankban is now known as frankban\|afk
lazyPower	Zic: thats unfortunate if this resolves the issue	16:31
Zic	yup :( with the docker version at Ubuntu Archive, we got a lot of dockerd stucked at garbage collecting	16:32
lazyPower	if it doesn't i'm not really sure where to go from here either, as this makes no sense to me that your container network just falls out	16:32
Zic	switching docker_from_upstream resolve immediately this issue	16:32
lazyPower	seriously?	16:32
Zic	yup :(	16:32
lazyPower	welp	16:32
lazyPower	nothing to do here	16:32
* lazyPower jetpacks away		16:32
Zic	was Kibana containers wich crash Docker garbage collection	16:32
lazyPower	hmmm	16:33
lazyPower	Zic: its 1.11.x coming from archive correct?	16:33
Zic	lazyPower: careful, saying that for our production-cluster, for the test-cluster we're debugging, downgrading is in progress	16:33
Zic	lazyPower: downgrading is finished and... all my pods are Running and have network connectivity	16:34
Zic	:o	16:34
lazyPower	Zic: perhaps it was jsut recycling docker that did it?	16:34
Zic	to recap what I said: we used docker_from_upstream as we hitted severe garbage collection bug with dockerd on production with heavy usage-intensive container like Kibana, with the version from PPA of docker.com, it was fixed (in 1.5.3)	16:35
Zic	but it seems that this docker version of docker.com breaks network in 1.6	16:36
lazyPower	i'm running a deploy with install_from_upstream=true right now	16:36
Zic	(to be clear, as we mixed our conversation about two different clusters earlier)	16:36
lazyPower	yep, i follow you now	16:36
Zic	for now, the test-cluster we're debugging here is now fixed	16:36
Zic	with docker_from_upstream sets to false	16:36
lazyPower	Zic: prior to doing that, did you attempt to restart the docker daemon?	16:36
lazyPower	was that part of your troubleshooting?	16:37
Zic	it's a bit lame as we are now using docker_from_upstream=true at production :/	16:37
Zic	lazyPower: after the downgrading, yeah, I restarted docker	16:37
lazyPower	Zic: i meant before	16:37
Zic	ah, yeah, rebooted the whole cluster too	16:37
lazyPower	Zic: well, i just deployed and upgraded	16:37
lazyPower	so far so good	16:37
lazyPower	to be clear - deployed with docker from archive, enabled install_from_upstream, things are still running	16:38
Zic	did you enable docker_from_upstream at 1.5.3, then upgrade to 1.6 ? :D	16:38
Zic	was the correct path	16:38
Zic	don't know if it can played at the game	16:38
Zic	lazyPower: the exact path was: switching to docker_from_upstream=true, look at juju status and when it's ended, restart docker on every kubernetes-worker units (as the juju scenario don't handle this part) <some days passed> -> upgrade to 1.6 with the Ubuntu Insights tutorial -> CLBO at kube-dns+kubernetes-dashboard after the upgrade / no network in container	16:41
lazyPower	Zic: running another deploy through the upgrade scenario	16:45
lazyPower	but i got networking with upstream docker from a fresh deployment	16:45
lazyPower	so, murky water here...	16:45
lazyPower	Zic: looks like Cynerva may have confirmed the behavior	16:57
lazyPower	still debugging but yeah, we're close to identifying the symptom	16:57
Zic	lazyPower: great! I'm leaving my office to go back to home, I will my backlog later if you find something else :)	17:04
bdx_	lazyPower: the issue seems to be with us-east-1a ..... the only way I can get an instance to deploy to us-east-1a is by spaces constraint, where the subnet in the space is in 1a	17:04
bdx_	otherwise, `juju deploy ubuntu -n10` will not deploy anything to 1a	17:05
bdx_	its the instances that I get into 1a with the spaces constraint that exhibit the issue of failing lxd	17:05
lazyPower	Zic: the only thing to note here is that with that upstream version of docker (1.28 API) is well beyond whats been tested by upstream. In the 1.6 release notes Drop the support for docker 1.9.x. Docker versions 1.10.3, 1.11.2, 1.12.6 have been validated. Anything outside of that is likely to have gremlins, as we're finding.	17:16
dockerer	Hi	19:41
=== frankban\|afk is now known as frankban
=== jasondotstar_ is now known as jasondotstar

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!