/srv/irclogs.ubuntu.com/2017/01/27/#juju.txt

=== thumper is now known as thumper-afk
=== thumper-afk is now known as thumper
stub	cory_fu: Should I do a followup MP that tears out the Python2 import machinery completely? Is there any Python2 code still out there using charms.reactive?	06:11
SaMnCo	@Zic thanks. Today is a bit busy for me, but can we do a call like next week?	06:32
=== scuttlemonkey is now known as scuttle\|afk
=== frankban\|afk is now known as frankban
kjackal	Good morning Juju world	08:23
BlackDex	hello :)	08:29
BlackDex	Can i upgrade a charm which is installed via cs, but i now want it to use a local version?	08:29
BlackDex	of use code.launchpad.net for its source?	08:30
aisrael	BlackDex, yes. check out the --switch flag of upgrade-charm	08:31
BlackDex	oke	08:31
BlackDex	i think i need --path :)	08:33
Ankammarao	Hi juju world	08:47
Ankammarao	do we need to created terms each time we are pushing to charm store ..	08:47
Ankammarao	or else it's enough to crdate one time	08:48
Zic	SaMnCo: I'm also very busy at this time at office because of Vitess (the Canonical Kubernetes was one of the quicker part :D) as we're late on the deadline, but I'm available through IRC all the (France UTC+1 o/) time. If you prefer an audio call I will try to find a solution :)	09:11
Zic	feel free to pm me if you need	09:12
Zic	I saw in the Ubuntu Newsletter the blogpost of jcastro : so conjure-up is the now-official way to install Kubernetes through Juju? I personately used the "manual provisionning" of Juju as I'm on baremetal server and doesn't use Ubuntu MaaS	09:17
Zic	I will surely bootstrap new k8s cluster so I'm asking myself if I must continue through this or begin to use conjure-up for the next ones	09:18
Zic	(I know that conjure-up is just a ncurses-like GUI for Juju, but I don't know if this install-way does exactly the same vs. what I did)	09:19
aisrael	BlackDex, Aha. I was close!	09:24
BlackDex	aisrael: You indeed were, and it worked :)	09:48
BlackDex	--switch --revision and --path are mutually exclusive :)	09:48
SaMnCo	Zic are you using MAAS?	10:00
SaMnCo	for bare metal management?	10:01
SaMnCo	if you do, then conjure-up will help. If not and you are on full manual provisioning then I guess you'll be good with your current method.	10:01
SaMnCo	conjure-up is a wizard to provide some help	10:06
Zic	SaMnCo: ok, yeah we don't use MaaS as we have a sensible same homemade product here, so I bootstrap Ubuntu Server from it, add this via juju add-machine over SSH	11:33
Zic	and when I want to deploy the canonical-bundle charms, delete all "newX" machines Juju want to pop, and reassign charms to machine already installed via manual	11:34
Zic	(just via drag'n'dropping)	11:34
Zic	at this step, I personally scale etcd to 5 instead of 3 by default, and put the EasyRSA charm at the same machine of kube-api-load-balancer	11:35
Zic	(and scale kubernetes-master to 3 also, I forgot to mention)	11:35
Zic	even if we have a MaaS-like in our company, maybe I will try in the future to set a MaaS here just to automatize all with Juju :)	11:37
SaMnCo	Zic that or write a juju provider for your tool. Is it all in house development or another product like crowbar ?	12:45
Zic	SaMnCo: completely homemade, it was to permit our customer to reinstall their VMs or physical server from our SI	13:08
BlackDex	how can i define a local charm in a bundle file?	13:18
BlackDex	or atleast in which directoy does it look? "local:xenial/charm-name" should be enough i think?	13:19
anrah	BlackDex: charm: "./build/xenial/my-charm"	13:20
anrah	for example	13:20
BlackDex	so instead of local i can just input the full path?	13:21
anrah	yep	13:21
BlackDex	oke :)	13:21
BlackDex	nice	13:21
BlackDex	thx	13:21
BlackDex	in juju 1.25 it took some hassel	13:21
anrah	if you download the bundle through the GUI you must change those manually	13:21
anrah	I haven't find a better way to do that	13:22
BlackDex	thats no prob	13:22
BlackDex	i have a bundle file already :)	13:23
BlackDex	using the export via the gui makes a messy bundle file in my opinion	13:23
anrah	that is true	13:23
BlackDex	i don't need the placements for instance	13:24
BlackDex	or annotations is they are called	13:24
BlackDex	ow strange, i see a lot "falsetrue" in the exported file	13:25
BlackDex	those are values which should be default	13:25
BlackDex	:q	13:25
jcastro	Zic: yeah for first use we went with conjure-up because it's a better user experience, especially for those getting started, it's all juju under the hood though so it's all good.	13:26
=== scuttle\|afk is now known as scuttlemonkey
SaMnCo	Zic: whaow. This is a significant engineering effort, congrats on building that.	13:45
Zic	hmm, my kubernetes-dashboard display a 500 error with "the server has asked for the client to provide credentials (get deployments.extensions)	14:34
Zic	did you already see that? I just apt update & upgrade & reboot kubernetes-master and etcd, one per one	14:35
Zic	the juju status is all green	14:43
=== mskalka\|afk is now known as mskalka
jcastro	that one sounds like a bug	14:46
jcastro	but mbruzek and lazypower aren't awake yet :-/	14:46
cory_fu	stub: There is not. Other parts of the framework, mainly the base layer due to the wheelhouse, require Python 3. So, +1 to pulling out py2 support	14:46
Zic	jcastro: I also have some error when running command that create or delete ressources, but they are random compared to kubernetes-dashboard error:	14:47
Zic	kubectl create -f service-endpoint.yaml	14:47
Zic	Error from server (Forbidden): error when creating "service-endpoint.yaml": services "cassandra-endpoint" is forbidden: not yet ready to handle request	14:47
Zic	this kind of error	14:48
jcastro	ok as soon as one of them shows up we'll set aside some time and get you sorted	14:48
Zic	thanks a lot	14:49
Zic	I will try to debug and collect some logs	14:49
Zic	http://paste.ubuntu.com/23875089/	14:57
Zic	"has invalid apiserver certificates or service accounts configuration" hmm	14:58
lazyPower	Zic - thats a new one to me, hmmmm	15:01
Zic	many pods are in CrashLoopBackOff such Ingress also :/	15:02
lazyPower	sounds like something botched during the upgrade. you ran the deploy upgrade to 1.5.2 correct?	15:02
Zic	W0127 15:01:40.848867 1 main.go:118] unexpected error getting runtime information: timed out waiting for the condition	15:03
Zic	F0127 15:01:40.850545 1 main.go:121] no service with name default/default-http-backend found: the server has asked for the client to provide credentials (get services default-http-backend)	15:03
Zic	I just upgraded the OS via apt update/upgrade	15:03
lazyPower	Did the units assign a new private ip address to their interface perhaps?	15:03
Zic	and reboot the machine which host kube-api-load-balancer, kubernetes-master and etcd	15:03
Zic	lazyPower: hmm, to the eth0 interface?	15:05
lazyPower	Zic - correct. The units request TLS certificates during initial bootstrap of the cluster, and we dont yet have a mechanism to re-key with new x509 data, such as if the ip addressing changes	15:06
lazyPower	which would yield an invalid certificate if the ip addresses changed	15:06
lazyPower	i'm trying to run the gambit of what might have happened to cause this in my head	15:06
Zic	I only use one private eth0 interface (static) for management VMs like master, etcd and kube-api-loadbalancer/easyrsa	15:08
Zic	for worker, I use bonding on two private interface	15:09
Zic	but nothing change at this area :(	15:09
lazyPower	ok i dont think thats teh issue then if the addressing hasn't changed	15:09
lazyPower	hmmm	15:09
SaMnCo	lazyPower, Zic would maybe removing the relation to easyrsa and adding it again fix?	15:09
Zic	for info, I reboot the VM which host juju controller also	15:09
Zic	rebooted*	15:10
lazyPower	Zic - i dont think its a juju controller issue, its an issue with the tls certificates it seems. Something changed that's causing them to be invalid which is causing a lot of sick type symptoms with the cluster	15:10
Zic	let me check the date of cert files	15:10
Zic	is it in /srv/kubernetes root?	15:11
lazyPower	yep, the keys are stored in /srv/kubernetes	15:11
Zic	16 January	15:11
Zic	:(	15:11
Zic	SaMnCo: do I risk to loose the PKI if I do that?	15:12
SaMnCo	that's what I am asking myself, if it would just regen the certs for the whole thing or not	15:12
SaMnCo	lazyPower would know better	15:12
lazyPower	SaMnCo - i'm mostly certain there's logic to check if the cert already exists in cache and will re-send the existing cert	15:13
lazyPower	we have an open bug about rekeying the infra but haven't taken an action on it yet	15:13
Zic	and there is some strange behaviour: via kubectl, I can do read action (get/describe) without any problem	15:13
Zic	but write, like create/delete sometime return a Forbidden	15:13
Zic	(I posted the exact message above)	15:13
Zic	but for Ingress or dashboard, it's a strong "nope"	15:13
lazyPower	Zic - hav eyou upgraded the kube-api-loadbalancer charm? we changed some of the tuning to disable proxy-buffering which was cuasing those issues	15:13
SaMnCo	I have seen that behavior in clusters where the relation with etcd or etcd itself was messy	15:14
SaMnCo	k8s seems to keep a state as long as it can	15:14
SaMnCo	so if you break etcd, it will keep returning values for its current state, but will refuse to change anything	15:14
Zic	lazyPower: I don't upgrade any juju's charm, juste classical .deb via apt	15:14
Zic	oh, in the apt upgrade, I saw an etcd upgrading	15:14
Zic	can it be...?	15:15
lazyPower	:\| i sincerely hope this is not related to the deb package doing something with what we've done to the configuration of etcd post deployment	15:15
lazyPower	if it is, i'm going to be upset and have nobody to complain to	15:15
Zic	I run an etcdctl member list on etcd machines	15:15
Zic	seems OK	15:15
Zic	but I don't know what to do more to check the health	15:16
lazyPower	member list and cluster-health are the 2 commands that would point out any obvious failures	15:16
SaMnCo	etcdctl cluster-health	15:16
SaMnCo	and tail the log, it tells if a member is out of sync	15:16
Zic	http://paste.ubuntu.com/23875201/	15:16
SaMnCo	ok so not that issue then	15:16
lazyPower	so that doesn't seem to be the culprit	15:16
Zic	I did a etcdctl backup before the upgrade also, just in case	15:17
lazyPower	excellent choice	15:17
Zic	hmm, so it seems to be tied to the CA	15:17
Zic	can I run some manual curl --cacert to one point of the API to check it?	15:17
lazyPower	Zic - yeah, so long as you use the client certificate or server certificate for k8s, you should be able to get a valid response if the certificates are valid	15:18
lazyPower	the server certificates are generated with server and client side x509 details. meaning the k8s certificates on the unit can be used as client or server keys.	15:18
Zic	lazyPower: it's what is strange : kubectl get/describe commands always work, kubectl create/delete at contrary works only 1 of 3, returning Forbidden message	15:19
Zic	and for Ingress/default-http-backend/kubernetes-dashboard side, it's just CrashLoopBackOff :(	15:20
lazyPower	Zic - can you check the log output on teh etcd unit to see if there's a tell-tale in there?	15:20
Zic	yep	15:20
lazyPower	Zic - it does sound like the cluster state storage is potentially at fault here	15:20
Zic	the etcd cluster doesn't return any weird logs, it juste saw that I upgraded the etcdd package :s	15:24
SaMnCo	Zic what are the logs of the Ingress/default-http-backend/kubernetes-dashboard pods?	15:25
Zic	Unpacking etcd (2.2.5+dfsg-1ubuntu1) over (2.2.5+dfsg-1) ...	15:25
Zic	(was the update)	15:25
Zic	the '1ubuntu1' part	15:26
Zic	seems that the etcd from Ubuntu archive installed over the Juju charm one, no?	15:26
Zic	SaMnCo: (I'm pasting you the log shortly)	15:26
lazyPower	Zic - thats expected. the etcd charm installs from archive	15:26
Zic	yeah but as it has not some "ubuntu" tagged version in the deb-version, I thought it was from outside of the Ubuntu archive.ubuntu.com	15:27
Zic	http://paste.ubuntu.com/23875254/	15:29
Zic	SaMnCo: ^	15:29
lazyPower	Zic - from your kubernetes master, can you grab the x509 details and pastebin it? openssl x509 -in /srv/kubernetes/server.crt -text	15:33
lazyPower	i dont need teh full certificate output, just teh x509 key usage bits so i can cross ref this info w/ whats in the cert	15:33
lazyPower	i'm expecting to find IP Address:10.152.183.1, in the output	15:33
Zic	oki	15:34
lazyPower	Zic - additionally, if you could run juju run-action debug kubernetes-master/0 && juju show-action-output --wait $UUID-RETURNED-FROM-LAST-COMMAND	15:35
lazyPower	it'll giv eyou a debug package you can ship us for dissecting the state of the cluster and we can try to piece together whats happened here	15:35
Zic	X509v3 Subject Alternative Name:	15:35
Zic	DNS:mth-k8smaster-01, DNS:mth-k8smaster-01, DNS:mth-k8smaster-01, IP Address:10.152.183.1, DNS:kubernetes, DNS:kubernetes.cluster.local, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local	15:35
lazyPower	i think i transposed debug and kuberetes-master	15:35
lazyPower	yeah, the certs valid, it has all the right SAN's i would expect to see there	15:35
Zic	that part of the certificate?	15:35
Zic	ok	15:36
Zic	let me run this juju command	15:36
* lazyPower sighs		15:36
lazyPower	this is a red herring, its something else thats gone awry	15:36
Zic	error: invalid unit name "debug"	15:37
Zic	hmm?	15:37
Zic	maybe I need to inverse the two args :D	15:37
Zic	juju run-action kubernetes-master/0 debug ?	15:37
Zic	Action queued with id: 99267d59-f3aa-467d-8686-130e90dc47a0	15:38
Zic	seems to be that :)	15:38
Zic	# juju show-action-output --wait 99267d59-f3aa-467d-8686-130e90dc47a0	15:38
Zic	error: no action ID specified	15:38
lazyPower	:\|	15:39
lazyPower	juju y u do dis	15:39
lazyPower	Zic if you omit hte --wait, it'll give you what you're looking for now	15:40
lazyPower	teh debug action doesn't take long to run	15:40
lazyPower	its just aggregating information and then offers up a tarball of files	15:41
Zic	http://paste.ubuntu.com/23875303/	15:42
Zic	but at that path, I don't have any debug-20170127153807.tar.gz	15:42
Zic	am I missing something? :o	15:42
lazyPower	Cynerva - have we encountered any situations where the debug package isnt' created?	15:42
lazyPower	1 sec, cc'ing the feature author	15:43
Zic	if I run the proposed juju scp it's ok	15:43
Cynerva	lazyPower: I haven't seen anything like that, no	15:44
lazyPower	wait so it did create?	15:44
Zic	lazyPower: if I run the juju scp manually, don't know if it was what you wait from me :)	15:44
Zic	or if the show-action should exec it	15:44
SaMnCo	Zic, lazyPower other people have had that: https://github.com/kubernetes/minikube/issues/363	15:44
lazyPower	Zic - I'm looking for the payload from that juju scp command that showed up in teh action output	15:46
lazyPower	Zic - that tarball will have several files which includes system configuration, logs, and things of that nature	15:46
Zic	yeah, I have it	15:46
Zic	just untared	15:46
lazyPower	Do you have a secure means to send that to us? if not i can give you a temporary dropbox upload page to send it over	15:47
Zic	yeah, I can generate you a secure link	15:47
lazyPower	excellent, thank you	15:47
mbruzek	Hello Zic, sorry I am late to the party. I heard you were having trouble with the Kubernetes cluster.	15:48
Zic	yeah :(	15:50
Zic	lazyPower: I pm-ed you the link with its password	15:51
lazyPower	Zic - confirmed receipt of the file	15:51
Zic	mbruzek: just run apt update/upgrade all over the different canonical-kubernetes machine, one per one, and the API begins to refuse some request for unknown reason	15:51
lazyPower	i'll take this debug package and we'll dissect it to see if we can discern whats happened post apt-get upgrade. i can't for the life of me think what went wrong but i suspect there's clues in here.	15:52
Zic	(for TL;DR :))	15:52
mbruzek	Thanks for bringing me up to speed.	15:52
lazyPower	Zic can you also send me the output from a kubernetes-worker node as well?	15:52
lazyPower	same process to run the debug action	15:52
Zic	lazyPower: just before the upgrade, I kubectl delete ns <a_large_namespaces> and it was still in Terminating when I kubectl get ns	15:53
Zic	don't know if it can help	15:53
lazyPower	Zic - it might be trash in the etcd kvstore, but i'm not positive this is the culprit yet	15:53
Zic	the goal was to delete all large namespaces used for PoC, upgrade all the cluster, reboot it, and begins some prod; but it seems that it will not be the good day :p	15:54
Zic	(I'm generating you other logs)	15:55
mbruzek	Zic: I am sorry you ran into this problem	15:55
mbruzek	Zic Have you verified that kube-apiserver is running on your kubernetes-master/0 charm?	15:56
Zic	I'm running a permanent watch -c "juju status --color"	15:57
Zic	it should be red if it's not working, correct?	15:57
Zic	because all is green atm :)	15:57
mbruzek	Zic not necessarily	15:57
Zic	oh	15:57
Zic	let me check directly so	15:57
Zic	but even if it was that, no queries will work at all, here I have some success via kubectl get/describe, random success with kubectl create/delete (resulting in Forbidden error sometime, and works just at the 2nd try...), and 0 success with Ingress & dashboard	15:58
Zic	(yeah it's running fine)	15:58
lazyPower	Zic - ok we'er going to need a bit to sift through this data and see what we come up with	16:04
lazyPower	i have the whole team looking at these debug packages, i'll ping you back when we've got more details	16:04
Zic	thanks for all your help!	16:04
mbruzek	Zic: You rebooted the nodes after apt-get update?	16:06
Zic	yep	16:06
Zic	all of it	16:06
mbruzek	Zic: Do you remember what time about? Looking at the logs I see some connection loss about 2017-01-25 10:35	16:07
Zic	hmm, I begin the kube-api-load-balancer, 3 kubernetes-master and the two etcd at ~14:15 (UTC+1)	16:10
Zic	and finished 3 mores etcd and all kubernetes-worker 1 hour after I think	16:11
mbruzek	Zic: OK that does not appear to be the problem then	16:11
Zic	but on 25th january, all was fine	16:11
Zic	(didn't see the day, sorry)	16:11
Zic	the exact timelaps is : I delete 4 large namespaces, that was forever in Terminating state, and no pods or other ressources was in Terminating, so I began to delete them one per one (without --force ou --grace=0, just normally)	16:13
Zic	all pods & svc was terminated, but the namespaces always show "Terminating" in the "kubectl get ns"	16:13
Zic	as I needed to upgrade and reboot all the cluster anyway, and saw an issue concering this fixed by rebooting the masters, I did it	16:14
lazyPower	DELETE /apis/authorization.k8s.io/v1beta1/namespaces/production/localsubjectaccessreviews: (698.088µs) 405 -- this seems to be dumping stacks in teh apiserver log	16:14
lazyPower	405 response	16:14
lazyPower	undetermined if this is the root cause, but it is consistent	16:14
Zic	yeah so it's maybe this large deletion which is the root cause :/	16:14
Zic	was 4 namespaces hosting 4 Vitess Cluster labs	16:15
lazyPower	logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"the server does not allow this method on the requested resource\",\"reason\":\"MethodNotAllowed\",\"details\":{},\"code\":405}\n"	16:15
lazyPower	which is interesting, i know for a fact you can delete namespaces	16:15
lazyPower	i believe what might be the cause, is it caused some kind of lock in etcd	16:15
Zic	yeah	16:15
Zic	for my previous labs, I just delete ns and all was clean	16:15
lazyPower	and k8s is stuck trying to complete that reqeuest and etcd is actively being aggressively in denial about it	16:15
Zic	but never deleted 4 larges one in the same time...	16:15
lazyPower	but not positive this is the root cause, we're still dissecting	16:15
mbruzek	Zic our e2e tests do large deletes of namespaces so that should be fine.	16:17
Zic	ok	16:17
Zic	atm, this namespace are still in "Terminating"	16:18
Zic	I check if rc, pods, services, statefulset, all the ressources was terminated	16:18
mbruzek	Zic Did you reboot the etcd node(s) while this was still trying delete? Was there an order of reboot?	16:18
Zic	mbruzek: I just checked the rc/pods/svc/statefulset of this namespaces was goodly terminated, but the namespace was still blocked at Terminating	16:19
Zic	I rebooted the etcd node one by one	16:19
Zic	(and try etcdctl member list after each reboot)	16:19
lazyPower	yeah	16:19
Zic	I have a previous backup of this morning for etcd	16:20
Zic	(and one after the upgrade)	16:20
lazyPower	the more we think this through, i think etcd is the core troublemaker here	16:20
lazyPower	i think teh client lost the claim on the lock	16:20
Zic	because of the high amount of delete request or because of the upgrade via apt of its package?	16:20
lazyPower	combination of the operation happenign and then being rebooted during the op	16:21
lazyPower	etcd is still waiting for that initial client request to complete	16:21
Zic	:s	16:21
lazyPower	i hear you, etcd is very finicky, and this is exactly why we label it as the problem child	16:21
lazyPower	i'm looking up how to recover from this	16:22
Zic	all my troubles was with etcd for all that time :D with K8s or Vitess	16:22
lazyPower	Zic - can you curl the leader units ledaer status in etcd?	16:22
lazyPower	eg: curl http://127.0.0.1:2379/v2/stats/leader	16:22
lazyPower	teh leader is identified with an asterisk next to the unit-number in juju status output	16:22
Zic	hmm	16:24
Zic	I have a non-printable character in return	16:24
Zic	I have a bad feeling about this	16:24
* lazyPower 's heart sinks a little in his chest		16:24
Zic	https://dl.iguanesolutions.com/f.php?h=1mvhf5F9&p=1	16:26
Zic	oh wait	16:28
Zic	it's not the master	16:28
Zic	etcd/0* active idle 5 mth-k8setcd-02 2379/tcp Healthy with 5 known peers.	16:28
Zic	I will try here	16:28
Zic	same non-printable-character :(	16:28
mbruzek	Zic: juju run --unit etcd/0 "systemctl status etcd" \| pastebin	16:29
lazyPower	Zic - etcdctl ls /registry/namespaces	16:29
Zic	http://paste.ubuntu.com/23875518/	16:29
Zic	mbruzek: ^	16:29
Zic	http://paste.ubuntu.com/23875523/	16:30
Zic	lazyPower: ^	16:30
Zic	jma, production, integration, development was the namespaces I deleted	16:31
Zic	(which is still locked to "Terminating" status)	16:31
Zic	hmm, lazyPower I run the same curl with https instead of http	16:34
Zic	root@mth-k8setcd-01:~# curl -k https://127.0.0.1:2379/v2/stats/leader	16:34
Zic	curl: (35) gnutls_handshake() failed: Certificate is bad	16:34
Zic	even with the "-k"	16:34
lazyPower	Zic - etcd is configured to listen to http on localhost	16:34
Zic	oh ok, so it was correct	16:34
lazyPower	you'll need https if you poll the eth0 interface ip	16:34
Zic	I try	16:35
Zic	# curl -k https://10.128.74.205:2379/v2/stats/leader	16:35
Zic	curl: (35) gnutls_handshake() failed: Certificate is bad	16:35
mbruzek	Zic: juju run --unit etcd/0 "journalctl -u etcd" \| pastebinit	16:37
Zic	http://paste.ubuntu.com/23875565/	16:38
Zic	the 14:17-14:32 interval is the upgrade/reboot I think	16:40
lazyPower	Zic - etcdctl ls /registry/serviceaccounts/$deleted-namespace	16:43
lazyPower	do you have 'default' listed in there in any of those namespaces?	16:43
Zic	root@mth-k8setcd-02:~# etcdctl ls /registry/serviceaccounts/production	16:43
Zic	/registry/serviceaccounts/production/default	16:43
Zic	yep	16:43
Zic	sorry, I will be afk for 1 hour (breaking K8s cluster was not an sufficient punition, I'm also of the rotation on-call tonight... need to go home before it begins... double-punishment :D)	16:48
mbruzek	Zic we were about to offer some face to face support. We can wait until you get home.	16:49
mbruzek	Zic ping us when you are back	17:07
bdx	is there a reason juju automatically adds a security group rule to every instance that allows access on 22 from 0.0.0.0/0?	17:08
bdx	I'm guessing juju just assumes you will always be accessing the instance via public internet and not from behind vpn?	17:09
lazyPower	Zic - we think we've narrowed it down to the one area we dont have visibility into at the moment, we're missing debug info from etcd, and there's no layer-debug support in teh etcd charm at present. When you surface and have a moment to re-ping, we'd like to gather some more information from the etcd unit(s) under question and i think we can then successfully determine what has happened.	17:56
stormmore	howdy juju world	18:01
lazyPower	o/ stormmore	18:01
stormmore	so I got my first k8s cluster up and running yesterday, woot!	18:02
lazyPower	AWE-SOME	18:05
lazyPower	doing anything interesting in there yet stormmore?	18:05
stormmore	not yet, still teaching the Devs how to create containers	18:06
stormmore	it will get more interesting when I migrate out AWS to our own hw	18:06
stormmore	I just love all the "pretty" dashboards that I can "show off" to management	18:07
stokachu	stormmore, it's what promotions are built on :)	18:22
stormmore	stokachu not in a startup where you already report to the CEO	18:25
stormmore	job security maybe	18:26
Zic	ping back lazyPower and mbruzek	18:29
Zic	sorry transports was hell	18:29
stokachu	stormmore, hah, maybe some nice hunting retreats	18:31
stokachu	stormmore, that may just be here in the south though	18:31
stormmore	yeah that is a southern thing stokachu, definitely not a bay area thing	18:31
Zic	lazyPower: what do you need from etcd, just the journalctl entries?	18:32
lazyPower	Zic - can you grab that, the systemd unit file /var/lib/systemd/etcd.service and the defaults environment file /etc/defaults/etcd	18:37
lazyPower	er	18:37
lazyPower	sorry /lib/systemd/service/etcd.service	18:37
lazyPower	i clearly botched teh systemd unit file location. herp derp	18:37
Zic	/lib/systemd/system/etcd.service?	18:39
Zic	because /lib/systemd/service does not exist :)	18:39
lazyPower	correct	18:39
Zic	'k	18:39
Zic	http://paste.ubuntu.com/23876230/	18:40
Zic	etcd.service date of Dec 18th Jan 16th, if it can help	18:42
Zic	oops, missing copy/paste	18:42
Zic	etcd.service is Dec 18th and /etc/default/etcd is Jan 16th	18:42
lazyPower	ok these unit files appear to be in order. We found some issues that also look related to the core problem regarding flannel not actually running on the units	18:47
lazyPower	it failed contacting etcd	18:47
mbruzek	Zic are you able to hangout with us for a debug session?	18:47
Zic	so sorry but I can't, my wife and my child will kill me if I run into a debugging-session with audio :/ but really appreciate your kindness, thanks	18:50
Zic	I'm out-of-office actually but I can do some IRC discretely :)	18:50
Zic	lazyPower: yes, we always discuss about this, but I `systemctl start flannel` at all rebooted kubernetes-worker	18:51
Zic	or maybe the time Flannel is not running sets the problem?	18:51
Zic	(and I also did a `juju resolved` on the flannel unit which was in error)	18:52
Zic	as you taught me :)	18:52
lazyPower	Zic - seems like flannel is having an issue contacting etcd per the debug output from kubernetes-master	18:52
lazyPower	which in turn is causing the kubernetes api to not be available to pods, which is causing the pod crashloop	18:53
mbruzek	Zic: Can you pastebin the /var/run/flannel/subnet.env also ?	18:53
Zic	hmm, I remembered to start the flannel service after every kubernetes-worker reboot	18:53
Zic	not on master	18:53
Zic	(as Juju doesn't tell me anything is on error after master reboots)	18:54
mbruzek	Zic Were the flannel services not autostarting?	18:54
Zic	yes, on worker	18:54
ryebot	fwiw, that issue was fixed in 1.5.2	18:55
Zic	nice, noted	18:56
Zic	I will plane an upgrade if I'm able to recover from this crash	18:56
mbruzek	Zic: Can you get the /var/run/flannel/subnet.env file for us?	18:57
Zic	oh sorry, I missed your message	18:57
Zic	http://paste.ubuntu.com/23876307/	18:59
ryebot	Thanks, Zic	18:59
=== frankban is now known as frankban\|afk
Zic	lazyPower: at this point, you can tell me the truth, do you think I will be able to recover from this crash? :D not so important because it's not in prod yet, and it's easy to redeploy from scratch	19:24
Zic	but to know of what mu monday will be made :)	19:25
Zic	s/mu/my/	19:25
lazyPower	Zic - we have some ideas, but nothing definitive for the root cause so its hard to point at what a fix would be without access to real time debug.	19:25
lazyPower	Zic - we're trapped in a meeting atm thats starting to wind down, but we've been scrubbing through the logs you sent all morning, it all seems to point back to flannel + etcd as the core of the issue	19:25
lazyPower	mbruzek - any ideas left to try before we call it DOA?	19:26
Zic	don't hesitate to ping me for more info, I'm @home but lurking at IRC as usual	19:27
lazyPower	Zic - will do. just pending feedback from the remainder of the team actively cycling on the issue.	19:27
lazyPower	i did the same oeprations you outlined on my home cluster running 1.5.1 and i got the intermediary flannel connection issue	19:27
lazyPower	it resolved once i restarted the master however	19:27
Zic	I rebooted in this order : juju controller, kube-api-loadbalancer+easyrsa, kubernetes-master, kubernetes-worker	19:28
lazyPower	That all sounds correct to me in terms of ordering	19:29
Zic	I don't try to reboot anything since then	19:29
lazyPower	would you mind terribly trying to reboot the kubernetes-master unit one last time to see if it "unsticks" the error?	19:29
Zic	but I can if it's needed	19:29
Zic	yeah, I can	19:29
Zic	the 3 ones?	19:29
Zic	one by one?	19:29
lazyPower	Zic - i would pick one, and restart it yes. identified as the leader	19:30
lazyPower	start there, and lets see what results we get back from that single reboot	19:31
lazyPower	if it looks promising, then cycle one by one the other two nodes	19:31
lazyPower	s/nodes/units/	19:31
Zic	reboot launched on the active master	19:31
lazyPower	Zic - i need to cycle into another role at the moment, but i'm leaving you in the very capable hands of mbruzek, ryebot, and Cynerva - they're going to keep the stream alive and ask for details about post-reboot	19:32
ryebot	Ready and waiting, Zic :)	19:33
Zic	oki, thanks anyway lazyPower for your great help :)	19:33
lazyPower	Zic - no, thank YOU for the patience during this debugging session. I know its unnerving	19:33
Zic	ryebot: o/, reboot finished, do I start the flannel.service?	19:33
lazyPower	if we can fix it we'd like to do so	19:33
ryebot	Zic: yes, please	19:34
mbruzek	Zic: Yes	19:34
Zic	started, systemctl status seems correct	19:34
ryebot	great, and /var/run/flannel/subnet.env exists?	19:34
lazyPower	mbruzek ryebot - the flanneld unit not being up before kube-apiserver/scheudler/controller-manager might be a bigger portion of the error set as well.	19:34
lazyPower	during bootstrap that happens after flanneld has indiciated its running and available	19:35
stormmore	Hey lazyPower just out of curiousity, do you know why the k8s bundle adds flannel connection to the k8s master that takes a full /24? seems like a bit of a waste to me	19:35
lazyPower	stormmore - expedience, that seems like an area we can optimize	19:35
Zic	http://paste.ubuntu.com/23876577/	19:35
ryebot	lazyPower ack, we'll investigate	19:35
Zic	ryebot: ^	19:35
ryebot	Zic great, thanks	19:35
Zic	do I mark the flannel/0 unit as "resolved" with juju cli?	19:36
Zic	it's in error atm	19:36
stormmore	lazyPower - seems like it, still not a deal breaker for me ;-)	19:36
ryebot	Zic, yes, please	19:36
Zic	done, it's green	19:36
ryebot	great	19:36
ryebot	one sec	19:37
lazyPower	stormmore glad to hear it :) As you can see per the channel logs, we take deployments seriously and value all feedback. keep it comin. if you'd like to file a bug against github.com/kubernetes/kubernetes regarding the master service cider range we can angle to get it on the roadmap in the future.	19:37
mbruzek	Zic: Can you also reboot the other masters and start flanneld as well.	19:37
Zic	ok	19:38
Zic	ok for the 2nd master	19:40
Zic	also for the 3rd one	19:41
ryebot	okay, and you resolved the errors?	19:41
Zic	yep	19:41
mbruzek	Zic can you try to create a simple pod to see if that works?	19:41
stormmore	lazyPower this channel is one of the reasons I am driving adoption in my company of mass and juju to underpin the k8s environment instead of other options	19:41
mbruzek	Are you still in a Crash Loop back off?	19:41
lazyPower	stormmore <3	19:41
lazyPower	we appreciate you too	19:42
mbruzek	Thanks stormmore!	19:42
Zic	mbruzek: I will deploy a simple nginx pod, let me few seconds	19:42
ryebot	Thanks, Zic	19:42
Zic	$ kubectl run my-nginx --image=nginx --replicas=2 --port=80	19:44
Zic	Error from server (Forbidden): deployments.extensions "my-nginx" is forbidden: not yet ready to handle request	19:44
Zic	same from before	19:44
ryebot	alright, thanks, Zic, one moment	19:44
Zic	I run it a 2nd time, same error	19:45
Zic	but the 3rd time operation was executed	19:45
Zic	so, same as earlier	19:45
stormmore	so on a less work related topic, when is the next convention that I can try and twisted the boss' arm into letting me attend?	19:45
mbruzek	Zic what master does your kubectl point to?	19:45
Zic	the kube-api-load-balancer	19:46
Zic	I can test from a master directly	19:46
mbruzek	Zic: Do you have a bundle or description of how you deployed your cluster?	19:46
Zic	yeah : 1 machine for kube-api-loadbalancer and easyrsa (identified by mth-k8slb-01 in my infra), 3 machines for kubernetes-master (mth-k8smaster-0[123]), 8 kubernetes-worker (mth-k8s-0[123], mth-k8svitess-0[12], mth-k8sa-0[123])	19:49
Zic	and one Juju controller of course : mth-k8sjuju-01	19:49
mbruzek	Zic: Can you pastebin juju status?	19:49
Zic	of course	19:49
Zic	http://paste.ubuntu.com/23876663/	19:50
lazyPower	mbruzek - one thought just occured to me as well, if Zic hasn't updated to our 1.5.2 release, the api-lb stil has the proxy buffer issue which reared its ugly head on delete/put requests as well.	19:50
lazyPower	but not certain if thats pertinent	19:50
Zic	(oh, I forgot to mention mth-k8setcd-0[12345] which running etcd parts)	19:50
Zic	mbruzek: ^	19:50
Zic	all is added manually ("cloud-manual" provisionning of Juju, even the AWS-EC2 instance)	19:52
Zic	I don't use the AWS connector since I have baremetal servers and EC2 instances in the same juju controller	19:52
mbruzek	Zic would you be able to upgrade this cluster to see if our operations code corrects this problem?	19:55
Zic	yeah, it will be the first upgrade I conduct through Juju since the bootstrapping of this cluster :)	19:55
Zic	what is the recommended way?	19:55
mbruzek	We have that documented in this blog post: http://insights.ubuntu.com/2017/01/24/canonical-distribution-of-kubernetes-release-1-5-2/#how-to-upgrade	19:56
mbruzek	Since you don't seem to be using the bundle, I would recommend the "juju upgrade-charm" steps	19:56
mbruzek	We can walk you through it.	19:56
Zic	I use the canonical-kubernetes but just scale etcd from 3 to 5 and master from 1 to 3 in the Juju GUI	19:57
Zic	(and make easyrsa to be on the same machine as kube-api-loadbalancer)	19:58
Zic	the canonical-kubernets bundle*	19:58
mbruzek	Zic: Ah I see, still would recommend the upgrade-charm path	19:58
Zic	ok, I just read the how-to-upgrade section, never use the upgrad-charm one by one :(	19:59
mbruzek	There is a first time for everything!	19:59
Zic	:)	19:59
Zic	so I run `juju upgrade-charm <on_every_charm_one_by_one>` ?	20:00
mbruzek	Zic: Just the applications in the cluster, kubernetes-master, kubernetes-worker, etcd, flannel, easyrsa, and kubeapi-load-balancer	20:01
mbruzek	The _units_ will upgrade automatically when the _application_ does	20:02
mbruzek	So you don't need to use the /0 /1 /2 /3	20:02
ryebot	Zic: Right, so literally as it is in those docs :)	20:02
Zic	ok	20:02
Zic	just to let me know, for the future (~yay~), is it the classical recommended way in a cluster like mine?	20:03
Zic	so I will try to memorize this :)	20:03
ryebot	Zic: As far as I know, for custom deployments, yes.	20:05
ryebot	You might be able to use your own bundle without version numbers, but I'm not sure. I'll have to investigate.	20:05
mbruzek	Zic: Technically you could export a bundle of your current system and it would be reproducable, as well as you could edit the version numbers of the charms and upgrade your cluster in one step	20:05
mbruzek	Zic: But we are just trying to fix the cluster here, we can add automation and reproduction as following steps	20:06
Zic	ok, so as I modify the number of units and replace the easyrsa charm, the out-of-the-box way is via juju upgrade-charms, that's all I want to know, it's not the time to automation I agree :)	20:07
Zic	(upgrades is running btw)	20:07
Zic	s/replace/place in the same machine of kube-api-loadbalancer/	20:08
mbruzek	Zic: Actually you can export the deployment from the Juju GUI in one step, and it will have the machines and everything just as you have it now. We could then copy that yaml and edit it.	20:08
Zic	smooth, noted, I will study this after :)	20:09
Zic	(easyrsa, kube-api-load-balancer already OK, kubernetes-master in progress, I will follow then with etcd and kubernetes-master)	20:10
Zic	worker*	20:10
Zic	hmm	20:11
Zic	I had a `watch "kubectl get pods --all-namespaces"` running	20:11
mbruzek	Zic: The documenation describe this process: https://jujucharms.com/docs/2.0/charms-bundles	20:11
Zic	and suddenly all the pods switched to Running after the kubernetes-master upgrade	20:11
Zic	\o/	20:11
mbruzek	Excllent!	20:11
Zic	let me check carefully	20:11
lazyPower	@!#$%^@!#$%!@#$%	20:11
lazyPower	AWESOME mbruzek GREAT WORK!	20:11
Zic	yeah, all is running	20:12
mbruzek	sweet	20:12
Zic	I continue the upgrade-charms	20:12
lazyPower	i think this calls for a tradmarked WE DID IT!	20:12
Zic	(I will try to schedule a new pod deployments just after)	20:12
Zic	clap clap anycase o/	20:12
Zic	oh maybe "clap clap" is not sounding like a STANDING APPLAUSE in English, sorry :)	20:13
mbruzek	Zic: It translated in my head just fine	20:13
Zic	:)	20:13
Zic	etcd charms is upgrading	20:14
Zic	<crossfinger>	20:14
Zic	done for etcd, I'm ending with kubernetes-worker charm	20:15
Zic	all upgrades is finished	20:17
Zic	let me run a kubectl deployment	20:17
mbruzek	Zic: Can you create a small test deployment?	20:17
Zic	yep	20:17
Zic	(oh btw, kubectl get ns does not return any of the old locked Terminating namespace)	20:17
Zic	it's clean now	20:17
lazyPower	yassssss	20:18
ryebot	great	20:18
Zic	so, it works but something weird but maybe normal : some Ingress goes to CrashLoopBackOff but it looks to come back in Running atm	20:20
Zic	yeah, they are all running now	20:20
Zic	some strange flap	20:20
lazyPower	it the default-http-backend pod is under re-creating it'll crash loop the ingress controller until the backend stabilizes	20:20
lazyPower	thats known and normal	20:20
Zic	oh, so it's normal	20:20
Zic	COOL :D	20:21
lazyPower	phwew	20:21
lazyPower	man, fan-tastic	20:21
Zic	so your new version amazingly correct everything	20:21
Zic	and I don't know if it's tied to it, but Scheduling is much more quicker	20:21
lazyPower	thats a 1.5.2 fixup :)	20:21
Zic	I waited sometime up to 3min between Scheduling and ContainerCreating	20:22
lazyPower	plus you probably have less pressure on the etcd cluster without those large namespaces	20:22
Zic	here it was done in 10s	20:22
mbruzek	Zic: The upgrade process installed new things and reset the config files to what we would expect, that is why I think you are having so much success here.	20:24
mbruzek	Zic: Is everything ok? do you have any other problems?	20:24
Zic	I'm checking some more test but all seems OK now !!	20:25
mbruzek	Zic: There was also a fix for LB where we turned off the proxy buffering in the 1.5.2 update.	20:26
Zic	hmm, some Ingress stayed in CrashLoopBackOff and others are running correctly:	20:28
Zic	http://paste.ubuntu.com/23876828/	20:28
mbruzek	Zic it is possible that those are still being "restarted" by the operations	20:30
mbruzek	But good information.	20:30
Zic	hmm, kubernetes-dashboard is staying in CrashBackLoopBackOff, I forgot to check the kube-system namepace :/	20:31
Zic	(kube-dns was also Crashed, but it is Running since the upgrade)	20:31
ryebot	Zic, is your juju status all-green right now?	20:32
Zic	yep	20:32
ryebot	thanks	20:32
Zic	heapster also was CrashLoopBackOff and just Running fine now	20:36
mbruzek	Zic: so just the ingress ones are in CLBO ?	20:37
Zic	the only crashloopbackoff now is the dashboard and some of the Ingress	20:37
Zic	oh I anticipated the question :)	20:37
Zic	I'm trying to relaunch the pod of kubernetes-dashboard, maybe it will help	20:38
Zic	2m1s163{kubelet mth-k8svitess-01}WarningFailedSyncError syncing pod, skipping: failed to "SetupNetwork" for "kubernetes-dashboard-3697905830-qv6hv_kube-system" with SetupNetworkError: "Failed to setup network for pod \"kubernetes-dashboard-3697905830-qv6hv_kube-system(8785f143-e4cc-11e6-b87d-0050569e741e)\" using network plugins \"cni\": open /run/flannel/subnet.env: no such file or directory; Skipping pod"	20:40
mbruzek	Zic: Did you get to juju upgrade-charm flannel?	20:41
Zic	no, I just realize that, I thought it was a part of the kubernetes-worker charms	20:41
Zic	can I run its upgrade now?	20:41
mbruzek	Zic: Yes please	20:42
Zic	done	20:44
mbruzek	Zic: Can you please pastebin `kubectl logs nginx-ingress-controller-jlxr5` ?	20:45
mbruzek	We want to see why that is in a CLBO	20:45
Zic	http://paste.ubuntu.com/23876915/	20:46
Zic	hmm	20:47
Zic	I didn't have this error before	20:47
mbruzek	Are they still in crash loop back off?	20:47
Zic	yep	20:47
Zic	oh	20:47
Zic	dashboard is running	20:47
Zic	and 2 Ingress are back in Running	20:48
ryebot	6 total ingress running now?	20:48
Zic	2 are always in CLBO but I think it will combe back	20:48
Zic	hop, all Running	20:48
ryebot	8 total running ingress?	20:48
Zic	yep	20:48
ryebot	great	20:48
Zic	all is Running now, and I'm checking --all-namespaces	20:48
Zic	(I forgot kube-system earlier :/)	20:49
Zic	dashboard is working effectively	20:49
Zic	hmm, 2 Ingress are returning in CLBO atm	20:49
mbruzek	Zic: Send us pastebin logs for those	20:50
Zic	http://paste.ubuntu.com/23876940/	20:50
Zic	I have another log from the RC of Ingress:	20:52
Zic	Liveness probe failed: Get http://10.52.128.99:10254/healthz: dial tcp 10.52.128.99:10254: getsockopt: connection refused	20:52
ryebot	thanks, looking	20:53
=== wolverin_ is now known as wolverineav
Zic	so the only CLBO part now is that 2 Ingress:	20:53
Zic	default nginx-ingress-controller-vg9qc 0/1 CrashLoopBackOff 17 13h	20:53
Zic	default nginx-ingress-controller-w1dhl 0/1 CrashLoopBackOff 92 10d	20:53
lazyPower	Zic - which namespace is this in?	20:54
Zic	default, it's the builtin nginx-ingress of the canonical-kubernetes bundle	20:54
lazyPower	ack, ok	20:54
Zic	(I don't add any Ingress myself)	20:54
Zic	ingress controller* to use the right terminology	20:54
lazyPower	can you kubectl describe po nginx-ingress-controller-vg9qc \| pastebinit	20:55
Zic	(Actually, I add Ingress, but not Ingress controller)	20:55
Zic	http://paste.ubuntu.com/23876964/	20:56
ryebot	thanks	20:56
lazyPower	Zic - thats totally fine	20:57
lazyPower	hmm nothing is leaping out at me from the pod description.... it did say it was failing health checks	20:57
lazyPower	from earlier pastes it looked like it was running out of file descriptors	20:58
Zic	in fact, my Ingress deserve well my testing website	21:00
Zic	but this two controller stays in CLBO :/	21:00
Zic	the 6 others is well Running	21:00
cholcombe	is it possible to tell juju to give my instance 2 ip addrs when it starts them?	21:00
lazyPower	cholcombe - you have to sacrifice 40tb of data and do the chant of "conjure-man-rah"	21:01
Zic	lazyPower: if I reboot (yeah, it's a bit brutal) nodes which host this two CLBO, maybe it will repop as Running?	21:02
cholcombe	man i need to brush up on that chant haha	21:02
lazyPower	in short, i dont know but i think extra-bindings and spaces is what would introduce that functionality	21:02
Zic	lazyPower: it's what I did for kubernetes-dashboard	21:02
lazyPower	s/know/so/	21:02
lazyPower	no no, nevermind that last edit, it was right	21:02
lazyPower	Zic - its worth a shot.	21:02
stormmore	hmmm that is curious for an "idle" cluster, one of my nodes just dropped 4GB memory usage and increased network tx for potential no obvious reason!	21:03
Zic	lazyPower: it seems to work, weird but I like it xD	21:07
Zic	I will wait few minutes before confirming	21:07
lazyPower	Zic - i'm going to blame gremlins for that one	21:07
Zic	default nginx-ingress-controller-7qcsn 0/1 CrashLoopBackOff 10 13h 10.52.128.135 mth-k8sa-01	21:10
Zic	default nginx-ingress-controller-lx6kt 0/1 CrashLoopBackOff 16 13h 10.52.128.253 mth-k8sa-02	21:11
Zic	it don't work so long :)	21:11
ryebot	Zic, another option is destroying them; the charm should launch new ones to replace them on the next update (<5 mins)	21:11
Zic	with the same error: F0127 21:11:44.648841 1 main.go:121] no service with name default/default-http-backend found: the server has asked for the client to provide credentials (get services default-http-backend)	21:12
Zic	ryebot: thanks, I will try	21:12
Zic	it pops again and they are Running	21:13
Zic	let few minutes pass to confirm :)	21:14
ryebot	+1	21:14
Zic	CLBO for one of them	21:15
mbruzek	Zic: Please try one more thinkg	21:15
mbruzek	juju config kubernetes-worker ingress=false	21:15
mbruzek	wait for the pods to terminate.	21:15
Zic	the two new bringed up ingress-controller are now in CLBO again	21:15
Zic	mbruzek: ok	21:16
mbruzek	Then you should be able to juju config kubernetes-worker ingress=true	21:16
Zic	mbruzek: (it mades me remember that at the end, I need to shut off the debug I enabled earlier on the kubernetes-master?)	21:17
Zic	all pods are terminated	21:18
Zic	rewitch to true	21:19
Zic	all is Running, let's wait some minutes	21:20
Zic	they are all in CLBO now xD	21:20
Zic	but it switch again to Running	21:21
lazyPower	it will register as running when they initially come up, but have to pass the healthcheck	21:21
lazyPower	Zic - can you juju run --application kubernetes-worker "lsof \| wc -l" and pastebin the output of that juju run command?	21:21
Zic	http://paste.ubuntu.com/23877086/	21:23
Zic	All Ingress are flapping between CLBO and Running now	21:24
lazyPower	i'm not positive, but k8s worker/2 and k8s-worker/1 seem to have a ton of open file descriptors	21:24
lazyPower	looks like something is leaking file descriptors :\|	21:24
lazyPower	which would explain the crash loop backoff on a segment of the ingress controllers vs the handfull that succeed	21:25
lazyPower	Zic - at this point we'll need a bug about this, and can look into it further, but we're not in a position to recommend a fix at this time.	21:26
lazyPower	we have encountered this before but the last patch that landed should have both a) enlarged the file descriptor pool, and b) hopefully corrected that. we might be behind in a path release on the ingress manifest that fixed this	21:26
lazyPower	i'll take a look in a bit, but i think we're 1:1	21:27
Zic	in fact the Ingress is working even if the Ingress Controller are flapping	21:27
lazyPower	do you have any ingress controllers listed as running that are not in CLB?	21:27
Zic	strange	21:27
Zic	yeah, as they flap between the two state not at the same time	21:27
lazyPower	if thats the case, kubernetes will do its best to use a functional route throuth the ingress API	21:27
lazyPower	as it round-robin distributes them. you'll likely find some requests that get dropped if you run something like apache-bench or bees-with-machine-guns against it	21:28
lazyPower	but typical single user testing likely looks fine	21:28
ryebot	Zic, one last thing - can you pastebin /lib/systemd/system/kubelet.service on a kubernetes-worker node?	21:28
Zic	http://paste.ubuntu.com/23877117/	21:30
lazyPower	juju run --application kubernetes-worker "sysctl fs.file-nr" \| pastebinit -- as well would be helpful in ensuring it is indeed related to file descriptors. this will list the the number of allocated file handles, the number of unused-but-allocated file handles, and the system wide max number of file handles	21:30
Zic	http://paste.ubuntu.com/23877122/	21:31
lazyPower	well those fd numbers went way down	21:32
lazyPower	however there are very different configurations listed there	21:32
lazyPower	where some have 400409 and some have 6573449 listed as max	21:32
ryebot	They might be his baremetal/ec2 machines?	21:32
lazyPower	ah, you are correct	21:32
lazyPower	different substrates	21:32
lazyPower	different rules	21:32
Zic	yeah, only mth-k8s-* and mth-k8svitess-* are robust physical servers	21:33
lazyPower	all the units appear to be well within bounds of those numbers though	21:33
Zic	mth-k8sa-* are EC2 instances	21:33
lazyPower	so maybe its not FD Leakage	21:33
Zic	http://paste.ubuntu.com/23877130/ <= and this error about credential remembered some bad hours of this day	21:35
Zic	(which was the same error returned by some kubectl commands and the dashboard earlier)	21:35
Zic	21:28 is @UTC	21:36
Zic	so few minutes ago	21:36
mbruzek	Zic: Can you please file a bug about this problem on the kubernetes github issue tracker? https://github.com/kubernetes/kubernetes/issues	21:38
mbruzek	Maybe someone else knows why the ingress would be in CrashLoopBackOff.	21:38
mbruzek	Please list if you added any ingress things and what manifest you used for that.	21:39
Zic	hmm, about this, maybe I can delete my Ingress	21:39
mbruzek	Put Juju in the title and your best estimation on how to reproduce these errors.	21:39
mbruzek	Zic: The ingress=false would have deleted them no?	21:40
mbruzek	Did you put in your own _different_ ingress objects?	21:40
Zic	if you talk about Ingress controller, no, I stayed with the default nginx-ingress-controller of the charms bundle	21:41
Zic	but I create two Ingress yep	21:41
lazyPower	ingress objects are related to, and depend on the ingress controller, but have very little to do with ingress controller operations	21:41
lazyPower	unless i'm misinformed	21:41
Zic	ok, so I promise you to fill a bug tomorrow morning, it's late now and if I want to fill a well-written/described bugs I prefer to do it seriously :) I will post the Issue link here tomorrow	21:44
Zic	the step-to-reproduce part will be the most difficult part	21:44
mbruzek	Zic yes	21:44
mbruzek	Zic, I just don't know how to reproduce this	21:44
mbruzek	I realize it is late for you, sorry about the problems.	21:44
Zic	no worries, you were all formidable to help me and focus on this problem during hours today, I can at least pursue the debug on IRC even if I'm out of office!	21:46
Zic	thanks a lot mbruzek, ryebot, lazyPower, jcastro	21:49
ryebot	Happy to help, Zic!	21:50
Zic	I will pop again here tomorrow about the issue I will report	21:50
ryebot	thanks, feel free to me ping when you post it, I'd like to track it	21:51
ryebot	ping me* :)	21:51
Zic	huh	21:51
Zic	I deleted and recreated my Ingress, and controller are now Running without flapping since 6min	21:52
Zic	engineering is SO 2016, magic is the new way	21:52
Zic	:\|	21:52
ryebot	lol	21:52
ryebot	Zic, when you say you deleted your ingress, can you provide the command you executed?	21:52
Zic	kubectl delete ing <my_two_ingress>	21:53
Zic	(one was exposing a nginx-deployment-test with nodeSelector on machine labelled at our datacenter/Paris (baremetal servers))	21:53
Zic	(and one another on EC2 only)	21:54
Zic	8min without flapping, lol	21:54
ryebot	I wonder if there was a conflict with our automatic ingress scaling. Shouldn't be, but I should probably make sure.	21:54
Zic	I try to keep all traces to fill a bug anyway tomorrow	21:55
ryebot	Hmm, you said you recreated them, too, so I guess that can't be it.	21:55
ryebot	I guess you're right... magic!	21:55
Zic	and add how I resolved, if it don't flap from tommorrow	21:55
ryebot	+1 sounds good, thanks Zic.	21:55
Zic	ryebot: yeah, and it's really simple Ingress, with only one rule on the hostname	21:55
Zic	like baremetal.ndd.com for the first, ec2.ndd.com for the second	21:56
Zic	11min without flapping \o/	21:56
ryebot	\o/	21:56
* Zic will buy some magic powder before sleeping		21:57
ryebot	heh	21:57
Zic	hmm, I discovered some more information in kubectl describe ing	22:00
Zic	all the operation that the controller did	22:00
Zic	I didn't know where to find those	22:00
Zic	I have a lot of 42m42m1{nginx-ingress-controller }WarningUPDATEerror: Operation cannot be fulfilled on ingresses.extensions "nginx-aws": the object has been modified; please apply your changes to the latest version and try again	22:01
Zic	during the CLBO period	22:01
Zic	and now it's working, juste MAPPING action	22:01
Zic	just*	22:01
Zic	(no flapping for 38min :p I'm going to bed, g'night and thanks one more time to all of you :))	22:18
lazyPower	Zic : excelelnt news. have a good sleep and enjoy your weekend o/	22:18
Zic	I will ping back with the GitHub issue tomorrow o/	22:18
stormmore	wish there was a good recommendation guide for SDN hardware :-/	22:24
stormmore	(still think runninig some of the maas services on an sdn switch would rock!)	22:24
=== scuttlemonkey is now known as scuttle\|afk
stormmore	since it is Friday and I am not doing anhything to affect my cluster but I want to learn more about juju, is it possible run your own charm "store"?	22:46
lazyPower	stormmore - not at this time	22:49
stormmore	lazyPower another "future work" thing then :)	22:49
lazyPower	stormmore - you can keep a local charm repository which is just files on disk, but as far as running the actual store display + backend service(s), thats not available for an on-premise solution	22:49
bdx	yo whats up with the charmstore ?	23:37
bdx	ERROR can't upload resource	23:37
bdx	will we ever fix this?	23:37
bdx	killin' me here	23:37

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!