=== thumper is now known as thumper-afk === thumper-afk is now known as thumper [06:11] cory_fu: Should I do a followup MP that tears out the Python2 import machinery completely? Is there any Python2 code still out there using charms.reactive? [06:32] @Zic thanks. Today is a bit busy for me, but can we do a call like next week? === scuttlemonkey is now known as scuttle|afk === frankban|afk is now known as frankban [08:23] Good morning Juju world [08:29] hello :) [08:29] Can i upgrade a charm which is installed via cs, but i now want it to use a local version? [08:30] of use code.launchpad.net for its source? [08:31] BlackDex, yes. check out the --switch flag of upgrade-charm [08:31] oke [08:33] i think i need --path :) [08:47] Hi juju world [08:47] do we need to created terms each time we are pushing to charm store .. [08:48] or else it's enough to crdate one time [09:11] SaMnCo: I'm also very busy at this time at office because of Vitess (the Canonical Kubernetes was one of the quicker part :D) as we're late on the deadline, but I'm available through IRC all the (France UTC+1 o/) time. If you prefer an audio call I will try to find a solution :) [09:12] feel free to pm me if you need [09:17] I saw in the Ubuntu Newsletter the blogpost of jcastro : so conjure-up is the now-official way to install Kubernetes through Juju? I personately used the "manual provisionning" of Juju as I'm on baremetal server and doesn't use Ubuntu MaaS [09:18] I will surely bootstrap new k8s cluster so I'm asking myself if I must continue through this or begin to use conjure-up for the next ones [09:19] (I know that conjure-up is just a ncurses-like GUI for Juju, but I don't know if this install-way does exactly the same vs. what I did) [09:24] BlackDex, Aha. I was close! [09:48] aisrael: You indeed were, and it worked :) [09:48] --switch --revision and --path are mutually exclusive :) [10:00] Zic are you using MAAS? [10:01] for bare metal management? [10:01] if you do, then conjure-up will help. If not and you are on full manual provisioning then I guess you'll be good with your current method. [10:06] conjure-up is a wizard to provide some help [11:33] SaMnCo: ok, yeah we don't use MaaS as we have a sensible same homemade product here, so I bootstrap Ubuntu Server from it, add this via juju add-machine over SSH [11:34] and when I want to deploy the canonical-bundle charms, delete all "newX" machines Juju want to pop, and reassign charms to machine already installed via manual [11:34] (just via drag'n'dropping) [11:35] at this step, I personally scale etcd to 5 instead of 3 by default, and put the EasyRSA charm at the same machine of kube-api-load-balancer [11:35] (and scale kubernetes-master to 3 also, I forgot to mention) [11:37] even if we have a MaaS-like in our company, maybe I will try in the future to set a MaaS here just to automatize all with Juju :) [12:45] Zic that or write a juju provider for your tool. Is it all in house development or another product like crowbar ? [13:08] SaMnCo: completely homemade, it was to permit our customer to reinstall their VMs or physical server from our SI [13:18] how can i define a local charm in a bundle file? [13:19] or atleast in which directoy does it look? "local:xenial/charm-name" should be enough i think? [13:20] BlackDex: charm: "./build/xenial/my-charm" [13:20] for example [13:21] so instead of local i can just input the full path? [13:21] yep [13:21] oke :) [13:21] nice [13:21] thx [13:21] in juju 1.25 it took some hassel [13:21] if you download the bundle through the GUI you must change those manually [13:22] I haven't find a better way to do that [13:22] thats no prob [13:23] i have a bundle file already :) [13:23] using the export via the gui makes a messy bundle file in my opinion [13:23] that is true [13:24] i don't need the placements for instance [13:24] or annotations is they are called [13:25] ow strange, i see a lot "falsetrue" in the exported file [13:25] those are values which should be default [13:25] :q [13:26] Zic: yeah for first use we went with conjure-up because it's a better user experience, especially for those getting started, it's all juju under the hood though so it's all good. === scuttle|afk is now known as scuttlemonkey [13:45] Zic: whaow. This is a significant engineering effort, congrats on building that. [14:34] hmm, my kubernetes-dashboard display a 500 error with "the server has asked for the client to provide credentials (get deployments.extensions) [14:35] did you already see that? I just apt update & upgrade & reboot kubernetes-master and etcd, one per one [14:43] the juju status is all green === mskalka|afk is now known as mskalka [14:46] that one sounds like a bug [14:46] but mbruzek and lazypower aren't awake yet :-/ [14:46] stub: There is not. Other parts of the framework, mainly the base layer due to the wheelhouse, require Python 3. So, +1 to pulling out py2 support [14:47] jcastro: I also have some error when running command that create or delete ressources, but they are random compared to kubernetes-dashboard error: [14:47] kubectl create -f service-endpoint.yaml [14:47] Error from server (Forbidden): error when creating "service-endpoint.yaml": services "cassandra-endpoint" is forbidden: not yet ready to handle request [14:48] this kind of error [14:48] ok as soon as one of them shows up we'll set aside some time and get you sorted [14:49] thanks a lot [14:49] I will try to debug and collect some logs [14:57] http://paste.ubuntu.com/23875089/ [14:58] "has invalid apiserver certificates or service accounts configuration" hmm [15:01] Zic - thats a new one to me, hmmmm [15:02] many pods are in CrashLoopBackOff such Ingress also :/ [15:02] sounds like something botched during the upgrade. you ran the deploy upgrade to 1.5.2 correct? [15:03] W0127 15:01:40.848867 1 main.go:118] unexpected error getting runtime information: timed out waiting for the condition [15:03] F0127 15:01:40.850545 1 main.go:121] no service with name default/default-http-backend found: the server has asked for the client to provide credentials (get services default-http-backend) [15:03] I just upgraded the OS via apt update/upgrade [15:03] Did the units assign a new private ip address to their interface perhaps? [15:03] and reboot the machine which host kube-api-load-balancer, kubernetes-master and etcd [15:05] lazyPower: hmm, to the eth0 interface? [15:06] Zic - correct. The units request TLS certificates during initial bootstrap of the cluster, and we dont yet have a mechanism to re-key with new x509 data, such as if the ip addressing changes [15:06] which would yield an invalid certificate if the ip addresses changed [15:06] i'm trying to run the gambit of what might have happened to cause this in my head [15:08] I only use one private eth0 interface (static) for management VMs like master, etcd and kube-api-loadbalancer/easyrsa [15:09] for worker, I use bonding on two private interface [15:09] but nothing change at this area :( [15:09] ok i dont think thats teh issue then if the addressing hasn't changed [15:09] hmmm [15:09] lazyPower, Zic would maybe removing the relation to easyrsa and adding it again fix? [15:09] for info, I reboot the VM which host juju controller also [15:10] rebooted* [15:10] Zic - i dont think its a juju controller issue, its an issue with the tls certificates it seems. Something changed that's causing them to be invalid which is causing a lot of sick type symptoms with the cluster [15:10] let me check the date of cert files [15:11] is it in /srv/kubernetes root? [15:11] yep, the keys are stored in /srv/kubernetes [15:11] 16 January [15:11] :( [15:12] SaMnCo: do I risk to loose the PKI if I do that? [15:12] that's what I am asking myself, if it would just regen the certs for the whole thing or not [15:12] lazyPower would know better [15:13] SaMnCo - i'm mostly certain there's logic to check if the cert already exists in cache and will re-send the existing cert [15:13] we have an open bug about rekeying the infra but haven't taken an action on it yet [15:13] and there is some strange behaviour: via kubectl, I can do read action (get/describe) without any problem [15:13] but write, like create/delete sometime return a Forbidden [15:13] (I posted the exact message above) [15:13] but for Ingress or dashboard, it's a strong "nope" [15:13] Zic - hav eyou upgraded the kube-api-loadbalancer charm? we changed some of the tuning to disable proxy-buffering which was cuasing those issues [15:14] I have seen that behavior in clusters where the relation with etcd or etcd itself was messy [15:14] k8s seems to keep a state as long as it can [15:14] so if you break etcd, it will keep returning values for its current state, but will refuse to change anything [15:14] lazyPower: I don't upgrade any juju's charm, juste classical .deb via apt [15:14] oh, in the apt upgrade, I saw an etcd upgrading [15:15] can it be...? [15:15] :| i sincerely hope this is not related to the deb package doing something with what we've done to the configuration of etcd post deployment [15:15] if it is, i'm going to be upset and have nobody to complain to [15:15] I run an etcdctl member list on etcd machines [15:15] seems OK [15:16] but I don't know what to do more to check the health [15:16] member list and cluster-health are the 2 commands that would point out any obvious failures [15:16] etcdctl cluster-health [15:16] and tail the log, it tells if a member is out of sync [15:16] http://paste.ubuntu.com/23875201/ [15:16] ok so not that issue then [15:16] so that doesn't seem to be the culprit [15:17] I did a etcdctl backup before the upgrade also, just in case [15:17] excellent choice [15:17] hmm, so it seems to be tied to the CA [15:17] can I run some manual curl --cacert to one point of the API to check it? [15:18] Zic - yeah, so long as you use the client certificate or server certificate for k8s, you should be able to get a valid response if the certificates are valid [15:18] the server certificates are generated with server and client side x509 details. meaning the k8s certificates on the unit can be used as client or server keys. [15:19] lazyPower: it's what is strange : kubectl get/describe commands always work, kubectl create/delete at contrary works only 1 of 3, returning Forbidden message [15:20] and for Ingress/default-http-backend/kubernetes-dashboard side, it's just CrashLoopBackOff :( [15:20] Zic - can you check the log output on teh etcd unit to see if there's a tell-tale in there? [15:20] yep [15:20] Zic - it does sound like the cluster state storage is potentially at fault here [15:24] the etcd cluster doesn't return any weird logs, it juste saw that I upgraded the etcdd package :s [15:25] Zic what are the logs of the Ingress/default-http-backend/kubernetes-dashboard pods? [15:25] Unpacking etcd (2.2.5+dfsg-1ubuntu1) over (2.2.5+dfsg-1) ... [15:25] (was the update) [15:26] the '1ubuntu1' part [15:26] seems that the etcd from Ubuntu archive installed over the Juju charm one, no? [15:26] SaMnCo: (I'm pasting you the log shortly) [15:26] Zic - thats expected. the etcd charm installs from archive [15:27] yeah but as it has not some "ubuntu" tagged version in the deb-version, I thought it was from outside of the Ubuntu archive.ubuntu.com [15:29] http://paste.ubuntu.com/23875254/ [15:29] SaMnCo: ^ [15:33] Zic - from your kubernetes master, can you grab the x509 details and pastebin it? openssl x509 -in /srv/kubernetes/server.crt -text [15:33] i dont need teh full certificate output, just teh x509 key usage bits so i can cross ref this info w/ whats in the cert [15:33] i'm expecting to find IP Address:10.152.183.1, in the output [15:34] oki [15:35] Zic - additionally, if you could run juju run-action debug kubernetes-master/0 && juju show-action-output --wait $UUID-RETURNED-FROM-LAST-COMMAND [15:35] it'll giv eyou a debug package you can ship us for dissecting the state of the cluster and we can try to piece together whats happened here [15:35] X509v3 Subject Alternative Name: [15:35] DNS:mth-k8smaster-01, DNS:mth-k8smaster-01, DNS:mth-k8smaster-01, IP Address:10.152.183.1, DNS:kubernetes, DNS:kubernetes.cluster.local, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local [15:35] i think i transposed debug and kuberetes-master [15:35] yeah, the certs valid, it has all the right SAN's i would expect to see there [15:35] that part of the certificate? [15:36] ok [15:36] let me run this juju command [15:36] * lazyPower sighs [15:36] this is a red herring, its something else thats gone awry [15:37] error: invalid unit name "debug" [15:37] hmm? [15:37] maybe I need to inverse the two args :D [15:37] juju run-action kubernetes-master/0 debug ? [15:38] Action queued with id: 99267d59-f3aa-467d-8686-130e90dc47a0 [15:38] seems to be that :) [15:38] # juju show-action-output --wait 99267d59-f3aa-467d-8686-130e90dc47a0 [15:38] error: no action ID specified [15:39] :| [15:39] juju y u do dis [15:40] Zic if you omit hte --wait, it'll give you what you're looking for now [15:40] teh debug action doesn't take long to run [15:41] its just aggregating information and then offers up a tarball of files [15:42] http://paste.ubuntu.com/23875303/ [15:42] but at that path, I don't have any debug-20170127153807.tar.gz [15:42] am I missing something? :o [15:42] Cynerva - have we encountered any situations where the debug package isnt' created? [15:43] 1 sec, cc'ing the feature author [15:43] if I run the proposed juju scp it's ok [15:44] lazyPower: I haven't seen anything like that, no [15:44] wait so it did create? [15:44] lazyPower: if I run the juju scp manually, don't know if it was what you wait from me :) [15:44] or if the show-action should exec it [15:44] Zic, lazyPower other people have had that: https://github.com/kubernetes/minikube/issues/363 [15:46] Zic - I'm looking for the payload from that juju scp command that showed up in teh action output [15:46] Zic - that tarball will have several files which includes system configuration, logs, and things of that nature [15:46] yeah, I have it [15:46] just untared [15:47] Do you have a secure means to send that to us? if not i can give you a temporary dropbox upload page to send it over [15:47] yeah, I can generate you a secure link [15:47] excellent, thank you [15:48] Hello Zic, sorry I am late to the party. I heard you were having trouble with the Kubernetes cluster. [15:50] yeah :( [15:51] lazyPower: I pm-ed you the link with its password [15:51] Zic - confirmed receipt of the file [15:51] mbruzek: just run apt update/upgrade all over the different canonical-kubernetes machine, one per one, and the API begins to refuse some request for unknown reason [15:52] i'll take this debug package and we'll dissect it to see if we can discern whats happened post apt-get upgrade. i can't for the life of me think what went wrong but i suspect there's clues in here. [15:52] (for TL;DR :)) [15:52] Thanks for bringing me up to speed. [15:52] Zic can you also send me the output from a kubernetes-worker node as well? [15:52] same process to run the debug action [15:53] lazyPower: just before the upgrade, I kubectl delete ns and it was still in Terminating when I kubectl get ns [15:53] don't know if it can help [15:53] Zic - it might be trash in the etcd kvstore, but i'm not positive this is the culprit yet [15:54] the goal was to delete all large namespaces used for PoC, upgrade all the cluster, reboot it, and begins some prod; but it seems that it will not be the good day :p [15:55] (I'm generating you other logs) [15:55] Zic: I am sorry you ran into this problem [15:56] Zic Have you verified that kube-apiserver is running on your kubernetes-master/0 charm? [15:57] I'm running a permanent watch -c "juju status --color" [15:57] it should be red if it's not working, correct? [15:57] because all is green atm :) [15:57] Zic not necessarily [15:57] oh [15:57] let me check directly so [15:58] but even if it was that, no queries will work at all, here I have some success via kubectl get/describe, random success with kubectl create/delete (resulting in Forbidden error sometime, and works just at the 2nd try...), and 0 success with Ingress & dashboard [15:58] (yeah it's running fine) [16:04] Zic - ok we'er going to need a bit to sift through this data and see what we come up with [16:04] i have the whole team looking at these debug packages, i'll ping you back when we've got more details [16:04] thanks for all your help! [16:06] Zic: You rebooted the nodes after apt-get update? [16:06] yep [16:06] all of it [16:07] Zic: Do you remember what time about? Looking at the logs I see some connection loss about 2017-01-25 10:35 [16:10] hmm, I begin the kube-api-load-balancer, 3 kubernetes-master and the two etcd at ~14:15 (UTC+1) [16:11] and finished 3 mores etcd and all kubernetes-worker 1 hour after I think [16:11] Zic: OK that does not appear to be the problem then [16:11] but on 25th january, all was fine [16:11] (didn't see the day, sorry) [16:13] the exact timelaps is : I delete 4 large namespaces, that was forever in Terminating state, and no pods or other ressources was in Terminating, so I began to delete them one per one (without --force ou --grace=0, just normally) [16:13] all pods & svc was terminated, but the namespaces always show "Terminating" in the "kubectl get ns" [16:14] as I needed to upgrade and reboot all the cluster anyway, and saw an issue concering this fixed by rebooting the masters, I did it [16:14] DELETE /apis/authorization.k8s.io/v1beta1/namespaces/production/localsubjectaccessreviews: (698.088µs) 405 -- this seems to be dumping stacks in teh apiserver log [16:14] 405 response [16:14] undetermined if this is the root cause, but it is consistent [16:14] yeah so it's maybe this large deletion which is the root cause :/ [16:15] was 4 namespaces hosting 4 Vitess Cluster labs [16:15] logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"the server does not allow this method on the requested resource\",\"reason\":\"MethodNotAllowed\",\"details\":{},\"code\":405}\n" [16:15] which is interesting, i know for a fact you can delete namespaces [16:15] i believe what might be the cause, is it caused some kind of lock in etcd [16:15] yeah [16:15] for my previous labs, I just delete ns and all was clean [16:15] and k8s is stuck trying to complete that reqeuest and etcd is actively being aggressively in denial about it [16:15] but never deleted 4 larges one in the same time... [16:15] but not positive this is the root cause, we're still dissecting [16:17] Zic our e2e tests do large deletes of namespaces so that should be fine. [16:17] ok [16:18] atm, this namespace are still in "Terminating" [16:18] I check if rc, pods, services, statefulset, all the ressources was terminated [16:18] Zic Did you reboot the etcd node(s) while this was still trying delete? Was there an order of reboot? [16:19] mbruzek: I just checked the rc/pods/svc/statefulset of this namespaces was goodly terminated, but the namespace was still blocked at Terminating [16:19] I rebooted the etcd node one by one [16:19] (and try etcdctl member list after each reboot) [16:19] yeah [16:20] I have a previous backup of this morning for etcd [16:20] (and one after the upgrade) [16:20] the more we think this through, i think etcd is the core troublemaker here [16:20] i think teh client lost the claim on the lock [16:20] because of the high amount of delete request or because of the upgrade via apt of its package? [16:21] combination of the operation happenign and then being rebooted during the op [16:21] etcd is still waiting for that initial client request to complete [16:21] :s [16:21] i hear you, etcd is very finicky, and this is exactly why we label it as the problem child [16:22] i'm looking up how to recover from this [16:22] all my troubles was with etcd for all that time :D with K8s or Vitess [16:22] Zic - can you curl the leader units ledaer status in etcd? [16:22] eg: curl http://127.0.0.1:2379/v2/stats/leader [16:22] teh leader is identified with an asterisk next to the unit-number in juju status output [16:24] hmm [16:24] I have a non-printable character in return [16:24] I have a bad feeling about this [16:24] * lazyPower 's heart sinks a little in his chest [16:26] https://dl.iguanesolutions.com/f.php?h=1mvhf5F9&p=1 [16:28] oh wait [16:28] it's not the master [16:28] etcd/0* active idle 5 mth-k8setcd-02 2379/tcp Healthy with 5 known peers. [16:28] I will try here [16:28] same non-printable-character :( [16:29] Zic: juju run --unit etcd/0 "systemctl status etcd" | pastebin [16:29] Zic - etcdctl ls /registry/namespaces [16:29] http://paste.ubuntu.com/23875518/ [16:29] mbruzek: ^ [16:30] http://paste.ubuntu.com/23875523/ [16:30] lazyPower: ^ [16:31] jma, production, integration, development was the namespaces I deleted [16:31] (which is still locked to "Terminating" status) [16:34] hmm, lazyPower I run the same curl with https instead of http [16:34] root@mth-k8setcd-01:~# curl -k https://127.0.0.1:2379/v2/stats/leader [16:34] curl: (35) gnutls_handshake() failed: Certificate is bad [16:34] even with the "-k" [16:34] Zic - etcd is configured to listen to http on localhost [16:34] oh ok, so it was correct [16:34] you'll need https if you poll the eth0 interface ip [16:35] I try [16:35] # curl -k https://10.128.74.205:2379/v2/stats/leader [16:35] curl: (35) gnutls_handshake() failed: Certificate is bad [16:37] Zic: juju run --unit etcd/0 "journalctl -u etcd" | pastebinit [16:38] http://paste.ubuntu.com/23875565/ [16:40] the 14:17-14:32 interval is the upgrade/reboot I think [16:43] Zic - etcdctl ls /registry/serviceaccounts/$deleted-namespace [16:43] do you have 'default' listed in there in any of those namespaces? [16:43] root@mth-k8setcd-02:~# etcdctl ls /registry/serviceaccounts/production [16:43] /registry/serviceaccounts/production/default [16:43] yep [16:48] sorry, I will be afk for 1 hour (breaking K8s cluster was not an sufficient punition, I'm also of the rotation on-call tonight... need to go home before it begins... double-punishment :D) [16:49] Zic we were about to offer some face to face support. We can wait until you get home. [17:07] Zic ping us when you are back [17:08] is there a reason juju automatically adds a security group rule to every instance that allows access on 22 from 0.0.0.0/0? [17:09] I'm guessing juju just assumes you will always be accessing the instance via public internet and not from behind vpn? [17:56] Zic - we think we've narrowed it down to the one area we dont have visibility into at the moment, we're missing debug info from etcd, and there's no layer-debug support in teh etcd charm at present. When you surface and have a moment to re-ping, we'd like to gather some more information from the etcd unit(s) under question and i think we can then successfully determine what has happened. [18:01] howdy juju world [18:01] o/ stormmore [18:02] so I got my first k8s cluster up and running yesterday, woot! [18:05] AWE-SOME [18:05] doing anything interesting in there yet stormmore? [18:06] not yet, still teaching the Devs how to create containers [18:06] it will get more interesting when I migrate out AWS to our own hw [18:07] I just love all the "pretty" dashboards that I can "show off" to management [18:22] stormmore, it's what promotions are built on :) [18:25] stokachu not in a startup where you already report to the CEO [18:26] job security maybe [18:29] ping back lazyPower and mbruzek [18:29] sorry transports was hell [18:31] stormmore, hah, maybe some nice hunting retreats [18:31] stormmore, that may just be here in the south though [18:31] yeah that is a southern thing stokachu, definitely not a bay area thing [18:32] lazyPower: what do you need from etcd, just the journalctl entries? [18:37] Zic - can you grab that, the systemd unit file /var/lib/systemd/etcd.service and the defaults environment file /etc/defaults/etcd [18:37] er [18:37] sorry /lib/systemd/service/etcd.service [18:37] i clearly botched teh systemd unit file location. herp derp [18:39] /lib/systemd/system/etcd.service? [18:39] because /lib/systemd/service does not exist :) [18:39] correct [18:39] 'k [18:40] http://paste.ubuntu.com/23876230/ [18:42] etcd.service date of Dec 18th Jan 16th, if it can help [18:42] oops, missing copy/paste [18:42] etcd.service is Dec 18th and /etc/default/etcd is Jan 16th [18:47] ok these unit files appear to be in order. We found some issues that also look related to the core problem regarding flannel not actually running on the units [18:47] it failed contacting etcd [18:47] Zic are you able to hangout with us for a debug session? [18:50] so sorry but I can't, my wife and my child will kill me if I run into a debugging-session with audio :/ but really appreciate your kindness, thanks [18:50] I'm out-of-office actually but I can do some IRC discretely :) [18:51] lazyPower: yes, we always discuss about this, but I `systemctl start flannel` at all rebooted kubernetes-worker [18:51] or maybe the time Flannel is not running sets the problem? [18:52] (and I also did a `juju resolved` on the flannel unit which was in error) [18:52] as you taught me :) [18:52] Zic - seems like flannel is having an issue contacting etcd per the debug output from kubernetes-master [18:53] which in turn is causing the kubernetes api to not be available to pods, which is causing the pod crashloop [18:53] Zic: Can you pastebin the /var/run/flannel/subnet.env also ? [18:53] hmm, I remembered to start the flannel service after every kubernetes-*worker* reboot [18:53] not on master [18:54] (as Juju doesn't tell me anything is on error after master reboots) [18:54] Zic Were the flannel services not autostarting? [18:54] yes, on worker [18:55] fwiw, that issue was fixed in 1.5.2 [18:56] nice, noted [18:56] I will plane an upgrade if I'm able to recover from this crash [18:57] Zic: Can you get the /var/run/flannel/subnet.env file for us? [18:57] oh sorry, I missed your message [18:59] http://paste.ubuntu.com/23876307/ [18:59] Thanks, Zic === frankban is now known as frankban|afk [19:24] lazyPower: at this point, you can tell me the truth, do you think I will be able to recover from this crash? :D not so important because it's not in prod yet, and it's easy to redeploy from scratch [19:25] but to know of what mu monday will be made :) [19:25] s/mu/my/ [19:25] Zic - we have some ideas, but nothing definitive for the root cause so its hard to point at what a fix would be without access to real time debug. [19:25] Zic - we're trapped in a meeting atm thats starting to wind down, but we've been scrubbing through the logs you sent all morning, it all seems to point back to flannel + etcd as the core of the issue [19:26] mbruzek - any ideas left to try before we call it DOA? [19:27] don't hesitate to ping me for more info, I'm @home but lurking at IRC as usual [19:27] Zic - will do. just pending feedback from the remainder of the team actively cycling on the issue. [19:27] i did the same oeprations you outlined on my home cluster running 1.5.1 and i got the intermediary flannel connection issue [19:27] it resolved once i restarted the master however [19:28] I rebooted in this order : juju controller, kube-api-loadbalancer+easyrsa, kubernetes-master, kubernetes-worker [19:29] That all sounds correct to me in terms of ordering [19:29] I don't try to reboot anything since then [19:29] would you mind terribly trying to reboot the kubernetes-master unit one last time to see if it "unsticks" the error? [19:29] but I can if it's needed [19:29] yeah, I can [19:29] the 3 ones? [19:29] one by one? [19:30] Zic - i would pick one, and restart it yes. identified as the leader [19:31] start there, and lets see what results we get back from that single reboot [19:31] if it looks promising, then cycle one by one the other two nodes [19:31] s/nodes/units/ [19:31] reboot launched on the active master [19:32] Zic - i need to cycle into another role at the moment, but i'm leaving you in the very capable hands of mbruzek, ryebot, and Cynerva - they're going to keep the stream alive and ask for details about post-reboot [19:33] Ready and waiting, Zic :) [19:33] oki, thanks anyway lazyPower for your great help :) [19:33] Zic - no, thank YOU for the patience during this debugging session. I know its unnerving [19:33] ryebot: o/, reboot finished, do I start the flannel.service? [19:33] if we can fix it we'd like to do so [19:34] Zic: yes, please [19:34] Zic: Yes [19:34] started, systemctl status seems correct [19:34] great, and /var/run/flannel/subnet.env exists? [19:34] mbruzek ryebot - the flanneld unit not being up before kube-apiserver/scheudler/controller-manager might be a bigger portion of the error set as well. [19:35] during bootstrap that happens after flanneld has indiciated its running and available [19:35] Hey lazyPower just out of curiousity, do you know why the k8s bundle adds flannel connection to the k8s master that takes a full /24? seems like a bit of a waste to me [19:35] stormmore - expedience, that seems like an area we can optimize [19:35] http://paste.ubuntu.com/23876577/ [19:35] lazyPower ack, we'll investigate [19:35] ryebot: ^ [19:35] Zic great, thanks [19:36] do I mark the flannel/0 unit as "resolved" with juju cli? [19:36] it's in error atm [19:36] lazyPower - seems like it, still not a deal breaker for me ;-) [19:36] Zic, yes, please [19:36] done, it's green [19:36] great [19:37] one sec [19:37] stormmore glad to hear it :) As you can see per the channel logs, we take deployments seriously and value all feedback. keep it comin. if you'd like to file a bug against github.com/kubernetes/kubernetes regarding the master service cider range we can angle to get it on the roadmap in the future. [19:37] Zic: Can you also reboot the other masters and start flanneld as well. [19:38] ok [19:40] ok for the 2nd master [19:41] also for the 3rd one [19:41] okay, and you resolved the errors? [19:41] yep [19:41] Zic can you try to create a simple pod to see if that works? [19:41] lazyPower this channel is one of the reasons I am driving adoption in my company of mass and juju to underpin the k8s environment instead of other options [19:41] Are you still in a Crash Loop back off? [19:41] stormmore <3 [19:42] we appreciate you too [19:42] Thanks stormmore! [19:42] mbruzek: I will deploy a simple nginx pod, let me few seconds [19:42] Thanks, Zic [19:44] $ kubectl run my-nginx --image=nginx --replicas=2 --port=80 [19:44] Error from server (Forbidden): deployments.extensions "my-nginx" is forbidden: not yet ready to handle request [19:44] same from before [19:44] alright, thanks, Zic, one moment [19:45] I run it a 2nd time, same error [19:45] but the 3rd time operation was executed [19:45] so, same as earlier [19:45] so on a less work related topic, when is the next convention that I can try and twisted the boss' arm into letting me attend? [19:45] Zic what master does your kubectl point to? [19:46] the kube-api-load-balancer [19:46] I can test from a master directly [19:46] Zic: Do you have a bundle or description of how you deployed your cluster? [19:49] yeah : 1 machine for kube-api-loadbalancer and easyrsa (identified by mth-k8slb-01 in my infra), 3 machines for kubernetes-master (mth-k8smaster-0[123]), 8 kubernetes-worker (mth-k8s-0[123], mth-k8svitess-0[12], mth-k8sa-0[123]) [19:49] and one Juju controller of course : mth-k8sjuju-01 [19:49] Zic: Can you pastebin juju status? [19:49] of course [19:50] http://paste.ubuntu.com/23876663/ [19:50] mbruzek - one thought just occured to me as well, if Zic hasn't updated to our 1.5.2 release, the api-lb stil has the proxy buffer issue which reared its ugly head on delete/put requests as well. [19:50] but not certain if thats pertinent [19:50] (oh, I forgot to mention mth-k8setcd-0[12345] which running etcd parts) [19:50] mbruzek: ^ [19:52] all is added manually ("cloud-manual" provisionning of Juju, even the AWS-EC2 instance) [19:52] I don't use the AWS connector since I have baremetal servers and EC2 instances in the same juju controller [19:55] Zic would you be able to upgrade this cluster to see if our operations code corrects this problem? [19:55] yeah, it will be the first upgrade I conduct through Juju since the bootstrapping of this cluster :) [19:55] what is the recommended way? [19:56] We have that documented in this blog post: http://insights.ubuntu.com/2017/01/24/canonical-distribution-of-kubernetes-release-1-5-2/#how-to-upgrade [19:56] Since you don't seem to be using the bundle, I would recommend the "juju upgrade-charm" steps [19:56] We can walk you through it. [19:57] I use the canonical-kubernetes but just scale etcd from 3 to 5 and master from 1 to 3 in the Juju GUI [19:58] (and make easyrsa to be on the same machine as kube-api-loadbalancer) [19:58] the canonical-kubernets bundle* [19:58] Zic: Ah I see, still would recommend the upgrade-charm path [19:59] ok, I just read the how-to-upgrade section, never use the upgrad-charm one by one :( [19:59] There is a first time for everything! [19:59] :) [20:00] so I run `juju upgrade-charm ` ? [20:01] Zic: Just the applications in the cluster, kubernetes-master, kubernetes-worker, etcd, flannel, easyrsa, and kubeapi-load-balancer [20:02] The _units_ will upgrade automatically when the _application_ does [20:02] So you don't need to use the /0 /1 /2 /3 [20:02] Zic: Right, so literally as it is in those docs :) [20:02] ok [20:03] just to let me know, for the future (~yay~), is it the classical recommended way in a cluster like mine? [20:03] so I will try to memorize this :) [20:05] Zic: As far as I know, for custom deployments, yes. [20:05] You might be able to use your own bundle without version numbers, but I'm not sure. I'll have to investigate. [20:05] Zic: Technically you could export a bundle of your current system and it would be reproducable, as well as you could edit the version numbers of the charms and upgrade your cluster in one step [20:06] Zic: But we are just trying to fix the cluster here, we can add automation and reproduction as following steps [20:07] ok, so as I modify the number of units and replace the easyrsa charm, the out-of-the-box way is via juju upgrade-charms, that's all I want to know, it's not the time to automation I agree :) [20:07] (upgrades is running btw) [20:08] s/replace/place in the same machine of kube-api-loadbalancer/ [20:08] Zic: Actually you can export the deployment from the Juju GUI in one step, and it will have the machines and everything just as you have it now. We could then copy that yaml and edit it. [20:09] smooth, noted, I will study this after :) [20:10] (easyrsa, kube-api-load-balancer already OK, kubernetes-master in progress, I will follow then with etcd and kubernetes-master) [20:10] worker* [20:11] hmm [20:11] I had a `watch "kubectl get pods --all-namespaces"` running [20:11] Zic: The documenation describe this process: https://jujucharms.com/docs/2.0/charms-bundles [20:11] and suddenly all the pods switched to Running after the kubernetes-master upgrade [20:11] \o/ [20:11] Excllent! [20:11] let me check carefully [20:11] @!#$%^@!#$%!@#$% [20:11] AWESOME mbruzek GREAT WORK! [20:12] yeah, all is running [20:12] sweet [20:12] I continue the upgrade-charms [20:12] i think this calls for a tradmarked WE DID IT! [20:12] (I will try to schedule a new pod deployments just after) [20:12] clap clap anycase o/ [20:13] oh maybe "clap clap" is not sounding like a STANDING APPLAUSE in English, sorry :) [20:13] Zic: It translated in my head just fine [20:13] :) [20:14] etcd charms is upgrading [20:14] [20:15] done for etcd, I'm ending with kubernetes-worker charm [20:17] all upgrades is finished [20:17] let me run a kubectl deployment [20:17] Zic: Can you create a small test deployment? [20:17] yep [20:17] (oh btw, kubectl get ns does not return any of the old locked Terminating namespace) [20:17] it's clean now [20:18] yassssss [20:18] great [20:20] so, it works but something weird but maybe normal : some Ingress goes to CrashLoopBackOff but it looks to come back in Running atm [20:20] yeah, they are all running now [20:20] some strange flap [20:20] it the default-http-backend pod is under re-creating it'll crash loop the ingress controller until the backend stabilizes [20:20] thats known and normal [20:20] oh, so it's normal [20:21] COOL :D [20:21] phwew [20:21] man, fan-tastic [20:21] so your new version amazingly correct everything [20:21] and I don't know if it's tied to it, but Scheduling is much more quicker [20:21] thats a 1.5.2 fixup :) [20:22] I waited sometime up to 3min between Scheduling and ContainerCreating [20:22] plus you probably have less pressure on the etcd cluster without those large namespaces [20:22] here it was done in 10s [20:24] Zic: The upgrade process installed new things and reset the config files to what we would expect, that is why I think you are having so much success here. [20:24] Zic: Is everything ok? do you have any other problems? [20:25] I'm checking some more test but all seems OK now !! [20:26] Zic: There was also a fix for LB where we turned off the proxy buffering in the 1.5.2 update. [20:28] hmm, some Ingress stayed in CrashLoopBackOff and others are running correctly: [20:28] http://paste.ubuntu.com/23876828/ [20:30] Zic it is possible that those are still being "restarted" by the operations [20:30] But good information. [20:31] hmm, kubernetes-dashboard is staying in CrashBackLoopBackOff, I forgot to check the kube-system namepace :/ [20:31] (kube-dns was also Crashed, but it is Running since the upgrade) [20:32] Zic, is your juju status all-green right now? [20:32] yep [20:32] thanks [20:36] heapster also was CrashLoopBackOff and just Running fine now [20:37] Zic: so just the ingress ones are in CLBO ? [20:37] the only crashloopbackoff now is the dashboard and some of the Ingress [20:37] oh I anticipated the question :) [20:38] I'm trying to relaunch the pod of kubernetes-dashboard, maybe it will help [20:40] 2m1s163{kubelet mth-k8svitess-01}WarningFailedSyncError syncing pod, skipping: failed to "SetupNetwork" for "kubernetes-dashboard-3697905830-qv6hv_kube-system" with SetupNetworkError: "Failed to setup network for pod \"kubernetes-dashboard-3697905830-qv6hv_kube-system(8785f143-e4cc-11e6-b87d-0050569e741e)\" using network plugins \"cni\": open /run/flannel/subnet.env: no such file or directory; Skipping pod" [20:41] Zic: Did you get to juju upgrade-charm flannel? [20:41] no, I just realize that, I thought it was a part of the kubernetes-worker charms [20:41] can I run its upgrade now? [20:42] Zic: Yes please [20:44] done [20:45] Zic: Can you please pastebin `kubectl logs nginx-ingress-controller-jlxr5` ? [20:45] We want to see why that is in a CLBO [20:46] http://paste.ubuntu.com/23876915/ [20:47] hmm [20:47] I didn't have this error before [20:47] Are they still in crash loop back off? [20:47] yep [20:47] oh [20:47] dashboard is running [20:48] and 2 Ingress are back in Running [20:48] 6 total ingress running now? [20:48] 2 are always in CLBO but I think it will combe back [20:48] hop, all Running [20:48] 8 total running ingress? [20:48] yep [20:48] great [20:48] all is Running now, and I'm checking --all-namespaces [20:49] (I forgot kube-system earlier :/) [20:49] dashboard is working effectively [20:49] hmm, 2 Ingress are returning in CLBO atm [20:50] Zic: Send us pastebin logs for those [20:50] http://paste.ubuntu.com/23876940/ [20:52] I have another log from the RC of Ingress: [20:52] Liveness probe failed: Get http://10.52.128.99:10254/healthz: dial tcp 10.52.128.99:10254: getsockopt: connection refused [20:53] thanks, looking === wolverin_ is now known as wolverineav [20:53] so the only CLBO part now is that 2 Ingress: [20:53] default nginx-ingress-controller-vg9qc 0/1 CrashLoopBackOff 17 13h [20:53] default nginx-ingress-controller-w1dhl 0/1 CrashLoopBackOff 92 10d [20:54] Zic - which namespace is this in? [20:54] default, it's the builtin nginx-ingress of the canonical-kubernetes bundle [20:54] ack, ok [20:54] (I don't add any Ingress myself) [20:54] ingress controller* to use the right terminology [20:55] can you kubectl describe po nginx-ingress-controller-vg9qc | pastebinit [20:55] (Actually, I add Ingress, but not Ingress controller) [20:56] http://paste.ubuntu.com/23876964/ [20:56] thanks [20:57] Zic - thats totally fine [20:57] hmm nothing is leaping out at me from the pod description.... it did say it was failing health checks [20:58] from earlier pastes it looked like it was running out of file descriptors [21:00] in fact, my Ingress deserve well my testing website [21:00] but this two controller stays in CLBO :/ [21:00] the 6 others is well Running [21:00] is it possible to tell juju to give my instance 2 ip addrs when it starts them? [21:01] cholcombe - you have to sacrifice 40tb of data and do the chant of "conjure-man-rah" [21:02] lazyPower: if I reboot (yeah, it's a bit brutal) nodes which host this two CLBO, maybe it will repop as Running? [21:02] man i need to brush up on that chant haha [21:02] in short, i dont know but i think extra-bindings and spaces is what would introduce that functionality [21:02] lazyPower: it's what I did for kubernetes-dashboard [21:02] s/know/so/ [21:02] no no, nevermind that last edit, it was right [21:02] Zic - its worth a shot. [21:03] hmmm that is curious for an "idle" cluster, one of my nodes just dropped 4GB memory usage and increased network tx for potential no obvious reason! [21:07] lazyPower: it seems to work, weird but I like it xD [21:07] I will wait few minutes before confirming [21:07] Zic - i'm going to blame gremlins for that one [21:10] default nginx-ingress-controller-7qcsn 0/1 CrashLoopBackOff 10 13h 10.52.128.135 mth-k8sa-01 [21:11] default nginx-ingress-controller-lx6kt 0/1 CrashLoopBackOff 16 13h 10.52.128.253 mth-k8sa-02 [21:11] it don't work so long :) [21:11] Zic, another option is destroying them; the charm should launch new ones to replace them on the next update (<5 mins) [21:12] with the same error: F0127 21:11:44.648841 1 main.go:121] no service with name default/default-http-backend found: the server has asked for the client to provide credentials (get services default-http-backend) [21:12] ryebot: thanks, I will try [21:13] it pops again and they are Running [21:14] let few minutes pass to confirm :) [21:14] +1 [21:15] CLBO for one of them [21:15] Zic: Please try one more thinkg [21:15] juju config kubernetes-worker ingress=false [21:15] *wait* for the pods to terminate. [21:15] the two new bringed up ingress-controller are now in CLBO again [21:16] mbruzek: ok [21:16] Then you should be able to juju config kubernetes-worker ingress=true [21:17] mbruzek: (it mades me remember that at the end, I need to shut off the debug I enabled earlier on the kubernetes-master?) [21:18] all pods are terminated [21:19] rewitch to true [21:20] all is Running, let's wait some minutes [21:20] they are all in CLBO now xD [21:21] but it switch again to Running [21:21] it will register as running when they initially come up, but have to pass the healthcheck [21:21] Zic - can you juju run --application kubernetes-worker "lsof | wc -l" and pastebin the output of that juju run command? [21:23] http://paste.ubuntu.com/23877086/ [21:24] All Ingress are flapping between CLBO and Running now [21:24] i'm not positive, but k8s worker/2 and k8s-worker/1 seem to have a ton of open file descriptors [21:24] looks like something is leaking file descriptors :| [21:25] which would explain the crash loop backoff on a segment of the ingress controllers vs the handfull that succeed [21:26] Zic - at this point we'll need a bug about this, and can look into it further, but we're not in a position to recommend a fix at this time. [21:26] we have encountered this before but the last patch that landed should have both a) enlarged the file descriptor pool, and b) hopefully corrected that. we might be behind in a path release on the ingress manifest that fixed this [21:27] i'll take a look in a bit, but i think we're 1:1 [21:27] in fact the Ingress is working even if the Ingress Controller are flapping [21:27] do you have any ingress controllers listed as running that are not in CLB? [21:27] strange [21:27] yeah, as they flap between the two state not at the same time [21:27] if thats the case, kubernetes will do its best to use a functional route throuth the ingress API [21:28] as it round-robin distributes them. you'll likely find some requests that get dropped if you run something like apache-bench or bees-with-machine-guns against it [21:28] but typical single user testing likely looks fine [21:28] Zic, one last thing - can you pastebin /lib/systemd/system/kubelet.service on a kubernetes-worker node? [21:30] http://paste.ubuntu.com/23877117/ [21:30] juju run --application kubernetes-worker "sysctl fs.file-nr" | pastebinit -- as well would be helpful in ensuring it is indeed related to file descriptors. this will list the the number of allocated file handles, the number of unused-but-allocated file handles, and the system wide max number of file handles [21:31] http://paste.ubuntu.com/23877122/ [21:32] well those fd numbers went way down [21:32] however there are very different configurations listed there [21:32] where some have 400409 and some have 6573449 listed as max [21:32] They might be his baremetal/ec2 machines? [21:32] ah, you are correct [21:32] different substrates [21:32] different rules [21:33] yeah, only mth-k8s-* and mth-k8svitess-* are robust physical servers [21:33] all the units appear to be well within bounds of those numbers though [21:33] mth-k8sa-* are EC2 instances [21:33] so maybe its not FD Leakage [21:35] http://paste.ubuntu.com/23877130/ <= and this error about credential remembered some bad hours of this day [21:35] (which was the same error returned by some kubectl commands and the dashboard earlier) [21:36] 21:28 is @UTC [21:36] so few minutes ago [21:38] Zic: Can you please file a bug about this problem on the kubernetes github issue tracker? https://github.com/kubernetes/kubernetes/issues [21:38] Maybe someone else knows why the ingress would be in CrashLoopBackOff. [21:39] Please list if you added any ingress things and what manifest you used for that. [21:39] hmm, about this, maybe I can delete my Ingress [21:39] Put Juju in the title and your best estimation on how to reproduce these errors. [21:40] Zic: The ingress=false would have deleted them no? [21:40] Did you put in your own _different_ ingress objects? [21:41] if you talk about Ingress controller, no, I stayed with the default nginx-ingress-controller of the charms bundle [21:41] but I create two Ingress yep [21:41] ingress objects are related to, and depend on the ingress controller, but have very little to do with ingress controller operations [21:41] unless i'm misinformed [21:44] ok, so I promise you to fill a bug tomorrow morning, it's late now and if I want to fill a well-written/described bugs I prefer to do it seriously :) I will post the Issue link here tomorrow [21:44] the step-to-reproduce part will be the most difficult part [21:44] Zic yes [21:44] Zic, I just don't know how to reproduce this [21:44] I realize it is late for you, sorry about the problems. [21:46] no worries, you were all formidable to help me and focus on this problem during hours today, I can at least pursue the debug on IRC even if I'm out of office! [21:49] thanks a lot mbruzek, ryebot, lazyPower, jcastro [21:50] Happy to help, Zic! [21:50] I will pop again here tomorrow about the issue I will report [21:51] thanks, feel free to me ping when you post it, I'd like to track it [21:51] ping me* :) [21:51] huh [21:52] I deleted and recreated my Ingress, and controller are now Running without flapping since 6min [21:52] engineering is SO 2016, magic is the new way [21:52] :| [21:52] lol [21:52] Zic, when you say you deleted your ingress, can you provide the command you executed? [21:53] kubectl delete ing [21:53] (one was exposing a nginx-deployment-test with nodeSelector on machine labelled at our datacenter/Paris (baremetal servers)) [21:54] (and one another on EC2 only) [21:54] 8min without flapping, lol [21:54] I wonder if there was a conflict with our automatic ingress scaling. Shouldn't be, but I should probably make sure. [21:55] I try to keep all traces to fill a bug anyway tomorrow [21:55] Hmm, you said you recreated them, too, so I guess that can't be it. [21:55] I guess you're right... magic! [21:55] and add how I resolved, if it don't flap from tommorrow [21:55] +1 sounds good, thanks Zic. [21:55] ryebot: yeah, and it's really simple Ingress, with only one rule on the hostname [21:56] like baremetal.ndd.com for the first, ec2.ndd.com for the second [21:56] 11min without flapping \o/ [21:56] \o/ [21:57] * Zic will buy some magic powder before sleeping [21:57] heh [22:00] hmm, I discovered some more information in kubectl describe ing [22:00] all the operation that the controller did [22:00] I didn't know where to find those [22:01] I have a lot of 42m42m1{nginx-ingress-controller }WarningUPDATEerror: Operation cannot be fulfilled on ingresses.extensions "nginx-aws": the object has been modified; please apply your changes to the latest version and try again [22:01] during the CLBO period [22:01] and now it's working, juste MAPPING action [22:01] just* [22:18] (no flapping for 38min :p I'm going to bed, g'night and thanks one more time to all of you :)) [22:18] Zic : excelelnt news. have a good sleep and enjoy your weekend o/ [22:18] I will ping back with the GitHub issue tomorrow o/ [22:24] wish there was a good recommendation guide for SDN hardware :-/ [22:24] (still think runninig some of the maas services on an sdn switch would rock!) === scuttlemonkey is now known as scuttle|afk [22:46] since it is Friday and I am not doing anhything to affect my cluster but I want to learn more about juju, is it possible run your own charm "store"? [22:49] stormmore - not at this time [22:49] lazyPower another "future work" thing then :) [22:49] stormmore - you can keep a local charm repository which is just files on disk, but as far as running the actual store display + backend service(s), thats not available for an on-premise solution [23:37] yo whats up with the charmstore ? [23:37] ERROR can't upload resource [23:37] will we ever fix this? [23:37] killin' me here