[04:21] So is there a charm and / or a doc on standing up a private docker registry for k8s? [04:22] Budgie^Smore - there's an open PR that hasn't made the shift to the upstream repository that adds this functionality into k8s itself - https://github.com/juju-solutions/kubernetes/pull/97 [04:23] once that lands it'll get released with our next update to the charms, we have some additional prs that need to land to support that change. but its on the horizon [04:23] so again I am getting ahead of myself :) [05:13] I am pondering running nexus 3 ina container in the meantime (possible long term depending on the registry functionality) [07:23] blahdeblah: It's about this bug https://bugs.launchpad.net/nrpe-charm/+bug/1633517 [07:23] Bug #1633517: local checks arn't installed sinds nrpe-7 [07:30] BlackDex: You've caught me a little late in my day, but if I get a chance I'll have a look at that if I have some spare time. [08:09] Good morning Juju world! === frankban|afk is now known as frankban === junaidal1 is now known as junaidali [09:16] blahdeblah: Thx, i don't mind i just need a answer/help, i know the time-difference is there, so no prob, i'm glad someone wants to take a look [09:18] morning kjackal o/ [09:19] hell marcoceppi [09:22] BlackDex: Definitely keen to find out what's going on; what times (UTC) are you likely to be around most? [09:26] blahdeblah: im in utc+1 (netherlands) so that would be utc 8:00 till around 16:00 [09:26] BlackDex: ack - will try to catch you in your mornings [09:26] oke :) cool thx! [09:29] hi here, I'm asking myself if a complete teardown via Juju (and tearup) could permit a resolution of this issue: https://github.com/kubernetes/kubernetes/issues/40648 [09:30] because it does not seem that many people have encountered this one :/ === sk_ is now known as Guest67469 [09:37] (I'm also trying to see in the Kubernetes' Slack if somebody already encountered this issue) [09:42] Zic: it might be? Are you on 1.5.2? [09:46] marcoceppi: yep [09:47] Zic: I don't feel comfortable saying scrap and redeploy, esp if there's information we can capture from your deployment to improve CDK, but I also don't want you siting with a wedge'd cluster [09:48] Zic: lazyPower mbruzek & co should be online in the next few hours [09:48] yeah, they helped a lot with the first party of this problem friday :) [09:49] s/party/part/ :] [09:49] I'm cool with calling it a party instead of a problem ;) [09:50] the first part was about all my pods crashed with this kind of error, and even some kubectl command (which actually "do/write" something, like create/delete, as get/describe works) return this kind [09:50] upgrading to 1.5.2 pass all my Pods to Running [09:51] but this weekend, when I tried to reboot some kubernetes-worker to test the resilience and the eviction/respawn of pods, I fell again in a sort of same problem :/ [10:24] oh, I found something strange, cc @ lazyPower [10:24] http://paste.ubuntu.com/23892884/ [10:25] etcd again /o\ [10:26] I saw this via juju status, flannel/2 was marked as "waiting" indefinitely [10:26] Zic: you have 1,3, or 5 etcd machines? [10:26] 5 etcd [10:26] on 5 different VMs [10:26] doh, just saw the flanneld line [10:27] I didn't have Flannel in this state when I opened the GitHub issue [10:27] interesting [10:27] the last guilty for my first problem was also etcd [10:27] it seems it just started happening baesd on the logs [10:28] do you know if I can do a "fresh start" of an etcd database for canonical-kubernetes without redeploy it from scratch? I don't have any important data for now in this cluster [10:29] (= I have all my custom YAML file to redeploy all easily) [10:29] files* [10:32] Zic: you should be able to just remove the etcd application then redeploy etcd and re-create the relations [10:32] you might get some spurrious errors during removal, and I'm not sure if it's a tested path or not [10:33] theoretically, you should be able to, but distributed systems are always a bit interesting in practice [10:34] oh you mean via Juju? I thought to wipe the data directly in etcd, but even if I don't know well etcd, I suppose there is some "default-key/value" needed and provisionned via Juju deployment at bootstrap :/ [10:35] Zic: not sure via etcd, from a Juju perspective the "keys" for TLS are actually a charm, the easyrsa charm, so since there's a CA running still it'll just get new certs, distribute those via relations and k8s will be reconfigured to point at that etcd [10:36] Zic: as for the etcd portion, there probably is a way to wipe, I'm just not sure of one [10:41] marcoceppi: do you advice me to wait for lazyPower and ryebot to come up before smashing etcd in the head (it's not the first time etcd annoys me, even in other technologies than K8s/Vitess :)) [10:41] ? [10:42] Zic: It's probably a good idea to wait for them, but smashing etcd over the head might also be very theroputic. I'll make sure we have some people in EU/APAC timezone come up to speed with kuberentes knowledge so there's not so much a wait period [10:53] oh I will never complain about timezone as it's a community support channel :) but great to here it [10:54] hear* [11:12] new info: old NodePort service are always listening, but with a new NodePort service just deployed, no nodes are listening on this port :/ [11:12] I think Flannel is the guilty but because it cannot contact etcd [11:17] stokachu: ping when you're around [11:18] or mmcc but I doub't you'll be around before stokachu [12:45] Hey BlackDex, this might be a long shot, but the bug you posted on NRPE isn't related to https://bugs.launchpad.net/charms/+source/nagios/+bug/1605733 by any chance? [12:45] Bug #1605733: Nagios charm does not add default host checks to nagios [14:09] marcoceppi, ping [14:25] stokachu: hey man, what's conjurebr0 for? [14:25] stokachu: I'm doing some super weird things in a spell, and was curious [14:25] marcoceppi, it's mainly for openstack on novalxd to have that second nic for its neutron network [14:26] but it's always there so you could rely on it if need be [14:32] mbruzek: hi, are you around? just saw you joined, sorry if I disturb you [14:36] Zic I am here. What can I help with? [14:39] mbruzek: remember the last time with my Ingress controller in CLBO? I thought all was fixed after upgrading to 1.5.2, but when I rebooted some nodes, the problem came back... I continued to look at the problem today and saw Flannel is completely messed up: http://paste.ubuntu.com/23892884/ [14:39] actually, all new NodePort are not working :s [14:39] hrmm. [14:40] There must be a problem with the reboot sequence. Do you think you could reproduce this? [14:41] stokachu: is it routable, and is it connected to the controller? [14:41] marcoceppi, yea it's routable, but not connected to the controller [14:42] stokachu: cool [14:42] stokachu: second question, can I reference a local bundle.yaml file in the spell metadata? [14:42] mbruzek: I tried to restart the flannel service but with the same result, I didn't try to reboot another node to see if I can reproduce [14:42] marcoceppi, you would just place a bundle.yaml in the same directory as your metadata.yaml and make sure bundle-location isn't defined in metadata.yaml [14:43] stokachu: boss, thanks [14:43] np [14:43] Zic: I need to know how you are rebooting these systems. Are you doing them in a specific order. [14:43] stokachu: I also wrote this https://gist.github.com/marcoceppi/e74c10178d1b730a36debc1f1622b2ce [14:43] mbruzek: this morning, I just rebooted (via the `reboot` command) one kubernetes-worker [14:43] I'm using it in a modified step to merge kubeconfig files, this way the user only needs to set --context [14:43] no other machines [14:44] marcoceppi, nice! [14:45] stokachu: updated with the step-01 file, nothing major [14:45] stokachu: last question, for headless, any thoughts on allowing a final positional argument for model name? [14:46] I love rando names as much as the next person, but I hvae some explicit model names I want to use [14:46] marcoceppi, one thing that i need to address for kuberentes is https://github.com/conjure-up/conjure-up/issues/568#issuecomment-272379010 [14:46] stokachu: yeah, that's what my gist does [14:46] stokachu: it names the context, user, and cluster the same as teh model name from juju [14:47] marcoceppi, very nice, how do you access that with kubectl? [14:47] so they can live side by side with others. it doesn't do de-duping or collision detection yet, but I'll test it with my local spell first [14:47] stokachu: `kubectl --context ` [14:47] stokachu: and `kubectl set-context ` <- this is like juju switch [14:47] marcoceppi, very nice, once you're ready ill add those to the spells [14:48] marcoceppi, we have positional arguments for cloud and controller, so adding a third for model makes sense [14:48] stokachu: cool, I'll file a bug, not high priority but wanted to run by you first in person before throw'in another on the pile [14:49] marcoceppi, thanks, that's an easy one so it'll get addressed this week [14:49] marcoceppi, my other big todo is to make spell authoring cleaner with maybe a clean sdk or something [14:49] haven't quite figured out the best approach there for developer happiness [14:49] stokachu: yeah, I was taken aback by all the bash and python mixed [14:50] mbruzek Zic: I'm bringing up a cluster to attempt to repro [14:50] marcoceppi, yea i'd to use like something like charmhelpers for this [14:50] stokachu: you might beable to borrow a lot from the reactive style, where you use decorators in bash/python to trigger/register events [14:50] Zic: And how do you see the problem? Are you just watching the output of kubectl get pods ? [14:51] marcoceppi, ah that's a good idea, would clean up a lot of the code [14:51] stokachu: and with the bash bindings, best of both worlds [14:51] stokachu: I'll file a bug for you there with some initial thoughts [14:51] marcoceppi, cool man appreciate it, i want to get that done sooner than later as well [14:51] mbruzek: I'm running a permanent watch "kubectl get pods -o wide --all-namespaces" when I reboot a node, and watch at the pods state, during that, I also do some curl and telnet to various Ingress and NodePort of the cluster [14:54] I think that's a "relica" from my first problem, as the step-to-reproduce is hard to describe, I'm asking myself if there is a way to reset the etcd cluster to default value (= wipe all data of the K8s cluster) without reinstalling the Juju [14:54] the Juju full-cluster* [14:55] it may be more simple to set up a step-to-reproduce path, or to confirm that's tied to the problem of the last time and my actual data is corrupted :s [14:56] stokachu: https://github.com/conjure-up/conjure-up/issues/635 [14:56] marcoceppi, perfect thanks [14:56] stokachu: I'm onsite atm, but I'll file the developer one later tonight if you don't get to it before me [14:57] marcoceppi, cool man, yea ill file one [15:01] marcoceppi, fyi https://github.com/conjure-up/conjure-up/issues/636 [15:04] ryebot mbruzek: in the same strange behaviour, if I totally poweroff a node that was hosting some pods, this node is shown as NotReady in kubectl get nodes (this point is ok), but the pods stay saying "Running" on the poweroffed-node [15:05] I'm sure I didn't have this behaviour on the fresh bootstrapped cluster [15:07] stokachu: thanks, I'll dump my ideas there [15:07] marcoceppi, cool man [15:10] Zic: What problem(s) are you trying to solve by rebooting? What are else are you doing on the system to necessitate the reboot? [15:11] I'm trying to test the HA and the resilience of the cluster (= what happened and during what time) before going prod [15:14] with a fresh bootstrapped cluster, all pods hosted on a node passed "Completed" and finally disappeared and repop "Running" on another node [15:14] now, they just stayed in "Unknown state" [15:15] and as they are some variable I can't control like the disaster of friday, I cannot describe a clear step-to-reproduce without a full-reset I think :/ [15:19] Zic: We are looking into the problem here, trying to reproduce on our side [15:19] thanks [15:35] 11m11m1{controllermanager }NormalNodeControllerEvictionMarking for deletion Pod kube-dns-3216771805-w2853 from Node mth-k8svitess-02 [15:36] for example, this action last forever, the pod stayed in Unknown instead of switching to Completed and respawn somewhere else [15:36] (in fact it pops somewhere and was in state Running, but the old one stayed in Unknown forever) [15:38] Zic: And you have deployed 1.5.2 kubernetes right? [15:39] yep, was friday :) [15:39] I remember [15:41] Zic: Hmm, not able to repro. [15:41] maybe I need to do a recap because I spoke so much, sorry :D 1) The first problem was some kube-system components and Ingress controller in CLBO because etcd units was rebooted too quickly (operations was in progress I think) because of large namespace deletion 2) Upgrading to 1.5.2 immediately fix the problem (I thought) 3) I rebooted just one node this weekend (planned to reboot all first, but as the first [15:41] trigger problems, I stopped) and finished here [15:41] Zic: We might need some detailed reproduction steps [15:42] Zic: ack, thanks [15:42] Let me try rebooting etcd [15:44] I'm sure if a do the same on a fresh canonical-kubernetes, I will don't have any of this issues; something must not totally recover from the previous problem at etcd level [15:44] Zic: Some of your systems are physical, yes? [15:44] Zic: We rebooted a worker here and did not have a problem coming back up [15:44] yep, 5 of 8 kubernetes-worker [15:45] all others components are VMs [15:45] mbruzek: yeah, just after the first installation of the bundle charms, all this operation was OK [15:45] it's since my last friday's incident, something must be partially working [15:45] Zic mbruzek: rebooted all etcd nodes, no problems coming back up [15:46] Pods and nodes all intact [15:46] can I wipe the etcd-cluster to default data without tearing-down all the canonical-kubernetes cluster? [15:47] (the infra, I don't mind to loose the settings of K8s cluster, I can redeploy my pods & services easily) [15:48] Zic: We have not tested wiping out etcd, it holds some of the Software Defined Network settings. [15:50] Zic: We are unable to reproduce the failure you are seeing. It may be because of the manual operations you ran post deployment. Would it be possible to re-deploy the canonical-kubernetes-cluster entirely and start there? [15:51] mbruzek: yeah, I think it's the only path now [15:52] mbruzek: do I need to reinstall everything or can I do a clean teardown with Juju and re start from the beginning? [15:52] Zic: As we spoke about on Friday lets take a snapshot of your environment now. [15:52] Basically you need to use the Juju GUI to export the model of your environment now [15:53] Zic: To open the GUI: [15:54] Zic: If you haven't changed your admin password, run `juju show-controller --show-password` to get the randomly generated password [15:54] Zic: Next, run `juju gui` [15:54] ryebot Zic or just run `juju gui --show-credentials` ;) [15:55] marcoceppi: dangit, I always forget that [15:55] Zic: That'll start up the gui and give you a url to hit [15:56] Zic: Login with "admin" and your password, then look for the export button, which is at the top and looks like a box with an up-arrow [15:56] ryebot: yeah, it's the step I followed to bootstrap the cluster successfully [15:56] Zic: Click that, and it'll download a copy of the model. We'd like to see it. [15:56] it's the "teardown" part where I don't know what it's the best practice :) [15:56] oh ok, I will do that now [15:57] Zic: The up-arrow button will download the model in YAML representation. you can save it and it will help you deploy the same environment again in a repeatable fashion [15:57] I wrote a more detailed step-to-not-reproduce-but-post-mortem: http://paste.ubuntu.com/23894187/ [15:57] mbruzek: do I need to reinstall the VMs and machine which host the cluster? [15:58] Zic: hey, just as an open invite, we'll be in Belgium next week if you feel like hopping on a train to talk face-to-face: http://summit.juju.solutions/ [15:58] Zic: The summit is free as in beer [16:00] Zic: Because you used a mixture of Amazon and Manual Provider it may not be as easy a juju deploy bundle.yaml, but after you manually provision those physical systems you can deploy the bundle. [16:00] Zic: Pastebin the model when you get that done [16:01] mbruzek: Amazon machines was linked with the Manual provider also, I don't use the AWS credentials [16:02] (our AWS instances is popped by Terraform for the anecdote) [16:03] so all I need to do is 1) reinstall all OS 2) relink to the Manual provider of the Juju controller 3) redeploy the YAML I'm exporing, or the 1st step is useless? [16:03] jcastro: will be happy to come, Belgium is not far away, I will try to discuss at our meeting if we can go with my company :) [16:05] bring as many people as you want too, it's a free event. [16:05] http://paste.ubuntu.com/23894269/ [16:05] Zic: Why are you installing the OS? [16:06] mbruzek: because we don't have MaaS and our homemade-installer have no connector to Juju (but lazyPower let me know that I can write one :)) [16:06] do I miss something? [16:07] Zic: No, I just didn't understand your environment. I was about to tell you about MAAS but you already know. [16:08] VMs and physical servers at our datacenter are auto-installed by a homemade-installer like MaaS (which autocomplete our internal Information System, registry, and some warranty support) [16:08] for AWS, we just use AMI [16:08] so I told lazyPower that maybe, in the future if we have more Juju infra I will install a MaaS [16:08] or maybe start to write a connector for Juju if it's not so hard for my knowledge [16:08] Zic: With that new information, it seems your steps are right. I was hoping to avoid having to reinstall the OS [16:09] ok, I was asking about the reinstallation if Juju provides a clean way to teardown the cluster [16:09] if not, not so important, I will just need to redo the manual-provider part, the reinstallation is fast and automatically done [16:10] Zic - as they were manually enlisted, there's no clean way to tear it down, once you juju remove-application the machines will be left behind and still be dirty. [16:10] Zic: so you can issue: juju destroy-environment , [16:10] oh hi lazyPower :) [16:10] Zic: From the sound of it, you don't need to reinstall juju [16:10] just remove your manual machines, reprovision, and add them back [16:11] *if you want to, though, go ahead :) [16:12] :) [16:12] I'm sure this new cluster will not have all this problems, as it was at the beginning by the way [16:13] Zic: For reference: https://jujucharms.com/docs/stable/clouds-manual If you add the machines in manually you can use the bundle.yaml file you just downloaded to redeploy on the right systems using the to: (machine number) [16:13] it must be some sneaky things that came up with the friday's incident, even if I don't know what it is [16:14] mbruzek: yup, I will re-add the machine in the same order (and will controll through juju status btw) [16:14] it's what I did the first time to match charms with the hostname of machine (as we use predictive and rolenamed hostname) [16:17] Zic: Looking at the last pastebin I see the variety of machines you are using, some of your workers have 20 cores and some have 2. You should be able to identify the systems by their constraints. [16:19] mbruzek: do you think I can test to restore the etcd base with a backup before the friday's incident? or it's worse and I just don't need to spend that time? [16:21] I don't know how Kubernetes manage a restore of etcd when there is a delta from what is currently running in term of pods, services... and what is restored in etcd [16:22] (I had 3 Vitess cluster deployed when I did the etcd backup, I wipe all of them since then) [16:22] Zic - that is a problem. your snapshot will not contain the TTL's on the keys, so you'll restore to whatever the state waas during that snapshot [16:22] this may have implications on running workloads [16:23] Zic: Here is what I would recommend. Redeploy this cluster, and once you get everything working and in good state take a snapshot of etcd data (before you do any non-juju operations) [16:24] because I have two options : 1) etcd backups 2) all management parts of the cluster (easycharm, kube-api-loadbalancer, etcd and kubernetes-master, so all except kubernetes-worker) are snapshotted daily [16:25] so I set back to the past all the management part, I don't know how the kubernetes-worker part will act [16:25] I know that's complicating the problem istead of reinstall everything, it's just to know what can I possibly do if it was in production [16:26] s/so I/so if I/ (did nothing actually for now :p) [16:26] Zic: Technically I think you could do both backup etcd, and snapshot the Kuberentes controll plane [16:27] Zic: the etcd charm has a snapshot and restore action provided in Juju you can run that at the same time you snapshot the control plane [16:28] oh I didn't know, I did this backup action manually via crontab and etcdctl backup command [16:29] so, I will try to set all the K8s control plane back to thursday, and see if from here, I can directly upgrade to Kubernetes 1.5.2 [16:30] mbruzek: if one of a component of charms (like etcd) appeared in APT's upgrades, do I must validate or hold this package? [16:30] it's one of the first step leading to my disaster the last week [16:30] Zic - thats a great question, and I should probably be pinning etcd if delivered via charm and release charm upgrades when the package is upgraded. [16:30] don't know if it is for real, but was in the steps [16:31] ok, I will pin etcd [16:31] to unpin and rev teh etcd package [16:34] I think that all this problems came from the uprading of etcd via APT *PLUS* the fact that I run larges delete operations of large namespaces just before, and maybe I don't wait enough [16:34] (concerning friday) and concerning today, maybe some parts that are not working perfectly since then [16:34] * mbruzek suspects that as well [16:34] and it's not reproductible unless you can do the exact same delete operation and upgrade etcd at the wrong time like me :D [16:35] as I said, before that, all my resilience and HA tests was perfect :) [16:35] I thought I will go in prod quickly ^^ [16:36] it's not the first time etcd f*cked me up, in other technologies than K8S (or even Vitess, as lazyPower known), I know it's not your fault and I'm very happy of all the help you was able to provide me during this last days ;) [17:03] lazyPower mbruzek ryebot: I successfully return to the previous state before the incident via my backupped snapshot of all the K8s controll plane, so I'm going to immediately upgrade to 1.5.2 and will redo my own step-to-reproduce [17:03] I expect to... not reproduce my problem :) === petevg is now known as petevg_noms [17:03] why upgrade? If you deploy new you should get 1.5.2 by default [17:04] Zic: ^ [17:05] Zic | I know that's complicating the problem istead of reinstall everything, it's just to know what can I possibly do if it was in production [17:05] ^ just to test that [17:06] I restored the VMs (which host master, etcd, apilb and easyrsa) of a ESX snapshot of wednesday [17:06] (my cluster works perfectly at this date) [17:06] I have just the upgrade to 1.5.2 to redo [17:07] and I'm sure that my step-to-reproduce the problem will not work, as it seems to be tied with the etcd disaster of friday [17:10] I can confirm, I can't reproduce my own previous problem \o/ [17:11] so it seems that something I did friday corrupt something (etcd I suppose) was the guilty part [17:12] I just restore all management part to wednesday, reupgrade to 1.5.2, restore some kubernetes-worker and etcd... all seems fine [17:12] the only difference is that I immediately upgrade to 1.5.2 before deleting my large namespaces [17:13] and that I did not upgrade etcd through APT this time [17:13] cc lazyPower ^ [17:13] Zic - thats good to hear. I'm going to circle back and file a bug if you dont beat me to it, against layer-etcd to pin the package or make it configurable. [17:14] :) in my side, I will do a simple apt-mark hold etcd for this time [17:14] i haven't had the pleasure of testing that scenario where etcd is upgraded out of band by an apt-get operation, so it may have been attributed to that, or it might have beena ttributed to broken key/val data in etcd due to the delete. [17:14] a mix of the two I think, upgrade via APT during broken key/val operation [17:15] it's the only part I didn't test in my step-to-reproduce [17:15] yeah, thats crappy that we werent' able to recover from that though [17:15] (I immediately upgrade, and then delete all my namespace) [17:15] (thinking of the buffer problem of kubeapilb) [17:16] Zic - i suppose moving forward, the suggestion is to snapshot your data in etcd, hten run the upgrade sequence. its going to stomp all over your resource versions doing the restore but its better to attain that prior state than to be completely broken. [17:16] yep [17:16] I will try to look precisely at the apt upgrade proposition also [17:16] to not upgrade anything that's managed by Juju carms [17:17] charms* [17:18] lazyPower: the juju etcd charms does not include auto-backup right? do you think it's a good idea, as etcd-operator do it? [17:18] currently I run the backup through a crontab on each etcd units, mbruzek told me that I can do the same with a juju action, I will go with that I think [17:18] Zic - i'm open to a contribution for auto backups, but as it stand today its an operator action [17:18] you can run that backup action in like a jenkins job and have it archive, and then you have an audit trail [17:19] i get leery of automatic things that have no visibliity (like cron) [17:19] personally I prefer to configure this type of backup on my own [17:19] the last thing i want is to assume its working, by wrapping it in a CI subsystem you have trace logs and know when it fails [17:19] but as Juju is here to help, it sounds like a good feature :) [17:20] haha, yeah, I can uderstand that part :) [17:20] and the whole juju action part ensures its repeatable :) [17:20] even if etcd-operator do the job, I will actively monitor what he does [17:20] plus those packages are what iv'e tested for restore, its effectively teh same thing, but i'd hate to think adding an extra dir to the tree or something would cause the restore action to tank. [17:21] and then its added gas to the fire [17:21] metaphorically speaking anyway [17:22] for now I do the both : daily snapshots of the VMs which host etcd units + etcdctl backup command [17:23] thats a good strategy [17:23] +1 [17:33] lazyPower: do you have any docs on how Juju and MaaS is connected, code-sided? [17:34] Zic: what do you mean? [17:34] I will lurk at if it's valuable for us to develop a connector for our own installation/provisionning infra or deploy a simple MaaS for Juju architecture [17:34] rick_h: oh hello, about this thing ^ [17:34] we have a kind of MaaS which is connected to all our services in my company [17:35] it's an homemade system and I don't know if I can write a simple new "provider" for Juju, or if I just go to MaaS [17:35] (will be a little redundant) [17:37] Zic: check out https://github.com/juju/gomaasapi and https://github.com/juju/juju/tree/staging/provider/maas === mskalka is now known as mskalka|afk [17:42] rick_h: thanks [17:48] mbruzek: hmm, the juju debug-log is kinda flooding since the new upgrade to 1.5.2 : http://paste.ubuntu.com/23894873/ [17:48] (I use the juju upgrade-charms command) [17:52] lazyPower: I'm reposting as you were offline : the juju debug-log looks really strange since my new upgrade to 1.5.2 : http://paste.ubuntu.com/23894873/ [17:53] Zic - most of that is normal. the leadership failure - if it continues to spam we'll want to get a bug filed against that [17:53] it floods in loop :) [17:53] thats the unit agent complaining about a process it needs to do leadership stuff for coordination. not related to teh charms however [17:54] Zic - i'm headed out for lunch will be back in a bit [17:54] hmm, I didn't do a kubectl version after juju upgrade-charm, I just take a look at juju status but I'm always on 1.5.1 in fact :o [17:59] (I confirm I stayed in 1.5.1 after the juju upgrade-charms, I was just looking at juju status to all return to green and didn't look at the application version -_-) === petevg_noms is now known as petevg [18:20] how juju world [18:20] howdy* === frankban is now known as frankban|afk === mskalka|afk is now known as mskalka [18:25] Zic - so you're saying juju upgrade-charm on the components either didn't run, or the resource was not upgraded? [18:25] sorry for latency, i'm at terrible coffeeshop free wifi === scuttle|afk is now known as scuttlemonkey [18:27] lazyPower: the upgrade-charms command just changed to the lastest version of the charm, but the application was not upgraded [18:28] Zic - that seems strangely reminiscent of another user reporting at deploy time they didn't get the upgraded resource [18:28] this is recoverable [18:28] on the store display, you can fetch each resource and manually attach them to upgrade the components. [18:29] we actually just landed a doc update about this, 1 moment while i fetch the link [18:29] Zic - https://github.com/juju-solutions/bundle-canonical-kubernetes/pull/197 [18:30] the upgrade performs well the first time with the same cluster, don't know what happen :( [18:30] does the order of what charp is upgraded via upgrade-charm count? [18:30] charm* [18:31] oh I know what happen [18:31] in 1.5.1 Flannel does not start properly [18:31] I forgot to restart them :} [18:32] in juju status, if you ran the upgrade-charm step, you should still see 1.5.2 listed as your k8s component versions [18:32] assuming it went without error. if there's a unit(s) trapped in error state that are related, its possible that the upgrade hasn't completed [18:32] does the upgrade-charm start if Flannel is in error? [18:32] ah [18:32] if the charm is in error, it will halt the operations on related units [18:32] until the error is resolved [18:33] ok lunch is over for me, heading back to the office and will resume then Zic. [18:33] o/ [18:39] Zic: I am back. [18:40] Zic: What is the current issue? [18:40] mbruzek recap: I restored my old cluster to wednesday, I ran a "juju status", all was green, I ran juju upgrade-charms on easy charms, I did another juju status at the end and all was green, and the "Rev" column contains the latest version of the charm, but in fact, the software version was always 1.5.1 for kubernetes-master/worker for example. I remembered that Flannel don't start well on boot sequence on [18:40] 1.5.1 lately, started it on every node and the upgrade was unblocked. The only weird thing is that Flannel was shown as "active/green" in juju status so... [18:41] so all is fine actually, was my mistake with Flannel not autostarting well on old 1.5.1 [18:41] (and juju status which show me as active/green the first time) [18:41] Zic: Yes we fixed the flannel restart issue in 1.5.2 so I am confused why flannel didn't restart. [18:42] mbruzek: the restaured cluster was in 1.5.1 [18:42] restored* [18:42] ah [18:42] OK [18:42] :) [18:43] apparently, Flannel not started was blocking the upgrade [18:44] But everything has started now? [18:44] yep [18:44] it's my fault, when the cluster was restored to wednesday/1.5.1, I just did *one* juju status, all was green [18:44] normally I run a watch -c "juju status --color" [18:45] I didn't see between the first "juju status" and the upgrade that Flannel passed in "error/red" [18:46] and that's apparently what's blocked the upgrade as after Flannel was manually started, the upgrade begins instantly [18:53] lazyPower: TL;DR : it was Flannel (of the 1.5.1 version) which blocked my upgrade :) [18:53] ah good to know [18:53] it's OK now [18:54] i wish we could retroactively fix that [18:54] #fwp [18:54] right! [18:55] was my fault also as I just run onetime juju status, showed all green, then upgrade-charms, see it did not nothing at the juju debug-log, re-run a juju status, show that Flannel is in error... and remembered that on wednesday (date of the snapshot) I was still in 1.5.1 with the flannel's issue :p [18:55] normally I monitor the upgrading-charm process through a watch -c "juju status --color" :p [18:56] s/it did not nothing/it did nothing/ [18:56] double-negation is dangerous. [18:57] IN CONCLUSION (sorry for the caps), I have a good-running 1.5.2 cluster, no CLBO, PodsEviction works well if node goes down... [18:57] \o/ [18:57] yay [18:57] ... well, the only last point is, can I come to the Juju Summit? :D [18:57] I will discuss this with my company :) [18:57] its a freebie event, and you're invited and can bring more [18:57] so, load up the possee and meet us in ghent :) [18:57] Zic: Yes you are most welcome to join us [18:58] you will hear my perfect^WFrench accent \o/ [18:59] hmm, just a real last point: http://paste.ubuntu.com/23895203/ [19:00] all is up-to-date isn't it? I have a doubt for k8s-master which say 1.5.1 [19:00] (because: ERROR already running latest charm "cs:~containers/kubernetes-master-11" if I try again the juju upgrade-charm kubernetes-master) [19:01] kubectl version return 1.5.2 [19:14] it seems to be a display bug SSHing directly to the master, all composants are in 1.5.2 [19:14] s/bug/bug./ [19:20] Question how can I FORCE destroy something on juju I have 6 container which will not go away [19:20] Teranet: You want to destroy everything juju? [19:20] yes [19:21] so I can redeploy from scratch [19:21] Teranet: OK here is the command, but you should be careful with this. [19:22] juju destroy-controller --destroy-all-models [19:22] teranet: juju kill-controller will tear it all down, including the controller node [19:22] ok let me see if that works [19:23] thx [19:23] Teranet: https://kubernetes.io/docs/getting-started-guides/ubuntu/decommissioning/ [19:23] covers everything [19:23] oh sorry, thought you were using kubes [19:24] nope but it's ok [19:24] the Cleaning Up the Controller part at the bottom should still apply [19:24] I would try to remove the model first though with juju destroy-model [19:24] https://jujucharms.com/docs/2.0/controllers [19:24] I use my own private cloud and had broken relationship which broke my complete nove enviroment [19:24] juju destroy-model I had ran and it was stuck [19:25] I had false applied a HA relactionship which resulted in destroying all my compute nodes :-( [19:25]