[04:21] So is there a charm and / or a doc on standing up a private docker registry for k8s? [04:22] Budgie^Smore - there's an open PR that hasn't made the shift to the upstream repository that adds this functionality into k8s itself - https://github.com/juju-solutions/kubernetes/pull/97 [04:23] once that lands it'll get released with our next update to the charms, we have some additional prs that need to land to support that change. but its on the horizon [04:23] so again I am getting ahead of myself :) [05:13] I am pondering running nexus 3 ina container in the meantime (possible long term depending on the registry functionality) [07:23] blahdeblah: It's about this bug https://bugs.launchpad.net/nrpe-charm/+bug/1633517 [07:23] Bug #1633517: local checks arn't installed sinds nrpe-7 [07:30] BlackDex: You've caught me a little late in my day, but if I get a chance I'll have a look at that if I have some spare time. [08:09] Good morning Juju world! === frankban|afk is now known as frankban === junaidal1 is now known as junaidali [09:16] blahdeblah: Thx, i don't mind i just need a answer/help, i know the time-difference is there, so no prob, i'm glad someone wants to take a look [09:18] morning kjackal o/ [09:19] hell marcoceppi [09:22] BlackDex: Definitely keen to find out what's going on; what times (UTC) are you likely to be around most? [09:26] blahdeblah: im in utc+1 (netherlands) so that would be utc 8:00 till around 16:00 [09:26] BlackDex: ack - will try to catch you in your mornings [09:26] oke :) cool thx! [09:29] hi here, I'm asking myself if a complete teardown via Juju (and tearup) could permit a resolution of this issue: https://github.com/kubernetes/kubernetes/issues/40648 [09:30] because it does not seem that many people have encountered this one :/ === sk_ is now known as Guest67469 [09:37] (I'm also trying to see in the Kubernetes' Slack if somebody already encountered this issue) [09:42] Zic: it might be? Are you on 1.5.2? [09:46] marcoceppi: yep [09:47] Zic: I don't feel comfortable saying scrap and redeploy, esp if there's information we can capture from your deployment to improve CDK, but I also don't want you siting with a wedge'd cluster [09:48] Zic: lazyPower mbruzek & co should be online in the next few hours [09:48] yeah, they helped a lot with the first party of this problem friday :) [09:49] s/party/part/ :] [09:49] I'm cool with calling it a party instead of a problem ;) [09:50] the first part was about all my pods crashed with this kind of error, and even some kubectl command (which actually "do/write" something, like create/delete, as get/describe works) return this kind [09:50] upgrading to 1.5.2 pass all my Pods to Running [09:51] but this weekend, when I tried to reboot some kubernetes-worker to test the resilience and the eviction/respawn of pods, I fell again in a sort of same problem :/ [10:24] oh, I found something strange, cc @ lazyPower [10:24] http://paste.ubuntu.com/23892884/ [10:25] etcd again /o\ [10:26] I saw this via juju status, flannel/2 was marked as "waiting" indefinitely [10:26] Zic: you have 1,3, or 5 etcd machines? [10:26] 5 etcd [10:26] on 5 different VMs [10:26] doh, just saw the flanneld line [10:27] I didn't have Flannel in this state when I opened the GitHub issue [10:27] interesting [10:27] the last guilty for my first problem was also etcd [10:27] it seems it just started happening baesd on the logs [10:28] do you know if I can do a "fresh start" of an etcd database for canonical-kubernetes without redeploy it from scratch? I don't have any important data for now in this cluster [10:29] (= I have all my custom YAML file to redeploy all easily) [10:29] files* [10:32] Zic: you should be able to just remove the etcd application then redeploy etcd and re-create the relations [10:32] you might get some spurrious errors during removal, and I'm not sure if it's a tested path or not [10:33] theoretically, you should be able to, but distributed systems are always a bit interesting in practice [10:34] oh you mean via Juju? I thought to wipe the data directly in etcd, but even if I don't know well etcd, I suppose there is some "default-key/value" needed and provisionned via Juju deployment at bootstrap :/ [10:35] Zic: not sure via etcd, from a Juju perspective the "keys" for TLS are actually a charm, the easyrsa charm, so since there's a CA running still it'll just get new certs, distribute those via relations and k8s will be reconfigured to point at that etcd [10:36] Zic: as for the etcd portion, there probably is a way to wipe, I'm just not sure of one [10:41] marcoceppi: do you advice me to wait for lazyPower and ryebot to come up before smashing etcd in the head (it's not the first time etcd annoys me, even in other technologies than K8s/Vitess :)) [10:41] ? [10:42] Zic: It's probably a good idea to wait for them, but smashing etcd over the head might also be very theroputic. I'll make sure we have some people in EU/APAC timezone come up to speed with kuberentes knowledge so there's not so much a wait period [10:53] oh I will never complain about timezone as it's a community support channel :) but great to here it [10:54] hear* [11:12] new info: old NodePort service are always listening, but with a new NodePort service just deployed, no nodes are listening on this port :/ [11:12] I think Flannel is the guilty but because it cannot contact etcd [11:17] stokachu: ping when you're around [11:18] or mmcc but I doub't you'll be around before stokachu [12:45] Hey BlackDex, this might be a long shot, but the bug you posted on NRPE isn't related to https://bugs.launchpad.net/charms/+source/nagios/+bug/1605733 by any chance? [12:45] Bug #1605733: Nagios charm does not add default host checks to nagios [14:09] marcoceppi, ping [14:25] stokachu: hey man, what's conjurebr0 for? [14:25] stokachu: I'm doing some super weird things in a spell, and was curious [14:25] marcoceppi, it's mainly for openstack on novalxd to have that second nic for its neutron network [14:26] but it's always there so you could rely on it if need be [14:32] mbruzek: hi, are you around? just saw you joined, sorry if I disturb you [14:36] Zic I am here. What can I help with? [14:39] mbruzek: remember the last time with my Ingress controller in CLBO? I thought all was fixed after upgrading to 1.5.2, but when I rebooted some nodes, the problem came back... I continued to look at the problem today and saw Flannel is completely messed up: http://paste.ubuntu.com/23892884/ [14:39] actually, all new NodePort are not working :s [14:39] hrmm. [14:40] There must be a problem with the reboot sequence. Do you think you could reproduce this? [14:41] stokachu: is it routable, and is it connected to the controller? [14:41] marcoceppi, yea it's routable, but not connected to the controller [14:42] stokachu: cool [14:42] stokachu: second question, can I reference a local bundle.yaml file in the spell metadata? [14:42] mbruzek: I tried to restart the flannel service but with the same result, I didn't try to reboot another node to see if I can reproduce [14:42] marcoceppi, you would just place a bundle.yaml in the same directory as your metadata.yaml and make sure bundle-location isn't defined in metadata.yaml [14:43] stokachu: boss, thanks [14:43] np [14:43] Zic: I need to know how you are rebooting these systems. Are you doing them in a specific order. [14:43] stokachu: I also wrote this https://gist.github.com/marcoceppi/e74c10178d1b730a36debc1f1622b2ce [14:43] mbruzek: this morning, I just rebooted (via the `reboot` command) one kubernetes-worker [14:43] I'm using it in a modified step to merge kubeconfig files, this way the user only needs to set --context [14:43] no other machines [14:44] marcoceppi, nice! [14:45] stokachu: updated with the step-01 file, nothing major [14:45] stokachu: last question, for headless, any thoughts on allowing a final positional argument for model name? [14:46] I love rando names as much as the next person, but I hvae some explicit model names I want to use [14:46] marcoceppi, one thing that i need to address for kuberentes is https://github.com/conjure-up/conjure-up/issues/568#issuecomment-272379010 [14:46] stokachu: yeah, that's what my gist does [14:46] stokachu: it names the context, user, and cluster the same as teh model name from juju [14:47] marcoceppi, very nice, how do you access that with kubectl? [14:47] so they can live side by side with others. it doesn't do de-duping or collision detection yet, but I'll test it with my local spell first [14:47] stokachu: `kubectl --context ` [14:47] stokachu: and `kubectl set-context ` <- this is like juju switch [14:47] marcoceppi, very nice, once you're ready ill add those to the spells [14:48] marcoceppi, we have positional arguments for cloud and controller, so adding a third for model makes sense [14:48] stokachu: cool, I'll file a bug, not high priority but wanted to run by you first in person before throw'in another on the pile [14:49] marcoceppi, thanks, that's an easy one so it'll get addressed this week [14:49] marcoceppi, my other big todo is to make spell authoring cleaner with maybe a clean sdk or something [14:49] haven't quite figured out the best approach there for developer happiness [14:49] stokachu: yeah, I was taken aback by all the bash and python mixed [14:50] mbruzek Zic: I'm bringing up a cluster to attempt to repro [14:50] marcoceppi, yea i'd to use like something like charmhelpers for this [14:50] stokachu: you might beable to borrow a lot from the reactive style, where you use decorators in bash/python to trigger/register events [14:50] Zic: And how do you see the problem? Are you just watching the output of kubectl get pods ? [14:51] marcoceppi, ah that's a good idea, would clean up a lot of the code [14:51] stokachu: and with the bash bindings, best of both worlds [14:51] stokachu: I'll file a bug for you there with some initial thoughts [14:51] marcoceppi, cool man appreciate it, i want to get that done sooner than later as well [14:51] mbruzek: I'm running a permanent watch "kubectl get pods -o wide --all-namespaces" when I reboot a node, and watch at the pods state, during that, I also do some curl and telnet to various Ingress and NodePort of the cluster [14:54] I think that's a "relica" from my first problem, as the step-to-reproduce is hard to describe, I'm asking myself if there is a way to reset the etcd cluster to default value (= wipe all data of the K8s cluster) without reinstalling the Juju [14:54] the Juju full-cluster* [14:55] it may be more simple to set up a step-to-reproduce path, or to confirm that's tied to the problem of the last time and my actual data is corrupted :s [14:56] stokachu: https://github.com/conjure-up/conjure-up/issues/635 [14:56] marcoceppi, perfect thanks [14:56] stokachu: I'm onsite atm, but I'll file the developer one later tonight if you don't get to it before me [14:57] marcoceppi, cool man, yea ill file one [15:01] marcoceppi, fyi https://github.com/conjure-up/conjure-up/issues/636 [15:04] ryebot mbruzek: in the same strange behaviour, if I totally poweroff a node that was hosting some pods, this node is shown as NotReady in kubectl get nodes (this point is ok), but the pods stay saying "Running" on the poweroffed-node [15:05] I'm sure I didn't have this behaviour on the fresh bootstrapped cluster [15:07] stokachu: thanks, I'll dump my ideas there [15:07] marcoceppi, cool man [15:10] Zic: What problem(s) are you trying to solve by rebooting? What are else are you doing on the system to necessitate the reboot? [15:11] I'm trying to test the HA and the resilience of the cluster (= what happened and during what time) before going prod [15:14] with a fresh bootstrapped cluster, all pods hosted on a node passed "Completed" and finally disappeared and repop "Running" on another node [15:14] now, they just stayed in "Unknown state" [15:15] and as they are some variable I can't control like the disaster of friday, I cannot describe a clear step-to-reproduce without a full-reset I think :/ [15:19] Zic: We are looking into the problem here, trying to reproduce on our side [15:19] thanks [15:35] 11m11m1{controllermanager }NormalNodeControllerEvictionMarking for deletion Pod kube-dns-3216771805-w2853 from Node mth-k8svitess-02 [15:36] for example, this action last forever, the pod stayed in Unknown instead of switching to Completed and respawn somewhere else [15:36] (in fact it pops somewhere and was in state Running, but the old one stayed in Unknown forever) [15:38] Zic: And you have deployed 1.5.2 kubernetes right? [15:39] yep, was friday :) [15:39] I remember [15:41] Zic: Hmm, not able to repro. [15:41] maybe I need to do a recap because I spoke so much, sorry :D 1) The first problem was some kube-system components and Ingress controller in CLBO because etcd units was rebooted too quickly (operations was in progress I think) because of large namespace deletion 2) Upgrading to 1.5.2 immediately fix the problem (I thought) 3) I rebooted just one node this weekend (planned to reboot all first, but as the first [15:41] trigger problems, I stopped) and finished here [15:41] Zic: We might need some detailed reproduction steps [15:42] Zic: ack, thanks [15:42] Let me try rebooting etcd [15:44] I'm sure if a do the same on a fresh canonical-kubernetes, I will don't have any of this issues; something must not totally recover from the previous problem at etcd level [15:44] Zic: Some of your systems are physical, yes? [15:44] Zic: We rebooted a worker here and did not have a problem coming back up [15:44] yep, 5 of 8 kubernetes-worker [15:45] all others components are VMs [15:45] mbruzek: yeah, just after the first installation of the bundle charms, all this operation was OK [15:45] it's since my last friday's incident, something must be partially working [15:45] Zic mbruzek: rebooted all etcd nodes, no problems coming back up [15:46] Pods and nodes all intact [15:46] can I wipe the etcd-cluster to default data without tearing-down all the canonical-kubernetes cluster? [15:47] (the infra, I don't mind to loose the settings of K8s cluster, I can redeploy my pods & services easily) [15:48] Zic: We have not tested wiping out etcd, it holds some of the Software Defined Network settings. [15:50] Zic: We are unable to reproduce the failure you are seeing. It may be because of the manual operations you ran post deployment. Would it be possible to re-deploy the canonical-kubernetes-cluster entirely and start there? [15:51] mbruzek: yeah, I think it's the only path now [15:52] mbruzek: do I need to reinstall everything or can I do a clean teardown with Juju and re start from the beginning? [15:52] Zic: As we spoke about on Friday lets take a snapshot of your environment now. [15:52] Basically you need to use the Juju GUI to export the model of your environment now [15:53] Zic: To open the GUI: [15:54] Zic: If you haven't changed your admin password, run `juju show-controller --show-password` to get the randomly generated password [15:54] Zic: Next, run `juju gui` [15:54] ryebot Zic or just run `juju gui --show-credentials` ;) [15:55] marcoceppi: dangit, I always forget that [15:55] Zic: That'll start up the gui and give you a url to hit [15:56] Zic: Login with "admin" and your password, then look for the export button, which is at the top and looks like a box with an up-arrow [15:56] ryebot: yeah, it's the step I followed to bootstrap the cluster successfully [15:56] Zic: Click that, and it'll download a copy of the model. We'd like to see it. [15:56] it's the "teardown" part where I don't know what it's the best practice :) [15:56] oh ok, I will do that now [15:57] Zic: The up-arrow button will download the model in YAML representation. you can save it and it will help you deploy the same environment again in a repeatable fashion [15:57] I wrote a more detailed step-to-not-reproduce-but-post-mortem: http://paste.ubuntu.com/23894187/ [15:57] mbruzek: do I need to reinstall the VMs and machine which host the cluster? [15:58] Zic: hey, just as an open invite, we'll be in Belgium next week if you feel like hopping on a train to talk face-to-face: http://summit.juju.solutions/ [15:58] Zic: The summit is free as in beer [16:00] Zic: Because you used a mixture of Amazon and Manual Provider it may not be as easy a juju deploy bundle.yaml, but after you manually provision those physical systems you can deploy the bundle. [16:00] Zic: Pastebin the model when you get that done [16:01] mbruzek: Amazon machines was linked with the Manual provider also, I don't use the AWS credentials [16:02] (our AWS instances is popped by Terraform for the anecdote) [16:03] so all I need to do is 1) reinstall all OS 2) relink to the Manual provider of the Juju controller 3) redeploy the YAML I'm exporing, or the 1st step is useless? [16:03] jcastro: will be happy to come, Belgium is not far away, I will try to discuss at our meeting if we can go with my company :) [16:05] bring as many people as you want too, it's a free event. [16:05] http://paste.ubuntu.com/23894269/ [16:05] Zic: Why are you installing the OS? [16:06] mbruzek: because we don't have MaaS and our homemade-installer have no connector to Juju (but lazyPower let me know that I can write one :)) [16:06] do I miss something? [16:07] Zic: No, I just didn't understand your environment. I was about to tell you about MAAS but you already know. [16:08] VMs and physical servers at our datacenter are auto-installed by a homemade-installer like MaaS (which autocomplete our internal Information System, registry, and some warranty support) [16:08] for AWS, we just use AMI [16:08] so I told lazyPower that maybe, in the future if we have more Juju infra I will install a MaaS [16:08] or maybe start to write a connector for Juju if it's not so hard for my knowledge [16:08] Zic: With that new information, it seems your steps are right. I was hoping to avoid having to reinstall the OS [16:09] ok, I was asking about the reinstallation if Juju provides a clean way to teardown the cluster [16:09] if not, not so important, I will just need to redo the manual-provider part, the reinstallation is fast and automatically done [16:10] Zic - as they were manually enlisted, there's no clean way to tear it down, once you juju remove-application the machines will be left behind and still be dirty. [16:10] Zic: so you can issue: juju destroy-environment , [16:10] oh hi lazyPower :) [16:10] Zic: From the sound of it, you don't need to reinstall juju [16:10] just remove your manual machines, reprovision, and add them back [16:11] *if you want to, though, go ahead :) [16:12] :) [16:12] I'm sure this new cluster will not have all this problems, as it was at the beginning by the way [16:13] Zic: For reference: https://jujucharms.com/docs/stable/clouds-manual If you add the machines in manually you can use the bundle.yaml file you just downloaded to redeploy on the right systems using the to: (machine number) [16:13] it must be some sneaky things that came up with the friday's incident, even if I don't know what it is [16:14] mbruzek: yup, I will re-add the machine in the same order (and will controll through juju status btw) [16:14] it's what I did the first time to match charms with the hostname of machine (as we use predictive and rolenamed hostname) [16:17] Zic: Looking at the last pastebin I see the variety of machines you are using, some of your workers have 20 cores and some have 2. You should be able to identify the systems by their constraints. [16:19] mbruzek: do you think I can test to restore the etcd base with a backup before the friday's incident? or it's worse and I just don't need to spend that time? [16:21] I don't know how Kubernetes manage a restore of etcd when there is a delta from what is currently running in term of pods, services... and what is restored in etcd [16:22] (I had 3 Vitess cluster deployed when I did the etcd backup, I wipe all of them since then) [16:22] Zic - that is a problem. your snapshot will not contain the TTL's on the keys, so you'll restore to whatever the state waas during that snapshot [16:22] this may have implications on running workloads [16:23] Zic: Here is what I would recommend. Redeploy this cluster, and once you get everything working and in good state take a snapshot of etcd data (before you do any non-juju operations) [16:24] because I have two options : 1) etcd backups 2) all management parts of the cluster (easycharm, kube-api-loadbalancer, etcd and kubernetes-master, so all except kubernetes-worker) are snapshotted daily [16:25] so I set back to the past all the management part, I don't know how the kubernetes-worker part will act [16:25] I know that's complicating the problem istead of reinstall everything, it's just to know what can I possibly do if it was in production [16:26] s/so I/so if I/ (did nothing actually for now :p) [16:26] Zic: Technically I think you could do both backup etcd, and snapshot the Kuberentes controll plane [16:27] Zic: the etcd charm has a snapshot and restore action provided in Juju you can run that at the same time you snapshot the control plane [16:28] oh I didn't know, I did this backup action manually via crontab and etcdctl backup command [16:29] so, I will try to set all the K8s control plane back to thursday, and see if from here, I can directly upgrade to Kubernetes 1.5.2 [16:30] mbruzek: if one of a component of charms (like etcd) appeared in APT's upgrades, do I must validate or hold this package? [16:30] it's one of the first step leading to my disaster the last week [16:30] Zic - thats a great question, and I should probably be pinning etcd if delivered via charm and release charm upgrades when the package is upgraded. [16:30] don't know if it is for real, but was in the steps [16:31] ok, I will pin etcd [16:31] to unpin and rev teh etcd package [16:34] I think that all this problems came from the uprading of etcd via APT *PLUS* the fact that I run larges delete operations of large namespaces just before, and maybe I don't wait enough [16:34] (concerning friday) and concerning today, maybe some parts that are not working perfectly since then [16:34] * mbruzek suspects that as well [16:34] and it's not reproductible unless you can do the exact same delete operation and upgrade etcd at the wrong time like me :D [16:35] as I said, before that, all my resilience and HA tests was perfect :) [16:35] I thought I will go in prod quickly ^^ [16:36] it's not the first time etcd f*cked me up, in other technologies than K8S (or even Vitess, as lazyPower known), I know it's not your fault and I'm very happy of all the help you was able to provide me during this last days ;) [17:03] lazyPower mbruzek ryebot: I successfully return to the previous state before the incident via my backupped snapshot of all the K8s controll plane, so I'm going to immediately upgrade to 1.5.2 and will redo my own step-to-reproduce [17:03] I expect to... not reproduce my problem :) === petevg is now known as petevg_noms [17:03] why upgrade? If you deploy new you should get 1.5.2 by default [17:04] Zic: ^ [17:05] Zic | I know that's complicating the problem istead of reinstall everything, it's just to know what can I possibly do if it was in production [17:05] ^ just to test that [17:06] I restored the VMs (which host master, etcd, apilb and easyrsa) of a ESX snapshot of wednesday [17:06] (my cluster works perfectly at this date) [17:06] I have just the upgrade to 1.5.2 to redo [17:07] and I'm sure that my step-to-reproduce the problem will not work, as it seems to be tied with the etcd disaster of friday [17:10] I can confirm, I can't reproduce my own previous problem \o/ [17:11] so it seems that something I did friday corrupt something (etcd I suppose) was the guilty part [17:12] I just restore all management part to wednesday, reupgrade to 1.5.2, restore some kubernetes-worker and etcd... all seems fine [17:12] the only difference is that I immediately upgrade to 1.5.2 before deleting my large namespaces [17:13] and that I did not upgrade etcd through APT this time [17:13] cc lazyPower ^ [17:13] Zic - thats good to hear. I'm going to circle back and file a bug if you dont beat me to it, against layer-etcd to pin the package or make it configurable. [17:14] :) in my side, I will do a simple apt-mark hold etcd for this time [17:14] i haven't had the pleasure of testing that scenario where etcd is upgraded out of band by an apt-get operation, so it may have been attributed to that, or it might have beena ttributed to broken key/val data in etcd due to the delete. [17:14] a mix of the two I think, upgrade via APT during broken key/val operation [17:15] it's the only part I didn't test in my step-to-reproduce [17:15] yeah, thats crappy that we werent' able to recover from that though [17:15] (I immediately upgrade, and then delete all my namespace) [17:15] (thinking of the buffer problem of kubeapilb) [17:16] Zic - i suppose moving forward, the suggestion is to snapshot your data in etcd, hten run the upgrade sequence. its going to stomp all over your resource versions doing the restore but its better to attain that prior state than to be completely broken. [17:16] yep [17:16] I will try to look precisely at the apt upgrade proposition also [17:16] to not upgrade anything that's managed by Juju carms [17:17] charms* [17:18] lazyPower: the juju etcd charms does not include auto-backup right? do you think it's a good idea, as etcd-operator do it? [17:18] currently I run the backup through a crontab on each etcd units, mbruzek told me that I can do the same with a juju action, I will go with that I think [17:18] Zic - i'm open to a contribution for auto backups, but as it stand today its an operator action [17:18] you can run that backup action in like a jenkins job and have it archive, and then you have an audit trail [17:19] i get leery of automatic things that have no visibliity (like cron) [17:19] personally I prefer to configure this type of backup on my own [17:19] the last thing i want is to assume its working, by wrapping it in a CI subsystem you have trace logs and know when it fails [17:19] but as Juju is here to help, it sounds like a good feature :) [17:20] haha, yeah, I can uderstand that part :) [17:20] and the whole juju action part ensures its repeatable :) [17:20] even if etcd-operator do the job, I will actively monitor what he does [17:20] plus those packages are what iv'e tested for restore, its effectively teh same thing, but i'd hate to think adding an extra dir to the tree or something would cause the restore action to tank. [17:21] and then its added gas to the fire [17:21] metaphorically speaking anyway [17:22] for now I do the both : daily snapshots of the VMs which host etcd units + etcdctl backup command [17:23] thats a good strategy [17:23] +1 [17:33] lazyPower: do you have any docs on how Juju and MaaS is connected, code-sided? [17:34] Zic: what do you mean? [17:34] I will lurk at if it's valuable for us to develop a connector for our own installation/provisionning infra or deploy a simple MaaS for Juju architecture [17:34] rick_h: oh hello, about this thing ^ [17:34] we have a kind of MaaS which is connected to all our services in my company [17:35] it's an homemade system and I don't know if I can write a simple new "provider" for Juju, or if I just go to MaaS [17:35] (will be a little redundant) [17:37] Zic: check out https://github.com/juju/gomaasapi and https://github.com/juju/juju/tree/staging/provider/maas === mskalka is now known as mskalka|afk [17:42] rick_h: thanks [17:48] mbruzek: hmm, the juju debug-log is kinda flooding since the new upgrade to 1.5.2 : http://paste.ubuntu.com/23894873/ [17:48] (I use the juju upgrade-charms command) [17:52] lazyPower: I'm reposting as you were offline : the juju debug-log looks really strange since my new upgrade to 1.5.2 : http://paste.ubuntu.com/23894873/ [17:53] Zic - most of that is normal. the leadership failure - if it continues to spam we'll want to get a bug filed against that [17:53] it floods in loop :) [17:53] thats the unit agent complaining about a process it needs to do leadership stuff for coordination. not related to teh charms however [17:54] Zic - i'm headed out for lunch will be back in a bit [17:54] hmm, I didn't do a kubectl version after juju upgrade-charm, I just take a look at juju status but I'm always on 1.5.1 in fact :o [17:59] (I confirm I stayed in 1.5.1 after the juju upgrade-charms, I was just looking at juju status to all return to green and didn't look at the application version -_-) === petevg_noms is now known as petevg [18:20] how juju world [18:20] howdy* === frankban is now known as frankban|afk === mskalka|afk is now known as mskalka [18:25] Zic - so you're saying juju upgrade-charm on the components either didn't run, or the resource was not upgraded? [18:25] sorry for latency, i'm at terrible coffeeshop free wifi === scuttle|afk is now known as scuttlemonkey [18:27] lazyPower: the upgrade-charms command just changed to the lastest version of the charm, but the application was not upgraded [18:28] Zic - that seems strangely reminiscent of another user reporting at deploy time they didn't get the upgraded resource [18:28] this is recoverable [18:28] on the store display, you can fetch each resource and manually attach them to upgrade the components. [18:29] we actually just landed a doc update about this, 1 moment while i fetch the link [18:29] Zic - https://github.com/juju-solutions/bundle-canonical-kubernetes/pull/197 [18:30] the upgrade performs well the first time with the same cluster, don't know what happen :( [18:30] does the order of what charp is upgraded via upgrade-charm count? [18:30] charm* [18:31] oh I know what happen [18:31] in 1.5.1 Flannel does not start properly [18:31] I forgot to restart them :} [18:32] in juju status, if you ran the upgrade-charm step, you should still see 1.5.2 listed as your k8s component versions [18:32] assuming it went without error. if there's a unit(s) trapped in error state that are related, its possible that the upgrade hasn't completed [18:32] does the upgrade-charm start if Flannel is in error? [18:32] ah [18:32] if the charm is in error, it will halt the operations on related units [18:32] until the error is resolved [18:33] ok lunch is over for me, heading back to the office and will resume then Zic. [18:33] o/ [18:39] Zic: I am back. [18:40] Zic: What is the current issue? [18:40] mbruzek recap: I restored my old cluster to wednesday, I ran a "juju status", all was green, I ran juju upgrade-charms on easy charms, I did another juju status at the end and all was green, and the "Rev" column contains the latest version of the charm, but in fact, the software version was always 1.5.1 for kubernetes-master/worker for example. I remembered that Flannel don't start well on boot sequence on [18:40] 1.5.1 lately, started it on every node and the upgrade was unblocked. The only weird thing is that Flannel was shown as "active/green" in juju status so... [18:41] so all is fine actually, was my mistake with Flannel not autostarting well on old 1.5.1 [18:41] (and juju status which show me as active/green the first time) [18:41] Zic: Yes we fixed the flannel restart issue in 1.5.2 so I am confused why flannel didn't restart. [18:42] mbruzek: the restaured cluster was in 1.5.1 [18:42] restored* [18:42] ah [18:42] OK [18:42] :) [18:43] apparently, Flannel not started was blocking the upgrade [18:44] But everything has started now? [18:44] yep [18:44] it's my fault, when the cluster was restored to wednesday/1.5.1, I just did *one* juju status, all was green [18:44] normally I run a watch -c "juju status --color" [18:45] I didn't see between the first "juju status" and the upgrade that Flannel passed in "error/red" [18:46] and that's apparently what's blocked the upgrade as after Flannel was manually started, the upgrade begins instantly [18:53] lazyPower: TL;DR : it was Flannel (of the 1.5.1 version) which blocked my upgrade :) [18:53] ah good to know [18:53] it's OK now [18:54] i wish we could retroactively fix that [18:54] #fwp [18:54] right! [18:55] was my fault also as I just run onetime juju status, showed all green, then upgrade-charms, see it did not nothing at the juju debug-log, re-run a juju status, show that Flannel is in error... and remembered that on wednesday (date of the snapshot) I was still in 1.5.1 with the flannel's issue :p [18:55] normally I monitor the upgrading-charm process through a watch -c "juju status --color" :p [18:56] s/it did not nothing/it did nothing/ [18:56] double-negation is dangerous. [18:57] IN CONCLUSION (sorry for the caps), I have a good-running 1.5.2 cluster, no CLBO, PodsEviction works well if node goes down... [18:57] \o/ [18:57] yay [18:57] ... well, the only last point is, can I come to the Juju Summit? :D [18:57] I will discuss this with my company :) [18:57] its a freebie event, and you're invited and can bring more [18:57] so, load up the possee and meet us in ghent :) [18:57] Zic: Yes you are most welcome to join us [18:58] you will hear my perfect^WFrench accent \o/ [18:59] hmm, just a real last point: http://paste.ubuntu.com/23895203/ [19:00] all is up-to-date isn't it? I have a doubt for k8s-master which say 1.5.1 [19:00] (because: ERROR already running latest charm "cs:~containers/kubernetes-master-11" if I try again the juju upgrade-charm kubernetes-master) [19:01] kubectl version return 1.5.2 [19:14] it seems to be a display bug SSHing directly to the master, all composants are in 1.5.2 [19:14] s/bug/bug./ [19:20] Question how can I FORCE destroy something on juju I have 6 container which will not go away [19:20] Teranet: You want to destroy everything juju? [19:20] yes [19:21] so I can redeploy from scratch [19:21] Teranet: OK here is the command, but you should be careful with this. [19:22] juju destroy-controller --destroy-all-models [19:22] teranet: juju kill-controller will tear it all down, including the controller node [19:22] ok let me see if that works [19:23] thx [19:23] Teranet: https://kubernetes.io/docs/getting-started-guides/ubuntu/decommissioning/ [19:23] covers everything [19:23] oh sorry, thought you were using kubes [19:24] nope but it's ok [19:24] the Cleaning Up the Controller part at the bottom should still apply [19:24] I would try to remove the model first though with juju destroy-model [19:24] https://jujucharms.com/docs/2.0/controllers [19:24] I use my own private cloud and had broken relationship which broke my complete nove enviroment [19:24] juju destroy-model I had ran and it was stuck [19:25] I had false applied a HA relactionship which resulted in destroying all my compute nodes :-( [19:25] lucky I hadn't deployed VM's in Openstack yet === scuttlemonkey is now known as scuttle|afk [19:53] lazyPowe_ mbruzek: hmm, the two additionnal (scale from 3 to 5) etcd members seems unhealthy in etcdctl cluster-health [19:53] http://paste.ubuntu.com/23895436/ [19:54] (after an upgrade of the charm) [19:54] I just restarted the etcd service via systemctl and all is fine [19:54] (just to let you know if it's a known issue) [19:55] all node are healthy after that [19:55] Zic - seems like it might have raaced, i haven't seen any test failures doing scale testing [19:56] and there's logic ot help prevent that in the charms [19:57] hmm I spoke too quicky, seems the restart does not suffice, it is unhealthy again, some etcd logs: http://paste.ubuntu.com/23895453/ [19:57] I have just this problem on my 04 and 05 etcd node [19:59] hmm, seems just a bit of flapping : http://paste.ubuntu.com/23895464/ [19:59] they are all healthy again [20:01] cory_fu: petevg: if matrix (or any other project) depends on juju-plugins, the setup.py PR (https://github.com/juju/plugins/pull/75) would make the crashdump PR (https://github.com/juju-solutions/layer-cwr/pull/46) unnecessary, right? [20:02] kwmonroe: matrix doesn't depend on it. [20:02] kwmonroe: cory_fu pushed back on that, and I think that he's right. crashdump needs python2's yaml, and matrix just handles python3 stuff. [20:04] ok, well let's forget matrix for now petevg.. should juju-plugins be some kind of packaged citizens? [20:04] kwmonroe: yes. I think that we should merge the PR. I'm a little biased, though, on account of it being my PR :-) [20:04] kwmonroe: Yeah, if we can update crashdump to work in py3 (i.e., if that one bug is fixed upstream), then we could perhaps make plugins a dep for matrix. But I also kind of like having it as optional functionality that works if you have the lib installed and is otherwise a no-op [20:05] An optional dep, if you will [20:05] Somebody does need to go and clean up the merge conflicts, though. [20:05] kwmonroe, petevg: +1 to packaging juju-plugins for easier install. Could also be a snap [20:05] cory_fu: yeah. No matter what, matrix shouldn't fail if crashdump doesn't exist. [20:05] petevg: if only there were a recently gung-ho ~charmer that could propose a clean PR... [20:06] kwmonroe: yeah. It's on my list o' things to do today, once I finish running this double set of tests, where I'm confirming that I'm telling the truth about matrix running with an without the crashdump libs :-) [20:07] my beef is really that the juju-plugins readme says "clone this repo", and that's not enough [20:07] .. for runtime [20:07] .. sometimes [20:07] Yeah. Adding the repo to your PATH is adequate, but not pretty. :-) [20:07] hey! ^^ that's a java slogan right there, right mbruzek? for runtime, sometimes? [20:08] Heh. [20:08] hi === Salty is now known as SaltySolomon [20:09] except it's not petevg, doesn't crashdump need pyyaml at runtime? nothing about cloning that repo and adding to the path helps you there. [20:10] kwmonroe: true. nm [20:10] kwmonroe: I will ping people when I have the new nice PR :-) [20:12] very fine petevg -- fwiw, i'm really trying to say that j-p is all growed up and it's time to consider which format to deliver it in. [20:13] Cool :-) [20:13] your keyboard says smiles, your subtext says ugh. [20:14] Read into stuff much? :-p [20:14] you did it again! [20:15] It is in Python2 still. Silly bug. [20:26] kwmonroe: Stop arguing about it and create a snap. ;) [20:28] 90 seconds and i have no retort. you win this round cory_fu. [20:28] :) [20:49] ok quick question I had a host deployment failure it timed out in the bios setting how can I initiate a redeployment to this box ? [20:49] ryebot / mbruzek / lazyPowe_ : just a final word: I did all my resilience and HA test and this time, all is working [20:49] Zic: great [20:50] Zic: Awesome! [20:50] Zic awesome, glad you kept at it and had positive results :) [20:50] ^5 [20:51] the customer of this architecture leaves an old VMware ESX platform, these robust Hosts machines will be added as kubernetes-worker at term :) [20:52] EC2 instances are just from popping near their own customer, in their country for each endpoint [20:53] I will keep you updated as a testimonial of how good CDK will do the work for the coming-launch :) [20:54] at term, this cluster will run a 3 Vitess cluster, Cassandra/Spark/Zeppelin, some Nginx and php-fpm7 [20:54] thats a nice spread of workloads [20:55] got some presentation layer, some app layer, some business intelligence in there, and i dont know what vitess is but i assume its funky vegetables [20:55] lazyPowe_: you are on the Vitess' Slack, don't you? [20:56] Zic - negative, i'm on 7 slacks but that is not one of them [20:56] ah, I crossed you at the K8S Slack :) [20:57] (for a problem about Vitess, and so, I was invite to the Slack of Vitess, I didn't remember where we crossed :)) [20:57] http://vitess.io [20:57] it's the way that YouTube use MySQL in their infra, especially in Kubernetes/Borg [20:59] ahhhh ok [21:00] bookmarked for later reading [21:00] kwmonroe, cory_fu: PR for you https://github.com/juju-solutions/matrix/pull/73 [21:00] to a totally another subject, my colleague saw the OpenStack Juju bundle charms [21:01] it's pretty... dense :) [21:01] Zic there are lots of applications in OpenStack [21:02] Zic: but one can upgrade through the releases of OpenStack easily with those charms. [21:02] (he saw me do some drag'n'dropping at the Juju GUI and when he visits the jujucharms.com website, he saw the openstack bundle) [21:02] mbruzek: yeah, he is interested as we have also a PoC for OpenStack and... it was not so concluent [21:03] if you want to build OpenStack from scratch on your own, and maintain this infra, it costs a lot of time, especially at the beginning when you're alone [21:03] (before the infra is up, running and... documented :)) [21:04] I think he will try the OpenStack bundle charm :D [21:05] going to sleep anyway, I'm at extra-unofficial-hour for too long :) [21:05] Zic: I totally recommend it, I know lots of 1-3 people teams that maintain openstack in production with juju [21:06] yeah, as my own experience with Juju now, I can recommend it for another technologies :) [21:06] we mainly use Puppet here as our configuration-management tool, sometime Ansible for particular work... [21:07] as K8S module for Puppet does not exist and will be a headache to maintain by ourselves, I run through Juju :) [21:07] (kubeadm first, then I discovered CDK via Juju) [21:13] the second lesson I learn from Juju/K8S/Vitess is that I must learn Go someday :p [21:13] more and more techno that I used are written in Go [21:14] * Zic said he is going to sleep 10min ago, too talkative, g'night [21:16] cheers o/ === mskalka is now known as mskalka|afk [22:29] so how I go about enabling elasticsearch and kibana for logging on my CDK cluster? [22:34] thedac, were you able to talk to jonh from CPLANE today [22:34] ? [22:35] narinder: yes [22:44] I keep coming across documents that talk about exporting a couple of env vars before bringing up a k8s to get it to spin up elasticsearch and kibana pods. does anyone know how to get those pods deployed in a pre-existing cluster easily? [22:48] was add-metric ever added to the charmhelpers? [22:49] cmars, ^^ [22:49] cholcombe, no, i don't think it was [22:50] cmars, we should get that fixed so i don't have to keep calling subprocess.check_output to add metrics :) [22:50] cholcombe, could do, yeah [22:51] cholcombe, where is the charmhelpers project? [22:52] still on LP? [22:52] hmm, seems so. sure, i'll look into this [22:52] cmars, https://code.launchpad.net/charm-helpers [22:52] yeah [22:53] cmars, i'm working through a PR on gerrit and people are asking why I have to subprocess call to a juju function [22:54] cholcombe, it'll probably take all of a couple minutes to write, the rest of the day to document and test :) [22:55] lol yup [22:56] cmars, sorry to be a pain in the butt [22:57] cholcombe, :) its fine, i'm just complaining [22:57] needs to be done.. layer:metrics isn't terribly efficient [22:59] stormmore: hey, so you can either deploy elastic search and kibana on your cluster, or you can deploy elasticsearch/kibana/beats along side it [23:00] marcoceppi at the moment I am thinking of on the cluster to minimize the number of "machines" in use. still in the process of architecting a bigger bare metal cluster [23:00] marcoceppi I already have a small k8s cluster deployed though [23:01] stormmore: makes sense. I don't have much experience in doing that, but you should be able to follow any online guide that walks through elastic on k8s [23:02] cmars, i'd use it but the ceph charms haven't gone layered yet [23:02] stormmore: I can't confirm, but this one looks promising: https://github.com/kayrus/elk-kubernetes [23:02] it's atleast been recently updated [23:03] marcoceppi and that is where the problem lies, it seems to assume that you are not adding it but enabling it before bringing up the cluster... from what I can see it is an addon at this point [23:03] https://kubernetes.io/docs/user-guide/logging/elasticsearch/ [23:28] lazyPowe_ are you around? do you have any input into adding elasticsearch & kibana pods to kube-system? [23:29] stormmore - our integration point was an external logging deployment using the beats-core bundle as a foundation for that effort [23:29] the idea is that if your k8s cluster is sick, you'd want some persistence around that data, and have it be accessable regardless of the kubernetes system sttate [23:29] so it uses beats to ship the data over, and then gets parsed-reinterpreted by the kibana dashboards [23:30] lazyPowe_ hmmm insteresting considering conjure-up docs suggest that it deploys 2 elasticsearch pods and a kibana one [23:30] wat [23:30] when did this happen? [23:30] stokachu - wat? [23:31] https://insights.ubuntu.com/2016/11/21/conjure-up-canonical-kubernetes-under-lxd-today/ [23:31] ooooo [23:31] this isn't the conjure-up docs or prompt, this is the upstream k8s guide [23:32] right, at this time, teh beats core bundle was part of CDK [23:32] its now an ancillary bundle, pending our v5 update of the elastic stack components [23:32] "conjure-up kubernetes" vs "conjure-up canonical-kubernetes"? [23:32] that work thats there still works and functions as it did then, but it could be better with the v5 updates as there were a ton of fixes and normalized versioning schema and etc. [23:33] stormmore - so in short, what you get today with canonical-kubernetes, is much more aligned with a smaller deploment, and you can then add the beats components and relate it all. we have a todo to get another bundle published since we moved to the fragments, but we're holding off until the v5 rev of the elastic stack iirc [23:34] stokachu - un-wat, miscommunication [23:36] stormmore - i'll take a line item to bring this up with the team about seeing if we can get you an elastic-enabled bundle tomorrow [23:36] most of the team has left, im' sticking around a little bit longer to check on this deployment i'm running, and then i'm out for the evening as well [23:37] lazyPowe_ awesome, no worries [23:37] i would build you one now, but that would be a "throw it over the fence good luck i'm behind 7 proxies" kind of thing to do [23:37] i'd rather at least run a test deployment before i put it in your hands [23:37] lazyPowe_ I am just trying to make sure I have everything in place so dev doesn't need access to the nodes and have a UI to get the logs from [23:37] yep, totally understand that [23:37] why give them admin when read-only works [23:38] have you been looking into RBAC k8s primitives perchance? [23:38] those seem like they are going to be right in your wheelhouse [23:38] you can assign roles to namespaces and scope what primitives they can interact with [23:38] rather roles to users, in a namespace, and .... [23:38] see above [23:39] stormmore - https://kubernetes.io/docs/admin/authorization/ [23:39] we haven't fully enabled this yet as its currently in BETA [23:40] but you'll def want ot read up on it and when we land teh feature set in the charm to make that configurable, you'll be in container-topia [23:42] yeah exactly [23:45] cholcombe, here you go, wasn't nearly as bad as i thought :) https://code.launchpad.net/~cmars/charm-helpers/add-metricenv/+merge/315952 [23:45] woo [23:46] cmars, nice. i forgot about the JUJU_METER thing [23:46] i should write more python tests for my charms... mock.patch is pretty easy to work with [23:47] gotta run now. if you could help me get this landed, or reviewed -- happy to fix things up however -- i'd much appreciate it! [23:56] cmars, sure. i can review it but i can't land it