/srv/irclogs.ubuntu.com/2017/01/30/#juju.txt

Budgie^SmoreSo is there a charm and / or a doc on standing up a private docker registry for k8s?04:21
lazyPowerBudgie^Smore - there's an open PR that hasn't made the shift to the upstream repository that adds this functionality into k8s itself  - https://github.com/juju-solutions/kubernetes/pull/9704:22
lazyPoweronce that lands it'll get released with our next update to the charms, we have some additional prs that need to land to support that change. but its on the horizon04:23
Budgie^Smoreso again I am getting ahead of myself :)04:23
Budgie^SmoreI am pondering running nexus 3 ina container in the meantime (possible long term depending on the registry functionality)05:13
BlackDexblahdeblah: It's about this bug https://bugs.launchpad.net/nrpe-charm/+bug/163351707:23
mupBug #1633517: local checks arn't installed sinds nrpe-7 <NRPE Charm:New> <https://launchpad.net/bugs/1633517>07:23
blahdeblahBlackDex: You've caught me a little late in my day, but if I get a chance I'll have a look at that if I have some spare time.07:30
kjackalGood morning Juju world!08:09
=== frankban|afk is now known as frankban
=== junaidal1 is now known as junaidali
BlackDexblahdeblah: Thx, i don't mind i just need a answer/help, i know the time-difference is there, so no prob, i'm glad someone wants to take a look09:16
marcoceppimorning kjackal o/09:18
kjackalhell marcoceppi09:19
blahdeblahBlackDex: Definitely keen to find out what's going on; what times (UTC) are you likely to be around most?09:22
BlackDexblahdeblah: im in utc+1 (netherlands) so that would be utc 8:00 till around 16:0009:26
blahdeblahBlackDex: ack - will try to catch you in your mornings09:26
BlackDexoke :) cool thx!09:26
Zichi here, I'm asking myself if a complete teardown via Juju (and tearup) could permit a resolution of this issue: https://github.com/kubernetes/kubernetes/issues/4064809:29
Zicbecause it does not seem that many people have encountered this one :/09:30
=== sk_ is now known as Guest67469
Zic(I'm also trying to see in the Kubernetes' Slack if somebody already encountered this issue)09:37
marcoceppiZic: it might be? Are you on 1.5.2?09:42
Zicmarcoceppi: yep09:46
marcoceppiZic: I don't feel comfortable saying scrap and redeploy, esp if there's information we can capture from your deployment to improve CDK, but I also don't want you siting with a wedge'd cluster09:47
marcoceppiZic: lazyPower mbruzek & co should be online in the next few hours09:48
Zicyeah, they helped a lot with the first party of this problem friday :)09:48
Zics/party/part/ :]09:49
marcoceppiI'm cool with calling it a party instead of a problem ;)09:49
Zicthe first part was about all my pods crashed with this kind of error, and even some kubectl command (which actually "do/write" something, like create/delete, as get/describe works) return this kind09:50
Zicupgrading to 1.5.2 pass all my Pods to Running09:50
Zicbut this weekend, when I tried to reboot some kubernetes-worker to test the resilience and the eviction/respawn of pods, I fell again in a sort of same problem :/09:51
Zicoh, I found something strange, cc @ lazyPower10:24
Zichttp://paste.ubuntu.com/23892884/10:24
Zicetcd again /o\10:25
ZicI saw this via juju status, flannel/2 was marked as "waiting" indefinitely10:26
marcoceppiZic: you have 1,3, or 5 etcd machines?10:26
Zic5 etcd10:26
Zicon 5 different VMs10:26
marcoceppidoh, just saw the flanneld line10:26
ZicI didn't have Flannel in this state when I opened the GitHub issue10:27
marcoceppiinteresting10:27
Zicthe last guilty for my first problem was also etcd10:27
marcoceppiit seems it just started happening baesd on the logs10:27
Zicdo you know if I can do a "fresh start" of an etcd database for canonical-kubernetes without redeploy it from scratch? I don't have any important data for now in this cluster10:28
Zic(= I have all my custom YAML file to redeploy all easily)10:29
Zicfiles*10:29
marcoceppiZic: you should be able to just remove the etcd application then redeploy etcd and re-create the relations10:32
marcoceppiyou might get some spurrious errors during removal, and I'm not sure if it's a tested path or not10:32
marcoceppitheoretically, you should be able to, but distributed systems are always a bit interesting in practice10:33
Zicoh you mean via Juju? I thought to wipe the data directly in etcd, but even if I don't know well etcd, I suppose there is some "default-key/value" needed and provisionned via Juju deployment at bootstrap :/10:34
marcoceppiZic: not sure via etcd, from a Juju perspective the "keys" for TLS are actually a charm, the easyrsa charm, so since there's a CA running still it'll just get new certs, distribute those via relations and k8s will be reconfigured to point at that etcd10:35
marcoceppiZic: as for the etcd portion, there probably is a way to wipe, I'm just not sure of one10:36
Zicmarcoceppi: do you advice me to wait for lazyPower and ryebot to come up before smashing etcd in the head (it's not the first time etcd annoys me, even in other technologies than K8s/Vitess :))10:41
Zic?10:41
marcoceppiZic: It's probably a good idea to wait for them, but smashing etcd over the head might also be very theroputic. I'll make sure we have some people in EU/APAC timezone come up to speed with kuberentes knowledge so there's not so much a wait period10:42
Zicoh I will never complain about timezone as it's a community support channel :) but great to here it10:53
Zichear*10:54
Zicnew info: old NodePort service are always listening, but with a new NodePort service just deployed, no nodes are listening on this port :/11:12
ZicI think Flannel is the guilty but because it cannot contact etcd11:12
marcoceppistokachu: ping when you're around11:17
marcoceppior mmcc but I doub't you'll be around before stokachu11:18
zeestratHey BlackDex, this might be a long shot, but the bug you posted on NRPE isn't related to https://bugs.launchpad.net/charms/+source/nagios/+bug/1605733 by any chance?12:45
mupBug #1605733: Nagios charm does not add default host checks to nagios <canonical-bootstack> <family> <nagios> <nrpe> <unknown> <nagios (Juju Charms Collection):New> <https://launchpad.net/bugs/1605733>12:45
stokachumarcoceppi, ping14:09
marcoceppistokachu: hey man, what's conjurebr0 for?14:25
marcoceppistokachu: I'm doing some super weird things in a spell, and was curious14:25
stokachumarcoceppi, it's mainly for openstack on novalxd to have that second nic for its neutron network14:25
stokachubut it's always there so you could rely on it if need be14:26
Zicmbruzek: hi, are you around? just saw you joined, sorry if I disturb you14:32
mbruzekZic I am here. What can I help with?14:36
Zicmbruzek: remember the last time with my Ingress controller in CLBO? I thought all was fixed after upgrading to 1.5.2, but when I rebooted some nodes, the problem came back... I continued to look at the problem today and saw Flannel is completely messed up: http://paste.ubuntu.com/23892884/14:39
Zicactually, all new NodePort are not working :s14:39
mbruzekhrmm.14:39
mbruzekThere must be a problem with the reboot sequence. Do you think you could reproduce this?14:40
marcoceppistokachu: is it routable, and is it connected to the controller?14:41
stokachumarcoceppi, yea it's routable, but not connected to the controller14:41
marcoceppistokachu: cool14:42
marcoceppistokachu: second question, can I reference a local bundle.yaml file in the spell metadata?14:42
Zicmbruzek: I tried to restart the flannel service but with the same result, I didn't try to reboot another node to see if I can reproduce14:42
stokachumarcoceppi, you would just place a bundle.yaml in the same directory as your metadata.yaml and make sure bundle-location isn't defined in metadata.yaml14:42
marcoceppistokachu: boss, thanks14:43
stokachunp14:43
mbruzekZic: I need to know how you are rebooting these systems. Are you doing them in a specific order.14:43
marcoceppistokachu: I also wrote this https://gist.github.com/marcoceppi/e74c10178d1b730a36debc1f1622b2ce14:43
Zicmbruzek: this morning, I just rebooted (via the `reboot` command) one kubernetes-worker14:43
marcoceppiI'm using it in a modified step to merge kubeconfig files, this way the user only needs to set --context14:43
Zicno other machines14:43
stokachumarcoceppi, nice!14:44
marcoceppistokachu: updated with the step-01 file, nothing major14:45
marcoceppistokachu: last question, for headless, any thoughts on allowing a final positional argument for model name?14:45
marcoceppiI love rando names as much as the next person, but I hvae some explicit model names I want to use14:46
stokachumarcoceppi, one thing that i need to address for kuberentes is https://github.com/conjure-up/conjure-up/issues/568#issuecomment-27237901014:46
marcoceppistokachu: yeah, that's what my gist does14:46
marcoceppistokachu: it names the context, user, and cluster the same as teh model name from juju14:46
stokachumarcoceppi, very nice, how do you access that with kubectl?14:47
marcoceppiso they can live side by side with others. it doesn't do de-duping or collision detection yet, but I'll test it with my local spell first14:47
marcoceppistokachu: `kubectl --context <model-name>`14:47
marcoceppistokachu: and `kubectl set-context <model-name>` <- this is like juju switch14:47
stokachumarcoceppi, very nice, once you're ready ill add those to the spells14:47
stokachumarcoceppi, we have positional arguments for cloud and controller, so adding a third for model makes sense14:48
marcoceppistokachu: cool, I'll file a bug, not high priority but wanted to run by you first in person before throw'in another on the pile14:48
stokachumarcoceppi, thanks, that's an easy one so it'll get addressed this week14:49
stokachumarcoceppi, my other big todo is to make spell authoring cleaner with maybe a clean sdk or something14:49
stokachuhaven't quite figured out the best approach there for developer happiness14:49
marcoceppistokachu: yeah, I was taken aback by all the bash and python mixed14:49
ryebotmbruzek Zic: I'm bringing up a cluster to attempt to repro14:50
stokachumarcoceppi, yea i'd to use like something like charmhelpers for this14:50
marcoceppistokachu: you might beable to borrow a lot from the reactive style, where you use decorators in bash/python to trigger/register events14:50
mbruzekZic: And how do you see the problem? Are you just watching the output of kubectl get pods ?14:50
stokachumarcoceppi, ah that's a good idea, would clean up a lot of the code14:51
marcoceppistokachu: and with the bash bindings, best of both worlds14:51
marcoceppistokachu: I'll file a bug for you there with some initial thoughts14:51
stokachumarcoceppi, cool man appreciate it, i want to get that done sooner than later as well14:51
Zicmbruzek: I'm running a permanent watch "kubectl get pods -o wide --all-namespaces" when I reboot a node, and watch at the pods state, during that, I also do some curl and telnet to various Ingress and NodePort of the cluster14:51
ZicI think that's a "relica" from my first problem, as the step-to-reproduce is hard to describe, I'm asking myself if there is a way to reset the etcd cluster to default value (= wipe all data of the K8s cluster) without reinstalling the Juju14:54
Zicthe Juju full-cluster*14:54
Zicit may be more simple to set up a step-to-reproduce path, or to confirm that's tied to the problem of the last time and my actual data is corrupted :s14:55
marcoceppistokachu: https://github.com/conjure-up/conjure-up/issues/63514:56
stokachumarcoceppi, perfect thanks14:56
marcoceppistokachu: I'm onsite atm, but I'll file the developer one later tonight if you don't get to it before me14:56
stokachumarcoceppi, cool man, yea ill file one14:57
stokachumarcoceppi, fyi https://github.com/conjure-up/conjure-up/issues/63615:01
Zicryebot mbruzek: in the same strange behaviour, if I totally poweroff a node that was hosting some pods, this node is shown as NotReady in kubectl get nodes (this point is ok), but the pods stay saying "Running" on the poweroffed-node15:04
ZicI'm sure I didn't have this behaviour on the fresh bootstrapped cluster15:05
marcoceppistokachu: thanks, I'll dump my ideas there15:07
stokachumarcoceppi, cool man15:07
mbruzekZic: What problem(s) are you trying to solve by rebooting? What are else are you doing on the system to necessitate the reboot?15:10
ZicI'm trying to test the HA and the resilience of the cluster (= what happened and during what time) before going prod15:11
Zicwith a fresh bootstrapped cluster, all pods hosted on a node passed "Completed" and finally disappeared and repop "Running" on another node15:14
Zicnow, they just stayed in "Unknown state"15:14
Zicand as they are some variable I can't control like the disaster of friday, I cannot describe a clear step-to-reproduce without a full-reset I think :/15:15
mbruzekZic: We are looking into the problem here, trying to reproduce on our side15:19
Zicthanks15:19
Zic  11m11m1{controllermanager }NormalNodeControllerEvictionMarking for deletion Pod kube-dns-3216771805-w2853 from Node mth-k8svitess-0215:35
Zicfor example, this action last forever, the pod stayed in Unknown instead of switching to Completed and respawn somewhere else15:36
Zic(in fact it pops somewhere and was in state Running, but the old one stayed in Unknown forever)15:36
mbruzekZic: And you have deployed 1.5.2 kubernetes right?15:38
Zicyep, was friday :)15:39
mbruzekI remember15:39
ryebotZic: Hmm, not able to repro.15:41
Zicmaybe I need to do a recap because I spoke so much, sorry :D 1) The first problem was some kube-system components and Ingress controller in CLBO because etcd units was rebooted too quickly (operations was in progress I think) because of large namespace deletion 2) Upgrading to 1.5.2 immediately fix the problem (I thought) 3) I rebooted just one node this weekend (planned to reboot all first, but as the first15:41
Zictrigger problems, I stopped) and finished here15:41
ryebotZic: We might need some detailed reproduction steps15:41
ryebotZic: ack, thanks15:42
ryebotLet me try rebooting etcd15:42
ZicI'm sure if a do the same on a fresh canonical-kubernetes, I will don't have any of this issues; something must not totally recover from the previous problem at etcd level15:44
mbruzekZic: Some of your systems are physical, yes?15:44
mbruzekZic: We rebooted a worker here and did not have a problem coming back up15:44
Zicyep, 5 of 8 kubernetes-worker15:44
Zicall others components are VMs15:45
Zicmbruzek: yeah, just after the first installation of the bundle charms, all this operation was OK15:45
Zicit's since my last friday's incident, something must be partially working15:45
ryebotZic mbruzek: rebooted all etcd nodes, no problems coming back up15:45
ryebotPods and nodes all intact15:46
Ziccan I wipe the etcd-cluster to default data without tearing-down all the canonical-kubernetes cluster?15:46
Zic(the infra, I don't mind to loose the settings of K8s cluster, I can redeploy my pods & services easily)15:47
mbruzekZic: We have not tested wiping out etcd, it holds some of the Software Defined Network settings.15:48
mbruzekZic: We are unable to reproduce  the failure you are seeing. It may be because of the manual operations you ran post deployment. Would it be possible to re-deploy the canonical-kubernetes-cluster entirely and start there?15:50
Zicmbruzek: yeah, I think it's the only path now15:51
Zicmbruzek: do I need to reinstall everything or can I do a clean teardown with Juju and re start from the beginning?15:52
mbruzekZic: As we spoke about on Friday lets take a snapshot of your environment now.15:52
mbruzekBasically you need to use the Juju GUI to export the model of your environment now15:52
ryebotZic: To open the GUI:15:53
ryebotZic: If you haven't changed your admin password, run `juju show-controller --show-password` to get the randomly generated password15:54
ryebotZic: Next, run `juju gui`15:54
marcoceppiryebot Zic or just run `juju gui --show-credentials` ;)15:54
ryebotmarcoceppi: dangit, I always forget that15:55
ryebotZic: That'll start up the gui and give you a url to hit15:55
ryebotZic: Login with "admin" and your password, then look for the export button, which is at the top and looks like a box with an up-arrow15:56
Zicryebot: yeah, it's the step I followed to bootstrap the cluster successfully15:56
ryebotZic: Click that, and it'll download a copy of the model. We'd like to see it.15:56
Zicit's the "teardown" part where I don't know what it's the best practice :)15:56
Zicoh ok, I will do that now15:56
mbruzekZic: The up-arrow button will download the model in YAML representation. you can save it and it will help you deploy the same environment again in a repeatable fashion15:57
ZicI wrote a more detailed step-to-not-reproduce-but-post-mortem: http://paste.ubuntu.com/23894187/15:57
Zicmbruzek: do I need to reinstall the VMs and machine which host the cluster?15:57
jcastroZic: hey, just as an open invite, we'll be in Belgium next week if you feel like hopping on a train to talk face-to-face: http://summit.juju.solutions/15:58
mbruzekZic: The summit is free as in beer15:58
mbruzekZic: Because you used a mixture of Amazon and Manual Provider it may not be as easy a juju deploy bundle.yaml, but after you manually provision those physical systems you can deploy the bundle.16:00
mbruzekZic: Pastebin the model when you get that done16:00
Zicmbruzek: Amazon machines was linked with the Manual provider also, I don't use the AWS credentials16:01
Zic(our AWS instances is popped by Terraform for the anecdote)16:02
Zicso all I need to do is 1) reinstall all OS 2) relink to the Manual provider of the Juju controller 3) redeploy the YAML I'm exporing, or the 1st step is useless?16:03
Zicjcastro: will be happy to come, Belgium is not far away, I will try to discuss at our meeting if we can go with my company :)16:03
jcastrobring as many people as you want too, it's a free event.16:05
Zichttp://paste.ubuntu.com/23894269/16:05
mbruzekZic: Why are you installing the OS?16:05
Zicmbruzek: because we don't have MaaS and our homemade-installer have no connector to Juju (but lazyPower let me know that I can write one :))16:06
Zicdo I miss something?16:06
mbruzekZic: No, I just didn't understand your environment. I was about to tell you about MAAS but you already know.16:07
ZicVMs and physical servers at our datacenter are auto-installed by a homemade-installer like MaaS (which autocomplete our internal Information System, registry, and some warranty support)16:08
Zicfor AWS, we just use AMI16:08
Zicso I told lazyPower that maybe, in the future if we have more Juju infra I will install a MaaS16:08
Zicor maybe start to write a connector for Juju if it's not so hard for my knowledge16:08
mbruzekZic: With that new information, it seems your steps are right. I was hoping to avoid having to reinstall the OS16:08
Zicok, I was asking about the reinstallation if Juju provides a clean way to teardown the cluster16:09
Zicif not, not so important, I will just need to redo the manual-provider part, the reinstallation is fast and automatically done16:09
lazyPowerZic - as they were manually enlisted, there's no clean way to tear it down, once you juju remove-application the machines will be left behind and still be dirty.16:10
mbruzekZic: so you can issue: juju destroy-environment <name>,16:10
Zicoh hi lazyPower :)16:10
ryebotZic: From the sound of it, you don't need to reinstall juju16:10
ryebotjust remove your manual machines, reprovision, and add them back16:10
ryebot*if you want to, though, go ahead :)16:11
Zic:)16:12
ZicI'm sure this new cluster will not have all this problems, as it was at the beginning by the way16:12
mbruzekZic: For reference: https://jujucharms.com/docs/stable/clouds-manual  If you add the machines in manually you can use the bundle.yaml file you just downloaded to redeploy on the right systems using the to: (machine number)16:13
Zicit must be some sneaky things that came up with the friday's incident, even if I don't know what it is16:13
Zicmbruzek: yup, I will re-add the machine in the same order (and will controll through juju status btw)16:14
Zicit's what I did the first time to match charms with the hostname of machine (as we use predictive and rolenamed hostname)16:14
mbruzekZic: Looking at the last pastebin I see the variety of machines you are using, some of your workers have 20 cores and some have 2. You should be able to identify the systems by their constraints.16:17
Zicmbruzek: do you think I can test to restore the etcd base with a backup before the friday's incident? or it's worse and I just don't need to spend that time?16:19
ZicI don't know how Kubernetes manage a restore of etcd when there is a delta from what is currently running in term of pods, services... and what is restored in etcd16:21
Zic(I had 3 Vitess cluster deployed when I did the etcd backup, I wipe all of them since then)16:22
lazyPowerZic - that is a problem. your snapshot will not contain the TTL's on the keys, so you'll restore to whatever the state waas during that snapshot16:22
lazyPowerthis may have implications on running workloads16:22
mbruzekZic: Here is what I would recommend. Redeploy this cluster, and once you get everything working and in good state take a snapshot of etcd data (before you do any non-juju operations)16:23
Zicbecause I have two options : 1) etcd backups 2) all management parts of the cluster (easycharm, kube-api-loadbalancer, etcd and kubernetes-master, so all except kubernetes-worker) are snapshotted daily16:24
Zicso I set back to the past all the management part, I don't know how the kubernetes-worker part will act16:25
ZicI know that's complicating the problem istead of reinstall everything, it's just to know what can I possibly do if it was in production16:25
Zics/so I/so if I/ (did nothing actually for now :p)16:26
mbruzekZic: Technically I think you could do both backup etcd, and snapshot the Kuberentes controll plane16:26
mbruzekZic: the etcd charm has a snapshot and restore action provided in Juju you can run that at the same time you snapshot the control plane16:27
Zicoh I didn't know, I did this backup action manually via crontab and etcdctl backup command16:28
Zicso, I will try to set all the K8s control plane back to thursday, and see if from here, I can directly upgrade to Kubernetes 1.5.216:29
Zicmbruzek: if one of a component of charms (like etcd) appeared in APT's upgrades, do I must validate or hold this package?16:30
Zicit's one of the first step leading to my disaster the last week16:30
lazyPowerZic - thats a great question, and I should probably be pinning etcd if delivered via charm and release charm upgrades when the package is upgraded.16:30
Zicdon't know if it is for real, but was in the steps16:30
Zicok, I will pin etcd16:31
lazyPowerto unpin and rev teh etcd package16:31
ZicI think that all this problems came from the uprading of etcd via APT *PLUS* the fact that I run larges delete operations of large namespaces just before, and maybe I don't wait enough16:34
Zic(concerning friday) and concerning today, maybe some parts that are not working perfectly since then16:34
* mbruzek suspects that as well16:34
Zicand it's not reproductible unless you can do the exact same delete operation and upgrade etcd at the wrong time like me :D16:34
Zicas I said, before that, all my resilience and HA tests was perfect :)16:35
ZicI thought I will go in prod quickly ^^16:35
Zicit's not the first time etcd f*cked me up, in other technologies than K8S (or even Vitess, as lazyPower known), I know it's not your fault and I'm very happy of all the help you was able to provide me during this last days ;)16:36
ZiclazyPower mbruzek ryebot: I successfully return to the previous state before the incident via my backupped snapshot of all the K8s controll plane, so I'm going to immediately upgrade to 1.5.2 and will redo my own step-to-reproduce17:03
ZicI expect to... not reproduce my problem :)17:03
=== petevg is now known as petevg_noms
mbruzekwhy upgrade? If you deploy new you should get 1.5.2 by default17:03
mbruzekZic: ^17:04
Zic Zic | I know that's complicating the problem istead of reinstall everything, it's just to know what can I possibly do if it was in production17:05
Zic^ just to test that17:05
ZicI restored the VMs (which host master, etcd, apilb and easyrsa) of a ESX snapshot of wednesday17:06
Zic(my cluster works perfectly at this date)17:06
ZicI have just the upgrade to 1.5.2 to redo17:06
Zicand I'm sure that my step-to-reproduce the problem will not work, as it seems to be tied with the etcd disaster of friday17:07
ZicI can confirm, I can't reproduce my own previous problem \o/17:10
Zicso it seems that something I did friday corrupt something (etcd I suppose) was the guilty part17:11
ZicI just restore all management part to wednesday, reupgrade to 1.5.2, restore some kubernetes-worker and etcd... all seems fine17:12
Zicthe only difference is that I immediately upgrade to 1.5.2 before deleting my large namespaces17:12
Zicand that I did not upgrade etcd through APT this time17:13
Ziccc lazyPower ^17:13
lazyPowerZic - thats good to hear. I'm going to circle back and file a bug if you dont beat me to it, against layer-etcd to pin the package or make it configurable.17:13
Zic:) in my side, I will do a simple apt-mark hold etcd for this time17:14
lazyPoweri haven't had the pleasure of testing that scenario where etcd is upgraded out of band by an apt-get operation, so it may have been attributed to that, or it might have beena ttributed to broken key/val data in etcd due to the delete.17:14
Zica mix of the two I think, upgrade via APT during broken key/val operation17:14
Zicit's the only part I didn't test in my step-to-reproduce17:15
lazyPoweryeah, thats crappy that we werent' able to recover from that though17:15
Zic(I immediately upgrade, and then delete all my namespace)17:15
Zic(thinking of the buffer problem of kubeapilb)17:15
lazyPowerZic - i suppose moving forward, the suggestion is to snapshot your data in etcd, hten run the upgrade sequence. its going to stomp all over your resource versions doing the restore but its better to attain that prior state than to be completely broken.17:16
Zicyep17:16
ZicI will try to look precisely at the apt upgrade proposition also17:16
Zicto not upgrade anything that's managed by Juju carms17:16
Ziccharms*17:17
ZiclazyPower: the juju etcd charms does not include auto-backup right? do you think it's a good idea, as etcd-operator do it?17:18
Ziccurrently I run the backup through a crontab on each etcd units, mbruzek told me that I can do the same with a juju action, I will go with that I think17:18
lazyPowerZic - i'm open to a contribution for auto backups, but as it stand today its an operator action17:18
lazyPoweryou can run that backup action in like a jenkins job and have it archive, and then you have an audit trail17:18
lazyPoweri get leery of automatic things that have no visibliity (like cron)17:19
Zicpersonally I prefer to configure this type of backup on my own17:19
lazyPowerthe last thing i want is to assume its working, by wrapping it in a CI subsystem you have trace logs and know when it fails17:19
Zicbut as Juju is here to help, it sounds like a good feature :)17:19
Zichaha, yeah, I can uderstand that part :)17:20
lazyPowerand the whole juju action part ensures its repeatable :)17:20
Ziceven if etcd-operator do the job, I will actively monitor what he does17:20
lazyPowerplus those packages are what iv'e tested for restore, its effectively teh same thing, but i'd hate to think adding an extra dir to the tree or something would cause the restore action to tank.17:20
lazyPowerand then its added gas to the fire17:21
lazyPowermetaphorically speaking anyway17:21
Zicfor now I do the both : daily snapshots of the VMs which host etcd units + etcdctl backup command17:22
lazyPowerthats a good strategy17:23
lazyPower+117:23
ZiclazyPower: do you have any docs on how Juju and MaaS is connected, code-sided?17:33
rick_hZic: what do you mean?17:34
ZicI will lurk at if it's valuable for us to develop a connector for our own installation/provisionning infra or deploy a simple MaaS for Juju architecture17:34
Zicrick_h: oh hello, about this thing ^17:34
Zicwe have a kind of MaaS which is connected to all our services in my company17:34
Zicit's an homemade system and I don't know if I can write a simple new "provider" for Juju, or if I just go to MaaS17:35
Zic(will be a little redundant)17:35
rick_hZic: check out https://github.com/juju/gomaasapi and https://github.com/juju/juju/tree/staging/provider/maas17:37
=== mskalka is now known as mskalka|afk
Zicrick_h: thanks17:42
Zicmbruzek: hmm, the juju debug-log is kinda flooding since the new upgrade to 1.5.2 : http://paste.ubuntu.com/23894873/17:48
Zic(I use the juju upgrade-charms command)17:48
ZiclazyPower: I'm reposting as you were offline : the juju debug-log looks really strange since my new upgrade to 1.5.2 : http://paste.ubuntu.com/23894873/17:52
lazyPowerZic - most of that is normal. the leadership failure - if it continues to spam we'll want to get a bug filed against that17:53
Zicit floods in loop :)17:53
lazyPowerthats the unit agent complaining about a process it needs to do leadership stuff for coordination. not related to teh charms however17:53
lazyPowerZic - i'm headed out for lunch will be back in a bit17:54
Zichmm, I didn't do a kubectl version after juju upgrade-charm, I just take a look at juju status but I'm always on 1.5.1 in fact :o17:54
Zic(I confirm I stayed in 1.5.1 after the juju upgrade-charms, I was just looking at juju status to all return to green and didn't look at the application version -_-)17:59
=== petevg_noms is now known as petevg
stormmorehow juju world18:20
stormmorehowdy*18:20
=== frankban is now known as frankban|afk
=== mskalka|afk is now known as mskalka
lazyPowerZic - so you're saying juju upgrade-charm on the components either didn't run, or the resource was not upgraded?18:25
lazyPowersorry for latency, i'm at terrible coffeeshop free wifi18:25
=== scuttle|afk is now known as scuttlemonkey
ZiclazyPower: the upgrade-charms command just changed to the lastest version of the charm, but the application was not upgraded18:27
lazyPowerZic - that seems strangely reminiscent of another user reporting at deploy time they didn't get the upgraded resource18:28
lazyPowerthis is recoverable18:28
lazyPoweron the store display, you can fetch each resource and manually attach them to upgrade the components.18:28
lazyPowerwe actually just landed a doc update about this, 1 moment while i fetch the link18:29
lazyPowerZic - https://github.com/juju-solutions/bundle-canonical-kubernetes/pull/19718:29
Zicthe upgrade performs well the first time with the same cluster, don't know what happen :(18:30
Zicdoes the order of what charp is upgraded via upgrade-charm count?18:30
Ziccharm*18:30
Zicoh I know what happen18:31
Zicin 1.5.1 Flannel does not start properly18:31
ZicI forgot to restart them :}18:31
lazyPowerin juju status, if you ran the upgrade-charm step, you should still see 1.5.2 listed as your k8s component versions18:32
lazyPowerassuming it went without error. if there's a unit(s) trapped in error state that are related, its possible that the upgrade hasn't completed18:32
Zicdoes the upgrade-charm start if Flannel is in error?18:32
Zicah18:32
lazyPowerif the charm is in error, it will halt the operations on related units18:32
lazyPoweruntil the error is resolved18:32
lazyPowerok lunch is over for me, heading back to the office and will resume then Zic.18:33
lazyPowero/18:33
mbruzekZic: I am back.18:39
mbruzekZic: What is the current issue?18:40
Zicmbruzek recap: I restored my old cluster to wednesday, I ran a "juju status", all was green, I ran juju upgrade-charms on easy charms, I did another juju status at the end and all was green, and the "Rev" column contains the latest version of the charm, but in fact, the software version was always 1.5.1 for kubernetes-master/worker for example. I remembered that Flannel don't start well on boot sequence on18:40
Zic1.5.1 lately, started it on every node and the upgrade was unblocked. The only weird thing is that Flannel was shown as "active/green" in juju status so...18:40
Zicso all is fine actually, was my mistake with Flannel not autostarting well on old 1.5.118:41
Zic(and juju status which show me as active/green the first time)18:41
mbruzekZic: Yes we fixed the flannel restart issue in 1.5.2 so I am confused why flannel didn't restart.18:41
Zicmbruzek: the restaured cluster was in 1.5.118:42
Zicrestored*18:42
mbruzekah18:42
mbruzekOK18:42
Zic:)18:42
Zicapparently, Flannel not started was blocking the upgrade18:43
mbruzekBut everything has started now?18:44
Zicyep18:44
Zicit's my fault, when the cluster was restored to wednesday/1.5.1, I just did *one* juju status, all was green18:44
Zicnormally I run a watch -c "juju status --color"18:44
ZicI didn't see between the first "juju status" and the upgrade that Flannel passed in "error/red"18:45
Zicand that's apparently what's blocked the upgrade as after Flannel was manually started, the upgrade begins instantly18:46
ZiclazyPower: TL;DR : it was Flannel (of the 1.5.1 version) which blocked my upgrade :)18:53
lazyPowerah good to know18:53
Zicit's OK now18:53
lazyPoweri wish we could retroactively fix that18:54
lazyPower#fwp18:54
mbruzekright!18:54
Zicwas my fault also as I just run onetime juju status, showed all green, then upgrade-charms, see it did not nothing at the juju debug-log, re-run a juju status, show that Flannel is in error... and remembered that on wednesday (date of the snapshot) I was still in 1.5.1 with the flannel's issue :p18:55
Zicnormally I monitor the upgrading-charm process through a watch -c "juju status --color" :p18:55
Zics/it did not nothing/it did nothing/18:56
Zicdouble-negation is dangerous.18:56
ZicIN CONCLUSION (sorry for the caps), I have a good-running 1.5.2 cluster, no CLBO, PodsEviction works well if node goes down...18:57
lazyPower\o/18:57
lazyPoweryay18:57
Zic... well, the only last point is, can I come to the Juju Summit? :D18:57
ZicI will discuss this with my company :)18:57
lazyPowerits a freebie event, and you're invited and can bring more18:57
lazyPowerso, load up the possee and meet us in ghent :)18:57
mbruzekZic: Yes you are most welcome to join us18:57
Zicyou will hear my perfect^WFrench accent \o/18:58
Zichmm, just a real last point: http://paste.ubuntu.com/23895203/18:59
Zicall is up-to-date isn't it? I have a doubt for k8s-master which say 1.5.119:00
Zic(because: ERROR already running latest charm "cs:~containers/kubernetes-master-11" if I try again the juju upgrade-charm kubernetes-master)19:00
Zickubectl version return 1.5.219:01
Zicit seems to be a display bug SSHing directly to the master, all composants are in 1.5.219:14
Zics/bug/bug./19:14
TeranetQuestion how can I FORCE destroy something on juju I have 6 container which will not go away19:20
mbruzekTeranet: You want to destroy everything juju?19:20
Teranetyes19:20
Teranetso I can redeploy from scratch19:21
mbruzekTeranet: OK here is the command, but you should be careful with this.19:21
mbruzekjuju destroy-controller <name> --destroy-all-models19:22
mskalkateranet: juju kill-controller <controller-name> will tear it all down, including the controller node19:22
Teranetok let me see if that works19:22
Teranetthx19:23
jcastroTeranet: https://kubernetes.io/docs/getting-started-guides/ubuntu/decommissioning/19:23
jcastrocovers everything19:23
jcastrooh sorry, thought you were using kubes19:23
Teranetnope but it's ok19:24
jcastrothe Cleaning Up the Controller part at the bottom should still apply19:24
mskalkaI would try to remove the model first though with juju destroy-model <model-name>19:24
mbruzekhttps://jujucharms.com/docs/2.0/controllers19:24
TeranetI use my own private cloud and had broken relationship which broke my complete nove enviroment19:24
Teranetjuju destroy-model I had ran and it was stuck19:24
TeranetI had false applied a HA relactionship which resulted in destroying all my compute nodes :-(19:25
Teranetlucky I hadn't deployed VM's in Openstack yet19:25
=== scuttlemonkey is now known as scuttle|afk
ZiclazyPowe_ mbruzek: hmm, the two additionnal (scale from 3 to 5) etcd members seems unhealthy in etcdctl cluster-health19:53
Zichttp://paste.ubuntu.com/23895436/19:53
Zic(after an upgrade of the charm)19:54
ZicI just restarted the etcd service via systemctl and all is fine19:54
Zic(just to let you know if it's a known issue)19:54
Zicall node are healthy after that19:55
lazyPowe_Zic - seems like it might have raaced, i haven't seen any test failures doing scale testing19:55
lazyPowe_and there's logic ot help prevent that in the charms19:56
Zichmm I spoke too quicky, seems the restart does not suffice, it is unhealthy again, some etcd logs: http://paste.ubuntu.com/23895453/19:57
ZicI have just this problem on my 04 and 05 etcd node19:57
Zichmm, seems just a bit of flapping : http://paste.ubuntu.com/23895464/19:59
Zicthey are all healthy again19:59
kwmonroecory_fu: petevg:  if matrix (or any other project) depends on juju-plugins, the setup.py PR (https://github.com/juju/plugins/pull/75) would make the crashdump PR (https://github.com/juju-solutions/layer-cwr/pull/46) unnecessary, right?20:01
petevgkwmonroe: matrix doesn't depend on it.20:02
petevgkwmonroe: cory_fu pushed back on that, and I think that he's right. crashdump needs python2's yaml, and matrix just handles python3 stuff.20:02
kwmonroeok, well let's forget matrix for now petevg.. should juju-plugins be some kind of packaged citizens?20:04
petevgkwmonroe: yes. I think that we should merge the PR. I'm a little biased, though, on account of it being my PR :-)20:04
cory_fukwmonroe: Yeah, if we can update crashdump to work in py3 (i.e., if that one bug is fixed upstream), then we could perhaps make plugins a dep for matrix.  But I also kind of like having it as optional functionality that works if you have the lib installed and is otherwise a no-op20:04
cory_fuAn optional dep, if you will20:05
petevgSomebody does need to go and clean up the merge conflicts, though.20:05
cory_fukwmonroe, petevg: +1 to packaging juju-plugins for easier install.  Could also be a snap20:05
petevgcory_fu: yeah. No matter what, matrix shouldn't fail if crashdump doesn't exist.20:05
kwmonroepetevg: if only there were a recently gung-ho ~charmer that could propose a clean PR...20:05
petevgkwmonroe: yeah. It's on my list o' things to do today, once I finish running this double set of tests, where I'm confirming that I'm telling the truth about matrix running with an without the crashdump libs :-)20:06
kwmonroemy beef is really that the juju-plugins readme says "clone this repo", and that's not enough20:07
kwmonroe.. for runtime20:07
kwmonroe.. sometimes20:07
petevgYeah. Adding the repo to your PATH is adequate, but not pretty. :-)20:07
kwmonroehey!  ^^ that's a java slogan right there, right mbruzek?  for runtime, sometimes?20:07
petevgHeh.20:08
mbruzekhi20:08
=== Salty is now known as SaltySolomon
kwmonroeexcept it's not petevg, doesn't crashdump need pyyaml at runtime?  nothing about cloning that repo and adding to the path helps you there.20:09
petevgkwmonroe: true. nm20:10
petevgkwmonroe: I will ping people when I have the new nice PR :-)20:10
kwmonroevery fine petevg -- fwiw, i'm really trying to say that j-p is all growed up and it's time to consider which format to deliver it in.20:12
petevgCool :-)20:13
kwmonroeyour keyboard says smiles, your subtext says ugh.20:13
petevgRead into stuff much? :-p20:14
kwmonroeyou did it again!20:14
petevgIt is in Python2 still. Silly bug.20:15
cory_fukwmonroe: Stop arguing about it and create a snap.  ;)20:26
kwmonroe90 seconds and i have no retort.  you win this round cory_fu.20:28
cory_fu:)20:28
Teranetok quick question I had a host deployment failure it timed out in the bios setting how can I initiate a redeployment to this box  ?20:49
Zicryebot / mbruzek / lazyPowe_ : just a final word: I did all my resilience and HA test and this time, all is working20:49
mbruzekZic: great20:49
ryebotZic: Awesome!20:50
lazyPowe_Zic awesome, glad you kept at it and had positive results :)20:50
lazyPowe_^520:50
Zicthe customer of this architecture leaves an old VMware ESX platform, these robust Hosts machines will be added as kubernetes-worker at term :)20:51
ZicEC2 instances are just from popping near their own customer, in their country for each endpoint20:52
ZicI will keep you updated as a testimonial of how good CDK will do the work for the coming-launch :)20:53
Zicat term, this cluster will run a 3 Vitess cluster, Cassandra/Spark/Zeppelin, some Nginx and php-fpm720:54
lazyPowe_thats a nice spread of workloads20:54
lazyPowe_got some presentation layer, some app layer, some business intelligence in there, and i dont know what vitess is but i assume its funky vegetables20:55
ZiclazyPowe_: you are on the Vitess' Slack, don't you?20:55
lazyPowe_Zic - negative, i'm on 7 slacks but that is not one of them20:56
Zicah, I crossed you at the K8S Slack :)20:56
Zic(for a problem about Vitess, and so, I was invite to the Slack of Vitess, I didn't remember where we crossed :))20:57
Zichttp://vitess.io20:57
Zicit's the way that YouTube use MySQL in their infra, especially in Kubernetes/Borg20:57
lazyPowe_ahhhh ok20:59
lazyPowe_bookmarked for later reading21:00
petevgkwmonroe, cory_fu: PR for you https://github.com/juju-solutions/matrix/pull/7321:00
Zicto a totally another subject, my colleague saw the OpenStack Juju bundle charms21:00
Zicit's pretty... dense :)21:01
mbruzekZic there are  lots of applications in OpenStack21:01
mbruzekZic: but one can upgrade through the releases of OpenStack easily with those charms.21:02
Zic(he saw me do some drag'n'dropping at the Juju GUI and when he visits the jujucharms.com website, he saw the openstack bundle)21:02
Zicmbruzek: yeah, he is interested as we have also a PoC for OpenStack and... it was not so concluent21:02
Zicif you want to build OpenStack from scratch on your own, and maintain this infra, it costs a lot of time, especially at the beginning when you're alone21:03
Zic(before the infra is up, running and... documented :))21:03
ZicI think he will try the OpenStack bundle charm :D21:04
Zicgoing to sleep anyway, I'm at extra-unofficial-hour for too long :)21:05
marcoceppiZic: I totally recommend it, I know lots of 1-3 people teams that maintain openstack in production with juju21:05
Zicyeah, as my own experience with Juju now, I can recommend it for another technologies :)21:06
Zicwe mainly use Puppet here as our configuration-management tool, sometime Ansible for particular work...21:06
Zicas K8S module for Puppet does not exist and will be a headache to maintain by ourselves, I run through Juju :)21:07
Zic(kubeadm first, then I discovered CDK via Juju)21:07
Zicthe second lesson I learn from Juju/K8S/Vitess is that I must learn Go someday :p21:13
Zicmore and more techno that I used are written in Go21:13
* Zic said he is going to sleep 10min ago, too talkative, g'night21:14
marcoceppicheers o/21:16
=== mskalka is now known as mskalka|afk
stormmoreso how I go about enabling elasticsearch and kibana for logging on my CDK cluster?22:29
narinderthedac, were you able to talk to jonh from CPLANE today22:34
narinder?22:34
thedacnarinder: yes22:35
stormmoreI keep coming across documents that talk about exporting a couple of env vars before bringing up a k8s to get it to spin up elasticsearch and kibana pods. does anyone know how to get those pods deployed in a pre-existing cluster easily?22:44
cholcombewas add-metric ever added to the charmhelpers?22:48
cholcombecmars, ^^22:49
cmarscholcombe, no, i don't think it was22:49
cholcombecmars, we should get that fixed so i don't have to keep calling subprocess.check_output to add metrics :)22:50
cmarscholcombe, could do, yeah22:50
cmarscholcombe, where is the charmhelpers project?22:51
cmarsstill on LP?22:52
cmarshmm, seems so. sure, i'll look into this22:52
cholcombecmars,  https://code.launchpad.net/charm-helpers22:52
cholcombeyeah22:52
cholcombecmars, i'm working through a PR on gerrit and people are asking why I have to subprocess call to a juju function22:53
cmarscholcombe, it'll probably take all of a couple minutes to write, the rest of the day to document and test :)22:54
cholcombelol yup22:55
cholcombecmars, sorry to be a pain in the butt22:56
cmarscholcombe, :) its fine, i'm just complaining22:57
cmarsneeds to be done.. layer:metrics isn't terribly efficient22:57
marcoceppistormmore: hey, so you can either deploy elastic search and kibana on your cluster, or you can deploy elasticsearch/kibana/beats along side it22:59
stormmoremarcoceppi at the moment I am thinking of on the cluster to minimize the number of "machines" in use. still in the process of architecting a bigger bare metal cluster23:00
stormmoremarcoceppi I already have a small k8s cluster deployed though23:00
marcoceppistormmore: makes sense. I don't have much experience in doing that, but you should be able to follow any online guide that walks through elastic on k8s23:01
cholcombecmars, i'd use it but the ceph charms haven't gone layered yet23:02
marcoceppistormmore: I can't confirm, but this one looks promising: https://github.com/kayrus/elk-kubernetes23:02
marcoceppiit's atleast been recently updated23:02
stormmoremarcoceppi and that is where the problem lies, it seems to assume that you are not adding it but enabling it before bringing up the cluster... from what I can see it is an addon at this point23:03
stormmorehttps://kubernetes.io/docs/user-guide/logging/elasticsearch/23:03
stormmorelazyPowe_ are you around? do you have any input into adding elasticsearch & kibana pods to kube-system?23:28
lazyPowe_stormmore - our integration point was an external logging deployment using the beats-core bundle as a foundation for that effort23:29
lazyPowe_the idea is that if your k8s cluster is sick, you'd want some persistence around that data, and have it be accessable regardless of the kubernetes system sttate23:29
lazyPowe_so it uses beats to ship the data over, and then gets parsed-reinterpreted by the kibana dashboards23:29
stormmorelazyPowe_ hmmm insteresting considering conjure-up docs suggest that it deploys 2 elasticsearch pods and a kibana one23:30
lazyPowe_wat23:30
lazyPowe_when did this happen?23:30
lazyPowe_stokachu - wat?23:30
stormmorehttps://insights.ubuntu.com/2016/11/21/conjure-up-canonical-kubernetes-under-lxd-today/23:31
lazyPowe_ooooo23:31
lazyPowe_this isn't the conjure-up docs or prompt, this is the upstream k8s guide23:31
lazyPowe_right, at this time, teh beats core bundle was part of CDK23:32
lazyPowe_its now an ancillary bundle, pending our v5 update of the elastic stack components23:32
stormmore"conjure-up kubernetes" vs "conjure-up canonical-kubernetes"?23:32
lazyPowe_that work thats there still works and functions as it did then, but it could be better with the v5 updates as there were a ton of fixes and normalized versioning schema and etc.23:32
lazyPowe_stormmore - so in short, what you get today with canonical-kubernetes, is much more aligned with a smaller deploment, and you can then add the beats components and relate it all. we have a todo to get another bundle published since we moved to the fragments, but we're holding off until the v5 rev of the elastic stack iirc23:33
lazyPowe_stokachu - un-wat, miscommunication23:34
lazyPowe_stormmore - i'll take a line item to bring this up with the team about seeing if we can get you an elastic-enabled bundle tomorrow23:36
lazyPowe_most of the team has left, im' sticking around a little bit longer to check on this deployment i'm running, and then i'm out for the evening as well23:36
stormmorelazyPowe_ awesome, no worries23:37
lazyPowe_i would build you one now, but that would be a "throw it over the fence good luck i'm behind 7 proxies" kind of thing to do23:37
lazyPowe_i'd rather at least run a test deployment before i put it in your hands23:37
stormmorelazyPowe_ I am just trying to make sure I have everything in place so dev doesn't need access to the nodes and have a UI to get the logs from23:37
lazyPowe_yep, totally understand that23:37
lazyPowe_why give them admin when read-only works23:37
lazyPowe_have you been looking into RBAC k8s primitives perchance?23:38
lazyPowe_those seem like they are going to be right in your wheelhouse23:38
lazyPowe_you can assign roles to namespaces and scope what primitives they can interact with23:38
lazyPowe_rather roles to users, in a namespace, and ....23:38
lazyPowe_see above23:38
lazyPowe_stormmore - https://kubernetes.io/docs/admin/authorization/23:39
lazyPowe_we haven't fully enabled this yet as its currently in BETA23:39
lazyPowe_but you'll def want ot read up on it and when we land teh feature set in the charm to make that configurable, you'll be in container-topia23:40
stormmoreyeah exactly23:42
cmarscholcombe, here you go, wasn't nearly as bad as i thought :) https://code.launchpad.net/~cmars/charm-helpers/add-metricenv/+merge/31595223:45
cholcombewoo23:45
cholcombecmars, nice. i forgot about the JUJU_METER thing23:46
cmarsi should write more python tests for my charms... mock.patch is pretty easy to work with23:46
cmarsgotta run now. if you could help me get this landed, or reviewed -- happy to fix things up however -- i'd much appreciate it!23:47
cholcombecmars, sure.  i can review it but i can't land it23:56

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!