veebers | anastasiamac: can do :-) | 00:43 |
---|---|---|
veebers | anastasiamac: LGTM | 00:48 |
veebers | wallyworld: I'm seeing "agent lost, see 'juju show-status-log mariadb/0'" you mentioned this last week but I can't recall the context. Is this something that's new, I wasn't seeing it earlier (before I rebased develop onto my branch) | 00:50 |
wallyworld | veebers: the new presence implementation break k8s status. it's something i need to fix | 00:51 |
veebers | wallyworld: ack, ok that makes sense that I'm seeing it then :-) As long as I know the reason I'm comfortable that I'm sane (ish) | 00:51 |
veebers | wallyworld: in other news: https://pastebin.canonical.com/p/tWCjdgMCH7/ (re: units being terminated) | 00:52 |
wallyworld | veebers: good, so that means we don't need to explicitly set the status in the api facade | 00:53 |
veebers | indeed | 00:53 |
hloeung | is it safe to upgrade juju from 2.3.8 to 2.4.3? | 00:54 |
anastasiamac | wow, so provisioner unit tests now started failing as frequently as 1 out 4 times... working thru these | 01:43 |
anastasiamac | hloeung: fwiw, yes it should be safe. if u know otherwise, please let us know :) | 01:53 |
hloeung | ok, will let you know when I get to upgrading our CI environment. Thnaks | 01:53 |
thumper | hloeung: there are no known issues upgrading from 2.3.8 to 2.4.3 | 01:54 |
hloeung | ack, thanks | 01:56 |
thumper | babbageclunk: is your recent fix for the sometimes timeout with the raft worker test? | 02:20 |
veebers | wallyworld: I'm confused, I'm trying to deploy cs:~wallyworld/caas-mariadb (without doing the 'juju trust aws-integrator' step to create an error). With "kubectl -n message log -f juju-operator-mariadb-744bb855-vtvbd" I see no complaints; with juju debug-log -m controller I see "ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty" every | 02:26 |
veebers | 4 seconds. I must be missing something obvious | 02:27 |
wallyworld | veebers: did you use the --storage deply arg? | 02:28 |
wallyworld | the way it is erroring is a bug also | 02:28 |
veebers | wallyworld: aye, "ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty" | 02:28 |
veebers | sorry, juju deploy cs:~wallyworld/mariadb-k8s --storage database=10M,k8s-ebs | 02:28 |
wallyworld | did you create the storage pool? | 02:29 |
wallyworld | juju create-storage-pool k8s-ebs kubernetes storage-class=juju-ebs storage-provisioner=kubernetes.io/aws-ebs parameters.type=gp2 | 02:29 |
veebers | yep, as per discourse post | 02:29 |
wallyworld | maybe there's a bug if oeprator storage is missing | 02:30 |
wallyworld | juju create-storage-pool operator-storage kubernetes storage-class=juju-operator-storage storage-provisioner=kubernetes.io/aws-ebs parameters.type=gp2 | 02:30 |
veebers | wallyworld: I'll run that now? | 02:31 |
wallyworld | yeah | 02:31 |
* veebers makes it so | 02:32 | |
wallyworld | you may need to deploy a new app though | 02:32 |
veebers | ah, ack | 02:33 |
wallyworld | but i expect it should work | 02:33 |
wallyworld | it should bounce the worker and read the nre storage ppol info | 02:33 |
wallyworld | *new | 02:33 |
veebers | I'm trying a new deploy, was still sing the manifold errors in logs every 4sec | 02:34 |
* anastasiamac imagines veebers singing errors | 02:36 | |
anastasiamac | like a snake charmer | 02:36 |
veebers | ugh, I'm seeing "Warning FailedScheduling 6s (x8 over 1m) default-scheduler pod has unbound PersistentVolumeClaims" for the new deploy, but that's not reflected in juju status. it's not getting pushed through the cloud container status properly it seems | 02:36 |
veebers | anastasiamac: hah ^_^ | 02:36 |
thumper | anastasiamac: I'm looking at another intermittent test failure | 02:47 |
thumper | and I think I have worked out the race in that... | 02:47 |
thumper | it may be similar to yours... | 02:47 |
thumper | I'm thinking through a solution | 02:48 |
anastasiamac | thumper: which test and what race? | 02:48 |
* thumper dabbles | 02:48 | |
anastasiamac | thumper: mine seems to be all in provisioner_task | 02:48 |
thumper | http://10.125.0.203:8080/view/Unit%20tests/job/RunUnittests-s390x/lastCompletedBuild/testReport/github/com_juju_juju_apiserver_facades_client_client/TestPackage/ | 02:48 |
thumper | FAIL: client_test.go:762: clientSuite.TestClientWatchAllAdminPermission | 02:48 |
thumper | the fundamental problem is the test goes: | 02:48 |
thumper | do something | 02:48 |
thumper | do something | 02:48 |
thumper | start watcher | 02:48 |
thumper | expect X changes | 02:48 |
thumper | there is the expectation that the second do something had been processed before the watcher started | 02:49 |
thumper | so there is a race there | 02:49 |
anastasiamac | thumper: oh k... i hope it similar to mine... | 02:49 |
thumper | there is another bug... and a kinda bug one... | 02:50 |
anastasiamac | thumper: m owrking at the moment at "start controller machine, start another machine, remove 1 machine... oooh... both machines removed"... | 02:50 |
thumper | related to CMR | 02:50 |
anastasiamac | so mayb our failures are similar but mayb not... | 02:50 |
anastasiamac | ouch, cmr bugs are scary :) | 02:50 |
thumper | that I'm not sure whether it has real world impact or not | 02:50 |
* anastasiamac looks in directoin of FTP and wallyworld :D | 02:50 | |
wallyworld | huh? | 02:51 |
anastasiamac | wallyworld: nothing | 02:51 |
anastasiamac | thumper and i are having fun with watcher intermittent test failures :D ignore me | 02:51 |
thumper | wallyworld: the multiwatcher interaction with CMR is questionable | 02:51 |
wallyworld | multiwatcher does report on reote apps | 02:53 |
thumper | yes, but it seems more by good luck | 02:56 |
thumper | wallyworld: did you want to chat about caas presence at some stage? | 02:58 |
wallyworld | nah, fixed it | 02:58 |
wallyworld | just testing | 02:58 |
veebers | sigh, I almost tried to put my glass of water in my pocket so I could carry my muesli bar back to the office :-| | 03:11 |
anastasiamac | veebers: it does get better... at least it was not a hot drink like coffee or tea | 03:12 |
veebers | hah ^_^ true that | 03:12 |
wallyworld | thumper: fyi, the prsence fix https://github.com/juju/juju/pull/9150 | 03:13 |
anastasiamac | thumper: https://github.com/juju/juju/pull/9151 (one test fix) ... m chasing the 2nd one | 03:19 |
anastasiamac | turns out we r just way too efficient now sometimes | 03:20 |
* thumper looks at both | 03:26 | |
thumper | I'm testing a fix for mine too | 03:27 |
thumper | anastasiamac: I think my fix would be more appropriate | 03:28 |
thumper | anastasiamac: I think yours is adjusting the timing by side-effect | 03:28 |
thumper | the start sync method doesn't do any syncing with the underlying txn watcher | 03:28 |
* thumper sighs | 03:28 | |
thumper | mine just failed too | 03:28 |
thumper | FFS | 03:29 |
thumper | I made the race much smaller... but it is still there | 03:29 |
* thumper thinks some more | 03:29 | |
thumper | testing async code is hard... | 03:30 |
anastasiamac | thumper: k, m chasing the 2nd failure... m sure that the 1st failure is not t=with code but with test setup.. | 03:30 |
anastasiamac | thumper: hence, the sync felt appropriate | 03:30 |
thumper | the StartSync doesn't do anything for the JujuConnSuite | 03:30 |
thumper | except poking the presence worker | 03:31 |
* thumper thinks | 03:31 | |
thumper | and something else | 03:31 |
* thumper goes to look at the something else | 03:31 | |
anastasiamac | thumper: k | 03:31 |
thumper | pingBatcher | 03:32 |
anastasiamac | thumper: what about it? | 03:32 |
thumper | that is the other thing StartSync pokes | 03:32 |
thumper | presenceWatcher and pingBatcher | 03:32 |
thumper | nothing to do with the normal watchers | 03:32 |
anastasiamac | thumper: right. so the first failure was becase we were creating a machine, setting harvest mode and removing in hopes that harvest mode will b respected... occasionally, and now more often, harvest mode was not set when we came to remove... hence we failed... | 03:33 |
* thumper nods | 03:33 | |
anastasiamac | thumper: as soon as sync was addeed before removal, the failure disappeared | 03:33 |
thumper | but that was just due to a change in timing | 03:34 |
thumper | if you added sleep 10ms it would probably do the same | 03:34 |
thumper | we work really hard to have workers work asynchronously | 03:34 |
thumper | then want control in tests | 03:34 |
anastasiamac | thumper: k... can we ho? | 03:35 |
thumper | sure | 03:35 |
veebers | wallyworld, kelvinliu__ : any idea what might cause the error; pod has unbound PersistentVolumeClaims? | 04:10 |
wallyworld | if the underlying volume cannot be created | 04:11 |
veebers | wallyworld: ok, so I did create-storage-pool, is it likely something aws related? Perhaps previously storage bits wheren't cleaned up? | 04:13 |
wallyworld | new volumes are created on demand | 04:13 |
wallyworld | did you deploy the aws-integrator? | 04:13 |
wallyworld | and used juju trust? | 04:13 |
wallyworld | kubectl get all,pv,pvc | 04:14 |
wallyworld | will show status of volumes and claims | 04:14 |
kelvinliu__ | veebers, Is the aws-integrator/0 in active status? | 04:14 |
veebers | wallyworld: ah hah right, no I didn't do the juju trust part. So this is the failure I was expecting to see right? | 04:14 |
wallyworld | yup | 04:15 |
wallyworld | that's what should be surfaced in juju status | 04:15 |
veebers | wallyworld: ok cool. I'm not seeing it surfaced, need to debug why | 04:15 |
wallyworld | veebers: fyi, "lost" status fix just landed | 04:26 |
veebers | wallyworld: yay, thanks! | 04:28 |
veebers | gah, why is "Running machine config. script" taking so damn long in aws/ap-southeast-1. It would be quicker to deploy locally in lxd :-| | 04:41 |
veebers | wallyworld (sorry to pester) Am I reading this right in that the storage error is stopping the operator pod from being deployed and thus the updateStateUnits et. al won't be in operation? (https://pastebin.canonical.com/p/rQrDsx7KRM/) | 04:55 |
wallyworld | yes, that's right. but you should be able to deploy the operator without any storage unless there's a bug | 04:57 |
wallyworld | if there is a bug and the operator does need storage, you could always create the mariadb storage pool with a dud provisioner | 04:58 |
wallyworld | that should induce an error in deploying the mariadb unit | 04:58 |
veebers | wallyworld: hmm, so I had created both operator-storage and k8s-ebs before deploying mariadb (still without having run juju trust for the k8s cluster) | 05:02 |
wallyworld | when you run juju trust the storage will come good and thing will provision | 05:02 |
wallyworld | so you can create a new different storage pool with a dud provisioner | 05:03 |
veebers | wallyworld: right, but the intention is to be able to surface the fact that the storage is borked right? | 05:03 |
wallyworld | and deploy a new mariadb with an alias usng that dud pool | 05:03 |
veebers | ah shoot I also (somehow) misspelled the image path (caas-operator-image-path=veebers/caas-operator...) :-\ | 05:04 |
wallyworld | that would explain things a bit | 05:04 |
wallyworld | you shouldn't need to create a storage ool for the operator | 05:04 |
wallyworld | hence you can leave off the trust step | 05:04 |
wallyworld | and the operator will deploy | 05:04 |
veebers | but, it's not trying to install that as far as I can tell. At anyrate I'll fix that and re-deploy | 05:04 |
wallyworld | and the app itself will fail; | 05:04 |
veebers | ack | 05:05 |
veebers | argh its still complaining about storage with the proper image url | 05:07 |
veebers | wallyworld: does that suggest a bug where juju is putting storage constraints on the operator pod that shouldn't be there? | 05:08 |
wallyworld | i'd have to see the error. but you can deploy the operator with storage and posion the app storage pool to get by | 05:09 |
veebers | wallyworld: deploy op with storage as in run 'juju trust aws-integrator'? | 05:11 |
wallyworld | yeah | 05:11 |
veebers | ack ok cheers | 05:11 |
wallyworld | just set up the mariadb storage pool with a typo in the provioner attribute | 05:11 |
veebers | ah right, ack will do | 05:12 |
veebers | I've just enabled trust, waiting for the scheduling to succeed | 05:12 |
stickupkid | jam: I just want to ammend some stuff in here, before we merge https://github.com/juju/juju/pull/9148# | 08:22 |
jam | stickupkid: sorry about that, I had 2 PR up, and accidentally submitted the wrong one | 09:09 |
stickupkid | jam: haha, you can merge away now :p | 09:10 |
manadart | Need a review: https://github.com/juju/juju/pull/9153 | 09:35 |
manadart | Small change, with easy Q/A. | 09:35 |
stickupkid | manadart: looking | 09:46 |
manadart | stickupkid: Ta. | 09:48 |
stickupkid | manadart: done | 09:48 |
manadart | Cheers. | 09:49 |
manadart | stickupkid: As discussed - https://github.com/juju/juju/pull/9155 | 13:24 |
stickupkid | manadart: nice, will have a look now | 13:28 |
wallyworld | babbageclunk: have you tried bootstrapping lately? | 22:28 |
babbageclunk | wallyworld: not todat | 22:29 |
babbageclunk | y | 22:29 |
babbageclunk | wallyworld: y? | 22:29 |
wallyworld | since late yesterday it's hung for m | 22:29 |
wallyworld | e | 22:29 |
wallyworld | just wondering if it's just me | 22:30 |
veebers | wallyworld: in aws I see "Running machine config. script" take *ages* | 22:31 |
wallyworld | for me on aws or lxd it just hangs at that point | 22:31 |
babbageclunk | wallyworld: ok, having a go myself after pushing this change. | 22:37 |
wallyworld | ok, let's see how it goes | 22:38 |
veebers | wallyworld, babbageclunk: I got a successful bootstrap, took almost 40 minutes | 22:45 |
babbageclunk | crazy | 22:45 |
wallyworld | there's got to be something that's changed. it could be slow apt get of image updates or mongo or something | 22:46 |
veebers | maybe cloud-init taking a while when it's apt installing? | 22:46 |
veebers | heh | 22:46 |
thumper | wallyworld: I've worked out this bug, but would like to talk to if you have a chance | 22:46 |
wallyworld | sure, give me 5 | 22:46 |
wallyworld | otp | 22:46 |
thumper | ack | 22:49 |
thumper | wallyworld: actually, never mind | 23:01 |
wallyworld | thumper: sorry, still in 1:1 | 23:01 |
babbageclunk | wallyworld: bootstrap was about normal speed for me | 23:13 |
wallyworld | damn ok | 23:13 |
veebers | babbageclunk: where were you bootstrapping to? | 23:14 |
veebers | into? at? into probably | 23:15 |
babbageclunk | veebers: localhost. | 23:15 |
babbageclunk | I'll try aws | 23:15 |
babbageclunk | ooh, meeting | 23:15 |
thumper | wallyworld: this one is for you https://github.com/juju/juju/pull/9156 | 23:25 |
wallyworld | ok, will look after standup | 23:25 |
veebers | babbageclunk: which region did you bootstrap aws? I'm using ap-southeast-1 | 23:27 |
babbageclunk | I used ap-southeast-2 | 23:28 |
wallyworld | thumper: lgtm, nice pickup | 23:35 |
thumper | wallyworld: took me a while because I had the assumption that the initial state was wrong and we weren't waiting | 23:37 |
wallyworld | seems obvious now | 23:37 |
thumper | but it was initial state was right, and subsequent update fubared it | 23:37 |
wallyworld | alays is after the fact | 23:37 |
thumper | sure is | 23:37 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!