[00:43] anastasiamac: can do :-) [00:48] anastasiamac: LGTM [00:50] wallyworld: I'm seeing "agent lost, see 'juju show-status-log mariadb/0'" you mentioned this last week but I can't recall the context. Is this something that's new, I wasn't seeing it earlier (before I rebased develop onto my branch) [00:51] veebers: the new presence implementation break k8s status. it's something i need to fix [00:51] wallyworld: ack, ok that makes sense that I'm seeing it then :-) As long as I know the reason I'm comfortable that I'm sane (ish) [00:52] wallyworld: in other news: https://pastebin.canonical.com/p/tWCjdgMCH7/ (re: units being terminated) [00:53] veebers: good, so that means we don't need to explicitly set the status in the api facade [00:53] indeed [00:54] is it safe to upgrade juju from 2.3.8 to 2.4.3? [01:43] wow, so provisioner unit tests now started failing as frequently as 1 out 4 times... working thru these [01:53] hloeung: fwiw, yes it should be safe. if u know otherwise, please let us know :) [01:53] ok, will let you know when I get to upgrading our CI environment. Thnaks [01:54] hloeung: there are no known issues upgrading from 2.3.8 to 2.4.3 [01:56] ack, thanks [02:20] babbageclunk: is your recent fix for the sometimes timeout with the raft worker test? [02:26] wallyworld: I'm confused, I'm trying to deploy cs:~wallyworld/caas-mariadb (without doing the 'juju trust aws-integrator' step to create an error). With "kubectl -n message log -f juju-operator-mariadb-744bb855-vtvbd" I see no complaints; with juju debug-log -m controller I see "ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty" every [02:27] 4 seconds. I must be missing something obvious [02:28] veebers: did you use the --storage deply arg? [02:28] the way it is erroring is a bug also [02:28] wallyworld: aye, "ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty" [02:28] sorry, juju deploy cs:~wallyworld/mariadb-k8s --storage database=10M,k8s-ebs [02:29] did you create the storage pool? [02:29] juju create-storage-pool k8s-ebs kubernetes storage-class=juju-ebs storage-provisioner=kubernetes.io/aws-ebs parameters.type=gp2 [02:29] yep, as per discourse post [02:30] maybe there's a bug if oeprator storage is missing [02:30] juju create-storage-pool operator-storage kubernetes storage-class=juju-operator-storage storage-provisioner=kubernetes.io/aws-ebs parameters.type=gp2 [02:31] wallyworld: I'll run that now? [02:31] yeah [02:32] * veebers makes it so [02:32] you may need to deploy a new app though [02:33] ah, ack [02:33] but i expect it should work [02:33] it should bounce the worker and read the nre storage ppol info [02:33] *new [02:34] I'm trying a new deploy, was still sing the manifold errors in logs every 4sec [02:36] * anastasiamac imagines veebers singing errors [02:36] like a snake charmer [02:36] ugh, I'm seeing "Warning FailedScheduling 6s (x8 over 1m) default-scheduler pod has unbound PersistentVolumeClaims" for the new deploy, but that's not reflected in juju status. it's not getting pushed through the cloud container status properly it seems [02:36] anastasiamac: hah ^_^ [02:47] anastasiamac: I'm looking at another intermittent test failure [02:47] and I think I have worked out the race in that... [02:47] it may be similar to yours... [02:48] I'm thinking through a solution [02:48] thumper: which test and what race? [02:48] * thumper dabbles [02:48] thumper: mine seems to be all in provisioner_task [02:48] http://10.125.0.203:8080/view/Unit%20tests/job/RunUnittests-s390x/lastCompletedBuild/testReport/github/com_juju_juju_apiserver_facades_client_client/TestPackage/ [02:48] FAIL: client_test.go:762: clientSuite.TestClientWatchAllAdminPermission [02:48] the fundamental problem is the test goes: [02:48] do something [02:48] do something [02:48] start watcher [02:48] expect X changes [02:49] there is the expectation that the second do something had been processed before the watcher started [02:49] so there is a race there [02:49] thumper: oh k... i hope it similar to mine... [02:50] there is another bug... and a kinda bug one... [02:50] thumper: m owrking at the moment at "start controller machine, start another machine, remove 1 machine... oooh... both machines removed"... [02:50] related to CMR [02:50] so mayb our failures are similar but mayb not... [02:50] ouch, cmr bugs are scary :) [02:50] that I'm not sure whether it has real world impact or not [02:50] * anastasiamac looks in directoin of FTP and wallyworld :D [02:51] huh? [02:51] wallyworld: nothing [02:51] thumper and i are having fun with watcher intermittent test failures :D ignore me [02:51] wallyworld: the multiwatcher interaction with CMR is questionable [02:53] multiwatcher does report on reote apps [02:56] yes, but it seems more by good luck [02:58] wallyworld: did you want to chat about caas presence at some stage? [02:58] nah, fixed it [02:58] just testing [03:11] sigh, I almost tried to put my glass of water in my pocket so I could carry my muesli bar back to the office :-| [03:12] veebers: it does get better... at least it was not a hot drink like coffee or tea [03:12] hah ^_^ true that [03:13] thumper: fyi, the prsence fix https://github.com/juju/juju/pull/9150 [03:19] thumper: https://github.com/juju/juju/pull/9151 (one test fix) ... m chasing the 2nd one [03:20] turns out we r just way too efficient now sometimes [03:26] * thumper looks at both [03:27] I'm testing a fix for mine too [03:28] anastasiamac: I think my fix would be more appropriate [03:28] anastasiamac: I think yours is adjusting the timing by side-effect [03:28] the start sync method doesn't do any syncing with the underlying txn watcher [03:28] * thumper sighs [03:28] mine just failed too [03:29] FFS [03:29] I made the race much smaller... but it is still there [03:29] * thumper thinks some more [03:30] testing async code is hard... [03:30] thumper: k, m chasing the 2nd failure... m sure that the 1st failure is not t=with code but with test setup.. [03:30] thumper: hence, the sync felt appropriate [03:30] the StartSync doesn't do anything for the JujuConnSuite [03:31] except poking the presence worker [03:31] * thumper thinks [03:31] and something else [03:31] * thumper goes to look at the something else [03:31] thumper: k [03:32] pingBatcher [03:32] thumper: what about it? [03:32] that is the other thing StartSync pokes [03:32] presenceWatcher and pingBatcher [03:32] nothing to do with the normal watchers [03:33] thumper: right. so the first failure was becase we were creating a machine, setting harvest mode and removing in hopes that harvest mode will b respected... occasionally, and now more often, harvest mode was not set when we came to remove... hence we failed... [03:33] * thumper nods [03:33] thumper: as soon as sync was addeed before removal, the failure disappeared [03:34] but that was just due to a change in timing [03:34] if you added sleep 10ms it would probably do the same [03:34] we work really hard to have workers work asynchronously [03:34] then want control in tests [03:35] thumper: k... can we ho? [03:35] sure [04:10] wallyworld, kelvinliu__ : any idea what might cause the error; pod has unbound PersistentVolumeClaims? [04:11] if the underlying volume cannot be created [04:13] wallyworld: ok, so I did create-storage-pool, is it likely something aws related? Perhaps previously storage bits wheren't cleaned up? [04:13] new volumes are created on demand [04:13] did you deploy the aws-integrator? [04:13] and used juju trust? [04:14] kubectl get all,pv,pvc [04:14] will show status of volumes and claims [04:14] veebers, Is the aws-integrator/0 in active status? [04:14] wallyworld: ah hah right, no I didn't do the juju trust part. So this is the failure I was expecting to see right? [04:15] yup [04:15] that's what should be surfaced in juju status [04:15] wallyworld: ok cool. I'm not seeing it surfaced, need to debug why [04:26] veebers: fyi, "lost" status fix just landed [04:28] wallyworld: yay, thanks! [04:41] gah, why is "Running machine config. script" taking so damn long in aws/ap-southeast-1. It would be quicker to deploy locally in lxd :-| [04:55] wallyworld (sorry to pester) Am I reading this right in that the storage error is stopping the operator pod from being deployed and thus the updateStateUnits et. al won't be in operation? (https://pastebin.canonical.com/p/rQrDsx7KRM/) [04:57] yes, that's right. but you should be able to deploy the operator without any storage unless there's a bug [04:58] if there is a bug and the operator does need storage, you could always create the mariadb storage pool with a dud provisioner [04:58] that should induce an error in deploying the mariadb unit [05:02] wallyworld: hmm, so I had created both operator-storage and k8s-ebs before deploying mariadb (still without having run juju trust for the k8s cluster) [05:02] when you run juju trust the storage will come good and thing will provision [05:03] so you can create a new different storage pool with a dud provisioner [05:03] wallyworld: right, but the intention is to be able to surface the fact that the storage is borked right? [05:03] and deploy a new mariadb with an alias usng that dud pool [05:04] ah shoot I also (somehow) misspelled the image path (caas-operator-image-path=veebers/caas-operator...) :-\ [05:04] that would explain things a bit [05:04] you shouldn't need to create a storage ool for the operator [05:04] hence you can leave off the trust step [05:04] and the operator will deploy [05:04] but, it's not trying to install that as far as I can tell. At anyrate I'll fix that and re-deploy [05:04] and the app itself will fail; [05:05] ack [05:07] argh its still complaining about storage with the proper image url [05:08] wallyworld: does that suggest a bug where juju is putting storage constraints on the operator pod that shouldn't be there? [05:09] i'd have to see the error. but you can deploy the operator with storage and posion the app storage pool to get by [05:11] wallyworld: deploy op with storage as in run 'juju trust aws-integrator'? [05:11] yeah [05:11] ack ok cheers [05:11] just set up the mariadb storage pool with a typo in the provioner attribute [05:12] ah right, ack will do [05:12] I've just enabled trust, waiting for the scheduling to succeed [08:22] jam: I just want to ammend some stuff in here, before we merge https://github.com/juju/juju/pull/9148# [09:09] stickupkid: sorry about that, I had 2 PR up, and accidentally submitted the wrong one [09:10] jam: haha, you can merge away now :p [09:35] Need a review: https://github.com/juju/juju/pull/9153 [09:35] Small change, with easy Q/A. [09:46] manadart: looking [09:48] stickupkid: Ta. [09:48] manadart: done [09:49] Cheers. [13:24] stickupkid: As discussed - https://github.com/juju/juju/pull/9155 [13:28] manadart: nice, will have a look now [22:28] babbageclunk: have you tried bootstrapping lately? [22:29] wallyworld: not todat [22:29] y [22:29] wallyworld: y? [22:29] since late yesterday it's hung for m [22:29] e [22:30] just wondering if it's just me [22:31] wallyworld: in aws I see "Running machine config. script" take *ages* [22:31] for me on aws or lxd it just hangs at that point [22:37] wallyworld: ok, having a go myself after pushing this change. [22:38] ok, let's see how it goes [22:45] wallyworld, babbageclunk: I got a successful bootstrap, took almost 40 minutes [22:45] crazy [22:46] there's got to be something that's changed. it could be slow apt get of image updates or mongo or something [22:46] maybe cloud-init taking a while when it's apt installing? [22:46] heh [22:46] wallyworld: I've worked out this bug, but would like to talk to if you have a chance [22:46] sure, give me 5 [22:46] otp [22:49] ack [23:01] wallyworld: actually, never mind [23:01] thumper: sorry, still in 1:1 [23:13] wallyworld: bootstrap was about normal speed for me [23:13] damn ok [23:14] babbageclunk: where were you bootstrapping to? [23:15] into? at? into probably [23:15] veebers: localhost. [23:15] I'll try aws [23:15] ooh, meeting [23:25] wallyworld: this one is for you https://github.com/juju/juju/pull/9156 [23:25] ok, will look after standup [23:27] babbageclunk: which region did you bootstrap aws? I'm using ap-southeast-1 [23:28] I used ap-southeast-2 [23:35] thumper: lgtm, nice pickup [23:37] wallyworld: took me a while because I had the assumption that the initial state was wrong and we weren't waiting [23:37] seems obvious now [23:37] but it was initial state was right, and subsequent update fubared it [23:37] alays is after the fact [23:37] sure is