[00:43] <veebers> anastasiamac: can do :-)
[00:48] <veebers> anastasiamac: LGTM
[00:50] <veebers> wallyworld: I'm seeing "agent lost, see 'juju show-status-log mariadb/0'" you mentioned this last week but I can't recall the context. Is this something that's new, I wasn't seeing it earlier (before I rebased develop onto my branch)
[00:51] <wallyworld> veebers: the new presence implementation break k8s status. it's something i need to fix
[00:51] <veebers> wallyworld: ack, ok that makes sense that I'm seeing it then :-) As long as I know the reason I'm comfortable that I'm sane (ish)
[00:52] <veebers> wallyworld: in other news: https://pastebin.canonical.com/p/tWCjdgMCH7/ (re: units being terminated)
[00:53] <wallyworld> veebers: good, so that means we don't need to explicitly set the status in the api facade
[00:53] <veebers> indeed
[00:54] <hloeung> is it safe to upgrade juju from 2.3.8 to 2.4.3?
[01:43] <anastasiamac> wow, so provisioner unit tests now started failing as frequently as 1 out 4 times... working thru these
[01:53] <anastasiamac> hloeung: fwiw, yes it should be safe. if u know otherwise, please let us know :)
[01:53] <hloeung> ok, will let you know when I get to upgrading our CI environment. Thnaks
[01:54] <thumper> hloeung: there are no known issues upgrading from 2.3.8 to 2.4.3
[01:56] <hloeung> ack, thanks
[02:20] <thumper> babbageclunk: is your recent fix for the sometimes timeout with the raft worker test?
[02:26] <veebers> wallyworld: I'm confused, I'm trying to deploy cs:~wallyworld/caas-mariadb (without doing the 'juju trust aws-integrator' step to create an error). With "kubectl -n message log -f juju-operator-mariadb-744bb855-vtvbd" I see no complaints; with juju debug-log -m controller I see "ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty" every
[02:27] <veebers> 4 seconds. I must be missing something obvious
[02:28] <wallyworld> veebers: did you use the --storage deply arg?
[02:28] <wallyworld> the way it is erroring is a bug also
[02:28] <veebers> wallyworld: aye, "ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty"
[02:28] <veebers> sorry, juju deploy cs:~wallyworld/mariadb-k8s --storage database=10M,k8s-ebs
[02:29] <wallyworld> did you create the storage pool?
[02:29] <wallyworld> juju create-storage-pool k8s-ebs kubernetes storage-class=juju-ebs storage-provisioner=kubernetes.io/aws-ebs parameters.type=gp2
[02:29] <veebers> yep, as per discourse post
[02:30] <wallyworld> maybe there's a bug if oeprator storage is missing
[02:30] <wallyworld> juju create-storage-pool operator-storage kubernetes storage-class=juju-operator-storage storage-provisioner=kubernetes.io/aws-ebs parameters.type=gp2
[02:31] <veebers> wallyworld: I'll run that now?
[02:31] <wallyworld> yeah
[02:32]  * veebers makes it so
[02:32] <wallyworld> you may need to deploy a new app though
[02:33] <veebers> ah, ack
[02:33] <wallyworld> but i expect it should work
[02:33] <wallyworld> it should bounce the worker and read the nre storage ppol info
[02:33] <wallyworld> *new
[02:34] <veebers> I'm trying a new deploy, was still sing the manifold errors in logs every 4sec
[02:36]  * anastasiamac imagines veebers singing errors
[02:36] <anastasiamac> like a snake charmer
[02:36] <veebers> ugh, I'm seeing "Warning  FailedScheduling  6s (x8 over 1m)  default-scheduler  pod has unbound PersistentVolumeClaims" for the new deploy, but that's not reflected in juju status.  it's not getting pushed through the cloud container status properly it seems
[02:36] <veebers> anastasiamac: hah ^_^
[02:47] <thumper> anastasiamac: I'm looking at another intermittent test failure
[02:47] <thumper> and I think I have worked out the race in that...
[02:47] <thumper> it may be similar to yours...
[02:48] <thumper> I'm thinking through a solution
[02:48] <anastasiamac> thumper: which test and what race?
[02:48]  * thumper dabbles
[02:48] <anastasiamac> thumper: mine seems to be all in provisioner_task
[02:48] <thumper> http://10.125.0.203:8080/view/Unit%20tests/job/RunUnittests-s390x/lastCompletedBuild/testReport/github/com_juju_juju_apiserver_facades_client_client/TestPackage/
[02:48] <thumper> FAIL: client_test.go:762: clientSuite.TestClientWatchAllAdminPermission
[02:48] <thumper> the fundamental problem is the test goes:
[02:48] <thumper> do something
[02:48] <thumper> do something
[02:48] <thumper> start watcher
[02:48] <thumper> expect X changes
[02:49] <thumper> there is the expectation that the second do something had been processed before the watcher started
[02:49] <thumper> so there is a race there
[02:49] <anastasiamac> thumper: oh k... i hope it similar to mine...
[02:50] <thumper> there is another bug... and a kinda bug one...
[02:50] <anastasiamac> thumper: m owrking at the moment at "start controller machine, start another machine, remove 1 machine... oooh... both machines removed"...
[02:50] <thumper> related to CMR
[02:50] <anastasiamac> so mayb our failures are similar but mayb not...
[02:50] <anastasiamac> ouch, cmr bugs are scary :)
[02:50] <thumper> that I'm not sure whether it has real world impact or not
[02:50]  * anastasiamac looks in directoin of FTP and wallyworld :D
[02:51] <wallyworld> huh?
[02:51] <anastasiamac> wallyworld: nothing
[02:51] <anastasiamac> thumper and i are having fun with watcher intermittent test failures :D ignore me
[02:51] <thumper> wallyworld: the multiwatcher interaction with CMR is questionable
[02:53] <wallyworld> multiwatcher does report on reote apps
[02:56] <thumper> yes, but it seems more by good luck
[02:58] <thumper> wallyworld: did you want to chat about caas presence at some stage?
[02:58] <wallyworld> nah, fixed it
[02:58] <wallyworld> just testing
[03:11] <veebers> sigh, I almost tried to put my glass of water in my pocket so I could carry my muesli bar back to the office :-|
[03:12] <anastasiamac> veebers: it does get better... at least it was not a hot drink like coffee or tea
[03:12] <veebers> hah ^_^ true that
[03:13] <wallyworld> thumper: fyi, the prsence fix https://github.com/juju/juju/pull/9150
[03:19] <anastasiamac> thumper: https://github.com/juju/juju/pull/9151 (one test fix) ... m chasing the 2nd one
[03:20] <anastasiamac> turns out we r just way too efficient now sometimes
[03:26]  * thumper looks at both
[03:27] <thumper> I'm testing a fix for mine too
[03:28] <thumper> anastasiamac: I think my fix would be more appropriate
[03:28] <thumper> anastasiamac: I think yours is adjusting the timing by side-effect
[03:28] <thumper> the start sync method doesn't do any syncing with the underlying txn watcher
[03:28]  * thumper sighs
[03:28] <thumper> mine just failed too
[03:29] <thumper> FFS
[03:29] <thumper> I made the race much smaller... but it is still there
[03:29]  * thumper thinks some more
[03:30] <thumper> testing async code is hard...
[03:30] <anastasiamac> thumper: k, m chasing the 2nd failure... m sure that the 1st failure is not t=with code but with test setup..
[03:30] <anastasiamac> thumper: hence, the sync felt appropriate
[03:30] <thumper> the StartSync doesn't do anything for the JujuConnSuite
[03:31] <thumper> except poking the presence worker
[03:31]  * thumper thinks 
[03:31] <thumper> and something else
[03:31]  * thumper goes to look at the something else
[03:31] <anastasiamac> thumper: k
[03:32] <thumper> pingBatcher
[03:32] <anastasiamac> thumper: what about it?
[03:32] <thumper> that is the other thing StartSync pokes
[03:32] <thumper> presenceWatcher and pingBatcher
[03:32] <thumper> nothing to do with the normal watchers
[03:33] <anastasiamac> thumper: right. so the first failure was becase we were creating a machine, setting harvest mode and removing in hopes that harvest mode will b respected... occasionally, and now more often, harvest mode was not set when we came to remove... hence we failed...
[03:33]  * thumper nods
[03:33] <anastasiamac> thumper: as soon as sync was addeed before removal, the failure disappeared
[03:34] <thumper> but that was just due to a change in timing
[03:34] <thumper> if you added sleep 10ms it would probably do the same
[03:34] <thumper> we work really hard to have workers work asynchronously
[03:34] <thumper> then want control in tests
[03:35] <anastasiamac> thumper: k... can we ho?
[03:35] <thumper> sure
[04:10] <veebers> wallyworld, kelvinliu__ : any idea what might cause the error; pod has unbound PersistentVolumeClaims?
[04:11] <wallyworld> if the underlying volume cannot be created
[04:13] <veebers> wallyworld: ok, so I did create-storage-pool, is it likely something aws related? Perhaps previously storage bits wheren't cleaned up?
[04:13] <wallyworld> new volumes are created on demand
[04:13] <wallyworld> did you deploy the aws-integrator?
[04:13] <wallyworld> and used juju trust?
[04:14] <wallyworld> kubectl get all,pv,pvc
[04:14] <wallyworld> will show status of volumes and claims
[04:14] <kelvinliu__> veebers, Is the aws-integrator/0 in active status?
[04:14] <veebers> wallyworld: ah hah right, no I didn't do the juju trust part. So this is the failure I was expecting to see right?
[04:15] <wallyworld> yup
[04:15] <wallyworld> that's what should be surfaced in juju status
[04:15] <veebers> wallyworld: ok cool. I'm not seeing it surfaced, need to debug why
[04:26] <wallyworld> veebers: fyi, "lost" status fix just landed
[04:28] <veebers> wallyworld:  yay, thanks!
[04:41] <veebers> gah, why is "Running machine config. script" taking so damn long in aws/ap-southeast-1. It would be quicker to deploy locally in lxd :-|
[04:55] <veebers> wallyworld (sorry to pester) Am I reading this right in that the storage error is stopping the operator pod from being deployed and thus the updateStateUnits et. al won't be in operation? (https://pastebin.canonical.com/p/rQrDsx7KRM/)
[04:57] <wallyworld> yes, that's right. but you should be able to deploy the operator without any storage unless there's a bug
[04:58] <wallyworld> if there is a bug and the operator does need storage, you could always create the mariadb storage pool with a dud provisioner
[04:58] <wallyworld> that should induce an error in deploying the mariadb unit
[05:02] <veebers> wallyworld: hmm, so I had created both operator-storage and k8s-ebs before deploying mariadb (still without having run juju trust for the k8s cluster)
[05:02] <wallyworld> when you run juju trust the storage will come good and thing will provision
[05:03] <wallyworld> so you can create a new different storage pool with a dud provisioner
[05:03] <veebers> wallyworld: right, but the intention is to be able to surface the fact that the storage is borked right?
[05:03] <wallyworld> and deploy a new mariadb with an alias usng that dud pool
[05:04] <veebers> ah shoot I also (somehow) misspelled the image path (caas-operator-image-path=veebers/caas-operator...) :-\
[05:04] <wallyworld> that would explain things a bit
[05:04] <wallyworld> you shouldn't need to create a storage ool for the operator
[05:04] <wallyworld> hence you can leave off the trust step
[05:04] <wallyworld> and the operator will deploy
[05:04] <veebers> but, it's not trying to install that as far as I can tell. At anyrate I'll fix that and re-deploy
[05:04] <wallyworld> and the app itself will fail;
[05:05] <veebers> ack
[05:07] <veebers> argh its still complaining about storage with the proper image url
[05:08] <veebers> wallyworld: does that suggest a bug where juju is putting storage constraints on the operator pod that shouldn't be there?
[05:09] <wallyworld> i'd have to see the error. but you can deploy the operator with storage and posion the app storage pool to get by
[05:11] <veebers> wallyworld: deploy op with storage as in run 'juju trust aws-integrator'?
[05:11] <wallyworld> yeah
[05:11] <veebers> ack ok cheers
[05:11] <wallyworld> just set up the mariadb storage pool with a typo in the provioner attribute
[05:12] <veebers> ah right, ack will do
[05:12] <veebers> I've just enabled trust, waiting for the scheduling to succeed
[08:22] <stickupkid> jam: I just want to ammend some stuff in here, before we merge https://github.com/juju/juju/pull/9148#
[09:09] <jam> stickupkid: sorry about that, I had 2 PR up, and accidentally submitted the wrong one
[09:10] <stickupkid> jam: haha, you can merge away now :p
[09:35] <manadart> Need a review: https://github.com/juju/juju/pull/9153
[09:35] <manadart> Small change, with easy Q/A.
[09:46] <stickupkid> manadart: looking
[09:48] <manadart> stickupkid: Ta.
[09:48] <stickupkid> manadart: done
[09:49] <manadart> Cheers.
[13:24] <manadart> stickupkid: As discussed - https://github.com/juju/juju/pull/9155
[13:28] <stickupkid> manadart: nice, will have a look now
[22:28] <wallyworld> babbageclunk: have you tried bootstrapping lately?
[22:29] <babbageclunk> wallyworld: not todat
[22:29] <babbageclunk> y
[22:29] <babbageclunk> wallyworld: y?
[22:29] <wallyworld> since late yesterday it's hung for m
[22:29] <wallyworld> e
[22:30] <wallyworld> just wondering if it's just me
[22:31] <veebers> wallyworld: in aws I see "Running machine config. script" take *ages*
[22:31] <wallyworld> for me on aws or lxd it just hangs at that point
[22:37] <babbageclunk> wallyworld: ok, having a go myself after pushing this change.
[22:38] <wallyworld> ok, let's see how it goes
[22:45] <veebers> wallyworld, babbageclunk: I got a successful bootstrap, took almost 40 minutes
[22:45] <babbageclunk> crazy
[22:46] <wallyworld> there's got to be something that's changed. it could be slow apt get of image updates or mongo or something
[22:46] <veebers> maybe cloud-init taking a while when it's apt installing?
[22:46] <veebers> heh
[22:46] <thumper> wallyworld: I've worked out this bug, but would like to talk to if you have a chance
[22:46] <wallyworld> sure, give me 5
[22:46] <wallyworld> otp
[22:49] <thumper> ack
[23:01] <thumper> wallyworld: actually, never mind
[23:01] <wallyworld> thumper: sorry, still in 1:1
[23:13] <babbageclunk> wallyworld: bootstrap was about normal speed for me
[23:13] <wallyworld> damn ok
[23:14] <veebers> babbageclunk: where were you bootstrapping to?
[23:15] <veebers> into? at? into probably
[23:15] <babbageclunk> veebers: localhost.
[23:15] <babbageclunk> I'll try aws
[23:15] <babbageclunk> ooh, meeting
[23:25] <thumper> wallyworld: this one is for you https://github.com/juju/juju/pull/9156
[23:25] <wallyworld> ok, will look after standup
[23:27] <veebers> babbageclunk: which region did you bootstrap aws? I'm using ap-southeast-1
[23:28] <babbageclunk> I used ap-southeast-2
[23:35] <wallyworld> thumper: lgtm, nice pickup
[23:37] <thumper> wallyworld: took me a while because I had the assumption that the initial state was wrong and we weren't waiting
[23:37] <wallyworld> seems obvious now
[23:37] <thumper> but it was initial state was right, and subsequent update fubared it
[23:37] <wallyworld> alays is after the fact
[23:37] <thumper> sure is