[00:49] <anastasiamac> axw_: this is on 2.0.0 -https://bugs.launchpad.net/bugs/1636634... do we recommend to upgrade? and to what version?
[00:49] <mup> Bug #1636634: azure controller becomes unusable after a few days <juju:Triaged by alexis-bruemmer> <https://launchpad.net/bugs/1636634>
[01:04] <axw_> anastasiamac: 2.0.2 has several related fixes in it
[01:06] <anastasiamac> axw_: awesome \o/ but it's not out yet AFAIK
[01:10] <axw_> anastasiamac: ah, I thought it was. why did it not get released? the binaries are all up there AFAICS
[01:12] <anastasiamac> axw_: m trying to find u to PM
[01:14]  * babbageclunk goes for a run
[03:05] <menn0> thumper: sigh... so cs:@
[03:05]  * menn0 tries again
[03:06] <menn0> thumper: so cs:~user/name style urls have always been broken
[03:06] <thumper> wat?
[03:06] <menn0> thumper: for migrations
[03:06] <thumper> ugh
[03:06] <menn0> thumper: the name field is read from the charm's metadata to reconstruct the charm URL
[03:06] <menn0> thumper: and that doesn't include the username part
[03:07] <menn0> thumper: will have to add an extra query arg to the upload endpoint
[03:07] <thumper> urgle
[03:15] <menn0> thumper: ignore that problem for now
[03:16] <thumper> I keep adding debug
[03:16] <thumper> and presence is looking more and more crackful
[03:16] <menn0> thumper: now testing resource migration with the cs:etcd charm which also uses a resource
[03:16] <menn0> thumper: it's so hairy I wouldn't be surprised if there's a bug hiding in there
[03:20] <thumper> menn0: interestingly though, the precheck machine presence failure finds the controller machine down
[03:20] <thumper> which is somewhat ironic
[03:20] <thumper> because that is the machine talking to the client
[03:20] <thumper> it isn't the model agents down
[03:20] <menn0> so it's clearly wrong
[03:25] <thumper> hmm
[03:27] <thumper> menn0: apparently not
[03:27] <thumper> precheck.go : 110
[03:28] <thumper> menn0: why do we check all the machines of the controller?
[03:28] <menn0> thumper: my line counts are different to yours it seems
[03:28] <thumper> 208?
[03:28] <thumper> check that one
[03:28] <thumper> checkController
[03:28] <thumper> calls checkMachines
[03:28] <thumper> which checks all the machines
[03:28] <thumper> now obviously there is still a problem
[03:28] <thumper> because the machine is clearly not down
[03:29] <menn0> thumper: b/c the migrationmaster and API server is on the controllers
[03:29] <thumper> sure, but if there are any workloads deployed in the controller, we clearly don't care
[03:29] <thumper> only care about the apiserver machines
[03:29] <thumper> so if we have one apiserver machine down
[03:30] <thumper> we can't migrate off?
[03:31] <menn0> thumper: point taken... the check is probably overly cautious
[03:31] <menn0> thumper: my thinking was that we want a good place to start from
[03:31] <thumper> yeah
[03:31] <menn0> thumper: but that's probably unnecessary
[03:31] <thumper> I'll keep looking at this failure
[03:31] <menn0> thumper: feel free to remove that check
[03:31] <thumper> it is clearly wrong
[03:31] <thumper> but unclear as to why it thinks the machine is down
[03:36] <thumper> menn0: this is just a timing bug
[03:36] <thumper> menn0: sometimes the presence main loop hasn't done a sync fully before the request for alive comes through
[03:36] <thumper> in those cases, it finds the machine down
[03:37] <thumper> the reason we see different results for status
[03:37] <thumper> is that the full status call uses state connections from the pool
[03:37] <thumper> and the migration endpoint does not
[03:37] <thumper> it creates new state instances
[03:37] <thumper> where the presence isn't fully synced before we ask it if the entities are alive
[03:38] <thumper> AFAICT
[03:38] <menn0> thumper: that makes a whole lot of sense
[03:38] <thumper> yeah, because this was as confusing as hell
[03:39] <menn0> thumper: I just found another migration issue
[03:39] <thumper> yet another?
[03:39] <menn0> thumper: the resource migration code needs to handle placeholder resources
[03:40] <menn0> thumper: this is where the resource is defined by hasn't been up/downloaded yet
[03:40] <menn0> thumper: I just ran into that with the etcd charm
[03:41] <thumper> hmm...
[03:41] <thumper> or precheck?
[03:44] <menn0> thumper: no that won't work. it's perfectly normal for a resource to be a placeholder.
[03:44] <menn0> thumper: i'll deal with it.
[03:44] <thumper> oh
[04:16] <natefinch> gah, the maas api is so weird
[05:05] <babbageclunk> natefinch: true that
[05:09] <natefinch> babbageclunk: mostly it's the gomaasapi package that I'm complaining about... it gives some really weird errors
[05:10] <babbageclunk> natefinch: well, that's ours - you should be complaining to thumper!
[05:10] <natefinch> well, I think it was inherited
[05:10] <babbageclunk> natefinch: I mean, I did (little) bits of it too.
[05:12] <babbageclunk> natefinch: oh yeah - I'm thinking about the new stuff for talking to maas 2. There's all that weird bit for the 1.0 api.
[05:16] <natefinch> the error messages are just written in such a way that you know no one ever actually expected anyone to read the errors
[05:18] <natefinch> like, if you try to connect to a server that response to <endpoint>/version with something unexpected, you get the error "Requested Map, got <nil>"  .... uh.... what?
[05:22] <natefinch> s/response/responds
[05:29] <natefinch> so like.. I'm stuck - do I hide the crazy maas errors entirely, which may hide some genuinely useful info?  Or do I show the crazy errors?  Or do I recreate some of the logic externally so I can catch some reasonably common errors, like typoing IP addresses or something, and return an actual reasonable error message?
[05:36] <babbageclunk> can someone review this formatting fix please? https://github.com/juju/juju/pull/6627
[05:36] <babbageclunk> The bad formatting is stopping me from pushing
[05:36] <natefinch> I almost forgot parens around if statements was a thing
[05:37] <natefinch> ship it
[05:37] <babbageclunk> natefinch: ta!
[05:40] <babbageclunk> natefinch: I guess the right thing to do is to fix the gomaasapi error handling? It shouldn't be letting JSON "traversal" (I think?) errors get up to the client like that.
[05:41] <babbageclunk> natefinch: Although I recognise that might be a much bigger task. Maybe just a targetted fix in the place you're hitting now?
[05:41] <natefinch> yeah, I was thinking that
[07:01] <mup> Bug #1644331 opened: juju-deployer failed on SSL3_READ_BYTES <oil> <uosci> <juju-core:Triaged> <juju-deployer:New> <OPNFV:New> <https://launchpad.net/bugs/1644331>
[09:21] <voidspace> frobware: did you see you had tests fail on your merge this morning?
[09:21] <voidspace> frobware: PR 6618
[09:21] <frobware> voidspace: I did.
[09:21] <voidspace> frobware: cool, just checking you'd seen
[09:22] <voidspace> in other news I got ordained last week
[09:22] <voidspace> frobware: I'm now a priest of the church of the latter day dude
[09:22] <voidspace> frobware: http://dudeism.com/
[09:25] <frobware> voidspace: :)
[09:29] <frobware> voidspace: congrats!
[09:40] <frobware> voidspace: it's difficult to separate whether it is just my PR or a general failure.
[09:40] <frobware> voidspace: macgreagoir was having trouble too
[09:42] <macgreagoir> I see tests pass but lxd deployment fail, I think. I'm trying some local lxd testing to see if its a branch problem.
[09:42] <macgreagoir> (If I can get some disk space.)
[09:50] <mgz> mornin' all
[09:59] <macgreagoir> \o mgz
[10:04] <voidspace> mgz: o/
[10:38] <jam> mgz: do you know what is up with the merge juju check failing on https://github.com/juju/juju/pull/6620 ?
[10:40] <mgz> jam: let me have a look
[10:42] <mgz> lxd failed with "x509: certificate signed by unknown authority"
[10:42] <mgz> trying to talk back to the api server
[10:48] <jam> mgz: any chance the LXD containers on that machine are bridged onto the hosts network and the bug about LXD provider using gateway as its host machine is acting up?
[10:50] <mgz> jam: so, a run afterwards passed
[10:50] <mgz> so, it's either something intermittent or the branch really had an effect
[11:26] <perrito666> fun way to start the morning, my machine would not boot and wile trying to fix it I un-sudoed myself
[11:26]  * perrito666 downloads livecd and goes get a coffee
[11:27] <voidspace> perrito666: fun way to start the day...
[11:27] <perrito666> voidspace: and the week
[11:27] <perrito666> that must be worth something
[11:28] <voidspace> perrito666: heh
[12:11] <perrito666> bbl, errands
[12:15] <rick_h> mgz: ping
[12:16] <rick_h> mgz: can you please pair up with voidspace and help look at how the testing is setup/working on this OIL problem and see if there's anything that jumps out to you about why it works with 2.0.1 and fails with the version bump commit right after?
[12:17] <mgz> rick_h: heya
[12:18] <mgz> sure, voidspace, maybe we do hangout and hang out?
[12:31] <rick_h> voidspace: ping for standup
[12:35] <voidspace> rick_h: sorry, omw
[12:38] <voidspace> mgz: yep, cool
[13:03] <voidspace> mgz: ok, this time I see tags/juju-2.0.1 with a custom version number failing
[13:03] <voidspace> mgz: which makes more sense to me
[13:03] <voidspace> mgz: because then the problem is consistent - use a non-standard stream version in this environment and it fails
[13:04] <voidspace> mgz: trying again with this version and then trying vanilla 2.0.1 to confirm
[13:04] <voidspace> mgz: and then writing it up
[13:06] <mgz> voidspace: ace, thanks
[14:02] <voidspace> mgz: I'm now seeing the same failure with vanilla 2.0.1
[14:02] <voidspace> mgz: so having to repeat
[14:03] <voidspace> mgz: vanilla 2.0.1 worked for me earlier today
[14:03] <voidspace> mgz: if it continues to fail I will try 2.0.2 and if the failure mode is the same then I will conclude that I am fully unable to reproduce the problem
[14:03] <mgz> voidspace: I feel this repo is just not reliable enough...
[14:04] <voidspace> mgz: I'll send you the email I *was* going to send you when I thought vanilla 2.0.1 would work
[14:04] <voidspace> mgz: I think you might be right
[14:07] <voidspace> rick_h: I have updated bug 1642609 and continue to work on it
[14:07] <mup> Bug #1642609: [2.0.2] many maas nodes left undeployed when deploying multiple models simultaneously on single controller <oil> <oil-2.0> <regression> <juju:Triaged by mfoord> <https://launchpad.net/bugs/1642609>
[14:08] <mup> Bug #1645729 opened: environment unstable after 1.25.8 upgrade <juju-core:New> <https://launchpad.net/bugs/1645729>
[14:31] <rick_h> voidspace: ty /me looks
[14:32] <voidspace> rick_h: I am right in the *middle* of another vanilla 2.0.1 deploy
[14:32] <voidspace> rick_h: which I'm sure yesterday worked fine and today the last one just failed in the same way as the custom versions fail for me
[14:32] <rick_h> voidspace: k, can we run this test on other hardware?
[14:32] <voidspace> rick_h: so I am almost back to knowing nothing I think
[14:32] <rick_h> voidspace: can we test it on a public cloud, or another MAAS?
[14:33] <voidspace> rick_h: it's 50 odd machines
[14:33] <rick_h> voidspace: see if we can isolate it to the OIL hardware or something?
[14:33] <voidspace> rick_h: I can try it on a public cloud
[14:33] <voidspace> rick_h: I can't do it on my maas
[14:33] <rick_h> voidspace: k, at 50 machines we'll need big credentials I think.
[14:33] <rick_h> voidspace: let me know and I can get you some gce creds that might work
[14:33]  * rick_h hasn't tried 50 machines on there yet
[14:34] <voidspace> hah
[14:34] <rick_h> voidspace: can you shoot me the instructions for replicating please?
[14:34] <rick_h> voidspace: I'd like to see how involved it is
[14:39] <voidspace> rick_h: sent
[14:39] <voidspace> rick_h: what's the current state of the art encrypted messaging service - is it still telegram or something else
[14:40] <rick_h> voidspace: yea, telegram is the usual thing we use at sprints/etc
[14:42] <voidspace> rick_h: must install that before we leave for the sprint
[14:43] <voidspace> rick_h: mgz: vanilla 2.0.1 failed for me a second time, now retrying (again) 2.0.2 to check the failure mode is the same
[14:43] <voidspace> rick_h: (6 models, 49 machines, 63 applications)
[14:43] <rick_h> voidspace: k
[14:44] <voidspace> even tearing down the environment takes time
[14:45] <rick_h> voidspace: rgr, I'd like to talk to larry on this I think.
[14:45] <voidspace> rick_h: yep
[14:45] <voidspace> rick_h: I was going to copy him in on that email I sent martin as I assumed 2.0.1 would *work* and then I would have some actual data
[14:46] <voidspace> rick_h: as it is I am back to having no useful data I don't think
[14:46] <voidspace> rick_h: other than maybe that the repro I have been given isn't reliable
[14:46] <rick_h> voidspace: rgr, let's hold off atm
[14:46] <rick_h> voidspace: take a break off it for a bit while we sort out some cross team bits I think
[14:47] <rick_h> mgz: please let me know if anything there's looked fishy
[14:47] <mgz> I'm reading over the details now
[14:47] <voidspace> rick_h: ok, I have a 2.0.2 deploy in progress I will leave running
[14:47] <rick_h> voidspace: rgr
[14:50] <mgz> we really only have two candiate changes in the 2.0.1 to 2.0.2 range
[14:51] <mgz> pr #6537 (bump gomaasapi)
[14:51] <mgz> pr #6527 (constrain to 3.5GB mem for controllers)
[14:51] <rick_h> voidspace: do we have the ability to track controller metrics, cpu/ram/etc during the deploy?
[14:52] <mgz> and that second one would need to be some weird maas machine selection problem to be relevent
[14:53] <rick_h> voidspace: if you have a sec can you jump in the standup early
[14:53] <voidspace> rick_h: yep, coming now
[15:42]  * frobware first attempt at bridging only single interfaces ... does not work. :(
[15:43] <frobware> boo
[15:43] <frobware> though this may only be related to bonds.
[15:43] <rick_h> doh
[15:44] <frobware> rick_h: the behaviour is different to what we had before.
[15:45] <frobware> rick_h: and it could be MAAS 2.1 specific.
[15:45] <frobware> so many permutations. so few automated baselines. :(
[15:46]  * frobware wanders off to see if there's any chocolate in the house.
[15:49] <frobware> rick_h: a little pricy, but dual NICS with vPro - http://www.logicsupply.com/uk-en/mc500-51/
[15:51] <rick_h> frobware: ping Mike and Andres on a hardware suggestion and will OK it
[15:51] <frobware> rick_h: are you saying definitely not that one? or just choose waht Mike+A already use?
[15:52] <rick_h> I think asking what they use/test Maas on seems like a good way to go about it and make sure it's something that will work on Maas.
[15:52] <frobware> rick_h: ok
[15:52] <rick_h> frobware: ^
[15:53] <frobware> rick_h: I'm wondering if my bond issues are a vMAAS issue only. If I manage to ssh in run `ifdown br-bond0; ifup br-bond0` it springs into life.
[15:53] <frobware> rick_h: but that's the exact same sequence that has just run - and you do see sensible values for routes, configured interfaces, addresses, et al.
[15:56] <rick_h> frobware: k, can we put together a test case and see if we can get help verifying it with someone with real hardware?
[16:01] <frobware> rick_h: can do. just need to clean stuff up. will ping mike and andres in the meantime.
[18:23] <alexisb> perrito666, ping
[18:26] <perrito666> alexisb: pong
[20:32] <menn0> thumper: this is last night's work. it undoes an early design decision regarding resource migrations: https://github.com/juju/juju/pull/6628
[20:33] <thumper> will look
[20:34] <menn0> thumper: cheers
[20:40] <thumper> menn0: looks good
[20:41] <menn0> thumper:  thanks
[20:41] <thumper> menn0: https://github.com/juju/juju/pull/6629
[20:41]  * menn0 looks
[20:49] <menn0> thumper: done
[20:50] <thumper> ta
[20:52] <babbageclunk> menn0, thumper: could you take another look at https://github.com/juju/juju/pull/6622 please?
[21:12] <menn0> babbageclunk: forgot to say, ship it
[21:13] <menn0> babbageclunk: with a couple of comments
[21:13] <babbageclunk> menn0: thanks! looking now
[21:14] <menn0> babbageclunk: once this lands can you also send an email to veebers and torsten about this being ready for the planned CI test?
[21:14] <babbageclunk> menn0: sure
[21:17] <menn0> babbageclunk: thinking about it, is ShortWait enough time for the the goroutine to start waiting on the clock?
[21:17] <menn0> babbageclunk: that seems like a flaky test waiting to happen
[21:18] <menn0> babbageclunk: I think you need to wait up to LongWait
[21:18] <babbageclunk> menn0: ShortWait's 50ms - that should be *heaps* of time for it to catch up, shouldn't it?
[21:19] <babbageclunk> menn0: Ok, I'll bump it up to be on the safe side.
[21:19] <menn0> babbageclunk: we see things that *should* happen in ms take seconds on overcommitted test machines all the time
[21:19] <menn0> babbageclunk: LongWait is the time we wait for things that *should* happen
[21:20] <menn0> babbageclunk: if you change the wait for LongWait (10s) then WaitAdvance will take at least 1s each all
[21:20] <babbageclunk> menn0: Even with LongWait the test still takes 0s - I guess in the usual case on a not-really-loaded machine the other side's already waiting.
[21:20] <menn0> babbageclunk: I think WaitAdvance needs to be changed so that pause is a fixed amount
[21:20] <menn0> instead of w / 10
[21:21] <menn0> maybe pause should just be ShortWait
[21:21] <menn0> that would remove my concern about WaitAdvance blowing out test times
[21:21] <menn0> b/c it'll finish within ShortWait of the correct number of waiters turning up
[21:21] <babbageclunk> menn0: Yeah, I think so too - I had an idea about how to do it without polling but I haven't had a chance to try it out. We can talk about it at the sprint.
[21:22] <menn0> babbageclunk: sounds good. a non polling approach would be preferable
[21:24] <babbageclunk> menn0: Ooh, I thought it was racy but a nice tweak just occurred to me - if notifyAlarms was a chan int and always got the number of waiters (instead of a struct{}{}) that might do it.
[21:25] <menn0> babbageclunk: interesting... worth playing with
[21:53] <thumper> the test clock alarms was designed exactly to have a test wait on the signal that showed that something had called clock.After()
[21:59] <alexisb> perrito666, I am running a little late
[22:00] <perrito666> no worries, I am still logging in
[22:06] <mup> Bug #1644331 changed: juju-deployer failed on SSL3_READ_BYTES <deployer> <oil> <python> <uosci> <OpenStack Charm Test Infra:Confirmed> <juju:Won't Fix> <juju-core:Won't Fix> <juju-deployer:New> <OPNFV:New> <https://launchpad.net/bugs/1644331>
[22:07] <thumper> menn0: https://github.com/juju/juju/pull/6631
[22:16] <babbageclunk> thumper: yeah, but you can't rely on the fact that there's a message on the alarms channel to mean there's something waiting, because multiple waiters can be removed with one advance.
[22:17] <thumper> babbageclunk: ok
[22:23] <menn0> thumper: ship it
[22:24] <thumper> menn0: what about a later check?
[22:24] <thumper> menn0: it is possible that during initiation a hook may be executing
[22:24] <thumper> which may then put the charm into a failed state
[22:24] <thumper> I'm trying to remember
[22:24] <thumper> is that OK?
[22:24] <thumper> I *think* it is...
[22:27] <alexisb> menn0, did I see that correctly, did MM resources land?
[22:27] <menn0> alexisb: no, just a step towards it
[22:28] <menn0> alexisb: during my testing yesterday I realised that an early design decision wasn't going to work out - that PR reverses it
[22:28] <menn0> well changes it
[23:15] <babbageclunk> menn0, thumper - what's the timestamp granularity of our log messages? Is it nanos?
[23:22] <thumper> babbageclunk: maybe
[23:22] <babbageclunk> thumper: :) thanks
[23:22] <babbageclunk> thumper: looks like it from the code.
[23:22] <babbageclunk> thumper: or at least, the storage won't chop off any nanos that are there.
[23:22] <thumper> omg this pie is good
[23:44] <axw> wallyworld jam menn0 katco: I won't be able to make tech board today, doing the roster at my son's kindy