[00:49] axw_: this is on 2.0.0 -https://bugs.launchpad.net/bugs/1636634... do we recommend to upgrade? and to what version? [00:49] Bug #1636634: azure controller becomes unusable after a few days [01:04] anastasiamac: 2.0.2 has several related fixes in it [01:06] axw_: awesome \o/ but it's not out yet AFAIK [01:10] anastasiamac: ah, I thought it was. why did it not get released? the binaries are all up there AFAICS [01:12] axw_: m trying to find u to PM [01:14] * babbageclunk goes for a run [03:05] thumper: sigh... so cs:@ [03:05] * menn0 tries again [03:06] thumper: so cs:~user/name style urls have always been broken [03:06] wat? [03:06] thumper: for migrations [03:06] ugh [03:06] thumper: the name field is read from the charm's metadata to reconstruct the charm URL [03:06] thumper: and that doesn't include the username part [03:07] thumper: will have to add an extra query arg to the upload endpoint [03:07] urgle [03:15] thumper: ignore that problem for now [03:16] I keep adding debug [03:16] and presence is looking more and more crackful [03:16] thumper: now testing resource migration with the cs:etcd charm which also uses a resource [03:16] thumper: it's so hairy I wouldn't be surprised if there's a bug hiding in there [03:20] menn0: interestingly though, the precheck machine presence failure finds the controller machine down [03:20] which is somewhat ironic [03:20] because that is the machine talking to the client [03:20] it isn't the model agents down [03:20] so it's clearly wrong [03:25] hmm [03:27] menn0: apparently not [03:27] precheck.go : 110 [03:28] menn0: why do we check all the machines of the controller? [03:28] thumper: my line counts are different to yours it seems [03:28] 208? [03:28] check that one [03:28] checkController [03:28] calls checkMachines [03:28] which checks all the machines [03:28] now obviously there is still a problem [03:28] because the machine is clearly not down [03:29] thumper: b/c the migrationmaster and API server is on the controllers [03:29] sure, but if there are any workloads deployed in the controller, we clearly don't care [03:29] only care about the apiserver machines [03:29] so if we have one apiserver machine down [03:30] we can't migrate off? [03:31] thumper: point taken... the check is probably overly cautious [03:31] thumper: my thinking was that we want a good place to start from [03:31] yeah [03:31] thumper: but that's probably unnecessary [03:31] I'll keep looking at this failure [03:31] thumper: feel free to remove that check [03:31] it is clearly wrong [03:31] but unclear as to why it thinks the machine is down [03:36] menn0: this is just a timing bug [03:36] menn0: sometimes the presence main loop hasn't done a sync fully before the request for alive comes through [03:36] in those cases, it finds the machine down [03:37] the reason we see different results for status [03:37] is that the full status call uses state connections from the pool [03:37] and the migration endpoint does not [03:37] it creates new state instances [03:37] where the presence isn't fully synced before we ask it if the entities are alive [03:38] AFAICT [03:38] thumper: that makes a whole lot of sense [03:38] yeah, because this was as confusing as hell [03:39] thumper: I just found another migration issue [03:39] yet another? [03:39] thumper: the resource migration code needs to handle placeholder resources [03:40] thumper: this is where the resource is defined by hasn't been up/downloaded yet [03:40] thumper: I just ran into that with the etcd charm [03:41] hmm... [03:41] or precheck? [03:44] thumper: no that won't work. it's perfectly normal for a resource to be a placeholder. [03:44] thumper: i'll deal with it. [03:44] oh [04:16] gah, the maas api is so weird [05:05] natefinch: true that [05:09] babbageclunk: mostly it's the gomaasapi package that I'm complaining about... it gives some really weird errors [05:10] natefinch: well, that's ours - you should be complaining to thumper! [05:10] well, I think it was inherited [05:10] natefinch: I mean, I did (little) bits of it too. [05:12] natefinch: oh yeah - I'm thinking about the new stuff for talking to maas 2. There's all that weird bit for the 1.0 api. [05:16] the error messages are just written in such a way that you know no one ever actually expected anyone to read the errors [05:18] like, if you try to connect to a server that response to /version with something unexpected, you get the error "Requested Map, got " .... uh.... what? [05:22] s/response/responds [05:29] so like.. I'm stuck - do I hide the crazy maas errors entirely, which may hide some genuinely useful info? Or do I show the crazy errors? Or do I recreate some of the logic externally so I can catch some reasonably common errors, like typoing IP addresses or something, and return an actual reasonable error message? [05:36] can someone review this formatting fix please? https://github.com/juju/juju/pull/6627 [05:36] The bad formatting is stopping me from pushing [05:36] I almost forgot parens around if statements was a thing [05:37] ship it [05:37] natefinch: ta! [05:40] natefinch: I guess the right thing to do is to fix the gomaasapi error handling? It shouldn't be letting JSON "traversal" (I think?) errors get up to the client like that. [05:41] natefinch: Although I recognise that might be a much bigger task. Maybe just a targetted fix in the place you're hitting now? [05:41] yeah, I was thinking that [07:01] Bug #1644331 opened: juju-deployer failed on SSL3_READ_BYTES

=== frankban|afk is now known as frankban [09:21] frobware: did you see you had tests fail on your merge this morning? [09:21] frobware: PR 6618 [09:21] voidspace: I did. [09:21] frobware: cool, just checking you'd seen [09:22] in other news I got ordained last week [09:22] frobware: I'm now a priest of the church of the latter day dude [09:22] frobware: http://dudeism.com/ [09:25] voidspace: :) [09:29] voidspace: congrats! [09:40] voidspace: it's difficult to separate whether it is just my PR or a general failure. [09:40] voidspace: macgreagoir was having trouble too [09:42] I see tests pass but lxd deployment fail, I think. I'm trying some local lxd testing to see if its a branch problem. [09:42] (If I can get some disk space.) [09:50] mornin' all [09:59] \o mgz [10:04] mgz: o/ [10:38] mgz: do you know what is up with the merge juju check failing on https://github.com/juju/juju/pull/6620 ? [10:40] jam: let me have a look [10:42] lxd failed with "x509: certificate signed by unknown authority" [10:42] trying to talk back to the api server [10:48] mgz: any chance the LXD containers on that machine are bridged onto the hosts network and the bug about LXD provider using gateway as its host machine is acting up? [10:50] jam: so, a run afterwards passed [10:50] so, it's either something intermittent or the branch really had an effect [11:26] fun way to start the morning, my machine would not boot and wile trying to fix it I un-sudoed myself [11:26] * perrito666 downloads livecd and goes get a coffee [11:27] perrito666: fun way to start the day... [11:27] voidspace: and the week [11:27] that must be worth something [11:28] perrito666: heh [12:11] bbl, errands [12:15] mgz: ping [12:16] mgz: can you please pair up with voidspace and help look at how the testing is setup/working on this OIL problem and see if there's anything that jumps out to you about why it works with 2.0.1 and fails with the version bump commit right after? [12:17] rick_h: heya [12:18] sure, voidspace, maybe we do hangout and hang out? [12:31] voidspace: ping for standup [12:35] rick_h: sorry, omw [12:38] mgz: yep, cool [13:03] mgz: ok, this time I see tags/juju-2.0.1 with a custom version number failing [13:03] mgz: which makes more sense to me [13:03] mgz: because then the problem is consistent - use a non-standard stream version in this environment and it fails [13:04] mgz: trying again with this version and then trying vanilla 2.0.1 to confirm [13:04] mgz: and then writing it up [13:06] voidspace: ace, thanks === benji_ is now known as benji [14:02] mgz: I'm now seeing the same failure with vanilla 2.0.1 [14:02] mgz: so having to repeat [14:03] mgz: vanilla 2.0.1 worked for me earlier today [14:03] mgz: if it continues to fail I will try 2.0.2 and if the failure mode is the same then I will conclude that I am fully unable to reproduce the problem [14:03] voidspace: I feel this repo is just not reliable enough... [14:04] mgz: I'll send you the email I *was* going to send you when I thought vanilla 2.0.1 would work [14:04] mgz: I think you might be right [14:07] rick_h: I have updated bug 1642609 and continue to work on it [14:07] Bug #1642609: [2.0.2] many maas nodes left undeployed when deploying multiple models simultaneously on single controller

[14:08] Bug #1645729 opened: environment unstable after 1.25.8 upgrade [14:31] voidspace: ty /me looks [14:32] rick_h: I am right in the *middle* of another vanilla 2.0.1 deploy [14:32] rick_h: which I'm sure yesterday worked fine and today the last one just failed in the same way as the custom versions fail for me [14:32] voidspace: k, can we run this test on other hardware? [14:32] rick_h: so I am almost back to knowing nothing I think [14:32] voidspace: can we test it on a public cloud, or another MAAS? [14:33] rick_h: it's 50 odd machines [14:33] voidspace: see if we can isolate it to the OIL hardware or something? [14:33] rick_h: I can try it on a public cloud [14:33] rick_h: I can't do it on my maas [14:33] voidspace: k, at 50 machines we'll need big credentials I think. [14:33] voidspace: let me know and I can get you some gce creds that might work [14:33] * rick_h hasn't tried 50 machines on there yet [14:34] hah [14:34] voidspace: can you shoot me the instructions for replicating please? [14:34] voidspace: I'd like to see how involved it is [14:39] rick_h: sent [14:39] rick_h: what's the current state of the art encrypted messaging service - is it still telegram or something else [14:40] voidspace: yea, telegram is the usual thing we use at sprints/etc [14:42] rick_h: must install that before we leave for the sprint [14:43] rick_h: mgz: vanilla 2.0.1 failed for me a second time, now retrying (again) 2.0.2 to check the failure mode is the same [14:43] rick_h: (6 models, 49 machines, 63 applications) [14:43] voidspace: k [14:44] even tearing down the environment takes time [14:45] voidspace: rgr, I'd like to talk to larry on this I think. [14:45] rick_h: yep [14:45] rick_h: I was going to copy him in on that email I sent martin as I assumed 2.0.1 would *work* and then I would have some actual data [14:46] rick_h: as it is I am back to having no useful data I don't think [14:46] rick_h: other than maybe that the repro I have been given isn't reliable [14:46] voidspace: rgr, let's hold off atm [14:46] voidspace: take a break off it for a bit while we sort out some cross team bits I think [14:47] mgz: please let me know if anything there's looked fishy [14:47] I'm reading over the details now [14:47] rick_h: ok, I have a 2.0.2 deploy in progress I will leave running [14:47] voidspace: rgr [14:50] we really only have two candiate changes in the 2.0.1 to 2.0.2 range [14:51] pr #6537 (bump gomaasapi) [14:51] pr #6527 (constrain to 3.5GB mem for controllers) [14:51] voidspace: do we have the ability to track controller metrics, cpu/ram/etc during the deploy? [14:52] and that second one would need to be some weird maas machine selection problem to be relevent [14:53] voidspace: if you have a sec can you jump in the standup early [14:53] rick_h: yep, coming now [15:42] * frobware first attempt at bridging only single interfaces ... does not work. :( [15:43] boo [15:43] though this may only be related to bonds. [15:43] doh [15:44] rick_h: the behaviour is different to what we had before. [15:45] rick_h: and it could be MAAS 2.1 specific. [15:45] so many permutations. so few automated baselines. :( [15:46] * frobware wanders off to see if there's any chocolate in the house. [15:49] rick_h: a little pricy, but dual NICS with vPro - http://www.logicsupply.com/uk-en/mc500-51/ [15:51] frobware: ping Mike and Andres on a hardware suggestion and will OK it [15:51] rick_h: are you saying definitely not that one? or just choose waht Mike+A already use? [15:52] I think asking what they use/test Maas on seems like a good way to go about it and make sure it's something that will work on Maas. [15:52] rick_h: ok [15:52] frobware: ^ [15:53] rick_h: I'm wondering if my bond issues are a vMAAS issue only. If I manage to ssh in run `ifdown br-bond0; ifup br-bond0` it springs into life. [15:53] rick_h: but that's the exact same sequence that has just run - and you do see sensible values for routes, configured interfaces, addresses, et al. [15:56] frobware: k, can we put together a test case and see if we can get help verifying it with someone with real hardware? [16:01] rick_h: can do. just need to clean stuff up. will ping mike and andres in the meantime. [18:23] perrito666, ping [18:26] alexisb: pong [20:32] thumper: this is last night's work. it undoes an early design decision regarding resource migrations: https://github.com/juju/juju/pull/6628 [20:33] will look [20:34] thumper: cheers [20:40] menn0: looks good [20:41] thumper: thanks [20:41] menn0: https://github.com/juju/juju/pull/6629 [20:41] * menn0 looks [20:49] thumper: done [20:50] ta [20:52] menn0, thumper: could you take another look at https://github.com/juju/juju/pull/6622 please? [21:12] babbageclunk: forgot to say, ship it [21:13] babbageclunk: with a couple of comments [21:13] menn0: thanks! looking now [21:14] babbageclunk: once this lands can you also send an email to veebers and torsten about this being ready for the planned CI test? [21:14] menn0: sure [21:17] babbageclunk: thinking about it, is ShortWait enough time for the the goroutine to start waiting on the clock? [21:17] babbageclunk: that seems like a flaky test waiting to happen [21:18] babbageclunk: I think you need to wait up to LongWait [21:18] menn0: ShortWait's 50ms - that should be *heaps* of time for it to catch up, shouldn't it? [21:19] menn0: Ok, I'll bump it up to be on the safe side. [21:19] babbageclunk: we see things that *should* happen in ms take seconds on overcommitted test machines all the time [21:19] babbageclunk: LongWait is the time we wait for things that *should* happen [21:20] babbageclunk: if you change the wait for LongWait (10s) then WaitAdvance will take at least 1s each all [21:20] menn0: Even with LongWait the test still takes 0s - I guess in the usual case on a not-really-loaded machine the other side's already waiting. [21:20] babbageclunk: I think WaitAdvance needs to be changed so that pause is a fixed amount [21:20] instead of w / 10 [21:21] maybe pause should just be ShortWait [21:21] that would remove my concern about WaitAdvance blowing out test times [21:21] b/c it'll finish within ShortWait of the correct number of waiters turning up [21:21] menn0: Yeah, I think so too - I had an idea about how to do it without polling but I haven't had a chance to try it out. We can talk about it at the sprint. [21:22] babbageclunk: sounds good. a non polling approach would be preferable [21:24] menn0: Ooh, I thought it was racy but a nice tweak just occurred to me - if notifyAlarms was a chan int and always got the number of waiters (instead of a struct{}{}) that might do it. [21:25] babbageclunk: interesting... worth playing with [21:53] the test clock alarms was designed exactly to have a test wait on the signal that showed that something had called clock.After() [21:59] perrito666, I am running a little late [22:00] no worries, I am still logging in [22:06] Bug #1644331 changed: juju-deployer failed on SSL3_READ_BYTES

[22:07] menn0: https://github.com/juju/juju/pull/6631 [22:16] thumper: yeah, but you can't rely on the fact that there's a message on the alarms channel to mean there's something waiting, because multiple waiters can be removed with one advance. [22:17] babbageclunk: ok [22:23] thumper: ship it [22:24] menn0: what about a later check? === frankban is now known as frankban|afk [22:24] menn0: it is possible that during initiation a hook may be executing [22:24] which may then put the charm into a failed state [22:24] I'm trying to remember [22:24] is that OK? [22:24] I *think* it is... [22:27] menn0, did I see that correctly, did MM resources land? [22:27] alexisb: no, just a step towards it [22:28] alexisb: during my testing yesterday I realised that an early design decision wasn't going to work out - that PR reverses it [22:28] well changes it [23:15] menn0, thumper - what's the timestamp granularity of our log messages? Is it nanos? [23:22] babbageclunk: maybe [23:22] thumper: :) thanks [23:22] thumper: looks like it from the code. [23:22] thumper: or at least, the storage won't chop off any nanos that are there. [23:22] omg this pie is good [23:44] wallyworld jam menn0 katco: I won't be able to make tech board today, doing the roster at my son's kindy