anastasiamac | axw_: this is on 2.0.0 -https://bugs.launchpad.net/bugs/1636634... do we recommend to upgrade? and to what version? | 00:49 |
---|---|---|
mup | Bug #1636634: azure controller becomes unusable after a few days <juju:Triaged by alexis-bruemmer> <https://launchpad.net/bugs/1636634> | 00:49 |
axw_ | anastasiamac: 2.0.2 has several related fixes in it | 01:04 |
anastasiamac | axw_: awesome \o/ but it's not out yet AFAIK | 01:06 |
axw_ | anastasiamac: ah, I thought it was. why did it not get released? the binaries are all up there AFAICS | 01:10 |
anastasiamac | axw_: m trying to find u to PM | 01:12 |
* babbageclunk goes for a run | 01:14 | |
menn0 | thumper: sigh... so cs:@ | 03:05 |
* menn0 tries again | 03:05 | |
menn0 | thumper: so cs:~user/name style urls have always been broken | 03:06 |
thumper | wat? | 03:06 |
menn0 | thumper: for migrations | 03:06 |
thumper | ugh | 03:06 |
menn0 | thumper: the name field is read from the charm's metadata to reconstruct the charm URL | 03:06 |
menn0 | thumper: and that doesn't include the username part | 03:06 |
menn0 | thumper: will have to add an extra query arg to the upload endpoint | 03:07 |
thumper | urgle | 03:07 |
menn0 | thumper: ignore that problem for now | 03:15 |
thumper | I keep adding debug | 03:16 |
thumper | and presence is looking more and more crackful | 03:16 |
menn0 | thumper: now testing resource migration with the cs:etcd charm which also uses a resource | 03:16 |
menn0 | thumper: it's so hairy I wouldn't be surprised if there's a bug hiding in there | 03:16 |
thumper | menn0: interestingly though, the precheck machine presence failure finds the controller machine down | 03:20 |
thumper | which is somewhat ironic | 03:20 |
thumper | because that is the machine talking to the client | 03:20 |
thumper | it isn't the model agents down | 03:20 |
menn0 | so it's clearly wrong | 03:20 |
thumper | hmm | 03:25 |
thumper | menn0: apparently not | 03:27 |
thumper | precheck.go : 110 | 03:27 |
thumper | menn0: why do we check all the machines of the controller? | 03:28 |
menn0 | thumper: my line counts are different to yours it seems | 03:28 |
thumper | 208? | 03:28 |
thumper | check that one | 03:28 |
thumper | checkController | 03:28 |
thumper | calls checkMachines | 03:28 |
thumper | which checks all the machines | 03:28 |
thumper | now obviously there is still a problem | 03:28 |
thumper | because the machine is clearly not down | 03:28 |
menn0 | thumper: b/c the migrationmaster and API server is on the controllers | 03:29 |
thumper | sure, but if there are any workloads deployed in the controller, we clearly don't care | 03:29 |
thumper | only care about the apiserver machines | 03:29 |
thumper | so if we have one apiserver machine down | 03:29 |
thumper | we can't migrate off? | 03:30 |
menn0 | thumper: point taken... the check is probably overly cautious | 03:31 |
menn0 | thumper: my thinking was that we want a good place to start from | 03:31 |
thumper | yeah | 03:31 |
menn0 | thumper: but that's probably unnecessary | 03:31 |
thumper | I'll keep looking at this failure | 03:31 |
menn0 | thumper: feel free to remove that check | 03:31 |
thumper | it is clearly wrong | 03:31 |
thumper | but unclear as to why it thinks the machine is down | 03:31 |
thumper | menn0: this is just a timing bug | 03:36 |
thumper | menn0: sometimes the presence main loop hasn't done a sync fully before the request for alive comes through | 03:36 |
thumper | in those cases, it finds the machine down | 03:36 |
thumper | the reason we see different results for status | 03:37 |
thumper | is that the full status call uses state connections from the pool | 03:37 |
thumper | and the migration endpoint does not | 03:37 |
thumper | it creates new state instances | 03:37 |
thumper | where the presence isn't fully synced before we ask it if the entities are alive | 03:37 |
thumper | AFAICT | 03:38 |
menn0 | thumper: that makes a whole lot of sense | 03:38 |
thumper | yeah, because this was as confusing as hell | 03:38 |
menn0 | thumper: I just found another migration issue | 03:39 |
thumper | yet another? | 03:39 |
menn0 | thumper: the resource migration code needs to handle placeholder resources | 03:39 |
menn0 | thumper: this is where the resource is defined by hasn't been up/downloaded yet | 03:40 |
menn0 | thumper: I just ran into that with the etcd charm | 03:40 |
thumper | hmm... | 03:41 |
thumper | or precheck? | 03:41 |
menn0 | thumper: no that won't work. it's perfectly normal for a resource to be a placeholder. | 03:44 |
menn0 | thumper: i'll deal with it. | 03:44 |
thumper | oh | 03:44 |
natefinch | gah, the maas api is so weird | 04:16 |
babbageclunk | natefinch: true that | 05:05 |
natefinch | babbageclunk: mostly it's the gomaasapi package that I'm complaining about... it gives some really weird errors | 05:09 |
babbageclunk | natefinch: well, that's ours - you should be complaining to thumper! | 05:10 |
natefinch | well, I think it was inherited | 05:10 |
babbageclunk | natefinch: I mean, I did (little) bits of it too. | 05:10 |
babbageclunk | natefinch: oh yeah - I'm thinking about the new stuff for talking to maas 2. There's all that weird bit for the 1.0 api. | 05:12 |
natefinch | the error messages are just written in such a way that you know no one ever actually expected anyone to read the errors | 05:16 |
natefinch | like, if you try to connect to a server that response to <endpoint>/version with something unexpected, you get the error "Requested Map, got <nil>" .... uh.... what? | 05:18 |
natefinch | s/response/responds | 05:22 |
natefinch | so like.. I'm stuck - do I hide the crazy maas errors entirely, which may hide some genuinely useful info? Or do I show the crazy errors? Or do I recreate some of the logic externally so I can catch some reasonably common errors, like typoing IP addresses or something, and return an actual reasonable error message? | 05:29 |
babbageclunk | can someone review this formatting fix please? https://github.com/juju/juju/pull/6627 | 05:36 |
babbageclunk | The bad formatting is stopping me from pushing | 05:36 |
natefinch | I almost forgot parens around if statements was a thing | 05:36 |
natefinch | ship it | 05:37 |
babbageclunk | natefinch: ta! | 05:37 |
babbageclunk | natefinch: I guess the right thing to do is to fix the gomaasapi error handling? It shouldn't be letting JSON "traversal" (I think?) errors get up to the client like that. | 05:40 |
babbageclunk | natefinch: Although I recognise that might be a much bigger task. Maybe just a targetted fix in the place you're hitting now? | 05:41 |
natefinch | yeah, I was thinking that | 05:41 |
mup | Bug #1644331 opened: juju-deployer failed on SSL3_READ_BYTES <oil> <uosci> <juju-core:Triaged> <juju-deployer:New> <OPNFV:New> <https://launchpad.net/bugs/1644331> | 07:01 |
=== frankban|afk is now known as frankban | ||
voidspace | frobware: did you see you had tests fail on your merge this morning? | 09:21 |
voidspace | frobware: PR 6618 | 09:21 |
frobware | voidspace: I did. | 09:21 |
voidspace | frobware: cool, just checking you'd seen | 09:21 |
voidspace | in other news I got ordained last week | 09:22 |
voidspace | frobware: I'm now a priest of the church of the latter day dude | 09:22 |
voidspace | frobware: http://dudeism.com/ | 09:22 |
frobware | voidspace: :) | 09:25 |
frobware | voidspace: congrats! | 09:29 |
frobware | voidspace: it's difficult to separate whether it is just my PR or a general failure. | 09:40 |
frobware | voidspace: macgreagoir was having trouble too | 09:40 |
macgreagoir | I see tests pass but lxd deployment fail, I think. I'm trying some local lxd testing to see if its a branch problem. | 09:42 |
macgreagoir | (If I can get some disk space.) | 09:42 |
mgz | mornin' all | 09:50 |
macgreagoir | \o mgz | 09:59 |
voidspace | mgz: o/ | 10:04 |
jam | mgz: do you know what is up with the merge juju check failing on https://github.com/juju/juju/pull/6620 ? | 10:38 |
mgz | jam: let me have a look | 10:40 |
mgz | lxd failed with "x509: certificate signed by unknown authority" | 10:42 |
mgz | trying to talk back to the api server | 10:42 |
jam | mgz: any chance the LXD containers on that machine are bridged onto the hosts network and the bug about LXD provider using gateway as its host machine is acting up? | 10:48 |
mgz | jam: so, a run afterwards passed | 10:50 |
mgz | so, it's either something intermittent or the branch really had an effect | 10:50 |
perrito666 | fun way to start the morning, my machine would not boot and wile trying to fix it I un-sudoed myself | 11:26 |
* perrito666 downloads livecd and goes get a coffee | 11:26 | |
voidspace | perrito666: fun way to start the day... | 11:27 |
perrito666 | voidspace: and the week | 11:27 |
perrito666 | that must be worth something | 11:27 |
voidspace | perrito666: heh | 11:28 |
perrito666 | bbl, errands | 12:11 |
rick_h | mgz: ping | 12:15 |
rick_h | mgz: can you please pair up with voidspace and help look at how the testing is setup/working on this OIL problem and see if there's anything that jumps out to you about why it works with 2.0.1 and fails with the version bump commit right after? | 12:16 |
mgz | rick_h: heya | 12:17 |
mgz | sure, voidspace, maybe we do hangout and hang out? | 12:18 |
rick_h | voidspace: ping for standup | 12:31 |
voidspace | rick_h: sorry, omw | 12:35 |
voidspace | mgz: yep, cool | 12:38 |
voidspace | mgz: ok, this time I see tags/juju-2.0.1 with a custom version number failing | 13:03 |
voidspace | mgz: which makes more sense to me | 13:03 |
voidspace | mgz: because then the problem is consistent - use a non-standard stream version in this environment and it fails | 13:03 |
voidspace | mgz: trying again with this version and then trying vanilla 2.0.1 to confirm | 13:04 |
voidspace | mgz: and then writing it up | 13:04 |
mgz | voidspace: ace, thanks | 13:06 |
=== benji_ is now known as benji | ||
voidspace | mgz: I'm now seeing the same failure with vanilla 2.0.1 | 14:02 |
voidspace | mgz: so having to repeat | 14:02 |
voidspace | mgz: vanilla 2.0.1 worked for me earlier today | 14:03 |
voidspace | mgz: if it continues to fail I will try 2.0.2 and if the failure mode is the same then I will conclude that I am fully unable to reproduce the problem | 14:03 |
mgz | voidspace: I feel this repo is just not reliable enough... | 14:03 |
voidspace | mgz: I'll send you the email I *was* going to send you when I thought vanilla 2.0.1 would work | 14:04 |
voidspace | mgz: I think you might be right | 14:04 |
voidspace | rick_h: I have updated bug 1642609 and continue to work on it | 14:07 |
mup | Bug #1642609: [2.0.2] many maas nodes left undeployed when deploying multiple models simultaneously on single controller <oil> <oil-2.0> <regression> <juju:Triaged by mfoord> <https://launchpad.net/bugs/1642609> | 14:07 |
mup | Bug #1645729 opened: environment unstable after 1.25.8 upgrade <juju-core:New> <https://launchpad.net/bugs/1645729> | 14:08 |
rick_h | voidspace: ty /me looks | 14:31 |
voidspace | rick_h: I am right in the *middle* of another vanilla 2.0.1 deploy | 14:32 |
voidspace | rick_h: which I'm sure yesterday worked fine and today the last one just failed in the same way as the custom versions fail for me | 14:32 |
rick_h | voidspace: k, can we run this test on other hardware? | 14:32 |
voidspace | rick_h: so I am almost back to knowing nothing I think | 14:32 |
rick_h | voidspace: can we test it on a public cloud, or another MAAS? | 14:32 |
voidspace | rick_h: it's 50 odd machines | 14:33 |
rick_h | voidspace: see if we can isolate it to the OIL hardware or something? | 14:33 |
voidspace | rick_h: I can try it on a public cloud | 14:33 |
voidspace | rick_h: I can't do it on my maas | 14:33 |
rick_h | voidspace: k, at 50 machines we'll need big credentials I think. | 14:33 |
rick_h | voidspace: let me know and I can get you some gce creds that might work | 14:33 |
* rick_h hasn't tried 50 machines on there yet | 14:33 | |
voidspace | hah | 14:34 |
rick_h | voidspace: can you shoot me the instructions for replicating please? | 14:34 |
rick_h | voidspace: I'd like to see how involved it is | 14:34 |
voidspace | rick_h: sent | 14:39 |
voidspace | rick_h: what's the current state of the art encrypted messaging service - is it still telegram or something else | 14:39 |
rick_h | voidspace: yea, telegram is the usual thing we use at sprints/etc | 14:40 |
voidspace | rick_h: must install that before we leave for the sprint | 14:42 |
voidspace | rick_h: mgz: vanilla 2.0.1 failed for me a second time, now retrying (again) 2.0.2 to check the failure mode is the same | 14:43 |
voidspace | rick_h: (6 models, 49 machines, 63 applications) | 14:43 |
rick_h | voidspace: k | 14:43 |
voidspace | even tearing down the environment takes time | 14:44 |
rick_h | voidspace: rgr, I'd like to talk to larry on this I think. | 14:45 |
voidspace | rick_h: yep | 14:45 |
voidspace | rick_h: I was going to copy him in on that email I sent martin as I assumed 2.0.1 would *work* and then I would have some actual data | 14:45 |
voidspace | rick_h: as it is I am back to having no useful data I don't think | 14:46 |
voidspace | rick_h: other than maybe that the repro I have been given isn't reliable | 14:46 |
rick_h | voidspace: rgr, let's hold off atm | 14:46 |
rick_h | voidspace: take a break off it for a bit while we sort out some cross team bits I think | 14:46 |
rick_h | mgz: please let me know if anything there's looked fishy | 14:47 |
mgz | I'm reading over the details now | 14:47 |
voidspace | rick_h: ok, I have a 2.0.2 deploy in progress I will leave running | 14:47 |
rick_h | voidspace: rgr | 14:47 |
mgz | we really only have two candiate changes in the 2.0.1 to 2.0.2 range | 14:50 |
mgz | pr #6537 (bump gomaasapi) | 14:51 |
mgz | pr #6527 (constrain to 3.5GB mem for controllers) | 14:51 |
rick_h | voidspace: do we have the ability to track controller metrics, cpu/ram/etc during the deploy? | 14:51 |
mgz | and that second one would need to be some weird maas machine selection problem to be relevent | 14:52 |
rick_h | voidspace: if you have a sec can you jump in the standup early | 14:53 |
voidspace | rick_h: yep, coming now | 14:53 |
* frobware first attempt at bridging only single interfaces ... does not work. :( | 15:42 | |
frobware | boo | 15:43 |
frobware | though this may only be related to bonds. | 15:43 |
rick_h | doh | 15:43 |
frobware | rick_h: the behaviour is different to what we had before. | 15:44 |
frobware | rick_h: and it could be MAAS 2.1 specific. | 15:45 |
frobware | so many permutations. so few automated baselines. :( | 15:45 |
* frobware wanders off to see if there's any chocolate in the house. | 15:46 | |
frobware | rick_h: a little pricy, but dual NICS with vPro - http://www.logicsupply.com/uk-en/mc500-51/ | 15:49 |
rick_h | frobware: ping Mike and Andres on a hardware suggestion and will OK it | 15:51 |
frobware | rick_h: are you saying definitely not that one? or just choose waht Mike+A already use? | 15:51 |
rick_h | I think asking what they use/test Maas on seems like a good way to go about it and make sure it's something that will work on Maas. | 15:52 |
frobware | rick_h: ok | 15:52 |
rick_h | frobware: ^ | 15:52 |
frobware | rick_h: I'm wondering if my bond issues are a vMAAS issue only. If I manage to ssh in run `ifdown br-bond0; ifup br-bond0` it springs into life. | 15:53 |
frobware | rick_h: but that's the exact same sequence that has just run - and you do see sensible values for routes, configured interfaces, addresses, et al. | 15:53 |
rick_h | frobware: k, can we put together a test case and see if we can get help verifying it with someone with real hardware? | 15:56 |
frobware | rick_h: can do. just need to clean stuff up. will ping mike and andres in the meantime. | 16:01 |
alexisb | perrito666, ping | 18:23 |
perrito666 | alexisb: pong | 18:26 |
menn0 | thumper: this is last night's work. it undoes an early design decision regarding resource migrations: https://github.com/juju/juju/pull/6628 | 20:32 |
thumper | will look | 20:33 |
menn0 | thumper: cheers | 20:34 |
thumper | menn0: looks good | 20:40 |
menn0 | thumper: thanks | 20:41 |
thumper | menn0: https://github.com/juju/juju/pull/6629 | 20:41 |
* menn0 looks | 20:41 | |
menn0 | thumper: done | 20:49 |
thumper | ta | 20:50 |
babbageclunk | menn0, thumper: could you take another look at https://github.com/juju/juju/pull/6622 please? | 20:52 |
menn0 | babbageclunk: forgot to say, ship it | 21:12 |
menn0 | babbageclunk: with a couple of comments | 21:13 |
babbageclunk | menn0: thanks! looking now | 21:13 |
menn0 | babbageclunk: once this lands can you also send an email to veebers and torsten about this being ready for the planned CI test? | 21:14 |
babbageclunk | menn0: sure | 21:14 |
menn0 | babbageclunk: thinking about it, is ShortWait enough time for the the goroutine to start waiting on the clock? | 21:17 |
menn0 | babbageclunk: that seems like a flaky test waiting to happen | 21:17 |
menn0 | babbageclunk: I think you need to wait up to LongWait | 21:18 |
babbageclunk | menn0: ShortWait's 50ms - that should be *heaps* of time for it to catch up, shouldn't it? | 21:18 |
babbageclunk | menn0: Ok, I'll bump it up to be on the safe side. | 21:19 |
menn0 | babbageclunk: we see things that *should* happen in ms take seconds on overcommitted test machines all the time | 21:19 |
menn0 | babbageclunk: LongWait is the time we wait for things that *should* happen | 21:19 |
menn0 | babbageclunk: if you change the wait for LongWait (10s) then WaitAdvance will take at least 1s each all | 21:20 |
babbageclunk | menn0: Even with LongWait the test still takes 0s - I guess in the usual case on a not-really-loaded machine the other side's already waiting. | 21:20 |
menn0 | babbageclunk: I think WaitAdvance needs to be changed so that pause is a fixed amount | 21:20 |
menn0 | instead of w / 10 | 21:20 |
menn0 | maybe pause should just be ShortWait | 21:21 |
menn0 | that would remove my concern about WaitAdvance blowing out test times | 21:21 |
menn0 | b/c it'll finish within ShortWait of the correct number of waiters turning up | 21:21 |
babbageclunk | menn0: Yeah, I think so too - I had an idea about how to do it without polling but I haven't had a chance to try it out. We can talk about it at the sprint. | 21:21 |
menn0 | babbageclunk: sounds good. a non polling approach would be preferable | 21:22 |
babbageclunk | menn0: Ooh, I thought it was racy but a nice tweak just occurred to me - if notifyAlarms was a chan int and always got the number of waiters (instead of a struct{}{}) that might do it. | 21:24 |
menn0 | babbageclunk: interesting... worth playing with | 21:25 |
thumper | the test clock alarms was designed exactly to have a test wait on the signal that showed that something had called clock.After() | 21:53 |
alexisb | perrito666, I am running a little late | 21:59 |
perrito666 | no worries, I am still logging in | 22:00 |
mup | Bug #1644331 changed: juju-deployer failed on SSL3_READ_BYTES <deployer> <oil> <python> <uosci> <OpenStack Charm Test Infra:Confirmed> <juju:Won't Fix> <juju-core:Won't Fix> <juju-deployer:New> <OPNFV:New> <https://launchpad.net/bugs/1644331> | 22:06 |
thumper | menn0: https://github.com/juju/juju/pull/6631 | 22:07 |
babbageclunk | thumper: yeah, but you can't rely on the fact that there's a message on the alarms channel to mean there's something waiting, because multiple waiters can be removed with one advance. | 22:16 |
thumper | babbageclunk: ok | 22:17 |
menn0 | thumper: ship it | 22:23 |
thumper | menn0: what about a later check? | 22:24 |
=== frankban is now known as frankban|afk | ||
thumper | menn0: it is possible that during initiation a hook may be executing | 22:24 |
thumper | which may then put the charm into a failed state | 22:24 |
thumper | I'm trying to remember | 22:24 |
thumper | is that OK? | 22:24 |
thumper | I *think* it is... | 22:24 |
alexisb | menn0, did I see that correctly, did MM resources land? | 22:27 |
menn0 | alexisb: no, just a step towards it | 22:27 |
menn0 | alexisb: during my testing yesterday I realised that an early design decision wasn't going to work out - that PR reverses it | 22:28 |
menn0 | well changes it | 22:28 |
babbageclunk | menn0, thumper - what's the timestamp granularity of our log messages? Is it nanos? | 23:15 |
thumper | babbageclunk: maybe | 23:22 |
babbageclunk | thumper: :) thanks | 23:22 |
babbageclunk | thumper: looks like it from the code. | 23:22 |
babbageclunk | thumper: or at least, the storage won't chop off any nanos that are there. | 23:22 |
thumper | omg this pie is good | 23:22 |
axw | wallyworld jam menn0 katco: I won't be able to make tech board today, doing the roster at my son's kindy | 23:44 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!