/srv/irclogs.ubuntu.com/2016/11/29/#juju-dev.txt

anastasiamacaxw_: this is on 2.0.0 -https://bugs.launchpad.net/bugs/1636634... do we recommend to upgrade? and to what version?00:49
mupBug #1636634: azure controller becomes unusable after a few days <juju:Triaged by alexis-bruemmer> <https://launchpad.net/bugs/1636634>00:49
axw_anastasiamac: 2.0.2 has several related fixes in it01:04
anastasiamacaxw_: awesome \o/ but it's not out yet AFAIK01:06
axw_anastasiamac: ah, I thought it was. why did it not get released? the binaries are all up there AFAICS01:10
anastasiamacaxw_: m trying to find u to PM01:12
* babbageclunk goes for a run01:14
menn0thumper: sigh... so cs:@03:05
* menn0 tries again03:05
menn0thumper: so cs:~user/name style urls have always been broken03:06
thumperwat?03:06
menn0thumper: for migrations03:06
thumperugh03:06
menn0thumper: the name field is read from the charm's metadata to reconstruct the charm URL03:06
menn0thumper: and that doesn't include the username part03:06
menn0thumper: will have to add an extra query arg to the upload endpoint03:07
thumperurgle03:07
menn0thumper: ignore that problem for now03:15
thumperI keep adding debug03:16
thumperand presence is looking more and more crackful03:16
menn0thumper: now testing resource migration with the cs:etcd charm which also uses a resource03:16
menn0thumper: it's so hairy I wouldn't be surprised if there's a bug hiding in there03:16
thumpermenn0: interestingly though, the precheck machine presence failure finds the controller machine down03:20
thumperwhich is somewhat ironic03:20
thumperbecause that is the machine talking to the client03:20
thumperit isn't the model agents down03:20
menn0so it's clearly wrong03:20
thumperhmm03:25
thumpermenn0: apparently not03:27
thumperprecheck.go : 11003:27
thumpermenn0: why do we check all the machines of the controller?03:28
menn0thumper: my line counts are different to yours it seems03:28
thumper208?03:28
thumpercheck that one03:28
thumpercheckController03:28
thumpercalls checkMachines03:28
thumperwhich checks all the machines03:28
thumpernow obviously there is still a problem03:28
thumperbecause the machine is clearly not down03:28
menn0thumper: b/c the migrationmaster and API server is on the controllers03:29
thumpersure, but if there are any workloads deployed in the controller, we clearly don't care03:29
thumperonly care about the apiserver machines03:29
thumperso if we have one apiserver machine down03:29
thumperwe can't migrate off?03:30
menn0thumper: point taken... the check is probably overly cautious03:31
menn0thumper: my thinking was that we want a good place to start from03:31
thumperyeah03:31
menn0thumper: but that's probably unnecessary03:31
thumperI'll keep looking at this failure03:31
menn0thumper: feel free to remove that check03:31
thumperit is clearly wrong03:31
thumperbut unclear as to why it thinks the machine is down03:31
thumpermenn0: this is just a timing bug03:36
thumpermenn0: sometimes the presence main loop hasn't done a sync fully before the request for alive comes through03:36
thumperin those cases, it finds the machine down03:36
thumperthe reason we see different results for status03:37
thumperis that the full status call uses state connections from the pool03:37
thumperand the migration endpoint does not03:37
thumperit creates new state instances03:37
thumperwhere the presence isn't fully synced before we ask it if the entities are alive03:37
thumperAFAICT03:38
menn0thumper: that makes a whole lot of sense03:38
thumperyeah, because this was as confusing as hell03:38
menn0thumper: I just found another migration issue03:39
thumperyet another?03:39
menn0thumper: the resource migration code needs to handle placeholder resources03:39
menn0thumper: this is where the resource is defined by hasn't been up/downloaded yet03:40
menn0thumper: I just ran into that with the etcd charm03:40
thumperhmm...03:41
thumperor precheck?03:41
menn0thumper: no that won't work. it's perfectly normal for a resource to be a placeholder.03:44
menn0thumper: i'll deal with it.03:44
thumperoh03:44
natefinchgah, the maas api is so weird04:16
babbageclunknatefinch: true that05:05
natefinchbabbageclunk: mostly it's the gomaasapi package that I'm complaining about... it gives some really weird errors05:09
babbageclunknatefinch: well, that's ours - you should be complaining to thumper!05:10
natefinchwell, I think it was inherited05:10
babbageclunknatefinch: I mean, I did (little) bits of it too.05:10
babbageclunknatefinch: oh yeah - I'm thinking about the new stuff for talking to maas 2. There's all that weird bit for the 1.0 api.05:12
natefinchthe error messages are just written in such a way that you know no one ever actually expected anyone to read the errors05:16
natefinchlike, if you try to connect to a server that response to <endpoint>/version with something unexpected, you get the error "Requested Map, got <nil>"  .... uh.... what?05:18
natefinchs/response/responds05:22
natefinchso like.. I'm stuck - do I hide the crazy maas errors entirely, which may hide some genuinely useful info?  Or do I show the crazy errors?  Or do I recreate some of the logic externally so I can catch some reasonably common errors, like typoing IP addresses or something, and return an actual reasonable error message?05:29
babbageclunkcan someone review this formatting fix please? https://github.com/juju/juju/pull/662705:36
babbageclunkThe bad formatting is stopping me from pushing05:36
natefinchI almost forgot parens around if statements was a thing05:36
natefinchship it05:37
babbageclunknatefinch: ta!05:37
babbageclunknatefinch: I guess the right thing to do is to fix the gomaasapi error handling? It shouldn't be letting JSON "traversal" (I think?) errors get up to the client like that.05:40
babbageclunknatefinch: Although I recognise that might be a much bigger task. Maybe just a targetted fix in the place you're hitting now?05:41
natefinchyeah, I was thinking that05:41
mupBug #1644331 opened: juju-deployer failed on SSL3_READ_BYTES <oil> <uosci> <juju-core:Triaged> <juju-deployer:New> <OPNFV:New> <https://launchpad.net/bugs/1644331>07:01
=== frankban|afk is now known as frankban
voidspacefrobware: did you see you had tests fail on your merge this morning?09:21
voidspacefrobware: PR 661809:21
frobwarevoidspace: I did.09:21
voidspacefrobware: cool, just checking you'd seen09:21
voidspacein other news I got ordained last week09:22
voidspacefrobware: I'm now a priest of the church of the latter day dude09:22
voidspacefrobware: http://dudeism.com/09:22
frobwarevoidspace: :)09:25
frobwarevoidspace: congrats!09:29
frobwarevoidspace: it's difficult to separate whether it is just my PR or a general failure.09:40
frobwarevoidspace: macgreagoir was having trouble too09:40
macgreagoirI see tests pass but lxd deployment fail, I think. I'm trying some local lxd testing to see if its a branch problem.09:42
macgreagoir(If I can get some disk space.)09:42
mgzmornin' all09:50
macgreagoir\o mgz09:59
voidspacemgz: o/10:04
jammgz: do you know what is up with the merge juju check failing on https://github.com/juju/juju/pull/6620 ?10:38
mgzjam: let me have a look10:40
mgzlxd failed with "x509: certificate signed by unknown authority"10:42
mgztrying to talk back to the api server10:42
jammgz: any chance the LXD containers on that machine are bridged onto the hosts network and the bug about LXD provider using gateway as its host machine is acting up?10:48
mgzjam: so, a run afterwards passed10:50
mgzso, it's either something intermittent or the branch really had an effect10:50
perrito666fun way to start the morning, my machine would not boot and wile trying to fix it I un-sudoed myself11:26
* perrito666 downloads livecd and goes get a coffee11:26
voidspaceperrito666: fun way to start the day...11:27
perrito666voidspace: and the week11:27
perrito666that must be worth something11:27
voidspaceperrito666: heh11:28
perrito666bbl, errands12:11
rick_hmgz: ping12:15
rick_hmgz: can you please pair up with voidspace and help look at how the testing is setup/working on this OIL problem and see if there's anything that jumps out to you about why it works with 2.0.1 and fails with the version bump commit right after?12:16
mgzrick_h: heya12:17
mgzsure, voidspace, maybe we do hangout and hang out?12:18
rick_hvoidspace: ping for standup12:31
voidspacerick_h: sorry, omw12:35
voidspacemgz: yep, cool12:38
voidspacemgz: ok, this time I see tags/juju-2.0.1 with a custom version number failing13:03
voidspacemgz: which makes more sense to me13:03
voidspacemgz: because then the problem is consistent - use a non-standard stream version in this environment and it fails13:03
voidspacemgz: trying again with this version and then trying vanilla 2.0.1 to confirm13:04
voidspacemgz: and then writing it up13:04
mgzvoidspace: ace, thanks13:06
=== benji_ is now known as benji
voidspacemgz: I'm now seeing the same failure with vanilla 2.0.114:02
voidspacemgz: so having to repeat14:02
voidspacemgz: vanilla 2.0.1 worked for me earlier today14:03
voidspacemgz: if it continues to fail I will try 2.0.2 and if the failure mode is the same then I will conclude that I am fully unable to reproduce the problem14:03
mgzvoidspace: I feel this repo is just not reliable enough...14:03
voidspacemgz: I'll send you the email I *was* going to send you when I thought vanilla 2.0.1 would work14:04
voidspacemgz: I think you might be right14:04
voidspacerick_h: I have updated bug 1642609 and continue to work on it14:07
mupBug #1642609: [2.0.2] many maas nodes left undeployed when deploying multiple models simultaneously on single controller <oil> <oil-2.0> <regression> <juju:Triaged by mfoord> <https://launchpad.net/bugs/1642609>14:07
mupBug #1645729 opened: environment unstable after 1.25.8 upgrade <juju-core:New> <https://launchpad.net/bugs/1645729>14:08
rick_hvoidspace: ty /me looks14:31
voidspacerick_h: I am right in the *middle* of another vanilla 2.0.1 deploy14:32
voidspacerick_h: which I'm sure yesterday worked fine and today the last one just failed in the same way as the custom versions fail for me14:32
rick_hvoidspace: k, can we run this test on other hardware?14:32
voidspacerick_h: so I am almost back to knowing nothing I think14:32
rick_hvoidspace: can we test it on a public cloud, or another MAAS?14:32
voidspacerick_h: it's 50 odd machines14:33
rick_hvoidspace: see if we can isolate it to the OIL hardware or something?14:33
voidspacerick_h: I can try it on a public cloud14:33
voidspacerick_h: I can't do it on my maas14:33
rick_hvoidspace: k, at 50 machines we'll need big credentials I think.14:33
rick_hvoidspace: let me know and I can get you some gce creds that might work14:33
* rick_h hasn't tried 50 machines on there yet14:33
voidspacehah14:34
rick_hvoidspace: can you shoot me the instructions for replicating please?14:34
rick_hvoidspace: I'd like to see how involved it is14:34
voidspacerick_h: sent14:39
voidspacerick_h: what's the current state of the art encrypted messaging service - is it still telegram or something else14:39
rick_hvoidspace: yea, telegram is the usual thing we use at sprints/etc14:40
voidspacerick_h: must install that before we leave for the sprint14:42
voidspacerick_h: mgz: vanilla 2.0.1 failed for me a second time, now retrying (again) 2.0.2 to check the failure mode is the same14:43
voidspacerick_h: (6 models, 49 machines, 63 applications)14:43
rick_hvoidspace: k14:43
voidspaceeven tearing down the environment takes time14:44
rick_hvoidspace: rgr, I'd like to talk to larry on this I think.14:45
voidspacerick_h: yep14:45
voidspacerick_h: I was going to copy him in on that email I sent martin as I assumed 2.0.1 would *work* and then I would have some actual data14:45
voidspacerick_h: as it is I am back to having no useful data I don't think14:46
voidspacerick_h: other than maybe that the repro I have been given isn't reliable14:46
rick_hvoidspace: rgr, let's hold off atm14:46
rick_hvoidspace: take a break off it for a bit while we sort out some cross team bits I think14:46
rick_hmgz: please let me know if anything there's looked fishy14:47
mgzI'm reading over the details now14:47
voidspacerick_h: ok, I have a 2.0.2 deploy in progress I will leave running14:47
rick_hvoidspace: rgr14:47
mgzwe really only have two candiate changes in the 2.0.1 to 2.0.2 range14:50
mgzpr #6537 (bump gomaasapi)14:51
mgzpr #6527 (constrain to 3.5GB mem for controllers)14:51
rick_hvoidspace: do we have the ability to track controller metrics, cpu/ram/etc during the deploy?14:51
mgzand that second one would need to be some weird maas machine selection problem to be relevent14:52
rick_hvoidspace: if you have a sec can you jump in the standup early14:53
voidspacerick_h: yep, coming now14:53
* frobware first attempt at bridging only single interfaces ... does not work. :(15:42
frobwareboo15:43
frobwarethough this may only be related to bonds.15:43
rick_hdoh15:43
frobwarerick_h: the behaviour is different to what we had before.15:44
frobwarerick_h: and it could be MAAS 2.1 specific.15:45
frobwareso many permutations. so few automated baselines. :(15:45
* frobware wanders off to see if there's any chocolate in the house.15:46
frobwarerick_h: a little pricy, but dual NICS with vPro - http://www.logicsupply.com/uk-en/mc500-51/15:49
rick_hfrobware: ping Mike and Andres on a hardware suggestion and will OK it15:51
frobwarerick_h: are you saying definitely not that one? or just choose waht Mike+A already use?15:51
rick_hI think asking what they use/test Maas on seems like a good way to go about it and make sure it's something that will work on Maas.15:52
frobwarerick_h: ok15:52
rick_hfrobware: ^15:52
frobwarerick_h: I'm wondering if my bond issues are a vMAAS issue only. If I manage to ssh in run `ifdown br-bond0; ifup br-bond0` it springs into life.15:53
frobwarerick_h: but that's the exact same sequence that has just run - and you do see sensible values for routes, configured interfaces, addresses, et al.15:53
rick_hfrobware: k, can we put together a test case and see if we can get help verifying it with someone with real hardware?15:56
frobwarerick_h: can do. just need to clean stuff up. will ping mike and andres in the meantime.16:01
alexisbperrito666, ping18:23
perrito666alexisb: pong18:26
menn0thumper: this is last night's work. it undoes an early design decision regarding resource migrations: https://github.com/juju/juju/pull/662820:32
thumperwill look20:33
menn0thumper: cheers20:34
thumpermenn0: looks good20:40
menn0thumper:  thanks20:41
thumpermenn0: https://github.com/juju/juju/pull/662920:41
* menn0 looks20:41
menn0thumper: done20:49
thumperta20:50
babbageclunkmenn0, thumper: could you take another look at https://github.com/juju/juju/pull/6622 please?20:52
menn0babbageclunk: forgot to say, ship it21:12
menn0babbageclunk: with a couple of comments21:13
babbageclunkmenn0: thanks! looking now21:13
menn0babbageclunk: once this lands can you also send an email to veebers and torsten about this being ready for the planned CI test?21:14
babbageclunkmenn0: sure21:14
menn0babbageclunk: thinking about it, is ShortWait enough time for the the goroutine to start waiting on the clock?21:17
menn0babbageclunk: that seems like a flaky test waiting to happen21:17
menn0babbageclunk: I think you need to wait up to LongWait21:18
babbageclunkmenn0: ShortWait's 50ms - that should be *heaps* of time for it to catch up, shouldn't it?21:18
babbageclunkmenn0: Ok, I'll bump it up to be on the safe side.21:19
menn0babbageclunk: we see things that *should* happen in ms take seconds on overcommitted test machines all the time21:19
menn0babbageclunk: LongWait is the time we wait for things that *should* happen21:19
menn0babbageclunk: if you change the wait for LongWait (10s) then WaitAdvance will take at least 1s each all21:20
babbageclunkmenn0: Even with LongWait the test still takes 0s - I guess in the usual case on a not-really-loaded machine the other side's already waiting.21:20
menn0babbageclunk: I think WaitAdvance needs to be changed so that pause is a fixed amount21:20
menn0instead of w / 1021:20
menn0maybe pause should just be ShortWait21:21
menn0that would remove my concern about WaitAdvance blowing out test times21:21
menn0b/c it'll finish within ShortWait of the correct number of waiters turning up21:21
babbageclunkmenn0: Yeah, I think so too - I had an idea about how to do it without polling but I haven't had a chance to try it out. We can talk about it at the sprint.21:21
menn0babbageclunk: sounds good. a non polling approach would be preferable21:22
babbageclunkmenn0: Ooh, I thought it was racy but a nice tweak just occurred to me - if notifyAlarms was a chan int and always got the number of waiters (instead of a struct{}{}) that might do it.21:24
menn0babbageclunk: interesting... worth playing with21:25
thumperthe test clock alarms was designed exactly to have a test wait on the signal that showed that something had called clock.After()21:53
alexisbperrito666, I am running a little late21:59
perrito666no worries, I am still logging in22:00
mupBug #1644331 changed: juju-deployer failed on SSL3_READ_BYTES <deployer> <oil> <python> <uosci> <OpenStack Charm Test Infra:Confirmed> <juju:Won't Fix> <juju-core:Won't Fix> <juju-deployer:New> <OPNFV:New> <https://launchpad.net/bugs/1644331>22:06
thumpermenn0: https://github.com/juju/juju/pull/663122:07
babbageclunkthumper: yeah, but you can't rely on the fact that there's a message on the alarms channel to mean there's something waiting, because multiple waiters can be removed with one advance.22:16
thumperbabbageclunk: ok22:17
menn0thumper: ship it22:23
thumpermenn0: what about a later check?22:24
=== frankban is now known as frankban|afk
thumpermenn0: it is possible that during initiation a hook may be executing22:24
thumperwhich may then put the charm into a failed state22:24
thumperI'm trying to remember22:24
thumperis that OK?22:24
thumperI *think* it is...22:24
alexisbmenn0, did I see that correctly, did MM resources land?22:27
menn0alexisb: no, just a step towards it22:27
menn0alexisb: during my testing yesterday I realised that an early design decision wasn't going to work out - that PR reverses it22:28
menn0well changes it22:28
babbageclunkmenn0, thumper - what's the timestamp granularity of our log messages? Is it nanos?23:15
thumperbabbageclunk: maybe23:22
babbageclunkthumper: :) thanks23:22
babbageclunkthumper: looks like it from the code.23:22
babbageclunkthumper: or at least, the storage won't chop off any nanos that are there.23:22
thumperomg this pie is good23:22
axwwallyworld jam menn0 katco: I won't be able to make tech board today, doing the roster at my son's kindy23:44

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!