[00:01] <babbageclunk> wallyworld: Can you take another look at https://github.com/juju/juju/pull/7340? I think I've resolved your comments.
[00:01] <wallyworld> sure
[00:05] <babbageclunk> Thanks
[00:19] <wallyworld> babbageclunk: done
[00:19] <babbageclunk> wallyworld: cheers.
[00:22] <babbageclunk> wallyworld: I've already done the migration/dump-model work! :) Although migration_import refuses to import a model with remote applications - I don't want to just enable it because it probably won't just work.
[00:22] <wallyworld> sure, ok
[00:24] <babbageclunk> I guess mainly because at the moment we're talking about cross-model/same-controller, and migration is inherently between controllers.
[01:20]  * thumper takes dog for a walk
[02:40] <babbageclunk> wallyworld, menn0: huh, haven't seen a test fail like this before: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10891/artifact/artifacts/xenial.log
[02:41] <babbageclunk> Do you think it's a problem?
[02:42] <menn0> babbageclunk: well it's certainly a problem :p
[02:42] <menn0> babbageclunk: maybe mention it to mwhudson ?
[02:42] <mwhudson> eh that's just a nil pointer deference isn't it?
[02:42] <wallyworld> menn0: babbageclunk: that's a know bug, let me look at the number
[02:42] <mwhudson> dereference
[02:43] <wallyworld> bug 1686720
[02:43] <mup> Bug #1686720: worker/lease: NewDeadManager returns manager that crashes <juju:Triaged> <https://launchpad.net/bugs/1686720>
[02:43] <wallyworld> it's not a go things, it's a juju npe IIANM
[02:44] <babbageclunk> wallyworld: ok, if it's got a bug I'mm'a just rerun. :/
[03:48] <menn0> wallyworld, babbageclunk or anastasiamac: quick help fix. https://github.com/juju/juju/pull/7343
[03:48] <wallyworld> sure
[03:49] <menn0> axw: ping
[03:49] <axw> menn0: pong
[03:50] <menn0> axw: I had a poke at this: http://reports.vapour.ws/releases/5227/job/run-unit-tests-win2012-amd64/attempt/3720
[03:50] <menn0> i know what the problem is but I don't know what the correct fix is
[03:50] <babbageclunk> menn0: reading now
[03:50] <axw> doh
[03:50] <axw> menn0: I can take it over if you like
[03:50] <menn0> axw: ok, happy to hand it over (there's a LK card already)
[03:51] <axw> okey dokey
[03:51] <menn0> axw: i'm curious though, are those paths supposed to by unix only?
[03:51] <axw> menn0: yep
[03:51] <axw> menn0: storage is not currently supported on windows
[03:51] <menn0> axw: ok, then the tests are slightly wrong
[03:52] <menn0> axw: the tests shouldn't use filepath.FromSlash
[03:52] <wallyworld> menn0: i added a suggestion, but lgtm
[03:52] <menn0> axw: and BlockDevicePath shouldn't use filepath.Join
[03:52] <menn0> wallyworld: cheers
[03:53] <axw> menn0: yup. I'll fix that up, thanks.
[03:53] <menn0> axw: cheers
[03:56] <menn0> wallyworld: good call. doing that now.
[04:03] <babbageclunk> menn0: D'oh, took too long with my comments.
[04:07] <anastasiamac> babbageclunk: but ur comments were totally worth it :) i wondr if i could give menn0 a resources related bug for jumping the gun?... babbageclunk, want to negotiate it for me? :D
[04:09] <axw> menn0: mind reviewing? https://github.com/juju/juju/pull/7344
[04:10] <menn0> axw: otp but yes afterwards
[04:10] <menn0> babbageclunk: good comments, i'll follow up
[04:10] <babbageclunk> wallyworld: got a moment for a sanity check? Or are you also otp?
[04:10] <babbageclunk> menn0: ok thanks!
[04:10] <babbageclunk> menn0: especially on the diaeresis one.
[04:11] <wallyworld> babbageclunk: otp, give me a little time
[04:15] <babbageclunk> wallyworld: I mean, you're my second choice in this case anyway - is thumper-afk in the same call?
[04:25] <axw> thanks jam
[04:29] <wallyworld> babbageclunk: tim isn't on this call
[04:30] <babbageclunk> ah, ok - he's actually afk then, presumably
[04:56] <wallyworld> babbageclunk: free now
[04:57] <menn0> babbageclunk: luckily the merge attempt failed so I can get your suggestions in
[04:57] <menn0> but not the diaeresis one :p
[04:57] <babbageclunk> menn0: tsk!
[04:57] <menn0> pre-existing maybe?
[04:59] <babbageclunk> menn0: I mean, I was just kidding with the tsk!
[05:00] <babbageclunk> I'm fine with no hyphen.
[05:01] <babbageclunk> wallyworld: hey, so my confusion was about this bit of code: https://github.com/juju/juju/blob/develop/worker/provisioner/provisioner_task.go#L739
[05:01] <babbageclunk> I'd expect that I'd be able to see those "failed to start instance" messages from the provisioner in the log, but I don't
[05:02] <babbageclunk> (Oh, this is re:that maas timeout bug btw)
[05:02] <wallyworld> did you figure it out?
[05:02] <babbageclunk> wallyworld: no
[05:02] <wallyworld> are you sure that code is being run?
[05:03] <babbageclunk> wallyworld: Well, that's the bit that thumper-afk was saying was the problem, and the thing that would be affected by changing the retry count.
[05:05] <wallyworld> sure, but maybe the code path isn't hitting that line in your deployment. since when do you believe what tim says :-P
[05:05] <babbageclunk> wallyworld: it makes sense that that would be the problem, but if it is I don't understand why I can't see that in the log.
[05:05] <babbageclunk> wallyworld: Oh, I mean in the log file attached to the bug.
[05:05] <wallyworld> in the past i've had similar issues and it did turn out the code wasn't running are you sure the right agent binary is being used
[05:05] <thumper-afk> :P
[05:06] <wallyworld> have you updated your source code post beta4
[05:06] <wallyworld> otherwise you'll be using agent from simplestreams
[05:06] <wallyworld> unless you used --build-agent
[05:06] <babbageclunk> wallyworld: I'm not running anything - I don't know how to make my maas timeout in the same way.
[05:08] <wallyworld> babbageclunk: does the log indicate that this error is bubbles up? "cannot start instance for machine"
[05:08] <wallyworld> i'm trying to establish if we really do get int that retry loop
[05:08] <babbageclunk> Right - I don't think we do, because we don't see that logging.
[05:09] <babbageclunk> Which would indicate that the retry count isn't the problem. (Unless we wouldn't expect to see the message in this log for some reason.)
[05:10] <wallyworld> or maybe we do and err is nil
[05:10] <wallyworld> maas says it will start the instance
[05:10] <wallyworld> it then goes through a lifecycle or allocating->deployed or something
[05:10] <babbageclunk> wallyworld: in which case changing the retry count won't help.
[05:10] <wallyworld> so juju never gets to retry as it doesn't think it needs to
[05:10] <wallyworld> exaxtly
[05:11] <wallyworld> it may be the wrong bit of code is being looked at here?
[05:11] <wallyworld> what does juju do sfter starting the instance - doe sit poll it until it goes to deployed state
[05:12] <wallyworld> doesn't look like it
[05:12] <wallyworld> it just assumes it is started
[05:12] <babbageclunk> wallyworld: hang on - trying to chase through the provider code.
[05:13] <wallyworld> maybe that's a bad assumption. i know with openstack startinstance returns but that just means the start request is accepted
[05:13] <wallyworld> the instance itself then goes through a lifecycle - building, blah....
[05:13] <wallyworld> before it is ready
[05:19] <babbageclunk> wallyworld: ok, thanks for confirming my thinking - still chasing then
[05:19] <wallyworld> np, good luck!
[07:11] <jam> axw: ping
[07:11] <axw> jam: pong
[07:11] <jam> axw: you have a PR #6326 that is about 8 months old
[07:11] <jam> axw: azure using simplestreams
[07:11] <jam> is there something we should close/unblock/???
[07:12] <axw> jam: yeah, it got stalled numerous times. we need to ensure streams.canonical.com has the azure bits. probably not going to happen until 2.3 at the earliest now
[07:12] <axw> jam: we were close to a resolution at barcelona, but now aaron's gone and he was handling the streams side
[07:12] <jam> axw: should we close that PR and come back to it?
[07:13] <axw> jam: can do I guess, though the code shouldn't change much
[07:13] <axw> jam: is it harmful leaving it there? distracting?
[07:13] <jam> I'd like to get us to PRs that are actionable
[07:13] <jam> so when we do a scan we can know rather than having a bunch of stuff we have to ignore
[07:13] <axw> jam: fair enough. no drama to recreate it later, I'll close
[07:13] <jam> we're a bit far from that, but it would be nice
[07:14] <axw> I agree
[07:14] <jam> axw: you also have a 1.8 branch that didn't land 3 days ago?
[07:15] <axw> jam: yep, I will get back to that one real soon. needs some fixes in the 1.25 branch
[07:15]  * axw creates a card to get it done
[09:44] <axw> wallyworld_: can you please review https://github.com/juju/juju/pull/7346 tomorrow, unless some kind soul does it overnight? the meat and potatoes are in the first two commits
[11:16] <wallyworld_> axw: looking
[13:35] <wpk> Any volunteers? https://github.com/juju/juju/pull/7348
[16:28] <gsamfira> heya
[16:29] <gsamfira> anyone want to look over a small PR? https://github.com/juju/utils/pull/279
[21:50] <balloons> menn0, not sure who all else is in the tech board, but I'd appreciate a look at https://github.com/juju/juju/pull/7350
[22:04] <wallyworld> babbageclunk: did you find where juju is not sufficiently retrying the provisioning?
[22:13] <babbageclunk> wallyworld: the Juju log attached to that bug doesn't show the provisioning failure - I've asked for another one that does. I want to check with thumper that he's sure that's the problem - it seems likely, but it'd be good to have the log showing it.
[22:15] <wallyworld> babbageclunk: it would be good yes. we could though make a preemptive fix and increase retry timeout based on knowledge that machines can take a while to provision; this will unblock the fix and if/when we get better logs we can confirm
[22:15] <babbageclunk> wqll
[22:15] <babbageclunk> oops
[22:16] <babbageclunk> wallyworld: yeah, that's my thinking too.
[22:16] <babbageclunk> wallyworld: https://github.com/juju/juju/pull/7351
[22:16] <wallyworld> looking
[22:17] <babbageclunk> It's pretty short
[22:17] <wallyworld> babbageclunk: i could comment but i won't :-)
[22:17]  * babbageclunk sniggers
[22:18] <wallyworld> babbageclunk: is 10 x 10s enough?
[22:18] <wallyworld> babbageclunk: i wonder also, we should surface the retry attempts in status
[22:19] <wallyworld> there's an api to surface progress
[22:19] <wallyworld> we could ensue the message says "Waiting for machine to be provisioned, attempt 1 or 10" or something
[22:19] <babbageclunk> wallyworld: machine.SetInstanceStatus - yeah, could do. I'll add that.
[22:19] <wallyworld> so the user can see stuff is happening
[22:20] <wallyworld> 1 of 10
[22:22] <babbageclunk> According to thumper, this happens occasionally (I think he said something like 10% of the time?), so normally 3 tries is enough. 10 seems like a reasonable number to bump it to in that case?
[22:24] <babbageclunk> wallyworld: oh, it turns out it already does that: failed to start instance (%s), retrying in %v (%d more attempts)
[22:24] <wallyworld> cool, ok
[22:25] <wallyworld> let's land it and take a view
[22:26] <babbageclunk> wallyworld: kind of an annoying amount of digging around in logs for a 1-line PR
[22:26] <wallyworld> it' the old mechanic paradigm - the fix is trivial, you just have to know where to apply the hammer
[22:27] <babbageclunk> sure
[22:28] <babbageclunk> wallyworld: ok, I'll move onto getting rid of that annoying unknown collection "remoteApplications" message.
[22:29] <wallyworld> +1 ty
[22:29] <wallyworld> assuming it was the last pr to land which introduced it
[22:29] <wallyworld> i think it must have been because it was all ok before then IIANM
[22:34] <babbageclunk> wallyworld: I don't think it was the last PR - it's in the logs on bug 1688028 too
[22:34] <mup> Bug #1688028: Juju fails to request deploy after busy node returns OK <cdo-qa> <cdo-qa-blocker> <juju:In Progress by 2-xtian> <https://launchpad.net/bugs/1688028>
[22:34] <wallyworld> ok, otp, will catch up soon
[22:35] <gsamfira> anybody want to take a stab at a couple of really tiny PRs? :)
[22:52] <babbageclunk> gsamfira: sure!
[22:57] <wallyworld> thumper: i had also pinged ante and andres in #juju to confirm that maas storage issue is fixed so hopefully we'll be able to close the issue real soon
[22:58] <thumper> wallyworld: I updated the bug to incomplete :)
[22:58] <wallyworld> yay :-)
[22:58] <wallyworld> hopefully nothing more to do
[23:10] <gsamfira> babbageclunk: https://github.com/juju/utils/pull/279 and https://github.com/juju/charm/pull/228
[23:10] <gsamfira> babbageclunk: thanks! ^_^
[23:22] <babbageclunk> gsamfira: Why do you think using defer might lead to leaked file handles?
[23:27] <gsamfira> babbageclunk: It was an improper use of the word. The whole defer fh.Close() pattern, while I get is useful to close a file even if a panic happens along the way, can cause ulimit errors when, for example unarchiving an archive with many files.
[23:28] <babbageclunk> gsamfira: ok, I thought it might be something like that - so in a function with a loop extracting many files.
[23:28] <gsamfira> babbageclunk: yap
[23:30] <gsamfira> babbageclunk: there is another issue with that pattern. On Windows for example, if you try something like (pseudo code follows: fh := Open("myfile"); defer fh.Close; fh.Write("hello"); io.Move("myfile", "awesome_name")
[23:30] <gsamfira> it will fail
[23:30] <babbageclunk> gsamfira: That said, I think the places you're changing in the utils PR would be better done by changing the deferred closes to be sync+closes. The loop problem doesn't apply there.
[23:30] <gsamfira> (the move at least)
[23:30] <babbageclunk> gsamfira: oh, right. Is that why you moved the sync+close to be before the call to the change func in AtomicWriteFileAndChange?
[23:30] <gsamfira> would it not be useful to return any errors while doing a sync/close?
[23:31] <gsamfira> yap
[23:31] <gsamfira> also in zip.go. in the case of copyTo
[23:33] <babbageclunk> gsamfira: for returning errors, yes, but not if we're already returning an error - we wouldn't want to clobber it.
[23:34] <thumper> menn0, wallyworld: I won't be at the tech-board meeting today
[23:34] <thumper> have taxi duty :-|
[23:34] <menn0> thumper: ok
[23:34] <wallyworld> thumper: no worries, we'll cover for you :-)
[23:35] <gsamfira> babbageclunk:  not really clobbering, I think. We still return the original error if that's where it happens. But we also get an error returned if we fail to flush to disk, or close the file...which I think is an important error to know :)
[23:35] <gsamfira> otherwise we cannot guarantee that the config file or state file is consistent
[23:36] <babbageclunk> Oh, I mean, it would be better to make the defer do sync+close and return any error (avoiding clobbering an error already coming back) than just putting the sync+close on the happy path.
[23:37] <babbageclunk> gsamfira: ^
[23:37] <babbageclunk> does that make sense?
[23:40] <babbageclunk> gsamfira: I'll put what I mean into the review.
[23:40] <gsamfira> babbageclunk: yup. Makes sense
[23:40] <babbageclunk> gsamfira: cool
[23:40] <gsamfira> babbageclunk: sounds good. helps to add context :)
[23:41] <gsamfira> I'll address it in the morning. Grabbing some sleep for tonight :)
[23:41] <gsamfira> thanks for looking over it!
[23:41] <gsamfira> babbageclunk: ^
[23:49] <babbageclunk> gsamfira: no worries! I started typing it up and it's unwieldy enough that I don't think it's worth it - I think yours is fine in that case.