[00:01] wallyworld: Can you take another look at https://github.com/juju/juju/pull/7340? I think I've resolved your comments. [00:01] sure [00:05] Thanks [00:19] babbageclunk: done [00:19] wallyworld: cheers. [00:22] wallyworld: I've already done the migration/dump-model work! :) Although migration_import refuses to import a model with remote applications - I don't want to just enable it because it probably won't just work. [00:22] sure, ok [00:24] I guess mainly because at the moment we're talking about cross-model/same-controller, and migration is inherently between controllers. [01:20] * thumper takes dog for a walk [02:40] wallyworld, menn0: huh, haven't seen a test fail like this before: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10891/artifact/artifacts/xenial.log [02:41] Do you think it's a problem? [02:42] babbageclunk: well it's certainly a problem :p [02:42] babbageclunk: maybe mention it to mwhudson ? [02:42] eh that's just a nil pointer deference isn't it? [02:42] menn0: babbageclunk: that's a know bug, let me look at the number [02:42] dereference [02:43] bug 1686720 [02:43] Bug #1686720: worker/lease: NewDeadManager returns manager that crashes [02:43] it's not a go things, it's a juju npe IIANM [02:44] wallyworld: ok, if it's got a bug I'mm'a just rerun. :/ === thumper is now known as thumper-afk [03:48] wallyworld, babbageclunk or anastasiamac: quick help fix. https://github.com/juju/juju/pull/7343 [03:48] sure [03:49] axw: ping [03:49] menn0: pong [03:50] axw: I had a poke at this: http://reports.vapour.ws/releases/5227/job/run-unit-tests-win2012-amd64/attempt/3720 [03:50] i know what the problem is but I don't know what the correct fix is [03:50] menn0: reading now [03:50] doh [03:50] menn0: I can take it over if you like [03:50] axw: ok, happy to hand it over (there's a LK card already) [03:51] okey dokey [03:51] axw: i'm curious though, are those paths supposed to by unix only? [03:51] menn0: yep [03:51] menn0: storage is not currently supported on windows [03:51] axw: ok, then the tests are slightly wrong [03:52] axw: the tests shouldn't use filepath.FromSlash [03:52] menn0: i added a suggestion, but lgtm [03:52] axw: and BlockDevicePath shouldn't use filepath.Join [03:52] wallyworld: cheers [03:53] menn0: yup. I'll fix that up, thanks. [03:53] axw: cheers [03:56] wallyworld: good call. doing that now. [04:03] menn0: D'oh, took too long with my comments. [04:07] babbageclunk: but ur comments were totally worth it :) i wondr if i could give menn0 a resources related bug for jumping the gun?... babbageclunk, want to negotiate it for me? :D [04:09] menn0: mind reviewing? https://github.com/juju/juju/pull/7344 [04:10] axw: otp but yes afterwards [04:10] babbageclunk: good comments, i'll follow up [04:10] wallyworld: got a moment for a sanity check? Or are you also otp? [04:10] menn0: ok thanks! [04:10] menn0: especially on the diaeresis one. [04:11] babbageclunk: otp, give me a little time [04:15] wallyworld: I mean, you're my second choice in this case anyway - is thumper-afk in the same call? [04:25] thanks jam [04:29] babbageclunk: tim isn't on this call [04:30] ah, ok - he's actually afk then, presumably [04:56] babbageclunk: free now [04:57] babbageclunk: luckily the merge attempt failed so I can get your suggestions in [04:57] but not the diaeresis one :p [04:57] menn0: tsk! [04:57] pre-existing maybe? [04:59] menn0: I mean, I was just kidding with the tsk! [05:00] I'm fine with no hyphen. [05:01] wallyworld: hey, so my confusion was about this bit of code: https://github.com/juju/juju/blob/develop/worker/provisioner/provisioner_task.go#L739 [05:01] I'd expect that I'd be able to see those "failed to start instance" messages from the provisioner in the log, but I don't [05:02] (Oh, this is re:that maas timeout bug btw) [05:02] did you figure it out? [05:02] wallyworld: no [05:02] are you sure that code is being run? [05:03] wallyworld: Well, that's the bit that thumper-afk was saying was the problem, and the thing that would be affected by changing the retry count. [05:05] sure, but maybe the code path isn't hitting that line in your deployment. since when do you believe what tim says :-P [05:05] wallyworld: it makes sense that that would be the problem, but if it is I don't understand why I can't see that in the log. [05:05] wallyworld: Oh, I mean in the log file attached to the bug. [05:05] in the past i've had similar issues and it did turn out the code wasn't running are you sure the right agent binary is being used [05:05] :P [05:06] have you updated your source code post beta4 [05:06] otherwise you'll be using agent from simplestreams [05:06] unless you used --build-agent [05:06] wallyworld: I'm not running anything - I don't know how to make my maas timeout in the same way. [05:08] babbageclunk: does the log indicate that this error is bubbles up? "cannot start instance for machine" [05:08] i'm trying to establish if we really do get int that retry loop [05:08] Right - I don't think we do, because we don't see that logging. [05:09] Which would indicate that the retry count isn't the problem. (Unless we wouldn't expect to see the message in this log for some reason.) [05:10] or maybe we do and err is nil [05:10] maas says it will start the instance [05:10] it then goes through a lifecycle or allocating->deployed or something [05:10] wallyworld: in which case changing the retry count won't help. [05:10] so juju never gets to retry as it doesn't think it needs to [05:10] exaxtly [05:11] it may be the wrong bit of code is being looked at here? [05:11] what does juju do sfter starting the instance - doe sit poll it until it goes to deployed state [05:12] doesn't look like it [05:12] it just assumes it is started [05:12] wallyworld: hang on - trying to chase through the provider code. [05:13] maybe that's a bad assumption. i know with openstack startinstance returns but that just means the start request is accepted [05:13] the instance itself then goes through a lifecycle - building, blah.... [05:13] before it is ready [05:19] wallyworld: ok, thanks for confirming my thinking - still chasing then [05:19] np, good luck! [07:11] axw: ping [07:11] jam: pong [07:11] axw: you have a PR #6326 that is about 8 months old [07:11] axw: azure using simplestreams [07:11] is there something we should close/unblock/??? [07:12] jam: yeah, it got stalled numerous times. we need to ensure streams.canonical.com has the azure bits. probably not going to happen until 2.3 at the earliest now [07:12] jam: we were close to a resolution at barcelona, but now aaron's gone and he was handling the streams side [07:12] axw: should we close that PR and come back to it? [07:13] jam: can do I guess, though the code shouldn't change much [07:13] jam: is it harmful leaving it there? distracting? [07:13] I'd like to get us to PRs that are actionable [07:13] so when we do a scan we can know rather than having a bunch of stuff we have to ignore [07:13] jam: fair enough. no drama to recreate it later, I'll close [07:13] we're a bit far from that, but it would be nice [07:14] I agree [07:14] axw: you also have a 1.8 branch that didn't land 3 days ago? [07:15] jam: yep, I will get back to that one real soon. needs some fixes in the 1.25 branch [07:15] * axw creates a card to get it done === frankban|afk is now known as frankban [09:44] wallyworld_: can you please review https://github.com/juju/juju/pull/7346 tomorrow, unless some kind soul does it overnight? the meat and potatoes are in the first two commits [11:16] axw: looking [13:35] Any volunteers? https://github.com/juju/juju/pull/7348 [16:28] heya [16:29] anyone want to look over a small PR? https://github.com/juju/utils/pull/279 === frankban is now known as frankban|afk [21:50] menn0, not sure who all else is in the tech board, but I'd appreciate a look at https://github.com/juju/juju/pull/7350 [22:04] babbageclunk: did you find where juju is not sufficiently retrying the provisioning? [22:13] wallyworld: the Juju log attached to that bug doesn't show the provisioning failure - I've asked for another one that does. I want to check with thumper that he's sure that's the problem - it seems likely, but it'd be good to have the log showing it. [22:15] babbageclunk: it would be good yes. we could though make a preemptive fix and increase retry timeout based on knowledge that machines can take a while to provision; this will unblock the fix and if/when we get better logs we can confirm [22:15] wqll [22:15] oops [22:16] wallyworld: yeah, that's my thinking too. [22:16] wallyworld: https://github.com/juju/juju/pull/7351 [22:16] looking [22:17] It's pretty short [22:17] babbageclunk: i could comment but i won't :-) [22:17] * babbageclunk sniggers [22:18] babbageclunk: is 10 x 10s enough? [22:18] babbageclunk: i wonder also, we should surface the retry attempts in status [22:19] there's an api to surface progress [22:19] we could ensue the message says "Waiting for machine to be provisioned, attempt 1 or 10" or something [22:19] wallyworld: machine.SetInstanceStatus - yeah, could do. I'll add that. [22:19] so the user can see stuff is happening [22:20] 1 of 10 [22:22] According to thumper, this happens occasionally (I think he said something like 10% of the time?), so normally 3 tries is enough. 10 seems like a reasonable number to bump it to in that case? [22:24] wallyworld: oh, it turns out it already does that: failed to start instance (%s), retrying in %v (%d more attempts) [22:24] cool, ok [22:25] let's land it and take a view [22:26] wallyworld: kind of an annoying amount of digging around in logs for a 1-line PR [22:26] it' the old mechanic paradigm - the fix is trivial, you just have to know where to apply the hammer [22:27] sure [22:28] wallyworld: ok, I'll move onto getting rid of that annoying unknown collection "remoteApplications" message. [22:29] +1 ty [22:29] assuming it was the last pr to land which introduced it [22:29] i think it must have been because it was all ok before then IIANM [22:34] wallyworld: I don't think it was the last PR - it's in the logs on bug 1688028 too [22:34] Bug #1688028: Juju fails to request deploy after busy node returns OK [22:34] ok, otp, will catch up soon [22:35] anybody want to take a stab at a couple of really tiny PRs? :) [22:52] gsamfira: sure! [22:57] thumper: i had also pinged ante and andres in #juju to confirm that maas storage issue is fixed so hopefully we'll be able to close the issue real soon [22:58] wallyworld: I updated the bug to incomplete :) [22:58] yay :-) [22:58] hopefully nothing more to do [23:10] babbageclunk: https://github.com/juju/utils/pull/279 and https://github.com/juju/charm/pull/228 [23:10] babbageclunk: thanks! ^_^ [23:22] gsamfira: Why do you think using defer might lead to leaked file handles? [23:27] babbageclunk: It was an improper use of the word. The whole defer fh.Close() pattern, while I get is useful to close a file even if a panic happens along the way, can cause ulimit errors when, for example unarchiving an archive with many files. [23:28] gsamfira: ok, I thought it might be something like that - so in a function with a loop extracting many files. [23:28] babbageclunk: yap [23:30] babbageclunk: there is another issue with that pattern. On Windows for example, if you try something like (pseudo code follows: fh := Open("myfile"); defer fh.Close; fh.Write("hello"); io.Move("myfile", "awesome_name") [23:30] it will fail [23:30] gsamfira: That said, I think the places you're changing in the utils PR would be better done by changing the deferred closes to be sync+closes. The loop problem doesn't apply there. [23:30] (the move at least) [23:30] gsamfira: oh, right. Is that why you moved the sync+close to be before the call to the change func in AtomicWriteFileAndChange? [23:30] would it not be useful to return any errors while doing a sync/close? [23:31] yap [23:31] also in zip.go. in the case of copyTo [23:33] gsamfira: for returning errors, yes, but not if we're already returning an error - we wouldn't want to clobber it. [23:34] menn0, wallyworld: I won't be at the tech-board meeting today [23:34] have taxi duty :-| [23:34] thumper: ok [23:34] thumper: no worries, we'll cover for you :-) [23:35] babbageclunk: not really clobbering, I think. We still return the original error if that's where it happens. But we also get an error returned if we fail to flush to disk, or close the file...which I think is an important error to know :) [23:35] otherwise we cannot guarantee that the config file or state file is consistent [23:36] Oh, I mean, it would be better to make the defer do sync+close and return any error (avoiding clobbering an error already coming back) than just putting the sync+close on the happy path. [23:37] gsamfira: ^ [23:37] does that make sense? [23:40] gsamfira: I'll put what I mean into the review. [23:40] babbageclunk: yup. Makes sense [23:40] gsamfira: cool [23:40] babbageclunk: sounds good. helps to add context :) [23:41] I'll address it in the morning. Grabbing some sleep for tonight :) [23:41] thanks for looking over it! [23:41] babbageclunk: ^ [23:49] gsamfira: no worries! I started typing it up and it's unwieldy enough that I don't think it's worth it - I think yours is fine in that case.