babbageclunk | wallyworld: Can you take another look at https://github.com/juju/juju/pull/7340? I think I've resolved your comments. | 00:01 |
---|---|---|
wallyworld | sure | 00:01 |
babbageclunk | Thanks | 00:05 |
wallyworld | babbageclunk: done | 00:19 |
babbageclunk | wallyworld: cheers. | 00:19 |
babbageclunk | wallyworld: I've already done the migration/dump-model work! :) Although migration_import refuses to import a model with remote applications - I don't want to just enable it because it probably won't just work. | 00:22 |
wallyworld | sure, ok | 00:22 |
babbageclunk | I guess mainly because at the moment we're talking about cross-model/same-controller, and migration is inherently between controllers. | 00:24 |
* thumper takes dog for a walk | 01:20 | |
babbageclunk | wallyworld, menn0: huh, haven't seen a test fail like this before: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10891/artifact/artifacts/xenial.log | 02:40 |
babbageclunk | Do you think it's a problem? | 02:41 |
menn0 | babbageclunk: well it's certainly a problem :p | 02:42 |
menn0 | babbageclunk: maybe mention it to mwhudson ? | 02:42 |
mwhudson | eh that's just a nil pointer deference isn't it? | 02:42 |
wallyworld | menn0: babbageclunk: that's a know bug, let me look at the number | 02:42 |
mwhudson | dereference | 02:42 |
wallyworld | bug 1686720 | 02:43 |
mup | Bug #1686720: worker/lease: NewDeadManager returns manager that crashes <juju:Triaged> <https://launchpad.net/bugs/1686720> | 02:43 |
wallyworld | it's not a go things, it's a juju npe IIANM | 02:43 |
babbageclunk | wallyworld: ok, if it's got a bug I'mm'a just rerun. :/ | 02:44 |
=== thumper is now known as thumper-afk | ||
menn0 | wallyworld, babbageclunk or anastasiamac: quick help fix. https://github.com/juju/juju/pull/7343 | 03:48 |
wallyworld | sure | 03:48 |
menn0 | axw: ping | 03:49 |
axw | menn0: pong | 03:49 |
menn0 | axw: I had a poke at this: http://reports.vapour.ws/releases/5227/job/run-unit-tests-win2012-amd64/attempt/3720 | 03:50 |
menn0 | i know what the problem is but I don't know what the correct fix is | 03:50 |
babbageclunk | menn0: reading now | 03:50 |
axw | doh | 03:50 |
axw | menn0: I can take it over if you like | 03:50 |
menn0 | axw: ok, happy to hand it over (there's a LK card already) | 03:50 |
axw | okey dokey | 03:51 |
menn0 | axw: i'm curious though, are those paths supposed to by unix only? | 03:51 |
axw | menn0: yep | 03:51 |
axw | menn0: storage is not currently supported on windows | 03:51 |
menn0 | axw: ok, then the tests are slightly wrong | 03:51 |
menn0 | axw: the tests shouldn't use filepath.FromSlash | 03:52 |
wallyworld | menn0: i added a suggestion, but lgtm | 03:52 |
menn0 | axw: and BlockDevicePath shouldn't use filepath.Join | 03:52 |
menn0 | wallyworld: cheers | 03:52 |
axw | menn0: yup. I'll fix that up, thanks. | 03:53 |
menn0 | axw: cheers | 03:53 |
menn0 | wallyworld: good call. doing that now. | 03:56 |
babbageclunk | menn0: D'oh, took too long with my comments. | 04:03 |
anastasiamac | babbageclunk: but ur comments were totally worth it :) i wondr if i could give menn0 a resources related bug for jumping the gun?... babbageclunk, want to negotiate it for me? :D | 04:07 |
axw | menn0: mind reviewing? https://github.com/juju/juju/pull/7344 | 04:09 |
menn0 | axw: otp but yes afterwards | 04:10 |
menn0 | babbageclunk: good comments, i'll follow up | 04:10 |
babbageclunk | wallyworld: got a moment for a sanity check? Or are you also otp? | 04:10 |
babbageclunk | menn0: ok thanks! | 04:10 |
babbageclunk | menn0: especially on the diaeresis one. | 04:10 |
wallyworld | babbageclunk: otp, give me a little time | 04:11 |
babbageclunk | wallyworld: I mean, you're my second choice in this case anyway - is thumper-afk in the same call? | 04:15 |
axw | thanks jam | 04:25 |
wallyworld | babbageclunk: tim isn't on this call | 04:29 |
babbageclunk | ah, ok - he's actually afk then, presumably | 04:30 |
wallyworld | babbageclunk: free now | 04:56 |
menn0 | babbageclunk: luckily the merge attempt failed so I can get your suggestions in | 04:57 |
menn0 | but not the diaeresis one :p | 04:57 |
babbageclunk | menn0: tsk! | 04:57 |
menn0 | pre-existing maybe? | 04:57 |
babbageclunk | menn0: I mean, I was just kidding with the tsk! | 04:59 |
babbageclunk | I'm fine with no hyphen. | 05:00 |
babbageclunk | wallyworld: hey, so my confusion was about this bit of code: https://github.com/juju/juju/blob/develop/worker/provisioner/provisioner_task.go#L739 | 05:01 |
babbageclunk | I'd expect that I'd be able to see those "failed to start instance" messages from the provisioner in the log, but I don't | 05:01 |
babbageclunk | (Oh, this is re:that maas timeout bug btw) | 05:02 |
wallyworld | did you figure it out? | 05:02 |
babbageclunk | wallyworld: no | 05:02 |
wallyworld | are you sure that code is being run? | 05:02 |
babbageclunk | wallyworld: Well, that's the bit that thumper-afk was saying was the problem, and the thing that would be affected by changing the retry count. | 05:03 |
wallyworld | sure, but maybe the code path isn't hitting that line in your deployment. since when do you believe what tim says :-P | 05:05 |
babbageclunk | wallyworld: it makes sense that that would be the problem, but if it is I don't understand why I can't see that in the log. | 05:05 |
babbageclunk | wallyworld: Oh, I mean in the log file attached to the bug. | 05:05 |
wallyworld | in the past i've had similar issues and it did turn out the code wasn't running are you sure the right agent binary is being used | 05:05 |
thumper-afk | :P | 05:05 |
wallyworld | have you updated your source code post beta4 | 05:06 |
wallyworld | otherwise you'll be using agent from simplestreams | 05:06 |
wallyworld | unless you used --build-agent | 05:06 |
babbageclunk | wallyworld: I'm not running anything - I don't know how to make my maas timeout in the same way. | 05:06 |
wallyworld | babbageclunk: does the log indicate that this error is bubbles up? "cannot start instance for machine" | 05:08 |
wallyworld | i'm trying to establish if we really do get int that retry loop | 05:08 |
babbageclunk | Right - I don't think we do, because we don't see that logging. | 05:08 |
babbageclunk | Which would indicate that the retry count isn't the problem. (Unless we wouldn't expect to see the message in this log for some reason.) | 05:09 |
wallyworld | or maybe we do and err is nil | 05:10 |
wallyworld | maas says it will start the instance | 05:10 |
wallyworld | it then goes through a lifecycle or allocating->deployed or something | 05:10 |
babbageclunk | wallyworld: in which case changing the retry count won't help. | 05:10 |
wallyworld | so juju never gets to retry as it doesn't think it needs to | 05:10 |
wallyworld | exaxtly | 05:10 |
wallyworld | it may be the wrong bit of code is being looked at here? | 05:11 |
wallyworld | what does juju do sfter starting the instance - doe sit poll it until it goes to deployed state | 05:11 |
wallyworld | doesn't look like it | 05:12 |
wallyworld | it just assumes it is started | 05:12 |
babbageclunk | wallyworld: hang on - trying to chase through the provider code. | 05:12 |
wallyworld | maybe that's a bad assumption. i know with openstack startinstance returns but that just means the start request is accepted | 05:13 |
wallyworld | the instance itself then goes through a lifecycle - building, blah.... | 05:13 |
wallyworld | before it is ready | 05:13 |
babbageclunk | wallyworld: ok, thanks for confirming my thinking - still chasing then | 05:19 |
wallyworld | np, good luck! | 05:19 |
jam | axw: ping | 07:11 |
axw | jam: pong | 07:11 |
jam | axw: you have a PR #6326 that is about 8 months old | 07:11 |
jam | axw: azure using simplestreams | 07:11 |
jam | is there something we should close/unblock/??? | 07:11 |
axw | jam: yeah, it got stalled numerous times. we need to ensure streams.canonical.com has the azure bits. probably not going to happen until 2.3 at the earliest now | 07:12 |
axw | jam: we were close to a resolution at barcelona, but now aaron's gone and he was handling the streams side | 07:12 |
jam | axw: should we close that PR and come back to it? | 07:12 |
axw | jam: can do I guess, though the code shouldn't change much | 07:13 |
axw | jam: is it harmful leaving it there? distracting? | 07:13 |
jam | I'd like to get us to PRs that are actionable | 07:13 |
jam | so when we do a scan we can know rather than having a bunch of stuff we have to ignore | 07:13 |
axw | jam: fair enough. no drama to recreate it later, I'll close | 07:13 |
jam | we're a bit far from that, but it would be nice | 07:13 |
axw | I agree | 07:14 |
jam | axw: you also have a 1.8 branch that didn't land 3 days ago? | 07:14 |
axw | jam: yep, I will get back to that one real soon. needs some fixes in the 1.25 branch | 07:15 |
* axw creates a card to get it done | 07:15 | |
=== frankban|afk is now known as frankban | ||
axw | wallyworld_: can you please review https://github.com/juju/juju/pull/7346 tomorrow, unless some kind soul does it overnight? the meat and potatoes are in the first two commits | 09:44 |
wallyworld_ | axw: looking | 11:16 |
wpk | Any volunteers? https://github.com/juju/juju/pull/7348 | 13:35 |
gsamfira | heya | 16:28 |
gsamfira | anyone want to look over a small PR? https://github.com/juju/utils/pull/279 | 16:29 |
=== frankban is now known as frankban|afk | ||
balloons | menn0, not sure who all else is in the tech board, but I'd appreciate a look at https://github.com/juju/juju/pull/7350 | 21:50 |
wallyworld | babbageclunk: did you find where juju is not sufficiently retrying the provisioning? | 22:04 |
babbageclunk | wallyworld: the Juju log attached to that bug doesn't show the provisioning failure - I've asked for another one that does. I want to check with thumper that he's sure that's the problem - it seems likely, but it'd be good to have the log showing it. | 22:13 |
wallyworld | babbageclunk: it would be good yes. we could though make a preemptive fix and increase retry timeout based on knowledge that machines can take a while to provision; this will unblock the fix and if/when we get better logs we can confirm | 22:15 |
babbageclunk | wqll | 22:15 |
babbageclunk | oops | 22:15 |
babbageclunk | wallyworld: yeah, that's my thinking too. | 22:16 |
babbageclunk | wallyworld: https://github.com/juju/juju/pull/7351 | 22:16 |
wallyworld | looking | 22:16 |
babbageclunk | It's pretty short | 22:17 |
wallyworld | babbageclunk: i could comment but i won't :-) | 22:17 |
* babbageclunk sniggers | 22:17 | |
wallyworld | babbageclunk: is 10 x 10s enough? | 22:18 |
wallyworld | babbageclunk: i wonder also, we should surface the retry attempts in status | 22:18 |
wallyworld | there's an api to surface progress | 22:19 |
wallyworld | we could ensue the message says "Waiting for machine to be provisioned, attempt 1 or 10" or something | 22:19 |
babbageclunk | wallyworld: machine.SetInstanceStatus - yeah, could do. I'll add that. | 22:19 |
wallyworld | so the user can see stuff is happening | 22:19 |
wallyworld | 1 of 10 | 22:20 |
babbageclunk | According to thumper, this happens occasionally (I think he said something like 10% of the time?), so normally 3 tries is enough. 10 seems like a reasonable number to bump it to in that case? | 22:22 |
babbageclunk | wallyworld: oh, it turns out it already does that: failed to start instance (%s), retrying in %v (%d more attempts) | 22:24 |
wallyworld | cool, ok | 22:24 |
wallyworld | let's land it and take a view | 22:25 |
babbageclunk | wallyworld: kind of an annoying amount of digging around in logs for a 1-line PR | 22:26 |
wallyworld | it' the old mechanic paradigm - the fix is trivial, you just have to know where to apply the hammer | 22:26 |
babbageclunk | sure | 22:27 |
babbageclunk | wallyworld: ok, I'll move onto getting rid of that annoying unknown collection "remoteApplications" message. | 22:28 |
wallyworld | +1 ty | 22:29 |
wallyworld | assuming it was the last pr to land which introduced it | 22:29 |
wallyworld | i think it must have been because it was all ok before then IIANM | 22:29 |
babbageclunk | wallyworld: I don't think it was the last PR - it's in the logs on bug 1688028 too | 22:34 |
mup | Bug #1688028: Juju fails to request deploy after busy node returns OK <cdo-qa> <cdo-qa-blocker> <juju:In Progress by 2-xtian> <https://launchpad.net/bugs/1688028> | 22:34 |
wallyworld | ok, otp, will catch up soon | 22:34 |
gsamfira | anybody want to take a stab at a couple of really tiny PRs? :) | 22:35 |
babbageclunk | gsamfira: sure! | 22:52 |
wallyworld | thumper: i had also pinged ante and andres in #juju to confirm that maas storage issue is fixed so hopefully we'll be able to close the issue real soon | 22:57 |
thumper | wallyworld: I updated the bug to incomplete :) | 22:58 |
wallyworld | yay :-) | 22:58 |
wallyworld | hopefully nothing more to do | 22:58 |
gsamfira | babbageclunk: https://github.com/juju/utils/pull/279 and https://github.com/juju/charm/pull/228 | 23:10 |
gsamfira | babbageclunk: thanks! ^_^ | 23:10 |
babbageclunk | gsamfira: Why do you think using defer might lead to leaked file handles? | 23:22 |
gsamfira | babbageclunk: It was an improper use of the word. The whole defer fh.Close() pattern, while I get is useful to close a file even if a panic happens along the way, can cause ulimit errors when, for example unarchiving an archive with many files. | 23:27 |
babbageclunk | gsamfira: ok, I thought it might be something like that - so in a function with a loop extracting many files. | 23:28 |
gsamfira | babbageclunk: yap | 23:28 |
gsamfira | babbageclunk: there is another issue with that pattern. On Windows for example, if you try something like (pseudo code follows: fh := Open("myfile"); defer fh.Close; fh.Write("hello"); io.Move("myfile", "awesome_name") | 23:30 |
gsamfira | it will fail | 23:30 |
babbageclunk | gsamfira: That said, I think the places you're changing in the utils PR would be better done by changing the deferred closes to be sync+closes. The loop problem doesn't apply there. | 23:30 |
gsamfira | (the move at least) | 23:30 |
babbageclunk | gsamfira: oh, right. Is that why you moved the sync+close to be before the call to the change func in AtomicWriteFileAndChange? | 23:30 |
gsamfira | would it not be useful to return any errors while doing a sync/close? | 23:30 |
gsamfira | yap | 23:31 |
gsamfira | also in zip.go. in the case of copyTo | 23:31 |
babbageclunk | gsamfira: for returning errors, yes, but not if we're already returning an error - we wouldn't want to clobber it. | 23:33 |
thumper | menn0, wallyworld: I won't be at the tech-board meeting today | 23:34 |
thumper | have taxi duty :-| | 23:34 |
menn0 | thumper: ok | 23:34 |
wallyworld | thumper: no worries, we'll cover for you :-) | 23:34 |
gsamfira | babbageclunk: not really clobbering, I think. We still return the original error if that's where it happens. But we also get an error returned if we fail to flush to disk, or close the file...which I think is an important error to know :) | 23:35 |
gsamfira | otherwise we cannot guarantee that the config file or state file is consistent | 23:35 |
babbageclunk | Oh, I mean, it would be better to make the defer do sync+close and return any error (avoiding clobbering an error already coming back) than just putting the sync+close on the happy path. | 23:36 |
babbageclunk | gsamfira: ^ | 23:37 |
babbageclunk | does that make sense? | 23:37 |
babbageclunk | gsamfira: I'll put what I mean into the review. | 23:40 |
gsamfira | babbageclunk: yup. Makes sense | 23:40 |
babbageclunk | gsamfira: cool | 23:40 |
gsamfira | babbageclunk: sounds good. helps to add context :) | 23:40 |
gsamfira | I'll address it in the morning. Grabbing some sleep for tonight :) | 23:41 |
gsamfira | thanks for looking over it! | 23:41 |
gsamfira | babbageclunk: ^ | 23:41 |
babbageclunk | gsamfira: no worries! I started typing it up and it's unwieldy enough that I don't think it's worth it - I think yours is fine in that case. | 23:49 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!