/srv/irclogs.ubuntu.com/2012/09/13/#juju-dev.txt

wrtpdavecheney, niemeyer: you'll be glad to know that the recent cloudinit changes and traffic auth changes mean that we can actually run a uniter live.00:27
niemeyerwrtp: WOAH00:27
niemeyerwrtp: No kidding!00:27
* niemeyer grabs some port wine to celebrate :)00:27
wrtpdavecheney: in a few minutes, i *should* see the first live uniter upgrade00:27
wrtpniemeyer: ^00:27
wrtpniemeyer: the first time failed because i'd forgotten to add the UpgradedError check...00:28
* wrtp does the same00:28
wrtpah, interesting case. it *would* work, but it doesn't get as far as starting the upgrader because the uniter is returning an error when initialised.00:37
wrtphmm, not easily fixed either.00:38
wrtpniemeyer: this an interesting case that i hadn't anticipated; i'm not quite sure what to do00:48
niemeyerwrtp: Hmm01:08
niemeyerwrtp: What's the error?01:08
niemeyerwrtp: and at which point?01:08
wrtpniemeyer: it actually doesn't really matter so much what the error is in fact01:08
wrtpniemeyer: although it's actually ModeInit: bad http response 404 Not Found01:09
wrtpniemeyer: the problem is that if one of the workers consistently dies early, the task exits, and all the other tasks (including the upgrader) get killed off01:10
wrtpniemeyer: so the upgrader never has a chance to upgrade01:10
wrtpniemeyer: i've worked out what i think is a decent solution though01:10
niemeyerwrtp: If the uniter is consistently dying early, sounds like something is wrong?01:10
wrtpniemeyer: indeed it is, but our code should be able to cope with that, even if it is wrong.01:11
wrtpniemeyer: otherwise we have a situation where we're running some bad code and we can't upgrade out of that, even though nothing has crashed01:11
niemeyerwrtp: If it's dying consistently before even starting, I don't see how we'd be able to do anything?01:12
niemeyerwrtp: We can't go "Hey, I think I'll upgrade anyway, just in case!"01:12
wrtpniemeyer: it was, but my first thought was: put the checks in *after* we've started the uniter01:12
wrtpniemeyer: so we're running the upgrader concurrently.01:13
wrtpniemeyer: but then you run into the case above.01:13
wrtpniemeyer: but as i said, i think i have a nice solution.01:13
niemeyerwrtp: Oh/01:16
niemeyer?01:16
wrtpniemeyer: 1) we change the runTasks function so that it gives UpgradedError priority over other errors01:17
wrtpniemeyer: 2) we change the upgrader so that if it's stopped while downloading, it waits a while for the download to complete before actually dying01:17
niemeyerwrtp: Hmm01:17
niemeyerwrtp: Feels easy to fall onto races01:28
niemeyerwrtp: I wish we had a more deterministic way to say that01:28
wrtpniemeyer: i think it's not too difficult01:28
wrtpniemeyer: obviously we can't wait forever because then a download that hung up would mean that an upgrader would never be killable, but i reckon a minute would be fine01:29
niemeyerwrtp: Feels like guess work01:29
niemeyerwrtp: It could also take 10 in a slow network, or 2001:30
wrtpniemeyer: point of reference: in ec2 it takes 4s01:30
wrtpniemeyer: we could add something to the downloader that signals that some progress is being made01:31
niemeyerwrtp: So S3 to EC2 is your reference of "slow network"!? :-)01:31
wrtpniemeyer: it'd still be guesswork, but not quite so wild01:31
niemeyerwrtp: That's probably as good as it gets01:31
wrtpniemeyer: fair enough01:31
wrtpniemeyer: we could make the timeout 1 day if we liked01:31
niemeyerwrtp: I don't like any of that.. :(01:31
niemeyerwrtp: We need a more deterministic way to define whether we want to stop *now* or whether we think we should check for an upgrade01:32
wrtpniemeyer: when do we ever want to stop *now* ?01:32
niemeyerwrtp: When we say "stop".. when there's an unrecoverable error in the state connection, ..01:33
wrtpniemeyer: yes, we could add code to enable stopping immediately, but that doesn't solve our problem01:33
wrtpniemeyer: ah yes, if we could know when we had such an error, that would be good01:34
wrtpniemeyer: we'd have to lose lots of errorContextfs though01:34
wrtpniemeyer: or actually, maybe errorContextf could be changed to deal with it01:35
niemeyerwrtp: Hm?01:36
wrtpniemeyer: ignore me01:36
wrtpniemeyer: i'm on crack01:36
wrtpniemeyer: we just need a way of asking whether the state connection is still ok01:36
niemeyerwrtp: Hmm01:38
niemeyerwrtp: Did you get to the point of seeing the upgrader behavior when the issue was happening?01:39
niemeyerwrtp: Was it bailing out before ever checking for any upgrade?01:39
wrtpniemeyer: i'm not sure. it doesn't currently print a log message when a download starts01:40
niemeyerwrtp: I was just having a look at it, a twist on the first idea you had sounds like an interesting direction01:41
wrtpniemeyer: i've actually finished writing the code for the first idea already BTW01:42
niemeyerwrtp: It's not just about "if a download is in progress", though01:43
wrtpniemeyer: it's also "have we seen an initial environ config"01:43
niemeyerwrtp: Right01:43
wrtpniemeyer: i've already got that logic too01:43
niemeyerwrtp: We have to ignore Dying and, if we have a valid environ, *go check for an upgrade*.. only then, put the Dying() consideration back on the table01:44
niemeyerwrtp: Oh, you rock :)01:44
wrtpniemeyer: http://paste.ubuntu.com/1201690/01:44
niemeyerdying = time.After(1 * time.Minute)01:46
niemeyer!?01:46
wrtpniemeyer: oops, well obviously we can't do *that* :-)01:46
wrtpniemeyer: but something that sends on dying after a minute would do the job01:46
niemeyerwrtp: I'd say we can ignore dying on the first round, and go try an upgrade (facilitated by factoring out somethings onto a function).01:48
niemeyerwrtp: If we have a timeout, that timeout should be on the downloader itself, not on that loop01:49
niemeyerwrtp: It's the downloader that should be responsible for complaining that things are stuck01:49
niemeyerwrtp: if we don't do that, we have problems either way (what if a charm download stops, what if ...)01:50
wrtpniemeyer: what about waiting for the initial config?01:50
niemeyerwrtp: Expand please?01:51
wrtpniemeyer: if we haven't yet received the inital environ config, there's no download to complain01:52
wrtpniemeyer: so we would need a top-level timeout anyway01:52
wrtpniemeyer: but that's not hard to arrange01:53
wrtpniemeyer: we can use both01:53
niemeyerwrtp: We can ask the state for an environment upfront, without using WaitForEnviron01:53
niemeyerwrtp: and run one attempt to upgrade01:54
niemeyerwrtp: By calling a function explicitly that contains the block that is within the select today01:54
wrtpniemeyer: we actually don't need to use WaitForEnviron at all. we don't use the environ.01:54
wrtperm, maybe we do01:55
niemeyer                        tools, err := environs.FindTools(environ, binary, flags)01:55
niemeyerWe need it01:55
wrtpniemeyer: ah yes, good point.01:56
niemeyerwrtp: Either way, your plan is a great quick solution01:57
niemeyerwrtp:        dying = u.tomb.Dying + 5 min delay01:57
niemeyerwrtp: Once we know there are no downloads available, reset it to pure Dying01:58
wrtpniemeyer: once we know there are no downloads available, we can just return01:58
niemeyerwrtp: Uh?01:59
wrtpi think01:59
niemeyerwrtp: What if there is a new upgrade?01:59
niemeyerwrtp: It should still work as usual01:59
niemeyerwrtp: The loop waits for environ changes02:00
wrtpniemeyer: this is *after* we've already received a dying signal, right?02:00
niemeyerwrtp: No, I was talking about upgrader behavior at all times02:00
wrtpniemeyer: ok, i'm not sure what you mean by "dying = u.tomb.Dying + 5 min delay" then02:00
niemeyerwrtp: On entrance, a dying channel is built that will fire if u.tomb.Dying fires, but with a 5 mins delay02:00
wrtpniemeyer: ahhh02:01
niemeyerwrtp: Everything works as usual then.. we get into the loop, Changes fires with the first env config,02:01
niemeyerwrtp: THen, when we get the first download-not-found, we reset dying to take off that Delay02:01
niemeyerwrtp: From then on it's fine for it to die at any point02:02
wrtpniemeyer: we still want to try to complete a download if we're killed, no?02:02
niemeyer<niemeyer> wrtp: THen, when we get the first download-not-found, we reset dying to take off that Delay02:02
niemeyerwrtp: By definition, to get a download-not-found, we've checked if there was a download, and noticed there wasn't any02:03
niemeyerwrtp: Before that, dying is Dying + 5 mins delay02:03
niemeyerwrtp: Which means we have the 5 mins to complete the download02:03
wrtpniemeyer: yes, that sounds good02:03
wrtpniemeyer: because it's only the first download opportunity we really care about02:04
niemeyerwrtp: We can tweak as we go, but it *sounds* like this is a pretty trivial change on top of what we have today02:04
niemeyerwrtp: Right02:04
niemeyerwrtp: If we break in the middle later, we'll loop and get into that again in either case02:04
wrtpniemeyer: yeah, that sounds great actually02:04
niemeyerwrtp: If there's no upgrade to do we reset it as well, of course02:05
wrtpniemeyer: ah, i thought that's what you meant by download-not-found actually02:06
wrtpniemeyer: but it's all good02:06
niemeyerwrtp: Yeah, both that and actual failures in download (we just break right now)02:06
wrtpgod i love how easy it is to reason about this stuff with channels02:08
niemeyerwrtp: +1!02:09
wrtpniemeyer: e.g. http://paste.ubuntu.com/1201718/02:09
wrtpniemeyer: a little better: http://paste.ubuntu.com/1201720/02:12
niemeyerwrtp: Yeah, that's nice02:17
niemeyerwrtp: A bit unfortunate that it hangs a goroutine forever for the 5 mins delay02:18
niemeyerwrtp: But I guess that's minor02:18
wrtpniemeyer: well, it can be 5 minutes only if the channel is buffered02:18
niemeyerwrtp: Hm?02:19
wrtpniemeyer: oh no, i'm talking rubbish02:19
wrtpniemeyer: it hangs a goroutine for 5 minutes, as you say, minor.02:19
niemeyerwrtp: No, it hangs forever02:19
wrtpniemeyer: oh yeah, i see what you mean02:20
wrtpniemeyer: i can't get very worked up about it :-)02:21
niemeyerwrtp: If we use a local var t *Tomb assigning to the u.tomb, we could use something like this: http://paste.ubuntu.com/1201731/02:21
niemeyerwrtp: So we Kill the delayed tomb after we have the first round acknowledged02:22
niemeyerwrtp: and set t back to the real tomb02:22
wrtpniemeyer: i see what you're doing, but i think it's overkill02:22
wrtpniemeyer: we really couldn't care less about one goroutine02:22
niemeyerwrtp: Up to you.. the two functions have the same size, are ready, and one of them doesn't leak memory02:23
niemeyerAlright, it's time for me to take some shower02:25
wrtpniemeyer: "leaking memory" is a little strong. it cleans up when the upgrader quits02:25
niemeyerwrtp: Dude, and it's super late for you too02:25
wrtpniemeyer: indeed02:26
niemeyerwrtp: Yes, when the process dies, it will release the memory :-)02:26
wrtpniemeyer: or 5 minutes after the upgrader returns.02:26
niemeyerwrtp: Which is done when the process dies!?02:27
niemeyerwrtp: Otherwise we'll always have an upgrader running02:27
niemeyerwrtp: Either way.. I'm not keen or trashing memory like that when it's trivial to avoid, but I'm not keen on bikeshedding on it either.. have a great sleep there02:27
wrtpniemeyer: will do, thanks02:28
wrtpniemeyer: enjoy your shower...02:28
fwereadehey, anyone who's around, cath is not well and I'm popping out to get some medicine... and I may in general be a bit sketchier than usual today07:03
davecheneyright o07:05
wrtpfwereade: mornin'08:00
fwereadewrtp, heyhey08:06
fwereadewrtp, looks like you got a lot done last night -- nice :D08:07
wrtpfwereade: you'll be glad to know i got the uniter running live last night08:07
wrtpfwereade: it broke immediately, but that's not the point :-)08:07
wrtpfwereade: it was started and connected to the state etc08:07
fwereadewrtp, oh, bugger, I didn't read everything I missed08:07
fwereadewrtp, what fell over?08:07
wrtpfwereade: it failed to download the charm08:07
wrtpfwereade: it's probably something i'm doing wrong in the test08:08
fwereadewrtp, huh, weird08:08
wrtpfwereade: one mo, i'll show you its log08:08
wrtpfwereade: it repeats forever doing this: http://paste.ubuntu.com/1202116/08:09
wrtpfwereade: (it would be nice if we could see the url that's failed, don't you think?)08:09
wrtpfwereade: it's actually quite good that it failed because it exposed me to an upgrader issue that i hadn't considered before08:09
fwereadewrtp, huh, very odd08:10
fwereadewrtp, but ModeInit is I think before install time, isn't it?08:10
fwereadewrtp, that should be ModeInstalling08:10
fwereadewrtp, most likely something to do with getting the unit address08:11
wrtpfwereade: this is a classic example of why errorcontextf is not always sufficient BTW08:14
fwereadewrtp, I'm not denying your general point about ErrorContextf, but I think that had it been used in the provider methods it would have been fine... surely?08:15
fwereadewrtp, your point that it makes it inconvenient to return specific error values holds far more water IMO :)08:16
wrtpfwereade: well, yeah, it must be used in every single function08:16
fwereadewrtp, it must be used mindfully, at any rate :)08:17
fwereadewrtp, IMO it comes down to a question of who has responsibility for producing sane errors -- the thing that fails, or its client08:17
fwereadewrtp, despite the drawbacks, of which I am aware, I still come down on the thing-that-failed side08:18
wrtpfwereade: it depends what you mean by a sane error i think08:18
fwereadewrtp, probably, yeah :)08:18
wrtpfwereade: with the current scheme, we're always passing the buck - when we see a function like the above, we say it's fine because it adds error context. but it's only fine if all the things it calls do the same.08:19
wrtpfwereade: and in many of those functions it doesn't really look like a context is necessary.08:20
wrtpfwereade: so nobody calls it out in review08:20
fwereadewrtp, hmm, I think that statement is flawed in exactly the same way as "if it's up to the client, it's only ok if every single error everywhere is annotated" would be08:20
fwereadewrtp, in either case it's essentially advocating tracebacks08:20
wrtpfwereade: i think some kind of (fairly minimal) traceback is exactly what we need for diagnosing problems like we've just encountered08:21
wrtpfwereade: we need just enough to walk us through the tree of possible error paths.08:22
wrtpfwereade: what i'm going to have to do now is find out which of the functions that ModeInit is calling might be able to generate the error we saw. that's slow and error prone in itself.08:23
fwereadewrtp, AFAICS there are only two things that could have failed there, and the actual problem is the same in either case, right?08:23
fwereadewrtp, one is PrivateAddress, the other is PublicAddress, and (almost certainly) they'd each need to be fixed in the same way08:24
fwereadewrtp, I totally understand there are cases where it could be much less pleasant08:24
wrtpfwereade: ah yeah, that's true. i'd forgotten they accessed an http link08:24
fwereadewrtp, and such a case is certainly evidence for inadequate annotation08:25
fwereadewrtp, but I *think* I'm comfortable with the tradeoff08:25
wrtpfwereade: i'll change their error messages to be a little better for a start08:25
fwereadewrtp, the number of cases where I've actually been baffled by an error is pretty small08:25
wrtpfwereade: just wait until we've encountered more of these kinds of errors reported by people in the wild :-)08:26
fwereadewrtp, fair point -- but all the same, I'm uncomfortable with the idea that the way to handle errors right is to hand-hack traceback functionality everywhere08:27
wrtpfwereade: i don't think of it as traceback functionality. i think of it as describing the error :-)08:27
fwereadewrtp, it's the "everywhere" that bugs me more than the precise term we use for the mechanism by which we describe the error (but, yes, good descriptions will be better than tracebacks)08:38
wrtpfwereade: i really feel there's no shortcut. but i'm rolling with it for now. if we decide to change the policy in the future, it won't be too hard.08:39
fwereadewrtp, yeah :)08:40
fwereadewrtp, btw, are your current branches messing with PATH in upstart?08:52
wrtpfwereade: no08:53
fwereadewrtp, awesome :)08:53
fwereadewrtp, but I do have to fix something now then08:53
wrtpfwereade: at least... there might be a dreg in the tests. i'll check.08:53
fwereadewrtp, yeah, the problem is that the uniter tests mess with PATH08:54
fwereadewrtp, it's not too bad though, I can do a trivial and notsonice fix quickly, and then sort out the long-running-hook-server thing which will let me do it nicely08:54
wrtpfwereade: i've found the metadata problem BTW08:54
fwereadewrtp, oh yes?08:54
wrtpfwereade: it's using the version 1.0, but public-hostname didn't exist in that version08:55
fwereadewrtp, ha :)08:55
wrtpfwereade: tbh i think it should probably use "latest".08:55
fwereadewrtp, mmmmmmaybe, I'm just not sure what guarantees they make re "latest" compatibility08:55
wrtpfwereade: grr08:56
wrtp04.15.979 ... value *errors.errorString = &errors.errorString{s:"cannot put charm: cannot write file \"local_3a_precise_2f_dummy-1\" to control bucket: remote error: handshake failure"} ("cannot put charm: cannot write file \"local_3a_precise_2f_dummy-1\" to control bucket: remote error: handshake failure")08:56
fwereadewrtp, yeah, those piss me off :/08:56
wrtpfwereade: ok, i'll use 2012-01-12 then08:56
fwereadewrtp, brb cig08:56
fwereadewrtp, cool08:56
wrtpfwereade: actually i think "latest" should be fine08:59
fwereadewrtp, I'm happy to trust your research/judgment :)09:03
wrtpfwereade: for metadata categories it talks about "version introduced" which to me implies that a category can't be retracted once introduced09:04
wrtpfwereade: and all their examples use "latest"09:05
fwereadewrtp, sounds sane to me :)09:05
Aramhello.09:12
fwereadeAram, heyhey09:16
wrtpfwereade: holy shit, it worked!09:17
* fwereade cheers09:17
fwereadewrtp, running unit agent?09:17
wrtpfwereade: unit agent upgraded09:17
fwereadewrtp, sweeeet09:18
fwereadewrtp, and I'm just about to land charm upgrades too09:18
fwereadewrtp, although I haven't actually implemented upgrade-charm yet09:18
wrtpfwereade: i need to think of a nice test so we can live test that the uniter can run a charm09:18
wrtpfwereade: how about a trivial charm that starts a web server? then we can poll for a while after the charm has started to see if the web server comes up09:19
wrtpAram: morning!09:19
fwereadewrtp, yeah, sounds sensible, just checking for unit status is not adequate to tell that the *charm* is working09:20
* wrtp is quite happy now09:20
wrtpfwereade: indeed. i want to check that a hook has been executed for real09:20
wrtpfwereade: this can lead into other tests that test more sophisticated functionality, i think09:20
fwereadewrtp, sgtm09:20
fwereadewrtp, will also test expose, which is nice09:21
fwereadewrtp, maybe just an echo server?09:21
wrtpfwereade: in fact, can i tell from the status that a charm has finished executing its start hook?09:21
fwereadewrtp, yeah, once unit status is started09:21
wrtpfwereade: sweet. that means that i won't have to wait too long after that09:22
wrtpfwereade: and i've just implemented the unit watcher, so i can use that to watch the status09:22
wrtpfwereade: lovely jubbly09:22
fwereadewrtp, you could write an echo server with plain/pirate/3117 transforms, to test config-changed as ell :)09:22
fwereades/ ell/ well/09:22
wrtpfwereade: i think i'll just write a server that executes a specified jujuc hook and sends back the result09:23
wrtps/hook/callback/09:23
fwereadewrtp, how do you plan to do that?09:23
wrtpfwereade: hmm, good point!09:23
fwereadewrtp, :p09:23
wrtpfwereade: darn, out of context callbacks09:24
fwereadewrtp, yeah, I'm strongly inclined to do the preliminary work for that today, just because it makes the uniter much cleaner (and will, I *think*, improve performance a little)09:24
fwereadewrtp, but you shouldn't count on the full thing being available any time soon at all09:25
fwereadewrtp, but even a trivial echo server will verify that open-port works, and by extension (probably) the rest of the jujuc stuff09:25
wrtpfwereade: maybe it's silly, but i'd prefer something that provides some actual info from the other side, so we *know* that we're talking to the right thing.09:27
fwereadewrtp, well, anything you want to send back from jujuc can be grabbed and stored when you run the start hook, right?09:28
wrtpfwereade: hmm, can i assume that python will be installed?09:29
fwereadewrtp, but ISTM that config-changed is a useful mechanism for checking end-to-endness09:29
fwereadewrtp, sorry I don't know09:29
wrtpfwereade: yeah, i was thinking that09:29
wrtpfwereade: what's the simplest web server i can write?09:29
wrtpfwereade: that doesn't need to pull in any more resources than we already have09:30
fwereadewrtp, if you have python, SimpleHTTPServer is pretty damn trivial09:31
wrtpfwereade: yeah, if09:31
fwereadewrtp, *but* the default python version is changing, so you want to be careful of that,09:31
wrtpfwereade: oh pish09:31
fwereadewrtp, I actually think you probably can guarantee *some* python09:31
fwereadewrtp, but yeah, exactly09:31
wrtpfwereade: perl is probably installed09:31
wrtpfwereade: (although i've never written a line of perl in my life)09:32
fwereadewrtp, whatever you need, it's one line in theinstall hook, surely?09:32
fwereadewrtp, in my very limited experience you're not missing much ;p09:32
wrtpfwereade: i firmly believe that to be the case :-)09:32
wrtpfwereade: i miss the inferno shell where i could write: listen 'tcp!*!12345' {echo hello there}09:37
wrtpfwereade: and instantly have a working server09:37
wrtpfwereade: i think we must have python because cloud-init is written in python.09:38
wrtpfwereade: i'll just have to be a teeny bit careful of version changes09:38
wrtpfwereade: does hook output get logged into /var/log/juju/unit*.log ?09:41
fwereadewrtp, nope, but juju-log will09:41
wrtpfwereade: hmm. where *does* hook output go?09:41
fwereadewrtp, I suspect it goes nowhere at the moment09:42
wrtpfwereade: i think that's wrong09:42
fwereadewrtp, fair point, we just want to be careful not to do it how we did in python, where every line on stderr gets prefixed with ERROR: ;)09:42
wrtpfwereade: just trying to deploy the real wordpress charm, just to see what happens...09:43
wrtpfwereade: http://paste.ubuntu.com/1202263/09:45
wrtpfwereade: also: http://paste.ubuntu.com/1202265/09:45
wrtpfwereade: it's looking good!09:46
wrtpfwereade: no way of knowing why the install hook failed though09:46
fwereadewrtp, that's pretty cool :D09:46
fwereadewrtp, I kinda would like to separate hook output from uniter output, but that's just an instinct, not sure it holds water09:47
fwereadewrtp, yeah, debug-hooks would be useful for that09:47
wrtpfwereade: i'm not sure it does. they do need to be readily distinguishable though09:47
wrtpfwereade: what does debug-hooks do again?09:48
fwereadewrtp, yeah, that's the heart of it09:48
fwereadewrtp, whenever the uniter runs a hook it just gives the client a shell instead09:48
wrtpfwereade:09:49
wrtp2012/09/13 09:42:35 JUJU HOOK some hook output09:49
wrtp?09:49
wrtpfwereade: i don't think we need to know if it was written to stdout or stderr09:50
fwereadewrtp, sounds reasonable09:50
wrtpfwereade: i might just go and do that to scratch my itch.09:51
fwereadewrtp, +109:51
fwereadewrtp, btw, I think I have some idea why I wanted the agent tools to go in the agent tools dir -- because the uniter does write to it09:53
wrtpfwereade: i don't think that's a problem. a given version of the uniter will always write the same thing, no?09:54
fwereadewrtp, yeah, indeed, I'm not sure it's really a big deal09:54
wrtpfwereade: ah, is it likely the reason for the failure is that the PATH isn't set up yet?10:04
fwereadewrtp, ha!10:04
fwereadewrtp, yes, very likely indeed10:04
wrtpfwereade: hook output logging is done BTW, with the small matter of a few tests to go10:05
wrtpfwereade: very happy with how easy it was to navigate the uniter code10:05
fwereadewrtp, sweet :D10:24
fwereadewrtp, sorry, I think I'm being dense... what's the use case for the unit watcher?10:46
wrtpfwereade: the test uses it to watch for the agent tools version changing10:46
wrtpfwereade: it can also be used to watch for the unit status changing10:47
wrtpfwereade: otherwise we'll have to poll, which seems wrong.10:47
fwereadewrtp, hmm, I'm -1 on test-only code polluting the API, where we can help it10:47
fwereadewrtp, I'm comfortable polling in tests (although ofc I do appreciate it when I don'rt have to)10:47
wrtpfwereade: i don't think it's strictly test only. we could easily have some client functionality that wants to watch some aspect of a unit.10:48
fwereadewrtp, however, it doesn't cover status changes properly (down?)10:48
fwereadewrtp, I dunno, I feel we have too much speculative code in state anyway10:49
wrtpfwereade: FWIW we've already got MachineWatcher which is basically the same thing10:50
fwereadewrtp, is MachineWatcher used? (or does it have a concrete use case in mind?)10:51
wrtpfwereade: it's used to watch the instance id for one10:51
fwereadewrtp, ok, so it has a use, and I'm fine with that (although *really* that STM like a job for a more specialized watcher)10:51
wrtpfwereade: this was at niemeyer's request. i started with a more specialized watcher.10:52
fwereadewrtp, ha, ok10:52
wrtpfwereade: i think the UnitWatcher is sufficiently general that it doesn't really count as test-only code, even though that's how we're using it currently.10:53
fwereadewrtp, I kinda feel a watcher that doesn't fire on certain important changes is a bit of a bad thing though10:53
wrtpfwereade: we will eventually use it to do major-version upgrades10:53
fwereadewrtp, if it's meant to watch status, it ought to actually watch status10:53
wrtpfwereade: yeah, maybe it should watch the alive status too.10:54
fwereadewrtp, a ToolsVersionWatcher that can handle both units and machines, and strips out unwanted changes, feels much cleaner to me10:54
wrtpfwereade: i agree, but i've been told that's wrong. so here we are.10:55
fwereadewrtp, (btw, tangentially related: with the provisioner bool in machine, can we drop the PA now and just run the appropriate workers in the MA?)10:56
fwereadewrtp, bah10:56
fwereadewrtp, this is the trouble with the "we should do more reviews ourselves thing"10:56
wrtpfwereade: yes, that's on the todo list, maybe today10:56
wrtpfwereade: i know10:56
fwereadewrtp, sweet10:56
wrtpfwereade: what do you think is the right behaviour for logging output from hooks that continue to run something printing in the background after exiting?11:23
fwereadewrtp, hum, I rather feel that hooks probably shouldn't do that, but good question11:24
wrtpfwereade: i'm tempted to throw away any output after the hook exits11:24
wrtpfwereade: that way at least it's clean and we can't get any intermingling11:24
fwereadewrtp, yeah, I'd assumed it would just be a CombinedOutput11:24
fwereadewrtp, wait a mo11:24
wrtpfwereade: i don't want to use CombinedOutput as i want to log lines as they happen11:25
fwereadewrtp, won't plain hook output go to logdir/agentname.out anyway?11:25
wrtpfwereade: why would it?11:26
* fwereade goes to peer at docs11:26
fwereadewrtp, ok, it wouldn't11:27
fwereadewrtp, FWIW, log-until-it-ends matches the python11:30
fwereadewrtp, but I agree that logging lines as they happen is a good thing11:30
fwereadewrtp, (that is also what py does011:30
fwereadewrtp, bbs, updates need restart11:30
fwereaderogpeppe, btw, when you said the machine/unit watchers would be used for major version upgrades... how?12:00
fwereaderogpeppe, ah ok, sorry, now I see12:00
rogpeppefwereade: cool, np12:00
fwereaderogpeppe, except12:00
* fwereade thinks12:00
fwereaderogpeppe, ok, so we upgrade everything, and everything knows not to hit state (apart from that bit?) until after the state itself has been upgraded?12:02
rogpeppefwereade: that's right12:02
rogpeppefwereade: well, actually we probably make a new state server12:03
fwereaderogpeppe, doesn't that put quite a major restriction on the sort of state changes we can actually make?12:03
rogpeppefwereade: i'm not sure of the details12:03
rogpeppefwereade: how do you mean?12:03
fwereaderogpeppe, well, where's it going to write the change? in the old document (created/managed by old code)? or in the new document, which doesn't exist yet and is on a different server?12:06
rogpeppefwereade: where is what going to write the change?12:07
rogpeppefwereade: the upgrader?12:07
fwereaderogpeppe, yeah12:07
fwereaderogpeppe, oh, wait, it writes the code version before it's running the new code, right?12:07
fwereaderogpeppe, feels a bit off somehow12:07
rogpeppefwereade: it uploads the new tools, writes the tools version, waits until all agents are in "pending" state, transforms the database, then triggers all the agents to have at it12:09
rogpeppefwereade: by pending, i mean "major-upgrade-pending" of course12:09
fwereaderogpeppe, ok, thanks :)12:10
fwereadeAram, pre-review delivered on https://codereview.appspot.com/6492110/ -- but I'm worried I've got completely the wrong end of the stick12:18
fwereadeAram, I thought the one thing we *weren't* meant to be notified of was a remove? (except when the watcher happens to have missed a Dead, in which case we should get a Dead change..?)12:19
Aramfwereade: I only implemented current behavior, not what we want it to be in the future.12:21
Aramyou are right about the future.12:21
fwereadeAram, my concern is that the future is *now*12:22
fwereadeAram, AIUI we dropped the life stuff in state so we could focus on it in mstate12:23
Aramwe can never switch, nor test everything if the API is not the same though.12:24
fwereadeAram, but this API is useless12:24
fwereadeAram, as is the one in state12:24
fwereadeAram, actually, how many of these watchers have non-test clients currently?12:27
fwereadeAram, I presume that at least the firewaller/provisioner/machiner does use some of them, maybe most of them12:28
Aramwhite:juju-core$ lsr | egrep 'go$' | egrep -v 'state1|mstate|test' | xargs egrep 'Watch[A-Z][A-Za-z]*' | wc12:28
Aram     64     394    547012:28
Aramwhite:juju-core$ lsr | egrep 'go$' | egrep -v 'state1|mstate|test' | xargs egrep '[A-Z][A-Za-z]*Change' |  lsr | egrep 'go$' | egrep -v 'state1|mstate|test' | xargs egrep '[A-Z][A-Za-z]+Change[^A-Za-z]' | wc12:30
Aram    206    1304   1858912:30
Aramnah, that's wrong12:31
Aramwhite:juju-core$ lsr | egrep 'go$' | egrep -v 'state|mstate|test' | xargs egrep '[A-Z][A-Za-z]*Change[^A-Za-z]' | wc12:31
Aram      8      52     79212:31
Aramwhite:juju-core$12:31
Aramwhite:juju-core$12:31
Aramwhite:juju-core$ lsr | egrep 'go$' | egrep -v 'state|mstate|test' | xargs egrep 'Watch[A-Z][A-Za-z]*' | wc12:31
Aram     22      78    179912:32
fwereadeAram, ok, so the plan is to implement broken watchers and only fix them once we've finished the switch (this is not *necessarily* a criticism, but I am disappointed if that's actually our only path to where we need to be)12:33
AramI don't see how it can be any other way. we use watchers therefore we can't change our client code because we don't have "fixed" watchers. I can't do "fixed" watcher first since we would not have an easy path to transition to mstate since client code would want "broken" patches.12:34
Aramto make the transition fluid the APIs have to be the same.12:35
Aramthe real fix would have been to have fixed watchers in state as well.12:35
fwereadeAram, yeah, I guess so, I can't come up with anything better :(12:35
* fwereade quietly freaks out in a corner somewhere12:36
* fwereade tries not to disturb anyone else12:36
fwereadeAram, well, hmm, actually I'm starting to feel it *would* be better to break what we currently have working in the interest of getting some real code running against mstate12:37
fwereadeAram, but... I suspect this is the cowboy-nature coming to the fore, and I should probably shut up12:38
rogfwereade: https://codereview.appspot.com/649413212:43
rogfwereade: i'm becoming convinced that stopping logging at the end of a hook is not right12:43
fwereaderog, I'm becoming convinced that if we get output after the end of a hook we should spam the user with UR DOIN IT WRONG messages12:44
rogfwereade: i think there's potentially great benefit in having all logs funnel into one place12:44
rogfwereade: as it is, every server that anyone starts logs to a.n.other log file somewhere12:44
rogfwereade: i remember having difficulty with that when writing a charm12:45
fwereaderog, and I think that any sane service will have its own logging, and we are not in the business of indulging craziness like expecting hook child processes to work properly :)12:45
fwereaderog, isn't that a job for logging charms?12:45
fwereaderog, if I'm looking in the juju logs I care about juju12:46
rogfwereade: i think logging should be more fundamental to the system than that12:46
fwereaderog, if I care about what (say) riak is doing, surely I can look at riak's logs12:46
rogfwereade: if i'm a deploying a charm, i care about juju *and* what my stuff is doing, because the two interact12:46
rogfwereade: we want to be able to see a time line of the whole system if possible12:46
rogfwereade: i'm pretty sure that doing this right is a major key to getting juju easier to use and debug.12:48
fwereaderog, that is fine by me :)12:57
rogfwereade: there are various web services that do that at low cost12:57
fwereaderog, having juju agents themselves mediate that sending directly feels bad though12:57
rogfwereade: perhaps you're right.12:58
fwereaderog, I'm not actually against logging, I just feel like there is more than one concern here (even if not quite as many as 2, IYSWIM)13:02
rogfwereade: although when a daemon dies 1 second after you've started it in the background, saying "cannot do something", i didn't think so13:02
rogfwereade: perhaps we could provide a standard way to say "watch this log file please"13:02
rogfwereade: though i suppose logger(1) isn't bad for that13:02
fwereaderog, feels to me like a feature tailor-made for a relation with eg rsyslog13:02
rogfwereade: sadly subordinate charms don't work too well for logging currently13:02
fwereaderog, ha, ok, I haven't investigated :(13:02
fwereaderog, why not?13:02
rogfwereade: you can't guarantee that the subordinate has started before the principal13:02
rogfwereade: which i think you want if you want to set up a logging destination so that every message from the principal is logged.13:02
fwereaderog, it doesn't sound like it would be *too* hard for a logging charm to catch up with messages from the past, but maybe that's crazy, it could certainly have pathological behaviour in some situations13:04
rogfwereade: FWIW i think that we *should* guarantee that if a service has subordinates, their start hooks should complete before the principal's install hook gets run.13:04
rogfwereade: depends if the messages are going onto disk or straight through the network13:04
fwereaderog, the best we can do is guarantee that happens *some* of the time, though, right?13:04
fwereaderog, why would they be going across the network before the subordinate is configured to send them there?13:05
rogfwereade: of course if you add a subordinate after deploying a unit of service we can't guarantee it :-)13:05
rogfwereade: it might not be possible to switch destinations mid-flow. ok, i admit i don't know anything about syslog :-)13:05
fwereaderog, nor do I really :)13:06
rogfwereade: but as a general point, i think it would make subordinates much more useful13:06
rogfwereade: because they could be used to configure some aspect of the system for the principal, *before* the principal starts13:06
rogfwereade: for instance, to tweak kernel paramaters13:07
rogeters13:07
fwereaderog, I still don't really see why it matters for the principal to run in an unhelpful mode for a few seconds before being bounced by the subordinate when it comes up13:07
fwereaderog, and the subordinates will always have to be written so as to be able to deal with suddenly being deployed against a running unit anyway, surely?13:08
rogfwereade: you're assuming that whatever it is the subordinate is setting up can be changed *after* the principal starts13:08
rogfwereade: sure - some subordinates might not be able to work in that scenario13:08
fwereaderog, I dunno, that level of interference feels to me like it would be better addressed by letting people launch on custom images13:09
rogfwereade: it's a combinatorial problem - there are n-dimensions of tweaks, and we can't provide images that address all points in the space.13:10
fwereaderog, I'm not suggesting that *we* should, I'm just suggesting that allowing people a mechanism for choosing ther own images addresses this use case without introducing a whole new class of subordinate charms (ie these ones that *cannot* be started aftertrhe principal)13:11
rogfwereade: i did think of a way around it using the current system primitives BTW13:12
rogfwereade: i might have mentioned it before13:12
* fwereade isn't sure whether or not he can hear a bell ringing13:12
fwereaderog, go on :)13:12
rogfwereade: a "sync" interface13:13
* fwereade can hear a bell now, but can't remember how it was meant to work13:14
rogfwereade: a principal provides it; a subordinate requires it; the principal only starts when it gets a "sync done" message from the subordinate13:14
fwereaderog, how does the unit know whether or not it's got a subordinate when it runs its start hook in the first place13:15
fwereade?13:15
fwereaderog, and can we have 2 synced charms?13:15
rogfwereade: you'd set "required-subordinate-count" in the service config, or something like that13:15
fwereaderog, hmmm, and how do we deal with people deployed syncing subordinates when the services are already running?13:16
fwereaderog, sorry, I'm way off on a derail here13:17
rogfwereade: i don't think that's a problem actually, but yeah13:17
rogfwereade: it's an interesting example, i think13:17
fwereaderog, yeah, but I should be writing actual code right now ;p13:18
rogfwereade: indeed13:18
rogfwereade: so i'll leave the logging as it is for now. review appreciated :-)13:18
fwereaderog, oh yeah, that was why I started talking about this ;P13:18
fwereaderog, LGTM13:23
rogfwereade: thanks13:23
rogniemeyer: morning!13:29
niemeyerrog: Morning!13:29
rogniemeyer: i got the uniter one stage closer this morning - the unit upgrading worked.13:29
niemeyerrog: Woah!13:30
rogniemeyer: and it failed because the PATH was not set up rather than anything else, i believe13:30
niemeyerrog: Aw.. almost there then :)13:30
rogniemeyer: although charm stdout was lost, so i don't know (hence this morning's CL: https://codereview.appspot.com/6494132)13:30
niemeyerrog: Cool, looking13:31
niemeyerrog: Hmm13:33
rogniemeyer: hmm?13:33
niemeyerrog: Why are we seeing this issue? When a process exits we can generally determine deterministically that we have all its output13:34
rogniemeyer: no we can't13:34
rogniemeyer: i know this because i see the issue all the time13:34
niemeyerrog: Ah, I see the issue, backgrounding.. ok13:35
rogniemeyer: actually it happens with no backgrounding at all13:35
fwereadehey, I just noticed: we just hit r500 :)13:37
rogniemeyer: or... it's possible that os/exec works around the problem for us actually13:37
niemeyerrog: http://paste.ubuntu.com/1202638/13:38
niemeyerrog: How does that work?13:38
rogniemeyer: good question; i'm looking in the source13:38
niemeyerrog: How does it know to stop reading, more specifically13:38
niemeyerrog: Cool13:38
niemeyerrog: My understanding is that if a process has exited, we necessarily have all output that came from *it* into buffers13:39
niemeyerrog: So we can simply read13:39
niemeyerrog: If there's more, there's backgrounding going on that may be considered misbehaving13:39
rogniemeyer: do we specifically disallow starting any background process from a hook without redirecting its stdout & stderr?13:40
niemeyerrog: This is bad practice, but I don't think we care in this case13:41
niemeyerrog: What we care, I think, is that we're logging everything the hook ever gave us by itself13:42
rogniemeyer: we *do* still have the issue, i think13:42
niemeyerrog: So what *is* the issue? :)13:42
rogniemeyer: the issue is that the process can write some stuff into the pipe, exit, we see the exit, but the pipe still has data in13:43
niemeyerrog: I think that's exactly what I mentioned above13:43
niemeyerniemeyer> rog: My understanding is that if a process has exited, we necessarily have all output that came from *it* into buffer13:43
rogniemeyer: it happens because we're using a pipe rather than sending the data into a local buffer13:43
rogniemeyer: in this case, we're not using a local buffer.13:43
rogniemeyer: an alternative would be to block forever waiting for EOF, which is what would happen if we used CombinedOutput, for example13:44
rogniemeyer: i prefer to be a little more robust, i think.13:44
=== hazmat is now known as 45PAA1GWO
niemeyerrog: Hm.. I don't understand.. those two things are exactly the same13:46
niemeyerrog: The only way to send data into local buffer is by using a pipe13:46
niemeyerrog: We certainly have to finish reading after the process has exited, not before13:46
niemeyerrog: Okay.. (?)13:46
niemeyerrog: I'd prefer to be robust, but meaningful.. (?)13:46
niemeyerrog: What is the actual symptom of the problem?13:46
rogniemeyer: we'd see log messages produced by the hook after RunHook had returned13:46
rogniemeyer: for a few microseconds at most13:47
niemeyerrog: How can the hook produce output after it has returned?13:47
niemeyerrog: Sorry, I'm pretty confused by now13:47
rogniemeyer: because there's data still in the pipe13:47
niemeyerrog: It's not *producing* output13:47
niemeyerrog: There's data in buffers.. that's normal unix behavior13:47
rogniemeyer: no, it's produced it, but we haven't read it yet13:48
niemeyerrog: Exactly.. that's handled without arbitrary delays13:48
rogniemeyer: so we'll probably get EOF immediately and everything is hunky dory13:48
rogniemeyer: but if we always wait for EOF then we get bitten if there *is* a background process that's keeping the pipe open13:48
rogniemeyer: we could make the delay 2s or so if that would make you happier13:49
niemeyerrog: We don't get bitten, because we don't have to wait..13:49
rogniemeyer: i'm afraid we do. i'm just writing some code to demo the issue.13:49
niemeyerrog: Thanks, that'd be awesome13:49
rogniemeyer: essentially the OS doesn't guarantee that the pipe has been drained when wait(2) returns13:51
niemeyerrog: The pipe may never get drained if there's a process holding it13:52
rogniemeyer: exactly13:52
niemeyerrog: We only care that the process has exited, and we can't read anything from it anymore13:52
rogniemeyer: the process can exit *and* there still be data in the pipe.13:53
rogniemeyer: which we can read.13:53
rogniemeyer: anyway, one mo13:53
niemeyerrog: Then we read it.. (!?)13:53
rogniemeyer: yeah, but we need to wait for it to be read, hence the <-logger.done13:54
niemeyerrog: What I mean is that we don't have to put a boundary on the reader to output its data in 50ms13:56
niemeyerrog: There's a deterministic event happening..13:56
rogniemeyer: only if there's no background process13:56
niemeyerrog: Always13:56
niemeyerrog: "I can't read data now" is deterministic13:57
rogniemeyer: how do we know when that event has completed?13:57
niemeyerrog: No matter if there's something in background or not13:57
rogniemeyer: we know there *might* be some data, but we don't have any way of knowing for sure, or how much there is13:57
niemeyerrog: Hence, "I can't read data now"13:57
niemeyerrog: If we can, doesn't matter how much there is, or how much time has passed, we can read it13:58
rogniemeyer: how do we know "I can't read data now"?13:58
rogniemeyer: there's no non-blocking read13:58
niemeyerrog: Because the system call tells us that?13:58
niemeyerrog: Oh, there isn't?13:58
rogniemeyer: i don't think so13:58
* niemeyer looks13:58
rogniemeyer: we could probably use syscall.SetNonblock14:00
rogniemeyer: but i'm reluctant to do that14:00
niemeyerThis sucks :(14:00
rogniemeyer: tbh i don't mind waiting a bit longer if someone starts a process in the background and that triggers the timeout14:02
rogniemeyer: given they're not meant to be doing that anyway14:03
niemeyerrog: Well, the consequence isn't great.. besides the lack of determinism (we can be closing the door on the logger before we've read the *real output*) we're holding resources, perhaps consecutively, for things that may not exit14:04
rogniemeyer: that's all true14:04
niemeyerrog: Anyway, thanks for the coverage14:06
niemeyerrog: It's a great workaround and we should definitely move on with it for now14:06
rogniemeyer: np. i'm not sure what you'd like to do about the issue though.14:07
rogniemeyer: ok, cool14:07
niemeyerrog: I'll finish the review so we can integrate this14:07
rogniemeyer: thanks14:07
rogniemeyer: FWIW i verified that CombinedOutput does block until EOF, even if the process has exited.14:10
niemeyerrog: Cheers. That's indeed the right thing for it to do14:10
rogniemeyer: yup14:10
rogniemeyer: but not what we want, i think14:11
niemeyer( echo foo; sleep 5&; echo bar) | cat14:11
rogniemeyer: yup14:11
niemeyerThis is the issue, in summary :)14:11
niemeyerrog: Agreed14:11
niemeyerGosh, and the bank calls me again.. the government is asking for new docs for the exchange contracts14:30
niemeyerJust what I was asking for14:30
niemeyerrog: Done, sorry for the delay14:39
rogniemeyer: tyvm14:39
rogniemeyer: do you think we just just truncate long lines?14:40
rogniemeyer: or should we send them as several log messages?14:41
niemeyerrog: I don't think we need to do much, to be honest14:43
niemeyerrog: ReadLine already truncates them14:43
rogniemeyer: so you'd go with that?14:43
rogniemeyer: (and ignore continuation lines)14:44
niemeyerrog: No, just log them14:44
rogniemeyer: ok, so we'll split lines14:44
rogniemeyer: fair enough14:44
niemeyerrog: Yeah, ReadLine already does it..14:45
rogniemeyer: yeah - it's always a difficult call whether to discard continuations of a split line though14:45
rogniemeyer: i'm mildly tempted to print [%d bytes truncated at end of line]14:46
rogniemeyer: in case someone cats an executable or something14:46
niemeyerrog: It's not truncated14:47
rogniemeyer: but... if someone's printing a large lump of json, you probably want all of it, split or not14:47
niemeyerrog: ReadLine will return on the next line14:47
niemeyerrog: Ah, I see, you're thinking about changing the behavior14:47
rogniemeyer: yeah, i'm concerned about changing the meaning of log messages14:48
rogniemeyer: but i can make the buffer pretty big, so it's really not much of an issue14:48
niemeyerrog: Change the meaning of log messages?14:48
niemeyerrog: What's the meaning of log messages? :)14:48
rogniemeyer: well, if we're grepping out messages by prefix (not uncommon), then we'll miss continuation lines14:49
rogniemeyer: but i think it's common to have some line-length restriction, and better that than lose information14:50
niemeyerrog: Grepping for prefixes would still work fine, I believe14:50
rogniemeyer: i'll use a buffer size of 128K or something14:51
niemeyerrog: Why? Please just keep the default14:51
rogniemeyer: not if the prefix is part of the message logged, i believe14:51
niemeyerrog: The prefix will end up in the same place14:51
rogniemeyer: yes, but if we grep for it, the continuation lines won't have it14:51
niemeyerrog: Yes, they won't, because they are continuation lines that don't have the prefix.. duh?14:52
rogniemeyer: so our grep output won't have all messages with that prefix14:52
niemeyerrog: Man14:52
niemeyerrog: Okay, please do whatever14:52
rogniemeyer: it depends how people are using the logging. 4K is probably well sufficient for almost all purposes, i'd guess14:53
rogniemeyer: so i'll leave the default14:53
fwereadeniemeyer, https://codereview.appspot.com/6496120 should be near-enough trivial, assuming you're comfortable with the approach in the short/medium term14:57
rogniemeyer: "s/0.1/0.2/ probably, given the internal change?" - is there a timeout change you intended to suggest but didn't?14:58
niemeyerrog: Huh.. indeed15:03
niemeyerrog: I typed it, but it got eaten somehow15:03
niemeyerrog: I was suggesting a small bump to 100ms15:04
rogniemeyer: sounds good15:04
niemeyerrog: Cheers15:04
niemeyerfwereade: Looking15:04
fwereadebrb, cig15:04
fwereadeb15:09
niemeyerfwereade: Seems mostly sane15:11
niemeyerfwereade: There's a small bit that is feeling contrived that sounds easy to simplify15:11
fwereadeniemeyer, ha, now that's faint praise ;p15:11
fwereadeniemeyer, cool15:11
fwereadeniemeyer, I kept finding myself wanting to extract a whole new type but it doesn't quite feel like the right time yet15:12
niemeyerfwereade: We're sending (dataDir, agentName) aaaaall the way down to EnsureJujuSymlinks, so it can call AgentToolsDir, and then we send that information aaaaaall the way up so we can use it15:12
niemeyerfwereade: When all these functions really care about is the directory, not the dataDir, not the agentName15:13
fwereadeniemeyer, fair point... so, make EJC just take a dir then? or do I misapprehend?15:14
niemeyerfwereade: Yeah, both ensureFs and EJC can take the toolsDir themselves15:15
niemeyerfwereade: and we can generate that info at the top level by calling it explicitly from NewUniter15:15
niemeyerfwereade: Neither of them, I think, care about dataDir or agentName15:15
fwereadeniemeyer, I had considered dropping ensureFs entirely, how would you feel about that? -- just ensure the state dir, and EJC the toolsDir directly?15:16
niemeyerfwereade: +115:16
fwereadeniemeyer, cool, on it :)15:16
niemeyerfwereade: The other thing I quickly pondered was whether we should just stick filepath.Dir(os.Args[0]) in the PATH rather than passing toolsDir into RunHooks, but I'm not so sure about that15:17
fwereadeniemeyer, not so keen really15:18
niemeyerfwereade: Cool15:20
rogniemeyer: i exported var LineBufferSize to make it easy for the tests to check line splitting. hope that seems ok, before i submit.15:24
rogniemeyer:15:24
rog// LineBufferSize holds the maximum length of a line read15:24
rog// from a hook. Lines longer than this will overflow into15:24
rog// subsequent log messages.15:24
rogvar LineBufferSize = 409615:24
rogniemeyer: oops, i meant to paste this: https://codereview.appspot.com/649413215:24
rogactually, it could easily be const15:24
niemeyerrog: It doesn't have to be public as well15:26
rogniemeyer: ok, i'll add export_test.go15:26
niemeyerrog: Ugh.. why?15:26
rogniemeyer: because the testing code needs access to the value15:26
rogniemeyer: so it can verify the split easily15:26
niemeyerrog: Just hardcode the number in the test as well.. this is not nearly as interesting as it sounds :)15:27
rogniemeyer: ok; thought you might not like that :-)15:27
rogfwereade: i'm seeing sporadic uniter test failures: http://paste.ubuntu.com/1202839/15:28
fwereaderog, aw poo, I'll take a look15:28
* fwereade is baffled, offhand15:30
fwereaderog, I'll investigate more closely in a mo15:30
rogfwereade: i only get it when i do go test ./... (i.e. it seems like a race of some kind)15:30
fwereaderog, thanks, good to know15:30
* niemeyer => lunch!15:33
rogfwereade: i've just reproduced the behaviour in trunk (i was slightly concerned it was a change i'd made). it happens about 30% of the time for me.15:43
fwereaderog, excellent -- I have not seen that one myself, but I'll see if I can figure it out15:44
fwereaderog, ...but... maybe tomorrow :( cath kinda needs support asap, she's been ill all day and laura's been around too15:45
rogfwereade: np15:45
rogfwereade: go tend your flock :-)15:45
fwereaderog, I'l hanging on for dear life to repropose one branch then that's probably it for me15:45
rogfwereade: do you mind if i make the PATH change?15:46
fwereaderog, kinda, because that's what I'm proposing ;)15:46
rogfwereade: ah, do it, DO IT!15:46
fwereaderog, just forcing myself to witness a full test run ;)15:46
rogfwereade: aw, i know that feeling very well15:47
rogfwereade: live?15:47
fwereaderog, afraid not, I should probably do that too, but for *that* I definitely don't have time :(15:48
rogfwereade: 'sok, i'm sure local is fine15:48
fwereaderog, https://codereview.appspot.com/6496120 and goodnight :)15:53
rogfwereade: nnight. hope cath gets better before you have to leave...15:56
niemeyerrog: ping16:53
rogniemeyer: pong16:53
niemeyerrog: Oh, hmm16:53
niemeyerrog: I guess we'll need to talk anyway, given fwereade's comment16:53
niemeyerrog: Was just looking at https://codereview.appspot.com/6498124/16:53
niemeyerrog: and pondering about where mstate is16:54
niemeyerrog: But we'll need further info it seems16:54
rogniemeyer: i thought about doing mstate but thought i'd leave it until there were more watchers and the pattern was fully established16:54
niemeyerrog: I think it'd be fairly equivalent to the machine one16:55
rogniemeyer: yeah. i wondered if there was going to be some equivalent of the generic machinery that was in state, where each watcher took less lines16:55
niemeyerrog: I'd prefer to add it together.. hopefully we can transition entirely next week during the sprint (fingers crossed)16:55
niemeyerrog: But we'll need to talk to fwereade anyway16:55
rogniemeyer: what do you think about the UnitWatcher as a concept?16:56
niemeyerrog: Probably, but we'll need to evolve a bit before that, I suspect at least16:56
rogniemeyer: shall i just poll for the time being?16:56
niemeyerrog: I personally think it's fine16:56
niemeyerrog: We'll need to watch it anyway for other reasons, I believe (e.g. life)16:56
rogniemeyer: yeah16:56
rogniemeyer: given fwereade's remarks, i wondered about making the unit watcher watch the unit's presence too16:57
rogniemeyer: then at least it would signal all status changes16:57
niemeyerrog: Initial feeling is that this would cross two unrelated issues together16:58
niemeyerrog: I'd be curious to know more about his feelings16:58
rogniemeyer: the difficulty is that you watch *almost* all unit status changes.16:58
rogniemeyer: but yes, i tend to agree.16:59
niemeyerrog: Why did you need it again, for reference?17:00
rogniemeyer: for watching the agent version17:00
rogniemeyer: so that i can tell when the unit agent upgrades17:00
niemeyerrog: Ah, for testing only, ok17:00
rogniemeyer: yes. although when we do major version upgrades, we'll probably need it too17:00
niemeyerrog: Interestingly, I notice we have a bunch of unrelated watchers on the unit.. ports, resolved, ..17:04
niemeyerrog: and a bunch of properties that can't be observed17:05
niemeyerrog: All of those are very rarely changed.. I wonder if we should move towards observing the unit itself17:05
rogniemeyer: yeah. that has made me a bit uncomfortable. we can watch some properties of the unit because they happen to be stored in a certain way. that might not be true in mstate tho'17:05
rogniemeyer: what do you mean by that?17:06
niemeyerrog: It's even the opposite, which kind of sucks17:06
niemeyerrog: We stored them in a certain way because we wanted to observe them17:06
niemeyerrog: let's say we then decide to observe an existing property with state (pre-mstate).. we'd be forced to observe the unit node itself17:07
rogniemeyer: i'm not that familiar with mstate. what do you mean by the unit "node" ?17:07
rogoh, sorry17:07
rogpre-mstate17:07
niemeyerrog: I mean the document (in mstate) or node (in zk) that contains most unit information17:08
niemeyerrog: We've split certain bits out, originally, because we felt it'd be more efficient to observe it that way17:08
niemeyerrog: It now feels like this is a poor design17:08
rogniemeyer: most but not all (thinking of the presence state, for example)17:08
niemeyerrog: Because we're attempting to anticipate every use case17:09
niemeyerrog: That's a very particular one17:09
niemeyerrog: That will always be special17:09
rogniemeyer: true17:09
niemeyerrog: Because we'll have significant volume of small bits of information going on and off regularly17:09
rogniemeyer: so you think the unit watcher (in mstate) should observe the entire unit document?17:11
rogniemeyer: (i don't even know if that makes sense actually!)17:11
niemeyerrog: Right, we have already unified all settings in a single document even17:11
niemeyerrog: Because the cost of loading the extra settings is nothing compared to the cost of everything else17:12
niemeyerrog: (make query, transfer ir, lookup index, lookup data, transfer back, unmarshal in, unmarshal out, blah blah)17:12
rogniemeyer: so in fact the unit watcher in mstate *would* automatically see all unit changes, e.g. ports etc?17:12
niemeyerrog: Exactly, and so will every other watcher that watches the unit, even if we do have separate watchers17:13
niemeyerrog: We can even filter out the specific settings, if we do want that, but again it's pretty irrelevant17:13
niemeyerrog: The cost of the operation is not there17:13
niemeyerrog: In summary, my point is that I think we could do pretty well with a single unit watcher, instead of tons of per-setting watchers for the unit17:14
rogniemeyer: i think it's still worth having typed channels rather than always passing around an object that you must invoke a method on to get the value17:14
rogniemeyer: because channels are generic. i had to jump through hoops to watch both the machine agent tools and the unit agent tools with the same code.17:15
niemeyerrog: I don't know.. it doesnt' feel worth it.. I believe these watchers will each have a single use case in the whole code17:15
rog"generic" is maybe a bad way of saying  it.17:15
rogniemeyer: it also means that every client needs to check if the setting it's after has actually changed17:16
rogniemeyer: i think part of the problem is that our watchers are so darn heavyweight17:16
rogniemeyer: if we did it right, a new watcher should only be 6 or 7 lines of code, i think.17:16
niemeyerrog: Indeed, but instead of checking it, we're doing a lot of heavy lifting to do exactly the same in a generic way17:17
niemeyerrog: Even then, it doesn't feel great17:17
niemeyerrog: We need a new method on the type, a new type to put on the channel, a new function to handle it, etc etc, for *each* detail we want to watch17:18
rogniemeyer: part of the problem for me is that interfaces don't work when you've got chan T, but only when you've got T17:19
niemeyerrog: That is unfortunate indeed..17:20
niemeyerrog: at the same time, it's not reaaally an issue in this case.. we have type-specific channels in both cases17:21
rogniemeyer: if i've got a chan *state.Unit, it's type specific, but it doesn't really represent what i'm watching.17:26
niemeyerrog: Well, if you're watching the unit it does represent what you're watching17:26
rogniemeyer: this is what i had to do in livetests: http://paste.ubuntu.com/1203085/17:27
rogniemeyer: so i could treat units and machines in the same way17:27
rogniemeyer: for the purposes of waiting for them to upgrade17:28
niemeyerrog: That seems pretty involved17:30
niemeyerrog: Isn't it just a matter of having a function that looks like17:30
niemeyerrog: func machineToolsHolder(ch <-chan *Machine) <-chan ToolsHolder17:32
niemeyerrog: ?17:32
niemeyerrog: To overcome exactly the issue you described above?17:32
niemeyerrog: The logic itself is exactly the same for both17:32
rogniemeyer: that's essentially what i pasted17:33
niemeyerrog: Heh17:33
niemeyerrog: Except completely unrelated, right? :)17:33
niemeyerrog: I'm saying converting a *channel type* onto *another channel type*17:33
niemeyerrog: The *logic* is the same17:34
niemeyerrog: You're converting a *watcher* onto anohter *watcher*17:34
rogniemeyer: well, it stores the machine too, so that the generic code can inspect the error if it wants17:34
rogniemeyer: i'm not convinced it would be less code overall17:35
rogniemeyer: ah, i see what you mean actually.17:36
niemeyerrog: It removes the duplication that you have there, with two watchers that look exactly the same..17:36
rogniemeyer: where MachineToolsHolder is something like:17:37
rogtype MachineToolHolder interface {17:37
rogAgentTools() (*state.Tools, error)17:37
rog}17:37
niemeyerYep17:37
rogi mean ToolsHolder of course17:37
rogniemeyer: good enough for testing anyway17:38
rogniemeyer: i still think it's a pity that we require that every stage in a channel pipeline have its own tomb.17:40
rogniemeyer: but i lost that one ages ago.17:40
niemeyerrog: I don't think we require that17:45
rogniemeyer: i think we do if we want to maintain the invariant that calling Stop means no more values will be produced on a channel.17:45
niemeyerrog: In fact, the mstate watcher mechanism explicitly does not have a per-channel tomb17:46
niemeyerrog: What we require in the design, since the beginning, is that we deterministically stop background activity17:46
rogniemeyer: are you talking about the MachineWatcher? i should have a look then17:46
niemeyerrog: That sounds sane17:46
niemeyerrog: No, I'm talking about the watcher foundation and the way that MachineWatcher uses it17:47
niemeyerrog: We do not have per-channel tombs there17:47
rogniemeyer: so the MachineWatcher doesn't have a tomb? cool!17:47
niemeyerrog: Heh17:47
niemeyerrog: The MachineWatcher has a tomb, because it has a goroutine that we want to stop deterministically17:48
rogniemeyer: i think we'll usually find that every stage in a pipeline is like that17:48
niemeyer<niemeyer> rog: No, I'm talking about the watcher foundation and the way that MachineWatcher uses it17:48
niemeyer<niemeyer> rog: We do not have per-channel tombs there17:48
rogniemeyer: a machine watcher is one stage down from the watcher foundation.17:48
rogniemeyer: i don't see the watcher foundation as a pipeline.17:48
niemeyerrog: Okay, sorry17:49
rogniemeyer: we *can* deterministically stop goroutines without a tomb per-stage if we're prepared to wait on the channel for the goroutine to acknowledge that it's been stopped (by closing the channel)17:49
rogniemeyer: i know that was rejected ages ago, but i still think it's a nice way to go about things.17:50
niemeyerrog: We can't close a channel from the consumer side17:50
rogniemeyer: it means the stop message can percolate deterministically down the pipeline17:50
rogniemeyer: each stage in the pipeline can delegate the stop handling to the stage upstream17:51
rogniemeyer: when that sees a stop, it closes the upstream channel, which then flows stage-by-stage back to the most downstream channel17:52
niemeyerrog: Doesn't work.. it means we have background activity because closing the channel leaves things running17:52
niemeyerrog: Deterministically stopping means Stop() returned, nothing else is happening17:52
niemeyerrog: It's one of the things we've fixed from Python, and it makes me very happy to be honest17:53
rogniemeyer: a stop would be a two-stage process - we close the stop channel, then wait for to see the downstream channel closed.17:53
rogniemeyer: there's nothing left running17:53
niemeyerrog: Heh.. I'm sure you can create many other mechanisms to do the same17:53
niemeyerrog: We have one, that works, and works well17:54
rogniemeyer: it means it's really heavyweight to do things that should be throwaway pieces of code, IMHO17:54
rogniemeyer: the only thing that *needs* to do a select is the head of the stream.17:55
niemeyerrog: Sorry, but this isn't helpful at all.. you're building a new design as I mention the requirements, and saying that what we do is heavyweight despite the fact it *works*.17:55
rogniemeyer: anyway, sorry, derail17:55
rogniemeyer: yes it does17:55
niemeyerrog: If you want to reimplement juju again, I won't be around17:55
rog:-)17:55
rogniemeyer: i'm off for the evening. see ya tomorrow!18:06
niemeyerrog: Have a good time18:06
rogniemeyer: when do you set off?18:06
niemeyerrog: Sometime Saturday only, I'll be around tomorrow still18:07
rogniemeyer: cool18:07
rogniemeyer: if you manage to remove fwereade's toolsDir branch tonight, that would be awesome. i reckon we can get it running actual commands tomorrow possibly.18:08
niemeyerrog: I'm sure it must be ready to go in.. it was really a trivial detail, and it was great18:09
niemeyerjimbaker: ping18:14
jimbakerniemeyer, hi18:14
niemeyerjimbaker: Heya18:14
niemeyerjimbaker: I'm wondering about format 2 stuff18:14
niemeyerjimbaker: Were you the one pushing it?18:14
jimbakerniemeyer, yes18:14
jimbakerit's in trunk now18:14
jimbakerin terms of the raw string changes18:15
niemeyerjimbaker: Okay.. I've noticed there's a "-Not set-" value being used by "config get"18:15
niemeyerjimbaker: Have you killed that?18:15
jimbakerniemeyer, i didn't notice that in the code or usage... i'll see if it's still there18:15
niemeyerjimbaker: Ouch18:15
niemeyerjimbaker: config_get.py18:16
niemeyer_jimbaker: Sorry, connection broken18:18
jimbakerniemeyer, yes that's unfortunate... i didn't touch juju get in the merged branch, but we should definitely address so it is also raw18:19
niemeyer_jimbaker: It should be nil or something.. "-Not set-" is a perfectly valid *set* value18:20
jimbakerniemeyer_, agreed18:20
jimbakeri understand that the attempt here is to provide some human readable info on how the settings have been made (such as the default flag). i do think it would be better if it were to simply roundtrip with juju set, but that's too late i suppose18:23
niemeyer_jimbaker: Can you please fix it so it's "None" instead?18:24
jimbakerniemeyer_, np. so should we also have the new behavior respect format 2 for that service?18:24
=== niemeyer_ is now known as niemeyer
niemeyerjimbaker: No, this is client side18:25
niemeyerjimbaker: format 2 is about the charm18:25
jimbakerniemeyer, i understand18:25
niemeyerjimbaker: This is a bug in the juju get command, that I suggest actually fixing18:25
niemeyerjimbaker: I'll suggest the same in the Go side18:25
niemeyerjimbaker: yaml has a perfectly valid idea of "nil"18:26
niemeyerjimbaker: Which is what this means18:26
jimbakerok, i'll propose this fix18:26
niemeyerjimbaker: Cheers18:26
=== steev_ is now known as steev
fwereadeniemeyer, heyhey, sorry I missed the conversation above19:10
fwereadeniemeyer, I did repropose the toolsDir PATH branch, maybe you missed it19:10
fwereadeniemeyer, https://codereview.appspot.com/649612019:11
jimbakerniemeyer, juju get is now fixed in trunk r577, and the change will be part of the 0.6 release19:11
niemeyerfwereade: Ah, sorry, I was postponing it because I didn't think you'd do it right now, but if you're active I'll have a look right away.. I'm sure it's good19:12
niemeyerjimbaker: Woohay, cheers19:12
fwereadejimbaker, cool19:12
fwereade(I wondered about that in the golang implementation :))19:13
jimbakerfwereade, no need to port that bug :)19:13
* fwereade suddenly has first-job flashbacks... bug-compatible reimplementation of directx... fun :)19:14
fwereadewe called it indirectx, of course19:14
niemeyerLOL19:21
niemeyerfwereade: Done, thanks!19:22
fwereadeniemeyer, sweet, tyvm19:27
fwereadeniemeyer, btw, I'm not 100% sure whether this is just me seeing patterns where none exist... but ISTM that the 50ms spinning while waiting for the uniter makes things a lot flakier than the 200ms spinning when I'm running full test suites, and I have a theory that this is *because* I'm spending more time spinning rather than sitting back and allowing the uniter to do its thing19:32
niemeyerfwereade: Hmm.. this sounds a bit like we don't know what's actually going on19:33
fwereadeniemeyer, how would you feel about dialing them back a little, at least to, say, 100ms... (until I decide that I'm seeing just as many just-missed timeouts, and change it back)19:33
fwereadeniemeyer, ha, yes, that is a good way of putting it19:33
niemeyerfwereade: I'd rather dial them to 25ms so we can actually find out what's going on :)19:34
fwereadeniemeyer, haha :)19:34
niemeyerfwereade: The delay is preceding a pull.. if returning earlier makes it break anyhow, something is necessarily not ok19:34
niemeyerfwereade: In some cases this reflects background activity that is missing a Stop, for example19:35
fwereadeniemeyer, hmmm, my theory was no more sophisticated than "the more I do on the test goroutine, the less time is spent letting real things actually happen"19:37
fwereadeniemeyer, (I'm defiitely not contradicting your suggestion)19:37
niemeyerfwereade: I find that unlikely.. 50ms is an eternity in computing time19:38
niemeyerfwereade: I find it more likely that we're simply lucky enough to have spotted a way to observe a problem19:38
fwereadeniemeyer, you are probably right... many uniter things take longer than I would naively expect them to19:39
niemeyerfwereade: Can I help you to debug the issue anyhow?19:40
fwereadeniemeyer, not obviously, except perhaps to guess at why this chunk of code might take 200ms:19:41
fwereade        if err = u.deployer.Deploy(u.charm); err != nil {19:42
fwereade            return err19:42
fwereade        }19:42
fwereade        if err = u.sf.Write(reason, Done, hi, url); err != nil {19:42
fwereade            return err19:42
fwereade        }19:42
fwereade    }19:42
fwereade    log.Printf("charm %q is deployed", url)19:42
fwereadeniemeyer, from the final deferred log in Deploy to the log at the bottom19:42
fwereadeniemeyer, but, yeah, that's not a helpful thing to ask19:42
niemeyerfwereade: I'll have a look at the code19:43
fwereadeniemeyer, specifically in the context of the jujud test19:43
fwereadeniemeyer, but I'll be staring at verbose runs of that on its own and seeing if I can spot anything, too19:43
fwereadeniemeyer, hmm, yeah, that's more like 10ms when run on its own19:44
niemeyerfwereade: Ah, ok19:44
niemeyerfwereade: That sounds about right for a disk seek19:45
fwereadeniemeyer, hmm, just got a couple of 100ms~s in a row19:46
niemeyerfwereade: Ok, that may be the machine working.. keep in mind that we're fsyncing on the writes19:47
fwereadeniemeyer, ha! yes, I totally forgot that19:47
fwereadeniemeyer, ok, I feel a little less insane now, and in that case actually more comfortable just bumping timeouts across the board19:48
niemeyerfwereade: Maybe I misunderstand the original issue then19:48
niemeyerfwereade: I thought the 50ms was inside a loop19:48
fwereadeniemeyer, yeah, just when polling for things to have happened19:48
niemeyerfwereade: That would wait further for a few seconds until the desired state actually arrived19:48
niemeyerfwereade: It'd certainly be too little to be deterministically waiting *just* that, but it's not too little if it's inside a loop19:49
niemeyerfwereade: Do I misunderstand what you're actually covering?19:49
fwereadeniemeyer, the issue is really just that, when things fail, I see a lot of "waiting..." spam from inside the tests; specifically, much more than I intuitively expected19:51
fwereadeniemeyer, being reminded that we're fsyncing every write makes that behaviour seem a lot less weird19:52
fwereadeniemeyer, ie I now have a mechanism for the uniter taking its own sweet time over things19:52
fwereadeniemeyer, rather than the half-baked "overeager tests + gc pauses + handwaving" theory19:53
niemeyerfwereade: LOL19:53
niemeyerfwereade: This is cool :)19:53
fwereadeniemeyer, and I'm now comfortable just bumping the top-level timeouts a little and relaxing19:53
niemeyerfwereade: We can certainly disable the fsyncing for tests later19:55
niemeyerfwereade: But we should make sure things actually work  first :)19:55
fwereadeniemeyer, v good point19:55
fwereadeniemeyer, quite so :)19:56
=== 45PAA1GWO is now known as hazmat
* niemeyer steps out for a while20:57

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!