[00:27] davecheney, niemeyer: you'll be glad to know that the recent cloudinit changes and traffic auth changes mean that we can actually run a uniter live. [00:27] wrtp: WOAH [00:27] wrtp: No kidding! [00:27] * niemeyer grabs some port wine to celebrate :) [00:27] davecheney: in a few minutes, i *should* see the first live uniter upgrade [00:27] niemeyer: ^ [00:28] niemeyer: the first time failed because i'd forgotten to add the UpgradedError check... [00:28] * wrtp does the same [00:37] ah, interesting case. it *would* work, but it doesn't get as far as starting the upgrader because the uniter is returning an error when initialised. [00:38] hmm, not easily fixed either. [00:48] niemeyer: this an interesting case that i hadn't anticipated; i'm not quite sure what to do [01:08] wrtp: Hmm [01:08] wrtp: What's the error? [01:08] wrtp: and at which point? [01:08] niemeyer: it actually doesn't really matter so much what the error is in fact [01:09] niemeyer: although it's actually ModeInit: bad http response 404 Not Found [01:10] niemeyer: the problem is that if one of the workers consistently dies early, the task exits, and all the other tasks (including the upgrader) get killed off [01:10] niemeyer: so the upgrader never has a chance to upgrade [01:10] niemeyer: i've worked out what i think is a decent solution though [01:10] wrtp: If the uniter is consistently dying early, sounds like something is wrong? [01:11] niemeyer: indeed it is, but our code should be able to cope with that, even if it is wrong. [01:11] niemeyer: otherwise we have a situation where we're running some bad code and we can't upgrade out of that, even though nothing has crashed [01:12] wrtp: If it's dying consistently before even starting, I don't see how we'd be able to do anything? [01:12] wrtp: We can't go "Hey, I think I'll upgrade anyway, just in case!" [01:12] niemeyer: it was, but my first thought was: put the checks in *after* we've started the uniter [01:13] niemeyer: so we're running the upgrader concurrently. [01:13] niemeyer: but then you run into the case above. [01:13] niemeyer: but as i said, i think i have a nice solution. [01:16] wrtp: Oh/ [01:16] ? [01:17] niemeyer: 1) we change the runTasks function so that it gives UpgradedError priority over other errors [01:17] niemeyer: 2) we change the upgrader so that if it's stopped while downloading, it waits a while for the download to complete before actually dying [01:17] wrtp: Hmm [01:28] wrtp: Feels easy to fall onto races [01:28] wrtp: I wish we had a more deterministic way to say that [01:28] niemeyer: i think it's not too difficult [01:29] niemeyer: obviously we can't wait forever because then a download that hung up would mean that an upgrader would never be killable, but i reckon a minute would be fine [01:29] wrtp: Feels like guess work [01:30] wrtp: It could also take 10 in a slow network, or 20 [01:30] niemeyer: point of reference: in ec2 it takes 4s [01:31] niemeyer: we could add something to the downloader that signals that some progress is being made [01:31] wrtp: So S3 to EC2 is your reference of "slow network"!? :-) [01:31] niemeyer: it'd still be guesswork, but not quite so wild [01:31] wrtp: That's probably as good as it gets [01:31] niemeyer: fair enough [01:31] niemeyer: we could make the timeout 1 day if we liked [01:31] wrtp: I don't like any of that.. :( [01:32] wrtp: We need a more deterministic way to define whether we want to stop *now* or whether we think we should check for an upgrade [01:32] niemeyer: when do we ever want to stop *now* ? [01:33] wrtp: When we say "stop".. when there's an unrecoverable error in the state connection, .. [01:33] niemeyer: yes, we could add code to enable stopping immediately, but that doesn't solve our problem [01:34] niemeyer: ah yes, if we could know when we had such an error, that would be good [01:34] niemeyer: we'd have to lose lots of errorContextfs though [01:35] niemeyer: or actually, maybe errorContextf could be changed to deal with it [01:36] wrtp: Hm? [01:36] niemeyer: ignore me [01:36] niemeyer: i'm on crack [01:36] niemeyer: we just need a way of asking whether the state connection is still ok [01:38] wrtp: Hmm [01:39] wrtp: Did you get to the point of seeing the upgrader behavior when the issue was happening? [01:39] wrtp: Was it bailing out before ever checking for any upgrade? [01:40] niemeyer: i'm not sure. it doesn't currently print a log message when a download starts [01:41] wrtp: I was just having a look at it, a twist on the first idea you had sounds like an interesting direction [01:42] niemeyer: i've actually finished writing the code for the first idea already BTW [01:43] wrtp: It's not just about "if a download is in progress", though [01:43] niemeyer: it's also "have we seen an initial environ config" [01:43] wrtp: Right [01:43] niemeyer: i've already got that logic too [01:44] wrtp: We have to ignore Dying and, if we have a valid environ, *go check for an upgrade*.. only then, put the Dying() consideration back on the table [01:44] wrtp: Oh, you rock :) [01:44] niemeyer: http://paste.ubuntu.com/1201690/ [01:46] dying = time.After(1 * time.Minute) [01:46] !? [01:46] niemeyer: oops, well obviously we can't do *that* :-) [01:46] niemeyer: but something that sends on dying after a minute would do the job [01:48] wrtp: I'd say we can ignore dying on the first round, and go try an upgrade (facilitated by factoring out somethings onto a function). [01:49] wrtp: If we have a timeout, that timeout should be on the downloader itself, not on that loop [01:49] wrtp: It's the downloader that should be responsible for complaining that things are stuck [01:50] wrtp: if we don't do that, we have problems either way (what if a charm download stops, what if ...) [01:50] niemeyer: what about waiting for the initial config? [01:51] wrtp: Expand please? [01:52] niemeyer: if we haven't yet received the inital environ config, there's no download to complain [01:52] niemeyer: so we would need a top-level timeout anyway [01:53] niemeyer: but that's not hard to arrange [01:53] niemeyer: we can use both [01:53] wrtp: We can ask the state for an environment upfront, without using WaitForEnviron [01:54] wrtp: and run one attempt to upgrade [01:54] wrtp: By calling a function explicitly that contains the block that is within the select today [01:54] niemeyer: we actually don't need to use WaitForEnviron at all. we don't use the environ. [01:55] erm, maybe we do [01:55] tools, err := environs.FindTools(environ, binary, flags) [01:55] We need it [01:56] niemeyer: ah yes, good point. [01:57] wrtp: Either way, your plan is a great quick solution [01:57] wrtp: dying = u.tomb.Dying + 5 min delay [01:58] wrtp: Once we know there are no downloads available, reset it to pure Dying [01:58] niemeyer: once we know there are no downloads available, we can just return [01:59] wrtp: Uh? [01:59] i think [01:59] wrtp: What if there is a new upgrade? [01:59] wrtp: It should still work as usual [02:00] wrtp: The loop waits for environ changes [02:00] niemeyer: this is *after* we've already received a dying signal, right? [02:00] wrtp: No, I was talking about upgrader behavior at all times [02:00] niemeyer: ok, i'm not sure what you mean by "dying = u.tomb.Dying + 5 min delay" then [02:00] wrtp: On entrance, a dying channel is built that will fire if u.tomb.Dying fires, but with a 5 mins delay [02:01] niemeyer: ahhh [02:01] wrtp: Everything works as usual then.. we get into the loop, Changes fires with the first env config, [02:01] wrtp: THen, when we get the first download-not-found, we reset dying to take off that Delay [02:02] wrtp: From then on it's fine for it to die at any point [02:02] niemeyer: we still want to try to complete a download if we're killed, no? [02:02] wrtp: THen, when we get the first download-not-found, we reset dying to take off that Delay [02:03] wrtp: By definition, to get a download-not-found, we've checked if there was a download, and noticed there wasn't any [02:03] wrtp: Before that, dying is Dying + 5 mins delay [02:03] wrtp: Which means we have the 5 mins to complete the download [02:03] niemeyer: yes, that sounds good [02:04] niemeyer: because it's only the first download opportunity we really care about [02:04] wrtp: We can tweak as we go, but it *sounds* like this is a pretty trivial change on top of what we have today [02:04] wrtp: Right [02:04] wrtp: If we break in the middle later, we'll loop and get into that again in either case [02:04] niemeyer: yeah, that sounds great actually [02:05] wrtp: If there's no upgrade to do we reset it as well, of course [02:06] niemeyer: ah, i thought that's what you meant by download-not-found actually [02:06] niemeyer: but it's all good [02:06] wrtp: Yeah, both that and actual failures in download (we just break right now) [02:08] god i love how easy it is to reason about this stuff with channels [02:09] wrtp: +1! [02:09] niemeyer: e.g. http://paste.ubuntu.com/1201718/ [02:12] niemeyer: a little better: http://paste.ubuntu.com/1201720/ [02:17] wrtp: Yeah, that's nice [02:18] wrtp: A bit unfortunate that it hangs a goroutine forever for the 5 mins delay [02:18] wrtp: But I guess that's minor [02:18] niemeyer: well, it can be 5 minutes only if the channel is buffered [02:19] wrtp: Hm? [02:19] niemeyer: oh no, i'm talking rubbish [02:19] niemeyer: it hangs a goroutine for 5 minutes, as you say, minor. [02:19] wrtp: No, it hangs forever [02:20] niemeyer: oh yeah, i see what you mean [02:21] niemeyer: i can't get very worked up about it :-) [02:21] wrtp: If we use a local var t *Tomb assigning to the u.tomb, we could use something like this: http://paste.ubuntu.com/1201731/ [02:22] wrtp: So we Kill the delayed tomb after we have the first round acknowledged [02:22] wrtp: and set t back to the real tomb [02:22] niemeyer: i see what you're doing, but i think it's overkill [02:22] niemeyer: we really couldn't care less about one goroutine [02:23] wrtp: Up to you.. the two functions have the same size, are ready, and one of them doesn't leak memory [02:25] Alright, it's time for me to take some shower [02:25] niemeyer: "leaking memory" is a little strong. it cleans up when the upgrader quits [02:25] wrtp: Dude, and it's super late for you too [02:26] niemeyer: indeed [02:26] wrtp: Yes, when the process dies, it will release the memory :-) [02:26] niemeyer: or 5 minutes after the upgrader returns. [02:27] wrtp: Which is done when the process dies!? [02:27] wrtp: Otherwise we'll always have an upgrader running [02:27] wrtp: Either way.. I'm not keen or trashing memory like that when it's trivial to avoid, but I'm not keen on bikeshedding on it either.. have a great sleep there [02:28] niemeyer: will do, thanks [02:28] niemeyer: enjoy your shower... [07:03] hey, anyone who's around, cath is not well and I'm popping out to get some medicine... and I may in general be a bit sketchier than usual today [07:05] right o [08:00] fwereade: mornin' [08:06] wrtp, heyhey [08:07] wrtp, looks like you got a lot done last night -- nice :D [08:07] fwereade: you'll be glad to know i got the uniter running live last night [08:07] fwereade: it broke immediately, but that's not the point :-) [08:07] fwereade: it was started and connected to the state etc [08:07] wrtp, oh, bugger, I didn't read everything I missed [08:07] wrtp, what fell over? [08:07] fwereade: it failed to download the charm [08:08] fwereade: it's probably something i'm doing wrong in the test [08:08] wrtp, huh, weird [08:08] fwereade: one mo, i'll show you its log [08:09] fwereade: it repeats forever doing this: http://paste.ubuntu.com/1202116/ [08:09] fwereade: (it would be nice if we could see the url that's failed, don't you think?) [08:09] fwereade: it's actually quite good that it failed because it exposed me to an upgrader issue that i hadn't considered before [08:10] wrtp, huh, very odd [08:10] wrtp, but ModeInit is I think before install time, isn't it? [08:10] wrtp, that should be ModeInstalling [08:11] wrtp, most likely something to do with getting the unit address [08:14] fwereade: this is a classic example of why errorcontextf is not always sufficient BTW [08:15] wrtp, I'm not denying your general point about ErrorContextf, but I think that had it been used in the provider methods it would have been fine... surely? [08:16] wrtp, your point that it makes it inconvenient to return specific error values holds far more water IMO :) [08:16] fwereade: well, yeah, it must be used in every single function [08:17] wrtp, it must be used mindfully, at any rate :) [08:17] wrtp, IMO it comes down to a question of who has responsibility for producing sane errors -- the thing that fails, or its client [08:18] wrtp, despite the drawbacks, of which I am aware, I still come down on the thing-that-failed side [08:18] fwereade: it depends what you mean by a sane error i think [08:18] wrtp, probably, yeah :) [08:19] fwereade: with the current scheme, we're always passing the buck - when we see a function like the above, we say it's fine because it adds error context. but it's only fine if all the things it calls do the same. [08:20] fwereade: and in many of those functions it doesn't really look like a context is necessary. [08:20] fwereade: so nobody calls it out in review [08:20] wrtp, hmm, I think that statement is flawed in exactly the same way as "if it's up to the client, it's only ok if every single error everywhere is annotated" would be [08:20] wrtp, in either case it's essentially advocating tracebacks [08:21] fwereade: i think some kind of (fairly minimal) traceback is exactly what we need for diagnosing problems like we've just encountered [08:22] fwereade: we need just enough to walk us through the tree of possible error paths. [08:23] fwereade: what i'm going to have to do now is find out which of the functions that ModeInit is calling might be able to generate the error we saw. that's slow and error prone in itself. [08:23] wrtp, AFAICS there are only two things that could have failed there, and the actual problem is the same in either case, right? [08:24] wrtp, one is PrivateAddress, the other is PublicAddress, and (almost certainly) they'd each need to be fixed in the same way [08:24] wrtp, I totally understand there are cases where it could be much less pleasant [08:24] fwereade: ah yeah, that's true. i'd forgotten they accessed an http link [08:25] wrtp, and such a case is certainly evidence for inadequate annotation [08:25] wrtp, but I *think* I'm comfortable with the tradeoff [08:25] fwereade: i'll change their error messages to be a little better for a start [08:25] wrtp, the number of cases where I've actually been baffled by an error is pretty small [08:26] fwereade: just wait until we've encountered more of these kinds of errors reported by people in the wild :-) [08:27] wrtp, fair point -- but all the same, I'm uncomfortable with the idea that the way to handle errors right is to hand-hack traceback functionality everywhere [08:27] fwereade: i don't think of it as traceback functionality. i think of it as describing the error :-) [08:38] wrtp, it's the "everywhere" that bugs me more than the precise term we use for the mechanism by which we describe the error (but, yes, good descriptions will be better than tracebacks) [08:39] fwereade: i really feel there's no shortcut. but i'm rolling with it for now. if we decide to change the policy in the future, it won't be too hard. [08:40] wrtp, yeah :) [08:52] wrtp, btw, are your current branches messing with PATH in upstart? [08:53] fwereade: no [08:53] wrtp, awesome :) [08:53] wrtp, but I do have to fix something now then [08:53] fwereade: at least... there might be a dreg in the tests. i'll check. [08:54] wrtp, yeah, the problem is that the uniter tests mess with PATH [08:54] wrtp, it's not too bad though, I can do a trivial and notsonice fix quickly, and then sort out the long-running-hook-server thing which will let me do it nicely [08:54] fwereade: i've found the metadata problem BTW [08:54] wrtp, oh yes? [08:55] fwereade: it's using the version 1.0, but public-hostname didn't exist in that version [08:55] wrtp, ha :) [08:55] fwereade: tbh i think it should probably use "latest". [08:55] wrtp, mmmmmmaybe, I'm just not sure what guarantees they make re "latest" compatibility [08:56] fwereade: grr [08:56] 04.15.979 ... value *errors.errorString = &errors.errorString{s:"cannot put charm: cannot write file \"local_3a_precise_2f_dummy-1\" to control bucket: remote error: handshake failure"} ("cannot put charm: cannot write file \"local_3a_precise_2f_dummy-1\" to control bucket: remote error: handshake failure") [08:56] wrtp, yeah, those piss me off :/ [08:56] fwereade: ok, i'll use 2012-01-12 then [08:56] wrtp, brb cig [08:56] wrtp, cool [08:59] fwereade: actually i think "latest" should be fine [09:03] wrtp, I'm happy to trust your research/judgment :) [09:04] fwereade: for metadata categories it talks about "version introduced" which to me implies that a category can't be retracted once introduced [09:05] fwereade: and all their examples use "latest" [09:05] wrtp, sounds sane to me :) [09:12] hello. [09:16] Aram, heyhey [09:17] fwereade: holy shit, it worked! [09:17] * fwereade cheers [09:17] wrtp, running unit agent? [09:17] fwereade: unit agent upgraded [09:18] wrtp, sweeeet [09:18] wrtp, and I'm just about to land charm upgrades too [09:18] wrtp, although I haven't actually implemented upgrade-charm yet [09:18] fwereade: i need to think of a nice test so we can live test that the uniter can run a charm [09:19] fwereade: how about a trivial charm that starts a web server? then we can poll for a while after the charm has started to see if the web server comes up [09:19] Aram: morning! [09:20] wrtp, yeah, sounds sensible, just checking for unit status is not adequate to tell that the *charm* is working [09:20] * wrtp is quite happy now [09:20] fwereade: indeed. i want to check that a hook has been executed for real [09:20] fwereade: this can lead into other tests that test more sophisticated functionality, i think [09:20] wrtp, sgtm [09:21] wrtp, will also test expose, which is nice [09:21] wrtp, maybe just an echo server? [09:21] fwereade: in fact, can i tell from the status that a charm has finished executing its start hook? [09:21] wrtp, yeah, once unit status is started [09:22] fwereade: sweet. that means that i won't have to wait too long after that [09:22] fwereade: and i've just implemented the unit watcher, so i can use that to watch the status [09:22] fwereade: lovely jubbly [09:22] wrtp, you could write an echo server with plain/pirate/3117 transforms, to test config-changed as ell :) [09:22] s/ ell/ well/ [09:23] fwereade: i think i'll just write a server that executes a specified jujuc hook and sends back the result [09:23] s/hook/callback/ [09:23] wrtp, how do you plan to do that? [09:23] fwereade: hmm, good point! [09:23] wrtp, :p [09:24] fwereade: darn, out of context callbacks [09:24] wrtp, yeah, I'm strongly inclined to do the preliminary work for that today, just because it makes the uniter much cleaner (and will, I *think*, improve performance a little) [09:25] wrtp, but you shouldn't count on the full thing being available any time soon at all [09:25] wrtp, but even a trivial echo server will verify that open-port works, and by extension (probably) the rest of the jujuc stuff [09:27] fwereade: maybe it's silly, but i'd prefer something that provides some actual info from the other side, so we *know* that we're talking to the right thing. [09:28] wrtp, well, anything you want to send back from jujuc can be grabbed and stored when you run the start hook, right? [09:29] fwereade: hmm, can i assume that python will be installed? [09:29] wrtp, but ISTM that config-changed is a useful mechanism for checking end-to-endness [09:29] wrtp, sorry I don't know [09:29] fwereade: yeah, i was thinking that [09:29] fwereade: what's the simplest web server i can write? [09:30] fwereade: that doesn't need to pull in any more resources than we already have [09:31] wrtp, if you have python, SimpleHTTPServer is pretty damn trivial [09:31] fwereade: yeah, if [09:31] wrtp, *but* the default python version is changing, so you want to be careful of that, [09:31] fwereade: oh pish [09:31] wrtp, I actually think you probably can guarantee *some* python [09:31] wrtp, but yeah, exactly [09:31] fwereade: perl is probably installed [09:32] fwereade: (although i've never written a line of perl in my life) [09:32] wrtp, whatever you need, it's one line in theinstall hook, surely? [09:32] wrtp, in my very limited experience you're not missing much ;p [09:32] fwereade: i firmly believe that to be the case :-) [09:37] fwereade: i miss the inferno shell where i could write: listen 'tcp!*!12345' {echo hello there} [09:37] fwereade: and instantly have a working server [09:38] fwereade: i think we must have python because cloud-init is written in python. [09:38] fwereade: i'll just have to be a teeny bit careful of version changes [09:41] fwereade: does hook output get logged into /var/log/juju/unit*.log ? [09:41] wrtp, nope, but juju-log will [09:41] fwereade: hmm. where *does* hook output go? [09:42] wrtp, I suspect it goes nowhere at the moment [09:42] fwereade: i think that's wrong [09:42] wrtp, fair point, we just want to be careful not to do it how we did in python, where every line on stderr gets prefixed with ERROR: ;) [09:43] fwereade: just trying to deploy the real wordpress charm, just to see what happens... [09:45] fwereade: http://paste.ubuntu.com/1202263/ [09:45] fwereade: also: http://paste.ubuntu.com/1202265/ [09:46] fwereade: it's looking good! [09:46] fwereade: no way of knowing why the install hook failed though [09:46] wrtp, that's pretty cool :D [09:47] wrtp, I kinda would like to separate hook output from uniter output, but that's just an instinct, not sure it holds water [09:47] wrtp, yeah, debug-hooks would be useful for that [09:47] fwereade: i'm not sure it does. they do need to be readily distinguishable though [09:48] fwereade: what does debug-hooks do again? [09:48] wrtp, yeah, that's the heart of it [09:48] wrtp, whenever the uniter runs a hook it just gives the client a shell instead [09:49] fwereade: [09:49] 2012/09/13 09:42:35 JUJU HOOK some hook output [09:49] ? [09:50] fwereade: i don't think we need to know if it was written to stdout or stderr [09:50] wrtp, sounds reasonable [09:51] fwereade: i might just go and do that to scratch my itch. [09:51] wrtp, +1 [09:53] wrtp, btw, I think I have some idea why I wanted the agent tools to go in the agent tools dir -- because the uniter does write to it [09:54] fwereade: i don't think that's a problem. a given version of the uniter will always write the same thing, no? [09:54] wrtp, yeah, indeed, I'm not sure it's really a big deal [10:04] fwereade: ah, is it likely the reason for the failure is that the PATH isn't set up yet? [10:04] wrtp, ha! [10:04] wrtp, yes, very likely indeed [10:05] fwereade: hook output logging is done BTW, with the small matter of a few tests to go [10:05] fwereade: very happy with how easy it was to navigate the uniter code [10:24] wrtp, sweet :D [10:46] wrtp, sorry, I think I'm being dense... what's the use case for the unit watcher? [10:46] fwereade: the test uses it to watch for the agent tools version changing [10:47] fwereade: it can also be used to watch for the unit status changing [10:47] fwereade: otherwise we'll have to poll, which seems wrong. [10:47] wrtp, hmm, I'm -1 on test-only code polluting the API, where we can help it [10:47] wrtp, I'm comfortable polling in tests (although ofc I do appreciate it when I don'rt have to) [10:48] fwereade: i don't think it's strictly test only. we could easily have some client functionality that wants to watch some aspect of a unit. [10:48] wrtp, however, it doesn't cover status changes properly (down?) [10:49] wrtp, I dunno, I feel we have too much speculative code in state anyway [10:50] fwereade: FWIW we've already got MachineWatcher which is basically the same thing [10:51] wrtp, is MachineWatcher used? (or does it have a concrete use case in mind?) [10:51] fwereade: it's used to watch the instance id for one [10:51] wrtp, ok, so it has a use, and I'm fine with that (although *really* that STM like a job for a more specialized watcher) [10:52] fwereade: this was at niemeyer's request. i started with a more specialized watcher. [10:52] wrtp, ha, ok [10:53] fwereade: i think the UnitWatcher is sufficiently general that it doesn't really count as test-only code, even though that's how we're using it currently. [10:53] wrtp, I kinda feel a watcher that doesn't fire on certain important changes is a bit of a bad thing though [10:53] fwereade: we will eventually use it to do major-version upgrades [10:53] wrtp, if it's meant to watch status, it ought to actually watch status [10:54] fwereade: yeah, maybe it should watch the alive status too. [10:54] wrtp, a ToolsVersionWatcher that can handle both units and machines, and strips out unwanted changes, feels much cleaner to me [10:55] fwereade: i agree, but i've been told that's wrong. so here we are. [10:56] wrtp, (btw, tangentially related: with the provisioner bool in machine, can we drop the PA now and just run the appropriate workers in the MA?) [10:56] wrtp, bah [10:56] wrtp, this is the trouble with the "we should do more reviews ourselves thing" [10:56] fwereade: yes, that's on the todo list, maybe today [10:56] fwereade: i know [10:56] wrtp, sweet [11:23] fwereade: what do you think is the right behaviour for logging output from hooks that continue to run something printing in the background after exiting? [11:24] wrtp, hum, I rather feel that hooks probably shouldn't do that, but good question [11:24] fwereade: i'm tempted to throw away any output after the hook exits [11:24] fwereade: that way at least it's clean and we can't get any intermingling [11:24] wrtp, yeah, I'd assumed it would just be a CombinedOutput [11:24] wrtp, wait a mo [11:25] fwereade: i don't want to use CombinedOutput as i want to log lines as they happen [11:25] wrtp, won't plain hook output go to logdir/agentname.out anyway? [11:26] fwereade: why would it? [11:26] * fwereade goes to peer at docs [11:27] wrtp, ok, it wouldn't [11:30] wrtp, FWIW, log-until-it-ends matches the python [11:30] wrtp, but I agree that logging lines as they happen is a good thing [11:30] wrtp, (that is also what py does0 [11:30] wrtp, bbs, updates need restart [12:00] rogpeppe, btw, when you said the machine/unit watchers would be used for major version upgrades... how? [12:00] rogpeppe, ah ok, sorry, now I see [12:00] fwereade: cool, np [12:00] rogpeppe, except [12:00] * fwereade thinks [12:02] rogpeppe, ok, so we upgrade everything, and everything knows not to hit state (apart from that bit?) until after the state itself has been upgraded? [12:02] fwereade: that's right [12:03] fwereade: well, actually we probably make a new state server [12:03] rogpeppe, doesn't that put quite a major restriction on the sort of state changes we can actually make? [12:03] fwereade: i'm not sure of the details [12:03] fwereade: how do you mean? [12:06] rogpeppe, well, where's it going to write the change? in the old document (created/managed by old code)? or in the new document, which doesn't exist yet and is on a different server? [12:07] fwereade: where is what going to write the change? [12:07] fwereade: the upgrader? [12:07] rogpeppe, yeah [12:07] rogpeppe, oh, wait, it writes the code version before it's running the new code, right? [12:07] rogpeppe, feels a bit off somehow [12:09] fwereade: it uploads the new tools, writes the tools version, waits until all agents are in "pending" state, transforms the database, then triggers all the agents to have at it [12:09] fwereade: by pending, i mean "major-upgrade-pending" of course [12:10] rogpeppe, ok, thanks :) [12:18] Aram, pre-review delivered on https://codereview.appspot.com/6492110/ -- but I'm worried I've got completely the wrong end of the stick [12:19] Aram, I thought the one thing we *weren't* meant to be notified of was a remove? (except when the watcher happens to have missed a Dead, in which case we should get a Dead change..?) [12:21] fwereade: I only implemented current behavior, not what we want it to be in the future. [12:21] you are right about the future. [12:22] Aram, my concern is that the future is *now* [12:23] Aram, AIUI we dropped the life stuff in state so we could focus on it in mstate [12:24] we can never switch, nor test everything if the API is not the same though. [12:24] Aram, but this API is useless [12:24] Aram, as is the one in state [12:27] Aram, actually, how many of these watchers have non-test clients currently? [12:28] Aram, I presume that at least the firewaller/provisioner/machiner does use some of them, maybe most of them [12:28] white:juju-core$ lsr | egrep 'go$' | egrep -v 'state1|mstate|test' | xargs egrep 'Watch[A-Z][A-Za-z]*' | wc [12:28] 64 394 5470 [12:30] white:juju-core$ lsr | egrep 'go$' | egrep -v 'state1|mstate|test' | xargs egrep '[A-Z][A-Za-z]*Change' | lsr | egrep 'go$' | egrep -v 'state1|mstate|test' | xargs egrep '[A-Z][A-Za-z]+Change[^A-Za-z]' | wc [12:30] 206 1304 18589 [12:31] nah, that's wrong [12:31] white:juju-core$ lsr | egrep 'go$' | egrep -v 'state|mstate|test' | xargs egrep '[A-Z][A-Za-z]*Change[^A-Za-z]' | wc [12:31] 8 52 792 [12:31] white:juju-core$ [12:31] white:juju-core$ [12:31] white:juju-core$ lsr | egrep 'go$' | egrep -v 'state|mstate|test' | xargs egrep 'Watch[A-Z][A-Za-z]*' | wc [12:32] 22 78 1799 [12:33] Aram, ok, so the plan is to implement broken watchers and only fix them once we've finished the switch (this is not *necessarily* a criticism, but I am disappointed if that's actually our only path to where we need to be) [12:34] I don't see how it can be any other way. we use watchers therefore we can't change our client code because we don't have "fixed" watchers. I can't do "fixed" watcher first since we would not have an easy path to transition to mstate since client code would want "broken" patches. [12:35] to make the transition fluid the APIs have to be the same. [12:35] the real fix would have been to have fixed watchers in state as well. [12:35] Aram, yeah, I guess so, I can't come up with anything better :( [12:36] * fwereade quietly freaks out in a corner somewhere [12:36] * fwereade tries not to disturb anyone else [12:37] Aram, well, hmm, actually I'm starting to feel it *would* be better to break what we currently have working in the interest of getting some real code running against mstate [12:38] Aram, but... I suspect this is the cowboy-nature coming to the fore, and I should probably shut up [12:43] fwereade: https://codereview.appspot.com/6494132 [12:43] fwereade: i'm becoming convinced that stopping logging at the end of a hook is not right [12:44] rog, I'm becoming convinced that if we get output after the end of a hook we should spam the user with UR DOIN IT WRONG messages [12:44] fwereade: i think there's potentially great benefit in having all logs funnel into one place [12:44] fwereade: as it is, every server that anyone starts logs to a.n.other log file somewhere [12:45] fwereade: i remember having difficulty with that when writing a charm [12:45] rog, and I think that any sane service will have its own logging, and we are not in the business of indulging craziness like expecting hook child processes to work properly :) [12:45] rog, isn't that a job for logging charms? [12:46] rog, if I'm looking in the juju logs I care about juju [12:46] fwereade: i think logging should be more fundamental to the system than that [12:46] rog, if I care about what (say) riak is doing, surely I can look at riak's logs [12:46] fwereade: if i'm a deploying a charm, i care about juju *and* what my stuff is doing, because the two interact [12:46] fwereade: we want to be able to see a time line of the whole system if possible [12:48] fwereade: i'm pretty sure that doing this right is a major key to getting juju easier to use and debug. [12:57] rog, that is fine by me :) [12:57] fwereade: there are various web services that do that at low cost [12:57] rog, having juju agents themselves mediate that sending directly feels bad though [12:58] fwereade: perhaps you're right. [13:02] rog, I'm not actually against logging, I just feel like there is more than one concern here (even if not quite as many as 2, IYSWIM) [13:02] fwereade: although when a daemon dies 1 second after you've started it in the background, saying "cannot do something", i didn't think so [13:02] fwereade: perhaps we could provide a standard way to say "watch this log file please" [13:02] fwereade: though i suppose logger(1) isn't bad for that [13:02] rog, feels to me like a feature tailor-made for a relation with eg rsyslog [13:02] fwereade: sadly subordinate charms don't work too well for logging currently [13:02] rog, ha, ok, I haven't investigated :( [13:02] rog, why not? [13:02] fwereade: you can't guarantee that the subordinate has started before the principal [13:02] fwereade: which i think you want if you want to set up a logging destination so that every message from the principal is logged. [13:04] rog, it doesn't sound like it would be *too* hard for a logging charm to catch up with messages from the past, but maybe that's crazy, it could certainly have pathological behaviour in some situations [13:04] fwereade: FWIW i think that we *should* guarantee that if a service has subordinates, their start hooks should complete before the principal's install hook gets run. [13:04] fwereade: depends if the messages are going onto disk or straight through the network [13:04] rog, the best we can do is guarantee that happens *some* of the time, though, right? [13:05] rog, why would they be going across the network before the subordinate is configured to send them there? [13:05] fwereade: of course if you add a subordinate after deploying a unit of service we can't guarantee it :-) [13:05] fwereade: it might not be possible to switch destinations mid-flow. ok, i admit i don't know anything about syslog :-) [13:06] rog, nor do I really :) [13:06] fwereade: but as a general point, i think it would make subordinates much more useful [13:06] fwereade: because they could be used to configure some aspect of the system for the principal, *before* the principal starts [13:07] fwereade: for instance, to tweak kernel paramaters [13:07] eters [13:07] rog, I still don't really see why it matters for the principal to run in an unhelpful mode for a few seconds before being bounced by the subordinate when it comes up [13:08] rog, and the subordinates will always have to be written so as to be able to deal with suddenly being deployed against a running unit anyway, surely? [13:08] fwereade: you're assuming that whatever it is the subordinate is setting up can be changed *after* the principal starts [13:08] fwereade: sure - some subordinates might not be able to work in that scenario [13:09] rog, I dunno, that level of interference feels to me like it would be better addressed by letting people launch on custom images [13:10] fwereade: it's a combinatorial problem - there are n-dimensions of tweaks, and we can't provide images that address all points in the space. [13:11] rog, I'm not suggesting that *we* should, I'm just suggesting that allowing people a mechanism for choosing ther own images addresses this use case without introducing a whole new class of subordinate charms (ie these ones that *cannot* be started aftertrhe principal) [13:12] fwereade: i did think of a way around it using the current system primitives BTW [13:12] fwereade: i might have mentioned it before [13:12] * fwereade isn't sure whether or not he can hear a bell ringing [13:12] rog, go on :) [13:13] fwereade: a "sync" interface [13:14] * fwereade can hear a bell now, but can't remember how it was meant to work [13:14] fwereade: a principal provides it; a subordinate requires it; the principal only starts when it gets a "sync done" message from the subordinate [13:15] rog, how does the unit know whether or not it's got a subordinate when it runs its start hook in the first place [13:15] ? [13:15] rog, and can we have 2 synced charms? [13:15] fwereade: you'd set "required-subordinate-count" in the service config, or something like that [13:16] rog, hmmm, and how do we deal with people deployed syncing subordinates when the services are already running? [13:17] rog, sorry, I'm way off on a derail here [13:17] fwereade: i don't think that's a problem actually, but yeah [13:17] fwereade: it's an interesting example, i think [13:18] rog, yeah, but I should be writing actual code right now ;p [13:18] fwereade: indeed [13:18] fwereade: so i'll leave the logging as it is for now. review appreciated :-) [13:18] rog, oh yeah, that was why I started talking about this ;P [13:23] rog, LGTM [13:23] fwereade: thanks [13:29] niemeyer: morning! [13:29] rog: Morning! [13:29] niemeyer: i got the uniter one stage closer this morning - the unit upgrading worked. [13:30] rog: Woah! [13:30] niemeyer: and it failed because the PATH was not set up rather than anything else, i believe [13:30] rog: Aw.. almost there then :) [13:30] niemeyer: although charm stdout was lost, so i don't know (hence this morning's CL: https://codereview.appspot.com/6494132) [13:31] rog: Cool, looking [13:33] rog: Hmm [13:33] niemeyer: hmm? [13:34] rog: Why are we seeing this issue? When a process exits we can generally determine deterministically that we have all its output [13:34] niemeyer: no we can't [13:34] niemeyer: i know this because i see the issue all the time [13:35] rog: Ah, I see the issue, backgrounding.. ok [13:35] niemeyer: actually it happens with no backgrounding at all [13:37] hey, I just noticed: we just hit r500 :) [13:37] niemeyer: or... it's possible that os/exec works around the problem for us actually [13:38] rog: http://paste.ubuntu.com/1202638/ [13:38] rog: How does that work? [13:38] niemeyer: good question; i'm looking in the source [13:38] rog: How does it know to stop reading, more specifically [13:38] rog: Cool [13:39] rog: My understanding is that if a process has exited, we necessarily have all output that came from *it* into buffers [13:39] rog: So we can simply read [13:39] rog: If there's more, there's backgrounding going on that may be considered misbehaving [13:40] niemeyer: do we specifically disallow starting any background process from a hook without redirecting its stdout & stderr? [13:41] rog: This is bad practice, but I don't think we care in this case [13:42] rog: What we care, I think, is that we're logging everything the hook ever gave us by itself [13:42] niemeyer: we *do* still have the issue, i think [13:42] rog: So what *is* the issue? :) [13:43] niemeyer: the issue is that the process can write some stuff into the pipe, exit, we see the exit, but the pipe still has data in [13:43] rog: I think that's exactly what I mentioned above [13:43] niemeyer> rog: My understanding is that if a process has exited, we necessarily have all output that came from *it* into buffer [13:43] niemeyer: it happens because we're using a pipe rather than sending the data into a local buffer [13:43] niemeyer: in this case, we're not using a local buffer. [13:44] niemeyer: an alternative would be to block forever waiting for EOF, which is what would happen if we used CombinedOutput, for example [13:44] niemeyer: i prefer to be a little more robust, i think. === hazmat is now known as 45PAA1GWO [13:46] rog: Hm.. I don't understand.. those two things are exactly the same [13:46] rog: The only way to send data into local buffer is by using a pipe [13:46] rog: We certainly have to finish reading after the process has exited, not before [13:46] rog: Okay.. (?) [13:46] rog: I'd prefer to be robust, but meaningful.. (?) [13:46] rog: What is the actual symptom of the problem? [13:46] niemeyer: we'd see log messages produced by the hook after RunHook had returned [13:47] niemeyer: for a few microseconds at most [13:47] rog: How can the hook produce output after it has returned? [13:47] rog: Sorry, I'm pretty confused by now [13:47] niemeyer: because there's data still in the pipe [13:47] rog: It's not *producing* output [13:47] rog: There's data in buffers.. that's normal unix behavior [13:48] niemeyer: no, it's produced it, but we haven't read it yet [13:48] rog: Exactly.. that's handled without arbitrary delays [13:48] niemeyer: so we'll probably get EOF immediately and everything is hunky dory [13:48] niemeyer: but if we always wait for EOF then we get bitten if there *is* a background process that's keeping the pipe open [13:49] niemeyer: we could make the delay 2s or so if that would make you happier [13:49] rog: We don't get bitten, because we don't have to wait.. [13:49] niemeyer: i'm afraid we do. i'm just writing some code to demo the issue. [13:49] rog: Thanks, that'd be awesome [13:51] niemeyer: essentially the OS doesn't guarantee that the pipe has been drained when wait(2) returns [13:52] rog: The pipe may never get drained if there's a process holding it [13:52] niemeyer: exactly [13:52] rog: We only care that the process has exited, and we can't read anything from it anymore [13:53] niemeyer: the process can exit *and* there still be data in the pipe. [13:53] niemeyer: which we can read. [13:53] niemeyer: anyway, one mo [13:53] rog: Then we read it.. (!?) [13:54] niemeyer: yeah, but we need to wait for it to be read, hence the <-logger.done [13:56] rog: What I mean is that we don't have to put a boundary on the reader to output its data in 50ms [13:56] rog: There's a deterministic event happening.. [13:56] niemeyer: only if there's no background process [13:56] rog: Always [13:57] rog: "I can't read data now" is deterministic [13:57] niemeyer: how do we know when that event has completed? [13:57] rog: No matter if there's something in background or not [13:57] niemeyer: we know there *might* be some data, but we don't have any way of knowing for sure, or how much there is [13:57] rog: Hence, "I can't read data now" [13:58] rog: If we can, doesn't matter how much there is, or how much time has passed, we can read it [13:58] niemeyer: how do we know "I can't read data now"? [13:58] niemeyer: there's no non-blocking read [13:58] rog: Because the system call tells us that? [13:58] rog: Oh, there isn't? [13:58] niemeyer: i don't think so [13:58] * niemeyer looks [14:00] niemeyer: we could probably use syscall.SetNonblock [14:00] niemeyer: but i'm reluctant to do that [14:00] This sucks :( [14:02] niemeyer: tbh i don't mind waiting a bit longer if someone starts a process in the background and that triggers the timeout [14:03] niemeyer: given they're not meant to be doing that anyway [14:04] rog: Well, the consequence isn't great.. besides the lack of determinism (we can be closing the door on the logger before we've read the *real output*) we're holding resources, perhaps consecutively, for things that may not exit [14:04] niemeyer: that's all true [14:06] rog: Anyway, thanks for the coverage [14:06] rog: It's a great workaround and we should definitely move on with it for now [14:07] niemeyer: np. i'm not sure what you'd like to do about the issue though. [14:07] niemeyer: ok, cool [14:07] rog: I'll finish the review so we can integrate this [14:07] niemeyer: thanks [14:10] niemeyer: FWIW i verified that CombinedOutput does block until EOF, even if the process has exited. [14:10] rog: Cheers. That's indeed the right thing for it to do [14:10] niemeyer: yup [14:11] niemeyer: but not what we want, i think [14:11] ( echo foo; sleep 5&; echo bar) | cat [14:11] niemeyer: yup [14:11] This is the issue, in summary :) [14:11] rog: Agreed [14:30] Gosh, and the bank calls me again.. the government is asking for new docs for the exchange contracts [14:30] Just what I was asking for [14:39] rog: Done, sorry for the delay [14:39] niemeyer: tyvm [14:40] niemeyer: do you think we just just truncate long lines? [14:41] niemeyer: or should we send them as several log messages? [14:43] rog: I don't think we need to do much, to be honest [14:43] rog: ReadLine already truncates them [14:43] niemeyer: so you'd go with that? [14:44] niemeyer: (and ignore continuation lines) [14:44] rog: No, just log them [14:44] niemeyer: ok, so we'll split lines [14:44] niemeyer: fair enough [14:45] rog: Yeah, ReadLine already does it.. [14:45] niemeyer: yeah - it's always a difficult call whether to discard continuations of a split line though [14:46] niemeyer: i'm mildly tempted to print [%d bytes truncated at end of line] [14:46] niemeyer: in case someone cats an executable or something [14:47] rog: It's not truncated [14:47] niemeyer: but... if someone's printing a large lump of json, you probably want all of it, split or not [14:47] rog: ReadLine will return on the next line [14:47] rog: Ah, I see, you're thinking about changing the behavior [14:48] niemeyer: yeah, i'm concerned about changing the meaning of log messages [14:48] niemeyer: but i can make the buffer pretty big, so it's really not much of an issue [14:48] rog: Change the meaning of log messages? [14:48] rog: What's the meaning of log messages? :) [14:49] niemeyer: well, if we're grepping out messages by prefix (not uncommon), then we'll miss continuation lines [14:50] niemeyer: but i think it's common to have some line-length restriction, and better that than lose information [14:50] rog: Grepping for prefixes would still work fine, I believe [14:51] niemeyer: i'll use a buffer size of 128K or something [14:51] rog: Why? Please just keep the default [14:51] niemeyer: not if the prefix is part of the message logged, i believe [14:51] rog: The prefix will end up in the same place [14:51] niemeyer: yes, but if we grep for it, the continuation lines won't have it [14:52] rog: Yes, they won't, because they are continuation lines that don't have the prefix.. duh? [14:52] niemeyer: so our grep output won't have all messages with that prefix [14:52] rog: Man [14:52] rog: Okay, please do whatever [14:53] niemeyer: it depends how people are using the logging. 4K is probably well sufficient for almost all purposes, i'd guess [14:53] niemeyer: so i'll leave the default [14:57] niemeyer, https://codereview.appspot.com/6496120 should be near-enough trivial, assuming you're comfortable with the approach in the short/medium term [14:58] niemeyer: "s/0.1/0.2/ probably, given the internal change?" - is there a timeout change you intended to suggest but didn't? [15:03] rog: Huh.. indeed [15:03] rog: I typed it, but it got eaten somehow [15:04] rog: I was suggesting a small bump to 100ms [15:04] niemeyer: sounds good [15:04] rog: Cheers [15:04] fwereade: Looking [15:04] brb, cig [15:09] b [15:11] fwereade: Seems mostly sane [15:11] fwereade: There's a small bit that is feeling contrived that sounds easy to simplify [15:11] niemeyer, ha, now that's faint praise ;p [15:11] niemeyer, cool [15:12] niemeyer, I kept finding myself wanting to extract a whole new type but it doesn't quite feel like the right time yet [15:12] fwereade: We're sending (dataDir, agentName) aaaaall the way down to EnsureJujuSymlinks, so it can call AgentToolsDir, and then we send that information aaaaaall the way up so we can use it [15:13] fwereade: When all these functions really care about is the directory, not the dataDir, not the agentName [15:14] niemeyer, fair point... so, make EJC just take a dir then? or do I misapprehend? [15:15] fwereade: Yeah, both ensureFs and EJC can take the toolsDir themselves [15:15] fwereade: and we can generate that info at the top level by calling it explicitly from NewUniter [15:15] fwereade: Neither of them, I think, care about dataDir or agentName [15:16] niemeyer, I had considered dropping ensureFs entirely, how would you feel about that? -- just ensure the state dir, and EJC the toolsDir directly? [15:16] fwereade: +1 [15:16] niemeyer, cool, on it :) [15:17] fwereade: The other thing I quickly pondered was whether we should just stick filepath.Dir(os.Args[0]) in the PATH rather than passing toolsDir into RunHooks, but I'm not so sure about that [15:18] niemeyer, not so keen really [15:20] fwereade: Cool [15:24] niemeyer: i exported var LineBufferSize to make it easy for the tests to check line splitting. hope that seems ok, before i submit. [15:24] niemeyer: [15:24] // LineBufferSize holds the maximum length of a line read [15:24] // from a hook. Lines longer than this will overflow into [15:24] // subsequent log messages. [15:24] var LineBufferSize = 4096 [15:24] niemeyer: oops, i meant to paste this: https://codereview.appspot.com/6494132 [15:24] actually, it could easily be const [15:26] rog: It doesn't have to be public as well [15:26] niemeyer: ok, i'll add export_test.go [15:26] rog: Ugh.. why? [15:26] niemeyer: because the testing code needs access to the value [15:26] niemeyer: so it can verify the split easily [15:27] rog: Just hardcode the number in the test as well.. this is not nearly as interesting as it sounds :) [15:27] niemeyer: ok; thought you might not like that :-) [15:28] fwereade: i'm seeing sporadic uniter test failures: http://paste.ubuntu.com/1202839/ [15:28] rog, aw poo, I'll take a look [15:30] * fwereade is baffled, offhand [15:30] rog, I'll investigate more closely in a mo [15:30] fwereade: i only get it when i do go test ./... (i.e. it seems like a race of some kind) [15:30] rog, thanks, good to know [15:33] * niemeyer => lunch! [15:43] fwereade: i've just reproduced the behaviour in trunk (i was slightly concerned it was a change i'd made). it happens about 30% of the time for me. [15:44] rog, excellent -- I have not seen that one myself, but I'll see if I can figure it out [15:45] rog, ...but... maybe tomorrow :( cath kinda needs support asap, she's been ill all day and laura's been around too [15:45] fwereade: np [15:45] fwereade: go tend your flock :-) [15:45] rog, I'l hanging on for dear life to repropose one branch then that's probably it for me [15:46] fwereade: do you mind if i make the PATH change? [15:46] rog, kinda, because that's what I'm proposing ;) [15:46] fwereade: ah, do it, DO IT! [15:46] rog, just forcing myself to witness a full test run ;) [15:47] fwereade: aw, i know that feeling very well [15:47] fwereade: live? [15:48] rog, afraid not, I should probably do that too, but for *that* I definitely don't have time :( [15:48] fwereade: 'sok, i'm sure local is fine [15:53] rog, https://codereview.appspot.com/6496120 and goodnight :) [15:56] fwereade: nnight. hope cath gets better before you have to leave... [16:53] rog: ping [16:53] niemeyer: pong [16:53] rog: Oh, hmm [16:53] rog: I guess we'll need to talk anyway, given fwereade's comment [16:53] rog: Was just looking at https://codereview.appspot.com/6498124/ [16:54] rog: and pondering about where mstate is [16:54] rog: But we'll need further info it seems [16:54] niemeyer: i thought about doing mstate but thought i'd leave it until there were more watchers and the pattern was fully established [16:55] rog: I think it'd be fairly equivalent to the machine one [16:55] niemeyer: yeah. i wondered if there was going to be some equivalent of the generic machinery that was in state, where each watcher took less lines [16:55] rog: I'd prefer to add it together.. hopefully we can transition entirely next week during the sprint (fingers crossed) [16:55] rog: But we'll need to talk to fwereade anyway [16:56] niemeyer: what do you think about the UnitWatcher as a concept? [16:56] rog: Probably, but we'll need to evolve a bit before that, I suspect at least [16:56] niemeyer: shall i just poll for the time being? [16:56] rog: I personally think it's fine [16:56] rog: We'll need to watch it anyway for other reasons, I believe (e.g. life) [16:56] niemeyer: yeah [16:57] niemeyer: given fwereade's remarks, i wondered about making the unit watcher watch the unit's presence too [16:57] niemeyer: then at least it would signal all status changes [16:58] rog: Initial feeling is that this would cross two unrelated issues together [16:58] rog: I'd be curious to know more about his feelings [16:58] niemeyer: the difficulty is that you watch *almost* all unit status changes. [16:59] niemeyer: but yes, i tend to agree. [17:00] rog: Why did you need it again, for reference? [17:00] niemeyer: for watching the agent version [17:00] niemeyer: so that i can tell when the unit agent upgrades [17:00] rog: Ah, for testing only, ok [17:00] niemeyer: yes. although when we do major version upgrades, we'll probably need it too [17:04] rog: Interestingly, I notice we have a bunch of unrelated watchers on the unit.. ports, resolved, .. [17:05] rog: and a bunch of properties that can't be observed [17:05] rog: All of those are very rarely changed.. I wonder if we should move towards observing the unit itself [17:05] niemeyer: yeah. that has made me a bit uncomfortable. we can watch some properties of the unit because they happen to be stored in a certain way. that might not be true in mstate tho' [17:06] niemeyer: what do you mean by that? [17:06] rog: It's even the opposite, which kind of sucks [17:06] rog: We stored them in a certain way because we wanted to observe them [17:07] rog: let's say we then decide to observe an existing property with state (pre-mstate).. we'd be forced to observe the unit node itself [17:07] niemeyer: i'm not that familiar with mstate. what do you mean by the unit "node" ? [17:07] oh, sorry [17:07] pre-mstate [17:08] rog: I mean the document (in mstate) or node (in zk) that contains most unit information [17:08] rog: We've split certain bits out, originally, because we felt it'd be more efficient to observe it that way [17:08] rog: It now feels like this is a poor design [17:08] niemeyer: most but not all (thinking of the presence state, for example) [17:09] rog: Because we're attempting to anticipate every use case [17:09] rog: That's a very particular one [17:09] rog: That will always be special [17:09] niemeyer: true [17:09] rog: Because we'll have significant volume of small bits of information going on and off regularly [17:11] niemeyer: so you think the unit watcher (in mstate) should observe the entire unit document? [17:11] niemeyer: (i don't even know if that makes sense actually!) [17:11] rog: Right, we have already unified all settings in a single document even [17:12] rog: Because the cost of loading the extra settings is nothing compared to the cost of everything else [17:12] rog: (make query, transfer ir, lookup index, lookup data, transfer back, unmarshal in, unmarshal out, blah blah) [17:12] niemeyer: so in fact the unit watcher in mstate *would* automatically see all unit changes, e.g. ports etc? [17:13] rog: Exactly, and so will every other watcher that watches the unit, even if we do have separate watchers [17:13] rog: We can even filter out the specific settings, if we do want that, but again it's pretty irrelevant [17:13] rog: The cost of the operation is not there [17:14] rog: In summary, my point is that I think we could do pretty well with a single unit watcher, instead of tons of per-setting watchers for the unit [17:14] niemeyer: i think it's still worth having typed channels rather than always passing around an object that you must invoke a method on to get the value [17:15] niemeyer: because channels are generic. i had to jump through hoops to watch both the machine agent tools and the unit agent tools with the same code. [17:15] rog: I don't know.. it doesnt' feel worth it.. I believe these watchers will each have a single use case in the whole code [17:15] "generic" is maybe a bad way of saying it. [17:16] niemeyer: it also means that every client needs to check if the setting it's after has actually changed [17:16] niemeyer: i think part of the problem is that our watchers are so darn heavyweight [17:16] niemeyer: if we did it right, a new watcher should only be 6 or 7 lines of code, i think. [17:17] rog: Indeed, but instead of checking it, we're doing a lot of heavy lifting to do exactly the same in a generic way [17:17] rog: Even then, it doesn't feel great [17:18] rog: We need a new method on the type, a new type to put on the channel, a new function to handle it, etc etc, for *each* detail we want to watch [17:19] niemeyer: part of the problem for me is that interfaces don't work when you've got chan T, but only when you've got T [17:20] rog: That is unfortunate indeed.. [17:21] rog: at the same time, it's not reaaally an issue in this case.. we have type-specific channels in both cases [17:26] niemeyer: if i've got a chan *state.Unit, it's type specific, but it doesn't really represent what i'm watching. [17:26] rog: Well, if you're watching the unit it does represent what you're watching [17:27] niemeyer: this is what i had to do in livetests: http://paste.ubuntu.com/1203085/ [17:27] niemeyer: so i could treat units and machines in the same way [17:28] niemeyer: for the purposes of waiting for them to upgrade [17:30] rog: That seems pretty involved [17:30] rog: Isn't it just a matter of having a function that looks like [17:32] rog: func machineToolsHolder(ch <-chan *Machine) <-chan ToolsHolder [17:32] rog: ? [17:32] rog: To overcome exactly the issue you described above? [17:32] rog: The logic itself is exactly the same for both [17:33] niemeyer: that's essentially what i pasted [17:33] rog: Heh [17:33] rog: Except completely unrelated, right? :) [17:33] rog: I'm saying converting a *channel type* onto *another channel type* [17:34] rog: The *logic* is the same [17:34] rog: You're converting a *watcher* onto anohter *watcher* [17:34] niemeyer: well, it stores the machine too, so that the generic code can inspect the error if it wants [17:35] niemeyer: i'm not convinced it would be less code overall [17:36] niemeyer: ah, i see what you mean actually. [17:36] rog: It removes the duplication that you have there, with two watchers that look exactly the same.. [17:37] niemeyer: where MachineToolsHolder is something like: [17:37] type MachineToolHolder interface { [17:37] AgentTools() (*state.Tools, error) [17:37] } [17:37] Yep [17:37] i mean ToolsHolder of course [17:38] niemeyer: good enough for testing anyway [17:40] niemeyer: i still think it's a pity that we require that every stage in a channel pipeline have its own tomb. [17:40] niemeyer: but i lost that one ages ago. [17:45] rog: I don't think we require that [17:45] niemeyer: i think we do if we want to maintain the invariant that calling Stop means no more values will be produced on a channel. [17:46] rog: In fact, the mstate watcher mechanism explicitly does not have a per-channel tomb [17:46] rog: What we require in the design, since the beginning, is that we deterministically stop background activity [17:46] niemeyer: are you talking about the MachineWatcher? i should have a look then [17:46] rog: That sounds sane [17:47] rog: No, I'm talking about the watcher foundation and the way that MachineWatcher uses it [17:47] rog: We do not have per-channel tombs there [17:47] niemeyer: so the MachineWatcher doesn't have a tomb? cool! [17:47] rog: Heh [17:48] rog: The MachineWatcher has a tomb, because it has a goroutine that we want to stop deterministically [17:48] niemeyer: i think we'll usually find that every stage in a pipeline is like that [17:48] rog: No, I'm talking about the watcher foundation and the way that MachineWatcher uses it [17:48] rog: We do not have per-channel tombs there [17:48] niemeyer: a machine watcher is one stage down from the watcher foundation. [17:48] niemeyer: i don't see the watcher foundation as a pipeline. [17:49] rog: Okay, sorry [17:49] niemeyer: we *can* deterministically stop goroutines without a tomb per-stage if we're prepared to wait on the channel for the goroutine to acknowledge that it's been stopped (by closing the channel) [17:50] niemeyer: i know that was rejected ages ago, but i still think it's a nice way to go about things. [17:50] rog: We can't close a channel from the consumer side [17:50] niemeyer: it means the stop message can percolate deterministically down the pipeline [17:51] niemeyer: each stage in the pipeline can delegate the stop handling to the stage upstream [17:52] niemeyer: when that sees a stop, it closes the upstream channel, which then flows stage-by-stage back to the most downstream channel [17:52] rog: Doesn't work.. it means we have background activity because closing the channel leaves things running [17:52] rog: Deterministically stopping means Stop() returned, nothing else is happening [17:53] rog: It's one of the things we've fixed from Python, and it makes me very happy to be honest [17:53] niemeyer: a stop would be a two-stage process - we close the stop channel, then wait for to see the downstream channel closed. [17:53] niemeyer: there's nothing left running [17:53] rog: Heh.. I'm sure you can create many other mechanisms to do the same [17:54] rog: We have one, that works, and works well [17:54] niemeyer: it means it's really heavyweight to do things that should be throwaway pieces of code, IMHO [17:55] niemeyer: the only thing that *needs* to do a select is the head of the stream. [17:55] rog: Sorry, but this isn't helpful at all.. you're building a new design as I mention the requirements, and saying that what we do is heavyweight despite the fact it *works*. [17:55] niemeyer: anyway, sorry, derail [17:55] niemeyer: yes it does [17:55] rog: If you want to reimplement juju again, I won't be around [17:55] :-) [18:06] niemeyer: i'm off for the evening. see ya tomorrow! [18:06] rog: Have a good time [18:06] niemeyer: when do you set off? [18:07] rog: Sometime Saturday only, I'll be around tomorrow still [18:07] niemeyer: cool [18:08] niemeyer: if you manage to remove fwereade's toolsDir branch tonight, that would be awesome. i reckon we can get it running actual commands tomorrow possibly. [18:09] rog: I'm sure it must be ready to go in.. it was really a trivial detail, and it was great [18:14] jimbaker: ping [18:14] niemeyer, hi [18:14] jimbaker: Heya [18:14] jimbaker: I'm wondering about format 2 stuff [18:14] jimbaker: Were you the one pushing it? [18:14] niemeyer, yes [18:14] it's in trunk now [18:15] in terms of the raw string changes [18:15] jimbaker: Okay.. I've noticed there's a "-Not set-" value being used by "config get" [18:15] jimbaker: Have you killed that? [18:15] niemeyer, i didn't notice that in the code or usage... i'll see if it's still there [18:15] jimbaker: Ouch [18:16] jimbaker: config_get.py [18:18] jimbaker: Sorry, connection broken [18:19] niemeyer, yes that's unfortunate... i didn't touch juju get in the merged branch, but we should definitely address so it is also raw [18:20] jimbaker: It should be nil or something.. "-Not set-" is a perfectly valid *set* value [18:20] niemeyer_, agreed [18:23] i understand that the attempt here is to provide some human readable info on how the settings have been made (such as the default flag). i do think it would be better if it were to simply roundtrip with juju set, but that's too late i suppose [18:24] jimbaker: Can you please fix it so it's "None" instead? [18:24] niemeyer_, np. so should we also have the new behavior respect format 2 for that service? === niemeyer_ is now known as niemeyer [18:25] jimbaker: No, this is client side [18:25] jimbaker: format 2 is about the charm [18:25] niemeyer, i understand [18:25] jimbaker: This is a bug in the juju get command, that I suggest actually fixing [18:25] jimbaker: I'll suggest the same in the Go side [18:26] jimbaker: yaml has a perfectly valid idea of "nil" [18:26] jimbaker: Which is what this means [18:26] ok, i'll propose this fix [18:26] jimbaker: Cheers === steev_ is now known as steev [19:10] niemeyer, heyhey, sorry I missed the conversation above [19:10] niemeyer, I did repropose the toolsDir PATH branch, maybe you missed it [19:11] niemeyer, https://codereview.appspot.com/6496120 [19:11] niemeyer, juju get is now fixed in trunk r577, and the change will be part of the 0.6 release [19:12] fwereade: Ah, sorry, I was postponing it because I didn't think you'd do it right now, but if you're active I'll have a look right away.. I'm sure it's good [19:12] jimbaker: Woohay, cheers [19:12] jimbaker, cool [19:13] (I wondered about that in the golang implementation :)) [19:13] fwereade, no need to port that bug :) [19:14] * fwereade suddenly has first-job flashbacks... bug-compatible reimplementation of directx... fun :) [19:14] we called it indirectx, of course [19:21] LOL [19:22] fwereade: Done, thanks! [19:27] niemeyer, sweet, tyvm [19:32] niemeyer, btw, I'm not 100% sure whether this is just me seeing patterns where none exist... but ISTM that the 50ms spinning while waiting for the uniter makes things a lot flakier than the 200ms spinning when I'm running full test suites, and I have a theory that this is *because* I'm spending more time spinning rather than sitting back and allowing the uniter to do its thing [19:33] fwereade: Hmm.. this sounds a bit like we don't know what's actually going on [19:33] niemeyer, how would you feel about dialing them back a little, at least to, say, 100ms... (until I decide that I'm seeing just as many just-missed timeouts, and change it back) [19:33] niemeyer, ha, yes, that is a good way of putting it [19:34] fwereade: I'd rather dial them to 25ms so we can actually find out what's going on :) [19:34] niemeyer, haha :) [19:34] fwereade: The delay is preceding a pull.. if returning earlier makes it break anyhow, something is necessarily not ok [19:35] fwereade: In some cases this reflects background activity that is missing a Stop, for example [19:37] niemeyer, hmmm, my theory was no more sophisticated than "the more I do on the test goroutine, the less time is spent letting real things actually happen" [19:37] niemeyer, (I'm defiitely not contradicting your suggestion) [19:38] fwereade: I find that unlikely.. 50ms is an eternity in computing time [19:38] fwereade: I find it more likely that we're simply lucky enough to have spotted a way to observe a problem [19:39] niemeyer, you are probably right... many uniter things take longer than I would naively expect them to [19:40] fwereade: Can I help you to debug the issue anyhow? [19:41] niemeyer, not obviously, except perhaps to guess at why this chunk of code might take 200ms: [19:42] if err = u.deployer.Deploy(u.charm); err != nil { [19:42] return err [19:42] } [19:42] if err = u.sf.Write(reason, Done, hi, url); err != nil { [19:42] return err [19:42] } [19:42] } [19:42] log.Printf("charm %q is deployed", url) [19:42] niemeyer, from the final deferred log in Deploy to the log at the bottom [19:42] niemeyer, but, yeah, that's not a helpful thing to ask [19:43] fwereade: I'll have a look at the code [19:43] niemeyer, specifically in the context of the jujud test [19:43] niemeyer, but I'll be staring at verbose runs of that on its own and seeing if I can spot anything, too [19:44] niemeyer, hmm, yeah, that's more like 10ms when run on its own [19:44] fwereade: Ah, ok [19:45] fwereade: That sounds about right for a disk seek [19:46] niemeyer, hmm, just got a couple of 100ms~s in a row [19:47] fwereade: Ok, that may be the machine working.. keep in mind that we're fsyncing on the writes [19:47] niemeyer, ha! yes, I totally forgot that [19:48] niemeyer, ok, I feel a little less insane now, and in that case actually more comfortable just bumping timeouts across the board [19:48] fwereade: Maybe I misunderstand the original issue then [19:48] fwereade: I thought the 50ms was inside a loop [19:48] niemeyer, yeah, just when polling for things to have happened [19:48] fwereade: That would wait further for a few seconds until the desired state actually arrived [19:49] fwereade: It'd certainly be too little to be deterministically waiting *just* that, but it's not too little if it's inside a loop [19:49] fwereade: Do I misunderstand what you're actually covering? [19:51] niemeyer, the issue is really just that, when things fail, I see a lot of "waiting..." spam from inside the tests; specifically, much more than I intuitively expected [19:52] niemeyer, being reminded that we're fsyncing every write makes that behaviour seem a lot less weird [19:52] niemeyer, ie I now have a mechanism for the uniter taking its own sweet time over things [19:53] niemeyer, rather than the half-baked "overeager tests + gc pauses + handwaving" theory [19:53] fwereade: LOL [19:53] fwereade: This is cool :) [19:53] niemeyer, and I'm now comfortable just bumping the top-level timeouts a little and relaxing [19:55] fwereade: We can certainly disable the fsyncing for tests later [19:55] fwereade: But we should make sure things actually work first :) [19:55] niemeyer, v good point [19:56] niemeyer, quite so :) === 45PAA1GWO is now known as hazmat [20:57] * niemeyer steps out for a while