[00:13] niemeyer, there is some redundancy with the new format since we're showing the service relations : db : [myblog, teamblog] at the service level and showing the same for each unit as well in the relations/relation-errors block [00:23] we had talked about collapsing them, but per unit rel status is important [00:24] * hazmat looks for previous context [01:42] hazmat: Wasn't the redundancy always there? [01:43] hazmat: The distinction is we show the active relations [01:43] hazmat: Maybe all we need is relation-errors? [01:43] niemeyer, yeah.. that sounds reasonable [01:43] there's not really anypoint to showing it duplicated, and we flag errrors separately [01:44] s/their [01:44] hazmat: yeah.. and if we regret we can always go back [01:44] sounds good, i'll do it up that way [01:45] niemeyer, thanks [01:45] hazmat: np [01:55] hmm.. some of the dot rendering needs that info [02:00] nm [04:03] * hazmat yawns [07:04] mornin' to anyone silly enough to be up at this hour [07:04] andrewsmedina: pong [07:12] speak for yourself, it's 5:12pm here :) [07:16] bigjools: :-) i knew it was a silly thing to say. [07:17] bigjools: australia? [07:17] yup [07:18] gah, this whole juju branching from LP thing has really screwed my day [07:18] bigjools: how's that? [07:19] I waited for a bootstrap to finish while it was setting up a node (takes ages) and then: [07:19] bzr: ERROR: http://bazaar.launchpad.net/%2Bbranch-id/348938/.bzr/repository/packs/c77bcca4ec91aee207a8f9d37b1a8608.pack is redirected to https://launchpad.net [07:19] which is a bug in LP/bzr somewhere [07:19] and the whole bootstrap grinds to a halt [07:20] aren't web services marvellous? [07:20] not today :) [07:20] bigjools: BTW how long does it take for you to bootstrap an environment? [07:21] (usually) [07:21] but I'm still gobsmacked at juju branching off LP at all during deployment/bootstrap :/ [07:21] bigjools: depends on the environment settings, right? [07:21] it depends. I am testing maas, and if I wait for a machine to install, an hour, otherwise it depends on network speed for apt-get update/install etc. [07:22] are you talking about juju-origin? [07:22] bigjools: yeah [07:22] I still think that's crazy [07:22] bigjools: that you should have the option? [07:23] that it tries to check out a branch [07:23] bigjools: does it do that even when juju-origin is distro or ppa? [07:23] is this an artifact of my dev environment, or does it always do that by default? [07:24] bigjools: i couldn't say i'm afraid [07:24] let me rephrase - does it always bzr branch/checkout? or does it try to use other ways to get the code down? [07:25] bigjools: i was under the impression that if you used juju-origin=distro or ppa that it would just do apt-get [07:25] ah ok, I didn't know that [07:25] so just the default is crazy then :) [07:26] bigjools: ah, the default depends on your local environment [07:26] bigjools: i'd forgotten about that. [07:26] ah, that's what I was getting at [07:26] * rog looks at the code [07:27] bigjools: it looks at the output of apt-cache policy [07:28] ok [07:29] bigjools: what does "apt-cache policy juju" produce for you when you run it? [07:30] bigjools: to be honest, i didn't really understand the motivation behind the code when i ported it to Go. i just copied the semantics and the tests :-) insight into why it's good or bad would be useful... [07:31] I really have no idea :/ [07:31] juju is not installed locally, anyway, which is why I guess it branches it [07:36] bigjools: no idea what happened there [07:36] bigjools: last thing i saw from you was: [07:36] [07:31] I really have no idea :/ [07:37] wrtp: I just said that juju is not installed from a package [07:37] bigjools: could you paste the exact output of apt-cache policy juju, please? [07:38] bigjools: i'm just looking to see how the code would deal with it - it's got quite a few cases. [07:38] wrtp: http://pastebin.ubuntu.com/891862/ [07:41] bigjools: thanks [07:42] bigjools: yeah, it's the "Installed: (none)" that triggers it [07:43] wrtp: right [07:43] bigjools: set juju-origin and you'll be fine, i guess [07:43] wrtp: I am using it to pull my dev branch from LP [07:43] but got caught by that LP bug [07:43] bigjools: the underlying problem is outlined in this email: https://lists.ubuntu.com/archives/juju/2012-March/001337.html [07:43] :) [07:50] rog, fwereade: moin [07:50] heya wrtp, TheMue, bigjools [07:51] fwereade: yo! [07:51] hello fwereade [07:55] wrtp: I found what looks like a bug, can you verify this: [07:55] the user-data contains lines that append to the /etc/init/..conf files [07:56] so on multiple boots you end up with it saying to start the provisioning agent etc multiple times [08:07] bigjools: multiple juju bootstraps? or multiple boots of the machine? [08:08] bigjools: BTW it's ironic that the problem i'm currently trying to debug with the Go port is directly to do with the issue you had above... [08:16] wrtp: heh - each boot uses the cloud-init user-date to append the same stuff to the conf files [08:17] bigjools: hmm. i think juju usually assumes a fresh machine each time. [08:17] wrtp: not sure why it is using >> in the bash script then [08:18] bigjools: is that in juju or in cloudinit.py ? [08:19] wrtp: the user-data's runcmd coming via cloudinit [08:20] afk for a while [08:21] bigjools: i don't see that, but maybe i'm looking in the wrong place. [08:21] oh yeah, i do now. [08:24] bigjools: i think that might have changed since i did the port. hmm. [08:27] fwereade: you did the upstart stuff, right? [08:28] wrtp, yeah, reading back... [08:28] fwereade: from a test data file (cloud_init_ppa): [08:29] /var/log/juju, 'cat >> /etc/init/juju-machine-agent.conf < why the append? [08:29] wrtp, because, apparently, the crack was strong with me that day, but let me check something [08:30] wrtp, yeah, crack [08:30] bigjools: ^ [08:30] :-) [08:31] fwereade: do any of the tests try rebooting the instances? [08:31] wrtp, hm, no [08:31] fwereade: it seems to me that it might be a good idea to try to test this stuff [08:31] wrtp, but rebooting them has AFAICT always worked in the past [08:32] fwereade: lucky :-) [08:32] wrtp, well, it tests that it has behaviour that has been experimentally verified to work, even if that was by sheer luck :/ [08:34] fwereade: :-) [08:34] wrtp, it seems that ec2 runs user scripts only one anyway [08:35] fwereade: that would be very sensible [08:35] fwereade: maybe can say it's a bug in MaaS? [08:35] wrtp, nah, my bug [08:35] fwereade: but... isn't it up to the OS to decide whether the run the init scripts or not? [08:36] wrtp, I think of it as being up to cloud-init, and it's not not-a-bug just because ec2 machines happen to be set to run them only once [08:36] wrtp, it was hitherto a latent bug [08:37] fwereade: BTW, looking at juju/providers/common/tests/data/cloud_init_ppa, it looks like the "--session-file" argument is on a different line to the "exec python -m juju.agents.machine" command. [08:37] fwereade: how can that work then? [08:38] wrtp, isn't that yaml being "clever"? [08:38] wrtp, in that section line breaks are represented as pairs of line breaks, AFAICT [08:39] fwereade: quite probably. i haven't found a good explanation of all yaml's cleverness anywhere yet [08:39] fwereade: i'd have thought that stuff inside ' ... ' is treated literally. [08:41] fwereade: jeeze who could ever think of calling it "simple"? [08:41] wrtp, I basically treat yaml as a binary format [08:41] wrtp, it passes through a library that causes it to make sense, somehow, and I'm happy enough with that :p [08:41] fwereade: can we move to json sometime, please? [08:42] wrtp, I have no idea, I think that's one to punt to niemeyer [08:42] wrtp, I don;t know where the original yaml dependency came from [08:42] fwereade: yeah. i think niemeyer likes the indentation-based format. [08:42] wrtp, tbh I'm not *so* bothered by it -- in the paces where users might use it, it's usually pretty clear and obvious [08:43] wrtp, but, yeah, it gets ungainly with our own data [08:45] "In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces." [08:45] from: http://www.yaml.org/spec/1.2/spec.html#id2788097 [08:45] wrtp, OTOH note that cloud-init is not our code, and that needs to be yaml anyway [08:45] fwereade: yeah, i realise that. [08:45] fwereade: unfortunately. [08:46] wrtp, so this specific example is not really directly relevant to the cause anyway ;) [08:46] fwereade: yeah, it was more of a by-the-by. [08:48] fwereade: i can't see where in that section it says that newlines are removed. [08:49] fwereade: i suppose example 7.9 implies that though. [08:49] wrtp, I'm afraid I have little stomach for a deep dive into the yaml spec, but the way it converted all the \ns to \n\n (in that section only) seems to me to be strong-enough circumstantial evidence for my position ;) [08:50] fwereade: true 'nuff :-) [08:59] wrtp, hmm, I wonder whether deploying machine constraints will break everyone's jujus again :/ [08:59] fwereade: it'll probably break the hacks that everyone's been using to get around the lack of machine constraints... [09:00] wrtp, oh, *that* is guaranteed, but *hopefully* people have known they were going away since last year [09:00] wrtp, I'm just worried about the impat on stuff that's already deployed from PPA, at the point it hits the PPA [09:02] fwereade: hmm, yeah. is the zk schema backwardly compatible? [09:03] wrtp, I need to double-check what happens if the constraints key is ever not found... but actually, that should be broken *already* if it's a problem === jelmer_ is now known as jelmer [09:34] wrtp, fwereade_: a simple review: https://codereview.appspot.com/5843068 [09:34] wrtp: btw, why today wrtp and not rog? [09:45] TheMue, the big question is "what will this be used for?" [09:46] fwereade_: inside WaitAgentAlive() of Unit and Machine, like discussed yesterday with niemeyer [09:49] TheMue, ok, and we definitely need a timeout for that? [09:51] fwereade_: we should leave waiting users the option to have a timeout (as usual in concurrent and distributed computing). which concerns do you have with timeouts? [09:52] TheMue, those waiting users being? [09:53] TheMue, my only concern is that it's code that won't be used, and we don't have enough information to choose a sensible timeout [09:53] TheMue, the only thinks that wait for agents are command-line tools [09:54] TheMue, I was under the impression that we deliberately *didn't* time those out so that people can write scripts that bootstrap and just keep going with the next command [09:55] TheMue, I can understand that there are other plausible scenarios in which we might want to do have a timeout, but they don;t exist yet [09:55] (sorry mangled english) [09:58] fwereade_: one moment please, doing esta in parallel ;) back in a few seconds [10:23] fwereade_: so, back again, started ESTA for UDS [10:23] TheMue, ah cool [10:24] TheMue, anyway it's not bad code; it's not a bad idea; it's just not something we have a reasonable certainty of needing imminently, so I'd prefer it if it just didn't exist [10:24] TheMue, every line of code is a small weight on each of our brains ;) [10:25] fwereade_: WaitAlive() is the same functionality as it has been before in Unit and Machine. afaik a low level func like this one should provide the possibility for timeouts and it's task of its using code to determine how long it is willing to wait [10:26] fwereade_: no more code duplication later for selects with time.After [10:26] fwereade_: this argument by rog yesterday lead to the reduction to one func [10:26] fwereade_: btw, why no concerns before when the same code has been in Unit and Machine? [10:26] TheMue, my point is that nobody's going to be doing anything other than "alive, ok = <-watch", without select, regardless [10:27] fwereade_: even you in your tests worked with a timeout [10:27] TheMue, very specifically, only in the tests [10:27] fwereade_: timeouts and there handlings are essential in distributed and concurrent programming [10:28] TheMue, by adding this mechanism, you *force* me to choose a timeout [10:28] fwereade_: so you think it's ok that parts of the software block forever until anyone kills the process? [10:29] TheMue, exactly so [10:29] fwereade_: i could add a -1 for automatic max duration ... [10:29] fwereade_: but again, why hasn't been that question beforin when the same code has been in Agent or later in Unit or Machine? [10:30] TheMue, that's still forcing me to make a choice, it's just that there's still only 1 meaningful choice [10:30] fwereade_: and btw, nobody is forced to use this function, anyone can use the other functions, they are unchanged [10:31] TheMue, if I failed to spot and complain about the timeout before, I apologise for my inattention [10:31] TheMue, sure; and if nobody does use the function, why have it in the first place? [10:31] fwereade_: i will use it, in Unit and Machine [10:31] TheMue, to do what? [10:32] fwereade_: like discussed yesterday with niemeyer and rog [10:32] fwereade_: wait for an agent [10:32] TheMue, I'm sorry I missed that; what are the new use cases in which a timeout doesn't break user expectations? [10:32] TheMue: out of interest, BTW, what code waits for an agent? [10:33] rogpeppe: would have to look again who is calling it [10:33] TheMue, it's juju-ssh and juju debug-hooks [10:33] TheMue, and that's it [10:34] fwereade_: interesting. why do they wait for an agent? (and which agent do they wait for?) [10:34] TheMue, in both cases it's the command line and it will break existing behaviour that users expect [10:34] rogpeppe: there are several callers of watch_agent() in todays py code [10:35] TheMue, I see two [10:35] fwereade_: just counted the search results, they include the tests [10:35] rogpeppe, ssh waits for the machine agent [10:36] fwereade_: both of those occurrences just seem to be there to get the ip address of the machine [10:37] TheMue, yeah; there are 2 non-test uses [10:37] rogpeppe, I don't entirely agree with it in juju ssh [10:37] rogpeppe, but that's what was agreed [10:37] fwereade_: i suppose it's better than polling ec2 for the ip address to appear [10:38] rogpeppe, but it's needed for debug-hooks because it's utterly meaningless without an active unit agent at the other end effectively forwarding you a session [10:38] fwereade_: ah [10:39] TheMue: the existing watch_agent doesn't seem to have a timeout [10:39] fwereade_, rogpeppe: is anybody of you intereted in continue the agent method implementation (2 x 3 simple methods)? it's real fun, with a lot of principal discussions over different timezones and languages ... [10:40] :-) [10:40] rogpeppe: i exactly (!) implemented today behavior as a first draft. and that had to be changed. [10:40] TheMue, sorry, I am honestly trying to save you work, but I don;t think I'm succeeding :( [10:41] TheMue: i seem to remember interfaces and embedding in the first draft :-) [10:42] actually, no interface, probably [10:42] rogpeppe: one interface (that only has been to verify that Unit and Machine provide the right methods, else useless) and three simple functions to directly use them [10:42] rogpeppe: embed has already been another way, after also switching to presence [10:44] fwereade_: which work do you wonna save? it's already done and changing it is new work. each time. [10:47] fwereade_: the wish of reducing code duplication (yes, the one with (!) timeout) has been by rog, niemeyer followed it and my only part has been to move it to presence instead of one single func in any of the state files to provide this functionality (waiting with timeout) also for other users in future (afaik there are several more watches). [10:48] TheMue, I just don't understand why you need a timeout at all [10:48] TheMue, perhaps one day you will [10:48] TheMue, but I don't see what it gains us except lines of code [10:49] TheMue, I clearly failed to adequately express my concerns with the original watcher type [10:49] Unit.WatchHookDebug(), Unit.WatchResolved(), but i don't know if they will base on presence, only some notes of methods that still have to be implemented [10:50] TheMue, they certainly won't wait on presence [10:50] TheMue: i was under the impression that that only agents will be based on the presence package [10:50] TheMue, they're called by the unit agent itself [10:50] TheMue, it *knows* it exists ;) [10:50] rogpeppe, we also have presence nodes for unit relations as I recall [10:51] fwereade_: interesting. [10:51] rogpeppe, used to signal active participation in a relation (as distinct from "it may be working now by coincidence, but the unit won't react to changes") [10:53] TheMue, I seem to recall pointing out the very limited us cases for agent watching before, but perhaps that got lost in the noise [10:53] fwereade_: is that implied by the unit agent's presence, but duplicated in a different place for convenience? i haven't looked into how this stuff works at all. [10:53] rogpeppe, unit relations have state, of which "up" and "down" are generally relevant [10:54] fwereade_: a unit relation can be down when its unit agent is up? [10:54] rogpeppe, certainly; failed hook? [10:54] ah, sure [10:55] rogpeppe, the unit relation state has the last value written by the agent [10:55] rogpeppe, but if the agent isn't even well enough to maintain its presence node we can be pretty sure that something is rotten in the state of... um, the service [10:55] fwereade_: i suppose what i'm trying to work out is if the pinger thing is necessary in this case, or whether we can just use zk as usual. [10:56] ah, ephemeral nodes. [10:56] rogpeppe, we need some way to know that a remote thing is active [10:56] rogpeppe, indeed :) [10:56] fwereade_: so the presence package is only for agents? [10:57] TheMue, no, it's a general replacement for ephemeral nodes, which we have decided aren't worth the trouble [10:57] fwereade_: if we know the unit agent is alive, doesn't that imply the unit relation is, erm, actively inactive? [10:57] fwereade_: and there is no such usage of ephermeral nodes where the watcher isn't willing to wait endlessly? [10:58] rogpeppe, if the unit agent is alive we hope/trust that it's also watching its watches [10:58] TheMue, we don't wait on ephemeral nodes except in the 2 cases I mentioned [10:59] fwereade_: if the unit agent is dead, we can assume its unit relations are dead? [10:59] TheMue, and we only directly care about unit relation ephemeral nodes in the context of `juju status` in which we definitely don't want to wait on them [10:59] i'm speculating wildly. please ignore me. [11:00] rogpeppe, if it's dead then the service may well still be working correctly underneath [11:00] rogpeppe, but we can be sure that it won't respond correctly to settings changes etc [11:01] TheMue, the trouble is that this is not obvious from reading the python code :( [11:03] fwereade_: i'm trying to think slightly deeper about why we use a pinger. AFAICS it's to signify that there's something active pinging the node. given that (i think) the unit relation node is managed by the same code that manages the unit agent presence node, the activeness of the former could be used to imply the activeness of the latter, perhaps, is what i'm thinking. [11:03] oops [11:03] fwereade_: how do those other watches work? what do they watch? the presence of nodes or the change of node contents? [11:03] activeness of the latter could... activeness of the former [11:04] rogpeppe, I think that's what we already do [11:04] fwereade_: so perhaps an ephemeral/presence node is unnecessary for the unit relation? [11:04] TheMue, the only ephemeral watches I am aware of are those in ssh and debug-hooks, which simply wait for the presence of the thing at the other end [11:05] rogpeppe, hm, that's interesting [11:05] fwereade_: so you say it's ok to wait forever there? [11:06] rogpeppe, something's knocking at the corner of my mind, gimme a sec [11:06] TheMue, yes, absolutely [11:06] fwereade_: i can imagine there might be race conditions that make it difficult [11:06] fwereade_: how shall i test it without blocking the test forever? [11:06] TheMue, we explicitly got rid of timeouts on command line tools [11:07] TheMue, you do something like I did in the original presence node tests? [11:07] TheMue: what i tend to do is to make sure that it blocks by waiting for a short period, then unblock it and check it unblocks. [11:08] fwereade_: hmm, yes, could do so too, indeed [11:09] TheMue: in fact you'll be testing almost exactly the same code... [11:10] rogpeppe, ServiceRelationState.get_unit_state also checks the ephemeral node, not going to analyse all the callers of that at this stage ;) [11:11] fwereade_: the unit relation ephemeral node? [11:11] rogpeppe, yeah [11:11] rogpeppe, (also we use an ephemeral node to signal presence of an active debug-hooks session, watched by the unit agent; that one should also wait forever) [11:11] fwereade_: still got pain with code waiting for external events endlessly. it disregards the knowledge of > 20 yrs distributed and concurrent systems and caused so much pain. [11:12] TheMue, it *is* basically the underlying model of juju though [11:12] fwereade_: but maybe indeed it's irrelevant for our system [11:12] TheMue, I can't actually think of many places where timeouts are appropriate at the juju level [11:13] fwereade_: yeah, seams so [11:13] TheMue, ZK is keepalive like hell underneath, but from our perspective we can wait forever [11:13] we've already got timeouts and retries happening at a low level [11:13] TheMue, if there are problems we trust ZK to tell us about it [11:13] fwereade_: yeah [11:13] fwereade_: once again i'm driven by my history where this had been a bad behavior [11:14] TheMue, yeah, we all carry baggage [11:14] TheMue, it's hard for me to separate "what the python does" from "how juju should actually do it" ;) [11:14] TheMue, but I think this is a juju-level property not a code-level one [11:14] TheMue, if you see what I mean [11:15] perhaps the question is not: "should there be a timeout?" but "where should the timeout be?" [11:15] fwereade_: but i would like you to discuss it with niemeyer as he said "yes, implement it in presence" and i don't want to follow know your advise and get another one by him in the evening. [11:15] TheMue, I think it comes down to "do you trust zookeeper" ;) [11:15] TheMue, I can totally understand that [11:16] TheMue: for now, why not just make a function in the state package, as suggested by niemeyer? [11:16] TheMue, niemeyer has firm opinions and the final say and while they are not arbitrary or capricious they are hard to predict with 100% accuracy [11:16] rogpeppe: the one in presence as an own branch is by niemeyer [11:17] rogpeppe: we talked about it yesterday when you stepped out [11:17] rogpeppe: the problems of time zones ;) [11:17] TheMue, did he implement it because he knew it was needed himself, or as an alternative to your watcher proposal? [11:17] TheMue, on the assumption that it *was* a necessary feature [11:17] rogpeppe: btw, now rogpeppe, this morning wrtp, yesterday rog. whay this? [11:18] TheMue, I always assumed it was low-level psyops to contribute to an aura of glamorous mystery [11:18] TheMue: better than rogpeppe_ and rogpeppe__, i thought [11:18] fwereade_: that too [11:18] fwereade_: no, the implementation is by me. niemeyer and i talked about moving it to presence yesterday. [11:19] rogpeppe: and why not only one? [11:19] TheMue: because my irc client sometimes reconnects when the irc server already thinks i'm connected, so it has to choose a different one [11:19] rogpeppe: ic [11:20] rogpeppe: thankfully it's seldom here. and short time after reconnect i get my nick back automatically [11:20] rogpeppe: irc is sometimes really unstable, yep [11:21] fwereade_: i think we should let presence.WaitAlive through, despite misgivings. it can go later if we find it's never used. [11:21] rogpeppe, TheMue: I'm ok with that [11:23] rog, fwereade_: hehe, and when i use it in Unit and Machine i'll get the next comments by you? *LOL* [11:23] TheMue: i'll be interested to see what value you choose for the timeout :-) [11:23] TheMue, well, yes, because IMO if you're using timeouts by default then you're using ZK wrong [11:24] * TheMue again wonders why the timeout hasn't been a topic before? [11:24] TheMue, and I don't think you can come up with a timeout value that is a 100% accurate indicator of "something is wrong" as opposed to "ec2 is taking ages, whaddayagonnado?" [11:25] TheMue, I'm pretty certain I have already talked with you about the very small set of clients for watch_agent, and their very limited use cases [11:25] fwereade_: but not about the timeout [11:26] yeah, i don't think we've really discussed timeouts before [11:26] TheMue, no; when I said that it was all unnecessary, and the presence package covers all the use cases already, that's what I meant [11:26] fwereade_: i'm not good in intention reading ;) [11:27] TheMue, nor does it have a custom memory allocator, because it doesn't need that either ;p [11:27] TheMue, I think I honestly did try to direct you to the places you needed to see to understand the use cases [11:28] fwereade_: indeed [11:28] TheMue, the fact that all user interactions must block forever is I suspect the crucial bit of floating context that you were missing [11:29] TheMue, I know that only because I was around when it was discussed and changed to that behaviour [11:30] TheMue, let's put it this way [11:30] fwereade_: exactly, this "block forever" is hard to get. i know different behaviors from those systems i've done in the past. [11:30] TheMue, sorry I forgot what I was going to say [11:31] fwereade_: hehe [11:31] TheMue, really it comes down to trusting ZK [11:31] TheMue, if we do, we should generally assume that the events will land according to ZK's limited guaranteed; and if we don't, we should panic [11:32] fwereade_: a software written in java? no, never! *scnr* [11:32] TheMue, haha :) [11:33] * TheMue has done JEE for > 7 yrs, it has been no (!) good time [11:33] TheMue, I can imagine [11:34] rogpeppe, TheMue: lunchtime :) [11:34] fwereade_: enjoy [11:34] fwereade_: likewise [12:42] TheMue, fwereade_: i'm off to get the train down to london now. will probably be incommunicado tomorrow morning too. see you tomorrow! [12:43] rogpeppe: have fun, take videos, copy slides, publish everything. ;) [13:14] fwereade_, ping [13:14] fwereade_, i'm a little concerned about the ambiguity around the ec2-instance-type arch.. esp as its the most common way people will use constraints [13:14] on ec2 [13:15] hazmat, pong [13:15] hazmat, there was a quiet discussion on the lists about how a 64-bit default would be a good idea, let's do it [13:16] Mornings [13:16] hazmat, is there some other ambiguity? [13:16] niemeyer, mornings [13:16] heya niemeyer [13:18] fwereade_, ah right, yeah.. given 64bit on all types its not much of a concern [13:18] hazmat, I can accept that it's a bit annoying not to be able to type "ec2-instance-type=m1.medium arch=i386" [13:18] hazmat, but there are what, 4 arch-choice types [13:19] hazmat, t1.micro is a bit of a joke really [13:19] hazmat, m1.small is the default in many people's minds, and is what you get if you specify bare "arch=i386" [13:20] hazmat, I think the proportion of our user base who specifically need 32-bit m1.medium, c1.medium, and t1.micro may have to bear it [13:21] fwereade_, fair enough [13:21] hazmat, in fact, they get to experience the awesome productivity shortcut of typing fewer characters in total! ("ec2-instance-type" is a bit of a mouthful...) [13:22] hazmat, niemeyer: of more concern is the HVM image thing [13:23] fwereade_: hm? [13:23] fwereade_, people will choose the explicit when possible [13:23] fwereade_ are there ubuntu images for hvm? [13:23] hazmat, niemeyer: I misread the EC2 information back in the day and somehow got the impression that hvm images were a nice improvement on cluster machines, not a hard requirement [13:23] hazmat, I think so, just a mo [13:24] hazmat, http://uec-images.ubuntu.com/query/oneiric/server/released.current.txt [13:24] hazmat, oneiric server release 20120222 ebs amd64 us-east-1 ami-beba68d7 hvm [13:25] fwereade_, it looks like its only in us-east-1 [13:25] hazmat, just the one, but hopefully that's good enough [13:26] hazmat, cluster instances are only available in us-east-1 [13:28] hazmat, it does mean that get_image_id and get_instance_type are no longer independent, but they were already slightly uncomfortably linked [13:29] hazmat, if I manage to do that quickly this afternoon, would you try to review it in your afternoon? [13:29] fwereade_, cool, that makes more sense then, as for the image_id/instance_type.. that's fine. [13:29] hazmat, I also wanted to ask about your branch [13:29] fwereade_, i could, but i'm wondering if its worth the trouble [13:30] re cc larges [13:30] it would be nice i guess [13:30] hazmat, I'd rather spend a couple of hours to make it not be *guaranteed* to break [13:30] hazmat, it's just way too shoddy to put up with [13:30] fwereade_, if your up for it, it would be nice to round out our ec2 constraints support with the support for the biggest baddest vm on the block ;-) [13:30] hazmat, exactly :) [13:31] hazmat, I already need to check that field anyway, so I don't *accidentally* get an hvm image that won't work with normal instances ;) [13:31] jimbaker: You've got a review on https://codereview.appspot.com/5836049 [13:31] fwereade_, the series from charm branch? [13:32] hazmat, yeah [13:32] hazmat, just let me find it [13:32] fwereade_, https://codereview.appspot.com/5845073/ [13:32] hazmat, AFAICT it's a no-op [13:32] fwereade_: So, what's the deal about 386 vs. amd64? [13:33] niemeyer, you can choose arch on more instance types than just t1.micro now [13:33] fwereade_, hmm. yeah. with_constraints returns the new constraint.. [13:33] niemeyer, for that, I was already defaulting to 64-bit [13:33] fwereade_: Ok, so you were just wondering if it was fine to default to amd64 to all? [13:33] hazmat, and the series certainly is baked into the constraints; it's just happens not to be done at that point [13:34] niemeyer, I think we discussed that on the lists; seemed to get a muted "yeah, sounds good" sort of response [13:34] fwereade_: Yeah, it certainly sounds good to me [13:34] fwereade_: Is there any other contentious point? [13:34] niemeyer, it's a little ungainly specifying 32-bit instances other than m1.smalls, but I think that's an acceptable price [13:36] niemeyer, you'd have to ask for "arch=i386 cpu=5" instead of "arch=i386 instance-type=c1.medium" [13:36] fwereade_, hmm.. ic, its part of the service state api [13:36] s/instance-type/ec2-instance-type/ [13:36] k, i'll yank that branch [13:36] fwereade_: Uh, why? [13:36] hazmat, yeah: the service knows the charm, the series isn't *important* until you've got an actual unit that needs to find a machine [13:37] niemeyer, because of the overlapping behaviour that I think we agreed on [13:37] niemeyer, I could just as easily decouple arch from ec2-instance-type [13:38] niemeyer, but that opens us up to "arch=i386 ec2-instance-type=c1.xlarge" which is still a nonsensical request [13:38] fwereade_: If there is a possibility of selecting the architecture, it means that there isn't an overlap [13:39] niemeyer, true [13:39] niemeyer, yep, that's what I should do then [13:40] niemeyer, when it was just t1.micro defaulting to 64-bit seemed to be sensible [13:40] fwereade_: Yeah, t1.micro is ridiculous enough that it doesn't matter much indeed [13:40] fwereade_: Please feel free to not fix it now, though [13:41] fwereade_: I don't know what stage this is, and it feels like a minor bug that could be filed on Launchpad and wait until someone has time to fix it [13:41] niemeyer, it should be no more than 2 lines of core code to fix the conflict and make arch explicitly default to amd64 [13:41] fwereade_: Ah, ok, sounds less painful than filing a bug even! :) [13:42] niemeyer, there will be tests to fix ofc, but it really will be cheap :) [13:42] niemeyer, and it's a semantic bug really which always feels worse to let out into the wild [13:42] fwereade_: Superb [13:43] niemeyer, the other question that perhaps you missed is about cluster instances [13:43] fwereade_: Yeah, I saw it, but it wasn't clear to me what it was about [13:44] niemeyer, cluster instances require HVM images [13:44] fwereade_: They need a specific image, I suppose [13:44] niemeyer, I mistakenly though it was an option rather than a requirement [13:44] niemeyer, it's not quite so cheap to fix but it's still less than an afternoon's work all told [13:45] fwereade_: That's going to be interesting.. it means that getting an image has to cross over the charm URL and the constraints [13:45] fwereade_: I'd not fix that now.. [13:45] niemeyer, they're combined at unit deploy time by the service [13:45] niemeyer, series is a constraint, just not one exposed directly to the user, because we know we'll get it from the charm [13:46] fwereade_: We can wait until someone reports a bug about it.. I suspect it will take a while, perhaps enough for it to be fixed already [13:49] niemeyer, I'm not really confident pressing too far on with the go code until I have a couple of reviews for the early steps... can I please take a little time to polish it up until I have a couple of reviews in? [13:50] fwereade_: Sure.. I was actually going to ask you about these branches in review [13:50] niemeyer, cool [13:50] fwereade_: In one of them you seem to have reverted back to kill the whole idea? [13:50] niemeyer, the hook context one? [13:50] fwereade_: Yeah [13:51] niemeyer, it became clear after sleeping on our discussion that it is *not* a *hook* context [13:51] niemeyer, it's a tool context [13:51] niemeyer, which happens to be useful for hooks [13:51] niemeyer, the hook package itself became unjustifiable; but it fits really nicely IMO as one of a number of things in a cmd/jujuc/server package [13:52] niemeyer, jujuc is the dumb main.main that calls the server package [13:52] niemeyer, the server package contains the various tools implemented as Command, and the RPC server that executes them [13:53] niemeyer, it's a good place for that particular set of, er, synergistic functionality [13:53] fwereade_: Sounds reasonable [13:53] niemeyer, and it'll be as convenient as anything to use once we have an agent far enough along that it's *actually* needing to run hooks [13:53] fwereade_: +1 [13:53] niemeyer, sweet [13:53] niemeyer, well, that's the point of that change :) [13:54] fwereade_: Cool, thanks [13:55] niemeyer, as for the others: go-add-cmd-context you looked at once and seemed generally happy with, but there may be contention on FlagSet output [13:55] niemeyer, go-tweak-supercommand sits on that and is pretty trivial really [13:56] niemeyer, and go-testing-charm has I think been fixed [13:56] fwereade_: I'll have a pass at it again as my next task right away [13:57] niemeyer, awesome, thanks [13:57] fwereade_: np, sorry for the delay.. I should probably have reviewed these before other things I've been reviewing [13:58] fwereade_: Btw, this change probably deserves a better summary [13:58] niemeyer, it's ok, I haven't been entirely blocked, there's always something worth my time :) [13:58] fwereade_: It's saying "remove hook package", but it's actually pushing things forward [13:59] niemeyer, haha, I hope the second part is generally assumed to be my intent ;) [14:00] fwereade_: right :) [14:11] niemeyer, thanks [14:11] niemeyer: morning! [14:11] rogpeppe: Heya [14:11] jimbaker: np [14:13] hmm, not surprising the connection is unreliable on the train... [14:14] fwereade_: I see I didn't actually review the test files in the branch. I'll have a quick pass at that post LGTM [14:16] fwereade_: Cool, delivered [14:16] fwereade_: LGTM still [14:17] fwereade_: Thanks for the ride through that branch :) [14:17] niemeyer, excellent [14:17] niemeyer, and thank you, it was fun :) [14:25] launching ec2 instances from on the train feels kinda cool [14:25] fwereade_: splitted the WaitAlive() into two variants. so i can use the one w/o timeout in Unit and Machine but we keep the one with timeout. if it never will be used it can be removed. [14:37] TheMue, cool [14:55] TheMue: Wait, what? [14:55] TheMue: Why do we need two variants? [14:56] TheMue: You actually needed the timeout, right? [14:57] TheMue: There's no other use of this logic today.. if we're finding it incorrect, let's fix the one we have [14:58] niemeyer: this morning fwereade_ convinced me that we don't need a timeout [14:59] TheMue: Ok, I'm not disputing either way.. let's just make the one function we have suite our use case [15:00] niemeyer: yes, i've done it that way and currently integrate it. so it behaves as we have it today. [15:00] TheMue: Awesome, thanks [15:02] niemeyer: additionally i kept a version with timeout based on the one w/o. so if we once will use presence nodes and need timeouts we have it. i could also remove it, but i think the situation will come. ;) [15:02] TheMue: Reading it.. [15:09] TheMue: Reviewed [15:09] TheMue: Let's drop the function we don't need.. we can merge the timeout on WaitAlive itself when we need it (if ever) [15:10] niemeyer: ok *sigh* [15:10] TheMue: Yeah, sorry for coercing you into removing code you don't need.. [15:12] niemeyer: it's just that i still have problem with systems waiting endlessly. that's based on my experiences with other systems. it indeed seems to be ok for juju, but it will last until i'm feeling happy with it [15:12] TheMue: That's a different problem that I'm not trying to convince you on [15:13] TheMue: I'm happy to have a timeout, and it may indeed be the best thing.. maybe even a timeout by default? [15:13] TheMue: We'll see when we actually have to use this function [15:13] niemeyer: the test for WaitAlive() is hidden in WaitAliveTimeout(). so now i'll move that internal code into the test. ;) [15:13] TheMue: What I'm unhappy about is having two functions just because we have no idea about what we need [15:13] TheMue: Heh [15:14] TheMue: and now you know in practice why "hidden tests" are a bad idea [15:14] niemeyer: yes, you're right, having timeouts also needs to know what to do if a timeout happens [15:16] TheMue: I'm not against having a timeout.. we'll probably have to add one soon enough. [15:18] niemeyer: but you also say that today the callers of WaitAlive() can wait endlessly? [15:18] TheMue: In fact, I start to wonder whether WaitAlive is even a good plan [15:19] niemeyer: oh [15:19] TheMue: When are we going to be calling those methods on unit and machine? [15:19] TheMue: Have you had a look on the current code base to get an ide? [15:19] idea [15:20] niemeyer: we found two places [15:21] niemeyer: in juju.control.debug_hooks.py and juju.control.ssh [15:23] niemeyer: debug_hooks waits in a while 1 loop [15:23] TheMue: Looking [15:23] niemeyer: and ssh.y waits for the watch [15:26] TheMue: Why don't we just expose the results of AliveW? [15:27] TheMue: WatchAgentAlive(...) { return presence.AliveW(...) } [15:28] niemeyer: can do so, yes, it's only a bit more inconvenient for the caller (handling err, alive and watch, like it is now done in one function) [15:28] TheMue: Ok.. I'm happy either wya [15:28] way [15:29] TheMue: Let's just not sprawl functions we have no use case for [15:30] niemeyer: ok [15:31] niemeyer: would only have expected to have no function w/o timeout. too much erlang influence, where receive has a timeout statement. ;) [15:32] * wrtp thought that unit, machine, etc could just return a string path to the agent presence node. then clients could use presence on it to their hearts' desire. [15:32] but that's probably a bit subversive to say at this point, sorry. [15:33] wrtp: shut up, you're sitting in a train, no proper working place. *rofl* [15:33] wrtp: seems your connection is too good ;) [15:35] wrtp: but as long as we're moving inside the state package it's no problem to get the path, yes. we have zkAgentPath() for it [15:36] wrtp: only for callers outside of state it would be difficult. we would have to make the function public. [15:36] lol [15:36] TheMue: that was my thought [15:36] TheMue: Let's have a timeout there then.. [15:36] TheMue: Sounds totally reasonable.. [15:37] TheMue: WaitAlive(conn, path, timeout) [15:38] TheMue: The only thing I've been saying is that we have to decide what we want, and go with it. Having multiple functions that have no use case is no good. [15:38] niemeyer: totally agree [15:39] * TheMue hugs niemeyer for the timeout variant ;) [15:40] :) [15:41] bcsaller1, ping [15:55] i wondered why environs tests were taking 45s to run locally - then realised that FindImageSpec does a http request on uec-images.ubuntu.com several times. perhaps i should allow tests to run connectionless somehow. [15:59] bcsaller1, whenever your up... looks like a late night. i was checking out the subordinates [16:05] wrtp: Definitely [16:05] wrtp: The Python tests have a dump of the content locally to check that out [16:06] * niemeyer => lunch! [16:06] bcsaller1, there's a missing yield in process new relations around subordinate deploy.. and the other problem is that it looks like it ends up attempting nested containers [16:07] because unit deployer picks up the provider type local and tries to use containers, which we don't want for subordinates [16:10] allenap, ping [16:10] niemeyer: we also have a local dump of the context to allow that (it's used in to test FindImageSpec), but i'm not quite sure what the best way to divert image requests only. perhaps have a variable holding the address of the image server, and change it for local tests. seems a pity to pollute the real code for testing, but maybe worth it. [16:10] s/in to/to/ [16:14] bcsaller1, nevermind, it is picking up the right deployer [16:15] hazmat, niemeyer: SpamapS makes a good point on the lists about default-instance-type, default-image-id [16:16] hazmat, niemeyer: namely that changing them at this stage will justifiably piss people off [16:16] hazmat, niemeyer: and that we should probably allow them but print deprecation warnings :(( [16:16] hazmat, niemeyer: thoughts? [16:18] fwereade_, he does make a good point, deprecation warnings, and using them still sounds reasonable, but it seems less of a debt to see that folded into defaults for environment constraints rather than remaining as global overrides [16:18] hazmat, crap, good point, we *have* to break environments.yaml anyway [16:18] hazmat, so we should at least make sure to do it only once [16:19] hazmat, when can we expect that change? [16:19] fwereade_, not sure [16:20] i wish i could get out of my juju talk this evening [16:20] i'll push forward on the specs [16:20] fwereade_, btw thanks for the reviews [16:20] hazmat, cool, thanks [16:20] hazmat, a pleasure, I still haven't really figured out what I'm thinking re GC [16:21] hazmat, I have an incoherent draft email gathering dust :/ [16:21] fwereade_, for the most part its just watching the topology and cleaning out non referenced things, along with recorded actions that have been completed [16:21] hazmat, yeah, but the devil is in the details [16:22] fwereade_, always :-) [16:22] hazmat, unit relations still referenced by other units need to survive until all other units have acked their departure [16:22] fwereade_, speaking of which if you have a moment, i was hoping to get a review of https://code.launchpad.net/~hazmat/juju/scheduler-peek-list/+merge/98104 from either you or niemeyer.. i basically rewrote the relation hook scheduler, its much simpler and more robust now [16:23] hazmat, otherwise we'll get missing data in relation hooks on the other units [16:23] and much better tested [16:23] hazmat, cool, I'll take a look [16:24] fwereade_, yeah.. relations don't cleaned up till their not referenced in the topology, which basically happens when either service endpoint leaves the rel [16:24] fwereade_, thanks [16:33] bcsaller1, can a subordinate open a port on the container? [16:34] Hey, I was trying to look at how maas calculates unit addresses, but I don't see it listed in juju.unit.address.get_unit_address .. also.. that if/elif chain seems really wrong.. this method should be moved to the providers themselves. [16:35] hazmat, https://codereview.appspot.com/5841080 [16:37] hazmat: they return information from their container when asked about networking. There might need to be a change in expose but really its the container service doing the work at that point. I hadn't considered if the implications of that are confusing in status wrt expose [16:37] SpamapS, concur === bcsaller1 is now known as bcsaller [16:39] hazmat, re unit relations: surely "leave them until the service endpoint dies" is problematic... it means that the number of unit relations for a long-lived service with lots of occasional clients will just grow and grow forever [16:40] fwereade_, why? if the client leaves the relation is broken [16:41] either the relation exists between the services, or it doesn't if one side leaves it in which case it can be gc'd [16:42] hazmat, but it can't, can it? not until we know that the units on the other side have run the appropriate broken/departed hooks [16:42] hazmat, or possibly have been GCed themselves [16:42] hazmat, if we don't wait for that we'll have hooks falling over needlessly [16:43] hazmat, maybe it doesn't matter if we know they're going to die soon anyway [16:44] hazmat, but I don't like the idea that a hook could fail to get state that exists according to its view of the world (ie the output of relation-list) [16:47] hazmat, actually, wait, I'm going to WIP that and reinstate the default-image-id and default-instance-type stuff [16:48] right, train just arriving. [16:48] hazmat, SpamapS: wait again, I think we should talk about that [16:50] hazmat, SpamapS: actually, no, I think it's OK: we can just do a deprecation warning saying that using this field silently stomps over any constraints you may set at a later date [17:09] fwereade_, they'll need to remove their presence nodes in the rel [17:09] empty role containers is probably the the threshold for gc [17:10] hazmat, something will need to but if an agent itself becomes unresponsive before removal then it won't be able to clean out its own presence nodes [17:10] hazmat, sorry not presence nodes [17:10] fwereade_, then it session ephemeral/pinger will expire naturally [17:11] hazmat, I misspoke, presence nodes are not relevant [17:11] hazmat, I meant settings nodes [17:11] hazmat, that's what other units may still be looking at for an arbitrarily long time [17:12] hazmat, we can't delete those until we know nobody will be looking at them any more [17:12] hazmat, a working agent can register disinterest itself, that's fine [17:12] fwereade_, the settings nodes are gc material not coordination.. the presence nodes are for coordination, if their dead and the rel has been marked remove.. [17:13] hazmat, a GC/machine agent that cleans up after a unit that won't die cleanly will have to clear those out itself [17:13] hazmat, the trouble is that a lack of presence doesn't *necessarily* imply that the unit agent won't come back, does it? [17:14] hazmat, ...but if it also doesn't exist in the topology it's a pretty safe bet [17:14] hazmat, hmm [17:14] hazmat, and it's easy if the rel has been marked for removal [17:14] fwereade_, two scenarios, clean exit from unit relation by running unit, stalled exit by badly behaving unit, the former is easy [17:14] TheMue: LGTM, thanks! [17:15] hazmat, agreed [17:15] so on the latter what does it see when it eventually comes alive.. or is terminated [17:15] we could have clean stop kill the settings node [17:16] and then only gc the structure, and on bad unit, its broken rel context has the required local data [17:16] from its settings node [17:16] hazmat, but surely other units can be behind? ie still running a relation-changed hook that requires that unit's settings [17:17] hazmat, if the unit clears out its own nodes it still has to wait for other units [17:17] hazmat, regardless of where the functionality lies, *something* has to keep unit relation settings around until it's safe to delete them [17:17] fwereade_, true it has to wait for the stop of its execution [17:18] hazmat, no, *other* units' hook execution [17:18] er. hook [17:18] fwereade_, ick ;-) [17:18] hazmat, that's the problem [17:19] fwereade_, we can start with a much more pessimistic base version of gc [17:19] ie both endpoint services removed [17:19] and by removed, i mean the services are destroyed [17:19] hazmat, true enough, nobody said this has to be the final version [17:20] hazmat, yeah, service destruction is easy [17:20] hazmat, I'm much more comfortable ignoring potential hook errors when *everything* related is going down as well ;) [17:23] SpamapS, hazmat, niemeyer: this is interesting, default-ami doesn't work [17:24] SpamapS, hazmat, niemeyer: this implies to me that we can basically just drop it without warning [17:25] SpamapS, hazmat, niemeyer: the ec2 provider was looking for default-image-id instead, and therefore never finding it :/ [17:26] fwereade_: Nice, that makes things easier :) [17:27] niemeyer, ok, I'll just drop default-ami [17:27] niemeyer, is there a specific way I should be doing a deprecation warning? [17:27] niemeyer, I thought we had some already but I can't find them [17:29] fwereade_: default-ami is not valid. its always been default-image-id [17:32] fwereade_: If it doesn't work, there's no need to warn :) [17:34] SpamapS, are you 100% sure? it was default-ami in config back in july, and it's default-ami now [17:34] SpamapS, I suspect that in fact nobody has ever used it [17:34] fwereade_, people have definitely used it, but with default-image-id [17:35] fwereade_, that's whats documented [17:35] SpamapS, niemeyer: oh, wait: does config accept other random keys? if it does that would make sense [17:35] fwereade_, it does the validation is only against known keys [17:36] fwereade_: default-image-id was always a hack, though.. I'd just send a polite message to the list pointing that the hack is being replaced by the real implementation [17:36] fwereade_, SpamapS has a branch out there to fix it, but i think the notion was that it was going away... [17:36] the schema validation of it that is [17:36] hazmat, niemeyer, SpamapS: I see, thank you all [17:37] niemeyer, SpamapS's point is that dumping *any* enforced change on our users at this point is a bad thing [17:38] niemeyer, and I think he's right; and furthermore, the global env change will change environments.yaml as well [17:39] niemeyer, I don't think it's sensible to land this and then make everyone change *again* imminently [17:39] fwereade_: This isn't just "any change". This is a bug in juju that I've been pointing out since that flag was first introduced. [17:39] fwereade_: It *has* to be fixed, or juju won't work [17:40] niemeyer, yes: however much people hate it, we *must* land at least one environments.yaml change [17:41] Also, https://code.launchpad.net/~clint-fewbar/juju/remove-default-ami/+merge/71278 [17:41] niemeyer, however, dumping *multiple* changes, over a few days, at this stage in the cycle, is surely going to lead to calls for our heads on pikes etc [17:41] This was the fix to the schema, that never landed.. [17:42] niemeyer, the fix to the schema that was deferred until we had a replacement -- constraints -- which then sat in review purgatory for, what, 2 months [17:43] fwereade_: I don't have anything positive to say about that, I'm afraid.. [17:43] niemeyer, in hindsight I should clearly have been making a big stink about it myself [17:45] niemeyer, but, well, I didn't :(. and we have to deal with the situation as we find it in addition to figuring out how we can avoid blundering into it again in the future [17:46] niemeyer, I think that we have left it too late to break them, but we can discourage their usage by complaining every time they're used [17:48] niemeyer, however... if I apply this same argument to the global-settings change... I get worried [17:48] niemeyer, because I think it *does* still apply [17:49] fwereade_: default-image-id must die, now.. there's no way to implement support for multiple distributions while supporting it [17:50] niemeyer, with default-series as well, you get something close enough: you just have to make sure the series matches the image [17:51] niemeyer, if I could be sure nobody was using it I would 100% agree [17:51] fwereade_: If someone is depending on it, they'll need to tell us, and we'll need to find something else [17:51] fwereade_: default-image-id can't be supported as it is. [17:52] fwereade_: We can talk about how to avoid introducing changes late in the cycle, we can talk about how to prioritize tasks appropriately over a cycle, but these are unrelated to the simple fact this has to be fixed. [17:53] niemeyer, would it not be sensible to add a deprecation warning and kill it for 12.04.1? [17:54] niemeyer, which robbiew tells me is the actual target for constraints [17:54] fwereade_: It's not about constraints.. it's about series. If I deploy cs:~fwereade/precise/foobar [17:54] fwereade_: foobar is in precise [17:54] fwereade_: default-image-id is broken, and has always been. I was against supporting it at first, and now it's time to kill it and implement the real thing. [17:55] niemeyer, I have no argument against "it must be fixed", and "it must die" [17:55] fwereade_: At least you haven't brought them up so far. :) [17:55] fwereade_: I can buy aguments like "we need to allow people to customize image selection" [17:56] fwereade_: I can't take "let's continue supporting default-image-id" as it is, because it's a relevant bug to have it. [17:56] niemeyer, nah, I *really* don't want anyone doing that [17:57] niemeyer, I'm suggesting we leave it in with a suitable dire deprecation warning now, and actually remove it in the release 3 months down the line [17:57] fwereade_: default-image-id = "ami-for-oneiric", deploy cs:~fwereade/precise/foobar, BOOM! [17:57] niemeyer, yeah, and that's bad, no argument [17:57] well more than a deprecation warning [17:58] we can issue the error message with the solution, and just refuse to honor it [17:58] +1 [17:58] hazmat: so, a specific check during env parsing for those 2 keys, and immediate explosion saying "you must now use constraints" [17:59] fwereade_: That's my opinion, though. To be honest, this will bite hazmat and SpamapS more than me, so I'm leaving it up for them to decide what's the best approach. [17:59] fwereade_, i'd keep it context relevant, to deploy/add-unit [18:00] hazmat, hmm, ok [18:00] hmm.. and bootstrap [18:00] hazmat, tbh it still feels like too sudden and painful a change for anyone depending on it [18:00] hazmat: That's a brand new feature being developed. -1 on deciding/supporting that at this stage in the cycle. [18:01] fwereade_, i don't think people are depending on it.. it was mostly used a hack for lack of arch support [18:01] hazmat, if we can be reasonably sure that's the case, then great [18:02] hazmat, in that case I'm happy saying "tough, use constraints" [18:02] fwereade_, default-instance-type is much more commonly used [18:03] There's a collection of opinions on this here: [18:03] https://bugs.launchpad.net/juju/+bug/830995 [18:03] hazmat, but I kinda feel it should happen at env-parsing time [18:03] "default-instance-type and default-image-id are inadequate" - 2011-08-22 [18:03] hazmat, if we're going to force immediate action then we should probably just force immediate action [18:04] "I currently use this in development and testing of formulas in different ubuntu versions" [18:04] -- Juan [18:04] "Often need stacks with different types of machines deployed for different services." [18:04] -- Mark [18:04] fwereade_, i'm just thinking about the existin environments, nothing to do unless you actually conflict [18:04] all of which are addressed by constraints [18:04] etc .. etc.. [18:05] Right.. they're all claiming for constraints. [18:05] And series selection [18:05] Which is done via charm URL [18:05] We're giving them what they want, not what they've asked for.. [18:05] fwereade_, but parse time is fine. [18:06] hazmat, OK [18:06] hazmat, now wrt to braking environments.yaml *twice* [18:06] breaking [18:06] hazmat, I really think we should avoid doing that [18:07] hazmat, so I think we have to coordinate it with the env-settings change [18:08] fwereade_, the env-settings change is mostly relating to storage and syncing.. with the removal of image-id, instance-type, and default-series.. what remains in env.yaml is probably the same [18:09] ie. its a behavioral change not a structural config change [18:09] although one more env.yaml that comes to mind is apt-proxy [18:09] url [18:10] hazmat, IIRC niemeyer was keen that we remove non-access settings from env.yaml [18:10] fwereade_, apt-proxy-url is an access setting [18:11] in envs that need it, we can't even bootstrap correctly without [18:12] hazmat, huh, seems you're right, I was sure we had more stuff gunking it up [18:12] hazmat, I wasn't thnking of apt-proxy, I was thinking we had other non-access ones in there [18:13] hazmat, OK, so we can be sure this is the *only* breaking change we can know we'll need to introduce? [18:14] hmm.. [18:14] the subordinates increment the topology version but there's a transparent migration there [18:14] hazmat, apart from anything else the env change will change zookeeper storage so people will have to bounce the whole env to upgrade [18:15] hmm actually that won't be meanigful unless the cluster is on the same code rev.. ie. it would still be problematic [18:15] hazmat, (I think?) [18:15] fwereade_, absent consistent code versions, yeah. [18:16] we can migrate state, but unless everything deployed understands the newer version it doesn't really resolve anything [18:20] hazmat, so, OK, we do have 2 inevitable breaking changes [18:20] hazmat, I think it's very important that we reduce that to 1 [18:25] hazmat, opinion? [18:27] SpamapS, hazmat, niemeyer: just to confirm, btw: default-instance-id should be treated in exactly the same way we do default-image-id; right? [18:32] fwereade_, yes imo. [18:34] SpamapS, negronjl: in light of your perspective, and beyond making sure there is only one of them: is there any way we can mitigate the impact of a necessary breakage? [18:47] hazmat, I have been speaking to niemeyer: consider me at your disposal for the next couple of weeks [18:47] hazmat, how an I have most impact? [18:48] hazmat, I *want* to do constraints-get and unit constraints as part of it but I'm not sure they're our highest priority [18:49] fwereade_, awesome! [18:50] fwereade_, wrt to constraints.. + env-constraints [18:51] hazmat, yep, the environment change seems to me to be the most critical thing [18:51] hazmat, once we have that we can add env constraints [18:51] fwereade_, that sounds like a good place to start, env-constraints and settings changes [18:51] we should land those soon [18:52] hazmat, yep, landing them is the big one [18:52] i'm going to work with bcsaller on subordinates, i've been testing them last night and today, i see a few problems that need resolving [18:53] hazmat, ok, I will aim for a series of branches that culminate in env-constraints, along with complete removal of default-image-id and default-instance-type [18:54] fwereade_, awesome thanks [18:54] hazmat, I think that I should push my current branch; with dii/dir printing loud deprecation warnings, but continuing to work as before (and completely overriding constraints) [18:55] dii/dit^^ [18:55] hazmat, people on the bleeding edge thereby get a few days' warning, and they can start using constraints as soon as the cut the cord there [18:55] fwereade_, that sounds reasonable, what's the default for the bootstrap node? [18:56] oh.. its going to honor them still [18:56] hazmat, same as everything: effectively an m1.small [18:56] fwereade_, for series? [18:56] hazmat, it's actually quite easy to keep them working [18:56] hazmat, IIRC it's taken from default-series [18:56] hazmat, hmmm [18:56] hazmat, that's not an access setting [18:57] fwereade_, right.. but that's one of the things thats going away.. [18:57] hazmat, I know there was something [18:57] hazmat, oh! [18:57] and keep in mind osx clients, can't always take from current [18:57] we'll need to pass it on the cli for bootstrap i think [18:57] hazmat, what are we going to use instead? [18:57] or use the latest current [18:57] hazmat, required? [18:58] hazmat, env constraints will need to be settable at bootstrap time [18:59] hazmat, as will default series [18:59] fwereade_, i'd suggest required in the absence of host inspection, but that still feels implicit [18:59] SpamapS, any thoughts wrt to bootstrap node release series specification? [19:00] hazmat, I kinda feel we should default to latest released series [19:00] fwereade_, custom introspection off cloud images? [19:00] url that is [19:00] hazmat, *however*, whatever we do, moving default-series is *another* env.yaml change [19:01] fwereade_, yes, and it should be part of the warnings [19:01] hazmat, yep, I'll add that now [19:02] hazmat, so, for now: warnings on all three default-* keys, behaviour unchanged [19:02] hazmat, blast I have to go [19:02] hazmat, I'll try to be on later [19:05] fwereade_, cheers, i'm out in 2hrs, back in 7hr [19:24] hazmat: Just submitted a review for https://codereview.appspot.com/5849054/ [19:24] hazmat: Good stuff.. just a few details [19:24] hazmat: Let me know if you want to sync up on any of the points [19:24] fwereade_: This may be relevant to you as well ^ [19:26] niemeyer, thanks [19:26] i've switched tracks to getting pull together a presentation [19:42] hazmat: was at lunch.. just read your question.. [19:43] IMO, all those default-* settings that are being obsoleted have to stay for a little while, and where they contradict the new way, warnings should be emitted. [19:44] fwereade_: we should probably rename relation-list to relation-units in the future (keeping the former for compatibility) [19:44] If somebody has asked for a default-image-id, we should honor that request, and warn them that we can't guarantee we're deploying the right charm on that image. [19:44] fwereade_: We'll likely add something like relation-ids, which will make relation-list ambiguous, and potentially error prone [19:44] If somebody has a default-series of oneiric, so be it. I see no reason to *remove* that setting. [19:45] And for default-instance-type .. same deal.. leave it be and let it override the hard coded 'm1.small' as a default.. but let constraints override both of those. [19:45] SpamapS: It will be removed, because the behavior you just described is a bug [19:45] SpamapS: It doesn't have to be removed now, if you and hazmat agree that's the best approach [19:45] SpamapS: But the behavior you just described is a bug, which has to be fixed [19:46] I'm not asking you to keep cruft around forever. Just take on the pattern of deprecation before removal so people have time to adapt. [19:47] jimbaker: https://codereview.appspot.com/5837050/ reviewed [19:47] SpamapS, so we can keep but deprecate instance-type, but image-id and series are more fundamentally problematic, because their just broken.. its not going to break an existing environment though, we're just forcing folks to update the config file. [19:47] niemeyer, cool [19:48] niemeyer: if the documentation says "this is deprecated don't use it" and it warns and says "This is deprecated don't use it" and the software still does what the user asked.. thats a reasonable method to fix a very sticky bug. [19:49] hazmat: *that* is breaking an automated environment. [19:49] SpamapS: That's the approach I'd use for a silly option that I wanted to deprecate. I'd not do that on an option that creates a bug for scenarios I care about. [19:49] SpamapS: Again, that's just my opinion, though. [19:49] SpamapS, forcing any user interaction even keeping existing apps and services running is breaking? [19:50] SpamapS: I'm happy to have you and hazmat deciding how to handle the removal [19:50] hazmat: at this point, if you error out because of options that were ok before.. you are breaking automated systems. [19:50] And at *least* deprecate the options in the documentation first. [19:52] SpamapS, done [19:52] I'm probably overreacting. But I've been through this 4 or 5 times now.. backward incompatible breaks have been coming at a pretty steady rate.. and at this point, the impact causes ripples through quite a few peoples' workflow [20:01] niemeyer, thanks for the review. it does seem to me that relation-ids should take an --interface option, as i note in my response in the mp [20:02] otherwise, this looks like we are in consensus on this [20:03] jimbaker: Do you have a real use case where that's relevant [20:03] ? [20:03] niemeyer, i mention it in the accompanying keystone example [20:03] jimbaker: That example seems wrong to me [20:04] jimbaker: You're blindly iterating over a list of relation ids, without any mention of what these ids are actually associated with [20:04] jimbaker: An interface defines a protocol [20:04] jimbaker: There's generally little reason to list things based on protocol [20:04] jimbaker: You'll generally want to know "where's my cache mongodb" [20:05] jimbaker: Not "where's any mongodb" [20:07] niemeyer, sounds good to me [20:08] niemeyer, i will further update the proposal, especially the example, with this in mind [20:34] SpamapS, hazmat: not sure if I can get back on properly today; but the first branch I propose will include: constraints updates; functioning default-* keys which override all constraints (so those who didn't specify them can use constraints straight off, but those who did don't have to deal with a *really* sudden behaviour change); and dire deprecation warnings of constraints incompatibility and immi [20:34] nent removal for all 3 default-* keys [20:35] SpamapS, hazmat: I think this will be safe to land quickly and is a necessary first step [20:35] SpamapS, hazmat: so ppa users who don't keep up with the lists get *some* warning [20:35] SpamapS, hazmat: if you forsee any problems with the above, please let me know :) [20:36] fwereade_: sounds very reasonable [20:37] SpamapS, cool, thanks [20:37] SpamapS, I'm sorry about all this :( [20:38] SpamapS, it also crosses my mind that nobody will actually be able to start using constraints without restarting their env [20:38] fwereade_: don't be! I appreciate the volume and quality of change, and I know its not possible to cover every case every time. [20:38] fwereade_: thats just the nature of the beast until we have an 'upgrade-environment' command [20:39] SpamapS, but we're cutting it fine with this one, and realistically we will have at least one more release which an environment cannot live through usefully [20:39] SpamapS, yeah, exactly, my feeling is that that's a critical for the next release [20:40] SpamapS, shame we couldn't get to it this cycle, but so it goes [20:41] SpamapS, once we have an upgrade-environment mechanism it will at least be *possible* to have non-breaking releases [20:44] fwereade_: breaking releases are different than not having a feature available to you in an existing environment though. [20:45] SpamapS, only slightly different, it's not like we can replace the PA [20:46] SpamapS, and, blast, I have thought of a possible breakage, the client (which does get updated) could try to get constraints info for a state that was never created with it [20:47] SpamapS, not hard to fix but not optional either [20:47] SpamapS, andway, sorry, gtg again [20:48] fwereade_: cheers! [21:47] Time for some outsiding.. [22:19] if anyone's around, do you recall why default-series is optional for some providers but not apparently all?