[00:02] SpamapS: merge done [00:02] resubmited [00:03] * nijaba off to see morpheus [00:04] nijaba: merging now, THANKS! [00:05] SpamapS: my pleasure [00:05] * nijaba really having fun [00:13] <_mup_> juju/ssh-known_hosts r427 committed by jim.baker@canonical.com [00:13] <_mup_> Initial commit [00:22] SpamapS: was wondering why having a readme is not mandatory for charm. Aren't they a bit dry without inbreed doc? Shouldn't juju offer a "man" option to access documentations about charms? [00:23] nijaba: I've thought about exactly that, having a 'charm info xxx' command that intelligently looks for README* readme* and cats them together into less would be cool. :) [00:23] s/less/$PAGER/ [00:24] nijaba: I think they also need a maintainer: field in metadata.yaml [00:24] SpamapS: this last bit was acked a fe days ago, IIRC [00:24] SpamapS: I'll open a bug :) [00:35] SpamapS: hey clint, should i push this lonely precise branch (lp:~gandelman-a/charm/precise/rabbitmq-server/900440) directly to lp:charm/precise/rabbitmq-server since theres nowhere to file a proper merge proposal? [00:40] adam_g: how about you push the oneiric branch into precise, then do a MP against that? [00:40] <_mup_> juju/ssh-known_hosts r428 committed by jim.baker@canonical.com [00:40] <_mup_> Support machine recycling [00:40] <_mup_> Bug #901017 was filed: Juju should have a "info" or "man" option < https://launchpad.net/bugs/901017 > [00:41] SpamapS: ahh [00:45] SpamapS: hmm. no dice pushing the current lp:charm/rabbitmq-server up to lp:charm/precise/rabbitmq-server http://paste.ubuntu.com/762302/ [00:46] adam_g: right.. I guess I have to do the "initialize" step before we can push to that series. :( [00:46] adam_g: can you at least push it to ~gandelman-a/charm/precise/... ? [00:47] SpamapS: yeah, already there: lp:~gandelman-a/charm/precise/rabbitmq-server/900440 [00:48] adam_g: so you can probably push the oneiric one to lp:~charmers/..... [00:48] adam_g: then do the MP [00:48] adam_g: meanwhile I'll try to figure out how we initialize the series [00:49] that might work [00:53] i could push to lp:~charmers/charm/precise/rabbitmq-server/trunk, suppose i'll propose against that... [00:54] yeah that will work [02:26] <_mup_> juju/upgrade-config-defaults r429 committed by kapil.thangavelu@canonical.com [02:26] <_mup_> use lazy computation of default values instead of recording them to config state [02:32] <_mup_> juju/upgrade-config-defaults r430 committed by kapil.thangavelu@canonical.com [02:32] <_mup_> config value validation no longer returns defaults [02:35] <_mup_> juju/upgrade-config-defaults r431 committed by kapil.thangavelu@canonical.com [02:35] <_mup_> no longer explicitly touch defaults in upgrade, the lazy computation suffices. [02:47] <_mup_> Bug #901043 was filed: switch charm subcommand to change origin of charm and upgrade < https://launchpad.net/bugs/901043 > [02:51] SpamapS, is bug 900517 different than the upgrade config defaults issue? [02:51] <_mup_> Bug #900517: config-get on an int set to 0 does not return '0' but an empty string < https://launchpad.net/bugs/900517 > [02:52] * SpamapS reads [02:52] hazmat: its entirely possible that this was actually the same effect. [02:52] hazmat: easy to test that hypothesis [02:53] * hazmat does a UTSL [02:55] SpamapS, i still haven't managed to reproduce bug 861928, i suspect its timing dependent, if you do manage to reproduce, it would be helpful to attach the entire provisioning agent log [02:55] <_mup_> Bug #861928: provisioning agent gets confused when machines are terminated < https://launchpad.net/bugs/861928 > [02:58] hazmat: interesting [02:58] hazmat: you know.. kees was experiencing it on the oneiric version (r398) .. its possible that its been fixed inadvertently with some of the ZK / API fixes [02:59] SpamapS, yeah.. jimbaker fixed another provisioning agent bug post oneiric afaicr [02:59] have we broken backward compatibility with r398 at all? I have half a mind to propose that we just put r427 in oneiric-updates [03:00] features be damned. ;) [03:00] only problem is.. we can't actually upgrade deployed environments [03:01] SpamapS, i doubt that's an issue in practice [03:01] that we can't upgrade the provisioning agent? [03:01] SpamapS, that their are long lived juju environments extant [03:01] kees had one very long lived for doing sbuild fanout [03:01] SpamapS, but fair enough [03:02] SpamapS, he shut it down though.. 45usd spend [03:02] until it stopped working [03:02] Anyway, I agree, nobody should have a long lived 11.10 juju cluster. :) [03:03] would be good to come up with an upgrade story for 12.04's juju [03:03] if william finishes the upstart job stuff.. we can at least put in the packages to stop/start the agents on upgrade [03:05] indeed that will be key, we can probably do some of dance around that, but the biggest question mark on the upgrade story, is just coordinating a code drop/rev across a cluster of different release series [03:05] ideally just a binary drop.. [03:12] Which is why I think we're going to eventually have to host juju packages on the juju service nodes [03:12] Otherwise precise won't be able to play in a "Q" managed cluster [03:14] should be fairly easy... each juju package just needs to include a script which builds itself for every series you want to support [03:15] and of course, we have to build a test suite which makes sure that actually works ;) [03:38] SpamapS, re the config set to 0, afaics its not an issue [03:39] hmm.. maybe it is [03:39] something sounds familiar [03:41] I was thinking it might be an issue where the value might not be carefully checked for None [04:49] <_mup_> juju/sshclient-refactor r428 committed by kapil.thangavelu@canonical.com [04:49] <_mup_> refactor the sshclient (zk over ssh tunnel) [04:55] <_mup_> juju/sshclient-refactor r429 committed by kapil.thangavelu@canonical.com [04:55] <_mup_> increase the default timeout [04:55] <_mup_> juju/sshclient-refactor r430 committed by kapil.thangavelu@canonical.com [04:55] <_mup_> robust zk conn [04:59] * SpamapS cheers hazmat on [05:06] * hazmat falls asleep [07:41] mornin' [08:19] moo rog [08:21] TheMue: yo [10:32] rog: is the example in the example dir of zookeeper working for you (that is, once you've replaced Init with Dial and fixed the err.String() calls)? [10:32] mpl: i'll try it [10:34] rog: here I have two problems with it. 1) it doesn't return as it should if I don't have any zookeeper server running. 2) I get loads of error messages for error, coming apparently from this point: event := <-session (it doesn't get past there apparently). [10:34] s/for error// [10:38] mpl: yeah, me too - loads of time out errors [10:38] mpl: i think the timeout must be wrong [10:39] mpl: yeah, the timeout should be 5e9 not 5000 [10:40] mpl: BTW i'm not sure what it should do if there's no zk server running [10:40] mpl: here's my updated version: http://paste.ubuntu.com/762565/ [10:40] rog: well, I don't know what it should do, but err should be != nil when Dial fails, and it seems it's not the case for me. [10:41] mpl: i'm not sure that Dial can ever fail [10:41] oh [10:41] how come? [10:41] mpl: because the connection itself is asynchronous [10:41] ah yes [10:41] good point, thx [10:41] so that err check is pretty moot [10:41] mpl: i think that's wrong, and gustavo and i have talked about changing it in the past, but the changes haven't been made yet [10:42] ok, another thing I don't get, why do I get tons of messages and not just one? that chan read is not in a loop. [10:45] mpl: looking at zk C source, it looks like the only way it can return an error is if the hosts arg is malformed [10:45] mpl: the messages are printed by the zk client code [10:46] mpl: (logging is turned on by default, which i think is wrong too) [10:49] rog: you mean they come from underlying calls of Dial? [10:49] mpl: yeah - they come from within the C API [10:49] rog: and not in any case as a result of this: "event := <-session" ? [10:50] mpl: indeed - that blocks until the connection is made. i don't know if zk ever decides that it can't connect. [10:50] rog: ok, that's reassuring then, thx. [10:50] mpl: you can turn the debugging messages off [10:52] ah cool, it finally worked. [10:52] mpl: zookeeper.SetLogLevel(0) [10:53] good to know, thx. [10:54] rog: ok, I'll elaborate from that example to play with ssh. [11:48] mpl: sounds good [12:30] re [12:45] g'monring [12:55] moo hazmat [13:14] for documentation purposes: are there some special bazaar configuration settings for juju? [13:35] TheMue: not as far as i know [13:35] fine, makes it easier [13:36] I'm working on a "Getting Started" [13:38] hazmat, is there some reason you know of for the particular shape of the code around CharmUpgradeOperation? [13:40] hazmat, because the workflow is perfectly capable of synchronising the state if we make the charm upgrade much more like a normal transition, but it's much hairier if there's a reason *not* to do it as a normal transition [13:52] fwereade, not sure what you mean [13:52] fwereade, you mean push more of the operation out of the watch callback and into the transition? [13:54] hazmat, that everything done CharmUpgradeOperation ought IMO to be done on the lifecycle, like the other things that happen as part of of a state transition [13:55] hazmat, and if we do that we can easily just call "self.workflow.synchronize(executor)" in place of the boolean tangle in the original MP [13:55] fwereade, hmm. so my thought there its not something that is manageable completely internal to the lifecycle, it depends on external mutable persistent settings, which is very different then anything else in the lifecycle [13:56] hazmat, on the service's charm id? [13:56] ie. you can't just call lifecycle.upgrade() and expect it to work, the external state needed to be put in place first.. where as you can call any of the other lifecycle methods [13:56] fwereade, on the upgrade flag [13:56] hazmat, hmm, hadn't had that perspective [13:57] fwereade, i thought the plan was not to do anything on upgrade_error [13:57] fwereade, how does this issue arrise? [13:58] hazmat, you recall the plan to make the workflow know how to set up the lifecycle and executor to match the current state [13:59] hazmat, to do so, we need to be able to detect the errors which occur while the executor is paused, so we can restore it correctly [13:59] fwereade, i thought we'd moved on to its an easy thing to distinguish in the upgrade transition, and we'll be dealing with disconnected op sync anyways, so exact match isn't nesc (queueing in the background) [14:00] fwereade, the error from the executor is paused is noted in the state [14:00] hazmat, how is it noted? [14:00] hazmat, we don't even try to fire a transition until some time after we've stopped the executor [14:00] fwereade, although juju could probably use a more robust setup there from pause, to enclose the rest in a try/except block [14:02] fwereade, so from pause to transition, its set a zk value, and extract a charm to disk [14:02] if the transition/hook fails we'll get into a recorded error state [14:02] hazmat, and if anything goes wrong during the extract or the zookeeper set, we'll be in a weird state [14:03] fwereade, a try/except around the others can manually fire transition to an error state [14:03] on error [14:03] fwereade, its an odd scenario regardless if we have a half extracted charm on disk [14:04] * hazmat ponders [14:04] hazmat, agreed, but I don't think we can guarantee that that will *never* happen [14:10] fwereade, agreed, although we can do a better job of minimizing, but its not clear that encompassing more to the error state, is helpful wrt to retry, the coordination state is gone on retry [14:11] the flag is cleared, and we don't know that we can safely execute the upgrade hook again, because we don't know the state on disk or zk of the charm [14:12] and if we renter the entire ugprade operation, we don't have the coordination state to trigger any changes, and it will early exit [14:12] hazmat, isn't it just down to the order of operations? [14:12] perhaps [14:13] hazmat, if we extract, then set in ZK, then fire the hook [14:13] fwereade, i don't see how that helps, the flag is cleared [14:14] fwereade, and you can't set the flag in an error state [14:14] fwereade, your right though, an error here should be recorded as a charm upgrade error [14:16] hazmat, because we can know by the unit charm id whether or not the extraction of the latest charm has completed; if it has we can move straight on to firing the hooks(or not) according to the "resolved" command [14:16] hazmat, if the charm ids don;t match, we start the operation from scratch [14:16] hazmat, (when we retry) [14:17] fwereade, so right now error states always refer to hook errors.. [14:18] hazmat, from the POV of the workflow state, which represents what the unit is actually doing, I feel that "half-extracted charm that's 100% broken" should absolutely represent an error [14:20] fwereade, it definitely should, i'm just trying to work through the implications of changing the meaning of an error state, what retry means in this context, and changing the interactions/responsibilities of lifecycle compared to any extant uses. [14:20] fwereade, there's a notion that upgrades flags shouldn't survive restarts, which is one reason why we cleared the flag early [14:21] i'm trying to recall if there was more to it that [14:21] * SpamapS stretches and yawns [14:22] fwereade, so when would the upgrade flag get cleared? [14:22] hazmat, my idea is that we cleat the upgrade flag as soon as we see it, but we kick off an upgrade_charm transition, which is "started"->"started" [14:23] hazmat, if we're not in a started state we just bail before we even try the transition [14:23] * hazmat nods [14:23] hazmat, the lifecycle.upgrade_charm will do the early parts before stopping the hooks and quietly bail out on errors, equivalently to now [14:24] hazmat, but once we hit the stop-hooks-start-messing-with-disk-state point, any subsequent errors should come out and be detected as transition failures [14:24] fwereade, how do you renter the upgrade charm state? [14:24] error state that is [14:25] on a process restart [14:25] hazmat: just got an email from fernanda...TZ mixup? [14:25] hazmat, it's just an existing workflow state, I'm already in that state when I come up [14:26] robbiew, doh.. indeed that is tz mixup, i thought it was +1 hr [14:26] fwereade, but the process mem state is different [14:26] fwereade, ah.. so the executor is still stopped [14:27] because we never started the lifecycle, and we're not listening to any rel lifecycles [14:27] hazmat, lifecycle.running and executor.,running are not especially closely related [14:28] fwereade, yup.. so if we restart in a charm upgrade error state.. the lifecycle is stopped, the exec is running, but nothing feeding into it [14:29] hazmat, the executor needs to be stopped during upgrade error states [14:29] hazmat, all teh rest of the time it's fine [14:29] fwereade, how does it get stopped on restart [14:30] hazmat, we just don't start it explicitly, we let the workflow do so if it's in a state which needs it [14:31] fwereade, and how is it any different than the lifecycle just being stopped [14:31] hazmat, so it's just "self.workflow.synchronize(self.executor)" and then we're in the state we must have been in when we left off last time [14:33] hazmat, from outside perspective no different, I guess -- no hooks are executing -- but... well, why exactly are we explicitly stopping the executor when we could just stop the lifecycle like we do with, say, configure? [14:34] hazmat, ...only just thought of that :/ [14:34] robbiew, just rescheduled for 20m from now [14:34] hazmat: cool [14:37] fwereade, because the ability to run a hook now (ahead of any queued hooks) has a safety notion that the executor is stopped, in part to guarantee that there are no other currently executing hooks [14:38] fwereade, i need to switch tracks for a little bit, but i'll definitely ponder this some more [14:38] hazmat, isn't the reason that the unit relation lifecycles' schedulers could still be busily executing queued hooks at any stage? [14:39] fwereade, not sure if you saw this.. because the ability to run a hook now (ahead of any queued hooks) has a safety notion that the executor is stopped, in part to guarantee that there are no other currently executing hooks [14:39] hazmat, exactly so [14:40] fwereade, i need to switch tracks for a little bit, but i'll definitely ponder this some more.. i think this is worthwhile.. part of the issue though on either an extract failure or a state change failure, is that its signals a signficant problem [14:41] hazmat, I think it comes down to my conviction that we're better off restoring process state on startup -- which state can be encapsulated in 2 bools -- than we are by complicating the logic we run all the time [14:41] hazmat, ok, ttyl -- ping me to continue when you're free :) [14:42] fwereade_, isn't restoring the state as simple as is -> if not self.running: self.lifecycle.start, else self.executor.start() [14:42] hazmat, well, "started" implies both running, but yeah, it's not complicated [14:42] hazmat, you seemed at one stage to be arguing against it [14:48] fwereade_, actually i was hoping for that since it was the simplest thing, but the notion that upgrade error should encapsulate non hook errors has some merit [14:49] fwereade_, definitely worth exploring, and i think a good track [14:49] hazmat, I think it is the simplest thing [15:33] http://www.ustream.tv/channel/vclug-venturaphp .. me.. talking about juju to a local LUG ... unfortunately, the demo failed because I had a lucid AMI in my environments.yaml [15:33] totally forgot that I had been monkeying around with the AMI. :-P [15:33] Pretty much flies off the rails at 22:00 [15:33] * SpamapS goes off to get the family out so he can get work done. [16:12] fwereade, connectivity problems? [16:13] hazmat, yeah, sorry about that, didn't actually notice it happening until just now [16:13] fwereade, no worries [16:21] * kees waves "hi" [16:22] so, I discussed some of the trouble I had with the provision here last sunday. not the best time for catching people, i realize. [16:22] *provisioner [16:22] SpamapS pointed me to where cloud-init does it's work, but ultimately I wasn't able to get the provisioner back on its feet. [16:23] hazmat: what's the best way for me to help debug the troubles I ran into? [16:25] kees: using the PPA version would go a long way to figuring out if this is already fixed or not.. which I suspect it may have been [16:25] kees: we still need to make the agents more robust and restartable, which fwereade is working on right now.. but I think some of the ZK stuff has been fixed since 11.10 released [16:26] SpamapS: how do I find AMIs with the PPA version built-in? [16:26] SpamapS: and why not SRU these fixes to Oneiric? [16:26] kees: you don't need an AMI.. you just add 'juju-origin: ppa' to your environment settings [16:27] kees: Its hard to isolate the fixes because there have been massive changes. [16:28] SpamapS: hrm, let me try... [16:28] kees: also if your client version is from the PPA, it will automatically deploy with the PPA [16:29] * SpamapS curses himself for forgetting to run the test suite before commit to trunk.. https://launchpadlibrarian.net/86855907/buildlog_ubuntu-precise-i386.juju_0.5%2Bbzr428-1juju2~precise1_FAILEDTOBUILD.txt.gz [16:29] SpamapS: if I just set "juju-origin: ppa", is that sufficient, or do I need to also install juju from the PPA? [16:29] * SpamapS puts on the cowboy hat [16:30] kees, that's sufficient [16:30] Has the ZK schema bumped since r398? [16:30] SpamapS, no [16:30] * kees attempts a bootstrap... [16:30] SpamapS, there's been some minor additions, but no changes to the cli interactions [16:31] one of the really goofy bugs I ran into was that --environment seemed to be ignored by a lot of commands [16:32] kees, that's odd just about every command takes that option [16:32] kees, it has to be specified after the sub command. [16:32] i.e. I tried to do juju bootstrap --environment sample2 after my "sample" environment's provisioner freaked out. [16:32] and then juju status --environment sample2 always failed. [16:32] then I destroyed sample2, and then juju status couldn't find sample any more [16:33] so I had to hard-code the instance list in the source to get control back. [16:33] kees: did they both have the same control-bucket ? [16:33] what is a control-bucket? :) [16:34] kees: the thing that uniquely identifies an environment in the provider... [16:34] kees, its an s3 bucket that's spec'd in environments.yaml.. its env specific [16:34] it gets autogenerated the first time around, but it can't be copied between multiple environments, without causing issues [16:34] ah, I see that now. does that get added automatically? I don't remember adding that or admin-secret [16:35] yeah, that would totally be what happened then [16:35] I just copied the entire "sample" section and changed the name. [16:35] hmm.. we should probably warn/error if we see that come up [16:35] heh, d'oh. [16:35] hazmat: yeah control-bucket should have the env name in it.. so we should be able to error out.. "control bucket foo has env name X not Y" [16:35] seems like that should be stored somewhere else instead of injected into environment.yaml [16:36] okay, well, that explains that glitch at least. :) [16:36] kees: its used by clients to find the ZK server, so it has to be in environments.yaml [16:37] Tho one thing that would work is to change it to control-bucket-prefix: .. and by default just prepend that to the env name. [16:37] SpamapS, that would create an implicit fail scenario around changing an env name [16:37] it might be nice to have the finding of the master instance show up in --verbose (i.e. the processing of the ec2 instance list, etc) [16:37] although for local provider it already is [16:38] since we use the env name on disk [16:38] I spent a lot of time trying to figure out how juju was deciding which was a master instance when I broke it with sample2. [16:38] kees, its always machine 0 atm [16:38] hazmat: err, env name can't be changed AFAICT, its used for so many things... ec2 group names for one. [16:38] SpamapS, ugh.. good point [16:39] hazmat: I mean the stuff before "Connecting to environment". [16:39] SpamapS, that sounds quite sensible then.. along with a nice warning in the doc about it [16:39] hazmat: when I bootstrapped using the same control-bucket, suddenly juju would only talk to the new instance [16:39] could we derive the control bucket name by combining the env name and the access id in some way? [16:40] thus removing the need for a user to invent another name [16:40] https://juju.ubuntu.com/docs/getting-started.html#configuring-your-environment <- this could add some details about what the control bucket is. [16:41] <_mup_> Bug #901311 was filed: automatically prefix control bucket with the environment name < https://launchpad.net/bugs/901311 > [16:42] hazmat: could that work? [16:42] <_mup_> juju/ssh-known_hosts r429 committed by jim.baker@canonical.com [16:42] <_mup_> Merged trunk [16:42] rog: I'm a little hesitant to make use of the access key id in any permanent context [16:43] robbiew, not sure we want to include access id its tied to an external/provider notion [16:43] e.g. envname + salt + hash(salt+accessid) [16:43] rog: they can be created and discarded quite often [16:43] rog, & [16:43] is there documentation on the potential contents of environment.yaml? [16:43] rog, take orchestra for example.. what's an access id.. or local provider, its a provider specific notion [16:43] e.g. how would I discover "juju-origin: ppa" otherwise? [16:44] hazmat: does orchestra have a control-bucket field? [16:44] kees, https://juju.ubuntu.com/docs/provider-configuration-ec2.html?highlight=origin [16:44] rog, doh.. good point [16:44] hazmat: ah-ha! thanks. I knew I'd found that before at some point. [16:44] * SpamapS goes OTP [16:44] hazmat: maybe link to that from https://juju.ubuntu.com/docs/getting-started.html#configuring-your-environment ? [16:45] rog, the other thing with access id, is it assumes the identity is shared across all users of the env [16:45] rog, which is true/required atm for bootstrap/destroy-environment [16:45] what about making "juju-origin" be "PPA" by default, since that should always be the latest/greatest? that could be SRUed to oneiric. [16:45] hazmat: that's true. [16:46] hazmat: but it might be a useful default [16:46] hazmat: if there's no entry for control-bucket, for example [16:46] rog, maybe not though.. they need access to bucket, which we have setup as private by default atm.. i just want to leave options open for delegation of access [16:47] hazmat: if we want multiuser access, the bucket must be readable by other users, right? [16:47] rog, yeah.. i'm not sure we'd ever make that not a required arg for ec2, if its an auto on deterministic setting.. well you can change your id or switch accounts, and then poof your env is gone [16:47] hazmat: okay, so, I spawned a bunch of units, and I've hit exactly what I saw on Sunday. [16:47] machines: [16:47] ... [16:47] 6: {dns-name: '', instance-id: i-0b65094c} [16:48] ... [16:48] builder-debian/5: [16:48] machine: 6 [16:48] public-address: null [16:48] relations: {} [16:48] state: null [16:48] machine 6 hasn't been noticed, and the unit stays "public-address: null" [16:48] hazmat: isn't that already true? (given that the bucket is private) [16:48] kees, also fwiw the latest client btw shows more information on status regarding machine state (pending from the provider, running, etc) [16:49] kees, public-address is null till the machine actually comes up and starts the machine agent.. [16:49] its not instaneous [16:49] it takes a minute, for the machine to launch, and have packages installed and to be available [16:49] ah, well, it just came up. heh. sunday I waited though. it wasn't up after an hour. [16:50] kees, definitely broken then, but its not something you can determine instaneously is all i'm saying [16:50] hazmat: right, absolutely. [16:50] kees, what i'm trying to verify though is.. A) is the bug something we've already fixed in the ppa B) if not what's the provisioning agent log look like [16:50] * kees nods [16:50] let me try to trigger the missing machine fault, one sec. [16:52] kaboooom [16:52] here was my steps: [16:52] $ juju terminate-machine 10 [16:52] oops, ignore that [16:52] steps: [16:52] $ juju remove-unit builder-debian/7 [16:52] $ juju terminate-machine 10 [16:52] $ juju add-unit builder-debian [16:53] at which point the provisioner explodes with python backtraces [16:53] 2011-12-07 08:52:01,217 provision:ec2: twisted ERROR: KeyError: 'Message' [16:53] 2011-12-07 08:52:01,217 provision:ec2: twisted ERROR: Logged from file provision.py, line 156 [16:53] what logs can I provide? :) [16:54] kees, awesome. the log is in /var/log/juju .. i think its provisioning-agent.log but i'm not sure of the exact filename [16:54] kees, its on machine 0 of the env [16:55] i think i kept using destroy-service instead of remove-unit when i was trying to reproduce this [16:56] So I am trying to use orchestra as per http://cloud.ubuntu.com/2011/09/oneiric-server-deploy-server-fleets-p2/ and this is failing because following those instructions does not appear to result in pxe booting being setup correctly on the provisioning server. [16:56] hazmat: http://paste.ubuntu.com/762905/ [16:56] I now have got the dhcp server running but it still isn't configured to do pxe things properly. [16:57] and so I get "No filename" errors when trying to boot client VMs [16:57] kees, thanks thats very helpful [16:57] that looks like a bug in txaws [16:57] (this is on oneiric VMs) [16:58] hazmat: cool, excellent. [16:58] hazmat: I assume that moving the ppa fixed the bring-up bug, or it's a hard race to lose and I just got "lucky" on sunday [17:06] kees, its really not a racy normally, i'm sorry that was your first juju experience. the client cli status reporting is much better now about keeping the user informed about what's going on (is the provider machine up, is juju read on the machine). the provisioning bug in particular has been a little hard to reproduce, and its been unclear what version and what the bug is.. but i think thanks to your help we should be able to fix that i [17:06] n the next day or two. and it indeed its seem to be a bug in txaws in that it varies/reproduces based on ec2 error response variation. [17:09] cool, thanks for looking into it! [17:10] hazmat: you wrote about a presentation about juju. would you please send it to me? [17:10] it was frustrating for sure, but it was still _way_ easier to bring up a bunch of identical instances this way. [17:10] the charm stuff is nice :) [17:17] TheMue, we have them shared in an ubuntu one folder atm [17:19] hazmat: ah, ok. still have not used my account. so I'll try it now. [17:20] hazmat: does it cover the dependencies of external components (like zk) and internal/external modules and libraries [17:20] ? [17:20] TheMue, no [17:20] TheMue, its a very high level architecture diagram [17:21] hazmat: ok, but I think it will help [17:48] hazmat, btw, need an opinion on how it's acceptable to detect unexpected shutdowns during the critical window of filesystem-screwage during upgrade-charm [17:50] hazmat, the workflow state seems like such an obvious place to put it, but I don't think it's a good idea to fire a transition while midway through executing another transition [17:51] hazmat, so if I were to do that I'd have to have a callback on workflow that called set_state on itself explicitly [17:51] hazmat, which feels like a bit of a perversion of the state machine [17:52] hazmat, hm, I have to stop now :( I'll pop back on later [17:53] fwereade, doh.. sorry.. definitely i think your idea is good (collapse part of upgrade op into the transition), go for it [17:54] hazmat, the issue is that I feel I should be able to handle the fact that the process could suddenly die while we're half way through extracting the charm [17:54] fwereade, that's independent really of the workflow aspect [17:56] hazmat, well, the trouble is it's intimately bound up with it, because if we come up from an incomplete upgrade we need to go into upgrade_charm_error state [17:56] hazmat, it certainly can't go on the lifecycle, we don't want that explicitly controlling the workflow [17:56] hazmat, SpamapS: if you're interested, I've got another set of juju blog posts up now: [17:56] http://www.outflux.net/blog/archives/2011/12/07/juju-bug-fixing/ [17:56] http://www.outflux.net/blog/archives/2011/12/07/how-to-throw-an-ec2-party/ [17:57] fwereade, sure, that state can be signaled by the error handler, but the aspect of doing the upgrade in such a way as to handle unexpected errors is independent of the location of the code [17:59] hazmat, I guess it could go on the unit agent itself, but it's a step in he opposite direction from the (IMO nice) move of state-reconciliation from unit agent to workflow [17:59] hazmat, teh workflow really feels like the right place for it [18:00] fwereade, so what happens on a retry? [18:00] of upgrade_error [18:01] hazmat, the usual: if unit charm id doesn't match service charm id, download and unpack before running the hooks [18:02] hazmat, and if it does, we know we're recovering from a state post-successful-replace, and we just fire the hooks if we're asked [18:02] fwereade, sounds good [18:02] hazmat, I'm just trying to figure out whether an "unlicensed" state transition, that doesn't go through the normal transition logic, is in any way acceptable [18:02] fwereade, just make an additional transition [18:02] fwereade, what's the scenario? [18:02] hazmat, and it's explicitly OK to fire a transition in the course of another transition? [18:03] fwereade, no.. but the lifecycle can call other lifecycle methods [18:03] hazmat, when we hit the point of no return *something* needs to record the fact that we're in a risky state [18:04] hi! [18:04] hazmat, as said above I think the workflow is the right place for it [18:04] kees: ty, reading your posts now. ;) [18:04] fwereade, huh? the transition handler itself is supposed to be risky/failable.. that's the benefit it and i thought the point.. it will record failures [18:04] May I ask aquestion? [18:05] kees: btw, you should be able to use us-west-2 now ;) [18:05] kickinz1_, sure [18:05] kees, cool post. i'm working on the ssh key management now, so that will take one step out of your process [18:05] I'm in the process of using juju with orchestra [18:06] When creating the boot strap, it fails with this error: [18:06] /root/.juju/environments.yaml: environments.orchestra.default-series: [18:06] The only place I see this is onbugs, but while using etckeeper. [18:07] fwereade, ah.. i think we're agreeing.. i think the point of no return stuff should be in the transition handler with a conditional guard, hence failures there record state, and can be retried. sounds good. [18:07] (https://bugs.launchpad.net/bugs/872553) [18:07] <_mup_> Bug #872553: [SRU] upon creating a node via juju & orchestra, etckeeper hangs < https://launchpad.net/bugs/872553 > [18:08] hazmat, I'm not totally certain whether we're talking past one another or not, 1 sec [18:08] kickinz1_: can you maybe pastebin the whole error, like from $ juju .... to the next $ ? [18:08] ok [18:09] hazmat, I'm talking about something like this: http://paste.ubuntu.com/762978/ [18:10] hazmat, on UnitWorkflowState [18:10] hazmat, damn, really must go, bbl [18:10] http://pastebin.com/NNqBkiNn [18:15] any idea? [18:16] I'm using precise [18:16] fwereade, the state changes should go in the watch callback not the workflow [18:18] fwereade, the existing upgradecharm op will continue to exist, and it can do some basic checks, but it will kick off the state change after clearing the upgrade flag, the transition handler holds the rest of the code to the upgrade, it should be retryable cleanly, if it fails the unit goes into an upgrade_charm_error. [18:19] kickinz1_, do you have a default-series set in your environments.yaml ? [18:20] <_mup_> Bug #901343 was filed: juju.control.tests.test_status.StatusTest.test_render_dot broken < https://launchpad.net/bugs/901343 > [18:20] no [18:21] I'm getting the source of juju to look at what it expect. [18:24] Funny names...."astounding, mgnificent, overridden, puissant"... [18:25] thanks! default-series: oneiric made it work! [18:38] Hello! [18:39] o/ [18:42] mainerror: Yo [18:42] rog: You'll like some of the upcoming improvements on lbox.. [18:42] niemeyer: cool [18:42] Just need to test them now.. no Launchpad connection on the flight :) [18:43] niemeyer: a couple of new reviews for you BTW [18:44] rog: and you just got one [18:44] niemeyer: yay! [18:45] niemeyer: make that 3 new reviews - i'd forgotten about that one! [18:45] niemeyer: i've updated the cloudinit package merge proposal [18:45]