[00:00] hmm.. it looks like trunk is broken [00:02] oh.. i had ensemble set to use a different branch [00:10] hazmat: Sounds right re. the catch up [00:14] actually pythoon-software-properties changed [00:15] jimbaker, is unexpose not implemented? [00:15] and needs to be fixed, and so will cloud-init [00:19] adam_g: What happened? [00:20] niemeyer: add-apt-repository now requires user confirmation when adding a PPA [00:20] just submitted a patch to software-properties to have a '-y' option similar to apt-get [00:20] adam_g: Oh no [00:20] either way it breaks lots of formulas. :\ [00:20] Man.. [00:22] hazmat: I understood it was ready as well [00:30] adam_g: Sent a note about the conceptual problem there [00:31] adam_g: We can fix with -y.. but we need to increase awareness about the importance of not introducing interactivity randomly [00:34] niemeyer: agreed [02:53] hazmat, unexpose is implemented [02:57] hazmat, i would be very curious if you have found any issues of course [03:09] hazmat: The full diff on the gozk update: http://paste.ubuntu.com/669664/ [03:12] jimbaker, i unexposed a service (current trunk) and was still able to access it [03:13] hazmat,did you delete the environment security group first? [03:13] jimbaker, ah [03:13] jimbaker, probably not [03:13] sorry, in the release notes :) [03:13] my expose-cleanup branch will take care of that cleanup [03:14] but i considered it a helpful feature in the transition ;) [03:15] jimbaker, cool, thanks for clarifying [03:15] hazmat, no worries [03:16] hazmat: on plane.. inflight wifi is .. amazing. :) [03:18] SpamapS, nice [03:19] hazmat: anyway re the traceback... I am trying to find where the 404 is coming from [03:20] SpamapS, i remember flying back from europe with internet service a few years back. too bad that program was scrapped because of expense [03:21] jimbaker: lufthansa is rolling it out on 100% of flights now [03:21] but ground stations are cheaper than satellites, so domestic only for the immediate future [03:21] SpamapS, where you heading? [03:21] Seattle.. just showing my boy something different than the SW US [03:21] SpamapS, really, there's another satellite player now? iirc, it was boeing that scrapped it. maybe someone did pick it up [03:22] He's only ever been to CA, NM, and TX ... WA is *quite* a different place. :) [03:22] and mid aug is the best time of year to visit WA [03:22] jimbaker: Not sure, they've been squawking about it that they're rolling it out. [03:22] jimbaker: yeah, nice and green.. mid 70's :) [03:22] sounds like our mountains [03:23] really, we should not have any more sprints in central tx in august [03:24] haha [03:24] I thought it was nice.. for keeping us in our rooms working [03:26] SpamapS, that's one theory, butts in seats and all. i find my brain works better if the body has moved however [03:33] oh I see the problem with the tracebacks. :-/ [03:36] SpamapS, i'm curious.. sometimes we get tracebacks in twisted and sometimes not.. [03:37] mostly depends on if the code has yielded to the reactor but its not always clear [03:37] This seems to be because the error isn't raised until inside the select loop [03:37] yeah [03:38] I'm guessing its coming from txaws.. I can't seem to find where its raised in the provider [03:45] Its really not clear at all where the 404 from Amazon is made "ok" .. [03:52] Yeah this twisted error handling is *maddening* [03:53] the error that is happening is not an S3Error, but a twisted.web.error.Error ... [03:53] SpamapS, it should be that line i pointed out earlier [03:53] hazmat: that line doesn't seem to be reached [03:53] hmm [03:54] i guess i should byte the bullet and do a nova install [03:54] just get on canonistack [03:54] its company wide [03:54] and dead dumb ass simple [03:54] requires something i don't have... though i should get one [03:54] actually swift is much easier to setup [03:54] err.. what? a SSO login? [03:55] thats all you need [03:56] SpamapS, i need to setup a shell account i thought? [03:57] nope [03:57] maybe if you don't want to use public IPs to ssh into the instances [04:00] hazmat: good luck.. time to shut down electronics [04:04] SpamapS, thanks.. i'll see if i can get this running [04:04] getting set up with canonistack was really easy [04:27] hmm.. so after fixing up the error trapping.. it still looks like a 404 on creating a bucket [05:38] got it [05:38] SpamapS, so getting past the error trapping code, it looks like the problem is a missing trailing slash [05:38] on bucket names in txaws [05:38] at least that's the only delta between boto and txaws when i try to do bucket ops with them [05:40] woot! [05:46] hmm.. not quite [05:46] well at least past all the s3 errors [05:46] now onto the ec2 errors [05:50] hazmat: :) [05:50] hazmat: so txaws needs some testing against openstack. :) [05:51] SpamapS, definitely, the delta was tiny.. finding the delta ;-) [05:51] but clearly some additional testing.. [05:51] its rather funny.. boto has such minimal testing.. but lots of prod use.. txaws lots of testing.. little prod use.. [05:56] hmm.. that's a little questionable [05:57] i didn't have my access key/secret key setup correctly, but i could still create buckets.. [05:57] i don't think there's any validation in nova-objectstore [05:57] woot! bootstrap sucess [05:58] ugh.. 12 roundtrips for bootstrap... [06:00] SpamapS, so i can't actually test this, since i need shell access for the ssh tunnel to zk [06:00] but it bootstraps at least [06:04] and shutsdown ;-) [06:04] * hazmat heads to bed [06:06] <_mup_> ensemble/stack-crack r322 committed by kapil.thangavelu@canonical.com [06:06] <_mup_> openstack compatibility fixes for ec2 provider. [06:10] hazmat: you shouldn't need shell access to talk to nova [06:10] hazmat: and you can allocate/attach public ips to ssh in [06:12] * SpamapS tries out branch [06:14] hazmat: argh! where is your branch? [08:59] huh .. ensemble upgraded causes syntax errors ? http://paste.ubuntu.com/669870/ [09:33] kim0: agreed, see the same [09:44] Any idea if the CF formula has been made to work === daker_ is now known as daker [12:48] <_mup_> Bug #829397 was filed: Link a service to a type of hardware < https://launchpad.net/bugs/829397 > [12:53] <_mup_> Bug #829402 was filed: Deploy 2 services on the same hardware < https://launchpad.net/bugs/829402 > [12:57] <_mup_> Bug #829412 was filed: Deploy a service on a service < https://launchpad.net/bugs/829412 > [13:00] <_mup_> Bug #829414 was filed: Fail over services < https://launchpad.net/bugs/829414 > [13:04] <_mup_> Bug #829420 was filed: Declare and consume external services < https://launchpad.net/bugs/829420 > [13:49] kim0: CF unknown still... looking at it now [13:50] m_3: thanks :) [13:50] I'm doing hpcc instead [13:50] horribly complex language [13:55] kim0: ok, I'll go back to adding/testing nfs mounts into our standard formulas... fun fun :) [13:56] yeah all fun :) [14:21] Newbee questions… Can formulas be written in Ruby? [14:22] botchagalupe, definitely [14:22] very cool… [14:23] botchagalupe, formulas can be written in any language, from c, shell, haskell, ruby, etc. [14:24] So far it looks pretty cool… Weird coming from a chef background though. Just looked at it over the last hour. Need to learn more… [14:25] botchagalupe, ensemble will call the hooks at the right time.. which are just executables to ensemble, and the hooks can interact with ensemble via some command line tools (relation-set, open-port, etc) that are provided. [14:27] Are there any good examples of running it outside of EC2? e.g., openstack…. [14:28] botchagalupe, not at the moment, we're still working on openstack compatiblity ( i was just working on it late last night.), and cobbler/orchestra (physical machine) integration, but that likely won't be finished till the end of the month. [14:29] well.. sooner.. but at as far as having blog posts, and docs. [14:31] I look forward to it :) [14:31] kim0, that upgrade error looks like some python 2.7isms that fwereade introduced while refactoring the provider storage... previously it was 2.6 compatible... its probably worth a bug report. [14:32] hazmat: ok filing it [14:35] <_mup_> Bug #829531 was filed: broken python 2.6 compatbility < https://launchpad.net/bugs/829531 > [14:41] SpamapS, open stack compatible branches.. lp:~hazmat/ensemble/stack-crack and lp:~hazmat/txaws/fix-s3-port-and-bucket-op [14:43] Hey all! [14:43] botchagalupe: Hey! Good to have you here.. [14:46] hazmat: Thanks for pushing that man [14:46] hello niemeyer [14:47] hazmat: Have you checked if openstack returned anything at all in that put? [14:47] hazmat: Was mostly curious if it was a bit off, or entirely off [14:47] highvoltage: Hey! [14:48] niemeyer, well bootstrap works, i need to see if assigning the elastic ip address will change the address as reported by describe instances, if so than i should be able to actually use ensemble against the instance, else it will need an ssh account into this particular private openstack installation [14:49] niemeyer, it was just a few bits off.. the error capture in ensemble needed to be more generic, and the bucket operations needed a trailing slash [14:49] hazmat: Hmm.. how's EIP involved there? [14:49] niemeyer, the actuall diff was only like 10 lines [14:49] niemeyer, only against this private installation of openstack [14:49] hazmat: I meant openstack itself.. was it returning anything at all, or just 404ing [14:50] hazmat: Yeah, but I don't get how's it involved even then [14:52] niemeyer, so two different topics.. openstack was returning 404s without ec2 error information, which means the error transformation in txaws wasn't working, and the error capture in ensemble wasn't working either, updating the error capture in ensemble to catch twisted.web.error.Error and checking status against 404 solved that.. there was an additional compatibility issue which required bucket operations to have a trailing slash [14:52] hazmat: I got that yesterday.. the question is: [14:53] hazmat: OpenStack is obviously not returning the same message than AWS.. what is it returning instead? [14:53] niemeyer, empty contents on a 404 [14:53] hazmat: Ok.. :( [14:54] Nicke, on the the EIP topic.. the problem is that this particular openstack installation is private, so we launch a bootstrap instance of ensemble, and then we can't actually use the ensemble commands against that, because we use an ssh tunnel to the public ip addrs of the node.. which isn't routable [14:55] niemeyer, the openstack implementation, both switch and nova-objectstore are very simple if we want to send patches upstream re this [14:56] hazmat: Sure, I totally get that we can fix it ourselves.. ;-) [14:56] finite time sucks :-) [14:56] niemeyer: Good looking tool… Gonna give it some kicks this weekend to podcast about Monday… [14:56] botchagalupe: Neat! [14:57] botchagalupe: We have a lot happening right now.. if you want to include details about what else is going on, just let us know [14:57] hazmat: Re. the EIP.. I see.. so our setup does not actually expose the machines unless an EIP is assigned [14:58] niemeyer, exactly [14:58] hazmat: Can we proxy the client itself through SSH? [14:59] niemeyer, yes. that requires a shell account that i don't have.. i'm curious if openstack maintains compatibility to the point of readjusting describe instance output when an eip is assigned to an instance, that will obviate the need for shell credentials to this private openstack instance. [15:00] hazmat: Are you sure? Have you tried to route through people or chinstrap, for instance? [15:00] niemeyer, i haven't setup that shell account [15:01] niemeyer, just finding some new errors as well with the ec2 group stuff on subsequent bootstraps [15:01] hazmat: Hmm, good stuff [15:06] * hazmat grabs some caffiene.. bbiam [15:13] niemeyer Please send me what you have john at dtosolutions com [15:14] botchagalupe: I don't have anything readily packed to mail you.. [15:16] botchagalupe: Right _now_ we're working on the formula store, physical deployments, and local development.. have just deployed EC2 firewall management for formulas and dynamic service configuration. [15:23] fwereade, i took more of a look at the cobbler-zk-connect branch [15:23] jimbaker: heyhey [15:23] 1. test_wait_for_initialize lacks an inlineCallbacks decorator, so it's not testing what it say it's testing :) [15:24] Nice catch [15:24] 2. the poke i mentioned is TestCase.poke_zk [15:24] note that to use it, you need to follow the convention of setting self.client [15:25] fwereade, in general, you don't want to be using sleeps in tests. they have a nasty habit of eventually failing [15:26] jimbaker: cool, tyvm for the pointers :) [15:26] fwereade, in this particular case, poke_zk will definitely work, and make the test deterministic. which is what we want [15:26] jimbaker: I'll look up poke_zk [15:26] jimbaker: sweet === otubo is now known as otubo[AFK] [15:33] * niemeyer => lunch [15:44] hmm.. it looks like openstack has a compatibility issue here regarding describe group for the environment group [15:50] niemeyer, i'm starting to realize compatibility for openstack might be a larger task, but its also unclear as their are lots of bugs fixed that are marked fix committed, but not released. === daker is now known as daker_ [16:06] hazmat: i believe those bugs aren't marked as fixed released until the next milestone is released, but packages in ppa:nova-core/trunk are built on every commit [16:07] <_mup_> Bug #829609 was filed: EC2 compatibility describe security group returns erroneous value for group ip permissions < https://launchpad.net/bugs/829609 > [16:11] adam_g, understood, its just not clear what version canonistack is running [16:13] woohoo just pushed a lexisnexis vid → http://cloud.ubuntu.com/2011/08/crunching-bigdata-with-hpcc-and-ensemble/ [16:13] * kim0 prepares to start the weekend [16:14] smoser, just to verify.. cloud-init is running on the canonistack images? [16:22] * hazmat lunches [16:27] hazmat: What was the other issue you found? (sry, just back from lunch now) [16:54] niemeyer, see the bug report [16:54] niemeyer, i committed a work around to the txzookeeper branch [16:54] er. txaws that is [16:54] now i have a new error to decipher.. it doesn't appear my keys on the machine from cloud init for ssh access [16:55] http://pastebin.ubuntu.com/670216/ [16:56] looking at the console-output it looks like cloud-init runs though .. http://pastebin.ubuntu.com/670217/ [16:58] hazmat: Hmm.. the metadata service must be hosed [16:58] niemeyer, true, that likely is hosed.. i think that's only recently been done, and perhaps differently, that was an issue i had early looking at rackspace support [16:59] there wasn't anyway to identify the machine api identifier from within the machine [17:11] niemeyer: quick confirm [17:11] fwereade: Sure [17:12] niemeyer: if a foolish developer had ended up with a 3000-line diff, it would be appreciated to reconstruct the end result as a pipeline, even if the indivdual steps themselves each ended up seeming a bit forced/redundant [17:13] niemeyer: a lot of the problem is decent-sized chunks of code moving from one file to another, but sadly the structure isn't really apparent from the diff [17:13] niemeyer: however, I think it could be clearer if it were broken up into steps like [17:13] 1) add new way to do X [17:14] 2) remove old way to do X, use new way [17:14] etc... [17:14] <_mup_> Bug #829642 was filed: expose relation lifecycle state to 'ensemble status' < https://launchpad.net/bugs/829642 > [17:14] fwereade: Yeah, that sounds a lot more reasonable, and less painful for both sides [17:15] niemeyer: cool, they'll land sometime one monday then [17:15] fwereade: Sounds good.. even though there's some work involved, I'm willing to bet that the overall time for the changes to land will be reduced [17:16] niemeyer: I swear it was a 1kline diff, and then I went and made everything neat and consistent :/ [17:16] niemeyer: anyway, thanks :) [17:16] happy weekends everyone :) [17:16] fwereade: I can believe that [17:16] fwereade: Have a great one! [17:16] niemeyer: and you :) [17:17] fwereade: Thanks [17:18] fwereade, cheers [17:27] hmm.. afaics its working the console output diff between ec2 and openstack looks sensible, except cloud-init waiting on the metadata service [17:30] machine and provisioning agents running normally [17:31] woot it works [17:31] doh [17:31] hazmat: Woah! [17:32] * niemeyer dances around the chair [17:32] niemeyer, well i got status output and agents are running [17:32] niemeyer, new error on deploy, but getting closer i think [17:32] * niemeyer sits down [17:32] * hazmat grabs some more caffiene === otubo[AFK] is now known as otubo [17:47] fwereade: \o/ [17:47] fwereade: any updates on the merges in trunk? [17:50] <_mup_> ensemble/fix-pyflakes r322 committed by jim.baker@canonical.com [17:50] <_mup_> Remove dict comprehension, pyflakes doesn't understand it yet [17:54] <_mup_> ensemble/fix-pyflakes r323 committed by jim.baker@canonical.com [17:54] <_mup_> Remove remaining dict comprehension [17:59] jimbaker, later version of pyflakes seems to understand it for me.. if your pyflakes is using a 2.6 python.. then it could be an issue [17:59] jimbaker, definitely valid to do.. but the fix is really python 2.6 compatiblity [18:02] jimbaker, nevermind.. i hadn't realized but the latest pyflakes package seems to be broken [18:11] jimbaker: ping [18:15] bcsaller: ping [18:15] niemeyer: whats up? [18:15] bcsaller: You, looking for a vict^Wcandidate for a review [18:15] bcsaller: s/You/Yo/ [18:15] sure [18:15] bcsaller: https://code.launchpad.net/~hazmat/ensemble/formula-state-with-url/+merge/71291 [18:16] on it [18:16] bcsaller: Cheers! [18:28] bcsaller: https://code.launchpad.net/~hazmat/ensemble/machine-agent-uses-formula-url/+merge/71923 [18:28] bcsaller: Oh, sorry, nm [18:28] bcsaller: William has already looked at that latter one [18:28] niemeyer, so with dynamic port opening, one question i had is how do we go about solving placement onto multiple machines when reusing machines [18:29] we need static analysis to determine port conflicts for placement afaics [18:29] s/analysis/metadata [18:29] something along the lines of describing a security group, port-ranges, protocols, etc [18:30] directly in a formula [18:31] I have a sort of random ensemble question: when you create a relation, is that a bidirectional concept, or unidirectional? i.e. do both pieces know about each other when you make the relationship for purposes of setting up configs, etc? [18:31] heckj, bi-directional [18:31] hazmat: ROTFL [18:31] heckj, each side is informed when units of the other side join, depart or change their relation settings [18:31] hazmat: Didn't we cover that issue at least 3 times? :-) [18:32] hazmat: thanks! [18:32] niemeyer, yeah.. we probably did, but i'm looking at doing a more flexible placement alg to respect max, min machines.. and i don't recall what solution we came up with [18:32] actually i know we did several times [18:33] hazmat: I don't understand how that changes the outcome we reached in Austin [18:33] niemeyer, i don't recall we discussed this in austin, we discussed network setups in austin [18:34] for lxc bridging [18:34] hazmat: We certainly discussed port conflicts and how we'd deal with them in the short term and in the long term [18:34] niemeyer, in the short term we said we wouldn't, and the long term? [18:35] hazmat: We have all the data we need to do anything we please.. [18:36] niemeyer, okay.. so i'm deploying a new formula, i can inspect which ports are open/used on a machine, but i can't tell which ones the new formula needs.. so i lack knowledge of what its going to be using in advance of deploying it. [18:36] if i knew advance i could select a machine with non-conflicting port usage [18:37] hazmat: open-port communicates to Ensemble what port that is.. we don't need to tell in advnace [18:37] hazmat: Ensemble will happily take the open port and move on with it [18:38] Woohay [18:38] niemeyer, and in the case of a port usage conflict between two formulas? [18:39] s/formulas/service units [18:40] hazmat: Yes, as we debated the same port can't be put in the same address.. it's a limit of IP [18:41] hazmat: If people try to force two services on the same machine to be *exposed*, it wil fail [18:41] hazmat: If they have the same port.. [18:41] hazmat: If they're not exposed, that's fine.. [18:41] niemeyer, yes.. but if i knew in advance i could avoid conflicts when doing machine placement, with dynamic in the short term we just allow conflicts in the short term.. but what's the long term solution here? [18:41] niemeyer, doesn't matter if their exposed or not [18:42] hazmat: Of course it matters [18:42] hazmat: you mean they bind it even if its not addressable outside the firewall, right? === bcsaller1 is now known as bcsaller [18:42] niemeyer, if i have two unexposed services trying to use port 80.. its a conflict regardless of the expose [18:42] hazmat: It's not.. each service has its own network spae [18:42] space [18:42] bcsaller1 exactly.. i have my web apps behind a load balancer for example [18:43] niemeyer, ah assuming lxc and bridges [18:43] hazmat: Yes, assuming the feature we've been talking about :-) [18:43] ah.. right so this is where we get to ipv6, ic [18:46] each service gets it own ipv6 address, we route ip4v6 internally, and expose still can't deal with port conflicts, which we can't detect/avoid [18:46] hazmat: prove it ;) [18:47] nijaba: ping? [18:47] hazmat: Yes.. [18:47] bcsaller, its runtime dynamic, placement is prior to instantiation.. what's to prove [18:48] I just haven't seen the ipv6-> ipv4 routing work this way yet [18:48] hazmat: In practice, that's a lot of ifs.. [18:48] not saying it can't, just haven't seen how it plays out yet [18:48] bcsaller, yeah.. there's a pile of magic dust somewhere [18:48] and I think of IBM all of a sudden [18:49] bcsaller: Why? [18:49] i think of nasa.. millions for a space pen that works.. russians use a pencil [18:49] niemeyer: they did commercials with self healing servers and magic pixie dust you sprinkle around the machine room [18:50] Nice :) [18:50] hazmat: Exactly.. let's design a pencil [18:50] niemeyer, a pencil is static metadata imo [18:51] niemeyer: http://www.youtube.com/watch?v=3nbEeU2dRBg [18:51] hazmat: A pencil to me is something that is already working fine today [18:51] hazmat: Rather than going after a different fancy pen [18:52] niemeyer, we can rip out significant parts of code base and simplify them. its development either way.. the point is a pencil is simple [18:52] hazmat: You're trying to design the pen that works without gravity.. [18:52] hazmat: Very easy to write once you have it [18:52] hazmat: The pencil is ready [18:53] niemeyer, so i think we've got the analogies as far as they go.. the question is what's the problem with static metadata? besides the fact we've already implemented something with known problems [18:53] hazmat: I thought the analogy was clear.. static metadata doesn't exist [18:54] hazmat: How do you allow a service to offer another port to a different service? [18:54] hazmat: How many ports do we put in the static metadata? [18:54] hazmat: What if another port is to be opened? [18:54] niemeyer, the formula declares what it enables via metadata.. allowing for port ranges etc, perhaps associated to a name [18:55] hazmat: Yeah.. what if the range is too small for the number of services someone wants to connect to? [18:55] hazmat: What if the service could actually work dynamically? [18:55] hazmat: And pick a port that is actually open in the current machine rather than forcing a given one? [18:56] niemeyer, the metadata is only for listen ports a formula offers [18:56] hazmat: Since it doesn't really care [18:56] hazmat: That's what I'm talking about too [18:56] it can reserve a range, if it wants.. like less than 1% of services are truly dyanmic that way [18:57] hazmat: All services are dynamic that way.. a single formula can manage multiple services for multiple clients [18:57] i'd rather design for the rule than the exception, if i get a pencil ;-) [18:57] hazmat: Multiple processes [18:57] hazmat: We have the pencil.. services are dynamic by nature.. open-port is dynamic by nature [18:58] hazmat: it works, today.. [18:58] niemeyer, right.. i can have a formula managing wsgi-app servers, but i can also pick a range of 100, and reserve that block for the processes i'll create [18:58] hazmat: Until botchagalupe1 wants to use it for 101 services in his data center [18:59] hazmat: Then, even the static allocation doesn't solve the problem you mentioned.. [18:59] hazmat: Which is interesting [19:00] hazmat: Scenario: [19:00] niemeyer, so your saying if its a service has a port per relation [19:00] hazmat: 1) User deploys frontend nginx and backend app server in the same machine [19:00] hazmat: 2) Both use port 80 [19:00] hazmat: 3) nginx is the only one exposed.. [19:01] That's a prefectly valid scenario [19:01] hazmat: 4) User decides to expose the app server for part of the traffic [19:01] hazmat: Boom.. [19:01] hazmat: Static allocation didn't help [19:01] in one case static metadata, we prevent the units from co-existing on the same machine [19:02] hazmat: Why? [19:02] when placing them.. to avoid conflicts [19:02] hazmat: The scenario above works.. [19:02] hazmat: 1-3 is perfectly fine [19:03] say i end up with varnish or haproxy on the same instance for a different service and i want to expose it.. [19:04] same problem [19:04] hazmat: Yep.. that's my point.. it's an inherent problem.. it exists with open-port or with dynamic allocation [19:04] in the static scenario we prevent by not placing it on a machine with conflicting port metadata [19:04] hazmat: We need to solve it in a different way [19:04] hazmat: Again, 1-3 is perfectly fine [19:04] 1) is not the case, they won't be deployed on the same machine with static metadata [19:04] hazmat: There's no reason to prevent people from doing it [19:05] hmm [19:06] it is rather limiting to get true density [19:06] with static metadata [19:07] hazmat: My suggestion is that we address this problem within the realm of placement semantics [19:07] hazmat: In more realistic stacks (!) admins will be fine-tunning aggregation [19:08] niemeyer, that's the problem/pov that i'm looking at this from.. placement has no data about the thing its about to deploy.. just about the current ports of each machine. [19:08] niemeyer, you mean moving units? [19:09] hazmat: No, I mean more fine-tunned aggregation [19:09] or just doing manual machine selection placement [19:09] hazmat: Not manual machine selection per se [19:09] hazmat: Machines have no names.. don't develop love for them.. ;) [19:10] niemeyer, absolutely.. their so unreliable ;-) [19:10] LOL [19:37] <_mup_> Bug #829734 was filed: PyFlakes cannot check Ensemble source < https://launchpad.net/bugs/829734 > [19:43] <_mup_> ensemble/fix-pyflakes r322 committed by jim.baker@canonical.com [19:43] <_mup_> Remove dict comprehension usage to support PyFlakes [19:49] bcsaller, hazmat - i have a trivial in lp:~jimbaker/ensemble/fix-pyflakes that allows pyflakes to work again for the entire source tree [19:50] jimbaker`, awesome, there's a bug for py 2.6 compatibility that it can link to as well [19:50] afaics [19:50] hazmat, yeah, that's probably the source of the 2.6 bug [19:50] dict comprehensions where the only 2.7 feature we where using [19:50] s/where/were [19:50] hazmat, they're nice, but just not yet unfortunately [19:51] i'll mention this to fwreade so we can avoid it for the time being [19:55] <_mup_> ensemble/stack-crack r323 committed by kapil.thangavelu@canonical.com [19:55] <_mup_> allow config of an ec2 keypair used for launching machines [19:58] hazmat, so if that trivial looks good, i will commit and mark those bugs as fix released [19:58] (to trunk) [19:59] negronjl: about how long does it take the mongo formula to deploy? [19:59] like, if I ssh in and I type mongo and it doesn't find it, then I've obviously ssh'ed in too early? :) [20:02] also, rs.status() returns "{ "errmsg" : "not running with --replSet", "ok" : 0 }" [20:02] jcastro_, if ensemble status says running it should be running [20:03] er. started [20:04] aha, it takes about a minute === highvolt1ge is now known as highvoltage [20:05] jcastro: you got it .... about a minute or so [20:07] negronjl: ok, the second db.ubuntu.find() shows the same results as the first one, how do I know that's on other nodes? [20:07] or do you just know because that's what rs.status() already showed? [20:08] jcastro: you don't really know ( without a bunch of digging ) what's on which node [20:08] right, I see, that's the point. :) [20:08] jimbaker`, also this has a fix for the cli help.. ignoring the plugin implementation http://pastebin.ubuntu.com/670338/ [20:08] jimbaker`, sans it, the default help is bloated out by config-set [20:08] on ./bin/ensemble -h [20:12] hazmat, you mean lines 46-50 of the paste? [20:12] sure, we should pull that in [20:13] can also use the docstring cleanup too [20:13] jimbaker`, well pretty much all the changes to commands in that diff are docstring cleanup [20:13] the stuff in __init__ and tests can be ignored [20:17] jimbaker`, fix-pyflakes looks good +1 [20:17] hazmat, thanks! [20:21] hmm.. looks like the nova objectstore namespace is flat [20:23] odd the code looks like it should work, its doing storing against the hexdigest of the name [20:24] <_mup_> ensemble/trunk r322 committed by jim.baker@canonical.com [20:24] <_mup_> merge fix-pyflakes [r=hazmat][f=829531,829734] [20:24] <_mup_> [trivial] Remove use of dict comprehensions to preserve Python 2.6 [20:24] <_mup_> compatibility and enable PyFlakes to work with Ensemble source. [21:03] hazmat: Any chance of a second review here: https://code.launchpad.net/~fwereade/ensemble/cobbler-shutdown/+merge/71391 [21:03] With that one handled, we'll have a clean Friday! :-) [21:04] sidnei: People are begging for your Ensemble talk at Python Brasil :) [21:16] <_mup_> Bug #828885 was filed: 'relation-departed' hook not firing when relation is set to 'error' state < https://launchpad.net/bugs/828885 > [21:21] niemeyer, sure [21:21] ugh.. its big [21:21] hazmat: Why was this reopened ^? [21:22] niemeyer, just had a talk with mark about it .. its not really about relation-broken being invoked, its more than if a service unit is an error state should the other side know about it [21:22] take a broken service out of a rotation [21:22] i guess we try not to associate relation state to overall service status [21:23] hazmat: That's what I understood from the original description [21:23] hazmat: As I've mentioned in the bug, I don't think killing a service like this is the right thing to do [21:23] hazmat: A _hook_ has failed, not a connection [21:23] hazmat: In other words, we take a slightly bad situation, and make the worst out of it by actually killing the service [21:24] niemeyer, yeah.. fair enough, i forget if resolved handles that [21:24] niemeyer, its not about killing the service though [21:24] niemeyer, its about informing the other end of the relation that something is wrong [21:24] other relations of the service continue to operate normally [21:24] hazmat: It definitely is.. that's what relation-departed does [21:24] hazmat: The relation wasn't departed [21:25] hazmat: There's an erroneous situation due to a human bug [21:25] niemeyer, relation-depart is just saying a unit has been removed.. [21:25] it can re-appear latter with a join [21:25] hazmat: Exactly, and it has not [21:25] hazmat: Imagine the situation.. blog up.. small bug in relation-changed [21:26] hazmat: "Oh, hey! There's a typo in your script! BOOM! Kill database connection." [21:26] niemeyer, but conversly do we allow for the other scenario to be true.. a web app server and proxy, the web app server is dead, its rel hook errors, and the proxy continues to serve traffic to it [21:27] hazmat: Yes, that sounds like the most likely way to have things working [21:27] m_3, anything to add? [21:27] hazmat: We can't assume it's fine to take services down at will without user consent [21:28] hazmat: The user desire was to have that relation up.. [21:28] the web app <-> proxy relationship you described is a good example [21:28] hazmat: There was an error because of an improper handling of state that can't be implied as "impossible to serve" [21:28] the one I was seeing was at spinup [21:28] niemeyer, indeed, i remember know why it was done this way [21:29] m_3, hazmat: Note that this is different from a machine going off [21:29] m_3, hazmat: Or network connectivity being disrupted, etc [21:29] spin up 20 units of a related service [21:29] third of them failed, but the primary service still had configured state for the failed units [21:29] that cleanup is what I'm targeting [21:30] m_3: Define "failed" [21:30] test case was a relation-changed hook that just "exit 1" [21:30] the one where a third were failing was NFS clients trying to mount [21:31] m_3: We can try to be smart about this in the future, and take down relations if there is more than one unit in it, for instance [21:31] m_3: That situation is not a good default, though [21:32] m_3: Note how your exit 1 does not imply in any way that the software running the service was broken [21:32] understand... we can choose to not implement... just wanted to surface the issue [21:32] so bringing clients up slowly works fine [21:32] m_3: It implies relation-changed was unable to run correctly for whatever reason [21:32] rewriting clients to retry a couple of time works [21:32] m_3: Right, but do you understand where I'm coming from? [21:32] yes, totally [21:33] turning a machine off in a physical infrastructure is a good example [21:33] haproxy and varnish are written to be tolerant against this eventuality [21:33] would be nice if we could provide this though [21:34] m_3: Hmm.. it sounds like we're still talking about different things [21:34] m_3: Ensemble _will_ handle disconnections, and _will_ take the relation down [21:34] sorry if I'm not explaining this well [21:34] m_3: you're explaining it well, but I feel like we're making disjoint points [21:34] it leaves the relation in an "error" state for the units where relation-changed hook exited poorly [21:35] that's not taking the relation down [21:35] there's no way for the "server" to know that anything wrong has happened [21:35] m_3: This is not a disconnection.. an error in a relation-changed script doesn't imply in any way that the service is down [21:35] it could do a relation-list and check on things... if something got fired [21:36] hmmm... yes, I've been focusing on relation-changed during startup [21:36] m_3: But if you turn the network down on the service, or if say, the kernel wedges.. Ensemble will take the relation down. [21:36] for services that often don't start until relation-changed (not in start) [21:37] m_3: Even in those cases, we can't tell whether the service is necessarily down or not [21:37] <_mup_> ensemble/stack-crack r324 committed by kapil.thangavelu@canonical.com [21:37] <_mup_> don't use namespaced storage keys, use a flat namespace [21:37] m_3: Since we don't know what happened [21:38] m_3: In a situation where that was a critical service, the most likely scenario to have it working is to allow the relation to stay up while the admin sorts it out [21:39] * m_3 wheels turning [21:39] how does ensemble respond to a kernel wedge (your example above) [21:40] m_3: That situation puts the machine agent and the unit agent unresponsive, which will eventually cause a timeout that will force all of its relations down [21:40] m_3, it will get disconnected from zookeeper and then the opposite end of the relation will see a 'relation-depart' hook exec [21:40] right... so "framework" or "infrastructure"-wise... that change is registere [21:40] d [21:41] m_3, more than framework.. the opposite relation endpoints see the disconnection [21:41] but it tries to stay ignorant of service semantics [21:41] m_3: For now.. [21:41] right, I can clean up when that happens [21:41] m_3: We want to go there, eventually [21:42] ok, this really goes to all of the bugs about relation-status [21:42] thanks for the discussion guys! [21:42] s/bugs/feature requests/ [21:47] m_3: np! [21:47] m_3: I think there's more we need to talk about in this area [21:47] <_mup_> ensemble/stack-crack r325 committed by kapil.thangavelu@canonical.com [21:47] <_mup_> allow txaws branch usage from an ensemble env [21:48] m_3, hazmat: I'm personally concerned about even that scenario, for instance, when the unit agent goes off [21:48] niemeyer: I'll write up my use cases that need relation state info [21:48] niemeyer, how so? [21:48] hazmat: We need to find a way to restart the unit agent without killing relations [21:48] niemeyer, we can do that now, we just need to reconnect to the same session [21:48] hazmat: In the next incarnation, the logic that puts the ephemeral nodes in place must take into account they might already be there [21:49] hazmat: Kind of [21:49] hazmat: We don't expect to find previous state, I believe [21:49] niemeyer, let's be clear its not killing a relation, its a transient depart and join for the same unit [21:49] niemeyer, we do find the same relation state [21:50] the unit's relation state is the same across a depart/join... even if the client is disconnected, the relation settings are persistent [21:50] there's a separate ephmeral node for active presence [21:50] 1.) formula tests need to know when hooks execute, 2.) relations that depend on another relation's state, and 3.) varous kinds of relation fails [21:50] hazmat: That's how a relation is killed! [21:51] hazmat: Formulas take state down on depart [21:51] niemeyer, that's how a service unit's participation in a relation is killed and resurrected [21:51] hazmat: Yes.. and correct formulas will clean state/block firewall/etc on depart! [21:51] the relation itself is a semantic notion between services, its only killed when the user removes the relation [21:52] niemeyer, and they will open it back up when it comes back [21:52] hazmat: The way that the formula knows a relation has been removed is through the relation-joined/departed! :-) [21:52] hazmat: A bit shocked to be stating this :) [21:52] :-) [21:52] niemeyer, a relation has been removed to a formula upon execution of relation-broken [21:53] and created upon first execution of any join [21:53] hazmat: No, relation-broken means it has been taken down by itself [21:53] hazmat: relation-departed means "The remote end left.. clean up after yourself." [21:53] right, but if i have 5 other units in a relation, and one goes away, i don't say the relation is removed [21:53] hazmat: The relation between the two units has been _dropped_... [21:53] I'm confused about difference between relation taken down and related unit taken down [21:54] hazmat: State may be removed.. etc [21:54] niemeyer, the state is service level typically, unit level state about remote ends is access, and that can be granted/restored [21:54] m_3: A relation is established between services.. that's the ideal model the admin has stated he wanted [21:55] in general though it should be possible that a unit transiently departs a relation and comes back to find things working with the same access and state [21:55] m_3: Service units join and depart the relation based on realistic behavior [21:55] right, but all of my examples above retain the relation and just drop units [21:55] hazmat: Agreed on the first point, disagreed strongly on the second one. [21:55] niemeyer, for example consider a network split.. its a transient disconnect and reconnect.. the relation isn't dead, that's between the services, the disconnect unit's participation in the relation is temporarily removed [21:56] hazmat: """ [21:56] -relation-departed - Runs upon each time a remote service unit leaves a relation. This could happen because the service unit has been removed, its service has been destroyed, or the relation between this service and the remote service has been removed. [21:56] An example usage is that HAProxy needs to be aware of web servers when they are no longer available. It can remove each web server its configuration as the corresponding service unit departs the relation. [21:56] """ [21:56] hazmat: This is our documentation. [21:56] hazmat: It's been designed that way.. relation-departed runs, connection should be _down_.. [21:57] hmm.. that's unfortunate, if a service has been destroyed should be under relation-broken [21:57] hazmat: Nope [21:57] hazmat: """ [21:57] -relation-broken - Runs when a relation which had at least one other relation hook run for it (successfully or not) is now unavailable. The service unit can then clean up any established state. [21:57] An example might be cleaning up the configuration changes which were performed when HAProxy was asked to load-balance for another service unit. [21:57] """ [21:57] hazmat: That's how it's been designed [21:58] Which is why I bring my original point back: we need to ensure that restarts keep the relation up [21:58] well i have some doubts that its implemented that way ... broken is always the final step of cleanup when destroying a relation [21:59] hazmat: If it's not that way, it's a serious bug we should fix.. I certainly reviewed it against that assumption [21:59] hazmat: We wrote that document jointly as well [21:59] niemeyer, i think that doc needs changing... depart is called when a unit is removed [22:00] niemeyer, i think some editing and updating got done on it post implementation [22:00] hazmat: "This could happen because the service unit has been removed" [22:00] hazmat: ? [22:00] it can happen for any number of reasons [22:00] hazmat: Yes, they seem listed there.. what's wrong specifically? [22:00] network split, explict removal of unit, etc.. the only significance is that the remote end isn't there [22:01] one of them that is [22:01] relation level cleanup.. removing a database, etc. should happen in relation-broken [22:01] only unit level cleanup should happen in depart [22:01] hazmat: We'll have to talk on monday about this.. [22:01] is there any difference between the events fired for timeouts -vs- those fired for remove-relation calls? [22:01] hazmat: That's not how it's been designed, and is certainly not what we talked about when we planned it [22:02] niemeyer, if i do a remove-unit, the remote end will get a depart [22:02] that doesn't mean blow up the database [22:02] hazmat: It means remove the access from the other end [22:02] hazmat: Nothing should mean "blow up the database", ever [22:02] niemeyer, write but not the five other units that are still in the relation [22:02] s/write/right [22:03] hazmat: Yes.. remove the access from the unit that has departed [22:03] but if i see broken, the relation is finished.. it won't ever come back [22:03] hazmat: Not at all [22:03] and i can do service level relation cleanup [22:03] niemeyer, it will be a new relation if it does [22:03] hazmat: If network connectivity terminates, it should get relation-broken [22:04] niemeyer, who get its and why? [22:04] hazmat: Again, the docs explain [22:04] niemeyer, if they see a network split from a single related unit, they get a depart [22:04] * hazmat goes to read [22:06] niemeyer, don't see it [22:07] a relation is never broken till the user severs it [22:07] hazmat: Who gets it: [22:07] """ [22:07] Runs when a relation which had at least one other relation hook run for it (successfully or not) is now unavailable. The service unit can then clean up any established state. [22:07] """ [22:07] and why too, in fact.. [22:08] like i said the docs need cleanup.. we can discuss design considerations on monday if need be.. but afaics the semantics are correct [22:09] relation-broken is effectively a relation-destroyed hook [22:10] m_3, no there isn't [22:10] hazmat: Regardless, the original point remains.. [22:11] hazmat: relation-joined should be sustained across restarts [22:11] niemeyer, you mean it shouldnt' be executed across an agent restart? [22:11] hazmat: Right.. the relation should remain up [22:11] niemeyer, like i said originally if we can reattach the session that's trivial as is [22:12] hazmat: I didn't say otherwise.. I pointed out the behavior of relation-joined, pointed out it doesn't work, and pointed out we should watch out next [22:12] hazmat: You seem to agree now, so that's a good base to move on [22:14] niemeyer, indeed we do need to check for the ephmeral nodes before blindly recreating them [22:14] which would fail currently [22:14] hazmat: Phew.. woohay agreement [22:14] niemeyer, i never disagreed with that, the conversation went sideways to something different [22:14] Exactly [22:15] hazmat: You disagreed with the behavior of joined, but it doesn't really matter now. [22:15] hazmat: re. broken.. reading the code.. it sounds like the behavior you described is actually more useful indeed [22:16] niemeyer, agreed [22:16] Double agreement! Score! :-) [22:16] :-) the docs need updating [22:17] just in time for the weekend, i should head out on that note ;-) [22:17] later man... thanks for the help [22:17] more openstack to do.. needed to adjust to deploy a txaws branch for esnemble [22:17] m_3, cheers [22:18] * hazmat grabs some caffiene [22:18] hazmat: Not entirely surprised about that debate on broken [22:19] hazmat: Looking through my mail, we've had very little debate on it [22:19] niemeyer, i think we discussed it in brazil sprint and voice meetings [22:20] hazmat: Hmm [22:20] hazmat: I'm still not sure about it [22:20] niemeyer, looks like we had a long discussion on list oct 2010 re [22:20] hazmat: relation-broken seems to be called on stop() [22:21] hmm [22:21] hazmat: Which would put its behavior closer to the documented [22:22] niemeyer, where do you see that? [22:22] i'm looking at unit/lifecycle [22:22] Me too [22:22] yield workflow.transition_state("down") [22:22] on stop we do a rel down transition [22:23] niemeyer, right that doesn't execute broken [22:23] niemeyer, it actually doesn't execute anything on a relation [22:23] hazmat: Ah, there's down_departed [22:23] ah. your looking at the workflow [22:24] niemeyer, those are for when the relation is broken while the relation was down [22:24] we still execute the relation-broken hook to give a final chance of cleanup [22:25] sorry... relation broken while down? [22:25] m_3, if the relation is an down/error state, we still execute the relation-broken hook on a unit if the relation between the services is removed [22:26] ah, gotcha [22:31] hazmat: There's some name clashing in the code.. we call depart when we mean break in a few cases [22:32] niemeyer, depart is always break [22:32] hazmat: Except when it's not.. :-) [22:32] hazmat: relation-departed [22:32] niemeyer, ah.. right.. yeah. there's a name indirection there [22:33] niemeyer, yeah.. ic what you mean [22:33] hazmat: It's all good, though.. you are right, we need to fix docs for broken [22:35] hazmat: I wonder if we can simplify the logic around that workflow significantly in the future, with a more direct state machine [22:36] hazmat: self._current_state.. self.relation_joined().. self.relation_changed().. etc [22:36] niemeyer, you mean fold the lifecycle and workflows together? [22:36] hazmat: Yeah [22:36] yeah.. possibly it was useful for some contexts like resolved where having the separate decisions points was very useful [22:37] to distinguish things like with hooks retry vs. not but that could be encapsulated differently [22:37] hazmat: Still.. we could probably come up with a way to encode the changes into functions themselves [22:38] or when we decided to execute change after join always [22:38] RoAkSoAx: pong (late) [22:38] hazmat: Anyway.. random wish to make it simpler really.. maybe not possible, don't know.. [22:39] niemeyer, yeah.. it does feel like a redundant layer through most of the workflow [22:39] workflow.py that is [22:39] Right [22:39] niemeyer, yeah.. i thought about just having functions attached as transition actions directly on the state machine [22:40] that was actually one of the original designs, but per discussion we wanted to keep things to as pure of a state machine as possible [22:40] i just went with something as static and simple as possible in the workflow def [22:40] but the extra layer there hasn't really proved useful.. [22:41] its always effectively a one liner to the lifecycle method from the workflow [22:41] hazmat: Yeah.. I mean really having two layers.. e.g. [22:41] def relation_joined(): [22:41] ... do stuff [22:42] def start(): [22:42] ... call start hook ... [22:42] self._state = "started" [22:42] etc [22:42] there's global state to manage on some of these though [22:42] Then, another class [22:42] err = hooks.install() [22:42] if err == nil: [22:42] hooks.start() [22:42] etc [22:43] This feels easier to grasp/manipulate somehow [22:43] the lifecycle methods should correspond directly to those hooks.* [22:43] we could hook them up directly to the workflow def [22:43] hazmat: Yeah, I know it's not too far.. we just have a few "padding layers" there [22:44] hazmat: But I think we also need some separation in a few cases.. we don't have that external driver that says what to do [22:44] yeah.. it should be easy to drop all the action methods on workflow, and have the transition action directly invoke the lifecycle method [22:44] hazmat: Feels a bit like inverting responsibility [22:44] hazmat: Right, that's what I'm trying to get to if I see what you mean [22:44] anyways.. i should get back to openstack.. i need to sign off soon [22:44] niemeyer, i do [22:44] hazmat: Awesome, have a good weekend.. I should be off in a bit too [22:45] niemeyer, have good weekend [22:45] ^a [22:45] Cheers! [22:46] great weekend guys... thanks [22:57] nice.. txaws running from branch.. [22:57] * hazmat crosses fingers on openstack deploy [22:57] sweet, deploy working! [22:57] * hazmat does a dance [23:30] hazmat: WOOT! [23:30] hazmat: Man.. that requires beer [23:31] I'll step out immediately to get some :-) [23:31] A good weekend to all! [23:51] <_mup_> Bug #829829 was filed: test_service_unit_removed eventually fails < https://launchpad.net/bugs/829829 >