jason_ | SpamapS, I'm getting an invalid ssh key now -- I did juju status, and it told me the key had been changed (from my many reinstalls no doubt) and asked to accept or no -- I said no, meaning to cancel out and delete the known hosts file and retry, and now it's Invalid SSH key each time | 00:16 |
---|---|---|
jason_ | ok, I copied all my .ssh files from the client I'd been working on to the server -- seems to have gotten me past that bit | 00:24 |
jason_ | juju status completed -- looks like my first system is in place | 00:30 |
hazmat | jason_, woot! | 00:52 |
_mup_ | juju/go-store r18 committed by gustavo@niemeyer.net | 01:08 |
_mup_ | Implemented URL.WithRevision. | 01:08 |
jason_ | hazmat, mysql deploy success, too... | 01:11 |
niemeyer | jason_: ho ho | 01:26 |
_mup_ | juju/go-store r19 committed by gustavo@niemeyer.net | 03:58 |
_mup_ | New store package with AddCharm and OpenCharm interface. | 03:58 |
_mup_ | The interface to the package is trivial, but internally it actually | 03:58 |
_mup_ | handles all the necessary logic for concurrent runs of the algorithm, | 03:58 |
_mup_ | including mongo-based atomic locks with expiration, multi-URL synchronous | 03:58 |
_mup_ | revision bumping as described in the charm specification, GridFS-based | 03:58 |
_mup_ | memory-friendly uploading for large files, and ponies too. | 03:58 |
_mup_ | Lacks documentation and sha256 handling, though.. but I need some sleep. | 03:58 |
niemeyer | Night all | 04:11 |
_mup_ | juju/expose-retry r402 committed by jim.baker@canonical.com | 06:04 |
_mup_ | Support retrying port mgmt ops in periodic machine check | 06:04 |
_mup_ | Bug #872164 was filed: [Oneiric] Cannot deply services - store.juju.ubuntu.com not found <juju:New> < https://launchpad.net/bugs/872164 > | 08:21 |
jamespage | morning - I took the liberty of pointing the bug reporter for bug 872164 in the right direction and marked the bug as invalid | 08:48 |
_mup_ | Bug #872164: [Oneiric] Cannot deply services - store.juju.ubuntu.com not found <juju:Invalid> < https://launchpad.net/bugs/872164 > | 08:48 |
fwereade_ | thanks jamespage, I just saw, much better response than mine | 08:52 |
jamespage | fwereade_, np | 08:52 |
jamespage | I think I must be missing something: should the stop hook be called when a unit is removed from a service using remove-unit? | 09:37 |
rog | where can i find documentation for txaws? | 11:19 |
rog | oops, LMGTFY | 11:19 |
hazmat | good morning | 12:07 |
hazmat | fwereade_, the docs still look out of date.. https://juju.ubuntu.com/docs/user-tutorial.html#deploying-service-units | 12:09 |
hazmat | i think jimbaker mentioned yesterday they weren't regenerating | 12:09 |
hazmat | jamespage, on bug 871966 when you say local juju environment you mean a local provider? | 12:15 |
_mup_ | Bug #871966: FQDN written to /etc/hosts causes problems for clustering systems <cloud-init (Ubuntu):Confirmed> <cassandra (juju Charms Collection):New> < https://launchpad.net/bugs/871966 > | 12:15 |
hazmat | jamespage, the stop hook is not called | 12:17 |
hazmat | jamespage, pretty much everything that deals with remove/destroy works one level up from the supervisor of the thing being killed | 12:18 |
hazmat | with the notion that even if the thing is AWOL, the action will happen | 12:18 |
rog | hazmat: hiya | 12:20 |
hazmat | rog, txaws is pretty much UTSL for most questions imo | 12:20 |
rog | hazmat: yeah, i discovered that. thanks. | 12:22 |
rog | foundations of sand :-) | 12:22 |
hazmat | rog, not really.. its well tested. but yeah.. its a consequence of using twisted, vs using the standard python library for aws (boto ) | 12:24 |
rog | uh huh | 12:24 |
hazmat | hmm.. interesting | 12:25 |
jamespage | hazmat: the comment on bug 871966 does refer to the local provider - but that provides an IP address for private-address anyway | 12:31 |
_mup_ | Bug #871966: FQDN written to /etc/hosts causes problems for clustering systems <cloud-init (Ubuntu):Confirmed> <cassandra (juju Charms Collection):New> < https://launchpad.net/bugs/871966 > | 12:31 |
hazmat | jamespage, yup and private-address==public-address there | 12:31 |
hazmat | and it shows up in juju status | 12:31 |
jamespage | hazmat: I now have something that works with the local provider, and on ec2 and openstack | 12:32 |
hazmat | jamespage, nice | 12:32 |
hazmat | jamespage, comments about the local provider probably aren't relevant on a cloud-init bug, since the local provider doesn't use cloud-init.. fwiw | 12:32 |
jamespage | hazmat: they more referred to the fix for cassandra | 12:33 |
hazmat | ah.. ic. its linked | 12:33 |
jamespage | yep | 12:33 |
jamespage | hazmat: with regards to units leaving a service/not calling stop I was trying to figure out the best way to remove a node from a cassandra cluster | 12:36 |
jamespage | because the node does not get shutdown, it remains in the ring | 12:36 |
=== plars-holiday is now known as plars | ||
robbiew | rog: ping | 12:45 |
rog | robbiew: pong | 12:46 |
robbiew | rog: have you registered for UDS? | 12:46 |
rog | robbiew: i think so.... but i'll just check | 12:46 |
rog | robbiew: yes, i have | 12:47 |
robbiew | rog: -> http://uds.ubuntu.com/register/ :) | 12:47 |
rog | robbiew: i did it on 15th Sep... | 12:47 |
rog | and flights all booked too | 12:47 |
robbiew | rog: hmm, okay. I'll talk to our admins then, thx | 12:48 |
rog | robbiew: at any rate, i've got a confirmatiom email from marianna | 12:49 |
rog | robbiew: i'll just check the web site directly | 12:49 |
robbiew | rog: ah, cool | 12:50 |
robbiew | nevermind then | 12:50 |
robbiew | :) | 12:50 |
rog | robbiew: ah, maybe i didn't register on the linaro web site. i think i only did the UDS registration. | 12:53 |
hazmat | jamespage, hmm | 12:55 |
hazmat | jamespage, yeah.. i guess we really should be calling stop on units | 12:55 |
jamespage | hazmat: I need to deal with two scenarios - one where its a controlled removal | 12:56 |
jamespage | and one where the node goes AWOL | 12:56 |
hazmat | jamespage, pls file a bug | 12:56 |
hazmat | i can look at that today | 12:56 |
jamespage | hazmat: ack - doing now | 12:56 |
hazmat | for stopping a machine its almost irrelevant, since we shutdown the machine, but for a unit if we don't call stop, there isn't any thing to keep it from continuing to run | 12:57 |
hazmat | at least till all units are containers | 12:57 |
hazmat | and then the container is killed | 12:57 |
robbiew | rog: UDS is all you need ;0 | 12:57 |
robbiew | ;) | 12:57 |
rog | robbiew: ok, i'll ignore the FAQ then... | 12:58 |
hazmat | but we really can't do the latter on ec2, till we figure out some magical networking solution, or stop doing dynamic port management | 12:58 |
hazmat | unless we assume a single unit per machine in ec2 and do a targeted forward rule per exposed port | 13:00 |
_mup_ | Bug #872264 was filed: stop hook does not fire when units removed from service <juju:New> < https://launchpad.net/bugs/872264 > | 13:04 |
jamespage | hazmat: ^^ | 13:05 |
jamespage | I tried to document the two challenges I have specifically with the cassandra charm | 13:05 |
hazmat | jamespage, thanks | 13:05 |
jamespage | I guess they may apply to other charms that have similar ring storage methods | 13:06 |
hazmat | jamespage, so on 2) and 1) the other units should both detect the removal | 13:07 |
jamespage | hazmat: yes - they do | 13:07 |
rog | just realised that "canonical/linaro employee" means "(canonical AND linaro) employee" not "(canonical OR linaro) employee"... | 13:07 |
rog | doh | 13:07 |
jamespage | hazmat: and I could use the hook on the remaining nodes to deal with both situations | 13:11 |
jamespage | I would need to write it such that only one node completes the action | 13:11 |
* jamespage thinks about that one | 13:12 | |
* SpamapS awakens.. far too early | 13:15 | |
niemeyer | Good morning all | 13:17 |
rog | niemeyer: yo! | 13:18 |
SpamapS | jamespage: I think there's another bug asking for similar functionality.. | 13:18 |
SpamapS | jamespage: bug 862422 | 13:19 |
_mup_ | Bug #862422: Provide a way for services to protect units during dangerous operations <juju:Confirmed> < https://launchpad.net/bugs/862422 > | 13:19 |
SpamapS | jamespage: swift is a similar ring service and has times where adding or removing is a bad idea | 13:20 |
jamespage | SpamapS, agreed - it looks very similar | 13:21 |
SpamapS | Does seem like the stop hook should handle this | 13:27 |
jamespage | SpamapS: it would do for controlled removal | 13:27 |
SpamapS | jamespage: not sure I understand the AWOL case | 13:28 |
jamespage | SpamapS, thats more of a housekeeping case | 13:28 |
jamespage | in cassandra if you never moved entries for nodes that had gone away ('Down' status) it gets very crufty | 13:29 |
jamespage | also you want to ensure that loadbalancing etc.. get re-adjusted as the node won't be coming back | 13:29 |
hazmat | jamespage, but don't you get a departed event at all other nodes when one goes AWOL? | 13:29 |
jamespage | SpamapS, yes | 13:29 |
jamespage | sorry - I mean hazmat | 13:29 |
* hazmat checks the bug report | 13:30 | |
SpamapS | jamespage: yeah that should be detected in the peer relations | 13:30 |
SpamapS | cassandra has a prescribed procedure for removing a dead node from the ring | 13:31 |
rog | niemeyer: i'm porting the ec2 launch code and i'm not sure how goamz's AuthorizeSecurityGroup is supposed to work the way it's being used in the python code. here's a comparison: http://paste.ubuntu.com/706060/ | 13:31 |
jamespage | SpamapS, it does | 13:31 |
SpamapS | so on departed.. you would run that procedure for the departed unit | 13:32 |
hazmat | jamespage, so in the case of 1) the desire is for the actual termination of the unit to hang till the stop (which is potentially a long running op) completes? | 13:32 |
hazmat | and of course to execute stop as part of 1 | 13:32 |
jamespage | hazmat: ideally yes | 13:32 |
jamespage | SpamapS: what information is provided when the -departed hook fires about the remote service unit? | 13:33 |
hazmat | jamespage, doesn't the same problem exist in reverse when adding units.. as i recall for cassandra (might be outdated), your supposed to only add a single unit at a time | 13:33 |
niemeyer | rog: Looks like there's a protocol setting missing | 13:33 |
hazmat | jamespage, just the unit name and that it departed | 13:33 |
niemeyer | rog: Check out the docs and the implementation | 13:33 |
SpamapS | hazmat: +1 for that, let stop be proactive about locally stored data | 13:34 |
hazmat | SpamapS, niemeyer g'morning | 13:34 |
rog | niemeyer: the python code doesn't seem to set a proto - i was just checking that it wasn't an obvious bug | 13:34 |
* hazmat just up the ante on his war against rodents, bring in the exterminator | 13:34 | |
* SpamapS wishes the time would change, its pitch black here in LA at 6:30am :-P | 13:34 | |
niemeyer | rog: Maybe it has a default? | 13:34 |
SpamapS | we're porting the ec2 launch code? | 13:34 |
jamespage | hazmat, there is a restriction on adding units - N+N rather than N+1 | 13:35 |
rog | niemeyer: it seems to have two distinct modes of operation | 13:35 |
rog | there's no obvious default in the python code | 13:35 |
rog | i'll recheck though | 13:35 |
niemeyer | rog: They're both backed by the same implementation | 13:35 |
niemeyer | rog: The same API | 13:35 |
niemeyer | rog: If one of them is failing, the call is different.. just figure how it's different and you'll understand the problem | 13:35 |
SpamapS | hazmat: bug 862422 has a case where swift requires that nodes wait to be added until rebalance is done | 13:36 |
_mup_ | Bug #862422: Provide a way for services to protect units during dangerous operations <juju:Confirmed> < https://launchpad.net/bugs/862422 > | 13:36 |
jamespage | SpamapS, hazmat: Cassandra has a similar requirement | 13:37 |
hazmat | hmm | 13:37 |
SpamapS | Its not that hard on the add-unit case though | 13:37 |
SpamapS | you can error out the joined event | 13:37 |
hazmat | they can't really scan for a rebalance attribute since its being set by the same hook that's doing it | 13:37 |
hazmat | and the hook values are only flushed at the end of the hook | 13:37 |
SpamapS | and admins will just have to resolve --retry | 13:37 |
SpamapS | hazmat: the services should protect themselves | 13:38 |
SpamapS | hazmat: there's somewhere that an admin has to look to see if a re-balance is going on | 13:38 |
SpamapS | thats where the hook should look | 13:38 |
hazmat | SpamapS, there isn't any service level logic.. atm.. its got to be what the units can coordinate among themselves | 13:38 |
jamespage | so - just to flip back to my -departed thinking | 13:39 |
jamespage | ATM I will need to a) detect which node needs to be removed from the ring | 13:39 |
SpamapS | hazmat: yeah, I don't think preventing it is juju's problems. Handling failures gracefully should be all it needs to do. | 13:39 |
jamespage | and b) elect which of the remaining units is going to execute the removal | 13:40 |
jamespage | in the -departed hook | 13:40 |
SpamapS | Though this does go back to the --wait argument where as an admin I'd like to get feedback from the command's intended actions. | 13:40 |
hazmat | jamespage, so a leader election/detection cli api for hooks | 13:41 |
jcastro | Does anyone want to volunteer to do a juju session for ubuntu openweek? https://wiki.ubuntu.com/UbuntuOpenWeek | 13:41 |
rog | niemeyer: hmm, it looks like the python code is using an undocumented feature of aws. | 13:42 |
jamespage | hazmat, that would be nice | 13:42 |
jamespage | as it would prevent some fragile hack in the charm hook | 13:42 |
hazmat | rog, that api has several different spellings, they are documented | 13:42 |
jamespage | I'm doing something similar at the moment for unit bootstrapping - which it not 100% reliable | 13:43 |
jamespage | when units join the peer relation | 13:43 |
SpamapS | jcastro: I'm down for it. | 13:43 |
jcastro | SpamapS: can you claim a block please? | 13:43 |
jcastro | SpamapS: I'll do it with you if you want | 13:44 |
SpamapS | Yeah at least be there to help me with the bot. ;) | 13:44 |
hazmat | rog, txaws is a poor reference impl to look at.. https://github.com/boto/boto/blob/master/boto/ec2/connection.py#L1917 | 13:44 |
hazmat | is much better at api coverage and docs, notice right above that impl there is support for a deprecated mechanism with slightly different spelling | 13:44 |
lynxman | hazmat: SpamapS: got the juju macports done and working, just a versioning question, let me paste here the versions of the python packages I'm using and let me know which ones would you deem as "need upgrading" | 13:45 |
rog | hazmat: the name "SourceSecurityGroupName" is used as a parameter. i'd have thought that should be documented in http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/index.html?ApiReference-query-AuthorizeSecurityGroupIngress.html | 13:46 |
rog | given that seems to be the entry point. | 13:46 |
lynxman | argparse (1.2.1), zookeeper (3.3.0), python-regex (0.8.0), python-txaws (0.2), pydot (1.0.25), python-argparse (1.2.1) | 13:47 |
lynxman | hazmat: maybe we should upgrade txaws? | 13:49 |
rog | niemeyer: looks like a new entry point is warranted. perhaps the original call would be better named AuthorizeSecurityGroupIP. hmm. | 13:51 |
hazmat | rog its quite possible txaws is not targeting the latestt api | 13:51 |
hazmat | rog, actually highly likely given its lack of dev | 13:52 |
rog | hazmat: txaws has the call. as does boto. but the AWS documentation doesn't mention that variant AFAICS | 13:52 |
rog | it looks like all the language APIs have that variant. do you know what it's actually doing? authorizing one group with the privileges of another? | 13:54 |
rog | that would be my guess, but it would be nice to know for sure, so that i can choose a good name. | 13:54 |
hazmat | rog, aws supports both because they have a versioned api, boto has separate implementations for each version one marked deprecated. | 13:56 |
hazmat | rog, it is documented, but not under the latest version of the api docs which document the latest | 13:56 |
jcastro | SpamapS: which slot do you want? | 13:57 |
lynxman | hazmat: so what do you reckon :) | 13:57 |
rog | hazmat: ah, so... we have to ask: what's the equivalent of that old call in the new API? | 13:57 |
rog | i'll try and find the old docs | 13:58 |
hazmat | lynxman, so txaws doesn't have a release with the openstack fixes atm | 13:58 |
hazmat | and i should probably push out a new version of txzookeeper | 13:59 |
hazmat | lynxman, give me a moment, i'll cut releases for both | 13:59 |
lynxman | hazmat: cool :) | 13:59 |
hazmat | lynxman, besides that.. what's python-regex? | 13:59 |
hazmat | lynxman, we use the builtin re module not a third party lib | 13:59 |
hazmat | unless a dep needs it like pydot.. | 14:00 |
hazmat | rog, it should be pretty clear from context how to translate | 14:00 |
lynxman | hazmat: I can drop it as a dependency then, pydot has its own :) | 14:02 |
rog | hazmat: perhaps. this page talks about a "user/group pair permission", but perhaps that's just code for "allow all IP access". http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-03/ApiReference-Query-AuthorizeSecurityGroupIngress.html | 14:03 |
hazmat | lynxman, so python-txzookeeper 0.8.0 is needed as well | 14:04 |
hazmat | lynxman, and zookeeper 3.3.3 .. there are definitely bug fixes in the py bindings we need | 14:05 |
lynxman | hazmat: alright, I'll upgrade both then, ty | 14:05 |
hazmat | lynxman, np.. the latest pypi release for txzookeper looks good, off to push out a 0.2.1 txaws release | 14:06 |
lynxman | hazmat: lovely, thanks! :D | 14:06 |
rog | tcp port numbers are 16 bit even with IPv6, right? | 14:17 |
* niemeyer looks at rog with the eye | 14:18 | |
rog | ok, ok, i should know that. | 14:19 |
SpamapS | jcastro: sorry, family stuff, I'll grab one in the next 2 hrs | 14:36 |
rog | niemeyer: just checking: have you already written some Go code to parse environments.yaml? | 14:41 |
niemeyer | rog: No, that was the first bit I suggested you could start with | 14:41 |
rog | ok, cool | 14:42 |
rog | (BTW the instance starting and group set up code is all working now) | 14:42 |
niemeyer | rog: Please follow the existing convention in the charm package | 14:42 |
niemeyer | rog: Wow, neat! | 14:42 |
niemeyer | rog: How're you testing it? | 14:42 |
rog | niemeyer: it's just a stub file currently, no tests written so far | 14:43 |
niemeyer | rog: Heh | 14:43 |
niemeyer | rog: So there's nothing.. | 14:43 |
rog | niemeyer: just running it and going to the aws console to check | 14:43 |
niemeyer | rog: :) | 14:43 |
niemeyer | rog: Please write tests with the logic, rather than retrofitting them | 14:43 |
niemeyer | rog: We should follow a similar model to what was done with goamz itself | 14:44 |
niemeyer | rog: Rather than the mocking craziness we have in the Python side | 14:44 |
rog | niemeyer: yes, tests are the next thing i'm putting in. the code isn't even in a package yet. | 14:44 |
niemeyer | rog: Ok, it's a spike then | 14:45 |
rog | niemeyer: a spike? | 14:45 |
niemeyer | rog: yeah, a temporary hack to get a feeling of the problem | 14:45 |
rog | niemeyer: yeah, although i've ported a lot of the logic from the original python, so it should be trivial to do it right. | 14:45 |
rog | niemeyer: this is all i've got so far: http://paste.ubuntu.com/706139/ | 14:46 |
niemeyer | rog: Nice | 14:48 |
hazmat | lynxman, latest txaws release @ http://launchpad.net/txaws/trunk/0.2/+download/txAWS-0.2.1.tar.gz | 14:50 |
lynxman | hazmat: lovely, thanks :) | 14:50 |
rog | niemeyer: what's the best approach to testing with ec2? actually interact with ec2 directly? | 14:50 |
niemeyer | rog: No, we can follow a similar model from goamz | 14:53 |
rog | ok, i'll have a look. | 14:53 |
rog | niemeyer: BTW is this the only spec for the environment yaml? https://juju.ubuntu.com/docs/getting-started.html#configuring-your-environment | 14:55 |
niemeyer | rog: Please read the Python code | 14:55 |
rog | ok | 14:55 |
lynxman | hazmat: new ports submitted, contacted one of the maintainers and it's *possible* that juju will be in the archive by next week | 15:05 |
hazmat` | lynxman, sweet! | 15:07 |
=== hazmat` is now known as hazmat | ||
SpamapS | lynxman: is there an artifact somewhere where I can test and provide positive feedback to the maintainers? | 15:26 |
lynxman | SpamapS: I can send you my portindex branch if you want | 15:27 |
jimbaker | SpamapS, this branch should hopefully fix the problem you saw on openstack with expose failing: lp:~jimbaker/juju/expose-retry | 15:45 |
SpamapS | Hah, I love this code | 15:53 |
SpamapS | self.mocker.call(simulate_random_failure) | 15:53 |
SpamapS | :) | 15:53 |
SpamapS | jimbaker: indeed that should retry those ops. There are many others.. I think we just have to get defensive about txaws | 15:54 |
* hazmat lunches | 15:59 | |
niemeyer | I'm off to lunch too. | 16:00 |
jimbaker | SpamapS, :). we need to be defensive about txaws because it needs work and it necessarily deals with bad stuff. in general, txaws will fail early, if it has a bad payload it can't parse | 16:06 |
jimbaker | for commands like destroy-environment that can be repeated, this may be ok. for agents, we need to do retries | 16:07 |
jimbaker | i'm pretty certain that the provisioning agent retry mechanism (ignoring that it's a SPOF for now) seems to robust, so long as we have errbacks defined such that stuff doesn't just stop. in the case of expose, the only place where txaws can be called is that one method (open_close_ports_on_machine), so trapping there and then using the existing resync mechanism for retries would seem to suffice | 16:10 |
SpamapS | Are there any operations that the provisioning agent does w/ txaws where it shouldn't retry on error? | 16:11 |
SpamapS | expose/unexpose was just the most common fail we had | 16:12 |
SpamapS | there were others | 16:12 |
SpamapS | any time listing instances returned empty ... things were likely to just grind to a halt | 16:12 |
jimbaker | SpamapS, i suspect the problem with that is seen here: http://pastebin.ubuntu.com/706206/, specifically lines 17-21 | 16:14 |
jimbaker | i need to check that get_machines will always raise a ProviderError if it fails | 16:15 |
jimbaker | SpamapS, no, it only catches EC2Error, but txaws will raise other errors | 16:16 |
SpamapS | jimbaker: yeah seems like we should be able to trust our internal libraries to always raise only ProviderError. :) | 16:17 |
jimbaker | SpamapS, that's definitely not the convention we have | 16:18 |
jimbaker | no catchalls | 16:18 |
SpamapS | seems like catchalls at external libraries would be a good idea, but not for internal ones. | 16:18 |
jimbaker | except perhaps in some twisted code where we use an errback setup, and then that does catch everything | 16:18 |
jimbaker | SpamapS, yeah, i don't know. i think i can defend the existing mechanism by stating that for nonagent code, it's better to failfast, so any unknown errors bubbling up is fine | 16:20 |
jimbaker | SpamapS, but if i look at periodic_machine_check, it does the right thing: it always reschedules itself, even if there's an error (equiv to inlineCallbacks with a finally) | 16:22 |
jimbaker | SpamapS, so it should be resilient. and of course, if txaws is bad here, vs just getting an occasional bad payload, there's nothing that can be done anyway except to repeatedly log the problem | 16:23 |
SpamapS | jimbaker: thats really what I'm wondering.. I don't know of any action the provisioning agent takes that shouldn't just be retried over and over. I will say that we need a better way than debug-log to track provisioning operations. | 16:26 |
jimbaker | SpamapS, i think this would be helpful, bug 769120 | 16:28 |
_mup_ | Bug #769120: Ensemble status shouldn't report dead units based soley on state, but also on presence. <juju:New> < https://launchpad.net/bugs/769120 > | 16:28 |
hazmat | niemeyer, the doc builds on juju docs have been broken for a while.. their still referencing old ways of deploying | 16:32 |
jimbaker | SpamapS, ok, i think i see one bug here however: watch_machine_changes is a watch, and it calls process_machines. so this watch would stop working if process_machines fails because of some random exception from txaws | 16:33 |
niemeyer | hazmat: Can you please raise that up in #is? | 16:33 |
jimbaker | SpamapS, we would still see the resync from the periodic_machine_check, but the provisioning agent wouldn't respond to changes to ZK as they happen | 16:35 |
SpamapS | jimbaker: exactly! | 16:35 |
jimbaker | SpamapS, cool, glad to see your evidence corresponds to what i'm seeing here :) | 16:35 |
SpamapS | jimbaker: did we ever open an actual bug for this? | 16:35 |
SpamapS | I suppose you can just lpad it :) | 16:36 |
jimbaker | SpamapS, i'll just open it conventionally, since i don't have a branch in place to fix it | 16:37 |
hazmat | niemeyer, done.. is there any one i should ping about it? | 16:37 |
niemeyer | hazmat: Hmm.. #is? Who did you ping if you're wondering about who to ping? | 16:38 |
hazmat | niemeyer, i just put the message about the problem on #is.. just wondering if i should bring it to a particular person's attention on #is | 16:38 |
niemeyer | hazmat: Ah, gotcha | 16:39 |
niemeyer | hazmat: No, I'd just wait to see if someone there is able to help | 16:39 |
niemeyer | hazmat: Otherwise mail rt | 16:39 |
hazmat | niemeyer, k, thanks | 16:39 |
_mup_ | juju/go-store r20 committed by gustavo@niemeyer.net | 16:54 |
_mup_ | Introduced revision key tracking so that we can detect whether a | 16:54 |
_mup_ | charm update is already the current tip across all requested URLs | 16:54 |
_mup_ | or not. If at least one of the URLs are out-of-date, the update | 16:54 |
_mup_ | will proceed and bump a revision on all of them. | 16:54 |
rog | i'm off for the day. see y'all tomorrow. | 17:00 |
niemeyer | rog: Cheers! | 17:03 |
_mup_ | Bug #872378 was filed: Provisioning agent stops watching machine changes in ZK <juju:New> < https://launchpad.net/bugs/872378 > | 17:05 |
jimbaker | SpamapS, i just filed bug 872378 | 17:05 |
_mup_ | Bug #872378: Provisioning agent stops watching machine changes in ZK <juju:New> < https://launchpad.net/bugs/872378 > | 17:05 |
SpamapS | jimbaker: thanks, will confirm and mark High | 17:05 |
jimbaker | SpamapS, thanks, just what i was going to ask :) | 17:05 |
SpamapS | oh you did that :) | 17:06 |
jimbaker | i did the high part, you can still confirm it however | 17:06 |
SpamapS | need to raise a txaws bug too | 17:06 |
jimbaker | i'll get the bug dance better next time | 17:06 |
SpamapS | well I am pretty religious about not confirming my own bugs :) | 17:06 |
jimbaker | SpamapS, it's an interesting question about txaws, but given that it's a closely related project, worth seeing their philosophy here - do they handle bad payloads or not? | 17:07 |
SpamapS | no | 17:07 |
SpamapS | the project expects its AWS partner to be well behaved | 17:07 |
SpamapS | so there's also a nova bug to raise | 17:08 |
SpamapS | as nova shouldn't be returning empty ever | 17:08 |
SpamapS | heh.. we should probably have a little triage party to clean up txaws's bug list. | 17:08 |
jimbaker | got it. but regardless we would still expect to see TimeoutError, so there's some class of errors txaws will likely not handle | 17:08 |
SpamapS | 34 new, 72 open, 3 high.. | 17:08 |
_mup_ | juju/go-store r21 committed by gustavo@niemeyer.net | 17:51 |
_mup_ | Track sha256 and store next to the charm information so we can answer | 17:51 |
_mup_ | related API requests in the future. | 17:51 |
_mup_ | juju/go-store r22 committed by gustavo@niemeyer.net | 18:01 |
_mup_ | Copied log.go from personal project (mgo). | 18:01 |
jcastro | lynxman: heya, any update on the macports thing? | 18:44 |
jcastro | hazmat: hey is there an easy way to tell the local provider to use my existing apt cache instead of installing all this apt-cacher-ng business? | 18:51 |
hazmat | jcastro, i think he mentioned updated the portfile, he's going to ping one of the maintainers, with luck soon | 18:52 |
hazmat | jcastro, sadly no | 18:53 |
hazmat | jcastro, is the initial download a problem? | 18:55 |
jcastro | yeah, this close to release the mirrors are hammered, I'll suffer and find something else to do | 18:55 |
m_3 | SpamapS: did you mention you had pending MW charm changes? | 19:05 |
SpamapS | m_3: everything I had is in lp:charm/mediawiki | 19:06 |
m_3 | SpamapS: cool thanks | 19:07 |
_mup_ | juju/go-store r23 committed by gustavo@niemeyer.net | 19:09 |
_mup_ | Added info/debug logging across the charm storage operations. | 19:09 |
hazmat | jamespage, ping | 19:18 |
hazmat | jamespage, i'm wondering how problematic it is to always kill the unit's processes on removal instead of a controlled termination via stop | 19:19 |
SpamapS | hazmat: stop needs to be able to *cancel* the removal | 19:20 |
hazmat | SpamapS, there's not much distinguishing a unit removal to a service removal at that level | 19:21 |
SpamapS | It would be awesome if charms could prevent data loss without a --force flag by simply refusing to stop the service while it is vulnerable. | 19:21 |
hazmat | and units overriding the user express commands.. | 19:21 |
hazmat | hmm | 19:22 |
SpamapS | is this only happening on destroy-service, not on remove-unit ? | 19:22 |
SpamapS | I do kind of think destroy-* should be more heavy handed | 19:22 |
hazmat | SpamapS, it would happen on either one, the mechanics are the same atm | 19:22 |
hazmat | SpamapS, how does the service know if its redundant or not? | 19:23 |
hazmat | service unit | 19:23 |
_mup_ | juju/config-get r393 committed by kapil.thangavelu@canonical.com | 19:25 |
_mup_ | juju get for service config/schema inspection | 19:25 |
SpamapS | hazmat: in the case of any clustered service, it will have some way to determine if removing this node is safe or not. | 19:27 |
SpamapS | hazmat: stop would also be a decent place for a single node service to signal some kind of snapshot or backup. | 19:29 |
SpamapS | so blocking until its done would be cool | 19:30 |
hazmat | SpamapS, the converse question is how to prevent problems with problematic charms, that might for example have a broken stop... or even well meaning ones that go out of control | 19:31 |
hazmat | decomissioning a node in cassandra is potentially a fairly long operation afaicr | 19:32 |
hazmat | we'll need intermediary states to properly convey status to a ui | 19:32 |
hazmat | ie. 'stopping' | 19:32 |
hazmat | we only have nouns now.. not verbs | 19:32 |
SpamapS | hazmat: --force ? | 19:36 |
hazmat | sounds reasonable | 19:36 |
SpamapS | hazmat: I see what you mean. Yes it would be cool if we followed upstart's model there and had a goal state, and the in-between states with hooks available for each state. | 19:37 |
hazmat | SpamapS, exactly | 19:37 |
hazmat | hmm.. well maybe not hooks available for each state, but at least the same re status | 19:37 |
hazmat | effectively it would be a hook per verb | 19:37 |
SpamapS | stop/running -> stop/hook-stop-running -> (if hook says so, stop/deferred-stop) -> stop/stopping-unit -> oblivion | 19:38 |
SpamapS | Like if a hook exits 100 , that means it is running the safe stop in the background | 19:38 |
SpamapS | then you can just keep trying to stop it, and getting back 100 until its done decomissioning | 19:39 |
SpamapS | and you can still have a short timeout to deal with misbehaving charms | 19:39 |
hazmat | i'm going to capture this discussion into the bug | 19:39 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!