/srv/irclogs.ubuntu.com/2011/04/25/#ubuntu-ensemble.txt

niemeyerGood morning!14:14
hazmatniemeyer, welcome back14:40
hazmatniemeyer, the sky fell after you left ;-)14:40
niemeyerhazmat: Danke!14:40
hazmatthe clouds rained14:40
niemeyerhazmat: Yeah, I was half-following the news.. man, that was interesting14:40
niemeyerhazmat: Any news on what actually happened?14:40
hazmatniemeyer, some interesting issues came up wrt to ensemble14:40
hazmatniemeyer, nutshell us-east-1 data center experienced a multi availability zone outage, affecting anything touching ebs14:41
niemeyerhazmat: Do they know why?  Or rather.. have they published why?14:42
hazmatwith internal network saturation due to ebs replication that impacted pretty much all services14:42
niemeyerAh, ok14:42
hazmatniemeyer, they haven't published outside of saying 'network event'14:42
niemeyerThat's so lame14:42
hazmatwhich triggered the ebs remirroring14:42
niemeyerIncreases the distrust14:42
hazmatniemeyer, i fixed up ensemble's region portability during the outage ;-)14:42
niemeyerhazmat: That's awesome :-)14:43
hazmatniemeyer, we have some other interesting problems as well, i was just doing a write up for the list14:43
niemeyerhazmat: I can imagine some of it14:43
hazmatmostly relating to the fact that we're still using ubuntu zk packages14:43
hazmatniemeyer, its unrelated14:43
niemeyerAh, ok14:43
niemeyerI can't, then :-)14:43
hazmatand that causes random segfaults in our agents now14:43
niemeyerUgh14:43
hazmatour unit agents have reach sufficient complexity14:43
hazmatniemeyer, its actually a really nice opportunity14:44
niemeyerLet's see if we can get a hand to get that fixed14:44
hazmatto test fault resilience with a  random fault injector14:44
niemeyer:-)14:44
hazmatniemeyer, two separate multi-step tracks, fix the packaging, fix the fault resilience14:44
niemeyer"Fix the fault resilience" feels like a long chain14:45
hazmatniemeyer, i paused on the resolved work as well  (two branches in review), i wanted to discuss options for the implementation of the sans-hook transitions14:45
niemeyerOk14:45
hazmatniemeyer, well three parts afaics, with some details..14:45
niemeyerhazmat: But what's in the queue is good to go, right?14:45
hazmatre fault.. agents monitoring launched agents, queue with fs durability14:46
hazmatniemeyer, yes14:46
niemeyerhazmat: Cool, sound like good topics14:47
hazmatniemeyer, i'm gonna run a quick errand, but if your game in 15m, i'd like to do a quick skype on the resolved stuff14:49
niemeyerhazmat: Sounds good14:49
hazmat niemeyer i'm on skype, ping me when your ready15:11
niemeyerreturn self._invoke_lifecycle(self._lifecycle.start, nohooks=True)15:25
hazmatniemeyer, def start(self, fire_hooks=True)15:25
niemeyerhazmat: ^15:25
hazmatniemeyer, yup15:25
hazmatdef start(self, transition_context)15:26
hazmattransition_context == object with origin_state, destionation_state, state variables, transition arguments15:26
=== deryck is now known as deryck[lunch]
=== niemeyer is now known as niemeyer_lunch
jimbakerhazmat, nice analysis of the problem we were seeing16:47
hazmatjimbaker, yeah... its nicer to think of as a random fault injector than crappy code ;-)16:47
jimbaker:)16:47
=== niemeyer_lunch is now known as niemeyer
hazmathmm... looks like my network provider is blocking post commit message to labix17:25
niemeyerhazmat: Huh17:36
niemeyerhttp://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/17:38
niemeyerPretty good write up17:38
hazmatniemeyer, lots of good write ups, the rightscale has a nice set of links, the joyeur/joyent one is nice as well17:54
niemeyerhazmat: Yeah, the RS one feels the closest from what I would expect the *official* post from Amazon to look like18:06
niemeyerOne of the funny aspects of EBS volumes is that they keep the actual machine disk more available for those that choose to use it18:18
niemeyer"How SmugMug survived the Amazonpocalypse >> (...) Third, we don’t use Elastic Block Storage (EBS), which is the main component that failed last week."18:19
niemeyerMajor DUH18:19
hazmatniemeyer, yeah.. that and the joyent post got me thinking about rethinking persistence and opening up the choice to formula authors18:41
hazmatone thing at a time18:41
niemeyerhazmat: Wasn't that the plan since the very early conversations?18:41
niemeyerhazmat: IIRC the EBS-only strategy was introduced just because it was a simple way for us to get started without risking blowing people's data18:42
hazmatniemeyer, it was, but last i mentioned a month or two back, you where suggesting just using ebs instances and not worrying about it18:42
hazmatinstead of spec'ing persistent directories, not clear if that was intended from a priority perspective or was a long term plan18:43
niemeyerhazmat: For now that still feels like a good plan18:43
niemeyerhazmat: I see.. FWIW I don't see inherent problems with supporting non-EBS formulas18:44
hazmatniemeyer, the goal i was considering is not requiring ec2 ebs instances for such formulas18:45
niemeyerhazmat: THat's what I'm talking about as well18:46
hazmatniemeyer, great18:50
hazmatniemeyer, is the endpoint to the post commit publishing bot on labix running?18:56
niemeyerhazmat: I don't know.. have to check that18:57
niemeyerhazmat: FWIW, it's not actually labix.. I just hosted the domain there.. the bot lives within one of the Landscape test servers18:57
hazmatniemeyer, ah.. right on i was wondering about that18:58
SpamapShazmat: can you explain why there is an "ensemble ami" ?19:08
hazmatSpamapS, good question, ideally there shouldn't be one19:08
SpamapSShould be able to do anything w/ cloud-init that you need to do.19:09
hazmatSpamapS, we ended up creating one because the bootstrap time was significant if we installed from scratch19:09
hazmatie. downloading java and updating packages, added several minutes to our startup19:09
hazmatSpamapS, plus checking out all the ensemble repos19:10
SpamapSYeah.. thats a valid reason to go AMI vs. cloud-init19:10
hazmatcloud-init had some failings in that regard as well, wrt to only logging output to the console log in the maverick cycle19:10
hazmatwe'd be on the machine and wondering what happened for like 10m till it showed in the ec2 get-console-output api19:10
SpamapSI've even wondered if it would be a worthy later optimization for machine providers to be able to rebundle after the install hook fires. :)19:10
hazmatthat's better now19:10
hazmatyeah.. unit snapshotting would be nice, and a viable strategy for some services19:11
hazmati'm really interested in serge's work with btrfs and lxc, to be discussed at uds-o19:11
SpamapSyeah very cool stuff there19:12
SpamapSSo the fault tolerance of the agents.. is this just as simple as respawning it if it dies?19:12
=== bcsaller1 is now known as bcsaller
SpamapSWe have this thing in Ubuntu called upstart that does that. ;)19:12
hazmatSpamapS, its two things, its making sure state is on disk, and respawning19:13
SpamapSbcsaller: hey! I spent this past weekend reading the first section of my new copy of "4 hour body" btw.. Thanks for the recommendation.. great book so far.19:13
hazmatSpamapS, but the respawn is potentially a machine not just a process19:13
SpamapShazmat: oh.19:13
bcsallerSpamapS: glad you liked it19:13
hazmatSpamapS, ie. if we kill  a machine agent, the provisioning agent may have to start a new machine to recover19:14
hazmatSpamapS, also upstart is fairly static is my understanding19:14
hazmatie you don't load new services to be managed at runtime19:15
hazmathmm. actually i guess you do19:15
hazmathence package installs using upstart19:15
SpamapShazmat: upstart just makes a best effort at keeping it running. It will give up after a while too.. so its not a perfect solution.19:17
hazmatSpamapS, is that based on total number of restarts or restarts within a timespan?20:28
SpamapS       respawn limit COUNT INTERVAL20:49
SpamapS              Respawning  is  subject to a limit, if the job is respawned more than COUNT times in INTERVAL seconds20:49
SpamapShazmat: 'man 5 init'20:49
hazmatSpamapS, thanks20:49
robbiewSpamapS: ping21:07
* robbiew goes from room to room21:07
robbiewtrying to get SpamapS attention...must be running a crap irc client21:08
robbiewlol21:08
jimbakerrobbiew, i know what you mean... i like xchat, but it tends to only work w/ one room at a time from being able to see stuff going on of interest21:12
jimbakerincluding being pinged :)21:12
robbiewjimbaker: pidgin for the world!!!!!!21:12
robbiew:P21:12
jimbakerrobbiew, not pidgin!!! ok i was unaware of that capability... the naive install i did simply opened lots and lots of windows, i couldn't take it21:13
SpamapSI tend to hide my IRC until I am ready to be interrupted21:14
robbiewjimbaker: oh..the tabbed view rocks21:14
robbiewI put mine on the side21:14
jimbakerSpamapS, good rendezvous protocol ;)21:15
robbiewif you have a ThinkPad...the ThinkLight plugin is AWESOME21:16
robbiewflashes the light when my nick is spoken...so I get notified, while muted ;)21:16
SpamapSwow21:17
SpamapSthat actually sounds cool21:17
SpamapSI wonder if I can do that w/ the MBP's light21:17
niemeyerI'll get a bite21:18
=== niemeyer is now known as niemeyer_biab
jimbakerwe have reinvented the circa 90s office phone21:19
hazmatSpamapS, re deb packaging for ensemble deps, i've got a script in ensemble/debian/ec2-build.. not sure if you've looked at it, but it basically just pulls the 3.3 branch of zk and builds the deb on an ec2 machine.. we should be fine with just a deb from the 3.3.3  release tarball.. i'm interested in learning more about what the process is.21:21
SpamapShazmat: Yeah thats cool actually. :) I have to run now, but lets talk in about an hour.21:23
hazmatSpamapS, awesome, let's pick it up tomorrow21:23
hazmatSpamapS,  we can talk later today.. but as far digging into doing it, tomorrow would be better21:23
SpamapShazmat: ack21:23
hazmathmm. the ubuntu packaging docs are much better than the debian new maintainer guide21:49
=== niemeyer_biab is now known as niemeyer
koolhead17kim0: around?22:35
koolhead17hi hazmat22:35
jimbakerhazmat, running trunk with test, i'm getting a failure on ensemble.providers.ec2.tests.test_utils.EC2UtilsTest.test_get_machine_options_defaults (http://paste.ubuntu.com/598962/)23:39
jimbakerbcsaller, i'm also seeing failures with your refactor-to-yamlstate branch (which is why i looked at trunk and did a fresh install of our dependencies so i could move to python 2.7 in my virtualenv)23:41
bcsallerjimbaker: any tracebacks?23:42
jimbakerbcsaller, here's the full traceback from test - http://paste.ubuntu.com/598963/23:44
jimbakermost of those look spurious23:44
jimbakerbut in isolating, http://paste.ubuntu.com/598965/ looks relevant to the changes you made23:45
bcsallerjim: thanks, I think some of those were in a later branch, I didn't think what I pushed was impacted by that 23:45
jimbakerbcsaller, cool. i'm trying to base a branch on refactor-to-yamlstate23:46
bcsalleryeah, thats the set taking a dict rather than a YAML dict string change23:46
jimbakeri know it's sort of early to do so, but it seemed the easiest way to keep our work from conflicting on HookContext23:47
jimbakerright now, i'm just going to hold off on the hook command changes23:47

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!