[14:14] Good morning! [14:40] niemeyer, welcome back [14:40] niemeyer, the sky fell after you left ;-) [14:40] hazmat: Danke! [14:40] the clouds rained [14:40] hazmat: Yeah, I was half-following the news.. man, that was interesting [14:40] hazmat: Any news on what actually happened? [14:40] niemeyer, some interesting issues came up wrt to ensemble [14:41] niemeyer, nutshell us-east-1 data center experienced a multi availability zone outage, affecting anything touching ebs [14:42] hazmat: Do they know why? Or rather.. have they published why? [14:42] with internal network saturation due to ebs replication that impacted pretty much all services [14:42] Ah, ok [14:42] niemeyer, they haven't published outside of saying 'network event' [14:42] That's so lame [14:42] which triggered the ebs remirroring [14:42] Increases the distrust [14:42] niemeyer, i fixed up ensemble's region portability during the outage ;-) [14:43] hazmat: That's awesome :-) [14:43] niemeyer, we have some other interesting problems as well, i was just doing a write up for the list [14:43] hazmat: I can imagine some of it [14:43] mostly relating to the fact that we're still using ubuntu zk packages [14:43] niemeyer, its unrelated [14:43] Ah, ok [14:43] I can't, then :-) [14:43] and that causes random segfaults in our agents now [14:43] Ugh [14:43] our unit agents have reach sufficient complexity [14:44] niemeyer, its actually a really nice opportunity [14:44] Let's see if we can get a hand to get that fixed [14:44] to test fault resilience with a random fault injector [14:44] :-) [14:44] niemeyer, two separate multi-step tracks, fix the packaging, fix the fault resilience [14:45] "Fix the fault resilience" feels like a long chain [14:45] niemeyer, i paused on the resolved work as well (two branches in review), i wanted to discuss options for the implementation of the sans-hook transitions [14:45] Ok [14:45] niemeyer, well three parts afaics, with some details.. [14:45] hazmat: But what's in the queue is good to go, right? [14:46] re fault.. agents monitoring launched agents, queue with fs durability [14:46] niemeyer, yes [14:47] hazmat: Cool, sound like good topics [14:49] niemeyer, i'm gonna run a quick errand, but if your game in 15m, i'd like to do a quick skype on the resolved stuff [14:49] hazmat: Sounds good [15:11] niemeyer i'm on skype, ping me when your ready [15:25] return self._invoke_lifecycle(self._lifecycle.start, nohooks=True) [15:25] niemeyer, def start(self, fire_hooks=True) [15:25] hazmat: ^ [15:25] niemeyer, yup [15:26] def start(self, transition_context) [15:26] transition_context == object with origin_state, destionation_state, state variables, transition arguments === deryck is now known as deryck[lunch] === niemeyer is now known as niemeyer_lunch [16:47] hazmat, nice analysis of the problem we were seeing [16:47] jimbaker, yeah... its nicer to think of as a random fault injector than crappy code ;-) [16:47] :) === niemeyer_lunch is now known as niemeyer [17:25] hmm... looks like my network provider is blocking post commit message to labix [17:36] hazmat: Huh [17:38] http://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/ [17:38] Pretty good write up [17:54] niemeyer, lots of good write ups, the rightscale has a nice set of links, the joyeur/joyent one is nice as well [18:06] hazmat: Yeah, the RS one feels the closest from what I would expect the *official* post from Amazon to look like [18:18] One of the funny aspects of EBS volumes is that they keep the actual machine disk more available for those that choose to use it [18:19] "How SmugMug survived the Amazonpocalypse >> (...) Third, we don’t use Elastic Block Storage (EBS), which is the main component that failed last week." [18:19] Major DUH [18:41] niemeyer, yeah.. that and the joyent post got me thinking about rethinking persistence and opening up the choice to formula authors [18:41] one thing at a time [18:41] hazmat: Wasn't that the plan since the very early conversations? [18:42] hazmat: IIRC the EBS-only strategy was introduced just because it was a simple way for us to get started without risking blowing people's data [18:42] niemeyer, it was, but last i mentioned a month or two back, you where suggesting just using ebs instances and not worrying about it [18:43] instead of spec'ing persistent directories, not clear if that was intended from a priority perspective or was a long term plan [18:43] hazmat: For now that still feels like a good plan [18:44] hazmat: I see.. FWIW I don't see inherent problems with supporting non-EBS formulas [18:45] niemeyer, the goal i was considering is not requiring ec2 ebs instances for such formulas [18:46] hazmat: THat's what I'm talking about as well [18:50] niemeyer, great [18:56] niemeyer, is the endpoint to the post commit publishing bot on labix running? [18:57] hazmat: I don't know.. have to check that [18:57] hazmat: FWIW, it's not actually labix.. I just hosted the domain there.. the bot lives within one of the Landscape test servers [18:58] niemeyer, ah.. right on i was wondering about that [19:08] hazmat: can you explain why there is an "ensemble ami" ? [19:08] SpamapS, good question, ideally there shouldn't be one [19:09] Should be able to do anything w/ cloud-init that you need to do. [19:09] SpamapS, we ended up creating one because the bootstrap time was significant if we installed from scratch [19:09] ie. downloading java and updating packages, added several minutes to our startup [19:10] SpamapS, plus checking out all the ensemble repos [19:10] Yeah.. thats a valid reason to go AMI vs. cloud-init [19:10] cloud-init had some failings in that regard as well, wrt to only logging output to the console log in the maverick cycle [19:10] we'd be on the machine and wondering what happened for like 10m till it showed in the ec2 get-console-output api [19:10] I've even wondered if it would be a worthy later optimization for machine providers to be able to rebundle after the install hook fires. :) [19:10] that's better now [19:11] yeah.. unit snapshotting would be nice, and a viable strategy for some services [19:11] i'm really interested in serge's work with btrfs and lxc, to be discussed at uds-o [19:12] yeah very cool stuff there [19:12] So the fault tolerance of the agents.. is this just as simple as respawning it if it dies? === bcsaller1 is now known as bcsaller [19:12] We have this thing in Ubuntu called upstart that does that. ;) [19:13] SpamapS, its two things, its making sure state is on disk, and respawning [19:13] bcsaller: hey! I spent this past weekend reading the first section of my new copy of "4 hour body" btw.. Thanks for the recommendation.. great book so far. [19:13] SpamapS, but the respawn is potentially a machine not just a process [19:13] hazmat: oh. [19:13] SpamapS: glad you liked it [19:14] SpamapS, ie. if we kill a machine agent, the provisioning agent may have to start a new machine to recover [19:14] SpamapS, also upstart is fairly static is my understanding [19:15] ie you don't load new services to be managed at runtime [19:15] hmm. actually i guess you do [19:15] hence package installs using upstart [19:17] hazmat: upstart just makes a best effort at keeping it running. It will give up after a while too.. so its not a perfect solution. [20:28] SpamapS, is that based on total number of restarts or restarts within a timespan? [20:49] respawn limit COUNT INTERVAL [20:49] Respawning is subject to a limit, if the job is respawned more than COUNT times in INTERVAL seconds [20:49] hazmat: 'man 5 init' [20:49] SpamapS, thanks [21:07] SpamapS: ping [21:07] * robbiew goes from room to room [21:08] trying to get SpamapS attention...must be running a crap irc client [21:08] lol [21:12] robbiew, i know what you mean... i like xchat, but it tends to only work w/ one room at a time from being able to see stuff going on of interest [21:12] including being pinged :) [21:12] jimbaker: pidgin for the world!!!!!! [21:12] :P [21:13] robbiew, not pidgin!!! ok i was unaware of that capability... the naive install i did simply opened lots and lots of windows, i couldn't take it [21:14] I tend to hide my IRC until I am ready to be interrupted [21:14] jimbaker: oh..the tabbed view rocks [21:14] I put mine on the side [21:15] SpamapS, good rendezvous protocol ;) [21:16] if you have a ThinkPad...the ThinkLight plugin is AWESOME [21:16] flashes the light when my nick is spoken...so I get notified, while muted ;) [21:17] wow [21:17] that actually sounds cool [21:17] I wonder if I can do that w/ the MBP's light [21:18] I'll get a bite === niemeyer is now known as niemeyer_biab [21:19] we have reinvented the circa 90s office phone [21:21] SpamapS, re deb packaging for ensemble deps, i've got a script in ensemble/debian/ec2-build.. not sure if you've looked at it, but it basically just pulls the 3.3 branch of zk and builds the deb on an ec2 machine.. we should be fine with just a deb from the 3.3.3 release tarball.. i'm interested in learning more about what the process is. [21:23] hazmat: Yeah thats cool actually. :) I have to run now, but lets talk in about an hour. [21:23] SpamapS, awesome, let's pick it up tomorrow [21:23] SpamapS, we can talk later today.. but as far digging into doing it, tomorrow would be better [21:23] hazmat: ack [21:49] hmm. the ubuntu packaging docs are much better than the debian new maintainer guide === niemeyer_biab is now known as niemeyer [22:35] kim0: around? [22:35] hi hazmat [23:39] hazmat, running trunk with test, i'm getting a failure on ensemble.providers.ec2.tests.test_utils.EC2UtilsTest.test_get_machine_options_defaults (http://paste.ubuntu.com/598962/) [23:41] bcsaller, i'm also seeing failures with your refactor-to-yamlstate branch (which is why i looked at trunk and did a fresh install of our dependencies so i could move to python 2.7 in my virtualenv) [23:42] jimbaker: any tracebacks? [23:44] bcsaller, here's the full traceback from test - http://paste.ubuntu.com/598963/ [23:44] most of those look spurious [23:45] but in isolating, http://paste.ubuntu.com/598965/ looks relevant to the changes you made [23:45] jim: thanks, I think some of those were in a later branch, I didn't think what I pushed was impacted by that [23:46] bcsaller, cool. i'm trying to base a branch on refactor-to-yamlstate [23:46] yeah, thats the set taking a dict rather than a YAML dict string change [23:47] i know it's sort of early to do so, but it seemed the easiest way to keep our work from conflicting on HookContext [23:47] right now, i'm just going to hold off on the hook command changes