[00:00] Oh, and then of course, add/remove stuff from your running ensemble environment [00:01] jimbaker`: ping [00:01] still needs to have the command changing added.. and show the units / machines.. but.. kind of cool. :) [00:03] (oh, and install gource) [00:13] bcsaller, interesting so it still doesn't work [00:13] part of the issue i think is that the container needs to do an apt-get update prior to the pkg install [00:14] else its referencing an old version of ensemble that's not upstream anymore [00:15] hazmat: ahh, that is strange, I have a script that updates the cache locally, but its just a chroot, apt-get update/upgrade [00:18] bcsaller, it should be part of the ensemble-create script, else its going to break during dev cycles every time there's an upload or prods on srus [00:25] bcsaller, i'm also seeing this alot.. ensemble.lib.lxc.LXCError: lxc-stop: failed to stop 'lxc_test': Operation not permitted [00:26] which causes other tests to fail because the container exists [00:28] bcsaller, any ideas on automated alternatives to apt-get upgrade in ensemble-create? [00:28] it does take quite a while [00:29] yeah, I've been thinking about that, its like there is another type of bootstrap before spawning a stack that you might want before spawning many nodes [00:30] I think they call that a "release" ;) [00:32] bcsaller, well its not even that.. the environment might live for quite a while.. and this will still cause breakage when adding a new unit to an existing env [00:32] hazmat: not so on a regular release of Ubuntu [00:32] the versions in the lists never "disappear" [00:32] SpamapS, all it takes is an sru? [00:32] nope [00:32] they stay there, all of them [00:33] SpamapS, ah cool.. so its only for dev versions that old versions get yanked? [00:33] this business of purged versions is only an issue during development [00:33] Yeah, for this exact reason. [00:36] bcsaller, so even updating the debbootstrap cache by hand, i'm still seeing a bunch of errors.. do the tests work for you on the lxc-lib branch? [00:36] hazmat: yes [00:37] hazmat: _cmd does return the output of the command if you want to make a change to look at it [00:37] jimbaker`: what is "butler" ? [00:37] sounds a lot like jenkins. :) [00:38] bcsaller, http://paste.ubuntu.com/685682/ [00:42] <_mup_> ensemble/lib-lxc-merge r339 committed by kapil.thangavelu@canonical.com [00:42] <_mup_> merge latest lxc-lib [00:46] hazmat: I'll see what I can do :-/ Just not sure there is a good place in the lifecycle for this. You think the tests timeout was what killed it and then it didn't clean up for the later tests? [00:48] bcsaller, hmm.. there's a couple of issues, i don't think the timeout is one of them [00:49] bcsaller, the container cleanup needs to wait for stopped state before proceeding to destroy [00:50] _cmd is spitting the output on error, without any context of what command it ran... although that's less functional [00:51] er. not a functional problem [00:56] hazmat: are you running the containers as pure daemons or foreground children? [00:56] SpamapS, daemons [00:56] hazmat: so you need to watch the cgroup then [00:57] or poll proc [00:57] SpamapS, lxc-wait does the trick [00:57] oh nice [00:57] didn't know that existed [00:58] SpamapS: took us a while to find it too [01:12] bcsaller, i don't see how the tests could be working [01:12] bcsaller, lxc-stop normally tosses an error [01:13] bcsaller, which will break per its integration with _cmd [01:13] and raise an exception [01:14] bcsaller, what version of lxc do you have? [01:14] hazmat: 0.7.5-0ubuntu7 [01:14] aha [01:17] i'm on the version in the ppa [01:21] hazmat: let me know if that changes anything [01:44] SpamapS, any chance we can the oneiric lxc into the ppa for natty [01:45] bcsaller, update-manager -d is broken for me at the moment.. [01:45] s/can/can get [02:23] SpamapS, historically a butler used to manage the buttery, which would in turn store the results of churning operations [02:24] SpamapS, that it is also similar to jenkins is not terribly coincidental either ;) [03:10] hazmat: you should be able to just upload it. === Aram_ is now known as Aram [12:56] hmm does open-port support a port range [13:11] Hallo Ensemblers [13:13] Hello [13:15] hmm .. Can I launch a long running program from an install hook [13:20] kim0: Hey man [13:20] kim0: Absolutely [13:20] I was imagining I'd need tricks [13:20] I'm doing an torrent download appliance on the cloud with Ensemble :) [13:20] hope this will go popular with many users [13:25] <_mup_> Bug #845604 was filed: ensemble should show ports that need to be exposed < https://launchpad.net/bugs/845604 > [13:28] kim0: It does sound cooL! [13:44] can open-port do a port-range [13:47] kim0: No, but that's an interesting idea [13:47] kim0: Can you please open a bug about this? [13:47] sure thingie [13:49] kim0, so i imagine at some point we might try to put some sort of sensible time outs on hooks [13:50] but if you fork something that should be fine [13:50] <_mup_> Bug #845616 was filed: open-port should support port ranges < https://launchpad.net/bugs/845616 > [13:50] hazmat: does that mean the install hook would not be considered "complete" [13:51] kim0, if it hasn't exited ... yes [13:51] hazmat: I suppose the better way is to double-fork my command [13:51] definitely [13:51] any advice on doing that, or should I google :) [13:52] kim0, probably google will be faster, what are you writing the program in? [13:52] bash shell script [13:52] it's a formula after all :) [13:53] It might actually not be a bad idea for my use-case .. to start the command in screen and detach it [13:53] I hope that would make the hooks happy [13:54] kim0: start-stop-daemon may help you as well [13:55] niemeyer: oh thanks looking at that [14:01] woot, bash's version of double fork is: ( command & ) [14:01] of course! [14:36] When ensemble launches an instance, can it get it's host SSH keys (ec2-get-console-output), such that I am not asked to confirm the machine's identity upon first login? worth a bug report? [14:37] kim0: We already have one open for that [14:37] ah okie [14:37] kim0: I *think* [14:37] kim0: At least we're very aware of the issue [14:38] * kim0 nods [14:38] kim0: The proper way is to send the host key [14:38] kim0: Rather than just ignoring [14:38] kim0: Otherwise it's a security issue [14:38] kim0: You'll likely ignoring it anyway, but then you're the security issue! ;-D [14:38] exactly :) [14:39] s/ignoring/be ignoring [14:39] kim0: We want to improve that, more seriously [14:48] jimbaker`: Please ping me when you're around [14:55] niemeyer, hi [14:55] jimbaker`: Yo [14:55] jimbaker`: So, are going to have a working waterfall today? [14:57] niemeyer, i should have butler working yes that runs the churns and generates the waterfall. to do so, i simply need to add code to walk the updates in a bzr branch [14:57] jimbaker`: Cool [14:57] this is pretty straightforward, compare bzr revno of a local branch with bzr revno lp:ensemble [14:57] (or whatever branch) [14:58] jimbaker`: Hmm.. not really.. the other point of comparison is the waterfall itself [14:59] niemeyer, what do you mean? in terms of the build runs in the waterfall directory? [14:59] jimbaker`: 1) update bzr to tip; 2) i := max revno in branch; 3) j := get max revno in waterfall; 3) for j < i: update bzr to j + run tests [15:00] jimbaker`: 1) update bzr to tip; 2) i := max revno in branch; 3) j := get max revno in waterfall; 4) for j < i: update bzr to j + run tests [15:00] niemeyer, sounds good, thanks! [15:23] hmm .. my ensemble deploy, results in "install_error", when I debaaaaapaaaa [15:23] hung irc window .. continuing .. [15:23] when I debug-hooks, and execute the hook manually .. it's exit code is 0 [15:24] any idea what could be going on [15:27] kim0: No, but you can check logs locally [15:27] * kim0 looks around [15:27] kim0: In the machine itself, that is [15:27] kim0: ensemble ssh [15:28] niemeyer: /var/log/ensemble/machine-agent.log ? [15:29] kim0: yeah [15:29] kim0: Wait, no [15:29] kim0: /var/lib/ensemble/units//formula.log [15:29] kim0: This is from the unit agent [15:29] thanks :) [15:29] kim0: What Mark says [15:47] m_3: hmm I forgot what's the cwd for a hook? [15:48] /var/lib/ensemble/units//formula/ [15:48] is that what you mean? [15:48] yes, thanks! [15:48] np [15:49] jimbaker`: so you've never explained how your butler relates to jenkins.. [15:50] jimbaker`: is this just a "jenkins is too complex and I don't want to use it" or "jenkins is missing something fundamental" ? [15:50] SpamapS, this is a project specific tool [15:50] Because its basically the industry standard.. and we already use it all over Ubuntu dev [15:51] Still sounds a lot like it was a project specific invented tool to do what jenkins does. :-P [15:52] I mean.. jenkins does code coverage analysis, distributed multi-platform testing, and a whole host of stuff I don't even understand yet.. so I'd like to understand why we're not just running bash scripts in jenkins. [15:57] SpamapS, these are all good points. the functional tests could be readily run by jenkins. however by being project specific, we can ensure that it can best meet our needs [15:59] LOL, ok, thats true, integrating it with several releases is more my problem than "yours" [16:01] jimbaker`: as long as I can also run it as part of our CI for upload to Ubuntu and maintenance of a "stable" PPA, I don't care what output it returns. :) [16:03] bike shed question: why does ensemble log in /var/lib/ensemble/whatever instead of just /var/log? [16:03] SpamapS, it will be very easy to take the output of churn and turn it into junitxml [16:05] jcastro: it does use /var/log for the "machine" wide logs [16:05] jcastro: the unit log ends up in /var/lib/ensemble because its eventually going to be in /var/lib/lxc/container/rootfs/.... [16:05] I think [16:05] yeah but I don't care about that, I care about the service being deployed and what not. [16:05] jimbaker`: I don't even care about junitxml [16:06] jimbaker`: just "pass/fail" [16:06] jcastro: its a stop gap until the unit agent runs inside a container. [16:06] oh I see [16:06] SpamapS, sure you just need some way of summarizing the churn results [16:07] jimbaker`: no, I need an exit code non 0 [16:07] jimbaker`: of course, I could just use run-parts on the same dir churn sees, why do I need churn? ;) [16:11] SpamapS, sounds like you don't need any part of the butler project to run the functional tests with jenkins. cool [16:22] SpamapS: I don't want to buy into the whole Jenkins and all of the things it does that we don't know before we need to [16:22] SpamapS: Right now our glorious functional test suite and Jenkins reinvention sums up to less than 100 lines [16:23] Hey I'm not complaining.. I *do* need something jenkins has that you don't, which is running multiple tests on multiple platform slaves. :) [16:23] SpamapS: Let's put that online ASAP and focus on the meat, which is the tests themselves and being able to see if trunk is working or not [16:24] SpamapS: We certainly have it.. these scripts can run anywhere [16:25] And one would need to coordinate the results of all of those tests. [16:25] SpamapS: I know you're not complaining.. I'm just stating the reasoning we're doing this because I've heard the "Oh, but that's Jenkins" argument a few times, so wanted to explain [16:26] SpamapS: Sure.. and nothing prevents us to use Jenkins when the threshold has been crossed [16:26] The setup that we need is, run tests on [ all supported releases ] then copy the package into the "stable" PPA. [16:26] s/to use/from using [16:26] and by we, I mean those of us integrating ensemble into Ubuntu and supporting people who use it for demos. :) [16:27] triggered by changes in bzr.. and showing those changes in all reports... [16:28] SpamapS: I bet I can do this with less than 100 lines of fabric logic or similar [16:28] SpamapS: But before even worrying about this, we need the tests [16:28] niemeyer: I'd think we'd want to rally around one tool.. like we have for everything else at Canonical. Jenkins has been in use for well over 8 months in the platform team for testing. [16:28] SpamapS: and being able to run them at all [16:29] SpamapS: That's great, and nothing we're doing prevents its use [16:30] SpamapS: But I don't want to buy a big truck when I need to walk next door [16:30] SpamapS: We should be able to run these tests in any machine, anywhere [16:30] SpamapS: checkout branch; run.. [16:30] SpamapS: With that covered, Jenkins support is trivial [16:31] indeed, jenkins tries very hard to be "any machine" :) [16:31] so getting that story right is the right focus. I was surprised to see a bunch of HTML output created and stuff. [16:32] SpamapS: It's less than 50 lines of code that converts a directory full of output files into HTML [16:32] SpamapS: and it's completely independent from the runner [16:32] SpamapS: Which is completely independent from the Bazaar updating logic [16:33] SpamapS: Again, trivial to do any of these steps in any other way.. [16:33] I need to get some food now.. biab [16:33] ciao! [17:08] jcastro, things which definitively live outside of a container do log to /var/log/ensemble .. the machine and provisioning agent atm [17:46] hazmat: so , given the impending release and such, I'm going to import your merge proposal as a patch to the oneiric txaws package.. [17:47] hazmat: even if you do make a release, there are other things in there that I'd rather just leave out of my sphere of concern [18:14] SpamapS, fair enough.. the biggest thing that's holding me up is i've seen an occasional regression against ec2 that i'm trying to track down [18:22] hazmat: *UGH* [18:23] hazmat: could you note that in the MP? That would be the suck to ship. [18:23] SpamapS, ugh indeed.. i'm doing some tests right now, but i'm not seeing any problems atm [18:23] i've definitely seen issues b4, but it might they no longer exist [18:24] do we do any extensive functional testing in txaws? Last I saw they weren't mocked up so they actually did hit Amazon [18:25] SpamapS, they are mocked up just differently [18:25] SpamapS, they have recorded responses from amazon that the parsing verifies against [18:26] and on the request side they verify the outbound request [18:26] *ah* [18:43] but its not act [18:49] biggest problem I keep running into is that canonistack's s3 is basically unusable 90% of the time [18:51] hazmat: As ahasenack would say, problems that magically disappear, magically reappear :-) [18:52] SpamapS: We should try to deploy Ceph there [19:09] <_mup_> ensemble/stack-crack r333 committed by kapil.thangavelu@canonical.com [19:09] <_mup_> merge trunk [19:10] SpamapS, its not that bad for me re canonistack [19:10] niemeyer, proper solution is to deploy swift [19:11] alternatively gluster [19:11] ceph + btrfs = chains of instability [19:11] hazmat: "proper" depends a lot of context [19:11] s/of/on [19:12] niemeyer, well we're talking about a machine provider storage that has an s3 front end and scales.. swift is that [19:12] hazmat: Really? Who's been using it at scale? [19:12] niemeyer, it powers rackspace cloud files today [19:12] its production code [19:12] hazmat: Interesting.. I'm curious about the stability of it [19:14] hazmat: Either way, Ceph is going to production soon as well [19:14] when we want to talk about volume/storage management by ensemble itself.. then tools like ceph/lustre/gluster are more appropriate, assuming an absence of a requisite provider capabilities (like orchestra) [19:14] niemeyer, i'm not sure how.. i still see lots of btrfs fails [19:14] hazmat: They've been in beta for quite a while [19:14] hazmat: objects.dreamhost.com [19:15] hazmat: This is the restricted beta site [19:15] niemeyer, internal server error ;-) [19:15] hazmat: Yeah, unfortunate timing [19:15] hazmat: It's down ATM [19:16] ceph has many more moving parts and code, and depends on other things that are not production ready (btrfs) [19:16] compared to swift for example, but swift isn't block storage [19:16] er. volume storage [19:17] its REST object storage [19:17] does ceph or gluster export block devices to clients? [19:17] to the user, lustre is just NFS on 'roids and a nightmare to the sys admins :P [19:19] adam_g: Yeah, Ceph has a kernel driver in the mainline [19:19] But that's a separate piece from the object storage and S3 interfces [19:19] inerfaces [19:19] niemeyer: right, a file system or a block driver? [19:20] adam_g: "Rados block device (RBD). The RBD driver provides a shared network block device via a Linux kernel block device driver (2.6.37+) or a Qemu/KVM storage driver based on librados. In contrast to alternatives like iSCSI or AoE, RBD images are striped and replicated across the Ceph object storage cluster, providing reliable, scalable, and thinly provisioned access to block storage. RBD supports read-on [19:20] ly snapshots with rollback." [19:20] oh, cool [19:21] * adam_g knows little about ceph [19:21] https://lists.launchpad.net/openstack/msg00053.html <- interesting [19:24] adam_g: I don't claim to know much either, but its features resemble science fiction [19:24] adam_g: Except it's real software backed by a real company that is doing that for quite a while [19:31] swift is definitely production ready stuff [19:32] adam_g: It's good to hear you guys feel confident on it [19:33] its the only openstack component thats seen production use. its too bad its lumped in and assumed to be as unstable as everything else under that umbrella [19:36] adam_g: So, it's not clear to me.. how does Swift handle storge? [19:36] storage [19:38] niemeyer: at what level? [19:38] hazmat: Perhaps you can answer that as well.. have you been following it? [19:39] adam_g: Replication, balancing, etc [19:43] niemeyer: http://swift.openstack.org/overview_architecture.html is a good overview [19:44] adam_g: Neat, thanks [19:47] hazmat: btw, re CEPH, its apparently ok to use it w/ ext3/4 now.. just not as performant. [19:49] niemeyer: file level. [19:49] niemeyer: swift is not a block store [19:50] hazmat: maybe I'm doing something wrong w/ canonistack's s3.. it has been timing out with every request all day [19:51] SpamapS: Yeah, was mostly wondering about the logic for replicating/load balancing the files [19:51] Its pretty simplistic [19:52] Thats a compliment to it btw. :) [19:52] Its a bit more clever than MogileFS, which simply keeps track of all files in an underlying database. [20:32] hazmat: heh, ignore my earlier comment about canonistack's s3 going slow.. I had left out my patch in the debian/patches/series file .. DOH! [20:41] hazmat: so, do you have a workaround for the keys not being set? [21:23] Stepping away.. have a good weekend folks [21:24] you too niemeyer! [21:41] SpamapS, so i think the issue i'm able to trigger on ocassion also exists in txaws trunk [21:41] happens when the security group gets removed [21:41] some sort of error happens, that txaws doesn't parse properly and then it gets a traceback [21:41] SpamapS, as for key not set workaround not sure.. smoser has a branch for openstack and cloud-init [21:42] SpamapS, gustavo suggested working around by bypassing cloud-init key installation.. [21:42] cloud-init is uploaded [21:42] smoser, nice, thanks [21:43] but regarding lucid support we either fix in openstack, ensemble, or sru cloud-init [21:49] <_mup_> Bug #846055 was filed: Occasional error when shutting down a machine from security group removal < https://launchpad.net/bugs/846055 > [22:25] hazmat: heh, well there's no lucid series of principia.. so we don't have to worry about lucid.. right? ;-) [22:26] I think fixing in nova is the right thing [22:26] and it looks like the trivial MP has been approved, so just needs to land in OpenStack. [22:36] SpamapS, yeah.. that's ideal, i'd rather not hardcoding things to bypass tools we already depend on [22:42] is there any plans to make the ensemble agents upstarted services instead of being spawned by cloud-init? [22:58] adam_g: yes, but there is some trouble to be tended to since the agents might miss changes in state if they're not running (something I think should be fine, but hazmat knows better than I do :) [22:59] IMO the state is the state, and the agent's job is just to make that state a reality.. and formulas should be written that way as well.. not written in such a way where their ordering matters. [23:11] adam_g, there are for the local dev [23:12] adam_g, we could go there for the provisioning and machine agent as they have no transient state, there's an issue for the unit agent that needs to be resolved for them to safely moved over to it [23:12] adam_g, i'm using upstart for unit agents on local dev.. but its a little dicey.. there's an open ticket/bug for it [23:19] hazmat: im running into an issue where the agent looses connection to zookeeper due to something that the formula is doing, and needs to be restarted manually [23:19] adam_g, do you have any logs for the agent you can upload to a bug? [23:20] hazmat: yeah, let me get something and you can tell me if its relevant, or if perhaps the formula shouldn't be doing anything that would cause connectivity to drop [23:20] adam_g, is the formula manipulating the firewall? [23:21] hazmat: the firewall, no. but basically doing an ifdown -a ; ifup -a [23:21] http://paste.ubuntu.com/686183/ [23:21] adam_g, hmm.. yeah.. we have some better reconnect capabilities in our zk api layer... but we haven't gone through and put the additional reconnect logic into the agents [23:22] adam_g, could you go ahead and file a bug for that... we should handle short disconnects a bit better [23:23] hazmat: ah.. i might end up not touching the network stack at all in these formulas, but thats not to say nothing else will. this problem didn't show up until deploying to a hardware on a "real" network :) [23:23] long disconnects are little more problematic (effectively the same problem for the upstart, transient state needs persistence, and needs to delta to remote on connect) [23:26] adam_g, interesting.. its definitely on my todo list for next cycle re better disconnect handling in agents [23:33] <_mup_> Bug #846106 was filed: Interruption of network connectivity should be handled gracefully < https://launchpad.net/bugs/846106 > [23:38] hazmat: explain transient state? Why can't we just look at whats there, and make it true? [23:39] hazmat: like, if I'm starting up, and I see that there's a relation.. I should just pretend its new and run the joined/changed hooks. [23:40] hazmat: likewise for install [23:40] all hooks must be idempotent [23:40] <_mup_> ensemble/stack-crack r334 committed by kapil.thangavelu@canonical.com [23:40] <_mup_> restore key name use temporarily [23:59] SpamapS, transient state like what have we informed the formula about regarding the upstream zk state.. [23:59] SpamapS, i'm trying not to assume any hooks are idempotent outside of config [23:59] SpamapS, ideally they should be