[00:00] * niemeyer gets some food [00:08] Its important to have a nice clear easy REST API to use.. but its vital that you also provide optimisations for batch operations. Its why SQL is so popular.. easy to get one row, easy to get all rows. [01:21] jimbaker: ping [01:40] hazmat: Still around? [02:14] niemeyer indeed [02:17] hazmat: Cool, sorted out already, again! :-) [02:17] hazmat: Review queue pretty much empty [02:17] niemeyer, nice [02:19] * hazmat crashes [02:25] hazmat: Cheers [02:47] <_mup_> juju/unit-info-cli r424 committed by kapil.thangavelu@canonical.com [02:47] <_mup_> remove the manual copy of host resolv.conf, since customize runs in a chroot, directly modify the resolv.conf output to point to dnsmasq, fix indentation problem [04:22] <_mup_> juju/env-origin r381 committed by jim.baker@canonical.com [04:22] <_mup_> Merged trunk [06:09] FYI r378 caused a segfault when building on natty [06:09] https://launchpadlibrarian.net/81865589/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz [09:43] * ejat just wondering … is someone doing charm for liferay :) [11:25] hmm [11:26] SpamapS, its a problem with the zk version there [11:26] 3.3.1 has known issues for juju [11:27] applies primarily to libzookeeper and python-libzookeeper [12:06] SpamapS, all the distro ppas (minus oneiric perhaps) should have 3.3.3 [12:10] <_mup_> Bug #867420 was filed: Add section mentioning expose to the user tutorial. < https://launchpad.net/bugs/867420 > [12:12] just updated my oneiric install, juju seems to have a problem: [12:12] Errors were encountered while processing: [12:12] /var/cache/apt/archives/juju_0.5+bzr361-0ubuntu1_all.deb [12:12] E: Sub-process /usr/bin/dpkg returned an error code (1) [12:17] was a transient problem, apt-get update, apt-get -f install seemed to have fixed it [12:29] interestingly simulating transient disconnection of a bootstrap node for extended periods of time seems to be fine [13:11] heya niemeyer [13:12] hazmat: ahh, we need to add a versioned build dep then [13:17] Hello! [13:19] niemeyer, g'morning [13:23] fwereade: How're things going there? [13:23] hazmat: Good stuff in these last few branches [13:23] niemeyer: tolerable :) [13:23] fwereade: ;-) [13:24] niemeyer, yeah.. finally fixed the local provider issue wrt to customization, so all is good there, still seem some occasionally lxc pty allocation errors, but haven't deduced to a reliable reproduction strategy for upstream [13:25] niemeyer, i did play around with the disconnect scenarios some more, at least for a period of no active usage (no hooks executing, etc), we tolerate zookeeper nodes going away transiently fairly well [13:25] hazmat: By zookeeper nodes you mean the server themselves? [13:25] servers [13:25] niemeyer, yeah.. the zookeeper server going away [13:25] hazmat: Neat! [13:26] It's a good beginning :) [13:26] hazmat: we should talk to rogpeppe about the issues we debated yesterday [13:26] hazmat: re. making things not fail when possible [13:26] i'm here! [13:26] niemeyer, for the single server case, the session stays alive, if the client reconnects within the the session timeout period after the server is back up. and the clients all go into poll mode every 3s when the zk server is down (roughly 1/3 session time i believe) [13:27] (afternoon, folks, BTW) [13:27] niemeyer, theres a few warnings in the zk docs about not trusting library implementation that magic things for the app [13:27] regarding error handling [13:27] rogpeppe, hola [13:27] hazmat: Well, sure :) [13:27] hazmat: That's what the whole session concept is about, though [13:28] rogpeppe: This goes a bit in the direction you were already thinking [13:28] rogpeppe: You mentioned in our conversations that e.g. it'd be good that Dial would hold back until the connection is actually established [13:28] rogpeppe: This is something we should do, but we're talking about doing more than that [13:28] i don't know if this is relevant, or if it's a problem with gozk alone, but i never got a server-down notification from a zk server, even when i killed it and waited 15 mins. [13:29] rogpeppe: Try bringing it up afterwards! :-) [13:29] rogpeppe, session expiration is server governed, clients don't decide that [13:29] rogpeppe: It's a bit strange, but that's how it works.. the session times out in the next reconnection [13:29] niemeyer: yeah, i definitely think it should [13:30] rogpeppe, the clients go into a polling reconnect mode, turning up the zookeeper debug log verbosity will show the activity [13:30] hazmat: but what if there's no server? surely the client should fail eventually? [13:30] rogpeppe: So, in addition to this, when we are connected and zk disconnects, we should also block certain calls [13:30] rogpeppe: Well.. all the calls [13:30] rogpeppe, nope.. they poll endlessley in the background, attempting to use the connection will raise a connectionloss/error [13:31] rogpeppe, at least until the handle is closed [13:31] rogpeppe: So that we avoid these errors ^ [13:31] rogpeppe, that's why we have explicit timeouts for connect [13:31] rogpeppe: In other words, if we have a _temporary_ error (e.g. disconnection rather than session expiration), we should block client calls [13:31] above libzk [13:31] hazmat: but if all users are blocked waiting for one of {connection, state change}, then no one will try to use the connection, and the client will hang forever [13:32] rogpeppe: Not necessarily.. as you know it's trivial to timeout and close a connection [13:32] rogpeppe: I mean, on our side [13:32] so all clients should do timeout explicitly? [13:32] rogpeppe: <-time.After & all [13:32] sure, but what's an appropriate time out? [13:33] rogpeppe: Whatever we choose [13:33] rogpeppe: But that's not what we're trying to solve now [13:33] sure [13:33] rogpeppe: What we have to do is make the gozk interface bearable [13:33] rogpeppe: Rather than a time bomb [13:33] so we're trying to make recoverable error handling subsumed into the client [13:33] [note to future: i'd argue for the timeout functionality to be inside the gozk interface, not reimplemented by every client] [13:34] [note to future: discuss timeout with rogpeppe] [13:34] by capturing a closure for any operation, and on connection error, wait till the connection is restablished, and rexecing the closure (possibly with additional error detection semantics) [13:34] hazmat: are we talking about the gozk package level here? [13:34] hazmat: I think there's a first step before that even [13:35] or a higher juju-specific level? [13:35] rogpeppe: Yeah, internal to gozk [13:35] rogpeppe, which pkg isn't relevant, but yes at the zk conn level [13:35] http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling [13:35] hazmat: Before we try to _redo_ operations, we should teach gozk to not _attempt_ them in the first place when it knows the connection is off [13:35] hmm [13:35] yeah.. thats better [13:36] we can basically watch session events and hold all operations [13:36] niemeyer, +1 [13:36] hazmat: Cool! [13:36] hmm [13:36] niemeyer, so there is still a gap [13:36] rogpeppe: Does that make sense to you as well? [13:37] * rogpeppe is thinking hard [13:37] hazmat: There is in some cases, when we attempt to do something and the tcp connection crashes on our face [13:37] niemeyer, internally libzk will do a heartbeat effectively to keep the session alive, if the op happens before the heartbeat detects dead we still get a conn error [13:37] hazmat: Let's handle that next by retrying certain operations intelligently [13:37] i think the first thing is to distinguish between recoverable and unrecoverable errors [13:38] rogpeppe, its a property of the handle [13:38] rogpeppe: That's the next thing, after the initial step we mentioned above [13:38] libzk exposes a method for it to return a bool [13:38] recoverable(handle) [13:38] rogpeppe: For blocking operations on certain connection states, we're actually preventing the error from even happening [13:39] preventing the error being exposed to the API-client code, that is, yes? [13:39] rogpeppe: No [13:39] rogpeppe, yup [13:39] :-) [13:39] lol [13:39] rogpeppe: Preventing it from happening at all [13:39] the error never happens [13:40] because we don't let the op go through while disconnected [13:40] rogpeppe: The error never happens if we don't try the call [13:40] ok, that makes sense. [13:40] but... what about an op that has already gone through [13:40] ? [13:40] next step is to auto recover the error for ops that we can do so without ambiguity, because there is still a gap on our detection of the client connectivity [13:40] and then the connection goes down [13:40] rogpeppe: That's the next case we were talking about above [13:41] rogpeppe: If the operation is idempotent, we can blindly retry it behind the lib client's back [13:41] niemeyer: do we need to? i thought it was important that clients be prepared to handle critical session events [13:41] rogpeppe: If the operation is not idempotent, too bad.. we'll have to let the app take care of it [13:41] rogpeppe, effectively the only only ops i've seen ambiguity around is the create scenario, and modifications without versions [13:42] rogpeppe: Do we need to what? [13:42] do we need to retry, was my question. [13:42] so this might be better structured as a library on top of the connection that's specific to juju [13:42] rogpeppe: Yeah, because otherwise we'll have to introduce error handling _everywhere_, doing exactly the same retry [13:43] hazmat: Nah.. let's do it internally and make a clean API.. we know what we're doing [13:43] does zookeeper do a 3 phase commit? [13:43] niemeyer, famous last words ;-) [13:43] i.e. for something like create with sequence number, does the client have to acknowledge the create before the node is actually created? [13:43] hazmat: Well, if we don't, we have larger problems ;-) [13:43] rogpeppe, its a paxos derivative internally. everything forwards to the active leader in the cluster [13:43] writes that is [13:44] it transparently does leader election as needed [13:44] rogpeppe: The _client_ cannot acknowledge the create [13:44] rogpeppe, the client doesn't ack the create, but the error recovery with a sequence node is hard, because without the server response, we have no idea what happened [13:45] niemeyer: why not? i thought the usual process was: write request; read response; write ack; server commits [13:45] rogpeppe: What's the difference? [13:45] rogpeppe: write ack; read response; write ack; read response; write ack; read response; server commits [13:45] niemeyer: the difference is that if the server doesn't see an ack from the client, the action never happened. [13:46] rogpeppe: Doesn't matter how many round trips.. at some point the server will commit, and if the connection crashes the client won't know if it was committed or not [13:46] ? there's client acks under the hood? [13:46] hazmat: There isn't.. and I'm explaining why it makes no difference [13:46] ah [13:47] * hazmat dogwalks back in 15 [13:47] hazmat: Cheers [13:48] if the connection crashes, the client can still force the commit by writing the ack. it's true that it doesn't know if the ack is received. hmm. byzantine generals. [13:48] Yeah [13:49] i'm slightly surprised the sequence-number create doesn't have a version argument, same as write [13:50] rogpeppe: Hmm.. seems to be sane to me? [13:50] that would fix the problem, at the expense of retries, no? [13:50] rogpeppe: It's atomic.. it's necessarily going to be version 0 [13:50] ah, child changes don't change a version number? [13:51] * rogpeppe goes back to look at the modify operation [13:51] rogpeppe: It changes, but it makes no sense to require a given version with a sequence number [13:51] rogpeppe: The point of using the sequence create is precisely to let the server make concurrent requests work atomically [13:52] Hmm [13:52] Weird [13:53] Abrupt disconnection [13:53] niemeyer: but we want to do that with node contents too - that's why the version number on Set [13:53] niemeyer_: and that's the main problem with the lack of Create idempotency [13:54] anyway, we could easily document that Create with SEQUENCE is a special case [13:54] and can return an error without retrying [13:55] rogpeppe: We don't even have to document it really.. the error itself is the notice [13:55] i think it would be good if the only time a session event arrived at a watcher was if the server went down unrecoverably [13:56] actually, that doesn't work [13:56] watchers will always have to restart [13:56] rogpeppe: That's how it is today, except for the session events in the session watch [13:56] rogpeppe: Not really [13:57] rogpeppe: If the watch was already established, zk will keep track of them and reestablish internally as long as the session survives [13:58] but what if the watch reply was lost when the connection went down? [13:59] rogpeppe: Good question.. worth confirming to see if it's handled properly [14:00] i'm not sure how it can be [14:00] the client doesn't ack watch replies AFAIK [14:01] rogpeppe: There are certainly ways it can be.. it really depends on how it's done [14:01] rogpeppe: E.g. the client itself can do the verification on connection reestablishment [14:01] Another alternative, which is perhaps a saner one, is to do a 180⁰ turn and ignore the existence of sessions completely [14:02] Hmmm.. [14:02] niemeyer_: that would look much nicer from a API user's perspective [14:02] I actually like the sound of that [14:03] rogpeppe: Not even thinking about API.. really thinking about how to build reliable software on top of it [14:03] aren't those closely related things? [14:04] rogpeppe: Not necessarily.. an API that reestablishes connections and knows how to hanndle problems internally is a lot nicer from an outside user's perspective === niemeyer_ is now known as niemeyer [14:06] niemeyer: don't quite follow [14:06] rogpeppe: Don't worry, it's fine either way [14:06] * hazmat catches up [14:07] hazmat: I think we should do a U turn [14:07] niemeyer, how so? [14:08] hmm.. verifying watch handling while down sounds good [14:08] connection down that is [14:08] hazmat: We're adding complexity in the middle layer, and reality is that no matter how complex and how much we prevent the session from "crashing", we _still_ have to deal with session termination correctly [14:08] session termination is effectively fatal [14:08] when does a session terminate? [14:08] the only sane thing to do is to restart the app [14:08] hazmat: we're also constantly saying "ah, but what if X happens?".. [14:09] rogpeppe, a client is disconnected from the quorum for the period of session timeout [14:09] hazmat: Not necessarily.. we have to restart the connection [14:09] niemeyer, and reinitialize any app state against the new connection [14:09] hazmat: Yes [14:09] ie. restart the app ;-) [14:09] hazmat: No, restart the app is something else [14:10] hazmat: Restart the app == new process [14:10] doesn't have to be a process restart to be effective, but it needs to go through the entire app init [14:10] hazmat: So, the point is that we have to do that anyway [14:10] hazmat: Because no matter how hard we try, that's a valid scenario [14:10] rogpeppe, the other way a session terminates is a client closes the handle, thats more explicit [14:11] rogpeppe, that can be abused in testing by connecting multiple clients via the same session id, to simulate session failures [14:11] niemeyer, absolutely for unrecoverable errors that is required [14:11] hazmat: So what about going to the other side, and handling any session hiccups as fatal? It feels a lot stronger as a general principle, and a lot harder to get it wrong [14:11] when you say "reinitialize any app state", doesn't that assume that no app state has already been stored on the server? [14:11] for recoverable errors local handling inline to the conn, seems worth exploring [14:11] or are we assuming that the server is now a clean slate? [14:12] we need to validate some of the watch state [14:12] rogpeppe, no the server has an existing state [14:12] hazmat: The problem is that, as we've been seeing above, "recoverable errors" are actually very hard to really figure [14:12] rogpeppe, the app needs to process the existing state against its own state needs and observation requirements [14:12] hazmat: rogpeppe makes a good point in terms of the details of watch establishment [14:12] so presumably we know almost all of that state, barring operations in progress? [14:12] hazmat: and I don't have a good answer for him [14:12] niemeyer, that's why i was going with a stop/reconnect/start for both error types as a simple mechanism [14:13] for now [14:13] * hazmat does a test to verify watch behavior [14:13] hazmat: Yeah, but the problem we have _today_ and that I don't feel safe doing that is that we don't have good-but-stay-alive semantics in the code base [14:13] erm.. [14:13] good stop-but-stay-alive [14:14] i *think* that the most important case is automatic retries of idempotent operations. [14:14] niemeyer, we do in the unit agents as a consequence of doing upgrades, we pause everything for it [14:14] but that's hard too. [14:14] hazmat: I seriously doubt that this will e.g. kill old watches [14:15] niemeyer, effectively the only thing that's not observation driven is the provider agent does some polling for runaway instances [14:15] niemeyer, it won't kill old watches, but we can close the handle explicitly [14:16] hazmat: and what happens to all the deferreds? [14:16] niemeyer, their dead, when the session is closed [14:16] at least for watches [14:17] hazmat: What means dead? Dead as in, they'll continue in memory, hanging? [14:17] niemeyer, yeah... their effectively dead, we can do things to clean them up if that's problematic [14:17] dead in memory [14:18] hazmat: Yeah.. so if we have something like "yield exists_watch", that's dead too.. [14:18] we can track open watches like gozk and kill them explicitly (errback disconnect) [14:18] hazmat: That's far from a clean termination [14:18] niemeyer, we can transition those to exceptions [14:18] hazmat: Sure, we can do everything we're talking about above.. the point is that it's not trivial [14:19] it seems straightforward at the conn level [14:19] to track watches, and on close kill them [14:19] hazmat: Heh.. it's straightforward to close() the connection, of course [14:19] hazmat: It's not straightforward to ensure that doing this will yield a predictable behavior [14:20] so back to process suicide ;-) [14:20] hazmat: Cinelerra FTW! [14:20] this is all talking about the situation when you need to explicitly restart a session, right? [14:20] rogpeppe, yes [14:21] rogpeppe: Yeah, control over fault scenarios in general [14:21] restart/open a new session [14:21] restart is different, i thought [14:21] because the library can do it behind the scenes [14:21] and reinstate watches [14:21] redo idempotent ops, etc [14:21] rogpeppe, but it can't reattach the watches to the all extant users? [14:22] i don't see why not [14:22] perhaps in go that's possible with channels and the channel bookeeping [14:22] against the watches [14:22] hazmat, rogpeppe: No, that doesn't work in any case [14:22] ? [14:22] The window between the watch being dead and the watch being alive again is lost === rogpeppe is now known as rog [14:23] of course [14:23] doh [14:23] except... [14:23] that the client *could* keep track of the last-returned state [14:23] and check the result when the new result arrives [14:24] and trigger the watcher itself if it's changed [14:25] rog: Yeah, we could try to implement the watch in the client side, but that's what I was talking above [14:25] expect... i don't know if {remove child; add child with same name} is legitimately a no-op [14:25] rog: We're going very far to avoid a situation that is in fact unavoidable [14:25] rog: Instead of doing that, I suggest we handle the unavoidable situation in all cases [14:26] force all clients to deal with any session termination as if it might be unrecoverable? [14:27] rog: Yeah [14:27] rog: Any client disconnection in fact [14:27] rog: let's also remove the hack we have in the code and allow watches to notice temporary disconnections [14:28] this is why proper databases have transactions [14:28] rog: Uh.. [14:28] rog: That was a shoot in the sky :-) [14:29] if the connection dies half way through modifying some complex state, then when retrying, you've got to figure out how far you previously got, then redo from there. [14:29] rog: We have exactly the same thing with zk. [14:30] rog: The difference is that unlike a database we're using this for coordination [14:30] rog: Which means we have live code waiting for state to change [14:30] rog: A database client that had to wait for state to change would face the same issues [14:31] rog, databaess still have the same issue [14:31] yeah, i guess [14:32] what should a watcher do when it sees a temporary disconnection? [14:33] await reconnection and watch again, i suppose [14:33] so watches don't fire if the event happens while disconnected [14:34] i wonder if the watch should terminate even on temporary disconnection. [14:34] rog: It should error out and stop whatever is being done, recovering the surrounding state if it makes sense [14:34] rog: Right, exactly [14:36] and is that true of the Dial session events too? the session terminates after the first non-ok event? [14:37] i think that makes sense. [14:37] (and it also makes use of Redial more ubiquitous). [of course i'm speaking from a gozk perspective here, as i'm not familiar with the py zk lib] [14:38] interesting, i get a session expired event.. just wrote a unit test for watch fire while disconnected, two server cluster, two clients one connected to each, one client sets a watch, shutdown its server, delete on the other client/server, resurrect the shutdown server with its client waiting on the watch, gets a session expired event [14:38] hmm. its timing dependent though [14:39] rog: Yeah, I think so too [14:39] yeah.. this needs more thought [14:39] hazmat: Yeah, the more we talk, the more I'm convinced we should assume nothing from a broken connection [14:40] niemeyer, indeed [14:40] hazmat: This kind of positioning also has a non-obvious advantage.. it enables us to more easily transition to doozerd at some point [14:40] Perhaps not as a coincidence, it has no concept of sessions [14:40] * niemeyer looks at Aram [14:40] niemeyer, interesting.. i thought you gave up on doozerd [14:40] upstream seems to be dead afaik [14:40] hazmat: I have secret plans! [14:40] ;-) [14:41] niemeyer, cool when i mentioned it b4 you seemed down on it [14:41] it would be nice for an arm env to go java-less [14:41] hazmat: Yeah, because it sucks on several aspects right now [14:41] rog, on ReDial does gozk reuse a handle? [14:41] hazmat: But what if we.. hmmm.. provided incentives for the situation to change? :-) [14:41] niemeyer, yeah.. persistence and error handling there don't seem well known [14:42] niemeyer, indeed, things can change [14:43] hazmat: no, i don't think so, but i don't think it needs to. [14:43] that's one important difference between gozk/txzk.. the pyzk doesn't expose reconnecting with the same handle, which toasts extant watches (associated to handle) when trying to reconnect to the same session explictly [14:43] (paste coming up) [14:43] libzk in the background will do it, but if you want to change the server explicitly at the app/client level its a hoser [14:46] if you get clients to explicitly negotiate with the central dialler when the connection is re-made, i think it can work. [14:47] i.e. get error indicating that server is down, ask central thread for new connection. [14:48] store that connection where you need to. [14:49] * niemeyer loves the idea of keeping it simple and not having to do any of that :) [14:51] yeah, me too. [14:51] but we can't. at least i think that's the conclusion we've come to, right? [14:52] rog: Hmm, not my understanding at least [14:52] rog: It's precisely the opposite.. we _have_ to do that anyway [14:52] rog: Because no matter how hard we try, the connection can break for real, and that situation has to be handled properly [14:53] rog: So I'd rather focus on that scenario all the time, and forget about the fact we even have sessions [14:54] niemeyer: so you're saying that we have to lose all client state when there's a reconnection? [14:54] rog: Yes, I'm saying we have to tolerate that no matter what [14:55] so the fact that zk has returned ok when we've created a node, we have to act as if that node might not have been created? [14:55] s/the fact/even if/ [14:56] rog: If zk returned ok, there's no disconnection [14:56] niemeyer: if it returned ok, and the next create returns an error; that's the scenario i'm thinking of [14:57] that's the situation where i think the node creator could wait for redial and then carry on from where it was [14:57] rog: If it returned an error, we have to handle it as an error and not assume that the session is alive, because it may well not be [14:58] rog: and what if the session dies? [14:58] i'm not saying that it should assume the session is still alive [14:58] i'm saying that when it gets an error, it could ask the central thread for the new connection - it might just get an error instead [14:59] the node creator is aware of the transition, but can carry on (knowingly) if appropriate [14:59] rog: and what about the several watches that are established? [14:59] niemeyer: same applies [14:59] rog: What applies? [15:00] the watch will return an error; the code doing the watch can ask for a new connection and redo the watch if it wishes. [15:00] niemeyer, so we're back to reinitializing the app on any connectoin error, disregarding recoverable [15:00] no redoing behind the scenes, but the possibility of carrying on where we left off [15:00] rog: The state on which the watch was requested has changed [15:00] rog: Check out the existing code base [15:01] niemeyer, so interestingly we can be disconnected, not know it, and miss a watch event [15:01] rog: It's not trivial to just "oh, redo it again" [15:01] niemeyer: it doesn't matter because the watcher is re-requesting the state, so it'll see both the state and any subsequent watch event [15:01] hazmat: Yeah, that's exactly the kind of very tricky scenario that I'm concerned about [15:01] the watcher has to deal with the "state just changed" scenario anyway when it first requests the watch [15:01] niemeyer, actually we get notification from a session event that we reconnected [15:01] hazmat: As Russ would say, I don't really want to think about whether it's correct or not [15:01] rog: No.. please look at the code base [15:02] niemeyer: sorry, which bit are you referring to? [15:02] rog: We're saying the same thing, in fact.. you're just underestimating the fact that "just retry" is more involved than "request the new connection and do it again" [15:02] rog: juju [15:02] rog: lp:juju [15:03] rog: This concept touches the whole application [15:03] niemeyer: i've been exploring it a bit this morning, but haven't found the crucial bits, i think. what's a good example file that would be strongly affected by this kind of thing? [15:03] hazmat: We do.. the real problem is ensuring state is as it should be when facing reconnections [15:04] rog: I'm serious.. this touches the whole app [15:04] niemeyer, right we always have to reconsider state on reconnection [15:04] rog: Check out the agents [15:05] niemeyer: ah, ok. i was looking in state [15:05] thanks [15:05] rog: state is good too [15:05] rog: Since it's what the agents use and touches this concept too [15:06] rog, hazmat: So, my suggestion is that the first thing we do is to unhide temporary failures in gozk [15:06] niemeyer, the test becomes alot more reliable when we have multiple zks in the cluster setup for the client to connect to [15:06] sgtm [15:06] rog, hazmat: Then, let's watch out for that kind of issue very carefully in reviews and whatnot, as we build a reliable version [15:07] hazmat: The same problem exists, though.. [15:07] niemeyer, indeed [15:07] but it minimizes total disconnect scenarios with multiple zks [15:07] hazmat: Even if it _immediately_ reconnects, the interim problems may have created differences that are easily translated into bugs very hard to figure out [15:08] hazmat: and again, we seriously _have_ to handle the hard-reconnect across the board [15:08] niemeyer, agreed [15:08] niemeyer, i'm all in favor of simplifying and treating them the same [15:09] recoverable/unrecoverable conn errors [15:09] hazmat: So no matter how much we'd love to not break the session and have a pleasant API, the hard reconnects means we'll need the good failure recovered either way [15:09] So we can as well plan for that at all times [15:09] niemeyer, its detecting the conn error that i'm concerned about atm [15:09] hazmat: My understanding is that the client always notifies about temporary issues [15:09] niemeyer, based on its internal poll period to the server [15:10] niemeyer, a quick disconnect masks any client detection [15:10] ^transient [15:10] hazmat: Really!? [15:11] it seems the server will attempt to expire the client session, but i've seen once we're instead it shows a reconnect [15:11] s/we're/where [15:11] hazmat: I can't imagine how that'd be possible [15:11] hazmat: The client lib should hopefully notify the user that the TCP connection had to be remade [15:13] niemeyer, fwiw here's the test i'm playing with (can drop into test_session.py ).. http://paste.ubuntu.com/702290/ [15:13] for a package install of zk.. ZOOKEEPER_PATH=/usr/share/java [15:13] for the test runner [15:13] hazmat: Hmm [15:14] hazmat: That seems to test that watches work across reconnections [15:14] hazmat: We know they can work [15:14] niemeyer, they do but we miss the delete [15:14] hazmat: Or am I missing something? [15:14] hazmat: Ah, right! [15:14] with no notice [15:15] hazmat: So yeah, it's total crack [15:15] niemeyer, actually most of the time we get an expired session event in the client w/ the watch [15:15] like 99% [15:16] if i connect the client to multiple servers it sees the delete [15:16] w/ the watch that is [15:17] hazmat: Hmm.. interesting.. so does it keep multiple connections internally in that case, or is it redoing the connection more quickly? [15:17] niemeyer, not afaick, but its been a while since i dug into that [15:18] niemeyer, but as an example here's one run http://paste.ubuntu.com/702291/ [15:19] where it does get the delete event [15:19] but that's not guaranteed in allop [15:20] hazmat: If you _don't_ create it on restart, does it get the notification? [15:20] hazmat: Just wondering if it might be joining the two events [15:21] niemeyer, no it still gets the deleted event if gets an event, else it gets session expired [15:21] but its easy to construct it so it only sees the created event [15:21] if toss a sleep in [15:22] perhaps not [15:22] it seems to get the delete event or session expiration.. i need to play with this some more and do a more thought out write up [15:22] in some cases it does get the created event, obviously the pastebin has that [15:23] hazmat: I see, cool [15:24] On a minor note, filepath.Rel is in.. can remove our internal impl. now [15:24] niemeyer, cool [15:24] That was a tough one :) [15:25] fwereade: Leaving to lunch soon.. how's stuff going there? [15:25] fwereade: Can I do anything for you? [15:26] jimbaker: How's env-origin as well? [15:26] niemeyer, just need to figure out the specific text for the two scenarios you mention [15:27] jimbaker: Hmmm.. which text? [15:27] niemeyer, from apt-cache policy [15:27] jimbaker: Just copy & paste from the existing test? Do you want me to send a patch? [15:28] niemeyer, well it's close to being copy & paste, but the difference really matters here [15:28] if you have a simple patch, for sure that would be helpful [15:28] jimbaker: Sorry, I'm still not sure about what you're talking about [15:29] jimbaker: It seems completely trivial t ome [15:29] jimbaker: Sure.. just a sec [15:29] niemeyer, i was not familiar with apt-cache policy before this work. obviously once familiar, it is trivial [15:30] jimbaker: I'm actually talking about the request I made in the review.. [15:30] jimbaker: But since you mention it, I actually provided you with a scripted version saying exactly how it should work like 3 reviews ago [15:31] niemeyer, i'm going against http://carlo17.home.xs4all.nl/howto/debian.html#errata for a description of the output format [15:31] the python-apt bindings are pretty simple too.. i used them for the local provider.. although its not clear how you identify a repo for a given package from it [15:31] niemeyer, if you have a better resource describing apt-cache policy, i would very much appreciate it [15:32] hazmat, one advantage of such bindings is the data model [15:33] jimbaker, well.. its as simple as cache = apt.Cache().. pkg = cache["juju"].. pkg.isInstalled -> bool... but it doesn't tell you if its a ppa or distro [15:33] and for natty/lucid installs without the ppa thats a keyerror on cache["juju"] [15:34] jimbaker: apt-get source apt [15:34] jimbaker: That's the best resource about apt-cache you'll find [15:34] niemeyer, ok, i will read the source, thanks [15:36] jimbaker: Turns out that *** only shows for the current version, so it's even easier [15:36] if (Pkg.CurrentVer() == V) [15:36] cout << " *** " << V.VerStr(); [15:36] else [15:36] cout << " " << V.VerStr(); [15:37] jimbaker, ideally the detection will also notice osx and do something sane, but we can do that latter [15:37] jimbaker: http://paste.ubuntu.com/702301/ [15:37] more important to have this in now for the release [15:37] hazmat: Oh yeah, please stop giving ideas! :-) [15:38] hazmat, for osx, doesn't it make more sense to just set juju-origin? [15:38] Please, let's just get this branch fixed.. [15:38] jimbaker, probably does.. but the "/usr" in package path has faulty semantics with a /usr/local install on the branch as i reclal [15:39] I'm stepping out for lunch [15:51] popping out for a bit, back later [16:35] i'm off for the evening. am still thinking hard about the recovery stuff. see ya tomorrow. [16:35] niemeyer: PS ping re merge requests :-) [16:36] rog: Awesome, sorry for the delay there [16:37] rog: Yesterday was a bit busier than expected [16:43] jimbaker: How's it there? [16:56] niemeyer, it's a nice day [16:57] jimbaker: Excellent.. that should mean env-origin is ready? [16:57] niemeyer, i still need to figure out what specifically apt-cache policy would print [16:58] jimbaker: Ok.. let's do this.. just leave this branch with me. [16:58] niemeyer, i do have the source code for what prints it, but i need to understand the model backing it [16:58] jimbaker: No need.. I'll handle it, thanks. [16:58] niemeyer, ok, that makes sense, i know you have a great deal of background from your work on synaptic, thanks! [16:59] jimbaker: That work is completely irrelevant.. the whole logic is contained in the paste bint [16:59] niemeyer, ok [16:59] jimbaker: and I pointed the exact algorithm to you [17:39] <_mup_> juju/remove-sec-grp-do-not-ignore-exception r381 committed by jim.baker@canonical.com [17:39] <_mup_> Simplified remove_security_group per review point [17:40] <_mup_> juju/remove-sec-grp-do-not-ignore-exception r382 committed by jim.baker@canonical.com [17:40] <_mup_> Merged trunk [18:38] hazmat: Do you have time for a quick review on top of env-origin? [18:38] hazmat: http://paste.ubuntu.com/702373/ [18:38] hazmat: It's pretty much just that function I've shown you a while ago plus minor test tweaks [18:38] niemeyer, checking [18:38] hazmat: The test tweaks just try a bit harder to break the logic [18:39] hazmat: Hmm.. I'll also add an extra test with broken input, to enusre that's working [18:40] niemeyer, what's the file on disk its parsing? [18:40] hazmat: output of apt-cache policy juju [18:40] or is that just apt-cache policy pkg? [18:41] hazmat: http://paste.ubuntu.com/702301/ [18:41] hazmat: Yeah [18:41] hazmat: That last paste has the logic generating the output [18:42] hazmat: Hmmm.. I'll also do an extra safety check there, actually [18:43] hazmat: It's assuming that any unknown output will fallback to branch.. that sounds dangerous [18:43] hazmat: I'll tweak it so it only falls back to branch in known inputs [18:47] hazmat: http://paste.ubuntu.com/702381/ [18:48] niemeyer, why is it returning a tuple if it only cares about the line from the line generator [18:48] niemeyer, in general it looks fine to me, there's two pieces in the branch that i have minor concern about [18:48] hazmat: Keep reading :) [18:49] ah. first indent [18:49] hazmat: It actually cares about the indent as well [18:49] hazmat: It's how we detect we've left a given version entry [18:51] hazmat: What's the other bit you're worried about? [18:52] niemeyer, basically how does it break on osx if apt-cache isn't found.. and the notion that if not juju.__name__.startswith("/usr") means unconditionally a package...if i check juju out and do a setup.py install its still a source install.. hmm.. i guess that works with the apt-cache check on installed.. so looks like just what happens if not on ubuntu.. pick a sane default [18:52] if apt-cache isn't there this will raise an exception it looks like [18:52] hazmat: I'll take care of that [18:53] niemeyer, +1 then [18:53] hazmat: What should we default to? [18:53] * niemeyer thinks [18:53] distro, I guess [18:53] niemeyer, distro seems sane [18:53] Cool [19:01] is juju useful for deploying services like mongodb on my local dev machine? [19:17] hazmat: http://paste.ubuntu.com/702393/ [19:17] lamalex: It is indeed [19:17] lamalex: We've just landed support for that, so we're still polishing it a bit, but that's already in and is definitely something we care about [19:20] niemeyer, awesome! [19:27] niemeyer +1 [19:27] hazmat: Woot, there we go [19:29] <_mup_> juju/env-origin r381 committed by gustavo@niemeyer.net [19:29] <_mup_> - Implementation redone completely. [19:29] <_mup_> - Do not crash on missing apt-cache. [19:29] <_mup_> - Exported and tested get_default_origin. [19:29] <_mup_> - Tests tweaked to explore edge cases. [19:45] niemeyer, should i be waiting on a second review for local-origin-passthrough or can i go ahead and merge? [19:45] bcsaller, if you have a moment and could look at local-origin-passthrough that would be awesome [19:46] I'll do it now [19:46] bcsaller, awesome, thanks [19:47] bcsaller, i had one fix that i accidentally pushed down to unit-cloud-cli but regarding the network setup in the chroot, the way it was working before modifying resolvconf/*/base wasn't going to work since that's not processed for a chroot, i ended up directly inserting dnsmasq into the output resolvconf/run/resolv.conf to ensure its active for the chroot [19:48] hazmat: why did it need to be active for the chroot? [19:48] bcsaller, because we install packages and software from there [19:48] bcsaller, most of the packages end up being cached, which caused some false starts, but doing it with juju-origin resurfaced the issue, since it had talk to lp to resolve the branch [19:48] yeah... just put that together. We might be better off with a 1 time job for juju-create [19:49] upstart job I mean [19:49] bcsaller, it is still a one time job, and dnsmasq is the correct resolver, i just changed it to be the active one during the chroot [19:49] hazmat: Hmm [19:49] k [19:49] <_mup_> juju/trunk r382 committed by gustavo@niemeyer.net [19:49] <_mup_> Merged env-origin branch [a=jimbaker,niemeyer] [r=hazmat,niemeyer] [19:49] <_mup_> This introduces a juju-origin option that may be set to "ppa", [19:49] <_mup_> "distro", or to a bzr branch URL. The new logic will also attempt [19:49] <_mup_> to find out the origin being used to run the local code and will [19:49] <_mup_> set it automatically if unset. [19:49] May be worth testing it against the tweaked env-origin [19:49] /etc/resolv.conf symlinks to /etc/resolvconf/run/resolv.conf .. its only on startup that it gets regen'd for the container via dhcp to be the dnsmasq.. [19:50] hazmat: thats why I was suggesting that it could happen in startup on the first run in a real lxc and not a chroot [19:50] niemeyer, good point.. i think i ended up calling get_default_origin to get a sane default for local provider to pass through [19:50] but the change you made should be fine [19:50] hazmat: Yeah.. I've exported it and tested it [19:51] hazmat: So it'll be easy to do that [19:51] cool [19:51] hazmat: Note that the interface has changed, though [19:51] noted, i'll do an end to end test [19:51] hazmat: It returns a tuple of two element in the same format of parse_juju_origin [19:52] hazmat, bcsaller: I've just unstuck the wtf too.. it was frozen on a "bzr update" of lp:juju for some unknown reason [19:52] We should have some input about the last 3 revisions merged soonish [19:55] I'm going outside for some exercising.. back later [20:00] niemeyer, cheers [20:00] Woot! 379 is good.. 3 to go [20:01] Alright, actually leaving now.. laters! [20:06] interesting.. Apache Ambari [20:10] hey is there a tutorial for using the local provider? [20:13] hmmmm [20:13] latest trunk failure on PPA build [20:13] https://launchpadlibrarian.net/81932606/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz [20:16] SpamapS, not yet [20:16] i'll put together some provider docs after i get these last bits merged [20:18] SpamapS, haven't seen those failures b4 [20:19] they work for me disconnected on trunk [20:20] SpamapS, is the s3 url endpoint being patched for the packaged? [20:21] i don't see how else that test could fail, perhaps bucket dns names [21:04] <_mup_> Bug #867877 was filed: revision in charm's metadata.yaml is inconvenient < https://launchpad.net/bugs/867877 > [21:20] <_mup_> juju/trunk-merge r343 committed by kapil.thangavelu@canonical.com [21:20] <_mup_> trunk merge [21:22] <_mup_> juju/local-origin-passthrough r418 committed by kapil.thangavelu@canonical.com [21:22] <_mup_> merge pipeline, resolve conflict [21:35] that's it for me, nn all [21:39] <_mup_> juju/trunk r383 committed by kapil.thangavelu@canonical.com [21:39] <_mup_> merge unit-relation-with-address [r=niemeyer][f=861225] [21:39] <_mup_> Unit relations are now prepopulated with the unit's private address [21:39] <_mup_> under the key 'private-address. This obviates the need for units to [21:39] <_mup_> manually set ip addresses on their relations to be connected to by the [21:39] <_mup_> remote side. [21:39] fwereade, cheers [21:44] * niemeyer waves [21:45] Woot.. lots of green on wtf [22:01] hazmat: re. local-origin-passthrough, once you're happy with it would you mind to do a run on EC2 just to make sure things are happy there? [22:01] niemeyer, sure, just in progress on that [22:01] hazmat: Cheers! === xzilla_ is now known as xzilla [22:12] although local-origin-passthrough doesn't work for me, hazmat believes he has a fix for it in the unit-info-cli branch [22:13] jimbaker, did you try it out? [22:13] hazmat, unit-info-cli has not yet come up [22:13] jimbaker, and to be clear that's not regarding ec2 [22:13] hazmat, of course not, it's local :) [22:14] jimbaker, pls pastebin the data-dir/units/master-customize.log [22:14] jimbaker, it would also be good to know if you have unit agents running or not [22:14] jimbaker, are you running oneiric containers? [22:14] jimbaker, yeah.. figuring out when its done basically needs to parse ps output [22:14] or check status [22:14] hazmat, unfortunately this is hitting a wall of time for me - need to take kids to get their shots momentarily [22:15] but incrementally its easier to look at ps output [22:15] hazmat, makes sense. i was just taking a look at juju status [22:15] jimbaker, k, i'll be around latter [22:15] ok, i will paste when i get back [22:16] jimbaker, juju status won't help if there's an error, looking at ps output shows the container creation and juju-create customization, all the output of customize goes to the customize log [22:21] * hazmat wonders if the cobbler api exposes available classes [22:24] ah.. get_mgmtclasses [23:33] <_mup_> juju/local-origin-passthrough r419 committed by kapil.thangavelu@canonical.com [23:33] <_mup_> incorporate non interactive apt suggestions, pull up indentation and resolv.conf fixes from the pipeline [23:37] <_mup_> juju/trunk r384 committed by kapil.thangavelu@canonical.com [23:37] <_mup_> merge local-origin-passthrough [r=niemeyer][f=861225] [23:37] <_mup_> local provider respects juju-origin settings. Allows for using [23:37] <_mup_> a published branch when deploying locally. [23:37] whoops forget the reviewers [23:38] <_mup_> juju/unit-info-cli r426 committed by kapil.thangavelu@canonical.com [23:38] <_mup_> merge local-origin-passthrough & resolve conflict [23:54] <_mup_> juju/unit-info-cli r427 committed by kapil.thangavelu@canonical.com [23:54] <_mup_> fix double typo pointed out by review