[00:00] * niemeyer gets some food [00:08] Its important to have a nice clear easy REST API to use.. but its vital that you also provide optimisations for batch operations. Its why SQL is so popular.. easy to get one row, easy to get all rows. [01:21] jimbaker: ping [01:40] hazmat: Still around? [02:14] niemeyer indeed [02:17] hazmat: Cool, sorted out already, again! :-) [02:17] hazmat: Review queue pretty much empty [02:17] niemeyer, nice [02:19] * hazmat crashes [02:25] hazmat: Cheers [02:47] <_mup_> juju/unit-info-cli r424 committed by kapil.thangavelu@canonical.com [02:47] <_mup_> remove the manual copy of host resolv.conf, since customize runs in a chroot, directly modify the resolv.conf output to point to dnsmasq, fix indentation problem [04:22] <_mup_> juju/env-origin r381 committed by jim.baker@canonical.com [04:22] <_mup_> Merged trunk [06:09] FYI r378 caused a segfault when building on natty [06:09] https://launchpadlibrarian.net/81865589/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz [09:43] * ejat just wondering … is someone doing charm for liferay :) [11:25] hmm [11:26] SpamapS, its a problem with the zk version there [11:26] 3.3.1 has known issues for juju [11:27] applies primarily to libzookeeper and python-libzookeeper [12:06] SpamapS, all the distro ppas (minus oneiric perhaps) should have 3.3.3 [12:10] <_mup_> Bug #867420 was filed: Add section mentioning expose to the user tutorial. < https://launchpad.net/bugs/867420 > [12:12] just updated my oneiric install, juju seems to have a problem: [12:12] Errors were encountered while processing: [12:12] /var/cache/apt/archives/juju_0.5+bzr361-0ubuntu1_all.deb [12:12] E: Sub-process /usr/bin/dpkg returned an error code (1) [12:17] was a transient problem, apt-get update, apt-get -f install seemed to have fixed it [12:29] interestingly simulating transient disconnection of a bootstrap node for extended periods of time seems to be fine [13:11] heya niemeyer [13:12] hazmat: ahh, we need to add a versioned build dep then [13:17] Hello! [13:19] niemeyer, g'morning [13:23] fwereade: How're things going there? [13:23] hazmat: Good stuff in these last few branches [13:23] niemeyer: tolerable :) [13:23] fwereade: ;-) [13:24] niemeyer, yeah.. finally fixed the local provider issue wrt to customization, so all is good there, still seem some occasionally lxc pty allocation errors, but haven't deduced to a reliable reproduction strategy for upstream [13:25] niemeyer, i did play around with the disconnect scenarios some more, at least for a period of no active usage (no hooks executing, etc), we tolerate zookeeper nodes going away transiently fairly well [13:25] hazmat: By zookeeper nodes you mean the server themselves? [13:25] servers [13:25] niemeyer, yeah.. the zookeeper server going away [13:25] hazmat: Neat! [13:26] It's a good beginning :) [13:26] hazmat: we should talk to rogpeppe about the issues we debated yesterday [13:26] hazmat: re. making things not fail when possible [13:26] i'm here! [13:26] niemeyer, for the single server case, the session stays alive, if the client reconnects within the the session timeout period after the server is back up. and the clients all go into poll mode every 3s when the zk server is down (roughly 1/3 session time i believe) [13:27] (afternoon, folks, BTW) [13:27] niemeyer, theres a few warnings in the zk docs about not trusting library implementation that magic things for the app [13:27] regarding error handling [13:27] rogpeppe, hola [13:27] hazmat: Well, sure :) [13:27] hazmat: That's what the whole session concept is about, though [13:28] rogpeppe: This goes a bit in the direction you were already thinking [13:28] rogpeppe: You mentioned in our conversations that e.g. it'd be good that Dial would hold back until the connection is actually established [13:28] rogpeppe: This is something we should do, but we're talking about doing more than that [13:28] i don't know if this is relevant, or if it's a problem with gozk alone, but i never got a server-down notification from a zk server, even when i killed it and waited 15 mins. [13:29] rogpeppe: Try bringing it up afterwards! :-) [13:29] rogpeppe, session expiration is server governed, clients don't decide that [13:29] rogpeppe: It's a bit strange, but that's how it works.. the session times out in the next reconnection [13:29] niemeyer: yeah, i definitely think it should [13:30] rogpeppe, the clients go into a polling reconnect mode, turning up the zookeeper debug log verbosity will show the activity [13:30] hazmat: but what if there's no server? surely the client should fail eventually? [13:30] rogpeppe: So, in addition to this, when we are connected and zk disconnects, we should also block certain calls [13:30] rogpeppe: Well.. all the calls [13:30] rogpeppe, nope.. they poll endlessley in the background, attempting to use the connection will raise a connectionloss/error [13:31] rogpeppe, at least until the handle is closed [13:31] rogpeppe: So that we avoid these errors ^ [13:31] rogpeppe, that's why we have explicit timeouts for connect [13:31] rogpeppe: In other words, if we have a _temporary_ error (e.g. disconnection rather than session expiration), we should block client calls [13:31] above libzk [13:31] hazmat: but if all users are blocked waiting for one of {connection, state change}, then no one will try to use the connection, and the client will hang forever [13:32] rogpeppe: Not necessarily.. as you know it's trivial to timeout and close a connection [13:32] rogpeppe: I mean, on our side [13:32] so all clients should do timeout explicitly? [13:32] rogpeppe: <-time.After & all [13:32] sure, but what's an appropriate time out? [13:33] rogpeppe: Whatever we choose [13:33] rogpeppe: But that's not what we're trying to solve now [13:33] sure [13:33] rogpeppe: What we have to do is make the gozk interface bearable [13:33] rogpeppe: Rather than a time bomb [13:33] so we're trying to make recoverable error handling subsumed into the client [13:33] [note to future: i'd argue for the timeout functionality to be inside the gozk interface, not reimplemented by every client] [13:34] [note to future: discuss timeout with rogpeppe] [13:34] by capturing a closure for any operation, and on connection error, wait till the connection is restablished, and rexecing the closure (possibly with additional error detection semantics) [13:34] hazmat: are we talking about the gozk package level here? [13:34] hazmat: I think there's a first step before that even [13:35] or a higher juju-specific level? [13:35] rogpeppe: Yeah, internal to gozk [13:35] rogpeppe, which pkg isn't relevant, but yes at the zk conn level [13:35] http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling [13:35] hazmat: Before we try to _redo_ operations, we should teach gozk to not _attempt_ them in the first place when it knows the connection is off [13:35] hmm [13:35] yeah.. thats better [13:36] we can basically watch session events and hold all operations [13:36] niemeyer, +1 [13:36] hazmat: Cool! [13:36] hmm [13:36] niemeyer, so there is still a gap [13:36] rogpeppe: Does that make sense to you as well? [13:37] * rogpeppe is thinking hard [13:37] hazmat: There is in some cases, when we attempt to do something and the tcp connection crashes on our face [13:37] niemeyer, internally libzk will do a heartbeat effectively to keep the session alive, if the op happens before the heartbeat detects dead we still get a conn error [13:37] hazmat: Let's handle that next by retrying certain operations intelligently [13:37] i think the first thing is to distinguish between recoverable and unrecoverable errors [13:38] rogpeppe, its a property of the handle [13:38] rogpeppe: That's the next thing, after the initial step we mentioned above [13:38] libzk exposes a method for it to return a bool [13:38] recoverable(handle) [13:38] rogpeppe: For blocking operations on certain connection states, we're actually preventing the error from even happening [13:39] preventing the error being exposed to the API-client code, that is, yes? [13:39] rogpeppe: No [13:39] rogpeppe, yup [13:39] :-) [13:39] lol [13:39] rogpeppe: Preventing it from happening at all [13:39] the error never happens [13:40] because we don't let the op go through while disconnected [13:40] rogpeppe: The error never happens if we don't try the call [13:40] ok, that makes sense. [13:40] but... what about an op that has already gone through [13:40] ? [13:40] next step is to auto recover the error for ops that we can do so without ambiguity, because there is still a gap on our detection of the client connectivity [13:40] and then the connection goes down [13:40] rogpeppe: That's the next case we were talking about above [13:41] rogpeppe: If the operation is idempotent, we can blindly retry it behind the lib client's back [13:41] niemeyer: do we need to? i thought it was important that clients be prepared to handle critical session events [13:41] rogpeppe: If the operation is not idempotent, too bad.. we'll have to let the app take care of it [13:41] rogpeppe, effectively the only only ops i've seen ambiguity around is the create scenario, and modifications without versions [13:42] rogpeppe: Do we need to what? [13:42] do we need to retry, was my question. [13:42] so this might be better structured as a library on top of the connection that's specific to juju [13:42] rogpeppe: Yeah, because otherwise we'll have to introduce error handling _everywhere_, doing exactly the same retry [13:43] hazmat: Nah.. let's do it internally and make a clean API.. we know what we're doing [13:43] does zookeeper do a 3 phase commit? [13:43] niemeyer, famous last words ;-) [13:43] i.e. for something like create with sequence number, does the client have to acknowledge the create before the node is actually created? [13:43] hazmat: Well, if we don't, we have larger problems ;-) [13:43] rogpeppe, its a paxos derivative internally. everything forwards to the active leader in the cluster [13:43] writes that is [13:44] it transparently does leader election as needed [13:44] rogpeppe: The _client_ cannot acknowledge the create [13:44] rogpeppe, the client doesn't ack the create, but the error recovery with a sequence node is hard, because without the server response, we have no idea what happened [13:45] niemeyer: why not? i thought the usual process was: write request; read response; write ack; server commits [13:45] rogpeppe: What's the difference? [13:45] rogpeppe: write ack; read response; write ack; read response; write ack; read response; server commits [13:45] niemeyer: the difference is that if the server doesn't see an ack from the client, the action never happened. [13:46] rogpeppe: Doesn't matter how many round trips.. at some point the server will commit, and if the connection crashes the client won't know if it was committed or not [13:46] ? there's client acks under the hood? [13:46] hazmat: There isn't.. and I'm explaining why it makes no difference [13:46] ah [13:47] * hazmat dogwalks back in 15 [13:47] hazmat: Cheers [13:48] if the connection crashes, the client can still force the commit by writing the ack. it's true that it doesn't know if the ack is received. hmm. byzantine generals. [13:48] Yeah [13:49] i'm slightly surprised the sequence-number create doesn't have a version argument, same as write [13:50] rogpeppe: Hmm.. seems to be sane to me? [13:50] that would fix the problem, at the expense of retries, no? [13:50] rogpeppe: It's atomic.. it's necessarily going to be version 0 [13:50] ah, child changes don't change a version number? [13:51] * rogpeppe goes back to look at the modify operation [13:51] rogpeppe: It changes, but it makes no sense to require a given version with a sequence number [13:51] rogpeppe: The point of using the sequence create is precisely to let the server make concurrent requests work atomically [13:52] Hmm [13:52] Weird [13:53] Abrupt disconnection [13:53] niemeyer: but we want to do that with node contents too - that's why the version number on Set [13:53] niemeyer_: and that's the main problem with the lack of Create idempotency [13:54] anyway, we could easily document that Create with SEQUENCE is a special case [13:54] and can return an error without retrying [13:55] rogpeppe: We don't even have to document it really.. the error itself is the notice [13:55] i think it would be good if the only time a session event arrived at a watcher was if the server went down unrecoverably [13:56] actually, that doesn't work [13:56] watchers will always have to restart [13:56] rogpeppe: That's how it is today, except for the session events in the session watch [13:56] rogpeppe: Not really [13:57] rogpeppe: If the watch was already established, zk will keep track of them and reestablish internally as long as the session survives [13:58] but what if the watch reply was lost when the connection went down? [13:59] rogpeppe: Good question.. worth confirming to see if it's handled properly [14:00] i'm not sure how it can be [14:00] the client doesn't ack watch replies AFAIK [14:01] rogpeppe: There are certainly ways it can be.. it really depends on how it's done [14:01] rogpeppe: E.g. the client itself can do the verification on connection reestablishment [14:01] Another alternative, which is perhaps a saner one, is to do a 180⁰ turn and ignore the existence of sessions completely [14:02] Hmmm.. [14:02] niemeyer_: that would look much nicer from a API user's perspective [14:02] I actually like the sound of that [14:03] rogpeppe: Not even thinking about API.. really thinking about how to build reliable software on top of it [14:03] aren't those closely related things? [14:04] rogpeppe: Not necessarily.. an API that reestablishes connections and knows how to hanndle problems internally is a lot nicer from an outside user's perspective === niemeyer_ is now known as niemeyer [14:06] niemeyer: don't quite follow [14:06] rogpeppe: Don't worry, it's fine either way [14:06] * hazmat catches up [14:07] hazmat: I think we should do a U turn [14:07] niemeyer, how so? [14:08] hmm.. verifying watch handling while down sounds good [14:08] connection down that is [14:08] hazmat: We're adding complexity in the middle layer, and reality is that no matter how complex and how much we prevent the session from "crashing", we _still_ have to deal with session termination correctly [14:08] session termination is effectively fatal [14:08] when does a session terminate? [14:08] the only sane thing to do is to restart the app [14:08] hazmat: we're also constantly saying "ah, but what if X happens?".. [14:09] rogpeppe, a client is disconnected from the quorum for the period of session timeout [14:09] hazmat: Not necessarily.. we have to restart the connection [14:09] niemeyer, and reinitialize any app state against the new connection [14:09] hazmat: Yes [14:09] ie. restart the app ;-) [14:09] hazmat: No, restart the app is something else [14:10] hazmat: Restart the app == new process [14:10] doesn't have to be a process restart to be effective, but it needs to go through the entire app init [14:10] hazmat: So, the point is that we have to do that anyway [14:10] hazmat: Because no matter how hard we try, that's a valid scenario [14:10] rogpeppe, the other way a session terminates is a client closes the handle, thats more explicit [14:11] rogpeppe, that can be abused in testing by connecting multiple clients via the same session id, to simulate session failures [14:11] niemeyer, absolutely for unrecoverable errors that is required [14:11] hazmat: So what about going to the other side, and handling any session hiccups as fatal? It feels a lot stronger as a general principle, and a lot harder to get it wrong [14:11] when you say "reinitialize any app state", doesn't that assume that no app state has already been stored on the server? [14:11] for recoverable errors local handling inline to the conn, seems worth exploring [14:11] or are we assuming that the server is now a clean slate? [14:12] we need to validate some of the watch state [14:12] rogpeppe, no the server has an existing state [14:12] hazmat: The problem is that, as we've been seeing above, "recoverable errors" are actually very hard to really figure [14:12] rogpeppe, the app needs to process the existing state against its own state needs and observation requirements [14:12] hazmat: rogpeppe makes a good point in terms of the details of watch establishment [14:12] so presumably we know almost all of that state, barring operations in progress? [14:12] hazmat: and I don't have a good answer for him [14:12] niemeyer, that's why i was going with a stop/reconnect/start for both error types as a simple mechanism [14:13] for now [14:13] * hazmat does a test to verify watch behavior [14:13] hazmat: Yeah, but the problem we have _today_ and that I don't feel safe doing that is that we don't have good-but-stay-alive semantics in the code base [14:13] erm.. [14:13] good stop-but-stay-alive [14:14] i *think* that the most important case is automatic retries of idempotent operations. [14:14] niemeyer, we do in the unit agents as a consequence of doing upgrades, we pause everything for it [14:14] but that's hard too. [14:14] hazmat: I seriously doubt that this will e.g. kill old watches [14:15] niemeyer, effectively the only thing that's not observation driven is the provider agent does some polling for runaway instances [14:15] niemeyer, it won't kill old watches, but we can close the handle explicitly [14:16] hazmat: and what happens to all the deferreds? [14:16] niemeyer, their dead, when the session is closed [14:16] at least for watches [14:17] hazmat: What means dead? Dead as in, they'll continue in memory, hanging? [14:17] niemeyer, yeah... their effectively dead, we can do things to clean them up if that's problematic [14:17] dead in memory [14:18] hazmat: Yeah.. so if we have something like "yield exists_watch", that's dead too.. [14:18] we can track open watches like gozk and kill them explicitly (errback disconnect) [14:18] hazmat: That's far from a clean termination [14:18] niemeyer, we can transition those to exceptions [14:18] hazmat: Sure, we can do everything we're talking about above.. the point is that it's not trivial [14:19] it seems straightforward at the conn level [14:19] to track watches, and on close kill them [14:19] hazmat: Heh.. it's straightforward to close() the connection, of course [14:19] hazmat: It's not straightforward to ensure that doing this will yield a predictable behavior [14:20] so back to process suicide ;-) [14:20] hazmat: Cinelerra FTW! [14:20] this is all talking about the situation when you need to explicitly restart a session, right? [14:20] rogpeppe, yes [14:21] rogpeppe: Yeah, control over fault scenarios in general [14:21] restart/open a new session [14:21] restart is different, i thought [14:21] because the library can do it behind the scenes [14:21] and reinstate watches [14:21] redo idempotent ops, etc [14:21] rogpeppe, but it can't reattach the watches to the all extant users? [14:22] i don't see why not [14:22] perhaps in go that's possible with channels and the channel bookeeping [14:22] against the watches [14:22] hazmat, rogpeppe: No, that doesn't work in any case [14:22] ? [14:22] The window between the watch being dead and the watch being alive again is lost === rogpeppe is now known as rog [14:23] of course [14:23] doh [14:23] except... [14:23] that the client *could* keep track of the last-returned state [14:23] and check the result when the new result arrives [14:24] and trigger the watcher itself if it's changed [14:25] rog: Yeah, we could try to implement the watch in the client side, but that's what I was talking above [14:25] expect... i don't know if {remove child; add child with same name} is legitimately a no-op [14:25] rog: We're going very far to avoid a situation that is in fact unavoidable [14:25] rog: Instead of doing that, I suggest we handle the unavoidable situation in all cases [14:26] force all clients to deal with any session termination as if it might be unrecoverable? [14:27] rog: Yeah [14:27] rog: Any client disconnection in fact [14:27] rog: let's also remove the hack we have in the code and allow watches to notice temporary disconnections [14:28] this is why proper databases have transactions [14:28] rog: Uh.. [14:28] rog: That was a shoot in the sky :-) [14:29] if the connection dies half way through modifying some complex state, then when retrying, you've got to figure out how far you previously got, then redo from there. [14:29] rog: We have exactly the same thing with zk. [14:30] rog: The difference is that unlike a database we're using this for coordination [14:30] rog: Which means we have live code waiting for state to change [14:30] rog: A database client that had to wait for state to change would face the same issues [14:31] rog, databaess still have the same issue [14:31] yeah, i guess [14:32] what should a watcher do when it sees a temporary disconnection? [14:33] await reconnection and watch again, i suppose [14:33] so watches don't fire if the event happens while disconnected [14:34] i wonder if the watch should terminate even on temporary disconnection. [14:34] rog: It should error out and stop whatever is being done, recovering the surrounding state if it makes sense [14:34] rog: Right, exactly [14:36] and is that true of the Dial session events too? the session terminates after the first non-ok event? [14:37] i think that makes sense. [14:37] (and it also makes use of Redial more ubiquitous). [of course i'm speaking from a gozk perspective here, as i'm not familiar with the py zk lib] [14:38] interesting, i get a session expired event.. just wrote a unit test for watch fire while disconnected, two server cluster, two clients one connected to each, one client sets a watch, shutdown its server, delete on the other client/server, resurrect the shutdown server with its client waiting on the watch, gets a session expired event [14:38] hmm. its timing dependent though [14:39] rog: Yeah, I think so too [14:39] yeah.. this needs more thought [14:39] hazmat: Yeah, the more we talk, the more I'm convinced we should assume nothing from a broken connection [14:40] niemeyer, indeed [14:40] hazmat: This kind of positioning also has a non-obvious advantage.. it enables us to more easily transition to doozerd at some point [14:40] Perhaps not as a coincidence, it has no concept of sessions [14:40] * niemeyer looks at Aram [14:40] niemeyer, interesting.. i thought you gave up on doozerd [14:40] upstream seems to be dead afaik [14:40] hazmat: I have secret plans! [14:40] ;-) [14:41] niemeyer, cool when i mentioned it b4 you seemed down on it [14:41] it would be nice for an arm env to go java-less [14:41] hazmat: Yeah, because it sucks on several aspects right now [14:41] rog, on ReDial does gozk reuse a handle? [14:41] hazmat: But what if we.. hmmm.. provided incentives for the situation to change? :-) [14:41] niemeyer, yeah.. persistence and error handling there don't seem well known [14:42] niemeyer, indeed, things can change [14:43] hazmat: no, i don't think so, but i don't think it needs to. [14:43] that's one important difference between gozk/txzk.. the pyzk doesn't expose reconnecting with the same handle, which toasts extant watches (associated to handle) when trying to reconnect to the same session explictly [14:43] (paste coming up) [14:43] libzk in the background will do it, but if you want to change the server explicitly at the app/client level its a hoser [14:46] if you get clients to explicitly negotiate with the central dialler when the connection is re-made, i think it can work. [14:47] i.e. get error indicating that server is down, ask central thread for new connection. [14:48] store that connection where you need to. [14:49] * niemeyer loves the idea of keeping it simple and not having to do any of that :) [14:51] yeah, me too. [14:51] but we can't. at least i think that's the conclusion we've come to, right? [14:52] rog: Hmm, not my understanding at least [14:52] rog: It's precisely the opposite.. we _have_ to do that anyway [14:52] rog: Because no matter how hard we try, the connection can break for real, and that situation has to be handled properly [14:53] rog: So I'd rather focus on that scenario all the time, and forget about the fact we even have sessions [14:54] niemeyer: so you're saying that we have to lose all client state when there's a reconnection? [14:54] rog: Yes, I'm saying we have to tolerate that no matter what [14:55] so the fact that zk has returned ok when we've created a node, we have to act as if that node might not have been created? [14:55] s/the fact/even if/ [14:56] rog: If zk returned ok, there's no disconnection [14:56] niemeyer: if it returned ok, and the next create returns an error; that's the scenario i'm thinking of [14:57] that's the situation where i think the node creator could wait for redial and then carry on from where it was [14:57] rog: If it returned an error, we have to handle it as an error and not assume that the session is alive, because it may well not be [14:58] rog: and what if the session dies? [14:58] i'm not saying that it should assume the session is still alive [14:58] i'm saying that when it gets an error, it could ask the central thread for the new connection - it might just get an error instead [14:59] the node creator is aware of the transition, but can carry on (knowingly) if appropriate [14:59] rog: and what about the several watches that are established? [14:59] niemeyer: same applies [14:59] rog: What applies? [15:00] the watch will return an error; the code doing the watch can ask for a new connection and redo the watch if it wishes. [15:00] niemeyer, so we're back to reinitializing the app on any connectoin error, disregarding recoverable [15:00] no redoing behind the scenes, but the possibility of carrying on where we left off [15:00] rog: The state on which the watch was requested has changed [15:00] rog: Check out the existing code base [15:01] niemeyer, so interestingly we can be disconnected, not know it, and miss a watch event [15:01] rog: It's not trivial to just "oh, redo it again" [15:01] niemeyer: it doesn't matter because the watcher is re-requesting the state, so it'll see both the state and any subsequent watch event [15:01] hazmat: Yeah, that's exactly the kind of very tricky scenario that I'm concerned about [15:01] the watcher has to deal with the "state just changed" scenario anyway when it first requests the watch [15:01] niemeyer, actually we get notification from a session event that we reconnected [15:01] hazmat: As Russ would say, I don't really want to think about whether it's correct or not [15:01] rog: No.. please look at the code base [15:02] niemeyer: sorry, which bit are you referring to? [15:02] rog: We're saying the same thing, in fact.. you're just underestimating the fact that "just retry" is more involved than "request the new connection and do it again" [15:02] rog: juju [15:02] rog: lp:juju [15:03] rog: This concept touches the whole application [15:03] niemeyer: i've been exploring it a bit this morning, but haven't found the crucial bits, i think. what's a good example file that would be strongly affected by this kind of thing? [15:03] hazmat: We do.. the real problem is ensuring state is as it should be when facing reconnections [15:04] rog: I'm serious.. this touches the whole app [15:04] niemeyer, right we always have to reconsider state on reconnection [15:04] rog: Check out the agents [15:05] niemeyer: ah, ok. i was looking in state [15:05] thanks [15:05] rog: state is good too [15:05] rog: Since it's what the agents use and touches this concept too [15:06] rog, hazmat: So, my suggestion is that the first thing we do is to unhide temporary failures in gozk [15:06] niemeyer, the test becomes alot more reliable when we have multiple zks in the cluster setup for the client to connect to [15:06] sgtm [15:06] rog, hazmat: Then, let's watch out for that kind of issue very carefully in reviews and whatnot, as we build a reliable version [15:07] hazmat: The same problem exists, though.. [15:07] niemeyer, indeed [15:07] but it minimizes total disconnect scenarios with multiple zks [15:07] hazmat: Even if it _immediately_ reconnects, the interim problems may have created differences that are easily translated into bugs very hard to figure out [15:08] hazmat: and again, we seriously _have_ to handle the hard-reconnect across the board [15:08] niemeyer, agreed [15:08] niemeyer, i'm all in favor of simplifying and treating them the same [15:09] recoverable/unrecoverable conn errors [15:09] hazmat: So no matter how much we'd love to not break the session and have a pleasant API, the hard reconnects means we'll need the good failure recovered either way [15:09] So we can as well plan for that at all times [15:09] niemeyer, its detecting the conn error that i'm concerned about atm [15:09] hazmat: My understanding is that the client always notifies about temporary issues [15:09] niemeyer, based on its internal poll period to the server [15:10] niemeyer, a quick disconnect masks any client detection [15:10] ^transient [15:10] hazmat: Really!? [15:11] it seems the server will attempt to expire the client session, but i've seen once we're instead it shows a reconnect [15:11] s/we're/where [15:11] hazmat: I can't imagine how that'd be possible [15:11] hazmat: The client lib should hopefully notify the user that the TCP connection had to be remade [15:13] niemeyer, fwiw here's the test i'm playing with (can drop into test_session.py ).. http://paste.ubuntu.com/702290/ [15:13] for a package install of zk.. ZOOKEEPER_PATH=/usr/share/java [15:13] for the test runner [15:13] hazmat: Hmm [15:14] hazmat: That seems to test that watches work across reconnections [15:14] hazmat: We know they can work [15:14] niemeyer, they do but we miss the delete [15:14] hazmat: Or am I missing something? [15:14] hazmat: Ah, right! [15:14]