/srv/irclogs.ubuntu.com/2011/10/04/#juju.txt

* niemeyer gets some food00:00
SpamapSIts important to have a nice clear easy REST API to use.. but its vital that you also provide optimisations for batch operations. Its why SQL is so popular.. easy to get one row, easy to get all rows.00:08
niemeyerjimbaker: ping01:21
niemeyerhazmat: Still around?01:40
hazmat niemeyer indeed02:14
niemeyerhazmat: Cool, sorted out already, again! :-)02:17
niemeyerhazmat: Review queue pretty much empty02:17
hazmatniemeyer, nice02:17
* hazmat crashes02:19
niemeyerhazmat: Cheers02:25
_mup_juju/unit-info-cli r424 committed by kapil.thangavelu@canonical.com02:47
_mup_remove the manual copy of host resolv.conf, since customize runs in a chroot, directly modify the resolv.conf output to point to dnsmasq, fix indentation problem02:47
_mup_juju/env-origin r381 committed by jim.baker@canonical.com04:22
_mup_Merged trunk04:22
SpamapSFYI r378 caused a segfault when building on natty06:09
SpamapShttps://launchpadlibrarian.net/81865589/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz06:09
* ejat just wondering … is someone doing charm for liferay :) 09:43
hazmathmm11:25
hazmatSpamapS, its a problem with the zk version there11:26
hazmat3.3.1 has known issues for juju11:26
hazmatapplies primarily to libzookeeper and python-libzookeeper11:27
hazmatSpamapS, all the distro ppas (minus oneiric perhaps) should have 3.3.312:06
_mup_Bug #867420 was filed: Add section mentioning expose to the user tutorial. <juju:In Progress by rogpeppe> < https://launchpad.net/bugs/867420 >12:10
TeTeTjust updated my oneiric install, juju seems to have a problem:12:12
TeTeTErrors were encountered while processing:12:12
TeTeT /var/cache/apt/archives/juju_0.5+bzr361-0ubuntu1_all.deb12:12
TeTeTE: Sub-process /usr/bin/dpkg returned an error code (1)12:12
TeTeTwas a transient problem, apt-get update, apt-get -f install seemed to have fixed it12:17
hazmatinterestingly simulating transient disconnection of a bootstrap node for extended periods of time seems to be fine12:29
fwereadeheya niemeyer13:11
SpamapShazmat: ahh, we need to add a versioned build dep then13:12
niemeyerHello!13:17
hazmatniemeyer, g'morning13:19
niemeyerfwereade: How're things going there?13:23
niemeyerhazmat: Good stuff in these last few branches13:23
fwereadeniemeyer: tolerable :)13:23
niemeyerfwereade: ;-)13:23
hazmatniemeyer, yeah.. finally fixed the local provider issue wrt to customization, so all is good there, still seem some occasionally lxc pty allocation errors, but haven't deduced to a reliable reproduction strategy for upstream13:24
hazmatniemeyer, i did play around with the disconnect scenarios some more, at least for a period of no active usage (no hooks executing, etc), we tolerate zookeeper nodes going away transiently fairly well13:25
niemeyerhazmat: By zookeeper nodes you mean the server themselves?13:25
niemeyerservers13:25
hazmatniemeyer, yeah.. the zookeeper server going away13:25
niemeyerhazmat: Neat!13:25
niemeyerIt's a good beginning :)13:26
niemeyerhazmat: we should talk to rogpeppe about the issues we debated yesterday13:26
niemeyerhazmat: re. making things not fail when possible13:26
rogpeppei'm here!13:26
hazmatniemeyer, for the single server case, the session stays alive, if the client reconnects within the the session timeout period after the server is back up. and the clients all go into poll mode every 3s when the zk server is down (roughly 1/3 session time i believe)13:26
rogpeppe(afternoon, folks, BTW)13:27
hazmatniemeyer, theres a few warnings in the zk docs about not trusting library implementation that magic things for the app13:27
hazmatregarding error handling13:27
hazmatrogpeppe, hola13:27
niemeyerhazmat: Well, sure :)13:27
niemeyerhazmat: That's what the whole session concept is about, though13:27
niemeyerrogpeppe: This goes a bit in the direction you were already thinking13:28
niemeyerrogpeppe: You mentioned in our conversations that e.g. it'd be good that Dial would hold back until the connection is actually established13:28
niemeyerrogpeppe: This is something we should do, but we're talking about doing more than that13:28
rogpeppei don't know if this is relevant, or if it's a problem with gozk alone, but i never got a server-down notification from a zk server, even when i killed it and waited 15 mins.13:28
niemeyerrogpeppe: Try bringing it up afterwards! :-)13:29
hazmatrogpeppe, session expiration is server governed, clients don't decide that13:29
niemeyerrogpeppe: It's a bit strange, but that's how it works.. the session times out in the next reconnection13:29
rogpeppeniemeyer: yeah, i definitely think it should13:29
hazmatrogpeppe, the clients go into a polling reconnect mode, turning up the zookeeper debug log verbosity will show the activity13:30
rogpeppehazmat: but what if there's no server? surely the client should fail eventually?13:30
niemeyerrogpeppe: So, in addition to this, when we are connected and zk disconnects, we should also block certain calls13:30
niemeyerrogpeppe: Well.. all the calls13:30
hazmatrogpeppe, nope.. they poll endlessley in the background, attempting to use the connection will raise a connectionloss/error13:30
hazmatrogpeppe, at least until the handle is closed13:31
niemeyerrogpeppe: So that we avoid these errors ^13:31
hazmatrogpeppe, that's why we have explicit timeouts for connect13:31
niemeyerrogpeppe: In other words, if we have a _temporary_ error (e.g. disconnection rather than session expiration), we should block client calls13:31
hazmatabove libzk13:31
rogpeppehazmat: but if all users are blocked waiting for one of {connection, state change}, then no one will try to use the connection, and the client will hang forever13:31
niemeyerrogpeppe: Not necessarily.. as you know it's trivial to timeout and close a connection13:32
niemeyerrogpeppe: I mean, on our side13:32
rogpeppeso all clients should do timeout explicitly?13:32
niemeyerrogpeppe: <-time.After & all13:32
rogpeppesure, but what's an appropriate time out?13:32
niemeyerrogpeppe: Whatever we choose13:33
niemeyerrogpeppe: But that's not what we're trying to solve now13:33
rogpeppesure13:33
niemeyerrogpeppe: What we have to do is make the gozk interface bearable13:33
niemeyerrogpeppe: Rather than a time bomb13:33
hazmatso we're trying to make recoverable error handling subsumed into the client13:33
rogpeppe[note to future: i'd argue for the timeout functionality to be inside the gozk interface, not reimplemented by every client]13:33
niemeyer[note to future: discuss timeout with rogpeppe]13:34
hazmatby capturing a closure for any operation, and on connection error, wait till the connection is restablished, and rexecing the closure (possibly with additional error detection semantics)13:34
rogpeppehazmat: are we talking about the gozk package level here?13:34
niemeyerhazmat: I think there's a first step before that even13:34
rogpeppeor a higher juju-specific level?13:35
niemeyerrogpeppe: Yeah, internal to gozk13:35
hazmatrogpeppe, which pkg isn't relevant, but yes at the zk conn level13:35
hazmathttp://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling13:35
niemeyerhazmat: Before we try to _redo_ operations, we should teach gozk to not _attempt_ them in the first place when it knows the connection is off13:35
hazmathmm13:35
hazmatyeah.. thats better13:35
hazmatwe can basically watch session events and hold all operations13:36
hazmatniemeyer, +113:36
niemeyerhazmat: Cool!13:36
hazmathmm13:36
hazmatniemeyer, so there is still a gap13:36
niemeyerrogpeppe: Does that make sense to you as well?13:36
* rogpeppe is thinking hard13:37
niemeyerhazmat: There is in some cases, when we attempt to do something and the tcp connection crashes on our face13:37
hazmatniemeyer, internally libzk will do a heartbeat effectively to keep the session alive, if the op happens before the heartbeat detects dead we still get a conn error13:37
niemeyerhazmat: Let's handle that next by retrying certain operations intelligently13:37
rogpeppei think the first thing is to distinguish between recoverable and unrecoverable errors13:37
hazmatrogpeppe, its a property of the handle13:38
niemeyerrogpeppe: That's the next thing, after the initial step we mentioned above13:38
hazmatlibzk exposes a method for it to return a bool13:38
hazmatrecoverable(handle)13:38
niemeyerrogpeppe: For blocking operations on certain connection states, we're actually preventing the error from even happening13:38
rogpeppepreventing the error being exposed to the API-client code, that is, yes?13:39
niemeyerrogpeppe: No13:39
hazmatrogpeppe, yup13:39
hazmat:-)13:39
rogpeppelol13:39
niemeyerrogpeppe: Preventing it from happening at all13:39
hazmatthe error never happens13:39
hazmatbecause we don't let the op go through while disconnected13:40
niemeyerrogpeppe: The error never happens if we don't try the call13:40
rogpeppeok, that makes sense.13:40
rogpeppebut... what about an op that has already gone through13:40
rogpeppe?13:40
hazmatnext step is to auto recover the error for ops that we can do so without ambiguity, because there is still a gap on our detection of the client connectivity13:40
rogpeppeand then the connection goes down13:40
niemeyerrogpeppe: That's the next case we were talking about above13:40
niemeyerrogpeppe: If the operation is idempotent, we can blindly retry it behind the lib client's back13:41
rogpeppeniemeyer: do we need to? i thought it was important that clients be prepared to handle critical session events13:41
niemeyerrogpeppe: If the operation is not idempotent, too bad.. we'll have to let the app take care of it13:41
hazmatrogpeppe, effectively the only only ops i've seen ambiguity around is the create scenario, and modifications without versions13:41
niemeyerrogpeppe: Do we need to what?13:42
rogpeppedo we need to retry, was my question.13:42
hazmatso this might be better structured as a library on top of the connection that's specific to juju13:42
niemeyerrogpeppe: Yeah, because otherwise we'll have to introduce error handling _everywhere_, doing exactly the same retry13:42
niemeyerhazmat: Nah.. let's do it internally and make a clean API.. we know what we're doing13:43
rogpeppedoes zookeeper do a 3 phase commit?13:43
hazmatniemeyer, famous last words ;-)13:43
rogpeppei.e. for something like create with sequence number, does the client have to acknowledge the create before the node is actually created?13:43
niemeyerhazmat: Well, if we don't, we have larger problems ;-)13:43
hazmatrogpeppe, its a paxos derivative internally. everything forwards to the active leader in the cluster13:43
hazmatwrites that is13:43
hazmatit transparently does leader election as needed13:44
niemeyerrogpeppe: The _client_ cannot acknowledge the create13:44
hazmatrogpeppe, the client doesn't ack the create, but the error recovery with a sequence node is hard, because without the server response, we have no idea what happened13:44
rogpeppeniemeyer: why not? i thought the usual process was: write request; read response; write ack; server commits13:45
niemeyerrogpeppe: What's the difference?13:45
niemeyerrogpeppe: write ack; read response; write ack; read response; write ack; read response; server commits13:45
rogpeppeniemeyer: the difference is that if the server doesn't see an ack from the client, the action never happened.13:45
niemeyerrogpeppe: Doesn't matter how many round trips.. at some point the server will commit, and if the connection crashes the client won't know if it was committed or not13:46
hazmat? there's client acks under the hood?13:46
niemeyerhazmat: There isn't.. and I'm explaining why it makes no difference13:46
hazmatah13:46
* hazmat dogwalks back in 1513:47
niemeyerhazmat: Cheers13:47
rogpeppeif the connection crashes, the client can still force the commit by writing the ack. it's true that it doesn't know if the ack is received. hmm. byzantine generals.13:48
niemeyerYeah13:48
rogpeppei'm slightly surprised the sequence-number create doesn't have a version argument, same as write13:49
niemeyerrogpeppe: Hmm.. seems to be sane to me?13:50
rogpeppethat would fix the problem, at the expense of retries, no?13:50
niemeyerrogpeppe: It's atomic.. it's necessarily going to be version 013:50
rogpeppeah, child changes don't change a version number?13:50
* rogpeppe goes back to look at the modify operation13:51
niemeyerrogpeppe: It changes, but it makes no sense to require a given version with a sequence number13:51
niemeyerrogpeppe: The point of using the sequence create is precisely to let the server make concurrent requests work atomically13:51
niemeyer_Hmm13:52
niemeyer_Weird13:52
niemeyer_Abrupt disconnection13:53
rogpeppeniemeyer: but we want to do that with node contents too - that's why the version number on Set13:53
rogpeppeniemeyer_: and that's the main problem with the lack of Create idempotency13:53
rogpeppeanyway, we could easily document that Create with SEQUENCE is a special case13:54
rogpeppeand can return an error without retrying13:54
niemeyer_rogpeppe: We don't even have to document it really.. the error itself is the notice13:55
rogpeppei think it would be good if the only time a session event arrived at a watcher was if the server went down unrecoverably13:55
rogpeppeactually, that doesn't work13:56
rogpeppewatchers will always have to restart13:56
niemeyer_rogpeppe: That's how it is today, except for the session events in the session watch13:56
niemeyer_rogpeppe: Not really13:56
niemeyer_rogpeppe: If the watch was already established, zk will keep track of them and reestablish internally as long as the session survives13:57
rogpeppebut what if the watch reply was lost when the connection went down?13:58
niemeyer_rogpeppe: Good question.. worth confirming to see if it's handled properly13:59
rogpeppei'm not sure how it can be14:00
rogpeppethe client doesn't ack watch replies AFAIK14:00
niemeyer_rogpeppe: There are certainly ways it can be.. it really depends on how it's done14:01
niemeyer_rogpeppe: E.g. the client itself can do the verification on connection reestablishment14:01
niemeyer_Another alternative, which is perhaps a saner one, is to do a 180⁰ turn and ignore the existence of sessions completely14:01
niemeyer_Hmmm..14:02
rogpeppeniemeyer_: that would look much nicer from a API user's perspective14:02
niemeyer_I actually like the sound of that14:02
niemeyer_rogpeppe: Not even thinking about API.. really thinking about how to build reliable software on top of it14:03
rogpeppearen't those closely related things?14:03
niemeyer_rogpeppe: Not necessarily.. an API that reestablishes connections and knows how to hanndle problems internally is a lot nicer from an outside user's perspective14:04
=== niemeyer_ is now known as niemeyer
rogpeppeniemeyer: don't quite follow14:06
niemeyerrogpeppe: Don't worry, it's fine either way14:06
* hazmat catches up14:06
niemeyerhazmat: I think we should do a U turn14:07
hazmatniemeyer, how so?14:07
hazmathmm.. verifying watch handling while down sounds good14:08
hazmatconnection down that is14:08
niemeyerhazmat: We're adding complexity in the middle layer, and reality is that no matter how complex and how much we prevent the session from "crashing", we _still_ have to deal with session termination correctly14:08
hazmatsession termination is effectively fatal14:08
rogpeppewhen does a session terminate?14:08
hazmatthe only sane thing to do is to restart the app14:08
niemeyerhazmat: we're also constantly saying "ah, but what if X happens?"..14:08
hazmatrogpeppe, a client is disconnected from the quorum for the period of session timeout14:09
niemeyerhazmat: Not necessarily.. we have to restart the connection14:09
hazmatniemeyer, and reinitialize any app state against the new connection14:09
niemeyerhazmat: Yes14:09
hazmatie. restart the app ;-)14:09
niemeyerhazmat: No, restart the app is something else14:09
niemeyerhazmat: Restart the app == new process14:10
hazmatdoesn't have to be a process restart to be effective, but it needs to go through the entire app init14:10
niemeyerhazmat: So, the point is that we have to do that anyway14:10
niemeyerhazmat: Because no matter how hard we try, that's a valid scenario14:10
hazmatrogpeppe, the other way a session terminates is a client closes the handle, thats more explicit14:10
hazmatrogpeppe, that can be abused in testing by connecting multiple clients via the same session id, to simulate session failures14:11
hazmatniemeyer, absolutely for unrecoverable errors that is required14:11
niemeyerhazmat: So what about going to the other side, and handling any session hiccups as fatal?  It feels a lot stronger as a general principle, and a lot harder to get it wrong14:11
rogpeppewhen you say "reinitialize any app state", doesn't that assume that no app state has already been stored on the server?14:11
hazmatfor recoverable errors local handling inline to the conn, seems worth exploring14:11
rogpeppeor are we assuming that the server is now a clean slate?14:11
hazmatwe need to validate some of the watch state14:12
hazmatrogpeppe, no the server has an existing state14:12
niemeyerhazmat: The problem is that, as we've been seeing above, "recoverable errors" are actually very hard to really figure14:12
hazmatrogpeppe, the app needs to process the existing state against its own state needs and observation requirements14:12
niemeyerhazmat: rogpeppe makes a good point in terms of the details of watch establishment14:12
rogpeppeso presumably we know almost all of that state, barring operations in progress?14:12
niemeyerhazmat: and I don't have a good answer for him14:12
hazmatniemeyer, that's why i was going with a stop/reconnect/start for both error types as a simple mechanism14:12
hazmatfor now14:13
* hazmat does a test to verify watch behavior14:13
niemeyerhazmat: Yeah, but the problem we have _today_ and that I don't feel safe doing that is that we don't have good-but-stay-alive semantics in the code base14:13
niemeyererm..14:13
niemeyergood stop-but-stay-alive14:13
rogpeppei *think* that the most important case is automatic retries of idempotent operations.14:14
hazmatniemeyer, we do in the unit agents as a consequence of doing upgrades, we pause everything for it14:14
rogpeppebut that's hard too.14:14
niemeyerhazmat: I seriously doubt that this will e.g. kill old watches14:14
hazmatniemeyer, effectively the only thing that's not observation driven is the provider agent does some polling for runaway instances14:15
hazmatniemeyer, it won't kill old watches, but we can close the handle explicitly14:15
niemeyerhazmat: and what happens to all the deferreds?14:16
hazmatniemeyer, their dead, when the session is closed14:16
hazmatat least for watches14:16
niemeyerhazmat: What means dead?  Dead as in, they'll continue in memory, hanging?14:17
hazmatniemeyer, yeah... their effectively dead, we can do things to clean them up if that's problematic14:17
hazmatdead in memory14:17
niemeyerhazmat: Yeah.. so if we have something like "yield exists_watch", that's dead too..14:18
hazmatwe can track open watches like gozk and kill them explicitly (errback disconnect)14:18
niemeyerhazmat: That's far from a clean termination14:18
hazmatniemeyer, we can transition those to exceptions14:18
niemeyerhazmat: Sure, we can do everything we're talking about above.. the point is that it's not trivial14:18
hazmatit seems straightforward at the conn level14:19
hazmatto track watches, and on close kill them14:19
niemeyerhazmat: Heh.. it's straightforward to close() the connection, of course14:19
niemeyerhazmat: It's not straightforward to ensure that doing this will yield a predictable behavior14:19
hazmatso back to process suicide ;-)14:20
niemeyerhazmat: Cinelerra FTW!14:20
rogpeppethis is all talking about the situation when you need to explicitly restart a session, right?14:20
hazmatrogpeppe, yes14:20
niemeyerrogpeppe: Yeah, control over fault scenarios in general14:21
hazmatrestart/open a new session14:21
rogpepperestart is different, i thought14:21
rogpeppebecause the library can do it behind the scenes14:21
rogpeppeand reinstate watches14:21
rogpepperedo idempotent ops, etc14:21
hazmatrogpeppe, but it can't reattach the watches to the all extant users?14:21
rogpeppei don't see why not14:22
hazmatperhaps in go that's possible with channels and the channel bookeeping14:22
hazmatagainst the watches14:22
niemeyerhazmat, rogpeppe: No, that doesn't work in any case14:22
rogpeppe?14:22
niemeyerThe window between the watch being dead and the watch being alive again is lost14:22
=== rogpeppe is now known as rog
rogof course14:23
rogdoh14:23
rogexcept...14:23
rogthat the client *could* keep track of the last-returned state14:23
rogand check the result when the new result arrives14:23
rogand trigger the watcher itself if it's changed14:24
niemeyerrog: Yeah, we could try to implement the watch in the client side, but that's what I was talking above14:25
rogexpect... i don't know if {remove child; add child with same name} is legitimately a no-op14:25
niemeyerrog: We're going very far to avoid a situation that is in fact unavoidable14:25
niemeyerrog: Instead of doing that, I suggest we handle the unavoidable situation in all cases14:25
rogforce all clients to deal with any session termination as if it might be unrecoverable?14:26
niemeyerrog: Yeah14:27
niemeyerrog: Any client disconnection in fact14:27
niemeyerrog: let's also remove the hack we have in the code and allow watches to notice temporary disconnections14:27
rogthis is why proper databases have transactions14:28
niemeyerrog: Uh..14:28
niemeyerrog: That was a shoot in the sky :-)14:28
rogif the connection dies half way through modifying some complex state, then when retrying, you've got to figure out how far you previously got, then redo from there.14:29
niemeyerrog: We have exactly the same thing with zk.14:29
niemeyerrog: The difference is that unlike a database we're using this for coordination14:30
niemeyerrog: Which means we have live code waiting for state to change14:30
niemeyerrog: A database client that had to wait for state to change would face the same issues14:30
hazmatrog, databaess still have the same issue14:31
rogyeah, i guess14:31
rogwhat should a watcher do when it sees a temporary disconnection?14:32
rogawait reconnection and watch again, i suppose14:33
hazmatso watches don't fire if the event happens while disconnected14:33
rogi wonder if the watch should terminate even on temporary disconnection.14:34
niemeyerrog: It should error out and stop whatever is being done, recovering the surrounding state if it makes sense14:34
niemeyerrog: Right, exactly14:34
rogand is that true of the Dial session events too? the session terminates after the first non-ok event?14:36
rogi think that makes sense.14:37
rog(and it also makes use of Redial more ubiquitous). [of course i'm speaking from a gozk perspective here, as i'm not familiar with the py zk lib]14:37
hazmatinteresting, i get a session expired event.. just wrote a unit test for watch fire while disconnected, two server cluster, two clients one connected to each, one client sets a watch, shutdown its server, delete on the other client/server, resurrect the shutdown server with its client waiting on the watch, gets a session expired event14:38
hazmathmm. its timing dependent though14:38
niemeyerrog: Yeah, I think so too14:39
hazmatyeah.. this needs more thought14:39
niemeyerhazmat: Yeah, the more we talk, the more I'm convinced we should assume nothing from a broken connection14:39
hazmatniemeyer, indeed14:40
niemeyerhazmat: This kind of positioning also has a non-obvious advantage.. it enables us to more easily transition to doozerd at some point14:40
niemeyerPerhaps not as a coincidence, it has no concept of sessions14:40
* niemeyer looks at Aram14:40
hazmatniemeyer, interesting.. i thought you gave up on doozerd14:40
hazmatupstream seems to be dead afaik14:40
niemeyerhazmat: I have secret plans!14:40
niemeyer;-)14:40
hazmatniemeyer, cool when i mentioned it b4 you seemed down on it14:41
hazmatit would be nice for an arm env to go java-less14:41
niemeyerhazmat: Yeah, because it sucks on several aspects right now14:41
hazmatrog, on ReDial does gozk reuse a handle?14:41
niemeyerhazmat: But what if we.. hmmm.. provided incentives for the situation to change? :-)14:41
hazmatniemeyer, yeah.. persistence and error handling there don't seem well known14:41
hazmatniemeyer, indeed, things can change14:42
roghazmat: no, i don't think so, but i don't think it needs to.14:43
hazmatthat's one important difference between gozk/txzk.. the pyzk doesn't expose reconnecting with the same handle, which toasts extant watches (associated to handle) when trying to reconnect to the same session explictly14:43
rog(paste coming up)14:43
hazmatlibzk in the background will do it, but if you want to change the server explicitly at the app/client level its a hoser14:43
rogif you get clients to explicitly negotiate with the central dialler when the connection is re-made, i think it can work.14:46
rogi.e. get error indicating that server is down, ask central thread for new connection.14:47
rogstore that connection where you need to.14:48
* niemeyer loves the idea of keeping it simple and not having to do any of that :)14:49
rogyeah, me too.14:51
rogbut we can't. at least i think that's the conclusion we've come to, right?14:51
niemeyerrog: Hmm, not my understanding at least14:52
niemeyerrog: It's precisely the opposite.. we _have_ to do that anyway14:52
niemeyerrog: Because no matter how hard we try, the connection can break for real, and that situation has to be handled properly14:52
niemeyerrog: So I'd rather focus on that scenario all the time, and forget about the fact we even have sessions14:53
rogniemeyer: so you're saying that we have to lose all client state when there's a reconnection?14:54
niemeyerrog: Yes, I'm saying we have to tolerate that no matter what14:54
rogso the fact that zk has returned ok when we've created a node, we have to act as if that node might not have been created?14:55
rogs/the fact/even if/14:55
niemeyerrog: If zk returned ok, there's no disconnection14:56
rogniemeyer: if it returned ok, and the next create returns an error; that's the scenario i'm thinking of14:56
rogthat's the situation where i think the node creator could wait for redial and then carry on from where it was14:57
niemeyerrog: If it returned an error, we have to handle it as an error and not assume that the session is alive, because it may well not be14:57
niemeyerrog: and what if the session dies?14:58
rogi'm not saying that it should assume the session is still alive14:58
rogi'm saying that when it gets an error, it could ask the central thread for the new connection - it might just get an error instead14:58
rogthe node creator is aware of the transition, but can carry on (knowingly) if appropriate14:59
niemeyerrog: and what about the several watches that are established?14:59
rogniemeyer: same applies14:59
niemeyerrog: What applies?14:59
rogthe watch will return an error; the code doing the watch can ask for a new connection and redo the watch if it wishes.15:00
hazmatniemeyer, so we're back to reinitializing the app on any connectoin error, disregarding recoverable15:00
rogno redoing behind the scenes, but the possibility of carrying on where we left off15:00
niemeyerrog: The state on which the watch was requested has changed15:00
niemeyerrog: Check out the existing code base15:00
hazmatniemeyer, so interestingly we can be disconnected, not know it, and miss a watch event15:01
niemeyerrog: It's not trivial to just "oh, redo it again"15:01
rogniemeyer: it doesn't matter because the watcher is re-requesting the state, so it'll see both the state and any subsequent watch event15:01
niemeyerhazmat: Yeah, that's exactly the kind of very tricky scenario that I'm concerned about15:01
rogthe watcher has to deal with the "state just changed" scenario anyway when it first requests the watch15:01
hazmatniemeyer, actually we get notification from a session event that we reconnected15:01
niemeyerhazmat: As Russ would say, I don't really want to think about whether it's correct or not15:01
niemeyerrog: No.. please look at the code base15:01
rogniemeyer: sorry, which bit are you referring to?15:02
niemeyerrog: We're saying the same thing, in fact.. you're just underestimating the fact that "just retry" is more involved than "request the new connection and do it again"15:02
niemeyerrog: juju15:02
niemeyerrog: lp:juju15:02
niemeyerrog: This concept touches the whole application15:03
rogniemeyer: i've been exploring it a bit this morning, but haven't found the crucial bits, i think. what's a good example file that would be strongly affected by this kind of thing?15:03
niemeyerhazmat: We do.. the real problem is ensuring state is as it should be when facing reconnections15:03
niemeyerrog: I'm serious.. this touches the whole app15:04
hazmatniemeyer, right we always have to reconsider state on reconnection15:04
niemeyerrog: Check out the agents15:04
rogniemeyer: ah, ok. i was looking in state15:05
rogthanks15:05
niemeyerrog: state is good too15:05
niemeyerrog: Since it's what the agents use and touches this concept too15:05
niemeyerrog, hazmat: So, my suggestion is that the first thing we do is to unhide temporary failures in gozk15:06
hazmatniemeyer, the test becomes alot more reliable when we have multiple zks in the cluster setup for the client to connect to15:06
rogsgtm15:06
niemeyerrog, hazmat: Then, let's watch out for that kind of issue very carefully in reviews and whatnot, as we build a reliable version15:06
niemeyerhazmat: The same problem exists, though..15:07
hazmatniemeyer, indeed15:07
hazmatbut it minimizes total disconnect scenarios with multiple zks15:07
niemeyerhazmat: Even if it _immediately_ reconnects, the interim problems may have created differences that are easily translated into bugs very hard to figure out15:07
niemeyerhazmat: and again, we seriously _have_ to handle the hard-reconnect across the board15:08
hazmatniemeyer, agreed15:08
hazmatniemeyer, i'm all in favor of simplifying and treating them the same15:08
hazmatrecoverable/unrecoverable conn errors15:09
niemeyerhazmat: So no matter how much we'd love to not break the session and have a pleasant API, the hard reconnects means we'll need the good failure recovered either way15:09
niemeyerSo we can as well plan for that at all times15:09
hazmatniemeyer, its detecting the conn error that i'm concerned about atm15:09
niemeyerhazmat: My understanding is that the client always notifies about temporary issues15:09
hazmatniemeyer, based on its internal poll period to the server15:09
hazmatniemeyer, a quick disconnect masks any client detection15:10
hazmat^transient15:10
niemeyerhazmat: Really!?15:10
hazmatit seems the server will attempt to expire the client session, but i've seen once we're instead it shows a reconnect15:11
hazmats/we're/where15:11
niemeyerhazmat: I can't imagine how that'd be possible15:11
niemeyerhazmat: The client lib should hopefully notify the user that the TCP connection had to be remade15:11
hazmatniemeyer, fwiw here's the test i'm playing with (can drop into test_session.py ).. http://paste.ubuntu.com/702290/15:13
hazmatfor a package install of zk.. ZOOKEEPER_PATH=/usr/share/java15:13
hazmatfor the test runner15:13
niemeyerhazmat: Hmm15:13
niemeyerhazmat: That seems to test that watches work across reconnections15:14
niemeyerhazmat: We know they can work15:14
hazmatniemeyer, they do but we miss the delete15:14
niemeyerhazmat: Or am I missing something?15:14
niemeyerhazmat: Ah, right!15:14
hazmatwith no notice15:14
niemeyerhazmat: So yeah, it's total crack15:15
hazmatniemeyer, actually most of the time we get an expired session event in the client w/ the watch15:15
hazmatlike 99%15:15
hazmatif i connect the client to multiple servers it sees the delete15:16
hazmatw/ the watch that is15:16
niemeyerhazmat: Hmm.. interesting.. so does it keep multiple connections internally in that case, or is it redoing the connection more quickly?15:17
hazmatniemeyer, not afaick, but its been a while since i dug into that15:17
hazmatniemeyer, but as an example here's one run http://paste.ubuntu.com/702291/15:18
hazmatwhere it does get the delete event15:19
hazmatbut that's not guaranteed in allop15:19
niemeyerhazmat: If you _don't_ create it on restart, does it get the notification?15:20
niemeyerhazmat: Just wondering if it might be joining the two events15:20
hazmatniemeyer, no it still gets the deleted event if gets an event, else it gets session expired15:21
hazmatbut its easy to construct it so it only sees the created event15:21
hazmatif toss a sleep in15:21
hazmatperhaps not15:22
hazmatit seems to get the delete event or session expiration.. i need to play with this some more and do a more thought out write up15:22
hazmatin some cases it does get the created event, obviously the pastebin has that15:22
niemeyerhazmat: I see, cool15:23
niemeyerOn a minor note, filepath.Rel is in.. can remove our internal impl. now15:24
hazmatniemeyer, cool15:24
niemeyerThat was a tough one :)15:24
niemeyerfwereade: Leaving to lunch soon.. how's stuff going there?15:25
niemeyerfwereade: Can I do anything for you?15:25
niemeyerjimbaker: How's env-origin as well?15:26
jimbakerniemeyer, just need to figure out the specific text for the two scenarios you mention15:26
niemeyerjimbaker: Hmmm.. which text?15:27
jimbakerniemeyer, from apt-cache policy15:27
niemeyerjimbaker: Just copy & paste from the existing test? Do you want me to send a patch?15:27
jimbakerniemeyer, well it's close to being copy & paste, but the difference really matters here15:28
jimbakerif you have a simple patch, for sure that would be helpful15:28
niemeyerjimbaker: Sorry, I'm still not sure about what you're talking about15:28
niemeyerjimbaker: It seems completely trivial t ome15:29
niemeyerjimbaker: Sure.. just a sec15:29
jimbakerniemeyer, i was not familiar with apt-cache policy before this work. obviously once familiar, it is trivial15:29
niemeyerjimbaker: I'm actually talking about the request I made in the review..15:30
niemeyerjimbaker: But since you mention it, I actually provided you with a scripted version saying exactly how it should work like 3 reviews ago15:30
jimbakerniemeyer, i'm going against http://carlo17.home.xs4all.nl/howto/debian.html#errata for a description of the output format15:31
hazmatthe python-apt bindings are pretty simple too.. i used them for the local provider.. although its not clear how you identify a repo for a given package from it15:31
jimbakerniemeyer, if you have a better resource describing apt-cache policy, i would very much appreciate it15:31
jimbakerhazmat, one advantage of such bindings is the data model15:32
hazmatjimbaker, well.. its as simple as cache = apt.Cache().. pkg = cache["juju"].. pkg.isInstalled -> bool... but it doesn't tell you if its a ppa or distro15:33
hazmatand for natty/lucid installs without the ppa thats a keyerror on cache["juju"]15:33
niemeyerjimbaker: apt-get source apt15:34
niemeyerjimbaker: That's the best resource about apt-cache you'll find15:34
jimbakerniemeyer, ok, i will read the source, thanks15:34
niemeyerjimbaker: Turns out that *** only shows for the current version, so it's even easier15:36
niemeyer         if (Pkg.CurrentVer() == V)15:36
niemeyer            cout << " *** " << V.VerStr();15:36
niemeyer         else15:36
niemeyer            cout << "     " << V.VerStr();15:36
hazmatjimbaker, ideally the detection will also notice osx and do something sane, but we can do that latter15:37
niemeyerjimbaker: http://paste.ubuntu.com/702301/15:37
hazmatmore important to have this in now for the release15:37
niemeyerhazmat: Oh yeah, please stop giving ideas! :-)15:37
jimbakerhazmat, for osx, doesn't it make more sense to just set juju-origin?15:38
niemeyerPlease, let's just get this branch fixed..15:38
hazmatjimbaker, probably does.. but the "/usr" in package path has faulty semantics with a /usr/local install on the branch as i reclal15:38
niemeyerI'm stepping out for lunch15:39
fwereadepopping out for a bit, back later15:51
rogi'm off for the evening. am still thinking hard about the recovery stuff. see ya tomorrow.16:35
rogniemeyer: PS ping re merge requests :-)16:35
niemeyerrog: Awesome, sorry for the delay there16:36
niemeyerrog: Yesterday was a bit busier than expected16:37
niemeyerjimbaker: How's it there?16:43
jimbakerniemeyer, it's a nice day16:56
niemeyerjimbaker: Excellent.. that should mean env-origin is ready?16:57
jimbakerniemeyer, i still need to figure out what specifically apt-cache policy would print16:57
niemeyerjimbaker: Ok.. let's do this.. just leave this branch with me.16:58
jimbakerniemeyer, i do have the source code for what prints it, but i need to understand the model backing it16:58
niemeyerjimbaker: No need.. I'll handle it, thanks.16:58
jimbakerniemeyer, ok, that makes sense, i know you have a great deal of background from your work on synaptic, thanks!16:58
niemeyerjimbaker: That work is completely irrelevant.. the whole logic is contained in the paste bint16:59
jimbakerniemeyer, ok16:59
niemeyerjimbaker: and I pointed the exact algorithm to you16:59
_mup_juju/remove-sec-grp-do-not-ignore-exception r381 committed by jim.baker@canonical.com17:39
_mup_Simplified remove_security_group per review point17:39
_mup_juju/remove-sec-grp-do-not-ignore-exception r382 committed by jim.baker@canonical.com17:40
_mup_Merged trunk17:40
niemeyerhazmat: Do you have time for a quick review on top of env-origin?18:38
niemeyerhazmat: http://paste.ubuntu.com/702373/18:38
niemeyerhazmat: It's pretty much just that function I've shown you a while ago plus minor test tweaks18:38
hazmatniemeyer, checking18:38
niemeyerhazmat: The test tweaks just try a bit harder to break the logic18:38
niemeyerhazmat: Hmm.. I'll also add an extra test with broken input, to enusre that's working18:39
hazmatniemeyer, what's the file on disk its parsing?18:40
niemeyerhazmat: output of apt-cache policy juju18:40
hazmator is that just apt-cache policy pkg?18:40
niemeyerhazmat: http://paste.ubuntu.com/702301/18:41
niemeyerhazmat: Yeah18:41
niemeyerhazmat: That last paste has the logic generating the output18:41
niemeyerhazmat: Hmmm.. I'll also do an extra safety check there, actually18:42
niemeyerhazmat: It's assuming that any unknown output will fallback to branch.. that sounds dangerous18:43
niemeyerhazmat: I'll tweak it so it only falls back to branch in known inputs18:43
niemeyerhazmat: http://paste.ubuntu.com/702381/18:47
hazmatniemeyer, why is it returning a tuple if it only cares about the line from the line generator18:48
hazmatniemeyer, in general it looks fine to me, there's two pieces in the branch that i have minor concern about18:48
niemeyerhazmat: Keep reading :)18:48
hazmatah. first indent18:49
niemeyerhazmat: It actually cares about the indent as well18:49
niemeyerhazmat: It's how we detect we've left a given version entry18:49
niemeyerhazmat: What's the other bit you're worried about?18:51
hazmatniemeyer, basically how does it break on osx if apt-cache isn't found.. and the notion that if not juju.__name__.startswith("/usr") means unconditionally a package...if i check juju out and do a setup.py install its still a source install.. hmm.. i guess that works with the apt-cache check on installed.. so looks like just what happens if not on ubuntu.. pick a sane default18:52
hazmatif apt-cache isn't there this will raise an exception it looks like18:52
niemeyerhazmat: I'll take care of that18:52
hazmatniemeyer, +1 then18:53
niemeyerhazmat: What should we default to?18:53
* niemeyer thinks18:53
niemeyerdistro, I guess18:53
hazmatniemeyer, distro seems sane18:53
niemeyerCool18:53
lamalexis juju useful for deploying services like mongodb on my local dev machine?19:01
niemeyerhazmat: http://paste.ubuntu.com/702393/19:17
niemeyerlamalex: It is indeed19:17
niemeyerlamalex: We've just landed support for that, so we're still polishing it a bit, but that's already in and is definitely something we care about19:17
lamalexniemeyer, awesome!19:20
hazmat niemeyer +119:27
niemeyerhazmat: Woot, there we go19:27
_mup_juju/env-origin r381 committed by gustavo@niemeyer.net19:29
_mup_- Implementation redone completely.19:29
_mup_- Do not crash on missing apt-cache.19:29
_mup_- Exported and tested get_default_origin.19:29
_mup_- Tests tweaked to explore edge cases.19:29
hazmatniemeyer, should i be waiting on a second review for local-origin-passthrough or can i go ahead and merge?19:45
hazmatbcsaller, if you have a moment and could look at local-origin-passthrough that would be awesome19:45
bcsallerI'll do it now19:46
hazmatbcsaller, awesome, thanks19:46
hazmatbcsaller, i had one fix that i accidentally pushed down to unit-cloud-cli but regarding the network setup in the chroot, the way it was working before modifying resolvconf/*/base wasn't going to work since that's not processed for a chroot, i ended up directly inserting dnsmasq into the output resolvconf/run/resolv.conf to ensure its active for the chroot19:47
bcsallerhazmat: why did it need to be active for the chroot?19:48
hazmatbcsaller, because we install packages and software from there19:48
hazmatbcsaller, most of the packages end up being cached, which caused some false starts, but doing it with juju-origin resurfaced the issue, since it had talk to lp to resolve the branch19:48
bcsalleryeah... just put that together. We might be better off with a 1 time job for juju-create19:48
bcsallerupstart job I mean19:49
hazmatbcsaller, it is still a one time job, and dnsmasq is the correct resolver, i just changed it to be the active one during the chroot19:49
niemeyerhazmat: Hmm19:49
bcsallerk19:49
_mup_juju/trunk r382 committed by gustavo@niemeyer.net19:49
_mup_Merged env-origin branch [a=jimbaker,niemeyer] [r=hazmat,niemeyer]19:49
_mup_This introduces a juju-origin option that may be set to "ppa",19:49
_mup_"distro", or to a bzr branch URL.  The new logic will also attempt19:49
_mup_to find out the origin being used to run the local code and will19:49
_mup_set it automatically if unset.19:49
niemeyerMay be worth testing it against the tweaked env-origin19:49
hazmat/etc/resolv.conf symlinks to /etc/resolvconf/run/resolv.conf .. its only on startup that it gets regen'd for the container via dhcp to be the dnsmasq..19:49
bcsallerhazmat: thats why I was suggesting that it could happen in startup on the first run in a real lxc and not a chroot19:50
hazmatniemeyer, good point.. i think i ended up calling get_default_origin to get a sane default for local provider to pass through19:50
bcsallerbut the change you made should be fine19:50
niemeyerhazmat: Yeah.. I've exported it and tested it19:50
niemeyerhazmat: So it'll be easy to do that19:51
hazmatcool19:51
niemeyerhazmat: Note that the interface has changed, though19:51
hazmatnoted, i'll do an end to end test19:51
niemeyerhazmat: It returns a tuple of two element in the same format of parse_juju_origin19:51
niemeyerhazmat, bcsaller: I've just unstuck the wtf too.. it was frozen on a "bzr update" of lp:juju for some unknown reason19:52
niemeyerWe should have some input about the last 3 revisions merged soonish19:52
niemeyerI'm going outside for some exercising.. back later19:55
hazmatniemeyer, cheers20:00
niemeyerWoot! 379 is good.. 3 to go20:00
niemeyerAlright, actually leaving now.. laters!20:01
hazmatinteresting.. Apache Ambari20:06
SpamapShey is there a tutorial for using the local provider?20:10
SpamapShmmmm20:13
SpamapSlatest trunk failure on PPA build20:13
SpamapShttps://launchpadlibrarian.net/81932606/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz20:13
hazmatSpamapS, not yet20:16
hazmati'll put together some provider docs after i get these last bits merged20:16
hazmatSpamapS, haven't seen those failures b420:18
hazmatthey work for me disconnected on trunk20:19
hazmatSpamapS, is the s3 url endpoint being patched for the packaged?20:20
hazmati don't see how else that test could fail, perhaps bucket dns names20:21
_mup_Bug #867877 was filed: revision in charm's metadata.yaml is inconvenient <juju:New> < https://launchpad.net/bugs/867877 >21:04
_mup_juju/trunk-merge r343 committed by kapil.thangavelu@canonical.com21:20
_mup_trunk merge21:20
_mup_juju/local-origin-passthrough r418 committed by kapil.thangavelu@canonical.com21:22
_mup_merge pipeline, resolve conflict21:22
fwereadethat's it for me, nn all21:35
_mup_juju/trunk r383 committed by kapil.thangavelu@canonical.com21:39
_mup_merge unit-relation-with-address [r=niemeyer][f=861225]21:39
_mup_Unit relations are now prepopulated with the unit's private address21:39
_mup_under the key 'private-address. This obviates the need for units to21:39
_mup_manually set ip addresses on their relations to be connected to by the21:39
_mup_remote side.21:39
hazmatfwereade, cheers21:39
* niemeyer waves21:44
niemeyerWoot.. lots of green on wtf21:45
niemeyerhazmat: re. local-origin-passthrough, once you're happy with it would you mind to do a run on EC2 just to make sure things are happy there?22:01
hazmatniemeyer, sure, just in progress on that22:01
niemeyerhazmat: Cheers!22:01
=== xzilla_ is now known as xzilla
jimbakeralthough local-origin-passthrough doesn't work for me, hazmat believes he has a fix for it in the unit-info-cli branch22:12
hazmatjimbaker, did you try it out?22:13
jimbakerhazmat, unit-info-cli has not yet come up22:13
hazmatjimbaker, and to be clear that's not  regarding ec222:13
jimbakerhazmat, of course not, it's local :)22:13
hazmatjimbaker, pls pastebin the data-dir/units/master-customize.log22:14
hazmatjimbaker, it would also be good to know if you have unit agents running or not22:14
hazmatjimbaker, are you running oneiric containers?22:14
hazmatjimbaker, yeah.. figuring out when its done basically needs to parse ps output22:14
hazmator check status22:14
jimbakerhazmat, unfortunately this is hitting a wall of time for me - need to take kids to get their shots momentarily22:14
hazmatbut incrementally its easier to look at ps output22:15
jimbakerhazmat, makes sense. i was just taking a look at juju status22:15
hazmatjimbaker, k, i'll be around latter22:15
jimbakerok, i will paste when i get back22:15
hazmatjimbaker, juju status won't help if there's an error, looking at ps output shows the container creation and juju-create customization, all the output of customize goes to the customize log22:16
* hazmat wonders if the cobbler api exposes available classes22:21
hazmatah.. get_mgmtclasses22:24
_mup_juju/local-origin-passthrough r419 committed by kapil.thangavelu@canonical.com23:33
_mup_incorporate non interactive apt suggestions, pull up indentation and resolv.conf fixes from the pipeline23:33
_mup_juju/trunk r384 committed by kapil.thangavelu@canonical.com23:37
_mup_merge local-origin-passthrough [r=niemeyer][f=861225]23:37
_mup_local provider respects juju-origin settings. Allows for using23:37
_mup_a published branch when deploying locally.23:37
hazmatwhoops forget the reviewers23:37
_mup_juju/unit-info-cli r426 committed by kapil.thangavelu@canonical.com23:38
_mup_merge local-origin-passthrough & resolve conflict23:38
_mup_juju/unit-info-cli r427 committed by kapil.thangavelu@canonical.com23:54
_mup_fix double typo pointed out by review23:54

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!