/srv/irclogs.ubuntu.com/2011/10/04/#juju.txt

* niemeyer gets some food		00:00
SpamapS	Its important to have a nice clear easy REST API to use.. but its vital that you also provide optimisations for batch operations. Its why SQL is so popular.. easy to get one row, easy to get all rows.	00:08
niemeyer	jimbaker: ping	01:21
niemeyer	hazmat: Still around?	01:40
hazmat	niemeyer indeed	02:14
niemeyer	hazmat: Cool, sorted out already, again! :-)	02:17
niemeyer	hazmat: Review queue pretty much empty	02:17
hazmat	niemeyer, nice	02:17
* hazmat crashes		02:19
niemeyer	hazmat: Cheers	02:25
_mup_	juju/unit-info-cli r424 committed by kapil.thangavelu@canonical.com	02:47
_mup_	remove the manual copy of host resolv.conf, since customize runs in a chroot, directly modify the resolv.conf output to point to dnsmasq, fix indentation problem	02:47
_mup_	juju/env-origin r381 committed by jim.baker@canonical.com	04:22
_mup_	Merged trunk	04:22
SpamapS	FYI r378 caused a segfault when building on natty	06:09
SpamapS	https://launchpadlibrarian.net/81865589/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz	06:09
* ejat just wondering … is someone doing charm for liferay :)		09:43
hazmat	hmm	11:25
hazmat	SpamapS, its a problem with the zk version there	11:26
hazmat	3.3.1 has known issues for juju	11:26
hazmat	applies primarily to libzookeeper and python-libzookeeper	11:27
hazmat	SpamapS, all the distro ppas (minus oneiric perhaps) should have 3.3.3	12:06
_mup_	Bug #867420 was filed: Add section mentioning expose to the user tutorial. <juju:In Progress by rogpeppe> < https://launchpad.net/bugs/867420 >	12:10
TeTeT	just updated my oneiric install, juju seems to have a problem:	12:12
TeTeT	Errors were encountered while processing:	12:12
TeTeT	/var/cache/apt/archives/juju_0.5+bzr361-0ubuntu1_all.deb	12:12
TeTeT	E: Sub-process /usr/bin/dpkg returned an error code (1)	12:12
TeTeT	was a transient problem, apt-get update, apt-get -f install seemed to have fixed it	12:17
hazmat	interestingly simulating transient disconnection of a bootstrap node for extended periods of time seems to be fine	12:29
fwereade	heya niemeyer	13:11
SpamapS	hazmat: ahh, we need to add a versioned build dep then	13:12
niemeyer	Hello!	13:17
hazmat	niemeyer, g'morning	13:19
niemeyer	fwereade: How're things going there?	13:23
niemeyer	hazmat: Good stuff in these last few branches	13:23
fwereade	niemeyer: tolerable :)	13:23
niemeyer	fwereade: ;-)	13:23
hazmat	niemeyer, yeah.. finally fixed the local provider issue wrt to customization, so all is good there, still seem some occasionally lxc pty allocation errors, but haven't deduced to a reliable reproduction strategy for upstream	13:24
hazmat	niemeyer, i did play around with the disconnect scenarios some more, at least for a period of no active usage (no hooks executing, etc), we tolerate zookeeper nodes going away transiently fairly well	13:25
niemeyer	hazmat: By zookeeper nodes you mean the server themselves?	13:25
niemeyer	servers	13:25
hazmat	niemeyer, yeah.. the zookeeper server going away	13:25
niemeyer	hazmat: Neat!	13:25
niemeyer	It's a good beginning :)	13:26
niemeyer	hazmat: we should talk to rogpeppe about the issues we debated yesterday	13:26
niemeyer	hazmat: re. making things not fail when possible	13:26
rogpeppe	i'm here!	13:26
hazmat	niemeyer, for the single server case, the session stays alive, if the client reconnects within the the session timeout period after the server is back up. and the clients all go into poll mode every 3s when the zk server is down (roughly 1/3 session time i believe)	13:26
rogpeppe	(afternoon, folks, BTW)	13:27
hazmat	niemeyer, theres a few warnings in the zk docs about not trusting library implementation that magic things for the app	13:27
hazmat	regarding error handling	13:27
hazmat	rogpeppe, hola	13:27
niemeyer	hazmat: Well, sure :)	13:27
niemeyer	hazmat: That's what the whole session concept is about, though	13:27
niemeyer	rogpeppe: This goes a bit in the direction you were already thinking	13:28
niemeyer	rogpeppe: You mentioned in our conversations that e.g. it'd be good that Dial would hold back until the connection is actually established	13:28
niemeyer	rogpeppe: This is something we should do, but we're talking about doing more than that	13:28
rogpeppe	i don't know if this is relevant, or if it's a problem with gozk alone, but i never got a server-down notification from a zk server, even when i killed it and waited 15 mins.	13:28
niemeyer	rogpeppe: Try bringing it up afterwards! :-)	13:29
hazmat	rogpeppe, session expiration is server governed, clients don't decide that	13:29
niemeyer	rogpeppe: It's a bit strange, but that's how it works.. the session times out in the next reconnection	13:29
rogpeppe	niemeyer: yeah, i definitely think it should	13:29
hazmat	rogpeppe, the clients go into a polling reconnect mode, turning up the zookeeper debug log verbosity will show the activity	13:30
rogpeppe	hazmat: but what if there's no server? surely the client should fail eventually?	13:30
niemeyer	rogpeppe: So, in addition to this, when we are connected and zk disconnects, we should also block certain calls	13:30
niemeyer	rogpeppe: Well.. all the calls	13:30
hazmat	rogpeppe, nope.. they poll endlessley in the background, attempting to use the connection will raise a connectionloss/error	13:30
hazmat	rogpeppe, at least until the handle is closed	13:31
niemeyer	rogpeppe: So that we avoid these errors ^	13:31
hazmat	rogpeppe, that's why we have explicit timeouts for connect	13:31
niemeyer	rogpeppe: In other words, if we have a _temporary_ error (e.g. disconnection rather than session expiration), we should block client calls	13:31
hazmat	above libzk	13:31
rogpeppe	hazmat: but if all users are blocked waiting for one of {connection, state change}, then no one will try to use the connection, and the client will hang forever	13:31
niemeyer	rogpeppe: Not necessarily.. as you know it's trivial to timeout and close a connection	13:32
niemeyer	rogpeppe: I mean, on our side	13:32
rogpeppe	so all clients should do timeout explicitly?	13:32
niemeyer	rogpeppe: <-time.After & all	13:32
rogpeppe	sure, but what's an appropriate time out?	13:32
niemeyer	rogpeppe: Whatever we choose	13:33
niemeyer	rogpeppe: But that's not what we're trying to solve now	13:33
rogpeppe	sure	13:33
niemeyer	rogpeppe: What we have to do is make the gozk interface bearable	13:33
niemeyer	rogpeppe: Rather than a time bomb	13:33
hazmat	so we're trying to make recoverable error handling subsumed into the client	13:33
rogpeppe	[note to future: i'd argue for the timeout functionality to be inside the gozk interface, not reimplemented by every client]	13:33
niemeyer	[note to future: discuss timeout with rogpeppe]	13:34
hazmat	by capturing a closure for any operation, and on connection error, wait till the connection is restablished, and rexecing the closure (possibly with additional error detection semantics)	13:34
rogpeppe	hazmat: are we talking about the gozk package level here?	13:34
niemeyer	hazmat: I think there's a first step before that even	13:34
rogpeppe	or a higher juju-specific level?	13:35
niemeyer	rogpeppe: Yeah, internal to gozk	13:35
hazmat	rogpeppe, which pkg isn't relevant, but yes at the zk conn level	13:35
hazmat	http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling	13:35
niemeyer	hazmat: Before we try to _redo_ operations, we should teach gozk to not _attempt_ them in the first place when it knows the connection is off	13:35
hazmat	hmm	13:35
hazmat	yeah.. thats better	13:35
hazmat	we can basically watch session events and hold all operations	13:36
hazmat	niemeyer, +1	13:36
niemeyer	hazmat: Cool!	13:36
hazmat	hmm	13:36
hazmat	niemeyer, so there is still a gap	13:36
niemeyer	rogpeppe: Does that make sense to you as well?	13:36
* rogpeppe is thinking hard		13:37
niemeyer	hazmat: There is in some cases, when we attempt to do something and the tcp connection crashes on our face	13:37
hazmat	niemeyer, internally libzk will do a heartbeat effectively to keep the session alive, if the op happens before the heartbeat detects dead we still get a conn error	13:37
niemeyer	hazmat: Let's handle that next by retrying certain operations intelligently	13:37
rogpeppe	i think the first thing is to distinguish between recoverable and unrecoverable errors	13:37
hazmat	rogpeppe, its a property of the handle	13:38
niemeyer	rogpeppe: That's the next thing, after the initial step we mentioned above	13:38
hazmat	libzk exposes a method for it to return a bool	13:38
hazmat	recoverable(handle)	13:38
niemeyer	rogpeppe: For blocking operations on certain connection states, we're actually preventing the error from even happening	13:38
rogpeppe	preventing the error being exposed to the API-client code, that is, yes?	13:39
niemeyer	rogpeppe: No	13:39
hazmat	rogpeppe, yup	13:39
hazmat	:-)	13:39
rogpeppe	lol	13:39
niemeyer	rogpeppe: Preventing it from happening at all	13:39
hazmat	the error never happens	13:39
hazmat	because we don't let the op go through while disconnected	13:40
niemeyer	rogpeppe: The error never happens if we don't try the call	13:40
rogpeppe	ok, that makes sense.	13:40
rogpeppe	but... what about an op that has already gone through	13:40
rogpeppe	?	13:40
hazmat	next step is to auto recover the error for ops that we can do so without ambiguity, because there is still a gap on our detection of the client connectivity	13:40
rogpeppe	and then the connection goes down	13:40
niemeyer	rogpeppe: That's the next case we were talking about above	13:40
niemeyer	rogpeppe: If the operation is idempotent, we can blindly retry it behind the lib client's back	13:41
rogpeppe	niemeyer: do we need to? i thought it was important that clients be prepared to handle critical session events	13:41
niemeyer	rogpeppe: If the operation is not idempotent, too bad.. we'll have to let the app take care of it	13:41
hazmat	rogpeppe, effectively the only only ops i've seen ambiguity around is the create scenario, and modifications without versions	13:41
niemeyer	rogpeppe: Do we need to what?	13:42
rogpeppe	do we need to retry, was my question.	13:42
hazmat	so this might be better structured as a library on top of the connection that's specific to juju	13:42
niemeyer	rogpeppe: Yeah, because otherwise we'll have to introduce error handling _everywhere_, doing exactly the same retry	13:42
niemeyer	hazmat: Nah.. let's do it internally and make a clean API.. we know what we're doing	13:43
rogpeppe	does zookeeper do a 3 phase commit?	13:43
hazmat	niemeyer, famous last words ;-)	13:43
rogpeppe	i.e. for something like create with sequence number, does the client have to acknowledge the create before the node is actually created?	13:43
niemeyer	hazmat: Well, if we don't, we have larger problems ;-)	13:43
hazmat	rogpeppe, its a paxos derivative internally. everything forwards to the active leader in the cluster	13:43
hazmat	writes that is	13:43
hazmat	it transparently does leader election as needed	13:44
niemeyer	rogpeppe: The _client_ cannot acknowledge the create	13:44
hazmat	rogpeppe, the client doesn't ack the create, but the error recovery with a sequence node is hard, because without the server response, we have no idea what happened	13:44
rogpeppe	niemeyer: why not? i thought the usual process was: write request; read response; write ack; server commits	13:45
niemeyer	rogpeppe: What's the difference?	13:45
niemeyer	rogpeppe: write ack; read response; write ack; read response; write ack; read response; server commits	13:45
rogpeppe	niemeyer: the difference is that if the server doesn't see an ack from the client, the action never happened.	13:45
niemeyer	rogpeppe: Doesn't matter how many round trips.. at some point the server will commit, and if the connection crashes the client won't know if it was committed or not	13:46
hazmat	? there's client acks under the hood?	13:46
niemeyer	hazmat: There isn't.. and I'm explaining why it makes no difference	13:46
hazmat	ah	13:46
* hazmat dogwalks back in 15		13:47
niemeyer	hazmat: Cheers	13:47
rogpeppe	if the connection crashes, the client can still force the commit by writing the ack. it's true that it doesn't know if the ack is received. hmm. byzantine generals.	13:48
niemeyer	Yeah	13:48
rogpeppe	i'm slightly surprised the sequence-number create doesn't have a version argument, same as write	13:49
niemeyer	rogpeppe: Hmm.. seems to be sane to me?	13:50
rogpeppe	that would fix the problem, at the expense of retries, no?	13:50
niemeyer	rogpeppe: It's atomic.. it's necessarily going to be version 0	13:50
rogpeppe	ah, child changes don't change a version number?	13:50
* rogpeppe goes back to look at the modify operation		13:51
niemeyer	rogpeppe: It changes, but it makes no sense to require a given version with a sequence number	13:51
niemeyer	rogpeppe: The point of using the sequence create is precisely to let the server make concurrent requests work atomically	13:51
niemeyer_	Hmm	13:52
niemeyer_	Weird	13:52
niemeyer_	Abrupt disconnection	13:53
rogpeppe	niemeyer: but we want to do that with node contents too - that's why the version number on Set	13:53
rogpeppe	niemeyer_: and that's the main problem with the lack of Create idempotency	13:53
rogpeppe	anyway, we could easily document that Create with SEQUENCE is a special case	13:54
rogpeppe	and can return an error without retrying	13:54
niemeyer_	rogpeppe: We don't even have to document it really.. the error itself is the notice	13:55
rogpeppe	i think it would be good if the only time a session event arrived at a watcher was if the server went down unrecoverably	13:55
rogpeppe	actually, that doesn't work	13:56
rogpeppe	watchers will always have to restart	13:56
niemeyer_	rogpeppe: That's how it is today, except for the session events in the session watch	13:56
niemeyer_	rogpeppe: Not really	13:56
niemeyer_	rogpeppe: If the watch was already established, zk will keep track of them and reestablish internally as long as the session survives	13:57
rogpeppe	but what if the watch reply was lost when the connection went down?	13:58
niemeyer_	rogpeppe: Good question.. worth confirming to see if it's handled properly	13:59
rogpeppe	i'm not sure how it can be	14:00
rogpeppe	the client doesn't ack watch replies AFAIK	14:00
niemeyer_	rogpeppe: There are certainly ways it can be.. it really depends on how it's done	14:01
niemeyer_	rogpeppe: E.g. the client itself can do the verification on connection reestablishment	14:01
niemeyer_	Another alternative, which is perhaps a saner one, is to do a 180⁰ turn and ignore the existence of sessions completely	14:01
niemeyer_	Hmmm..	14:02
rogpeppe	niemeyer_: that would look much nicer from a API user's perspective	14:02
niemeyer_	I actually like the sound of that	14:02
niemeyer_	rogpeppe: Not even thinking about API.. really thinking about how to build reliable software on top of it	14:03
rogpeppe	aren't those closely related things?	14:03
niemeyer_	rogpeppe: Not necessarily.. an API that reestablishes connections and knows how to hanndle problems internally is a lot nicer from an outside user's perspective	14:04
=== niemeyer_ is now known as niemeyer
rogpeppe	niemeyer: don't quite follow	14:06
niemeyer	rogpeppe: Don't worry, it's fine either way	14:06
* hazmat catches up		14:06
niemeyer	hazmat: I think we should do a U turn	14:07
hazmat	niemeyer, how so?	14:07
hazmat	hmm.. verifying watch handling while down sounds good	14:08
hazmat	connection down that is	14:08
niemeyer	hazmat: We're adding complexity in the middle layer, and reality is that no matter how complex and how much we prevent the session from "crashing", we _still_ have to deal with session termination correctly	14:08
hazmat	session termination is effectively fatal	14:08
rogpeppe	when does a session terminate?	14:08
hazmat	the only sane thing to do is to restart the app	14:08
niemeyer	hazmat: we're also constantly saying "ah, but what if X happens?"..	14:08
hazmat	rogpeppe, a client is disconnected from the quorum for the period of session timeout	14:09
niemeyer	hazmat: Not necessarily.. we have to restart the connection	14:09
hazmat	niemeyer, and reinitialize any app state against the new connection	14:09
niemeyer	hazmat: Yes	14:09
hazmat	ie. restart the app ;-)	14:09
niemeyer	hazmat: No, restart the app is something else	14:09
niemeyer	hazmat: Restart the app == new process	14:10
hazmat	doesn't have to be a process restart to be effective, but it needs to go through the entire app init	14:10
niemeyer	hazmat: So, the point is that we have to do that anyway	14:10
niemeyer	hazmat: Because no matter how hard we try, that's a valid scenario	14:10
hazmat	rogpeppe, the other way a session terminates is a client closes the handle, thats more explicit	14:10
hazmat	rogpeppe, that can be abused in testing by connecting multiple clients via the same session id, to simulate session failures	14:11
hazmat	niemeyer, absolutely for unrecoverable errors that is required	14:11
niemeyer	hazmat: So what about going to the other side, and handling any session hiccups as fatal? It feels a lot stronger as a general principle, and a lot harder to get it wrong	14:11
rogpeppe	when you say "reinitialize any app state", doesn't that assume that no app state has already been stored on the server?	14:11
hazmat	for recoverable errors local handling inline to the conn, seems worth exploring	14:11
rogpeppe	or are we assuming that the server is now a clean slate?	14:11
hazmat	we need to validate some of the watch state	14:12
hazmat	rogpeppe, no the server has an existing state	14:12
niemeyer	hazmat: The problem is that, as we've been seeing above, "recoverable errors" are actually very hard to really figure	14:12
hazmat	rogpeppe, the app needs to process the existing state against its own state needs and observation requirements	14:12
niemeyer	hazmat: rogpeppe makes a good point in terms of the details of watch establishment	14:12
rogpeppe	so presumably we know almost all of that state, barring operations in progress?	14:12
niemeyer	hazmat: and I don't have a good answer for him	14:12
hazmat	niemeyer, that's why i was going with a stop/reconnect/start for both error types as a simple mechanism	14:12
hazmat	for now	14:13
* hazmat does a test to verify watch behavior		14:13
niemeyer	hazmat: Yeah, but the problem we have _today_ and that I don't feel safe doing that is that we don't have good-but-stay-alive semantics in the code base	14:13
niemeyer	erm..	14:13
niemeyer	good stop-but-stay-alive	14:13
rogpeppe	i think that the most important case is automatic retries of idempotent operations.	14:14
hazmat	niemeyer, we do in the unit agents as a consequence of doing upgrades, we pause everything for it	14:14
rogpeppe	but that's hard too.	14:14
niemeyer	hazmat: I seriously doubt that this will e.g. kill old watches	14:14
hazmat	niemeyer, effectively the only thing that's not observation driven is the provider agent does some polling for runaway instances	14:15
hazmat	niemeyer, it won't kill old watches, but we can close the handle explicitly	14:15
niemeyer	hazmat: and what happens to all the deferreds?	14:16
hazmat	niemeyer, their dead, when the session is closed	14:16
hazmat	at least for watches	14:16
niemeyer	hazmat: What means dead? Dead as in, they'll continue in memory, hanging?	14:17
hazmat	niemeyer, yeah... their effectively dead, we can do things to clean them up if that's problematic	14:17
hazmat	dead in memory	14:17
niemeyer	hazmat: Yeah.. so if we have something like "yield exists_watch", that's dead too..	14:18
hazmat	we can track open watches like gozk and kill them explicitly (errback disconnect)	14:18
niemeyer	hazmat: That's far from a clean termination	14:18
hazmat	niemeyer, we can transition those to exceptions	14:18
niemeyer	hazmat: Sure, we can do everything we're talking about above.. the point is that it's not trivial	14:18
hazmat	it seems straightforward at the conn level	14:19
hazmat	to track watches, and on close kill them	14:19
niemeyer	hazmat: Heh.. it's straightforward to close() the connection, of course	14:19
niemeyer	hazmat: It's not straightforward to ensure that doing this will yield a predictable behavior	14:19
hazmat	so back to process suicide ;-)	14:20
niemeyer	hazmat: Cinelerra FTW!	14:20
rogpeppe	this is all talking about the situation when you need to explicitly restart a session, right?	14:20
hazmat	rogpeppe, yes	14:20
niemeyer	rogpeppe: Yeah, control over fault scenarios in general	14:21
hazmat	restart/open a new session	14:21
rogpeppe	restart is different, i thought	14:21
rogpeppe	because the library can do it behind the scenes	14:21
rogpeppe	and reinstate watches	14:21
rogpeppe	redo idempotent ops, etc	14:21
hazmat	rogpeppe, but it can't reattach the watches to the all extant users?	14:21
rogpeppe	i don't see why not	14:22
hazmat	perhaps in go that's possible with channels and the channel bookeeping	14:22
hazmat	against the watches	14:22
niemeyer	hazmat, rogpeppe: No, that doesn't work in any case	14:22
rogpeppe	?	14:22
niemeyer	The window between the watch being dead and the watch being alive again is lost	14:22
=== rogpeppe is now known as rog
rog	of course	14:23
rog	doh	14:23
rog	except...	14:23
rog	that the client could keep track of the last-returned state	14:23
rog	and check the result when the new result arrives	14:23
rog	and trigger the watcher itself if it's changed	14:24
niemeyer	rog: Yeah, we could try to implement the watch in the client side, but that's what I was talking above	14:25
rog	expect... i don't know if {remove child; add child with same name} is legitimately a no-op	14:25
niemeyer	rog: We're going very far to avoid a situation that is in fact unavoidable	14:25
niemeyer	rog: Instead of doing that, I suggest we handle the unavoidable situation in all cases	14:25
rog	force all clients to deal with any session termination as if it might be unrecoverable?	14:26
niemeyer	rog: Yeah	14:27
niemeyer	rog: Any client disconnection in fact	14:27
niemeyer	rog: let's also remove the hack we have in the code and allow watches to notice temporary disconnections	14:27
rog	this is why proper databases have transactions	14:28
niemeyer	rog: Uh..	14:28
niemeyer	rog: That was a shoot in the sky :-)	14:28
rog	if the connection dies half way through modifying some complex state, then when retrying, you've got to figure out how far you previously got, then redo from there.	14:29
niemeyer	rog: We have exactly the same thing with zk.	14:29
niemeyer	rog: The difference is that unlike a database we're using this for coordination	14:30
niemeyer	rog: Which means we have live code waiting for state to change	14:30
niemeyer	rog: A database client that had to wait for state to change would face the same issues	14:30
hazmat	rog, databaess still have the same issue	14:31
rog	yeah, i guess	14:31
rog	what should a watcher do when it sees a temporary disconnection?	14:32
rog	await reconnection and watch again, i suppose	14:33
hazmat	so watches don't fire if the event happens while disconnected	14:33
rog	i wonder if the watch should terminate even on temporary disconnection.	14:34
niemeyer	rog: It should error out and stop whatever is being done, recovering the surrounding state if it makes sense	14:34
niemeyer	rog: Right, exactly	14:34
rog	and is that true of the Dial session events too? the session terminates after the first non-ok event?	14:36
rog	i think that makes sense.	14:37
rog	(and it also makes use of Redial more ubiquitous). [of course i'm speaking from a gozk perspective here, as i'm not familiar with the py zk lib]	14:37
hazmat	interesting, i get a session expired event.. just wrote a unit test for watch fire while disconnected, two server cluster, two clients one connected to each, one client sets a watch, shutdown its server, delete on the other client/server, resurrect the shutdown server with its client waiting on the watch, gets a session expired event	14:38
hazmat	hmm. its timing dependent though	14:38
niemeyer	rog: Yeah, I think so too	14:39
hazmat	yeah.. this needs more thought	14:39
niemeyer	hazmat: Yeah, the more we talk, the more I'm convinced we should assume nothing from a broken connection	14:39
hazmat	niemeyer, indeed	14:40
niemeyer	hazmat: This kind of positioning also has a non-obvious advantage.. it enables us to more easily transition to doozerd at some point	14:40
niemeyer	Perhaps not as a coincidence, it has no concept of sessions	14:40
* niemeyer looks at Aram		14:40
hazmat	niemeyer, interesting.. i thought you gave up on doozerd	14:40
hazmat	upstream seems to be dead afaik	14:40
niemeyer	hazmat: I have secret plans!	14:40
niemeyer	;-)	14:40
hazmat	niemeyer, cool when i mentioned it b4 you seemed down on it	14:41
hazmat	it would be nice for an arm env to go java-less	14:41
niemeyer	hazmat: Yeah, because it sucks on several aspects right now	14:41
hazmat	rog, on ReDial does gozk reuse a handle?	14:41
niemeyer	hazmat: But what if we.. hmmm.. provided incentives for the situation to change? :-)	14:41
hazmat	niemeyer, yeah.. persistence and error handling there don't seem well known	14:41
hazmat	niemeyer, indeed, things can change	14:42
rog	hazmat: no, i don't think so, but i don't think it needs to.	14:43
hazmat	that's one important difference between gozk/txzk.. the pyzk doesn't expose reconnecting with the same handle, which toasts extant watches (associated to handle) when trying to reconnect to the same session explictly	14:43
rog	(paste coming up)	14:43
hazmat	libzk in the background will do it, but if you want to change the server explicitly at the app/client level its a hoser	14:43
rog	if you get clients to explicitly negotiate with the central dialler when the connection is re-made, i think it can work.	14:46
rog	i.e. get error indicating that server is down, ask central thread for new connection.	14:47
rog	store that connection where you need to.	14:48
* niemeyer loves the idea of keeping it simple and not having to do any of that :)		14:49
rog	yeah, me too.	14:51
rog	but we can't. at least i think that's the conclusion we've come to, right?	14:51
niemeyer	rog: Hmm, not my understanding at least	14:52
niemeyer	rog: It's precisely the opposite.. we _have_ to do that anyway	14:52
niemeyer	rog: Because no matter how hard we try, the connection can break for real, and that situation has to be handled properly	14:52
niemeyer	rog: So I'd rather focus on that scenario all the time, and forget about the fact we even have sessions	14:53
rog	niemeyer: so you're saying that we have to lose all client state when there's a reconnection?	14:54
niemeyer	rog: Yes, I'm saying we have to tolerate that no matter what	14:54
rog	so the fact that zk has returned ok when we've created a node, we have to act as if that node might not have been created?	14:55
rog	s/the fact/even if/	14:55
niemeyer	rog: If zk returned ok, there's no disconnection	14:56
rog	niemeyer: if it returned ok, and the next create returns an error; that's the scenario i'm thinking of	14:56
rog	that's the situation where i think the node creator could wait for redial and then carry on from where it was	14:57
niemeyer	rog: If it returned an error, we have to handle it as an error and not assume that the session is alive, because it may well not be	14:57
niemeyer	rog: and what if the session dies?	14:58
rog	i'm not saying that it should assume the session is still alive	14:58
rog	i'm saying that when it gets an error, it could ask the central thread for the new connection - it might just get an error instead	14:58
rog	the node creator is aware of the transition, but can carry on (knowingly) if appropriate	14:59
niemeyer	rog: and what about the several watches that are established?	14:59
rog	niemeyer: same applies	14:59
niemeyer	rog: What applies?	14:59
rog	the watch will return an error; the code doing the watch can ask for a new connection and redo the watch if it wishes.	15:00
hazmat	niemeyer, so we're back to reinitializing the app on any connectoin error, disregarding recoverable	15:00
rog	no redoing behind the scenes, but the possibility of carrying on where we left off	15:00
niemeyer	rog: The state on which the watch was requested has changed	15:00
niemeyer	rog: Check out the existing code base	15:00
hazmat	niemeyer, so interestingly we can be disconnected, not know it, and miss a watch event	15:01
niemeyer	rog: It's not trivial to just "oh, redo it again"	15:01
rog	niemeyer: it doesn't matter because the watcher is re-requesting the state, so it'll see both the state and any subsequent watch event	15:01
niemeyer	hazmat: Yeah, that's exactly the kind of very tricky scenario that I'm concerned about	15:01
rog	the watcher has to deal with the "state just changed" scenario anyway when it first requests the watch	15:01
hazmat	niemeyer, actually we get notification from a session event that we reconnected	15:01
niemeyer	hazmat: As Russ would say, I don't really want to think about whether it's correct or not	15:01
niemeyer	rog: No.. please look at the code base	15:01
rog	niemeyer: sorry, which bit are you referring to?	15:02
niemeyer	rog: We're saying the same thing, in fact.. you're just underestimating the fact that "just retry" is more involved than "request the new connection and do it again"	15:02
niemeyer	rog: juju	15:02
niemeyer	rog: lp:juju	15:02
niemeyer	rog: This concept touches the whole application	15:03
rog	niemeyer: i've been exploring it a bit this morning, but haven't found the crucial bits, i think. what's a good example file that would be strongly affected by this kind of thing?	15:03
niemeyer	hazmat: We do.. the real problem is ensuring state is as it should be when facing reconnections	15:03
niemeyer	rog: I'm serious.. this touches the whole app	15:04
hazmat	niemeyer, right we always have to reconsider state on reconnection	15:04
niemeyer	rog: Check out the agents	15:04
rog	niemeyer: ah, ok. i was looking in state	15:05
rog	thanks	15:05
niemeyer	rog: state is good too	15:05
niemeyer	rog: Since it's what the agents use and touches this concept too	15:05
niemeyer	rog, hazmat: So, my suggestion is that the first thing we do is to unhide temporary failures in gozk	15:06
hazmat	niemeyer, the test becomes alot more reliable when we have multiple zks in the cluster setup for the client to connect to	15:06
rog	sgtm	15:06
niemeyer	rog, hazmat: Then, let's watch out for that kind of issue very carefully in reviews and whatnot, as we build a reliable version	15:06
niemeyer	hazmat: The same problem exists, though..	15:07
hazmat	niemeyer, indeed	15:07
hazmat	but it minimizes total disconnect scenarios with multiple zks	15:07
niemeyer	hazmat: Even if it _immediately_ reconnects, the interim problems may have created differences that are easily translated into bugs very hard to figure out	15:07
niemeyer	hazmat: and again, we seriously _have_ to handle the hard-reconnect across the board	15:08
hazmat	niemeyer, agreed	15:08
hazmat	niemeyer, i'm all in favor of simplifying and treating them the same	15:08
hazmat	recoverable/unrecoverable conn errors	15:09
niemeyer	hazmat: So no matter how much we'd love to not break the session and have a pleasant API, the hard reconnects means we'll need the good failure recovered either way	15:09
niemeyer	So we can as well plan for that at all times	15:09
hazmat	niemeyer, its detecting the conn error that i'm concerned about atm	15:09
niemeyer	hazmat: My understanding is that the client always notifies about temporary issues	15:09
hazmat	niemeyer, based on its internal poll period to the server	15:09
hazmat	niemeyer, a quick disconnect masks any client detection	15:10
hazmat	^transient	15:10
niemeyer	hazmat: Really!?	15:10
hazmat	it seems the server will attempt to expire the client session, but i've seen once we're instead it shows a reconnect	15:11
hazmat	s/we're/where	15:11
niemeyer	hazmat: I can't imagine how that'd be possible	15:11
niemeyer	hazmat: The client lib should hopefully notify the user that the TCP connection had to be remade	15:11
hazmat	niemeyer, fwiw here's the test i'm playing with (can drop into test_session.py ).. http://paste.ubuntu.com/702290/	15:13
hazmat	for a package install of zk.. ZOOKEEPER_PATH=/usr/share/java	15:13
hazmat	for the test runner	15:13
niemeyer	hazmat: Hmm	15:13
niemeyer	hazmat: That seems to test that watches work across reconnections	15:14
niemeyer	hazmat: We know they can work	15:14
hazmat	niemeyer, they do but we miss the delete	15:14
niemeyer	hazmat: Or am I missing something?	15:14
niemeyer	hazmat: Ah, right!	15:14
hazmat	with no notice	15:14
niemeyer	hazmat: So yeah, it's total crack	15:15
hazmat	niemeyer, actually most of the time we get an expired session event in the client w/ the watch	15:15
hazmat	like 99%	15:15
hazmat	if i connect the client to multiple servers it sees the delete	15:16
hazmat	w/ the watch that is	15:16
niemeyer	hazmat: Hmm.. interesting.. so does it keep multiple connections internally in that case, or is it redoing the connection more quickly?	15:17
hazmat	niemeyer, not afaick, but its been a while since i dug into that	15:17
hazmat	niemeyer, but as an example here's one run http://paste.ubuntu.com/702291/	15:18
hazmat	where it does get the delete event	15:19
hazmat	but that's not guaranteed in allop	15:19
niemeyer	hazmat: If you _don't_ create it on restart, does it get the notification?	15:20
niemeyer	hazmat: Just wondering if it might be joining the two events	15:20
hazmat	niemeyer, no it still gets the deleted event if gets an event, else it gets session expired	15:21
hazmat	but its easy to construct it so it only sees the created event	15:21
hazmat	if toss a sleep in	15:21
hazmat	perhaps not	15:22
hazmat	it seems to get the delete event or session expiration.. i need to play with this some more and do a more thought out write up	15:22
hazmat	in some cases it does get the created event, obviously the pastebin has that	15:22
niemeyer	hazmat: I see, cool	15:23
niemeyer	On a minor note, filepath.Rel is in.. can remove our internal impl. now	15:24
hazmat	niemeyer, cool	15:24
niemeyer	That was a tough one :)	15:24
niemeyer	fwereade: Leaving to lunch soon.. how's stuff going there?	15:25
niemeyer	fwereade: Can I do anything for you?	15:25
niemeyer	jimbaker: How's env-origin as well?	15:26
jimbaker	niemeyer, just need to figure out the specific text for the two scenarios you mention	15:26
niemeyer	jimbaker: Hmmm.. which text?	15:27
jimbaker	niemeyer, from apt-cache policy	15:27
niemeyer	jimbaker: Just copy & paste from the existing test? Do you want me to send a patch?	15:27
jimbaker	niemeyer, well it's close to being copy & paste, but the difference really matters here	15:28
jimbaker	if you have a simple patch, for sure that would be helpful	15:28
niemeyer	jimbaker: Sorry, I'm still not sure about what you're talking about	15:28
niemeyer	jimbaker: It seems completely trivial t ome	15:29
niemeyer	jimbaker: Sure.. just a sec	15:29
jimbaker	niemeyer, i was not familiar with apt-cache policy before this work. obviously once familiar, it is trivial	15:29
niemeyer	jimbaker: I'm actually talking about the request I made in the review..	15:30
niemeyer	jimbaker: But since you mention it, I actually provided you with a scripted version saying exactly how it should work like 3 reviews ago	15:30
jimbaker	niemeyer, i'm going against http://carlo17.home.xs4all.nl/howto/debian.html#errata for a description of the output format	15:31
hazmat	the python-apt bindings are pretty simple too.. i used them for the local provider.. although its not clear how you identify a repo for a given package from it	15:31
jimbaker	niemeyer, if you have a better resource describing apt-cache policy, i would very much appreciate it	15:31
jimbaker	hazmat, one advantage of such bindings is the data model	15:32
hazmat	jimbaker, well.. its as simple as cache = apt.Cache().. pkg = cache["juju"].. pkg.isInstalled -> bool... but it doesn't tell you if its a ppa or distro	15:33
hazmat	and for natty/lucid installs without the ppa thats a keyerror on cache["juju"]	15:33
niemeyer	jimbaker: apt-get source apt	15:34
niemeyer	jimbaker: That's the best resource about apt-cache you'll find	15:34
jimbaker	niemeyer, ok, i will read the source, thanks	15:34
niemeyer	jimbaker: Turns out that *** only shows for the current version, so it's even easier	15:36
niemeyer	if (Pkg.CurrentVer() == V)	15:36
niemeyer	cout << " *** " << V.VerStr();	15:36
niemeyer	else	15:36
niemeyer	cout << " " << V.VerStr();	15:36
hazmat	jimbaker, ideally the detection will also notice osx and do something sane, but we can do that latter	15:37
niemeyer	jimbaker: http://paste.ubuntu.com/702301/	15:37
hazmat	more important to have this in now for the release	15:37
niemeyer	hazmat: Oh yeah, please stop giving ideas! :-)	15:37
jimbaker	hazmat, for osx, doesn't it make more sense to just set juju-origin?	15:38
niemeyer	Please, let's just get this branch fixed..	15:38
hazmat	jimbaker, probably does.. but the "/usr" in package path has faulty semantics with a /usr/local install on the branch as i reclal	15:38
niemeyer	I'm stepping out for lunch	15:39
fwereade	popping out for a bit, back later	15:51
rog	i'm off for the evening. am still thinking hard about the recovery stuff. see ya tomorrow.	16:35
rog	niemeyer: PS ping re merge requests :-)	16:35
niemeyer	rog: Awesome, sorry for the delay there	16:36
niemeyer	rog: Yesterday was a bit busier than expected	16:37
niemeyer	jimbaker: How's it there?	16:43
jimbaker	niemeyer, it's a nice day	16:56
niemeyer	jimbaker: Excellent.. that should mean env-origin is ready?	16:57
jimbaker	niemeyer, i still need to figure out what specifically apt-cache policy would print	16:57
niemeyer	jimbaker: Ok.. let's do this.. just leave this branch with me.	16:58
jimbaker	niemeyer, i do have the source code for what prints it, but i need to understand the model backing it	16:58
niemeyer	jimbaker: No need.. I'll handle it, thanks.	16:58
jimbaker	niemeyer, ok, that makes sense, i know you have a great deal of background from your work on synaptic, thanks!	16:58
niemeyer	jimbaker: That work is completely irrelevant.. the whole logic is contained in the paste bint	16:59
jimbaker	niemeyer, ok	16:59
niemeyer	jimbaker: and I pointed the exact algorithm to you	16:59
_mup_	juju/remove-sec-grp-do-not-ignore-exception r381 committed by jim.baker@canonical.com	17:39
_mup_	Simplified remove_security_group per review point	17:39
_mup_	juju/remove-sec-grp-do-not-ignore-exception r382 committed by jim.baker@canonical.com	17:40
_mup_	Merged trunk	17:40
niemeyer	hazmat: Do you have time for a quick review on top of env-origin?	18:38
niemeyer	hazmat: http://paste.ubuntu.com/702373/	18:38
niemeyer	hazmat: It's pretty much just that function I've shown you a while ago plus minor test tweaks	18:38
hazmat	niemeyer, checking	18:38
niemeyer	hazmat: The test tweaks just try a bit harder to break the logic	18:38
niemeyer	hazmat: Hmm.. I'll also add an extra test with broken input, to enusre that's working	18:39
hazmat	niemeyer, what's the file on disk its parsing?	18:40
niemeyer	hazmat: output of apt-cache policy juju	18:40
hazmat	or is that just apt-cache policy pkg?	18:40
niemeyer	hazmat: http://paste.ubuntu.com/702301/	18:41
niemeyer	hazmat: Yeah	18:41
niemeyer	hazmat: That last paste has the logic generating the output	18:41
niemeyer	hazmat: Hmmm.. I'll also do an extra safety check there, actually	18:42
niemeyer	hazmat: It's assuming that any unknown output will fallback to branch.. that sounds dangerous	18:43
niemeyer	hazmat: I'll tweak it so it only falls back to branch in known inputs	18:43
niemeyer	hazmat: http://paste.ubuntu.com/702381/	18:47
hazmat	niemeyer, why is it returning a tuple if it only cares about the line from the line generator	18:48
hazmat	niemeyer, in general it looks fine to me, there's two pieces in the branch that i have minor concern about	18:48
niemeyer	hazmat: Keep reading :)	18:48
hazmat	ah. first indent	18:49
niemeyer	hazmat: It actually cares about the indent as well	18:49
niemeyer	hazmat: It's how we detect we've left a given version entry	18:49
niemeyer	hazmat: What's the other bit you're worried about?	18:51
hazmat	niemeyer, basically how does it break on osx if apt-cache isn't found.. and the notion that if not juju.__name__.startswith("/usr") means unconditionally a package...if i check juju out and do a setup.py install its still a source install.. hmm.. i guess that works with the apt-cache check on installed.. so looks like just what happens if not on ubuntu.. pick a sane default	18:52
hazmat	if apt-cache isn't there this will raise an exception it looks like	18:52
niemeyer	hazmat: I'll take care of that	18:52
hazmat	niemeyer, +1 then	18:53
niemeyer	hazmat: What should we default to?	18:53
* niemeyer thinks		18:53
niemeyer	distro, I guess	18:53
hazmat	niemeyer, distro seems sane	18:53
niemeyer	Cool	18:53
lamalex	is juju useful for deploying services like mongodb on my local dev machine?	19:01
niemeyer	hazmat: http://paste.ubuntu.com/702393/	19:17
niemeyer	lamalex: It is indeed	19:17
niemeyer	lamalex: We've just landed support for that, so we're still polishing it a bit, but that's already in and is definitely something we care about	19:17
lamalex	niemeyer, awesome!	19:20
hazmat	niemeyer +1	19:27
niemeyer	hazmat: Woot, there we go	19:27
_mup_	juju/env-origin r381 committed by gustavo@niemeyer.net	19:29
_mup_	- Implementation redone completely.	19:29
_mup_	- Do not crash on missing apt-cache.	19:29
_mup_	- Exported and tested get_default_origin.	19:29
_mup_	- Tests tweaked to explore edge cases.	19:29
hazmat	niemeyer, should i be waiting on a second review for local-origin-passthrough or can i go ahead and merge?	19:45
hazmat	bcsaller, if you have a moment and could look at local-origin-passthrough that would be awesome	19:45
bcsaller	I'll do it now	19:46
hazmat	bcsaller, awesome, thanks	19:46
hazmat	bcsaller, i had one fix that i accidentally pushed down to unit-cloud-cli but regarding the network setup in the chroot, the way it was working before modifying resolvconf/*/base wasn't going to work since that's not processed for a chroot, i ended up directly inserting dnsmasq into the output resolvconf/run/resolv.conf to ensure its active for the chroot	19:47
bcsaller	hazmat: why did it need to be active for the chroot?	19:48
hazmat	bcsaller, because we install packages and software from there	19:48
hazmat	bcsaller, most of the packages end up being cached, which caused some false starts, but doing it with juju-origin resurfaced the issue, since it had talk to lp to resolve the branch	19:48
bcsaller	yeah... just put that together. We might be better off with a 1 time job for juju-create	19:48
bcsaller	upstart job I mean	19:49
hazmat	bcsaller, it is still a one time job, and dnsmasq is the correct resolver, i just changed it to be the active one during the chroot	19:49
niemeyer	hazmat: Hmm	19:49
bcsaller	k	19:49
_mup_	juju/trunk r382 committed by gustavo@niemeyer.net	19:49
_mup_	Merged env-origin branch [a=jimbaker,niemeyer] [r=hazmat,niemeyer]	19:49
_mup_	This introduces a juju-origin option that may be set to "ppa",	19:49
_mup_	"distro", or to a bzr branch URL. The new logic will also attempt	19:49
_mup_	to find out the origin being used to run the local code and will	19:49
_mup_	set it automatically if unset.	19:49
niemeyer	May be worth testing it against the tweaked env-origin	19:49
hazmat	/etc/resolv.conf symlinks to /etc/resolvconf/run/resolv.conf .. its only on startup that it gets regen'd for the container via dhcp to be the dnsmasq..	19:49
bcsaller	hazmat: thats why I was suggesting that it could happen in startup on the first run in a real lxc and not a chroot	19:50
hazmat	niemeyer, good point.. i think i ended up calling get_default_origin to get a sane default for local provider to pass through	19:50
bcsaller	but the change you made should be fine	19:50
niemeyer	hazmat: Yeah.. I've exported it and tested it	19:50
niemeyer	hazmat: So it'll be easy to do that	19:51
hazmat	cool	19:51
niemeyer	hazmat: Note that the interface has changed, though	19:51
hazmat	noted, i'll do an end to end test	19:51
niemeyer	hazmat: It returns a tuple of two element in the same format of parse_juju_origin	19:51
niemeyer	hazmat, bcsaller: I've just unstuck the wtf too.. it was frozen on a "bzr update" of lp:juju for some unknown reason	19:52
niemeyer	We should have some input about the last 3 revisions merged soonish	19:52
niemeyer	I'm going outside for some exercising.. back later	19:55
hazmat	niemeyer, cheers	20:00
niemeyer	Woot! 379 is good.. 3 to go	20:00
niemeyer	Alright, actually leaving now.. laters!	20:01
hazmat	interesting.. Apache Ambari	20:06
SpamapS	hey is there a tutorial for using the local provider?	20:10
SpamapS	hmmmm	20:13
SpamapS	latest trunk failure on PPA build	20:13
SpamapS	https://launchpadlibrarian.net/81932606/buildlog_ubuntu-natty-i386.juju_0.5%2Bbzr378-1juju1~natty1_FAILEDTOBUILD.txt.gz	20:13
hazmat	SpamapS, not yet	20:16
hazmat	i'll put together some provider docs after i get these last bits merged	20:16
hazmat	SpamapS, haven't seen those failures b4	20:18
hazmat	they work for me disconnected on trunk	20:19
hazmat	SpamapS, is the s3 url endpoint being patched for the packaged?	20:20
hazmat	i don't see how else that test could fail, perhaps bucket dns names	20:21
_mup_	Bug #867877 was filed: revision in charm's metadata.yaml is inconvenient <juju:New> < https://launchpad.net/bugs/867877 >	21:04
_mup_	juju/trunk-merge r343 committed by kapil.thangavelu@canonical.com	21:20
_mup_	trunk merge	21:20
_mup_	juju/local-origin-passthrough r418 committed by kapil.thangavelu@canonical.com	21:22
_mup_	merge pipeline, resolve conflict	21:22
fwereade	that's it for me, nn all	21:35
_mup_	juju/trunk r383 committed by kapil.thangavelu@canonical.com	21:39
_mup_	merge unit-relation-with-address [r=niemeyer][f=861225]	21:39
_mup_	Unit relations are now prepopulated with the unit's private address	21:39
_mup_	under the key 'private-address. This obviates the need for units to	21:39
_mup_	manually set ip addresses on their relations to be connected to by the	21:39
_mup_	remote side.	21:39
hazmat	fwereade, cheers	21:39
* niemeyer waves		21:44
niemeyer	Woot.. lots of green on wtf	21:45
niemeyer	hazmat: re. local-origin-passthrough, once you're happy with it would you mind to do a run on EC2 just to make sure things are happy there?	22:01
hazmat	niemeyer, sure, just in progress on that	22:01
niemeyer	hazmat: Cheers!	22:01
=== xzilla_ is now known as xzilla
jimbaker	although local-origin-passthrough doesn't work for me, hazmat believes he has a fix for it in the unit-info-cli branch	22:12
hazmat	jimbaker, did you try it out?	22:13
jimbaker	hazmat, unit-info-cli has not yet come up	22:13
hazmat	jimbaker, and to be clear that's not regarding ec2	22:13
jimbaker	hazmat, of course not, it's local :)	22:13
hazmat	jimbaker, pls pastebin the data-dir/units/master-customize.log	22:14
hazmat	jimbaker, it would also be good to know if you have unit agents running or not	22:14
hazmat	jimbaker, are you running oneiric containers?	22:14
hazmat	jimbaker, yeah.. figuring out when its done basically needs to parse ps output	22:14
hazmat	or check status	22:14
jimbaker	hazmat, unfortunately this is hitting a wall of time for me - need to take kids to get their shots momentarily	22:14
hazmat	but incrementally its easier to look at ps output	22:15
jimbaker	hazmat, makes sense. i was just taking a look at juju status	22:15
hazmat	jimbaker, k, i'll be around latter	22:15
jimbaker	ok, i will paste when i get back	22:15
hazmat	jimbaker, juju status won't help if there's an error, looking at ps output shows the container creation and juju-create customization, all the output of customize goes to the customize log	22:16
* hazmat wonders if the cobbler api exposes available classes		22:21
hazmat	ah.. get_mgmtclasses	22:24
_mup_	juju/local-origin-passthrough r419 committed by kapil.thangavelu@canonical.com	23:33
_mup_	incorporate non interactive apt suggestions, pull up indentation and resolv.conf fixes from the pipeline	23:33
_mup_	juju/trunk r384 committed by kapil.thangavelu@canonical.com	23:37
_mup_	merge local-origin-passthrough [r=niemeyer][f=861225]	23:37
_mup_	local provider respects juju-origin settings. Allows for using	23:37
_mup_	a published branch when deploying locally.	23:37
hazmat	whoops forget the reviewers	23:37
_mup_	juju/unit-info-cli r426 committed by kapil.thangavelu@canonical.com	23:38
_mup_	merge local-origin-passthrough & resolve conflict	23:38
_mup_	juju/unit-info-cli r427 committed by kapil.thangavelu@canonical.com	23:54
_mup_	fix double typo pointed out by review	23:54

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!