TheMue | re | 08:25 |
---|---|---|
niemeyer | Good morning all! | 09:12 |
TheMue | niemeyer: hiya | 09:18 |
fwereade__ | heya niemeyer, you're on early :) | 09:30 |
niemeyer | fwereade__: Yeah, a bit :) | 09:30 |
* rog finally makes his network connection work again | 09:46 | |
niemeyer | rog: Welcome back | 09:47 |
rog | niemeyer: hi! | 09:48 |
rog | niemeyer: you're up earlier than usual... | 09:49 |
TheMue | rog: hiya | 09:49 |
rog | TheMue: morning! | 09:49 |
niemeyer | rog: Yeah, a bit | 09:50 |
rog | niemeyer: any chance you could have another look at those CLs? i'm hoping to get something pushed this week. | 09:50 |
niemeyer | rog: Not right now | 09:52 |
rog | niemeyer: k | 09:52 |
fwereade__ | niemeyer, btw, there was a conversation the other day I think you missed about gozk panicing in C when a Conn is closed (and something else is busy in C code) | 09:55 |
niemeyer | fwereade__: Ugh, that's not good | 09:55 |
fwereade__ | niemeyer, would fixing this just be a matter of a RWMutex on handle accesses, or is it more subtle? | 09:55 |
niemeyer | fwereade__: Well.. I'd need some more information about the crash | 09:55 |
fwereade__ | niemeyer, I've seen it in a couple of cases, but the common feature is that some goroutine is always somewhere like: launchpad.net/gozk/zookeeper._C2func_zoo_wexists(0x36edf80, 0x36ef380) | 09:57 |
niemeyer | fwereade__: Ok, do you have a full traceback? | 09:57 |
fwereade__ | niemeyer, here's the one I just saw | 09:58 |
rog | niemeyer: i had a brief look at this - it just happens when zk derefs the nil handle, because handle access isn't mutexed, as fwereade__ suggestes | 09:58 |
fwereade__ | niemeyer, http://paste.ubuntu.com/855178/ | 09:58 |
rog | suggests | 09:58 |
fwereade__ | niemeyer, the set line shouldn't be the problem, because that conn is not closed | 09:59 |
niemeyer | fwereade__: Is this looping while you're calling Close on the zk Conn? | 10:01 |
fwereade__ | niemeyer, yes | 10:01 |
niemeyer | fwereade__: That's going back to the point we talked about | 10:02 |
niemeyer | fwereade__: We shouldn't do that | 10:02 |
niemeyer | fwereade__: We shouldn't ever close a connection behind the back of logic that we're managing ourselves | 10:02 |
fwereade__ | niemeyer, ok, maybe I'm missing something | 10:03 |
niemeyer | fwereade__: That's true for tests, and also for real logic | 10:03 |
niemeyer | fwereade__: We can even implement locking to prevent the nil pointer reference at some point, but that's not the most critical issue | 10:04 |
niemeyer | fwereade__: Ok, but do you recall that conversation? | 10:04 |
fwereade__ | niemeyer, ok, in that case I guess the tests I was trying to write are not actually relevant, because I'm really trying to test deal-with-session-event handling that doesn't exist yet | 10:04 |
niemeyer | Feb 22 12:09:11 <niemeyer> fwereade_: We should never, ever, ever, leave background logic unattended while we're messing around with the system state in disruptive ways | 10:05 |
niemeyer | fwereade__: Hmm, can you expand on that? | 10:05 |
fwereade__ | niemeyer, yes, I do; the reason I saw this one just now is because all I *thought* I had active on that conn was a watch | 10:05 |
fwereade__ | niemeyer, I was trying to verify that the watch channel got closed when the underlying ZK connection got yanked out from underneath | 10:06 |
niemeyer | fwereade__: Ok, that's an interesting case.. maybe we have something to fix then | 10:06 |
niemeyer | fwereade__: Can you please paste waitFor? | 10:06 |
fwereade__ | niemeyer, http://paste.ubuntu.com/855189/ | 10:07 |
niemeyer | fwereade__: Sorry, I thought that was where the loop was.. TestDisconnectAliveWatch then, I guess | 10:09 |
niemeyer | Ah, this is the guy blowing up: launchpad.net/gozk/zookeeper._C2func_zoo_set2(0x0, 0x36ef360) | 10:10 |
fwereade__ | niemeyer, yep | 10:10 |
niemeyer | fwereade__: Please paste the test and (*Pinger).run | 10:10 |
fwereade__ | niemeyer, before I paste this: I know the stuff in the goroutine comment is wrong :p | 10:10 |
fwereade__ | niemeyer, | 10:10 |
fwereade__ | http://paste.ubuntu.com/855193/ | 10:10 |
niemeyer | fwereade__: Cool :) | 10:11 |
fwereade__ | niemeyer, http://paste.ubuntu.com/855195/ | 10:11 |
fwereade__ | niemeyer, wait, you're right, it is set and not exists, bugger, I'm at least partially on crack (in a way that I was not hitherto aware of) | 10:13 |
niemeyer | fwereade__: You know it's wrong in which way? | 10:13 |
fwereade__ | niemeyer, "this should be equivalent" | 10:13 |
fwereade__ | niemeyer, ie, what we should actually be doing is *not* just closing the conn until we've dealt with everything | 10:16 |
fwereade__ | niemeyer, but if it is indeed the set that's blowing up, anmd it looks like it is, then I'm completely confused because that's using a different conn to the one that gets closed | 10:17 |
niemeyer | fwereade__: Not really | 10:17 |
niemeyer | fwereade__: Oh, wait.. yes | 10:17 |
fwereade__ | niemeyer, so I would appear to be crackful in an entirely fascinating and original way :/ | 10:18 |
niemeyer | fwereade__: But! | 10:18 |
niemeyer | fwereade__: Please paste Close | 10:18 |
niemeyer | fwereade__: I bet you have a race | 10:18 |
fwereade__ | niemeyer, oh balls, I try to hit the target node again | 10:18 |
fwereade__ | niemeyer, (to give clients as long a time as possible before they see a timeout) | 10:19 |
fwereade__ | niemeyer, (naively trusting that if the connection is borked it'll just error and I can ignore it) | 10:19 |
niemeyer | fwereade__: That's fine | 10:19 |
niemeyer | fwereade__: But a race is not fine.. please paste Close | 10:20 |
TheMue | Interesting, http://cocode.io/beta/ will provide collaborative coding. It's written in Go knows Go as a highlighted lang. Sadly yet only GitHub, no Bazaar (or others). | 10:21 |
niemeyer | <niemeyer> fwereade_: That's fine | 10:22 |
niemeyer | <niemeyer> fwereade_: But a race is not fine.. please paste Close | 10:22 |
niemeyer | fwereade_, fwereade__: Are you having wifi issues too? | 10:23 |
niemeyer | fwereade_, fwereade__, fwereade: Are you having wifi issues too? :-) | 10:23 |
fwereade | niemeyer, seems so :p | 10:23 |
fwereade | niemeyer, http://paste.ubuntu.com/855206/ | 10:24 |
fwereade | niemeyer, if you didn't see above | 10:24 |
niemeyer | fwereade_, fwereade__, fwereade: I always knew you were not a single person.. | 10:24 |
niemeyer | fwereade: Yeah, you have a race indeed | 10:24 |
niemeyer | fwereade: Can you please have a look at this post: http://blog.labix.org/2011/10/09/death-of-goroutines-under-control | 10:25 |
fwereade | niemeyer, I'm a globally distributed collection of outsourced drones, but don't tell anyone | 10:25 |
niemeyer | fwereade: I'll highlight a few things afterwards, if they're not obvious | 10:25 |
fwereade | niemeyer, cool, thanks | 10:25 |
* fwereade reads | 10:25 | |
rog | niemeyer: where's the race? | 10:27 |
fwereade | niemeyer, I remember reading that when you wrote it, and totally forgot about it until now | 10:28 |
fwereade | niemeyer, it seems clear, but I'd appreciate the highlighting all the same | 10:28 |
rog | oh yeah, in TestDisconnectAliveWatch. i was looking in the pinger methods | 10:31 |
fwereade_ | grar | 10:36 |
niemeyer | fwereade_: Are you here? :) | 10:37 |
fwereade_ | niemeyer, er, probably | 10:38 |
niemeyer | fwereade_: Ok :) | 10:38 |
niemeyer | fwereade_: So, quicklY! :-) | 10:38 |
niemeyer | fwereade_: Observe the way Stop is implemented in the blog post | 10:38 |
niemeyer | fwereade_: It sends a fatal error, and *waits* | 10:39 |
TheMue | niemeyer: next round of https://codereview.appspot.com/5671055/ | 10:39 |
niemeyer | TheMue: Thanks, I'll have a look at that next to rog's | 10:39 |
TheMue | niemeyer: yep, thx | 10:39 |
niemeyer | fwereade_: You're sending a stop signal, and ignoring the fact that, naturally, stopping isn't instantaneous | 10:40 |
fwereade_ | niemeyer, yeah; I had a "closed" channel I was waiting on originally, but it was pointed out that since closing was unbuffered I wasn't getting any benefit from waiting on closed after sending to closing | 10:40 |
niemeyer | fwereade_: Uh.. this is bogus | 10:41 |
niemeyer | fwereade_: Oh, or maybe not.. let me see again | 10:41 |
rog | niemeyer: it looked ok to me :-) | 10:41 |
fwereade_ | niemeyer, I'm not saying there weren't subtle bugs in what I was doing, or in what I am doing ;) | 10:41 |
rog | niemeyer: i *hope* it's not bogus otherwise my understanding of channels is fundamentally flawed... | 10:42 |
niemeyer | rog: No, you're right.. as long it's unbuffered, that's fine | 10:43 |
rog | phew | 10:43 |
niemeyer | fwereade_: The race is in the test itself | 10:44 |
rog | yup | 10:44 |
niemeyer | fwereade_: No, sorry.. crack again.. | 10:44 |
niemeyer | altConn is in a different end | 10:45 |
niemeyer | Hmm | 10:45 |
rog | niemeyer: isn't is a race that waitFor(c, altConn, path) is called concurrently with altConn.Close() ? | 10:46 |
rog | (not that i've looked at waitFor) | 10:46 |
niemeyer | rog: No.. the pinger is blowing up with a nil handle | 10:47 |
niemeyer | Despite anything else, there's a race here in that the handle is being cleaned up underneath its feet | 10:48 |
rog | niemeyer: that still looks like a race to me | 10:48 |
niemeyer | rog: :-) | 10:48 |
rog | niemeyer: even if it's not the thing that's blowing up | 10:48 |
niemeyer | rog: If the logic in the test is sane, the Close won't happen before the kill | 10:49 |
niemeyer | rog: if.. | 10:50 |
rog | niemeyer: oh yes, i hadn't seen connect - i was assuming that the session event was the connection event, not the teardown event. | 10:51 |
rog | but presumably connect waits for the first session event | 10:51 |
fwereade_ | rog, yes | 10:51 |
niemeyer | rog: That's what I'm assuming too | 10:51 |
rog | fwereade_: how reproducible is this? | 10:56 |
fwereade_ | rog, not very :( | 10:56 |
fwereade_ | rog, it'll happen eventually | 10:56 |
rog | fwereade_, niemeyer: AliveW is busy on altConn at the same time that altConn is closed, right? | 11:00 |
rog | so that looks like a race to me | 11:00 |
fwereade_ | rog, yes; so that Close could happen at any time, and that's definitely a problem | 11:01 |
rog | i.e. the server gets killed, the goroutine receives the session event and closes altConn, at the same time AliveW happens to do a Change | 11:01 |
fwereade_ | rog, there are similar tests that really are *only* watching when the conn gets closed, and they're fine; yes | 11:02 |
niemeyer | rog: If that's the case, then the connection is being Closed before kill() | 11:02 |
rog | niemeyer: i don't think so | 11:02 |
rog | niemeyer: the kill happens, then the close happens, then AliveW does something | 11:03 |
niemeyer | rog: AliveW is run before kill.. | 11:03 |
rog | niemeyer: AliveW has a goroutine | 11:03 |
fwereade_ | niemeyer, I guess it's not impossible that there's another session event confusing me | 11:03 |
niemeyer | rog: OH, does it? | 11:03 |
niemeyer | Why? | 11:03 |
niemeyer | To transform the event, I guess | 11:03 |
rog | niemeyer: yup | 11:03 |
niemeyer | But it should be listening only | 11:03 |
niemeyer | On a channel.. that's not supposed to blow up | 11:03 |
niemeyer | fwereade_: Can you please paste the whole file, or just push the branch? | 11:04 |
rog | yeah | 11:04 |
fwereade_ | niemeyer, sure | 11:04 |
fwereade_ | niemeyer, ~fwereade/juju/go-presence-nodes | 11:04 |
fwereade_ | niemeyer, when we're watching for aliveness we can't just dumbly watch a channel | 11:05 |
fwereade_ | niemeyer, we need to keep rewatching when we get changes before timeouts | 11:05 |
niemeyer | fwereade_: Ho ho ho | 11:05 |
fwereade_ | niemeyer, technically, above, when we're watching for *deadness* | 11:05 |
fwereade_ | niemeyer, have I misunderstood something fundamental? | 11:06 |
niemeyer | fwereade_: I'm just curious about the rewatching.. that may explain it | 11:06 |
fwereade_ | niemeyer, (also we may need to rewatch when waiting for liveness too, it's a bit of an edge case, but...) | 11:06 |
niemeyer | fwereade_: Yeah, there's a race there.. | 11:07 |
niemeyer | fwereade_: You're firing goroutines that use zk in a way that closing the connection will ignore completely | 11:07 |
fwereade_ | niemeyer, huh, I thought that closing the conn would close all the watches, and I'd see that | 11:08 |
rog | niemeyer: i don't think there's a way to do it without either having a way to Close an AliveW or by making zk.Close concurrent-safe | 11:08 |
rog | niemeyer: personally, i think i'd opt for the latter | 11:08 |
rog | niemeyer: (after all, all the other routines are ok to call concurrently) | 11:08 |
rog | s/routines/zk methods/ | 11:09 |
niemeyer | rog: Yes, there may be something to change in gozk, but there's something to be handled in presence itself that I'm trying to understand | 11:10 |
rog | niemeyer: presence seems ok to me, but i'm probably missing something | 11:11 |
niemeyer | fwereade_: Why do we need awaitDead and awaitAlive to loop? | 11:11 |
fwereade_ | niemeyer, awaitDead to refresh the data watch until I get a timeout without the data watch having fired | 11:12 |
niemeyer | fwereade_: Ah, right.. it's the opposite.. it fires when it doesn't get a watch | 11:12 |
niemeyer | Hmm | 11:12 |
fwereade_ | niemeyer, awaitAlive to handle the case where a known-unresponsive node is deleted -- it's still dead but not in the same way and IMO I shouldn't really alert the client to say "yeah, nothing changed" | 11:12 |
niemeyer | Well, looks like we need our Watcher type back.. | 11:14 |
rog | niemeyer: if zk Close was concurrent-safe, everything would be fine | 11:15 |
niemeyer | So that we can stop this logic in a sane way | 11:15 |
rog | niemeyer: just like closing a net.Conn | 11:15 |
rog | niemeyer: then the logic just stops as a matter of course | 11:15 |
niemeyer | rog: What do you mean by concurrency safe? | 11:16 |
rog | niemeyer: i mean that you could Close a zk conn concurrently with executing some other operation on the conn | 11:16 |
fwereade_ | niemeyer, it feels like one of the advantages of gozk is that the watches can just get auto-closed, and it would be really nice if we could follow the same convention here | 11:16 |
fwereade_ | rog, nitpick, technically it's all the other methods that aren't concurrent-safe ;) | 11:16 |
niemeyer | rog: That's not enough | 11:17 |
rog | fwereade_: they're concurrent-safe with each other... | 11:17 |
rog | niemeyer: no? | 11:17 |
niemeyer | fwereade_: Exactly | 11:17 |
niemeyer | rog: What fwereade_ said | 11:17 |
rog | niemeyer: i don't understand. wouldn't a mutex around the handle access be acceptable? | 11:17 |
fwereade_ | niemeyer, it still seems to me that a RWMutex on handle around C calls would be enough; what am I missing? | 11:18 |
rog | niemeyer: (every method would have to respect it, of course) | 11:18 |
niemeyer | rog: This is something else, not what you originally said | 11:18 |
rog | niemeyer: that's what is necessary to make it ok to call Close concurrently, AFAICS | 11:18 |
rog | niemeyer: which is what i was suggesting | 11:18 |
niemeyer | rog: You're still missing the point.. | 11:18 |
fwereade_ | niemeyer, so am I I think | 11:19 |
niemeyer | rog: It's not just about concurrency | 11:19 |
* rog usually does | 11:19 | |
niemeyer | rog: Close(); Set(foo).. BOOM! | 11:19 |
rog | niemeyer: no. | 11:19 |
rog | niemeyer: not if Set checks to see if handle is nil and then returns an error | 11:19 |
niemeyer | rog: Dude.. please make up your mind | 11:19 |
fwereade_ | niemeyer, wouldn't we just grab a read lock, check for nil, and error back cleanly? | 11:20 |
rog | fwereade_: exactly | 11:20 |
niemeyer | fwereade_: YES! | 11:20 |
niemeyer | fwereade_: We'd just do that.. that's not what was being said so far! | 11:20 |
fwereade_ | niemeyer, ha, I thought it was | 11:20 |
niemeyer | <niemeyer> rog: What do you mean by concurrency safe? | 11:20 |
niemeyer | <rog> niemeyer: i mean that you could Close a zk conn concurrently with executing some other operation on the conn | 11:20 |
rog | niemeyer: i just said "make Close concurrent-safe". that doesn't imply that Close is the only method that needs to change. | 11:20 |
niemeyer | rog: This is not about concurrency! Serial operations blow up! | 11:21 |
rog | niemeyer: well, that's a different (although related) issue. | 11:21 |
fwereade_ | niemeyer, Close might block for a while (or *possibly for ever if enough is going on...) but that seems preferable to unrecoverable panics in C if if a hamfisted amateur like myself is trying to use the library | 11:21 |
niemeyer | rog: Yes, it's the issue we're seeing | 11:21 |
rog | niemeyer: i think we're seeing an issue because of concurrency, no? | 11:22 |
niemeyer | <niemeyer> rog: This is not about concurrency! Serial operations blow up! | 11:22 |
rog | niemeyer: is that the case in this example? i thought we were doing things in two separate goroutines | 11:22 |
niemeyer | <niemeyer> rog: This is not about concurrency! Serial operations blow up! | 11:23 |
niemeyer | rog: You can inline.. Close(); Set(); BOM! | 11:23 |
fwereade_ | rog, niemeyer: sorry, lamb chops are hot on the table, back shortly :) | 11:23 |
rog | niemeyer: that's not what's happening in this issue though, right? | 11:24 |
niemeyer | rog: Yes, it is | 11:24 |
rog | niemeyer: in the same goroutine? | 11:24 |
niemeyer | rog: No.. the fact there are two goroutines is completely irrelevant | 11:24 |
rog | niemeyer: to my mind, it's not. if we were in a single goroutine it would be easy to avoid doing an operation after Close. | 11:25 |
rog | niemeyer: it's the concurrency issue that makes it harder | 11:25 |
niemeyer | rog: Sorry, but a "concurrency issue" is something else than what you have in mind | 11:26 |
rog | niemeyer: ok, fair enough | 11:26 |
rog | niemeyer: so... | 11:26 |
niemeyer | rog: You can put a lock across every single function in gozk, and it will still blow up | 11:26 |
rog | niemeyer: only if they don't check that handle is not nil before calling into C, no? | 11:26 |
niemeyer | rog: Yes, and that's not a concurrency issue.. this is about using a connection after it's been closed. | 11:27 |
niemeyer | rog: and make that not panic | 11:27 |
rog | niemeyer: agreed. but just doing that won't fix our problem either. | 11:27 |
rog | niemeyer: because then there would be a race. | 11:27 |
niemeyer | rog: Sure, because you can't trust on the test for the handle to be valid without a lock | 11:29 |
rog | niemeyer: exactly | 11:29 |
niemeyer | fwereade_: So.. | 11:37 |
rog | niemeyer: i think a RWMutex would probably do the job just fine | 11:37 |
rog | on zk.Conn, that is | 11:37 |
niemeyer | fwereade_: You're right, we must definitely fix gozk itself so that we protect from that kind of issue | 11:38 |
niemeyer | fwereade_: I'm just still wondering if we should change the interface for that | 11:38 |
niemeyer | fwereade_: I guess it's fine to keep it running it in background, and simply error out, given that this is purely a read operation without any side effects | 11:38 |
rog | niemeyer: that was my thought. | 11:40 |
niemeyer | fwereade_: Btw, I'd prefer to not have the (s state) as a parameter to the await functions | 11:40 |
niemeyer | fwereade_: There's actually a bug in that regard in the current logic | 11:41 |
niemeyer | fwereade_: No, there isn't, sorry | 11:42 |
niemeyer | fwereade_: You're passing the parameter as a copy, which is safe | 11:42 |
niemeyer | fwereade_: Still, I don't see why the parameter is necessary | 11:42 |
rog | niemeyer: it saves passing the path and timeout parameters as separate arguments | 11:43 |
rog | niemeyer: but i'm with you | 11:44 |
niemeyer | rog: Yeah, I think the organization there needs some tweaking for clarification | 11:47 |
rog | niemeyer: i wondered if things might be clearer if there was an internal watcher type which held the params that are being passed around. | 11:51 |
niemeyer | rog: I was wondering about that too, but this is makes it seem like there wouldn't be much benefit compared to what is there now: | 11:52 |
niemeyer | s, zkWatch, err = newStateW(conn, s.path) | 11:52 |
rog | niemeyer: s, err := w.getStateW() ? | 11:52 |
niemeyer | rog: The only bit that is actually constant is the path | 11:52 |
rog | yeah | 11:53 |
niemeyer | rog: Right, w is the path.. | 11:53 |
niemeyer | rog: Which renders fwereade_ design elegant | 11:53 |
rog | it was the other methods i was really thinking about (waitAlive and waitDead) | 11:53 |
niemeyer | rog: Right.. as methods, they would have to live on a type that is simply the path | 11:54 |
niemeyer | rog: Because timeout and aliveness changes | 11:54 |
rog | niemeyer: that's fine. they can still be fields in the waiter type. it will change when they change. | 11:55 |
rog | niemeyer: (it's not used concurrently) | 11:55 |
niemeyer | rog: There's no point in having them as fields.. it's only used in that one function | 11:55 |
rog | maybe "node" might be a better name for the type | 11:55 |
niemeyer | rog: Yeah, path :) | 11:56 |
niemeyer | rog: type node string, where node is the path | 11:56 |
niemeyer | rog: That won't improve much what fwereade_ has there now, though. The design feels sound on that level | 11:57 |
* rog is thinking about it | 11:57 | |
niemeyer | rog: Well.. I guess we could have the connection on it too | 11:57 |
niemeyer | Which would make it more interesting | 11:58 |
rog | niemeyer: yeah | 11:58 |
niemeyer | type node string | 11:58 |
rog | niemeyer: and the watch channel to send to | 11:58 |
niemeyer | func (n node) readState() (alive, timeout, error) | 11:58 |
niemeyer | func (n node) readStateW() (alive, timeout, watch, error) | 11:58 |
rog | func (n node) get() error | 11:58 |
rog | sorry | 11:58 |
rog | func (n *node) get() error | 11:58 |
niemeyer | func (n node) awaitAlive | 11:58 |
niemeyer | func (n node) awaitDead | 11:58 |
rog | func (n *node) getW() error | 11:58 |
niemeyer | rog: What's get? | 11:59 |
niemeyer | rog: get error? :) | 11:59 |
rog | niemeyer: gets the current state of the node. stores it in n. | 11:59 |
rog | niemeyer: getW also gets a wait channel and stores that in n too. | 11:59 |
niemeyer | rog: That's a bad name then.. update() maybe | 11:59 |
rog | sure | 11:59 |
niemeyer | and updateW() | 12:01 |
niemeyer | rog: Yeah, that should improve things indeed, +1 | 12:01 |
rog | cool | 12:02 |
* fwereade_ reads back... | 12:02 | |
niemeyer | rog, fwereade_: So.. I'd just like to take an eagles view on the whole issue to close it down, if that's ok | 12:02 |
niemeyer | We have one bug, and one design approach we're agreeing that breaks a prior rule | 12:02 |
* rog feels those sharp eyes on him from far far above. | 12:02 | |
niemeyer | :) | 12:02 |
niemeyer | The bug is, as we discussed at length, the fact closing and using a connection is a blow up in gozk for sure. We must fix that. | 12:03 |
niemeyer | The design decision is a more subtle one | 12:03 |
niemeyer | The prior goal I was personally attempting to ensure is that we never have background logic under our control running in parallel with other stuff that will break down the concurrently executing logic | 12:04 |
niemeyer | Under that design approach, we'd need a Stop() method on the watcher | 12:04 |
niemeyer | So that we can stop it.. that would avoid the gozk bug too | 12:05 |
niemeyer | But, we're agreeing to not do that, and instead fix gozk, and establish a new condition: | 12:05 |
niemeyer | If the background logic is entirely innocuous, it's fine to allow it to die a cold death | 12:05 |
rog | sounds good to me. as long as the logic *will* die, and not be left around as garbage | 12:07 |
fwereade_ | niemeyer, that sounds sensible to me personally, but then I would say that | 12:07 |
niemeyer | AliveW is one example of that.. by itself, it's doing nothing. If it blows up in the background, there are zero side effects, _as long as the logic that is using the watch does nothing_!! | 12:07 |
fwereade_ | niemeyer, this feels consistent with every other ZK watch | 12:07 |
niemeyer | It's _NOT_ ok to have someone waiting on the watch resulting from AliveW, and doing further actions like changing the FS or whatever else, despite the fact the connection has been closed | 12:08 |
rog | and the logic using the watch will just see a close on the channel, which means an error, so hopefully this kind of logic can cascade | 12:08 |
niemeyer | fwereade_: Agreed | 12:08 |
fwereade_ | niemeyer, excellent | 12:08 |
rog | fwereade_: agreed | 12:08 |
niemeyer | rog: No, that's exactly what we don't want | 12:08 |
niemeyer | rog: We'd have to reevaluate the use of the watch, if it's also in background | 12:09 |
rog | niemeyer: no? i'm imagining some other innocuous logic layered on top of AliveW | 12:09 |
niemeyer | rog: For it to be alright, it'd have to be innocuous too | 12:09 |
rog | niemeyer: yes, definitely | 12:09 |
niemeyer | rog: Cool, we're in agreement then | 12:09 |
rog | yup | 12:09 |
niemeyer | So, those are the relevant decisions | 12:10 |
niemeyer | Glad we're in agreement, phew :) | 12:10 |
niemeyer | fwereade_: In addition to that, we were nitpicking over the interface around the state type | 12:10 |
niemeyer | fwereade_: Which is not terribly important, but that might be interesting to sort out if you still have the energy | 12:11 |
fwereade_ | niemeyer, I saw that and I'm not strongly invested in what I currently have; the fields seemed to go well together, and to stop the function signatures breaking lines, and so I decided it was good :) | 12:11 |
niemeyer | fwereade_: rog was suggesting a node type, which feels nice to clarify the state handling | 12:11 |
niemeyer | fwereade_: Something like this is the status quo of the discussion: | 12:12 |
niemeyer | type node struct { conn, path, timeout, alive } | 12:12 |
niemeyer | func (node) update() error | 12:12 |
niemeyer | func (node) updateW() (watch, error) | 12:12 |
rog | *node | 12:13 |
niemeyer | Sorry, yes | 12:13 |
niemeyer | func (*node) waitAlive() | 12:13 |
niemeyer | func (*node) waitDead() | 12:13 |
niemeyer | (the last two with the necessary return types and parameters) | 12:13 |
fwereade_ | niemeyer, LGTM | 12:13 |
fwereade_ | thanks guys, productive discussion :-) | 12:13 |
niemeyer | fwereade_: waitAlive and waitDead would be implemented on top of the former two | 12:14 |
niemeyer | fwereade_: Likewise! | 12:14 |
rog | i keep on wondering if waitAlive and waitDead might not be better named as deadWait and liveWait respectively | 12:14 |
rog | i.e. name them after the state they start in rather than the state they're waiting for. | 12:14 |
rog | dunno | 12:14 |
niemeyer | rog: -1.. these are nouns | 12:14 |
niemeyer | we need verbs | 12:14 |
rog | yeah | 12:14 |
rog | i want to call waitDead waitDeath | 12:15 |
rog | but can't think of an appropriate name for the other one | 12:15 |
fwereade_ | rog, I considered similar but "waitBirth" wasn't right | 12:15 |
fwereade_ | rog, indeed | 12:15 |
niemeyer | waitAlive and waitDead feels great to me | 12:15 |
rog | yeah | 12:15 |
niemeyer | And on that node, I'll do some reviews.. | 12:16 |
rog | that's fine. just thought i'd mention it in case others has similar thoughts. | 12:16 |
fwereade_ | niemeyer, would you like me to take a look at gozk as well? | 12:16 |
niemeyer | Actually, let me finish answering emails first | 12:16 |
fwereade_ | niemeyer, I don't want to overload you :) | 12:16 |
rog | i could do gozk | 12:16 |
niemeyer | fwereade_: Yes, please | 12:16 |
niemeyer | or rog.. whatever works | 12:16 |
rog | or fwereade_ sure | 12:16 |
rog | i'm a bit ahead of myself anyway currently | 12:17 |
* fwereade_ bows out gracefully | 12:17 | |
rog | fwereade_: how about you refactor presence (again!) and i'll do zk. | 12:18 |
fwereade_ | rog, give me a shout if it becomes a hassle in any way and I can grab it | 12:18 |
fwereade_ | rog, but that sounds good to me | 12:18 |
* rog thinks is shouldn't be too much problem. | 12:18 | |
rog | s/is/it/ | 12:18 |
andrewsmedina | niemeyer: hi | 12:27 |
niemeyer | andrewsmedina: Hey | 12:27 |
rog | fwereade_, niemeyer: ok, zk changes made. now for the harder bit... the testing. | 12:35 |
fwereade_ | rog, good luck :) | 12:36 |
* rog thinks of a devious way of forcing a zk request to take a long time. | 12:38 | |
andrewsmedina | niemeyer: in the last weekly version Go isn't work with two packages in same project | 12:44 |
andrewsmedina | andrewsmedina: in some of your projects you put a example.go which is a different package. | 12:46 |
fwereade_ | rog, just a thought; I have a couple of `switch event.Type`s without defaults; if I get totally unexpected events, it would probably be sensible to `close(watch)`, right? | 13:08 |
rog | fwereade_: you could probably panic | 13:08 |
rog | fwereade_: it's part of the contract that the correct event types should be returned | 13:09 |
rog | fwereade_: if they're not, then something's very broken | 13:09 |
rog | the contract with zk, that is | 13:09 |
fwereade_ | rog, in that case I think explicitly panicking may be excessively defensive | 13:09 |
rog | fwereade_: i'd prefer to see a panic rather than a silent failure in that case | 13:10 |
fwereade_ | rog, ok, just `panic(event)` then? | 13:10 |
rog | fwereade_: panic(fmt.Errorf("unexpected event %v", event)) | 13:10 |
rog | or something like that | 13:10 |
fwereade_ | rog, ok then :) | 13:11 |
rog | it's nice to see the thing that caused the panic | 13:11 |
fwereade_ | rog, sounds good | 13:12 |
fwereade_ | rog, hm, I guess I can lose the forward() stuff once we have a Close-safe gozk | 13:16 |
rog | fwereade_: yeah. that'll be good. | 13:16 |
rog | fwereade_: i could push a version that you could use until i've done the tests, if you like. | 13:16 |
fwereade_ | rog, a good thing but also sort of a shame, that stuff made me very happy :) | 13:17 |
fwereade_ | rog, I guess that would be useful | 13:17 |
rog | fwereade_: it's still there for when you need it again... | 13:17 |
rog | fwereade_: in fact i think i might be using that technique in the zk tests... (my "devious way") | 13:17 |
fwereade_ | rog, indeed, the pleasure of deleting code still outweighs the pleasure I took in the code to begin with | 13:18 |
fwereade_ | rog, nice :) | 13:18 |
rog | fwereade_: i'm gonna have a forwarder and stop all incoming traffic. then i know that the reply can't come back quickly... | 13:18 |
rog | fwereade_: in fact i might steal your code directly, since it's still there... | 13:20 |
fwereade_ | rog, go for it :) | 13:20 |
TheMue | So, solved a merging prob, 76 tests passed | 13:24 |
rog | TheMue: nice | 13:25 |
TheMue | rog: yeah, it goes on, piece by piece | 13:26 |
rog | TheMue: i bet you're glad you speeded up the tests now :-) | 13:26 |
TheMue | rog: definitely | 13:35 |
TheMue | rog: when zk is cold (first start) about 11 sec, then about 6.3 sec. | 13:36 |
rog | TheMue: how long if you run just one test? | 13:37 |
TheMue | not yet tested | 13:37 |
fwereade_ | rog, TheMue, IME live-zookeeper tests have a 5s overhead | 13:39 |
fwereade_ | (when hot) | 13:39 |
rog | so about 1.3 seconds to run the tests. that's not bad | 13:39 |
rog | TheMue: try go gotest -gocheck.run patternMatchingOneTest :-) | 13:40 |
TheMue | rog: with a lot i/o | 13:40 |
rog | oh, i think it might be gocheck.f actually | 13:40 |
fwereade_ | rog, it is | 13:40 |
rog | fwereade_: i'm going to change that soon - it should be the same as the equivalent test option | 13:41 |
rog | fwereade_: i just haven't pushed the change yet | 13:41 |
rog | well, the merge request | 13:41 |
fwereade_ | rog, fine with me; I'll discover it, swear, check the options and be happy :) | 13:41 |
TheMue | rog: a single test is 5.11 vc 6.36 for all tests | 13:45 |
rog | ain't java wonderful? | 13:45 |
TheMue | fwereade_: your guess with 5 sec seems good | 13:45 |
TheMue | rog: haha | 13:45 |
fwereade_ | niemeyer, should ExpectFailure (gocheck) possibly output something in test runs? as it is, its "don't forget this test exists" value seems a touch limited | 13:52 |
niemeyer | fwereade_: The purpose is precisely to not forget that there are broken tests without having a failure that we'll learn to ignore | 13:54 |
fwereade_ | niemeyer, agree that ignored-by-convention failures are poisonous; I just don't see how it stops us forgetting any better than renaming the methods to DontTest* would | 13:56 |
niemeyer | fwereade_: It mentions that there are expected failures at the end of the run, doesn't it? | 13:57 |
fwereade_ | niemeyer, perhaps I'm Doing It Wrong, but not afaics | 13:57 |
fwereade_ | william@diz:~/code/go/src/launchpad.net/juju/go$ go test launchpad.net/juju/go/state/presence | 13:57 |
fwereade_ | ok launchpad.net/juju/go/state/presence5.890s | 13:57 |
niemeyer | Hmm | 13:57 |
niemeyer | fwereade_: Run it without the package name | 13:58 |
niemeyer | fwereade_: Within the directory | 13:58 |
niemeyer | fwereade_: Or with -v | 13:58 |
niemeyer | fwereade_: go test is eating the output | 13:58 |
fwereade_ | niemeyer, no dice either way | 13:58 |
fwereade_ | niemeyer, ah, somehow run it without using "go test"? | 13:58 |
niemeyer | No | 13:59 |
niemeyer | fwereade_: There's something wrng indeed | 13:59 |
fwereade_ | niemeyer, it's not critical, shall I just open a bug? | 14:01 |
fwereade_ | niemeyer, (srry, I feel bad distracting you all the time while the store's still in progress) | 14:02 |
niemeyer | fwereade_: No worries really | 14:02 |
niemeyer | fwereade_: I haven't even managed to get to that yet | 14:02 |
fwereade_ | niemeyer, heh, that's a large part of why I feel bad about it ;) | 14:03 |
niemeyer | fwereade_: Yes, please file a bug.. it's supposed to print it | 14:03 |
niemeyer | Hey cliff-hm | 14:25 |
niemeyer | cliff-hm: How're things going in the RedHat side of the world? | 14:25 |
cliff-hm | Pretty OK thanks. :) | 14:27 |
cliff-hm | Just trying to understand and play with Juju a bit to see how it fits itself together. | 14:28 |
niemeyer | cliff-hm: Nice, please let us know what you think | 14:28 |
cliff-hm | (I prefer not to hide away) :) | 14:29 |
cliff-hm | Sure | 14:31 |
cliff-hm | I may try to do that contest, not sure yet. | 14:32 |
cliff-hm | (In my spare, non-company time) | 14:32 |
niemeyer | cliff-hm: That'd be great | 14:38 |
niemeyer | cliff-hm: When can we expect juju to be integrated into RedHat's products? :) | 14:38 |
niemeyer | rog: Can you please sort out this branch so we can take it out of the review list: https://code.launchpad.net/~rogpeppe/juju/fix-tutorial-with-expose-2/+merge/75379 | 14:43 |
rog | niemeyer: jeeze, i thought i did that months ago :-) | 14:43 |
niemeyer | rog: and this one too: https://codereview.appspot.com/5674051/ | 14:44 |
rog | niemeyer: TheMue wanted to do something different there, so i was thinking of abandoning it | 14:45 |
niemeyer | rog: That's a fine solution, but we need to reject it so that that it goes out of the review list | 14:45 |
rog | niemeyer: ok. i'll reject it. | 14:46 |
niemeyer | rog: Thanks | 14:46 |
cliff-hm | niemeyer, I don't know ;) I'm doing this on personal time ... out of my own interests. | 14:46 |
niemeyer | cliff-hm: Ah, understood. I thought it was related to you being a managed of the Spacewalk/Satellite/RHN/etc team | 14:48 |
niemeyer | cliff-hm: Please drop us a note if you have any suggestions then | 14:49 |
cliff-hm | My interests stem from that background, yes. But I'm not under company orders to spy and look at what others do :) My day job is too busy for those tasks. | 14:50 |
cliff-hm | And yeap. I will try to give feedback. Though likely to have more questions :) | 14:53 |
cliff-hm | (been reading so far, hoping tonight and weekend to stay hands-on) | 14:53 |
niemeyer | cliff-hm: Sweet | 14:54 |
niemeyer | cliff-hm: I have good friends at RH, btw.. | 14:54 |
rog | niemeyer: both rejected - the earlier was submitted in some other form sometime. | 14:54 |
niemeyer | cliff-hm: Probably not very close to you, though | 14:55 |
cliff-hm | we have a lot of employees now. I know a bunch of people in my time here, but there is a lot more people that I do not know, than know. | 14:58 |
niemeyer | cliff-hm: Can imagine.. most of the people I know are around kernel development, I believe | 14:58 |
niemeyer | Rietveld cross-changes diffing is such a huuuge aid | 15:02 |
niemeyer | cliff-hm: Do you work close to the Openshift guys? | 15:02 |
niemeyer | Suddenly rietveld is crawling | 15:09 |
niemeyer | Which makes this a good time to have lunch | 15:09 |
niemeyer | rog: I did manage to LGTM one of hte pending branches, though | 15:09 |
rog | niemeyer: magic, thanks | 15:09 |
niemeyer | Will finish pending reviews when I'm back, and then will step out to do store work | 15:10 |
cliff-hm | I know several, but not close with them. Closer to the Aoelous and Katello, pulp project guys. | 15:13 |
rog | niemeyer, fwereade_: https://codereview.appspot.com/5694068/ | 15:18 |
rog | probably the most significant one yet. | 15:18 |
rog | fwereade_: if you wanna have a look, probably the best place to start is in util.go, which has the basic robustness primitive. | 15:19 |
rog | fwereade__: just checking: did you see my CL above? | 16:16 |
rog | fwereade__: or is your network playing up? | 16:16 |
rog | hmm, can anyone see this? | 16:17 |
niemeyer | It's funny that every time Mark starts a rant in canonical-tech, there's a bunch of people asking to subscribe to the list | 16:29 |
rog | niemeyer: i hadn't looked at that list recently | 16:41 |
fwereade__ | rog, sorry, I keep not seeing notifications | 16:49 |
rog | fwereade__: that's fine. this is what i said: | 16:50 |
fwereade__ | rog, I did, and thank you for the advice to start looking in util, would have been a little mystifying otherwise ;) | 16:50 |
rog | 15:18] <rog> niemeyer, fwereade_: https://codereview.appspot.com/5694068/ | 16:50 |
rog | [15:18] <rog> probably the most significant one yet. | 16:50 |
rog | [15:19] <rog> fwereade_: if you wanna have a look, probably the best place to start is in util.go, which has the basic robustness primitive. | 16:50 |
rog | fwereade__: np | 16:50 |
niemeyer | rog: Yeah, I'm on reviews | 16:51 |
rog | niemeyer: brill! | 16:51 |
niemeyer | Man.. why is codereview *so slow* | 16:54 |
niemeyer | rog: "there's deliberately only one securityGroup instance for each security group." | 16:54 |
niemeyer | rog: Where's that enforced? | 16:54 |
rog | niemeyer: there's only one occurrence of "&securityGroup" in the file, which is in newSecurityGroup, which is only called when group() has returned nil | 16:57 |
rog | niemeyer: actually, i think the newSecurityGroup call in NewInstances (even though it conforms to that rule) might be ananachronism | 16:58 |
rog | an anachronism | 16:58 |
niemeyer | rog: There are two different calls to newSecurityGroup | 16:58 |
rog | niemeyer: they're both called only when group() has returned nil | 16:59 |
rog | niemeyer: i.e. they never create a group with the same contents twice | 16:59 |
rog | niemeyer: but see above too | 16:59 |
rog | i think the occurrence in NewInstances might be from before i implemented CreateSecurityGroup | 17:00 |
niemeyer | rog: Still, this is a problem.. the current design is error prone, and there's a design not documented anywhere and scattered across the code | 17:00 |
rog | niemeyer: if i deleted it from NewInstances (which i think i probably should) then it would not be scattered. | 17:01 |
niemeyer | rog: It would still not be documented, and it would still be error prone | 17:01 |
niemeyer | rog: All it takes is a single line: &securityGroup{}.. BOOM, bug. | 17:01 |
rog | niemeyer: so you don't like comparing pointers? | 17:02 |
rog | niemeyer: or are you asking for a comment? | 17:02 |
niemeyer | rog: I like comparing pointers.. I don't like error prone code that will break behind our back in hard to notice ways | 17:02 |
niemeyer | rog: I'm asking to make the code not error prone | 17:02 |
niemeyer | rog: I don't want to remind everybody that touches that file that they can't create security groups | 17:02 |
rog | niemeyer: if i put a comment next to the securityGroup type saying "only create this through newSecurityGroup", would that be ok? | 17:03 |
rog | niemeyer: (and have newSecurityGroup do the check) | 17:03 |
niemeyer | rog: The comment + the latter sounds good | 17:03 |
rog | niemeyer: cool, will do. | 17:04 |
niemeyer | rog: This will prevent people from making silly but reasonable mistakes, I hope | 17:04 |
niemeyer | rog: Thanks, and sorry for the trouble | 17:04 |
rog | niemeyer: np. it was worth it to find the bug in NewInstances | 17:04 |
rog | well, not precisely a bug | 17:05 |
niemeyer | rog: Please also add a short note in the permKey struct, next to the pointer definition | 17:09 |
fwereade__ | EOD, and I'm exhausted :). happy weekends all! | 17:11 |
niemeyer | rog: Nice comments, btw, thanks | 17:11 |
niemeyer | fwereade__: have a good time man :) | 17:11 |
rog | fwereade__: happy weekend to you to | 17:12 |
rog | o | 17:12 |
rog | niemeyer: something like this? | 17:15 |
rog | // permKey represents permission for a given security | 17:15 |
rog | // group or IP address (but not both) to access a given range of | 17:15 |
rog | // ports. Equality of permKeys is used in the implementation of | 17:15 |
rog | // permission sets, relying on the uniqueness of securityGroup | 17:15 |
rog | // instances. | 17:15 |
niemeyer | rog: Sounds great, thanks | 17:17 |
rog | niemeyer: cool. | 17:28 |
rog | with those changes, does it LGTY? | 17:29 |
rog | niemeyer: here's the other comment, BTW: | 17:29 |
rog | // securityGroup holds a simulated ec2 security group. | 17:29 |
rog | // Instances of securityGroup should only be created through | 17:29 |
rog | // Server.newSecurityGroup to ensure that groups can be | 17:29 |
rog | // compared by pointer value. | 17:29 |
rog | niemeyer: PTAL | 17:35 |
niemeyer | rog: // It returns nil if a security group with the given name already exists. | 17:43 |
niemeyer | rog: Feels like a trap | 17:43 |
rog | niemeyer: really? | 17:43 |
rog | niemeyer: do you want it to return an error? | 17:44 |
rog | niemeyer: i considered that, but i haven't got an error type yet | 17:44 |
rog | niemeyer: and the caller needs to call fatalf with an appropriate error code | 17:44 |
rog | niemeyer: i could just delete the newSecurityGroup function | 17:45 |
niemeyer | rog: Yeah, an error sounds good.. or make it panic internally and delegate the responsibility of checking to the call site | 17:45 |
rog | niemeyer: what about my last suggestion? | 17:46 |
rog | niemeyer: just inline it inside CreateSecurityGroup | 17:46 |
niemeyer | rog: The abstraction is fine.. just make it panic internally if the group already exists | 17:47 |
niemeyer | rog: We just want to make the error visible rather than a hidden issue | 17:47 |
rog | niemeyer: it won't be hidden for long... anything that tries to use the returned group will panic. | 17:47 |
rog | niemeyer: but ok | 17:47 |
niemeyer | rog: Not really. Returning a nil group means people could get away with setting a permKey.group to it, for instance. | 17:48 |
rog | given that it's only called in one place, i don't really see why the creation needs its own function. | 17:49 |
niemeyer | rog: Sure, inline then.. | 17:49 |
rog | something in me slightly balks at doing two linear searches of the group list... (i know that's a ridiculous concern, too!) | 17:49 |
niemeyer | rog: Keep a comment about the problem, please | 17:50 |
rog | niemeyer: will do | 17:50 |
rog | niemeyer: done. | 17:54 |
rog | weird | 17:56 |
rog | my irc client appears to have logged me in twice | 17:57 |
rog | ah! that must've been carmen opening my laptop | 17:57 |
rog | niemeyer, TheMue: i'm off now. have a good weekend. | 17:59 |
niemeyer | rog: LGTM.. thanks | 17:59 |
TheMue | rohave a good weekend too | 17:59 |
niemeyer | rog: Have a good weekend | 17:59 |
rog | niemeyer: thanks for the reviews. | 17:59 |
rog | niemeyer: always nice to finish with a submit. thanks. | 18:00 |
niemeyer | rog: Indeed! :-) | 18:00 |
niemeyer | rog: and np | 18:00 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!