[09:00] <niemeyer> fwereade: Morning
[09:00] <fwereade> heya niemeyer
[09:03] <niemeyer> rogpeppe: Morning
[09:05] <rogpeppe> niemeyer: hiya, up early...
[09:07] <niemeyer> rogpeppe: Yeah, lost sleep and decided to push some pending tasks forward
[09:09] <rogpeppe> niemeyer: i've been up about 2 hours myself, but went for a nice long bike ride.
[09:09] <niemeyer> rogpeppe: Oh, that sounds very nice
[09:09] <niemeyer> rogpeppe: Haven't been biking in a long while
[09:10] <niemeyer> rogpeppe: I'm a one-sport man.. squash took over.. ;-)
[09:10] <rogpeppe> niemeyer: i've been going out in the mornings 'cos i realised that moving from bedroom to desk wasn't getting me much exercise!
[09:10] <niemeyer> rogpeppe: Nice that you've handled that quickly.. indeed it's an amazing recipe for couch potatoing :)
[09:11] <rogpeppe> niemeyer: rock climbing was always my big thing, but it's harder to get out these days...
[09:11] <rogpeppe> been going out on the bike most days for 2 weeks now, and i haven't repeated a route once...
[09:29] <niemeyer> Wow
[09:30] <niemeyer> rogpeppe: Just submitted a review for the robustness branch
[09:30] <niemeyer> Will step out to have coffee with Ale
[09:30] <niemeyer> Back soonish
[09:30] <rogpeppe> niemeyer: thanks
[09:45] <fwereade> popping out for a coffee myself, might kickstart my brain
[10:05]  * niemeyer returns
[10:13] <niemeyer> TheMue: Morning
[10:13] <TheMue> niemeyer: morning, back again
[10:36] <fwereade> TheMue, btw, you mentioned a blocker on relation state; do you have a mo to fill me in on that?
[10:37] <TheMue> fwereade: niemeyer told me that there are some open discussions and i should defer it and continue with other open parts meanwhile
[10:37] <fwereade> TheMue, ah yeah, cool, that makes sense
[10:38] <niemeyer> fwereade: Subordinates are on going in Python.. I suggested waiting for that so that the implementation may be done once
[10:38] <TheMue> fwereade: right now i'm going through presence package. should be difficult to switch my agent code to use it.
[10:38] <fwereade> TheMue, should be? :(
[10:38] <TheMue> fwereade: oh, sorry, typo
[10:38] <fwereade> TheMue, phew :)
[10:38] <TheMue> fwereade: ahoulsn't ;)
[10:39] <fwereade> TheMue, now *that's* a typo :p
[10:39] <TheMue> fwereade: hehe
[10:39] <fwereade> TheMue, what's next on your roadmap?
[10:39] <fwereade> TheMue, just want to make sure we don't collide again ;)
[10:40] <TheMue> fwereade: no, indeed, your package looks good. maybe an integration of tomb for save goroutine handling would make sense
[10:40] <fwereade> TheMue, the intent was to have as much as possible the same semantics as normal ZK watches
[10:41] <TheMue> fwereade: i've got three reviews in the pipe, one is the agent proposal. once that one is ready for integration i would extend the entities i already have and which shall have agents with it
[10:42] <TheMue> fwereade: beside that i've got to compare what's still open in state
[10:42] <fwereade> TheMue, cool; I'm looking into hook execution so I don't think we'll cause each other too much trouble ;)
[10:42] <TheMue> fwereade: i'll post here what i'll start next
[10:43] <TheMue> fwereade: did you took a look into my agent code?
[10:43] <fwereade> TheMue, one thing I noticed is that we don't seem to have state.Unit.OpenPort yet
[10:43] <fwereade> TheMue, I started; I have a couple of draft comments, I'll try to get them finished for you now
[10:43] <TheMue> fwereade: OpenPort() is one of the proposals
[10:44] <fwereade> TheMue, oh, fantastic, sorry I missed that one
[10:44] <TheMue> fwereade: the code is submitted for review
[10:44] <fwereade> TheMue, I'll take another look at +activereviews
[10:47] <fwereade> TheMue, hasn't go-state-continued-machine been merged already?
[10:48] <fwereade> TheMue, and go-state-gocheck-change too?
[10:49] <TheMue> fwereade: all they are missing an LGTM
[10:50] <TheMue> fwereade: the latter is also needless, it's also part of the others. rog and i thought it had to be done earlier, so i made an extra branch. but the changes have been so few that another brach already had them
[10:50] <fwereade> TheMue, ah cool, maybe reject that one then?
[10:51] <TheMue> fwereade: how to do that?
[10:51] <fwereade> TheMue, go to the LP merge proposal and click the status field at the top
[10:52] <TheMue> fwereade: important are the two go-state-continued-… and the go-state-first-agent-draft
[10:52] <fwereade> TheMue, cool, I'll get up to date on those then
[10:52] <fwereade> TheMue, thanks
[10:54] <TheMue> fwereade: hmm, i'm somehow lost and don't find the mentioned status field at the top
[10:55] <fwereade> TheMue, it's currently an orange "needs review"
[10:59] <TheMue> fwereade: ah, have not gone far enough in lp
[10:59] <fwereade> TheMue, ah sorry you were looking at the branch?
[11:00] <TheMue> fwereade: yep, i still have my troubles with the usability of lp
[11:00] <fwereade> TheMue, haha, LP seemed like sweetness and light to me after the rubbish we were using at my last place
[11:01] <fwereade> TheMue, the tool was *worse* than doing bug/iteration tracking with google docs spreadsheets
[11:01] <TheMue> fwereade: i've worked with JIRA. it's not as powerful as LP, but the UI is more clear
[11:01] <TheMue> fwereade: uh, sounds horrible
[11:01] <fwereade> TheMue, (I know that to be true because that was what we were using before)
[11:01]  * fwereade cries quietly in a corner
[11:02]  * TheMue dcc's fwereade a tissue
[11:02] <fwereade> :)
[11:05] <TheMue> fwereade: so, it's rejected but still with status "Development". can i now change this to "Abandoned"?
[11:05] <fwereade> TheMue, sounds good to me (I usually forget to do that myself)
[11:06] <TheMue> fwereade: ok, i like a clean workplace ;)
[11:13] <TheMue> niemeyer: oh, i just rejected go-state-gocheck-change after discussion with fwereade, those changes have already been in a former branch and are merged into state
[11:13] <TheMue> niemeyer: this overlapped, sorry
[11:15] <fwereade> TheMue, review on https://codereview.appspot.com/5690051/
[11:16] <TheMue> fwereade: thx, take a look
[11:16] <niemeyer> TheMue: That's cool, no worries
[11:23] <TheMue> fwereade: in your review you say that in todays code the only caller of RemoveMachine() care's if it doisn't exist?
[11:23] <fwereade> TheMue, seems so to me
[11:25] <TheMue> fwereade: do you know the reason behind it? from states perspective it's ok if there's nothing to remove.
[11:26] <TheMue> fwereade: but state has online a very limited view.
[11:26] <fwereade> TheMue, tbh I'd be just as happy if it silently ignored nonexistent machines
[11:26] <fwereade> TheMue, it's just the mismatch between the state's view of what's acceptable and the single client's view
[11:27] <fwereade> TheMue, I'd rather just return an error than return a value (false) that's always turned into an error
[11:34] <TheMue> fwereade: hmm, it's just todays code only ported to go
[11:35] <TheMue> fwereade: should ask niemeyer
[11:36] <fwereade> TheMue, understood; and there may be reasons for it that I've missed; but I'm not sure that we should feel too bound by python implementation details
[11:37] <TheMue> fwereade: if i got it right we first shall catch the current status and later optimize it for go
[11:37] <fwereade> TheMue, my view is that that particular implementation detail is either a misprediction of future needs, or an appendix left over after the removal of some pre-existing need
[11:38] <TheMue> fwereade: maybe, you are more safe than me i the existing code.
[11:40] <fwereade> TheMue, that hadn't been my perception -- while we should maintain externally observable characteristics wherever possible, I'd prefer to strip ickiness where possible
[11:40] <fwereade> TheMue, for example, presence nodes vs ephemeral nodes
[11:41] <fwereade> niemeyer, opinion?
[11:41] <niemeyer> fwereade++
[11:42] <niemeyer> TheMue: If the only use of this method actually does what fwereade suggests, turning around and saying "Hah! Actually, that's an error!", then it feels like the API was mispredicted when first written
[11:43] <niemeyer> I didn't even know that was the case, FWIW
[11:44] <TheMue> niemeyer: yep, i do code scans to see where a meth/func is used to se how it is intended when it's unclear to me
[11:45] <TheMue> niemeyer: in this particular case i felt no need so i ported it 1:1
[11:45] <TheMue> niemeyer: maybe scanning for each func is useful
[11:46] <rogpeppe> niemeyer: wanna talk about DNS name timeouts?
[11:47] <niemeyer> TheMue: It may be hard in some cases without further context
[11:47] <niemeyer> TheMue: Reviews help with that too
[11:47] <niemeyer> rogpeppe: Sure
[11:48] <TheMue> niemeyer: definitely
[11:48] <rogpeppe> niemeyer: i don't quite understand this: "
[11:48] <rogpeppe> There's an application running somewhere, we shutdown a
[11:48] <rogpeppe> machine, and then that app hangs for 3 minutes just because it asked for a DNS
[11:48] <rogpeppe> address?
[11:48] <rogpeppe> "
[11:48] <rogpeppe> niemeyer: what's the "application" in this context?
[11:48] <niemeyer> rogpeppe: A binary
[11:48] <niemeyer> rogpeppe: Any binary that links against this library of code
[11:49] <rogpeppe> niemeyer: ok.
[11:49] <fwereade> TheMue, reviewed https://codereview.appspot.com/5727045/
[11:49] <rogpeppe> niemeyer: i think there are two scenarios
[11:49] <rogpeppe> niemeyer: one is when you really need the address
[11:49] <rogpeppe> niemeyer: the other you don't care too much
[11:49] <TheMue> fwereade: already seen an interesting comment regarding retryTopologyChange()
[11:49] <rogpeppe> niemeyer: for instance, when we ask for the DNS address of the bootstrap machine, we really need it
[11:50] <rogpeppe> niemeyer: but, as you point out, in StateInfo we don't care too much
[11:50] <rogpeppe> niemeyer: i'm wondering about a "wait as long as necessary" bool flag on Environ.DNSName
[11:51] <rogpeppe> niemeyer: i'm trying to resist forcing clients of Environ to do their own polling.
[11:52] <rogpeppe> niemeyer: because i think the strategy might be quite different for different providers.
[11:52] <rogpeppe> niemeyer: (some providers will always have a DNS name instantly available, for example)
[11:53] <niemeyer> rogpeppe: Hmm
[11:53] <niemeyer> rogpeppe: I don't think so, to be honest
[11:54] <niemeyer> rogpeppe: The case where you say "give me a machine" to an IaaS and you have an address immediately may not be so common
[11:55] <niemeyer> rogpeppe: I'm happy to experiment with a waiting API, but let's please keep an eye on this rather than taking it as a general approach from now on
[11:55] <niemeyer> rogpeppe: I suggest we introduce WaitDNSName() returning (addr, error)
[11:55] <rogpeppe> niemeyer: immediately, perhaps not. but a 3 minute wait? many DHCP clients give an address in < 1s
[11:56] <fwereade> TheMue, I definitely don't think that's something to be addressed in this branch
[11:56] <niemeyer> rogpeppe: And keep DNSName() returning the already known address only
[11:56] <rogpeppe> niemeyer: i'm wondering if it should be the other way around.
[11:56] <fwereade> TheMue, but State "feels" like the right place for it to me
[11:56] <niemeyer> rogpeppe: The three minute wait is completely unrelated to DHCP
[11:56] <rogpeppe> niemeyer: DNSNameNoWait
[11:57] <rogpeppe> so DNSName always gives you a valid DNS name if it can
[11:57] <niemeyer> rogpeppe: This is waiting for a machine to be booted
[11:57] <niemeyer> rogpeppe: Which is why it's wrong to be holding on DNSName specifically
[11:57] <rogpeppe> niemeyer: why can't the DNS name be allocated before the machine boots?
[11:57] <niemeyer> rogpeppe: The machine is off.. the DNS name isn't special
[11:57] <rogpeppe> niemeyer: doesn't the infrastructure allocate the DNS name?
[11:57] <niemeyer> rogpeppe: It's not under our control why people do this or not
[11:57] <rogpeppe> niemeyer: not the machine itself?
[11:58] <niemeyer> rogpeppe: The fact is that the most well known IaaS doesn't do that
[11:58] <rogpeppe> niemeyer: ok. seems a bit weird to me, but i'm sure there's a good reason.
[11:58] <niemeyer> rogpeppe: Likely because they're still working out where to allocate the machine by the time it's answered
[11:58] <niemeyer> rogpeppe: But, again, there's no point for us
[11:59] <TheMue> fwereade: will comment your review, most of it sounds got. the one with the goto had a problem with a break, but i've got to look again which kind it was
[11:59] <rogpeppe> niemeyer: anyway, i'd prefer the default to be to wait, if that's ok.
[11:59] <niemeyer> rogpeppe: WaitDNSName please.. this is the special case that will yield misbehaving code if we forget that it can take *several minutes* to return
[12:00] <niemeyer> rogpeppe: and let's have DNSName() returning (attr, error) too, so that it's clear that we may not have a DNS address at that time
[12:00] <niemeyer> Erm, (addr, error)
[12:00] <rogpeppe> niemeyer: it does that currently
[12:00] <niemeyer> rogpeppe: It didn't do before your branch, and I'm suggesting both WaitDNSName and DNSName do that
[12:00] <niemeyer> rogpeppe: If that's what you had in mind already, great.. :)
[12:00] <rogpeppe> niemeyer: sure, that's unavoidable
[12:01] <niemeyer> rogpeppe: It's not unavoidable.. what we have in trunk today is not like that
[12:01] <rogpeppe> niemeyer: that's because it doesn't work :-)
[12:01] <niemeyer> Heh
[12:01] <fwereade> TheMue, cool, thanks; https://codereview.appspot.com/5782053/ reviewed as well
[12:04] <rogpeppe> "
[12:04] <rogpeppe> This logic is error prone. When we get ErrMissingInstance, we'll have to
[12:04] <rogpeppe> *always* check len(insts) to see if it's 0 or not before iterating
[12:04] <rogpeppe> "
[12:04] <rogpeppe> niemeyer: the logic was deliberate
[12:05] <niemeyer> rogpeppe: Cool, let's fix it then
[12:05] <rogpeppe> niemeyer: it means that if you've asked for some instances, you can easily check (len != 0) if at least some instances have been returned
[12:05] <niemeyer> rogpeppe: That's error prone, as described
[12:06] <rogpeppe> niemeyer: if you're iterating, you don't need to check len(insts) for 0
[12:06] <rogpeppe> niemeyer: because the iteration will never go anywhere
[12:08] <niemeyer> insts, err := e.Instances([]ec2.Instance{a, b, c})
[12:09] <niemeyer> if err == ec2.ErrMissingInstance && len(insts) != 0 && insts[0] == nil { ... }
[12:10] <niemeyer> rogpeppe: Please don't force me to remember to do that every time
[12:12] <rogpeppe> niemeyer: most of the time you just check the error, in which case it doesn't matter. the other way around (and probably the more common one) is that if you ask for some instances and get some, but not all of them, you have to scan the list to see if you've got any.
[12:12] <rogpeppe> niemeyer: whereas the way it is currently, you can just check the length - you'll never return a slice with no non-nil instances in.
[12:13] <rogpeppe> niemeyer: i started off always returning a slice, but changed it because it was easier to use this way
[12:14] <rogpeppe> s/you'll never return a slice/you'll never get a slice/
[12:14] <niemeyer> rogpeppe: Easy and error prone.. there are two different cases to handle
[12:15] <niemeyer> rogpeppe: and we have to remember them when using this method to avoid a panic
[12:15] <niemeyer> rogpeppe: Please, either return the slice at all times, or let's have two different error types
[12:15] <niemeyer> rogpeppe: No gotchas
[12:16] <rogpeppe> niemeyer: ok, i'll do two error types.
[12:18] <niemeyer> rogpeppe: Thanks
[12:21] <rogpeppe> niemeyer:
[12:21] <rogpeppe> 	// Instances returns a slice of instances corresponding to the
[12:21] <rogpeppe> 	// given instance ids.  If no instances were found, but there
[12:21] <rogpeppe> 	// was no other error, it will return ErrInstanceNotFound.  If
[12:21] <rogpeppe> 	// some but not all the instances were found, the returned slice
[12:21] <rogpeppe> 	// will have some nil slots, and an ErrInstancesIncomplete error
[12:21] <rogpeppe> 	// will be returned.
[12:22] <rogpeppe> 	Instances(ids []string) ([]Instance, error)
[12:24] <niemeyer> rogpeppe: Looks good, thanks. Just plural for ErrInstancesNotFound too, please
[12:24] <rogpeppe> niemeyer: oh yeah, that was a typo - the actual error was spelled that way in fact.
[12:25] <niemeyer> rogpeppe: Sweet.. "Partial" might also be a shorter term for Incomplete
[12:26] <niemeyer> rogpeppe: Up to you, though
[12:26] <rogpeppe> niemeyer: ErrInstancesPartial doesn't sound great, and neither does ErrNotFoundInstances IMHO.
[12:26] <rogpeppe> niemeyer: but ErrInstancesNotFound and ErrPartialInstances seem inconsistent
[12:27] <niemeyer> rogpeppe: PartialInstances + NoInstances?
[12:27] <rogpeppe> +1
[12:27] <niemeyer> Super, thanks
[12:36] <TheMue> lunchtime
[12:59] <rogpeppe> niemeyer: i'm slightly concerned about doing this: "
[12:59] <rogpeppe> It should wait, but stop on the first batch that has
[12:59] <rogpeppe> at least one valid DNSName.
[12:59] <rogpeppe> "
[12:59] <rogpeppe> (of StateInfo)
[13:00] <rogpeppe> it means that if we call StateInfo when, say, only one zk instance has been allocated, and that instance later goes down, we won't have the redundancy that we should.
[13:01] <niemeyer> rogpeppe: I don't understand.. if the only instance that was allocated goes down, there's no redundancy
[13:01] <rogpeppe> niemeyer: sorry, i was ambiguous
[13:02] <rogpeppe> niemeyer: when i said "only one zk instance has been allocated" i meant "the DNS name of only one of several zk instances has been allocated".
[13:02] <niemeyer> rogpeppe: The scenario is this: we have three intances, one is terminated for whatever reason.. it shouldn't wait for 3 minutes when it knows about other instances.
[13:02] <rogpeppe> niemeyer: i don't think that scenario would make it wait
[13:03] <rogpeppe> niemeyer: the scenario i'm concerned about is: we start 3 instances. the DNS name for only one of them is used, because it boots fractionally faster than the others.
[13:05] <niemeyer> rogpeppe: I don't see how it would not wait.. you're picking the first entry in the list and calling DNSName on it.. with the logic in the branch, it will wait for as long as it takes for that one machine to have an address, no matter how many machines are there.
[13:05] <niemeyer> rogpeppe: Is this the DNSName catching you already? :-)
[13:06] <rogpeppe> niemeyer: so we get *one* machine from the three possible zk machines. then that zk machine goes down. if we'd waited for the names of the others, the zk client could redial one of them, but as it is, it can't because it doesn't know about them
[13:07] <niemeyer> rogpeppe: I don't see what you mean..
[13:07] <niemeyer> for i, inst := range insts {
[13:07] <niemeyer>     addr, err := inst.DNSName()
[13:07] <niemeyer> *BOOM*.. three minutes..
[13:07] <niemeyer> That can't happen..
[13:08] <rogpeppe> niemeyer: ok, so you're suggesting we wait for them all in parallel, presumably, and then return the first one that has a DNS name?
[13:08] <rogpeppe> niemeyer: (assuming none of them have DNS names to start with)
[13:08] <niemeyer> rogpeppe: I'm saying that the problem above can't happen, so far.
[13:08] <rogpeppe> niemeyer: because we only have one zk server, right?
[13:08] <niemeyer> rogpeppe: If you understand the problem, we can talk about a solution
[13:09] <rogpeppe> i'm not sure i do understand the problem. is it that we might be launching one zk machine much later than the others?
[13:10] <rogpeppe> so we don't want to wait for that while the others are available anyway?
[13:10] <rogpeppe> niemeyer: we've got to wait for at least one DNS name, right?
[13:10] <niemeyer> rogpeppe: Yes..
[13:11] <niemeyer> rogpeppe: You have N machines.. waiting for several minutes for a random one to be alive == BAD
[13:11] <rogpeppe> niemeyer: so we don't mind if the zk client doesn't know about the other zk servers?
[13:11] <niemeyer> rogpeppe: I don't know what you're talking about
[13:12] <niemeyer> rogpeppe: I'm saying that if we know about N machines, waiting for several minutes for a single arbitrary machine == REALLY BAD
[13:12] <rogpeppe> it's not waiting for a single arbitrary machine, it's waiting for all of the zk machines.
[13:12] <niemeyer> rogpeppe: Yes.. that's equally bad
[13:13] <niemeyer> rogpeppe: It's pointless to have three machines for redundancy if the application waits for several minutes for all of them to be available
[13:13] <rogpeppe> niemeyer: the problem is that there's no way of telling the zk client when another zk machine *does* become available
[13:14] <rogpeppe> niemeyer: which means that if the only machine that we found does go down, we haven't got any redundancy any more.
[13:14] <niemeyer> rogpeppe: That's a completely independent problem
[13:14] <rogpeppe> oh?
[13:14] <niemeyer> rogpeppe: Yep.. dynamic member sets is a different problem
[13:15] <niemeyer> rogpeppe: We can't hold on forever waiting for everybody to be available or the redundancy is over
[13:15] <rogpeppe> niemeyer: it's not a dynamic member set though - we know we've started 3 zk machines, we just want to wait for them to be up
[13:15] <niemeyer> rogpeppe: A single machine terminated would kill juju
[13:15] <niemeyer> rogpeppe: If a single one of them terminates, it will never be up
[13:15] <rogpeppe> niemeyer: we're not waiting for the *machine* to be available, we're waiting for its DNS name. isn't that different?
[13:16] <rogpeppe> niemeyer: the DNS name is still available if a machine is terminated, i think.
[13:16] <niemeyer> rogpeppe: If a machine is terminated the entire instance record will go away in a bit
[13:18] <rogpeppe> niemeyer: that takes hours. and we can guard against that easily too.
[13:18] <rogpeppe> niemeyer: by only doing a short timeout when calling Instances in WaitDNSName
[13:19] <niemeyer> rogpeppe: I'm sure there are many workarounds for the problem. I was just describing it, and wishing that we didn't introduce it
[13:20] <rogpeppe> niemeyer: i think we *should* wait for all the zk addresses. but i think we shouldn't do a 3 minute timeout if the instances aren't there.
[13:20] <niemeyer> rogpeppe: The Instances call above is equally problematic, btw.. it's aborting if any of the instances are not found
[13:20] <niemeyer> rogpeppe: Uh oh.. :(
[13:21] <rogpeppe> niemeyer: good point. that *is* a bug.
[13:21] <niemeyer> rogpeppe: Man.. the issue is trivial: if a machine in a set of three never starts, we can't block *all of juju* because of that..
[13:21] <niemeyer> rogpeppe: Same thing if it terminates
[13:22] <niemeyer> rogpeppe: It simply makes no sense to have multiple machines if that's what we're doing
[13:22] <rogpeppe> niemeyer: terminating isn't a problem.
[13:22] <niemeyer> rogpeppe: The logic in your branch breaks if one terminates.
[13:22] <niemeyer> rogpeppe: That is a problem.
[13:23] <niemeyer> rogpeppe: Let's please fix that.
[13:23] <rogpeppe> niemeyer: agreed. but that's a different problem.
[13:23] <rogpeppe> niemeyer: that doesn't mean we shouldn't try to wait for all the DNS names while the instances are still around.
[13:24] <rogpeppe> niemeyer: (and running)
[13:24] <niemeyer> rogpeppe: The logic in your branch breaks if a machine never gets allocated.
[13:24] <niemeyer> rogpeppe: This is a bug. Let's fix it.
[13:24] <rogpeppe> niemeyer: yes, i've agreed about that.
[13:24] <niemeyer> rogpeppe: No, you just said otherwise.
 niemeyer: that doesn't mean we shouldn't try to wait for all the DNS names while the instances are still around.
[13:25] <rogpeppe> niemeyer: if a machine never gets allocated, the instance won't be around
[13:25] <rogpeppe> niemeyer: so we won't wait for its DNS name
[13:25] <niemeyer> rogpeppe: It will stay in "pending" mode..
[13:26] <niemeyer> rogpeppe: What happens then?
[13:26] <rogpeppe> niemeyer: so a machine can stay in pending mode forever?
[13:26] <niemeyer> rogpeppe: Bad things happen.. machines go from pending to terminated.. stay in pending for a very long time, etc
[13:27] <rogpeppe> niemeyer: if that's so, i agree it's an issue. but...
[13:27] <rogpeppe> niemeyer: if we go the other way, even if we have 3 zk instances, all the initial agents will only ever talk to one of them.
[13:27] <niemeyer> rogpeppe: The logic you're introducing is trying to handle the unfriendliness of EC2.. we must not go overboard with it. Having an application that hangs for several minutes in edge cases is not nice.
[13:27] <rogpeppe> niemeyer: so if that one goes down, the initial agents are stuffed
[13:28] <niemeyer> rogpeppe: If there are known machines in a set of zk instances, let's not block waiting forever we *know* about *good working machines*
[13:28] <rogpeppe> niemeyer: ... which means we'll stop when we have exactly *one* machine, right?
[13:29] <niemeyer> rogpeppe: That's fine!  If we have to reconnect, well.. we have to reconnect!
[13:29] <niemeyer> rogpeppe: Purposefully introducing a resilience bug to fix resilience is.. hmm.. interesting.
[13:29] <rogpeppe> niemeyer: ok, so it's important for the state client to know that it should always re-get the StateInfo when it reconnects, right?
[13:30] <rogpeppe> actually, no this won't work
[13:30] <niemeyer> rogpeppe: I don't know.. but whatever we do, there's a bug locallly, in that small subworld we're handling.
[13:30] <rogpeppe> because the zk client carries on redialling the set of known machines regardless of whether they're up or down
[13:31] <niemeyer> rogpeppe: Up to us.. we get the disconnection events, and as far as I understood, our approach was going to be reestablishing the connection.
[13:31] <rogpeppe> niemeyer: we don't get the disconnection events AFAIK
[13:32] <niemeyer> rogpeppe: ZooKeeper notifies about disconnections.
[13:32] <rogpeppe> ah, maybe it's just the initial connection attempt that goes on forever
[13:32] <niemeyer> rogpeppe: and our last agreement was that we were going to handle any connection issues as fatal.
[13:33] <niemeyer> rogpeppe: We might even be smarter at some point, and wait for longer if we knew that we had *just* bootstrapped
[13:33] <niemeyer> rogpeppe: Having that as a general rule over the lifetime of the environment is not reasonable, though
[13:34] <niemeyer> rogpeppe: ZooKeeper certainly tries to reestablish the connection.. that's not the same as not having notification of disconnections.
[13:34] <rogpeppe> niemeyer: yeah, given that it makes no difference if we pass 1 or 3 addresses to the zk client, there's no point in waiting for more than one.
[13:35] <niemeyer> rogpeppe: Arguably, we're doing more work on our side, but as far as I could perceive so far, that sounds like a good idea anyway
[13:35] <niemeyer> rogpeppe: Trusting on the internals of zk to achieve long term reliability hasn't been fruitful
[13:35] <rogpeppe> niemeyer: yeah. and it makes it easier to move to something else if we want to, perhaps.
[13:36] <niemeyer> rogpeppe: I guess we do have the instance start time, right?
[13:36]  * rogpeppe goes to have a look
[13:36] <niemeyer> rogpeppe: I'd be happy to wait in that loop as you suggest if we take that into consideration
[13:38] <rogpeppe> niemeyer: yeah, there's LaunchTime although it's not in goamz/ec2 yet
[13:38] <niemeyer> rogpeppe: Maybe let's just keep simple then, and return the first good set of machines we have
[13:39] <rogpeppe> niemeyer: i'll wait in parallel and return the first address we get. that's easiest i think.
[13:39] <rogpeppe> niemeyer: i could wait for a short time after the first one, in case there are several arrive together.
[13:39] <niemeyer> rogpeppe: I don't think it needs to be parallel
[13:39] <rogpeppe> niemeyer: oh, just poll DNSName on all of them?
[13:40] <niemeyer> rogpeppe: Put Instances itself in a loop.. once we have a batch with good DNSNames, return all of the ones that have an address assigned
[13:40] <niemeyer> rogpeppe: Instinctively, I believe there are good chances that we'll get multiple entries at once
[13:40] <niemeyer> rogpeppe: Since reservations are grouped
[13:40] <rogpeppe> yeah, that's better. i was going to call WaitDNSName in parallel.
[13:40] <rogpeppe> but that way is better
[13:40] <rogpeppe> niemeyer: all our instances are in separate reservations.
[13:41] <niemeyer> rogpeppe: Hmm, good point..
[13:41] <niemeyer> rogpeppe: Even then, there must be a queue in the server side
[13:42] <rogpeppe> niemeyer: who knows? given that we're waiting for a second before polling, we'll get any within that second.
[13:42] <niemeyer> rogpeppe: Btw, I'm also working a bit on a quick clean up of goamz this morning.. I'll put all the packages in a single branch
[13:42] <rogpeppe> niemeyer: yay!
[13:43] <rogpeppe> niemeyer: (i'm glad you got that fix in before go 1)
[13:43] <niemeyer> rogpeppe: Yeah, it'd be tricky later
[13:43] <rogpeppe> niemeyer: thanks for the discussion BTW. i found it very useful.
[13:43] <niemeyer> rogpeppe: I'm also including all the pending crack around mturk, sns, and sdb onto an exp/ subdir
[13:44] <rogpeppe> good plan.
[13:44] <niemeyer> They're not in a great state, but I want to make them into official branches since I've been doing a poor job reviewing/improving them
[13:44] <niemeyer> rogpeppe: Yeah, it was instructional for me too, thanks as well
[14:03] <fwereade> lunch,bbiab
[14:35] <rogpeppe> niemeyer: all done, i think
[14:36] <niemeyer> rogpeppe: Cheers
[15:05] <rogpeppe> TheMue: i just added my review to fwereade's.
[15:06] <TheMue> rogpeppe: yep, got the notification
[15:06] <TheMue> rogpeppe: thx
[15:06] <rogpeppe> TheMue: np
[15:09] <TheMue> rogpeppe: maybe rietveld should add a +1 or Like button ;)
[15:10] <rogpeppe> TheMue: +1 :-)
[15:28] <rogpeppe> lol "
[15:28] <rogpeppe> Don't worry.. I can't feel my toes anymore. They're already crushed
[15:28] <rogpeppe> from fixing code with bad error handling.
[15:28] <rogpeppe> "
[15:40] <TheMue> fwereade: thx, will change it from map to the value. but do you validate every data you retrieve from a backend (here zk) before returning it?
[15:41] <fwereade> TheMue, it just strikes me as sensibly paranoid ;)
[15:42] <TheMue> fwereade: :D
[15:42] <niemeyer> rogpeppe: You may want to delete this branch: https://code.launchpad.net/~rogpeppe/goamz/s3
[15:42] <niemeyer> rogpeppe: Has no content
[15:42] <fwereade> TheMue, there's certainly a case to be made that it's excessively paranoid ;)
[15:43] <rogpeppe> fwereade, TheMue: personally i treat all data that comes from outside the program itself as suspect.
[15:43] <rogpeppe> niemeyer: done
[15:43] <fwereade> rogpeppe, indeed, the additional cost is minimal
[15:44] <niemeyer> rogpeppe: Cheers
[15:44] <niemeyer> rogpeppe, fwereade, TheMue: goamz is now a single branch
[15:44] <fwereade> niemeyer, cool
[15:44] <niemeyer> You'll have to rm -rf $GOPATH/launchpad.net/goamz
[15:44] <rogpeppe> niemeyer: cool
[15:44] <niemeyer> and then
[15:44] <rogpeppe> go get -u won't work, presumably...
[15:44] <niemeyer> go get launchpad.net/goamz/aws
[15:44] <rogpeppe> that's a pity
[15:44] <niemeyer> rogpeppe: Yeah, unfortunate but necessary
[15:45] <rogpeppe> hmm, i'll just check i've got no outstanding branches before i rm -r...
[15:46] <TheMue> rogpeppe: so you validate all data you read from ZooKeeper too?
[15:46] <rogpeppe> TheMue: definitely. who knows what other dodgy programs have been at work there.
[15:46] <rogpeppe> TheMue: i'd only validate as necessary though. no need to fail if you don't need to.
[15:47] <TheMue> rogpeppe: how far goes your validation? how do you take care that the whole topology stored in one node is valid? or the combination of all nodes including their contents we use in state?
[15:49] <rogpeppe> TheMue: depends what i'm doing with it. i'd check the error with it's parsed. and i'd check that any expectations i have of it are true. but i wouldn't check what i don't need to.
[15:49] <TheMue> btw, do other applications write into our ZK instance?
[15:49] <rogpeppe> s/with it's/when it's/
[15:59] <niemeyer> rogpeppe: There's at least nothing in review
[16:02] <rogpeppe> niemeyer:
[16:02] <rogpeppe> % go get -u launchpad.net/goamz/aws
[16:02] <rogpeppe> package launchpad.net/goamz/aws: directory "/home/rog/src/go/src" is not using a known version control system
[16:02] <rogpeppe> hmm
[16:02] <rogpeppe> TheMue: "other applications" might be just a buggy old version of our own code...
[16:03] <niemeyer> rogpeppe: Try again please
[16:03] <rogpeppe> niemeyer: same result.  but no delay - it seems like it's decided already...
[16:03] <niemeyer> rogpeppe: Remove your local stuff
[16:04] <rogpeppe> niemeyer: which local stuff? you mean change my GOPATH?
[16:05] <niemeyer> rogpeppe: rm -rf $GOPATH/launchpad.net/goamz
[16:05] <rogpeppe> niemeyer: i did that
[16:05] <rogpeppe> niemeyer: there's no goamz directory
[16:05] <rogpeppe> niemeyer: but i changed my GOPATH and it works
[16:05] <niemeyer> rogpeppe: I've just installed it locally, and it works
[16:05] <niemeyer> rogpeppe: So you probably have left over data
[16:06] <niemeyer> Lunch time
[16:06] <niemeyer> biab
[16:06] <TheMue> rogpeppe: that surely maybe. so i would like you and fwereade do a walkthrough of the state code and take a look where further verification of informations retrieved out of ZK is needed.
[16:07] <rogpeppe> niemeyer: it's worked now. bizarre.
[16:07] <rogpeppe> TheMue: i've been looking out for it, so you're probably ok.
[16:07] <TheMue> rogpeppe: ok, thx
[16:09] <rogpeppe> niemeyer: http://paste.ubuntu.com/876199/
[16:27] <TheMue> rogpeppe: how would you call the type for the Resolved… values?
[16:28]  * rogpeppe has a look
[16:29] <rogpeppe> TheMue: i can't say i really understand what they're about, but perhaps ResolvedKind?
[16:30] <TheMue> rogpeppe: inside the nodes content they are stored as "retry = 1000" or "retry = 1001". so maybe RetryKind is better.
[16:31] <rogpeppe> TheMue: seems ok to me. fwereade should ok it though - he knows what's going on much better than i!
[16:32] <TheMue> fwereade: what do you say?
[16:32] <fwereade> TheMue, rogpeppe: I guess I'd say RetryKind or maybe RetryMode
[16:33] <rogpeppe> fwereade: out of interest, what do these values mean?
[16:33] <fwereade> rogpeppe, just jump to the next state, or try rerunning the hook and go back to error state if it fails
[16:34] <rogpeppe> fwereade: ok thanks
[16:36] <TheMue> rogpeppe, fwereade: thx, so i'll take RetryMode, looks best
[16:36] <rogpeppe> TheMue: sounds good
[17:02] <rogpeppe> this is weird. i've got one branch that's definitely different from another one, but running bzr diff between the two branches gives me no output.
[17:02] <rogpeppe> how can that happen?
[17:05] <rogpeppe> there must be some subtlety about "diff" that i don't get.
[17:05] <rogpeppe> fwereade: any idea?
[17:08] <fwereade> rogpeppe, no idea... I know "bzr diff" for uncommitted changes and "bzr diff --old lp:somewhere" and that's about it
[17:08] <niemeyer> rogpeppe: How are you sure that there are differences?
[17:08] <rogpeppe> niemeyer: i did md5sum on the files
[17:08] <rogpeppe> ah, i forgot the --old flag
[17:09] <rogpeppe> ah that worked!
[17:09] <rogpeppe> i wonder what it was doing when i didn't give it the --old flag.
[17:09] <rogpeppe> fwereade, niemeyer: thanks. i'm stupid.
[17:21] <niemeyer> Man, I just got the strongest espresso I can remember
[17:23] <rogpeppe> niemeyer: do those changes to go-ec2-robustness look ok, BTW?
[17:23] <niemeyer> rogpeppe: Haven't re-reviewed yet
[17:23] <rogpeppe> ok
[17:24] <niemeyer> rogpeppe: Just finishing the goamz stuff and will go back
[17:36] <niemeyer> rogpeppe, fwereade_, TheMue: FYI, https://groups.google.com/d/msg/goamz/cTZ5xmQeLQI/gFjSMMVrMbEJ
[17:37] <rogpeppe> niemeyer: pity all that history has gone
[17:38] <fwereade_> niemeyer, cheers
[17:38] <fwereade_> and I think I'm done for the week -- happy weekends all :)
[17:52] <niemeyer> fwereade_: Cheers
[17:52] <niemeyer> fwereade_, rogpeppe, TheMue: Early next week, it'd be nice to have a call with the four of us
[17:52] <niemeyer> To discuss a bit about how things are going in the port and align
[17:52] <rogpeppe> niemeyer: sounds good
[17:59] <niemeyer> rogpeppe: On the branch
[17:59] <rogpeppe>  niemeyer: thanks
[18:02] <niemeyer> rogpeppe: Nice reuse among DNSName and Wait&
[18:02] <rogpeppe> niemeyer: thanks. i was pleased how that turned out.
[18:03] <niemeyer> rogpeppe: Need to check for NoInstances too in StateInfo
[18:04] <rogpeppe> niemeyer: i think that's ok - it'll just return the error
[18:04] <niemeyer> rogpeppe: Sure, but that's not what we want, is it?
[18:05] <rogpeppe> niemeyer: i think it is - if the instances don't exist (remember we've already timed out for eventual consistency) then we don't want to wait for them
[18:05] <niemeyer> rogpeppe: Ah, you're right, thanks
[18:06] <niemeyer> rogpeppe: Need to check nil within the loop, though, right?
[18:06] <rogpeppe> niemeyer: oops yes, good catch!
[18:07] <niemeyer> rogpeppe: We'll have to pay special attention when coding/reviewing logic with ErrPartialInstances.. feels easy to miss indeed
[18:09] <rogpeppe> niemeyer: that's the hazard of returning incomplete data, i think. i'm not sure we can get around it.
[18:10] <rogpeppe> niemeyer: pushed a fix.
[18:11] <niemeyer> rogpeppe: Using ShortAttempt in live_test.go feels a bit like abusing private details of the implementation
[18:12] <niemeyer> rogpeppe: It'd be nice to have a trivial Sleep there, and remove all the hacking from export_test.go
[18:13] <rogpeppe> niemeyer: not sure. it's important that the local tests have a short sleep because otherwise the tests take much longer
[18:14] <rogpeppe> niemeyer: but maybe i should just have a variable and be done with it
[18:14] <niemeyer> rogpeppe: Only if they are broken, right?
[18:14] <rogpeppe> niemeyer: no. quite a few of the tests test stuff that's deliberately broken.
[18:14] <rogpeppe> niemeyer: so the timeout is exercised quite a bit
[18:14] <niemeyer> rogpeppe: No, I mean it will only take a while if the test is broken
[18:15] <rogpeppe> niemeyer: it makes the tests run for 20 or 30 seconds rather than 5.
[18:15] <rogpeppe> niemeyer: i'm not sure i understand
[18:15] <niemeyer> rogpeppe: I don't understand why..
[18:15] <rogpeppe> niemeyer: because if you're testing that getting a non-existent instance (for instance) actually returns an error, then the code has to time out
[18:17] <niemeyer> rogpeppe: There's a single use of ShortAttempt
[18:17] <niemeyer> rogpeppe: and it repeats until it breaks
[18:17] <rogpeppe> niemeyer: yes, but that test is run multiple times.
[18:17] <rogpeppe> niemeyer: one time for each scenario
[18:17] <niemeyer> rogpeppe: If it is run 10 times, and it takes 1 second to break on each, it's still 10 seconds
[18:18] <niemeyer> rogpeppe: How many times is it repeating, and why does it take so long?
[18:18] <rogpeppe> niemeyer: the timeout is 5 seconds not one second
[18:18] <niemeyer> rogpeppe: It doesn't reach the timeout, unless the test is broken!
[18:18] <rogpeppe> [18:15] <rogpeppe> niemeyer: because if you're testing that getting a non-existent instance (for instance) actually returns an error, then the code has to time out
 rogpeppe: There's a single use of ShortAttempt
 rogpeppe: and it repeats until it breaks
[18:20] <rogpeppe> niemeyer: oh! sorry, i'd forgotten the real reason. the timeouts *inside* the ec2 package are exercised.
[18:21] <rogpeppe> niemeyer: the ShortAttempt inside TestStopInstances isn't a problem. it could be hardwired as you suggest.
[18:21] <rogpeppe> niemeyer: (which would cut out some of the hackery in export_test)
[18:22] <niemeyer> rogpeppe: Ah, cool, we were talking about different things then
[18:22] <rogpeppe> niemeyer: but we'd still need SetShortTimeouts
[18:22] <niemeyer> rogpeppe: Yes, I was referring to those
[18:22] <niemeyer> rogpeppe: That's fine
[18:24] <rogpeppe> niemeyer: fix pushed.
[18:26] <rogpeppe> niemeyer: i'm going to have to go very shortly, BTW
[18:27] <niemeyer> rogpeppe: Cool, let me check that quickly so that hopefully you can get away with a pleasant submission :)
[18:28] <rogpeppe> niemeyer: that would be marvellous
[18:29] <rogpeppe> niemeyer: next two branches before actually running zookeeper and connecting to it are very small BTW. we're nearly there!
[18:29] <rogpeppe> s/before actually running/getting/
[18:29] <niemeyer> rogpeppe: Oh, that's great to hear.. I'm feeling so much behind on reviews
[18:32] <niemeyer> rogpeppe: AWESOME!
[18:32] <niemeyer> rogpeppe: LGTM
[18:33] <rogpeppe> niemeyer: brilliant, thanks a lot. i've gotta go and do a gig now!
[18:33] <rogpeppe> niemeyer: see ya monday.
[18:34] <niemeyer> rogpeppe: Have a great weekend!
[18:34] <rogpeppe> niemeyer: toi aussi
[18:34] <niemeyer> rogpeppe: Cheers :)
[18:43]  * TheMue has changed https://codereview.appspot.com/5727045. now i can go. have a good weekand, niemeyer 
[18:49] <TheMue> niemeyer: bye, c u monday
[18:49] <niemeyer> TheMue: Have a good one.. will try to unblock you for next week
[18:50] <TheMue> niemeyer: getting good feedback by rog and fwereade, so it's ok
[18:51] <niemeyer> TheMue: Sweet, good to see team work playing there
[18:51] <TheMue> niemeyer: yep, absolutely
[18:51] <TheMue> niemeyer: so have a good weekend
[18:54] <niemeyer> TheMue: Thanks, a great one to you too!
[18:54] <TheMue> niemeyer: thx
[19:44]  * niemeyer breaks..