/srv/irclogs.ubuntu.com/2012/03/09/#juju-dev.txt

niemeyerfwereade: Morning09:00
fwereadeheya niemeyer09:00
niemeyerrogpeppe: Morning09:03
rogpeppeniemeyer: hiya, up early...09:05
niemeyerrogpeppe: Yeah, lost sleep and decided to push some pending tasks forward09:07
rogpeppeniemeyer: i've been up about 2 hours myself, but went for a nice long bike ride.09:09
niemeyerrogpeppe: Oh, that sounds very nice09:09
niemeyerrogpeppe: Haven't been biking in a long while09:09
niemeyerrogpeppe: I'm a one-sport man.. squash took over.. ;-)09:10
rogpeppeniemeyer: i've been going out in the mornings 'cos i realised that moving from bedroom to desk wasn't getting me much exercise!09:10
niemeyerrogpeppe: Nice that you've handled that quickly.. indeed it's an amazing recipe for couch potatoing :)09:10
rogpeppeniemeyer: rock climbing was always my big thing, but it's harder to get out these days...09:11
rogpeppebeen going out on the bike most days for 2 weeks now, and i haven't repeated a route once...09:11
niemeyerWow09:29
niemeyerrogpeppe: Just submitted a review for the robustness branch09:30
niemeyerWill step out to have coffee with Ale09:30
niemeyerBack soonish09:30
rogpeppeniemeyer: thanks09:30
fwereadepopping out for a coffee myself, might kickstart my brain09:45
* niemeyer returns10:05
niemeyerTheMue: Morning10:13
TheMueniemeyer: morning, back again10:13
fwereadeTheMue, btw, you mentioned a blocker on relation state; do you have a mo to fill me in on that?10:36
TheMuefwereade: niemeyer told me that there are some open discussions and i should defer it and continue with other open parts meanwhile10:37
fwereadeTheMue, ah yeah, cool, that makes sense10:37
niemeyerfwereade: Subordinates are on going in Python.. I suggested waiting for that so that the implementation may be done once10:38
TheMuefwereade: right now i'm going through presence package. should be difficult to switch my agent code to use it.10:38
fwereadeTheMue, should be? :(10:38
TheMuefwereade: oh, sorry, typo10:38
fwereadeTheMue, phew :)10:38
TheMuefwereade: ahoulsn't ;)10:38
fwereadeTheMue, now *that's* a typo :p10:39
TheMuefwereade: hehe10:39
fwereadeTheMue, what's next on your roadmap?10:39
fwereadeTheMue, just want to make sure we don't collide again ;)10:39
TheMuefwereade: no, indeed, your package looks good. maybe an integration of tomb for save goroutine handling would make sense10:40
fwereadeTheMue, the intent was to have as much as possible the same semantics as normal ZK watches10:40
TheMuefwereade: i've got three reviews in the pipe, one is the agent proposal. once that one is ready for integration i would extend the entities i already have and which shall have agents with it10:41
TheMuefwereade: beside that i've got to compare what's still open in state10:42
fwereadeTheMue, cool; I'm looking into hook execution so I don't think we'll cause each other too much trouble ;)10:42
TheMuefwereade: i'll post here what i'll start next10:42
TheMuefwereade: did you took a look into my agent code?10:43
fwereadeTheMue, one thing I noticed is that we don't seem to have state.Unit.OpenPort yet10:43
fwereadeTheMue, I started; I have a couple of draft comments, I'll try to get them finished for you now10:43
TheMuefwereade: OpenPort() is one of the proposals10:43
fwereadeTheMue, oh, fantastic, sorry I missed that one10:44
TheMuefwereade: the code is submitted for review10:44
fwereadeTheMue, I'll take another look at +activereviews10:44
fwereadeTheMue, hasn't go-state-continued-machine been merged already?10:47
fwereadeTheMue, and go-state-gocheck-change too?10:48
TheMuefwereade: all they are missing an LGTM10:49
TheMuefwereade: the latter is also needless, it's also part of the others. rog and i thought it had to be done earlier, so i made an extra branch. but the changes have been so few that another brach already had them10:50
fwereadeTheMue, ah cool, maybe reject that one then?10:50
TheMuefwereade: how to do that?10:51
fwereadeTheMue, go to the LP merge proposal and click the status field at the top10:51
TheMuefwereade: important are the two go-state-continued-… and the go-state-first-agent-draft10:52
fwereadeTheMue, cool, I'll get up to date on those then10:52
fwereadeTheMue, thanks10:52
TheMuefwereade: hmm, i'm somehow lost and don't find the mentioned status field at the top10:54
fwereadeTheMue, it's currently an orange "needs review"10:55
TheMuefwereade: ah, have not gone far enough in lp10:59
fwereadeTheMue, ah sorry you were looking at the branch?10:59
TheMuefwereade: yep, i still have my troubles with the usability of lp11:00
fwereadeTheMue, haha, LP seemed like sweetness and light to me after the rubbish we were using at my last place11:00
fwereadeTheMue, the tool was *worse* than doing bug/iteration tracking with google docs spreadsheets11:01
TheMuefwereade: i've worked with JIRA. it's not as powerful as LP, but the UI is more clear11:01
TheMuefwereade: uh, sounds horrible11:01
fwereadeTheMue, (I know that to be true because that was what we were using before)11:01
* fwereade cries quietly in a corner11:01
* TheMue dcc's fwereade a tissue11:02
fwereade:)11:02
TheMuefwereade: so, it's rejected but still with status "Development". can i now change this to "Abandoned"?11:05
fwereadeTheMue, sounds good to me (I usually forget to do that myself)11:05
TheMuefwereade: ok, i like a clean workplace ;)11:06
TheMueniemeyer: oh, i just rejected go-state-gocheck-change after discussion with fwereade, those changes have already been in a former branch and are merged into state11:13
TheMueniemeyer: this overlapped, sorry11:13
fwereadeTheMue, review on https://codereview.appspot.com/5690051/11:15
TheMuefwereade: thx, take a look11:16
niemeyerTheMue: That's cool, no worries11:16
TheMuefwereade: in your review you say that in todays code the only caller of RemoveMachine() care's if it doisn't exist?11:23
fwereadeTheMue, seems so to me11:23
TheMuefwereade: do you know the reason behind it? from states perspective it's ok if there's nothing to remove.11:25
TheMuefwereade: but state has online a very limited view.11:26
fwereadeTheMue, tbh I'd be just as happy if it silently ignored nonexistent machines11:26
fwereadeTheMue, it's just the mismatch between the state's view of what's acceptable and the single client's view11:26
fwereadeTheMue, I'd rather just return an error than return a value (false) that's always turned into an error11:27
TheMuefwereade: hmm, it's just todays code only ported to go11:34
TheMuefwereade: should ask niemeyer11:35
fwereadeTheMue, understood; and there may be reasons for it that I've missed; but I'm not sure that we should feel too bound by python implementation details11:36
TheMuefwereade: if i got it right we first shall catch the current status and later optimize it for go11:37
fwereadeTheMue, my view is that that particular implementation detail is either a misprediction of future needs, or an appendix left over after the removal of some pre-existing need11:37
TheMuefwereade: maybe, you are more safe than me i the existing code.11:38
fwereadeTheMue, that hadn't been my perception -- while we should maintain externally observable characteristics wherever possible, I'd prefer to strip ickiness where possible11:40
fwereadeTheMue, for example, presence nodes vs ephemeral nodes11:40
fwereadeniemeyer, opinion?11:41
niemeyerfwereade++11:41
niemeyerTheMue: If the only use of this method actually does what fwereade suggests, turning around and saying "Hah! Actually, that's an error!", then it feels like the API was mispredicted when first written11:42
niemeyerI didn't even know that was the case, FWIW11:43
TheMueniemeyer: yep, i do code scans to see where a meth/func is used to se how it is intended when it's unclear to me11:44
TheMueniemeyer: in this particular case i felt no need so i ported it 1:111:45
TheMueniemeyer: maybe scanning for each func is useful11:45
rogpeppeniemeyer: wanna talk about DNS name timeouts?11:46
niemeyerTheMue: It may be hard in some cases without further context11:47
niemeyerTheMue: Reviews help with that too11:47
niemeyerrogpeppe: Sure11:47
TheMueniemeyer: definitely11:48
rogpeppeniemeyer: i don't quite understand this: "11:48
rogpeppeThere's an application running somewhere, we shutdown a11:48
rogpeppemachine, and then that app hangs for 3 minutes just because it asked for a DNS11:48
rogpeppeaddress?11:48
rogpeppe"11:48
rogpeppeniemeyer: what's the "application" in this context?11:48
niemeyerrogpeppe: A binary11:48
niemeyerrogpeppe: Any binary that links against this library of code11:48
rogpeppeniemeyer: ok.11:49
fwereadeTheMue, reviewed https://codereview.appspot.com/5727045/11:49
rogpeppeniemeyer: i think there are two scenarios11:49
rogpeppeniemeyer: one is when you really need the address11:49
rogpeppeniemeyer: the other you don't care too much11:49
TheMuefwereade: already seen an interesting comment regarding retryTopologyChange()11:49
rogpeppeniemeyer: for instance, when we ask for the DNS address of the bootstrap machine, we really need it11:49
rogpeppeniemeyer: but, as you point out, in StateInfo we don't care too much11:50
rogpeppeniemeyer: i'm wondering about a "wait as long as necessary" bool flag on Environ.DNSName11:50
rogpeppeniemeyer: i'm trying to resist forcing clients of Environ to do their own polling.11:51
rogpeppeniemeyer: because i think the strategy might be quite different for different providers.11:52
rogpeppeniemeyer: (some providers will always have a DNS name instantly available, for example)11:52
niemeyerrogpeppe: Hmm11:53
niemeyerrogpeppe: I don't think so, to be honest11:53
niemeyerrogpeppe: The case where you say "give me a machine" to an IaaS and you have an address immediately may not be so common11:54
niemeyerrogpeppe: I'm happy to experiment with a waiting API, but let's please keep an eye on this rather than taking it as a general approach from now on11:55
niemeyerrogpeppe: I suggest we introduce WaitDNSName() returning (addr, error)11:55
rogpeppeniemeyer: immediately, perhaps not. but a 3 minute wait? many DHCP clients give an address in < 1s11:55
fwereadeTheMue, I definitely don't think that's something to be addressed in this branch11:56
niemeyerrogpeppe: And keep DNSName() returning the already known address only11:56
rogpeppeniemeyer: i'm wondering if it should be the other way around.11:56
fwereadeTheMue, but State "feels" like the right place for it to me11:56
niemeyerrogpeppe: The three minute wait is completely unrelated to DHCP11:56
rogpeppeniemeyer: DNSNameNoWait11:56
rogpeppeso DNSName always gives you a valid DNS name if it can11:57
niemeyerrogpeppe: This is waiting for a machine to be booted11:57
niemeyerrogpeppe: Which is why it's wrong to be holding on DNSName specifically11:57
rogpeppeniemeyer: why can't the DNS name be allocated before the machine boots?11:57
niemeyerrogpeppe: The machine is off.. the DNS name isn't special11:57
rogpeppeniemeyer: doesn't the infrastructure allocate the DNS name?11:57
niemeyerrogpeppe: It's not under our control why people do this or not11:57
rogpeppeniemeyer: not the machine itself?11:57
niemeyerrogpeppe: The fact is that the most well known IaaS doesn't do that11:58
rogpeppeniemeyer: ok. seems a bit weird to me, but i'm sure there's a good reason.11:58
niemeyerrogpeppe: Likely because they're still working out where to allocate the machine by the time it's answered11:58
niemeyerrogpeppe: But, again, there's no point for us11:58
TheMuefwereade: will comment your review, most of it sounds got. the one with the goto had a problem with a break, but i've got to look again which kind it was11:59
rogpeppeniemeyer: anyway, i'd prefer the default to be to wait, if that's ok.11:59
niemeyerrogpeppe: WaitDNSName please.. this is the special case that will yield misbehaving code if we forget that it can take *several minutes* to return11:59
niemeyerrogpeppe: and let's have DNSName() returning (attr, error) too, so that it's clear that we may not have a DNS address at that time12:00
niemeyerErm, (addr, error)12:00
rogpeppeniemeyer: it does that currently12:00
niemeyerrogpeppe: It didn't do before your branch, and I'm suggesting both WaitDNSName and DNSName do that12:00
niemeyerrogpeppe: If that's what you had in mind already, great.. :)12:00
rogpeppeniemeyer: sure, that's unavoidable12:00
niemeyerrogpeppe: It's not unavoidable.. what we have in trunk today is not like that12:01
rogpeppeniemeyer: that's because it doesn't work :-)12:01
niemeyerHeh12:01
fwereadeTheMue, cool, thanks; https://codereview.appspot.com/5782053/ reviewed as well12:01
rogpeppe"12:04
rogpeppeThis logic is error prone. When we get ErrMissingInstance, we'll have to12:04
rogpeppe*always* check len(insts) to see if it's 0 or not before iterating12:04
rogpeppe"12:04
rogpeppeniemeyer: the logic was deliberate12:04
niemeyerrogpeppe: Cool, let's fix it then12:05
rogpeppeniemeyer: it means that if you've asked for some instances, you can easily check (len != 0) if at least some instances have been returned12:05
niemeyerrogpeppe: That's error prone, as described12:05
rogpeppeniemeyer: if you're iterating, you don't need to check len(insts) for 012:06
rogpeppeniemeyer: because the iteration will never go anywhere12:06
niemeyerinsts, err := e.Instances([]ec2.Instance{a, b, c})12:08
niemeyerif err == ec2.ErrMissingInstance && len(insts) != 0 && insts[0] == nil { ... }12:09
niemeyerrogpeppe: Please don't force me to remember to do that every time12:10
rogpeppeniemeyer: most of the time you just check the error, in which case it doesn't matter. the other way around (and probably the more common one) is that if you ask for some instances and get some, but not all of them, you have to scan the list to see if you've got any.12:12
rogpeppeniemeyer: whereas the way it is currently, you can just check the length - you'll never return a slice with no non-nil instances in.12:12
rogpeppeniemeyer: i started off always returning a slice, but changed it because it was easier to use this way12:13
rogpeppes/you'll never return a slice/you'll never get a slice/12:14
niemeyerrogpeppe: Easy and error prone.. there are two different cases to handle12:14
niemeyerrogpeppe: and we have to remember them when using this method to avoid a panic12:15
niemeyerrogpeppe: Please, either return the slice at all times, or let's have two different error types12:15
niemeyerrogpeppe: No gotchas12:15
rogpeppeniemeyer: ok, i'll do two error types.12:16
niemeyerrogpeppe: Thanks12:18
rogpeppeniemeyer:12:21
rogpeppe// Instances returns a slice of instances corresponding to the12:21
rogpeppe// given instance ids.  If no instances were found, but there12:21
rogpeppe// was no other error, it will return ErrInstanceNotFound.  If12:21
rogpeppe// some but not all the instances were found, the returned slice12:21
rogpeppe// will have some nil slots, and an ErrInstancesIncomplete error12:21
rogpeppe// will be returned.12:21
rogpeppeInstances(ids []string) ([]Instance, error)12:22
niemeyerrogpeppe: Looks good, thanks. Just plural for ErrInstancesNotFound too, please12:24
rogpeppeniemeyer: oh yeah, that was a typo - the actual error was spelled that way in fact.12:24
niemeyerrogpeppe: Sweet.. "Partial" might also be a shorter term for Incomplete12:25
niemeyerrogpeppe: Up to you, though12:26
rogpeppeniemeyer: ErrInstancesPartial doesn't sound great, and neither does ErrNotFoundInstances IMHO.12:26
rogpeppeniemeyer: but ErrInstancesNotFound and ErrPartialInstances seem inconsistent12:26
niemeyerrogpeppe: PartialInstances + NoInstances?12:27
rogpeppe+112:27
niemeyerSuper, thanks12:27
TheMuelunchtime12:36
rogpeppeniemeyer: i'm slightly concerned about doing this: "12:59
rogpeppeIt should wait, but stop on the first batch that has12:59
rogpeppeat least one valid DNSName.12:59
rogpeppe"12:59
rogpeppe(of StateInfo)12:59
rogpeppeit means that if we call StateInfo when, say, only one zk instance has been allocated, and that instance later goes down, we won't have the redundancy that we should.13:00
niemeyerrogpeppe: I don't understand.. if the only instance that was allocated goes down, there's no redundancy13:01
rogpeppeniemeyer: sorry, i was ambiguous13:01
rogpeppeniemeyer: when i said "only one zk instance has been allocated" i meant "the DNS name of only one of several zk instances has been allocated".13:02
niemeyerrogpeppe: The scenario is this: we have three intances, one is terminated for whatever reason.. it shouldn't wait for 3 minutes when it knows about other instances.13:02
rogpeppeniemeyer: i don't think that scenario would make it wait13:02
rogpeppeniemeyer: the scenario i'm concerned about is: we start 3 instances. the DNS name for only one of them is used, because it boots fractionally faster than the others.13:03
niemeyerrogpeppe: I don't see how it would not wait.. you're picking the first entry in the list and calling DNSName on it.. with the logic in the branch, it will wait for as long as it takes for that one machine to have an address, no matter how many machines are there.13:05
niemeyerrogpeppe: Is this the DNSName catching you already? :-)13:05
rogpeppeniemeyer: so we get *one* machine from the three possible zk machines. then that zk machine goes down. if we'd waited for the names of the others, the zk client could redial one of them, but as it is, it can't because it doesn't know about them13:06
niemeyerrogpeppe: I don't see what you mean..13:07
niemeyerfor i, inst := range insts {13:07
niemeyer    addr, err := inst.DNSName()13:07
niemeyer*BOOM*.. three minutes..13:07
niemeyerThat can't happen..13:07
rogpeppeniemeyer: ok, so you're suggesting we wait for them all in parallel, presumably, and then return the first one that has a DNS name?13:08
rogpeppeniemeyer: (assuming none of them have DNS names to start with)13:08
niemeyerrogpeppe: I'm saying that the problem above can't happen, so far.13:08
rogpeppeniemeyer: because we only have one zk server, right?13:08
niemeyerrogpeppe: If you understand the problem, we can talk about a solution13:08
rogpeppei'm not sure i do understand the problem. is it that we might be launching one zk machine much later than the others?13:09
rogpeppeso we don't want to wait for that while the others are available anyway?13:10
rogpeppeniemeyer: we've got to wait for at least one DNS name, right?13:10
niemeyerrogpeppe: Yes..13:10
niemeyerrogpeppe: You have N machines.. waiting for several minutes for a random one to be alive == BAD13:11
rogpeppeniemeyer: so we don't mind if the zk client doesn't know about the other zk servers?13:11
niemeyerrogpeppe: I don't know what you're talking about13:11
niemeyerrogpeppe: I'm saying that if we know about N machines, waiting for several minutes for a single arbitrary machine == REALLY BAD13:12
rogpeppeit's not waiting for a single arbitrary machine, it's waiting for all of the zk machines.13:12
niemeyerrogpeppe: Yes.. that's equally bad13:12
niemeyerrogpeppe: It's pointless to have three machines for redundancy if the application waits for several minutes for all of them to be available13:13
rogpeppeniemeyer: the problem is that there's no way of telling the zk client when another zk machine *does* become available13:13
rogpeppeniemeyer: which means that if the only machine that we found does go down, we haven't got any redundancy any more.13:14
niemeyerrogpeppe: That's a completely independent problem13:14
rogpeppeoh?13:14
niemeyerrogpeppe: Yep.. dynamic member sets is a different problem13:14
niemeyerrogpeppe: We can't hold on forever waiting for everybody to be available or the redundancy is over13:15
rogpeppeniemeyer: it's not a dynamic member set though - we know we've started 3 zk machines, we just want to wait for them to be up13:15
niemeyerrogpeppe: A single machine terminated would kill juju13:15
niemeyerrogpeppe: If a single one of them terminates, it will never be up13:15
rogpeppeniemeyer: we're not waiting for the *machine* to be available, we're waiting for its DNS name. isn't that different?13:15
rogpeppeniemeyer: the DNS name is still available if a machine is terminated, i think.13:16
niemeyerrogpeppe: If a machine is terminated the entire instance record will go away in a bit13:16
rogpeppeniemeyer: that takes hours. and we can guard against that easily too.13:18
rogpeppeniemeyer: by only doing a short timeout when calling Instances in WaitDNSName13:18
niemeyerrogpeppe: I'm sure there are many workarounds for the problem. I was just describing it, and wishing that we didn't introduce it13:19
rogpeppeniemeyer: i think we *should* wait for all the zk addresses. but i think we shouldn't do a 3 minute timeout if the instances aren't there.13:20
niemeyerrogpeppe: The Instances call above is equally problematic, btw.. it's aborting if any of the instances are not found13:20
niemeyerrogpeppe: Uh oh.. :(13:20
rogpeppeniemeyer: good point. that *is* a bug.13:21
niemeyerrogpeppe: Man.. the issue is trivial: if a machine in a set of three never starts, we can't block *all of juju* because of that..13:21
niemeyerrogpeppe: Same thing if it terminates13:21
niemeyerrogpeppe: It simply makes no sense to have multiple machines if that's what we're doing13:22
rogpeppeniemeyer: terminating isn't a problem.13:22
niemeyerrogpeppe: The logic in your branch breaks if one terminates.13:22
niemeyerrogpeppe: That is a problem.13:22
niemeyerrogpeppe: Let's please fix that.13:23
rogpeppeniemeyer: agreed. but that's a different problem.13:23
rogpeppeniemeyer: that doesn't mean we shouldn't try to wait for all the DNS names while the instances are still around.13:23
rogpeppeniemeyer: (and running)13:24
niemeyerrogpeppe: The logic in your branch breaks if a machine never gets allocated.13:24
niemeyerrogpeppe: This is a bug. Let's fix it.13:24
rogpeppeniemeyer: yes, i've agreed about that.13:24
niemeyerrogpeppe: No, you just said otherwise.13:24
niemeyer<rogpeppe> niemeyer: that doesn't mean we shouldn't try to wait for all the DNS names while the instances are still around.13:24
rogpeppeniemeyer: if a machine never gets allocated, the instance won't be around13:25
rogpeppeniemeyer: so we won't wait for its DNS name13:25
niemeyerrogpeppe: It will stay in "pending" mode..13:25
niemeyerrogpeppe: What happens then?13:26
rogpeppeniemeyer: so a machine can stay in pending mode forever?13:26
niemeyerrogpeppe: Bad things happen.. machines go from pending to terminated.. stay in pending for a very long time, etc13:26
rogpeppeniemeyer: if that's so, i agree it's an issue. but...13:27
rogpeppeniemeyer: if we go the other way, even if we have 3 zk instances, all the initial agents will only ever talk to one of them.13:27
niemeyerrogpeppe: The logic you're introducing is trying to handle the unfriendliness of EC2.. we must not go overboard with it. Having an application that hangs for several minutes in edge cases is not nice.13:27
rogpeppeniemeyer: so if that one goes down, the initial agents are stuffed13:27
niemeyerrogpeppe: If there are known machines in a set of zk instances, let's not block waiting forever we *know* about *good working machines*13:28
rogpeppeniemeyer: ... which means we'll stop when we have exactly *one* machine, right?13:28
niemeyerrogpeppe: That's fine!  If we have to reconnect, well.. we have to reconnect!13:29
niemeyerrogpeppe: Purposefully introducing a resilience bug to fix resilience is.. hmm.. interesting.13:29
rogpeppeniemeyer: ok, so it's important for the state client to know that it should always re-get the StateInfo when it reconnects, right?13:29
rogpeppeactually, no this won't work13:30
niemeyerrogpeppe: I don't know.. but whatever we do, there's a bug locallly, in that small subworld we're handling.13:30
rogpeppebecause the zk client carries on redialling the set of known machines regardless of whether they're up or down13:30
niemeyerrogpeppe: Up to us.. we get the disconnection events, and as far as I understood, our approach was going to be reestablishing the connection.13:31
rogpeppeniemeyer: we don't get the disconnection events AFAIK13:31
niemeyerrogpeppe: ZooKeeper notifies about disconnections.13:32
rogpeppeah, maybe it's just the initial connection attempt that goes on forever13:32
niemeyerrogpeppe: and our last agreement was that we were going to handle any connection issues as fatal.13:32
niemeyerrogpeppe: We might even be smarter at some point, and wait for longer if we knew that we had *just* bootstrapped13:33
niemeyerrogpeppe: Having that as a general rule over the lifetime of the environment is not reasonable, though13:33
niemeyerrogpeppe: ZooKeeper certainly tries to reestablish the connection.. that's not the same as not having notification of disconnections.13:34
rogpeppeniemeyer: yeah, given that it makes no difference if we pass 1 or 3 addresses to the zk client, there's no point in waiting for more than one.13:34
niemeyerrogpeppe: Arguably, we're doing more work on our side, but as far as I could perceive so far, that sounds like a good idea anyway13:35
niemeyerrogpeppe: Trusting on the internals of zk to achieve long term reliability hasn't been fruitful13:35
rogpeppeniemeyer: yeah. and it makes it easier to move to something else if we want to, perhaps.13:35
niemeyerrogpeppe: I guess we do have the instance start time, right?13:36
* rogpeppe goes to have a look13:36
niemeyerrogpeppe: I'd be happy to wait in that loop as you suggest if we take that into consideration13:36
rogpeppeniemeyer: yeah, there's LaunchTime although it's not in goamz/ec2 yet13:38
niemeyerrogpeppe: Maybe let's just keep simple then, and return the first good set of machines we have13:38
rogpeppeniemeyer: i'll wait in parallel and return the first address we get. that's easiest i think.13:39
rogpeppeniemeyer: i could wait for a short time after the first one, in case there are several arrive together.13:39
niemeyerrogpeppe: I don't think it needs to be parallel13:39
rogpeppeniemeyer: oh, just poll DNSName on all of them?13:39
niemeyerrogpeppe: Put Instances itself in a loop.. once we have a batch with good DNSNames, return all of the ones that have an address assigned13:40
niemeyerrogpeppe: Instinctively, I believe there are good chances that we'll get multiple entries at once13:40
niemeyerrogpeppe: Since reservations are grouped13:40
rogpeppeyeah, that's better. i was going to call WaitDNSName in parallel.13:40
rogpeppebut that way is better13:40
rogpeppeniemeyer: all our instances are in separate reservations.13:40
niemeyerrogpeppe: Hmm, good point..13:41
niemeyerrogpeppe: Even then, there must be a queue in the server side13:41
rogpeppeniemeyer: who knows? given that we're waiting for a second before polling, we'll get any within that second.13:42
niemeyerrogpeppe: Btw, I'm also working a bit on a quick clean up of goamz this morning.. I'll put all the packages in a single branch13:42
rogpeppeniemeyer: yay!13:42
rogpeppeniemeyer: (i'm glad you got that fix in before go 1)13:43
niemeyerrogpeppe: Yeah, it'd be tricky later13:43
rogpeppeniemeyer: thanks for the discussion BTW. i found it very useful.13:43
niemeyerrogpeppe: I'm also including all the pending crack around mturk, sns, and sdb onto an exp/ subdir13:43
rogpeppegood plan.13:44
niemeyerThey're not in a great state, but I want to make them into official branches since I've been doing a poor job reviewing/improving them13:44
niemeyerrogpeppe: Yeah, it was instructional for me too, thanks as well13:44
fwereadelunch,bbiab14:03
rogpeppeniemeyer: all done, i think14:35
niemeyerrogpeppe: Cheers14:36
rogpeppeTheMue: i just added my review to fwereade's.15:05
TheMuerogpeppe: yep, got the notification15:06
TheMuerogpeppe: thx15:06
rogpeppeTheMue: np15:06
TheMuerogpeppe: maybe rietveld should add a +1 or Like button ;)15:09
rogpeppeTheMue: +1 :-)15:10
rogpeppelol "15:28
rogpeppeDon't worry.. I can't feel my toes anymore. They're already crushed15:28
rogpeppefrom fixing code with bad error handling.15:28
rogpeppe"15:28
TheMuefwereade: thx, will change it from map to the value. but do you validate every data you retrieve from a backend (here zk) before returning it?15:40
fwereadeTheMue, it just strikes me as sensibly paranoid ;)15:41
TheMuefwereade: :D15:42
niemeyerrogpeppe: You may want to delete this branch: https://code.launchpad.net/~rogpeppe/goamz/s315:42
niemeyerrogpeppe: Has no content15:42
fwereadeTheMue, there's certainly a case to be made that it's excessively paranoid ;)15:42
rogpeppefwereade, TheMue: personally i treat all data that comes from outside the program itself as suspect.15:43
rogpeppeniemeyer: done15:43
fwereaderogpeppe, indeed, the additional cost is minimal15:43
niemeyerrogpeppe: Cheers15:44
niemeyerrogpeppe, fwereade, TheMue: goamz is now a single branch15:44
fwereadeniemeyer, cool15:44
niemeyerYou'll have to rm -rf $GOPATH/launchpad.net/goamz15:44
rogpeppeniemeyer: cool15:44
niemeyerand then15:44
rogpeppego get -u won't work, presumably...15:44
niemeyergo get launchpad.net/goamz/aws15:44
rogpeppethat's a pity15:44
niemeyerrogpeppe: Yeah, unfortunate but necessary15:44
rogpeppehmm, i'll just check i've got no outstanding branches before i rm -r...15:45
TheMuerogpeppe: so you validate all data you read from ZooKeeper too?15:46
rogpeppeTheMue: definitely. who knows what other dodgy programs have been at work there.15:46
rogpeppeTheMue: i'd only validate as necessary though. no need to fail if you don't need to.15:46
TheMuerogpeppe: how far goes your validation? how do you take care that the whole topology stored in one node is valid? or the combination of all nodes including their contents we use in state?15:47
rogpeppeTheMue: depends what i'm doing with it. i'd check the error with it's parsed. and i'd check that any expectations i have of it are true. but i wouldn't check what i don't need to.15:49
TheMuebtw, do other applications write into our ZK instance?15:49
rogpeppes/with it's/when it's/15:49
niemeyerrogpeppe: There's at least nothing in review15:59
rogpeppeniemeyer:16:02
rogpeppe% go get -u launchpad.net/goamz/aws16:02
rogpeppepackage launchpad.net/goamz/aws: directory "/home/rog/src/go/src" is not using a known version control system16:02
rogpeppehmm16:02
rogpeppeTheMue: "other applications" might be just a buggy old version of our own code...16:02
niemeyerrogpeppe: Try again please16:03
rogpeppeniemeyer: same result.  but no delay - it seems like it's decided already...16:03
niemeyerrogpeppe: Remove your local stuff16:03
rogpeppeniemeyer: which local stuff? you mean change my GOPATH?16:04
niemeyerrogpeppe: rm -rf $GOPATH/launchpad.net/goamz16:05
rogpeppeniemeyer: i did that16:05
rogpeppeniemeyer: there's no goamz directory16:05
rogpeppeniemeyer: but i changed my GOPATH and it works16:05
niemeyerrogpeppe: I've just installed it locally, and it works16:05
niemeyerrogpeppe: So you probably have left over data16:05
niemeyerLunch time16:06
niemeyerbiab16:06
TheMuerogpeppe: that surely maybe. so i would like you and fwereade do a walkthrough of the state code and take a look where further verification of informations retrieved out of ZK is needed.16:06
rogpeppeniemeyer: it's worked now. bizarre.16:07
rogpeppeTheMue: i've been looking out for it, so you're probably ok.16:07
TheMuerogpeppe: ok, thx16:07
rogpeppeniemeyer: http://paste.ubuntu.com/876199/16:09
TheMuerogpeppe: how would you call the type for the Resolved… values?16:27
* rogpeppe has a look16:28
rogpeppeTheMue: i can't say i really understand what they're about, but perhaps ResolvedKind?16:29
TheMuerogpeppe: inside the nodes content they are stored as "retry = 1000" or "retry = 1001". so maybe RetryKind is better.16:30
rogpeppeTheMue: seems ok to me. fwereade should ok it though - he knows what's going on much better than i!16:31
TheMuefwereade: what do you say?16:32
fwereadeTheMue, rogpeppe: I guess I'd say RetryKind or maybe RetryMode16:32
rogpeppefwereade: out of interest, what do these values mean?16:33
fwereaderogpeppe, just jump to the next state, or try rerunning the hook and go back to error state if it fails16:33
rogpeppefwereade: ok thanks16:34
TheMuerogpeppe, fwereade: thx, so i'll take RetryMode, looks best16:36
rogpeppeTheMue: sounds good16:36
=== asavu_ is now known as asavu
=== asavu_ is now known as asavu
rogpeppethis is weird. i've got one branch that's definitely different from another one, but running bzr diff between the two branches gives me no output.17:02
rogpeppehow can that happen?17:02
rogpeppethere must be some subtlety about "diff" that i don't get.17:05
rogpeppefwereade: any idea?17:05
fwereaderogpeppe, no idea... I know "bzr diff" for uncommitted changes and "bzr diff --old lp:somewhere" and that's about it17:08
niemeyerrogpeppe: How are you sure that there are differences?17:08
rogpeppeniemeyer: i did md5sum on the files17:08
rogpeppeah, i forgot the --old flag17:08
rogpeppeah that worked!17:09
rogpeppei wonder what it was doing when i didn't give it the --old flag.17:09
rogpeppefwereade, niemeyer: thanks. i'm stupid.17:09
niemeyerMan, I just got the strongest espresso I can remember17:21
rogpeppeniemeyer: do those changes to go-ec2-robustness look ok, BTW?17:23
niemeyerrogpeppe: Haven't re-reviewed yet17:23
rogpeppeok17:23
niemeyerrogpeppe: Just finishing the goamz stuff and will go back17:24
niemeyerrogpeppe, fwereade_, TheMue: FYI, https://groups.google.com/d/msg/goamz/cTZ5xmQeLQI/gFjSMMVrMbEJ17:36
rogpeppeniemeyer: pity all that history has gone17:37
fwereade_niemeyer, cheers17:38
fwereade_and I think I'm done for the week -- happy weekends all :)17:38
niemeyerfwereade_: Cheers17:52
niemeyerfwereade_, rogpeppe, TheMue: Early next week, it'd be nice to have a call with the four of us17:52
niemeyerTo discuss a bit about how things are going in the port and align17:52
rogpeppeniemeyer: sounds good17:52
niemeyerrogpeppe: On the branch17:59
rogpeppe niemeyer: thanks17:59
niemeyerrogpeppe: Nice reuse among DNSName and Wait&18:02
rogpeppeniemeyer: thanks. i was pleased how that turned out.18:02
niemeyerrogpeppe: Need to check for NoInstances too in StateInfo18:03
rogpeppeniemeyer: i think that's ok - it'll just return the error18:04
niemeyerrogpeppe: Sure, but that's not what we want, is it?18:04
rogpeppeniemeyer: i think it is - if the instances don't exist (remember we've already timed out for eventual consistency) then we don't want to wait for them18:05
niemeyerrogpeppe: Ah, you're right, thanks18:05
niemeyerrogpeppe: Need to check nil within the loop, though, right?18:06
rogpeppeniemeyer: oops yes, good catch!18:06
niemeyerrogpeppe: We'll have to pay special attention when coding/reviewing logic with ErrPartialInstances.. feels easy to miss indeed18:07
rogpeppeniemeyer: that's the hazard of returning incomplete data, i think. i'm not sure we can get around it.18:09
rogpeppeniemeyer: pushed a fix.18:10
niemeyerrogpeppe: Using ShortAttempt in live_test.go feels a bit like abusing private details of the implementation18:11
niemeyerrogpeppe: It'd be nice to have a trivial Sleep there, and remove all the hacking from export_test.go18:12
rogpeppeniemeyer: not sure. it's important that the local tests have a short sleep because otherwise the tests take much longer18:13
rogpeppeniemeyer: but maybe i should just have a variable and be done with it18:14
niemeyerrogpeppe: Only if they are broken, right?18:14
rogpeppeniemeyer: no. quite a few of the tests test stuff that's deliberately broken.18:14
rogpeppeniemeyer: so the timeout is exercised quite a bit18:14
niemeyerrogpeppe: No, I mean it will only take a while if the test is broken18:14
rogpeppeniemeyer: it makes the tests run for 20 or 30 seconds rather than 5.18:15
rogpeppeniemeyer: i'm not sure i understand18:15
niemeyerrogpeppe: I don't understand why..18:15
rogpeppeniemeyer: because if you're testing that getting a non-existent instance (for instance) actually returns an error, then the code has to time out18:15
niemeyerrogpeppe: There's a single use of ShortAttempt18:17
niemeyerrogpeppe: and it repeats until it breaks18:17
rogpeppeniemeyer: yes, but that test is run multiple times.18:17
rogpeppeniemeyer: one time for each scenario18:17
niemeyerrogpeppe: If it is run 10 times, and it takes 1 second to break on each, it's still 10 seconds18:17
niemeyerrogpeppe: How many times is it repeating, and why does it take so long?18:18
=== TheMue_ is now known as TheMue
rogpeppeniemeyer: the timeout is 5 seconds not one second18:18
niemeyerrogpeppe: It doesn't reach the timeout, unless the test is broken!18:18
rogpeppe[18:15] <rogpeppe> niemeyer: because if you're testing that getting a non-existent instance (for instance) actually returns an error, then the code has to time out18:18
niemeyer<niemeyer> rogpeppe: There's a single use of ShortAttempt18:19
niemeyer<niemeyer> rogpeppe: and it repeats until it breaks18:19
rogpeppeniemeyer: oh! sorry, i'd forgotten the real reason. the timeouts *inside* the ec2 package are exercised.18:20
rogpeppeniemeyer: the ShortAttempt inside TestStopInstances isn't a problem. it could be hardwired as you suggest.18:21
rogpeppeniemeyer: (which would cut out some of the hackery in export_test)18:21
niemeyerrogpeppe: Ah, cool, we were talking about different things then18:22
rogpeppeniemeyer: but we'd still need SetShortTimeouts18:22
niemeyerrogpeppe: Yes, I was referring to those18:22
niemeyerrogpeppe: That's fine18:22
rogpeppeniemeyer: fix pushed.18:24
rogpeppeniemeyer: i'm going to have to go very shortly, BTW18:26
niemeyerrogpeppe: Cool, let me check that quickly so that hopefully you can get away with a pleasant submission :)18:27
rogpeppeniemeyer: that would be marvellous18:28
rogpeppeniemeyer: next two branches before actually running zookeeper and connecting to it are very small BTW. we're nearly there!18:29
rogpeppes/before actually running/getting/18:29
niemeyerrogpeppe: Oh, that's great to hear.. I'm feeling so much behind on reviews18:29
niemeyerrogpeppe: AWESOME!18:32
niemeyerrogpeppe: LGTM18:32
rogpeppeniemeyer: brilliant, thanks a lot. i've gotta go and do a gig now!18:33
rogpeppeniemeyer: see ya monday.18:33
niemeyerrogpeppe: Have a great weekend!18:34
rogpeppeniemeyer: toi aussi18:34
niemeyerrogpeppe: Cheers :)18:34
* TheMue has changed https://codereview.appspot.com/5727045. now i can go. have a good weekand, niemeyer 18:43
TheMueniemeyer: bye, c u monday18:49
niemeyerTheMue: Have a good one.. will try to unblock you for next week18:49
TheMueniemeyer: getting good feedback by rog and fwereade, so it's ok18:50
niemeyerTheMue: Sweet, good to see team work playing there18:51
TheMueniemeyer: yep, absolutely18:51
TheMueniemeyer: so have a good weekend18:51
niemeyerTheMue: Thanks, a great one to you too!18:54
TheMueniemeyer: thx18:54
* niemeyer breaks..19:44

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!