#juju-dev 2012-12-03
<davecheney> [ANN] juju-core 1.9.3 has been released
<davecheney> https://lists.ubuntu.com/archives/juju-dev/2012-December/000333.html
<wallyworld_> davecheney: g'day, i have a problem you might me able to help me solve if you have a moment to take a look.
<davecheney> wallyworld_: shoot
<wallyworld_> davecheney: see https://codereview.appspot.com/6874049/patch/1/7, the TODO about 2/3 of the way down // TODO: WHY DOESN't THIS WORK?
<wallyworld_> you may not have an idea of what's wrong. i can't see it right now
<wallyworld_> the array ends up filled with the last id
<wallyworld_> so if there are 2 servers, the array becomes [id2, id2] instead of [id1, id1]
<wallyworld_> [id1, id2] i mean
<davecheney> wallyworld_: yes, this is one of two gotchas with go
<wallyworld_> only 2?
<davecheney> (the other is also to do with scoping of variables
<davecheney> (well, two that I consider gotchas)
<wallyworld_> so what am i missing?
<wallyworld_> i use append elsewhere in similar circumstances and it seems to work
<davecheney> if you do s := server ; insts = append( ...)
<davecheney> it will work
<wallyworld_> wtf? why?
<davecheney> hold on, i'll explain
<davecheney> just looking at why you're doing := range *servers
<wallyworld_> ok
<wallyworld_> the api returns a pointer to an array of structs
<wallyworld_> due to that big discussion i had on irc last week
<wallyworld_> about  returning nil if the json failed
<davecheney> ok, i don't want to get into that
<wallyworld_> me either :-)
<wallyworld_> just providing context
<wallyworld_> as to why it is a pointer
<davecheney> if we break down the for loop, what is happening behind hte scenes you're getting
<davecheney> var server Nova.server
<davecheney> for i := 0 ; i < len(*servers); i++ {
<davecheney>     server = *servers[j]
<davecheney>  ... then your code
<davecheney> }
<davecheney> (this is part of the explanation)
<wallyworld_> *servers[i]?
<davecheney> do you agree that this can be subtituted for the range above
<wallyworld_> yes
<wallyworld_> with the i
<davecheney> yes, sorry
<wallyworld_> np, just making sure
<davecheney> so inside each loop iteration it becomes
<davecheney> server = *servers[i] (hidden)
<davecheney> insts = append(insts, &instance{e, &server})
<davecheney> &server is capturing the address of the local server variable every time
<davecheney> the same local server variable
<davecheney> which is why they all end up as the last value assigned to it
 * wallyworld_ thinks
<wallyworld_> davecheney: so by doing s:= server, it forces a copy to be made?
<davecheney> yes
<wallyworld_> ok, i think i get it. is the moral of the story not to use loop variables in certain circumstances?
<davecheney> when taking their address
<davecheney> sadly this is just one thing you have to be ever vigilant for
<wallyworld_> right, ok. thanks so much. i wasted a loooong time on this
<davecheney> sorry, it is one of the rights of passage
<wallyworld_> lol
<wallyworld_> i still don't feel like Go is "native" just yet, but it's getting there
<davecheney> that's good to hear
<davecheney> i'm constantly amazed at how quickly can pick up the language
<davecheney> it tells me it's doing something right
<wallyworld_> yeah, there's some stuff i don't like, but it's interesting to work with
<davecheney> wallyworld_: btw, do you feel that any sort of decision was reached about the 'when and how should we meet' discussion ?
<davecheney> in the meeting I got the impression, and looking at the notes, that the decision to not change anything was made
<wallyworld_> my understanding was alternating times 21:00UTC and 09:00UTC
<davecheney> then the day after Mark posted an email saying we're going to do that
<davecheney> alternating times ?!
<davecheney> i read that as two meetings
<wallyworld_> it was explained to me as alternating and you would make whichever meeting you could
<wallyworld_> or both
<davecheney> but, alternating every other week ?
<wallyworld_> yes
<wallyworld_> unless i am mistaken
<davecheney> so we can go an entire milestone cycle without meeting or feedback ?
<wallyworld_> is each milestone cycle 2 weeks?
<wallyworld_> i think for you and me we can make both
<davecheney> you can see we run a pretty loose ship on juju-core, but yes, ish
<wallyworld_> to be honest, the current one that is 10pm for me i can make most weeks
<davecheney> it's 11pm for me
<wallyworld_> yeah, DST :-)
<davecheney> and i'm getting pressure from home
<wallyworld_> which i guess is why they changed it
<wallyworld_> timezones suck sometimes
<wallyworld_> happy wife, happy life
<wallyworld_> i'm not sure if tomorrow's meeting is still at 0pm, will have to check
<davecheney> DST is bitch if you need to coordinate with western europe
<davecheney> 7pm UK time is 6am AEST
<davecheney> (or is that AEDT
<wallyworld_> i thought it was 7am AEST
<davecheney> nup
<davecheney> http://everytimezone.com/
<wallyworld_> not 7pm, 9pm
<wallyworld_> 21:00UTC
<wallyworld_> so +10 is 7am for me
<TheMue> Morning
<davecheney> morning
<TheMue> davecheney: Had a nice weekend?
<rogpeppe2> morning all
<TheMue> Hiya rogpeppe
<fwereade> rogpeppe, TheMue, heyhey
<fwereade> rogpeppe, TheMue: I would appreciate your opinions on the Living interface
<TheMue> fwereade: Hi
<rogpeppe> fwereade: just jotting down some thoughts i had over the weekend. will have a look in a bit, thanks for the heads up.
<fwereade> rogpeppe, TheMue: it only seems to be used for the tests
<fwereade> rogpeppe, TheMue: and I have slight concern that it doesn't really fit our needs
<rogpeppe>  fwereade: what CL are we talking about?
<fwereade> rogpeppe, it's a piece that's already in place
<rogpeppe> fwereade: ah
<fwereade> rogpeppe, state/life.go
<fwereade> rogpeppe, it has EnsureDying/EnsureDead
<rogpeppe> fwereade: you're objecting to those methods?
<fwereade> rogpeppe, we took those methods off Relation because they didn't make sense
<TheMue> fwereade: If you say you have concerns that it fits our needs, then please start by definig what you think our needs are.
<fwereade> rogpeppe, and while they're ok for unit they don't express enough
<rogpeppe> fwereade: do you think Unit needs a richer interface than a simple "EnsureDead"?
<rogpeppe> fwereade: or EnsureDying
<fwereade> rogpeppe, yeah -- remove-unit --force
<fwereade> rogpeppe, we need a way to say, politely, "please kill yourself"; a way to say politely "I'm done, clean me up"; and a way to force dead immediately
<rogpeppe> fwereade: ah, so EnsureDying isn't allowed to take account of things that might block the EnsureDying
<fwereade> rogpeppe, yeah, nothing should block an EnsureDying
<rogpeppe> fwereade: emr
<fwereade> rogpeppe, depending on the circumstances, the existence of subordinates may or may not want to block EnsureDead
<rogpeppe> erm
<rogpeppe> fwereade: istm that EnsureDying *is* a way to say, politely, "please kill yourself"
<rogpeppe> fwereade: and that EnsureDead *is* a way to force dead immediately
<TheMue> rogpeppe: That's how I understood it too.
<fwereade> rogpeppe, yes... we have 3 things to express, and 2 0-arg methods
<rogpeppe> fwereade: that doesn't sound like the current interface is wrong, but that some things may need more
<fwereade> rogpeppe, maybe
<TheMue> fwereade: Please match the current methods to your three sentences above to make sure which one you don't see covered.
<fwereade> rogpeppe, bear in mind that it was already dropped from relation because it's useless
<rogpeppe> fwereade: how do you remove a relation then?
<fwereade> rogpeppe, TheMue: EnsureDying=please kill yourself, EnsureDead=die immediately; no way to politely signal your own death
<fwereade> rogpeppe, it depends
<fwereade> rogpeppe, I ca dredge up the details if you like
<rogpeppe> fwereade: by "signal your own death", do you mean there's no way to tell others when you've died?
<fwereade> rogpeppe, there is no convenient way for me to coordinate the unit's death amongst various tasks
<TheMue> fwereade: So you want a kind of destructor. But that is something the entitiy has to signal. It isn't signalled from the outside to the entitiy.
<rogpeppe> fwereade: i'm not sure what you mean by that
<fwereade> rogpeppe, ok, maybe I should shift perspective around and see what sorts of statements we agree on
<fwereade> rogpeppe, the only thing that can sanely set a machine to Dead is a machine agent; agreed?
<fwereade> rogpeppe, and the only thing that can sanely set it to Dying is the user
<TheMue> fwereade: How would you name that third method? And who will call it?
<rogpeppe> fwereade: i'm not sure. what if the machine has gone away and isn't coming back?
<fwereade> rogpeppe, do you mean the "instance"
<rogpeppe> fwereade: yeah
<fwereade> rogpeppe, I don't think that's relevant then?
<fwereade> rogpeppe, juju will provision a new one and put the machine agent on it
<rogpeppe> fwereade: do we have to resurrect the instance and its machine agent in order to kill the Machine?
 * TheMue still thinks it's a wrong perspective. The first two methods are telling the entity that it has something to to, the missing one is used by the entity itself to signal the outer world that a state is reached and the outer world has something to do.
<TheMue> s/to to/to do/
<fwereade> rogpeppe, you're talking like we can *stop* juju from provisioning a new instance for a broken machine?
<fwereade> TheMue, ok
<fwereade> TheMue, what does service.EnsureDead do?
<fwereade> TheMue, btw you are wrong re methods: if an entity is dead, it must *not* do anything, and in the normal case of affairs it doesn't even see itself become dead because it sets and terminates itself
<TheMue> fwereade: It sets the entity to dead.
<fwereade> TheMue, the notion of EnsureDead from outside the entity is only sane in the case of units, and that's only because we carved out an exception for remove-unit--force
<fwereade> TheMue, ok, and what happens when you do that?
<fwereade> TheMue, and more to the point, when is it ok to do that?
<rogpeppe> fwereade: what if an instance goes down and the user calls terminate-machine before it's been reprovisioned?
<TheMue> fwereade: I would have to take a look what's done today. But IMHO after an entity is marked as dead it can be cleaned up.
<fwereade> rogpeppe, if we can detect that for sure, then we can remove it; no reason to make it dead
<fwereade> TheMue, think through what will happen if we set a service to Dead
<fwereade> TheMue, what should happen to the units of that service?
<TheMue> fwereade: I'm currently looking from a more abstracte perspective about where methods have to apply.
<rogpeppe> fwereade: doesn't it have to be dead before we remove it?
<fwereade> rogpeppe, not at all, we don't do that with relations
<TheMue> fwereade: IMHO they should be set to dead too, before the service is marked as dead.
<fwereade> rogpeppe, I'm not sure the Dead state is even meaningful for relations or services
<fwereade> TheMue, how will you craft your request to set 100k units to Dead?
<TheMue> fwereade: I have no quick answer. But logically, if a service shall be set to dead, shouldn't the units be dead too?
<fwereade> TheMue, using the txn package, remember
<fwereade> TheMue, logically, yes
<fwereade> TheMue, but there are 2 possibilities here
<fwereade> TheMue, we can set services to dead, and get crack, or wee can admit that service.EnsureDead is (probably) crack
<rogpeppe> fwereade: anyway, sorry that was a derail. say, for the sake of argument, we accept that only a machine agent can set its machine to dead. what then?
<rogpeppe> fwereade: are you saying that if we set something to dead, we may as well just remove it?
<fwereade> rogpeppe, ok, so machines are easy: Dying from outside, Dead from inside
<TheMue> rogpeppe: Do we mix topics here?
<fwereade> rogpeppe, not at all -- I'm saying that relations and services have no use for a Dead state, because we can just remove when appropriate
<fwereade> rogpeppe, units and machines *do* have a use for the dead state
 * TheMue would like to stay on the path what should happen to units, independent of their number, when their service is said that it shall die.
<fwereade> rogpeppe, but the agreement thus far has been that the only one we should be setting to dead from outside was unit, and that in a not-especially-high-priority feature
<fwereade> TheMue, what we do is set the service to Dying
<fwereade> TheMue, every unit is the responsible for setting itself to dying and cleaning itself up
<fwereade> TheMue, the units' deployer waits for it to become dead
<TheMue> fwereade: Is every unit watching it's services' lifecycle?
<fwereade> TheMue, yes
<TheMue> fwereade: Ah, ok, missed that, thanks.
<rogpeppe> fwereade: it sounds like you're arguing for removing EnsureDead
<fwereade> sorry 1 sec
<fwereade> rogpeppe, I'm not stronly arguing *for* anything, I'm just trying to raise the topic for discussion in the hope something will be synthesized therefrom
<fwereade> TheMue, to continue: then the deployer removes the unit
<TheMue> fwereade: And markes it as dead?
<fwereade> TheMue, the unit marked itself as dead
<fwereade> TheMue, the deployer is then responsible for uninstalling it and removing it from state
<TheMue> fwereade: How does it mark itself as dead?
<fwereade> TheMue, and removing its service from state if it was the last member of a dying unit
<fwereade> TheMue, at the moment it calls EnsureDead
<TheMue> fwereade: The unit calls EnsureDead() on itself? Aha.
<fwereade> TheMue, but this will not work correctly in the presence of subordinates
<fwereade> TheMue, yes
<fwereade> TheMue, in general only an entity is meant to declare itself dead
<TheMue> fwereade: Could you give me a hint where in the code this happens?
<fwereade> TheMue, the EnsureDead? worker/uniter/modes.go
<fwereade> TheMue, ModeStop or something? ModeDie?
<TheMue> fwereade: So the uniter calls it, not the unit.
<fwereade> TheMue, what part of the unit has its own volition if not the uniter?
<rogpeppe> back in a mo
<TheMue> fwereade: It's only the wording I want to get sure. If I talk about Unit it's state.Unit. Not about the uniter.
<fwereade> TheMue, sure, sorry, I didn't mean to be unclear
<TheMue> fwereade: I only wanted to clear that to not be confused.
<TheMue> fwereade: I never been deep in the uniter. So that's important to me to get a similar context.
<fwereade> TheMue, I think of a Unit as a dumb type manipulated by interesting clients -- the clients are the agents and the command line
<rogpeppe> back
<TheMue> fwereade: OK, now it's more clear, thanks.
<fwereade> rogpeppe, actually, how do you expect we will set a machine to dead? thinking about it, that sounds like a job for the machiner
<fwereade> rogpeppe, and it sounds like exactly the same problem as the uniter has
<rogpeppe> fwereade: is the provisioner right to allocate a new instance for a machine that's set to dying?
<fwereade> rogpeppe, if it's still meant to be running a unit, yes, unless you have a better idea
<fwereade> rogpeppe, but I forget: did we agree that we couldn't set a machine to dying until it had no assigned units?
<rogpeppe> fwereade: i thought we could always set something to *dying*
<fwereade> rogpeppe, I haven't thought so much about machine lifecycles, which is probably why I perceive them as a simple problem.. I bet they aren't really ;p
<rogpeppe> fwereade: i have a suspicion we have some muddy thinking about the machine/instance/unit relationship
<fwereade> rogpeppe, ok, so terminate-machine shouldn't actually terminate the machine?
<fwereade> rogpeppe, ok, please explain your view of the situation
<fwereade> rogpeppe, is it your position that terminate-machine should also terminate all the units on that machine?
<rogpeppe> fwereade: i don't have a strong notion of *how* our thinking is muddy, just that i don't clearly understand things, and i'm not sure anyone else does
<fwereade> rogpeppe, beacuse it has hitherto meant "I will never use this unused machine again", and been blocked by the machine's having units deployed
<rogpeppe> fwereade: ok, well that sounds reasonable
<fwereade> rogpeppe, ok, you know what? this makes me question EnsureDying/EnsureDead still more
<rogpeppe> fwereade: so... say a machine is unused but its instance has just died; someone calls terminate-machine before the provisioner has noticed that the instance is dead.
<rogpeppe> fwereade: when the provisioner notices that the instance is dead, should it allocate another one?
<rogpeppe> fwereade: i don't see any question about EnsureDying
<rogpeppe> fwereade: EnsureDead seems to be the problematic part of the interface
<fwereade> rogpeppe, ok, what does it do?
<fwereade> rogpeppe, please describe the chain of consequences from EnsureDying
<rogpeppe> fwereade: the provisioner can see when an instance dies, right?
<fwereade> rogpeppe, yeah, I think so, roughly
<rogpeppe> fwereade: (i'm not actually sure that it can, tbh, but this we seem to believe it)
<rogpeppe> s/this we/we/
<fwereade> rogpeppe, I think it s possible to infer it anyway
<rogpeppe> fwereade: i'm not sure it is
<rogpeppe> fwereade: in the presence of network failures
<rogpeppe> fwereade: anyway, assuming we can
<rogpeppe> fwereade: when the provisioner sees an instance go down, it allocates a new one and assigns the old Machine to it, right?
<fwereade> rogpeppe, yeah
<rogpeppe> fwereade: so if someone has called EnsureDying on that machine, should the provisioner really allocate a new instance for the machine?
<rogpeppe> fwereade: (surely not?!)
<fwereade> rogpeppe, if a machine is Dying it means "I will set myself to Dead when I have disacharged my responsibilities"
<rogpeppe> fwereade: what is a machine without an instance?
<fwereade> rogpeppe, I am reluctant to stipulate that the machine agent can die with no responsibilities undischarged
<rogpeppe> fwereade: what responsibilities does a machine agent have if it has no units?
<fwereade> rogpeppe, I suspect storage handling will come into play mostly at the machine agent level
<fwereade> rogpeppe, but, ok -- that's a potential derail
<rogpeppe> fwereade: i just think it's weird that we'd allocate a new instance only to shut it down immediately
<fwereade> rogpeppe, what you say makes sense, but you're not taking it far enough
<rogpeppe> fwereade: i think the provisioner can proxy for the machine agent in this instance
<fwereade> rogpeppe, what does Dying actually mean if the unit agent can only go to Dying when it has no responsibilities left to discharge?
<fwereade> rogpeppe, it means Dying is not a sane state for a Machine to have
<fwereade> rogpeppe, it will become so if/when we change the rules ofc...
<rogpeppe> fwereade: sorry, you're mixing up machine dying and unit dying in a way i don't quite understand
<fwereade> rogpeppe, shit sorry s/unit/machine/
<fwereade> s/unit agent/machine/ actually
<rogpeppe> fwereade: it sounds reasonably sane to me
<fwereade> rogpeppe, what is the distinction between Dying and Dead for a machine?
<rogpeppe> fwereade: i think it's a pity we haven't actually implemented terminate-machine yet...
<fwereade> rogpeppe, although, sorry, I'm derailing a little and I have no particular interest in fighting this one -- I take a similar Dying shortcut in the Deployer, actually
<fwereade> rogpeppe, but I'm actually questioning it
<rogpeppe> fwereade: because you're saying that a machine can only go to Dying when it has no responsibilities left to discharge, but that's not the case currently
<rogpeppe> fwereade: anyone can set a machine to dying at any time
<fwereade> rogpeppe, ok, well then that is surely crack
<fwereade> rogpeppe, why timebomb your machines like that?
<rogpeppe> fwereade: i'm not sure why
<rogpeppe> fwereade: you might have a good reason
<rogpeppe> fwereade: you might not want a new machine to be allocated
<fwereade> rogpeppe, the only thing that does is to say "keep this machine around for an arbitrary length of time, but don't forget the irrevocable kill command that will take effect when its current purpose is fulfilled"
<rogpeppe> fwereade: so calling terminate-machine before removing a unit means that we can ensure that's true
<rogpeppe> fwereade: that seems quite plausibly useful to me
<fwereade> rogpeppe, but you're also suggesting that the unit should get no opportunity to shut down cleanly if the instance goes away
<fwereade> rogpeppe, it's a total niche case
<fwereade> rogpeppe, defaulting to shutting down machines when unused would be moreuseful
<fwereade> rogpeppe, weeks-delayed kill commands are pretty esoteric ;)
<rogpeppe> fwereade: if we think that automatically shutting down a machine when it's unused is a good thing, then sure. but i think we've decided that it's not.
<fwereade> rogpeppe, I am working on the asusmption that we will have some sort of persistent storage generally available, such that moving from one instance to another can be accomplished almost trivially
<fwereade> rogpeppe, my point is that if we're going to get clever about machine termination, delayed-kill is not the top priority
<rogpeppe> fwereade: well maybe, but there is overhead in starting an instance, and state in an instance that may not be easily stored.
<rogpeppe> fwereade: i think it falls out naturally from our model.
<fwereade> rogpeppe, yeah, depends on the provider
<rogpeppe> fwereade: and it *is* potentially useful behaviour
<fwereade> rogpeppe, it is potentially useful behaviour that we should restrain ourselves from implementing until we understand the use cases
<fwereade> rogpeppe, you know what
<fwereade> rogpeppe, Unit.EnsureDead, implemented correctly, is just *horribly* complex
<rogpeppe> fwereade: because it's hard to check all the unit's dependencies before setting state to dead?
<fwereade> rogpeppe, it needs to clean up some or all of: machine assignment, subordinates Dead-setting, relation removal, and service removal
<fwereade> rogpeppe, if we do it correctly, we have to just build up some monster transaction
<fwereade> rogpeppe, not to mention just cleaning up relation membership
<rogpeppe> fwereade: that's not Unit.EnsureDead - that's RemoveUnit
<rogpeppe> fwereade: EnsureDead and EnsureDying don't do anything other than set life state
<fwereade> rogpeppe, ok, drop service removal
<fwereade> rogpeppe, that's the only one
<rogpeppe> fwereade: it's up to the caller to make sure that the other invariants are preserved correctly.
<fwereade> rogpeppe, ha ha
<fwereade> rogpeppe, how is that possible?
<rogpeppe> fwereade: well usually we do that be not allowing new things to be added when something is in a dying state.
<fwereade> rogpeppe, to make sane transitions in system state we have to do these things as single transactions
<Aram> good morning
<rogpeppe> Aram: hiya!
<fwereade> Aram, heyhey
<fwereade> Aram, nice holiday?
<Aram> yeah.
<fwereade> Aram, cool
<rogpeppe> Aram: is your holiday finished now?
<Aram> yeah.
<fwereade> Aram, I landed your watchers -- we've identified the firewaller bug and TheMue is looking into it this week AIUI
<TheMue> fwereade, Aram: +1
<Aram> fwereade, could you please link me to the commit that fixed it?
<TheMue> Aram: Hello btw.
<Aram> hi.
<rogpeppe> fwereade: i'm not sure we do need to do these things as single transactions
<rogpeppe> fwereade: and that's why we have the dying/dead distinction
<fwereade> rogpeppe, Dying/Dead is about handover of responsibility surely
<TheMue> Aram: It's not yet fixed, only better identified. I'm now working on it with two CLs.
<fwereade> rogpeppe, but any time we want to make a change to the state, we need to do it as a single transation
<fwereade> rogpeppe, or write a CA
<rogpeppe> fwereade: CA?
<fwereade> rogpeppe, Corrective Agent ;)
<rogpeppe> fwereade: ha
<TheMue> Aram: And LXC is right now stopped, one CL has gone in, two are open. MAAS has gotten higher priority.
<rogpeppe> fwereade: but that's not true. we don't need to make every change to the state as a single transaction
<Aram> fwereade, I've seen a proposal to replace int with string, excellent.
<fwereade> Aram, (there was a MUW bug that was contributing to that test failing, I'll find that commit, just a mo)
<rogpeppe> fwereade: and in this case, after the unit is set to dying, if we use several transactions to find that all its dependencies have gone away and it's in dying state, then we know we can remove it
<rogpeppe> fwereade: or set it to dead
<Aram> meh, the damn unity panel can't be set in a different position, and auto hide doesn't work in vmware.
<rogpeppe> Aram: so run ubuntu natively :-)
<Aram> I bought a new mac.
<TheMue> Aram: It's possible to auto-hide it, have to find the switch again. Somwhere in the settings.
<Aram> TheMue, it's possible to auto hide it, it's impossible to appear again after you've done so.
<Aram> that's because compiz is too smart for its own good.
<TheMue> *lol
<Aram> it calculates how hard you've hit the screen edge.
<Aram> since vmware won't deliver mouse events after you've hit the edge, it can't calculate how hard you hit it.
<Aram> so it won't show up.
<fwereade> so, hey again all
<rogpeppe> fwereade: your network connection die?
<fwereade> rogpeppe, power cut
<rogpeppe> fwereade: ah, not useful
<fwereade> rogpeppe, didn't last long :0
<fwereade> rogpeppe, ok, so, the trouble with lifecycle discussions is that there are in the end fewer commonalities than one might think
<rogpeppe> fwereade: did you see my last commend ("... or set it to dead")?
<rogpeppe> comment
<fwereade> rogpeppe, ah, sorry, I didn't
<rogpeppe> fwereade: ah
<rogpeppe> [09:59:49] <rogpeppe> fwereade: but that's not true. we don't need to make every change to the state as a single transaction
<rogpeppe> [10:00:36] <rogpeppe> fwereade: and in this case, after the unit is set to dying, if we use several transactions to find that all its dependencies have gone away and it's in dying state, then we know we can remove it
<rogpeppe> [10:00:42] <rogpeppe> fwereade: or set it to dead
<fwereade> rogpeppe, ok, so what's the scenario you're considering there?
<rogpeppe> fwereade: i was replying to your "but any time we want to make a change to the state, we need to do it as a single transation" comment
<fwereade> rogpeppe, yeah, and you're explaining a situation in which we don't have to worry about it, and I'm asking for more details
<fwereade> rogpeppe, I think :)
<fwereade> rogpeppe, who is setting the unit to dying, and why?
<rogpeppe> fwereade: i was talking about setting the unit to dead, not dying
<rogpeppe> fwereade: anyone can set the unit to dying
<mramm> network here is a bit flakey, but I'm around again for the next few hours today, and then traveling back to the US tomorrow.
<rogpeppe> mramm: i guess you'll miss the meeting then
<mgz> dimitern: want to catch up, even though it'll only be us two?
<mramm> rogpeppe: yea
<mramm> rogpeppe: I also need to figure out better meeting times
<fwereade> rogpeppe, ok, yes, that is true; it is not hard to set a unit to dead once we know it's done everything
<fwereade> rogpeppe, but the tricky case is EnsureDead from the outside
<rogpeppe> fwereade: i don't see that's a problem. the only thing that can call EnsureDead is something that knows it can act for the object.
<rogpeppe> fwereade: which is usually the agent responsible for the object, but may not be, in some cases.
<fwereade> rogpeppe, ok, so when someone uses remove-unit --force...
<fwereade> rogpeppe, that client is basically taking on all the responsibilities of the unit agent, right?
<rogpeppe> fwereade: i'm not sure
<fwereade> rogpeppe, well, the UA won't do anything else after it's dead
<rogpeppe> fwereade: i don't think the client can call EnsureDead in that case
<dimitern> mgz: sure, i'm starting mumble now
<fwereade> rogpeppe, its responsibilities must either be transferred elsewhere, or discharged, in the same transaction that makes it Dead
<rogpeppe> fwereade: i think it just means "unit: please remove yourself regardless of the dependencies"
<fwereade> rogpeppe, I don't think there's any command that means "please corrupt state"
<rogpeppe> fwereade: i don't see why it all needs to happen in the same transaction
<fwereade> rogpeppe, ok, so what happens when the first change is made and then the connection is dropped?
<rogpeppe> fwereade: all we need to do is set the unit to dying and some flag, say "force" that means that the unit will kill itself even when it has relations etc.
<rogpeppe> s/"force"/"force",/
<fwereade> rogpeppe, please expand on what you mean by "kill itself"
<fwereade> rogpeppe, it needs to exit the relations even if it isn't running hooks
<rogpeppe> fwereade: well, perhaps it might be good if you could explain the sematics of remove-unit --force.
<fwereade> rogpeppe, the idea was from possibly the first lisbon
<rogpeppe> semantics
<fwereade> rogpeppe, make the unit Dead
<fwereade> rogpeppe, the trouble is that "make the unit Dead" is not necessarily a simple change to state
<rogpeppe> fwereade: at a higher level, why do we want this?
<fwereade> rogpeppe, because a broken UA is a hugely awful failure, that otherwise blocks the removal of machines, services, and relations
<fwereade> rogpeppe, it's a feature we hope nobody will use
<rogpeppe> fwereade: ok. are we talking about a broken unit agent, or a dead instance?
<fwereade> rogpeppe, a broken unit agent
<fwereade> rogpeppe, that has bad code, and bad upgrading code too, and that we have no other way of removing
<rogpeppe> fwereade: so if we know it's broken, then we know we can act for it, right?
<fwereade> rogpeppe, sure, I'm 100% comfortable with the idea that we can tak on its responsibilities
<fwereade> rogpeppe, I'm just saying that if we do so we must either discharge them all, or transfer responsibility for doing so elsewhere, and that it must be done in the same transaction as that in which we set the unit to dead
<rogpeppe> fwereade: ok, so something else must take on its responsibilities for a while and see the unit through its last phases.
<fwereade> rogpeppe, so you want two different sorts of Dying state
<rogpeppe> fwereade: i still don't see that last point
<rogpeppe> fwereade: no, i don't
<fwereade> rogpeppe, we are talking about setting the unit to dead
<fwereade> rogpeppe, there are a potentially large number of things in state that need to be changed in order for the unit to become dead and state to remain consistent
<rogpeppe> fwereade: i thought we were talking about forcing the unit to die even when its agent is broken
<rogpeppe> fwereade: that's a somewhat different thing
<fwereade> rogpeppe, er, "setting the unit to dead" is what the unit agent would have done were it not broken, right?
<rogpeppe> fwereade: i suspect we might want another worker/agent to manage this
<fwereade> rogpeppe, the whole point is setting it to dead
<fwereade> rogpeppe, that is how we communicate that the unit agent can be safely trashed
<rogpeppe> fwereade: well, presumably the unit agent will manage the "potentially large number of things" along the road to deadness, yes?
<rogpeppe> fwereade: setting it to dead indicates that the unit agent *is already* trashed
<fwereade> rogpeppe, when we set Dying, that is exactly what the uniter does, yes
<fwereade> rogpeppe, not even slightly
<fwereade> rogpeppe, setting it to dead is what lets its deployer know that it can be trashed
<rogpeppe> fwereade: ok, so we need something else to take on the responsibility for the work that the unit agent normally does
<fwereade> rogpeppe, <fwereade> rogpeppe, its responsibilities must either be transferred elsewhere, or discharged, in the same transaction that makes it Dead
<rogpeppe> fwereade: ok, that's a different level of trashing
<fwereade> rogpeppe, ok: AIUI the deployer is responsible for both kinds
<fwereade> rogpeppe, once something is Dead, the thing responsible for it releases any underlying resources and removes it from state
<rogpeppe> fwereade: AFAICS the deployer is responsible for maintaining the container of the unit
<rogpeppe> fwereade: yes
<rogpeppe> fwereade: and that's not the problem. the problem is the transition to death
<fwereade> rogpeppe, to Dead?
<rogpeppe> fwereade: yeah
<fwereade> rogpeppe, yes
<rogpeppe> fwereade: and we can't solve that by simply charging in and calling EnsureDead
<fwereade> rogpeppe, depending on what EnsureDead means, bu yes
<rogpeppe> fwereade: but i don't see that it must all be called in the same transaction either
<mramm> hotel wifi at 25% packet loss :(
<fwereade> rogpeppe, it sounds like yu are advocating for a corrective agent
<fwereade> rogpeppe, or task
<rogpeppe> fwereade: perhaps you could outline the things that the unit agent does after seeing Dying and before it calls EnsureDead?
<fwereade> rogpeppe, depart all its relations, basically
<rogpeppe> fwereade: perhaps; i'm not sure yet.
<fwereade> rogpeppe, this is not necessarily a simple operation
<rogpeppe> fwereade: each relation depart can be done as a single transaction?
<fwereade> rogpeppe, because some of those relations might themselves be dying
<fwereade> rogpeppe, yes, they could
<rogpeppe> fwereade: so esssentially it does: for r in relations {r.Depart()}; u.EnsureDead() ?
<fwereade> rogpeppe, you are proposing that we depart them each individually, and assuming that the unit agent is not currently running and will never come back?
<fwereade> rogpeppe, remove-unit --force needs to work even if the UA suddenly wakes up
<fwereade> rogpeppe, while the unit is not dead, the UA has responsibility for it
<rogpeppe> fwereade: if relation departing is idempotent, i think that could work
<fwereade> rogpeppe, I am open to discussion re when/how that responsibility is transferred/discharged
<mramm2> hotel wifi at 25% packet loss, falling back to cell phone data plan but on edge network, so only a half step above IPoAC
<fwereade> rogpeppe, I don't think so -- I can imagine a path by which the UA could rejoin one of those relations in between the depart and the ensuredead
<rogpeppe> fwereade: can a UA rejoin a relation when the unit is marked as dying?
<fwereade> rogpeppe, why would the unit be marked as dying?
<rogpeppe> fwereade: isn't that one of the transaction guards?
<fwereade> rogpeppe, we're talking about forcing Dead from any state
<rogpeppe> fwereade: because that's the first thing we must do when calling remove-unit --force, surely?
<rogpeppe> fwereade: that's how we manage to shut down cleanly in multiple transactions
<fwereade> rogpeppe, I guess that's probably not too harmful, although I am strongly -1 on state changes that are not expressed as single transactions
<niemeyer> fwereade: How would the deployer even know to kill the unit if it wasn't market as dying
<niemeyer> Good mornings, btw
<fwereade> niemeyer, heyhey
<TheMue> niemeyer: Morning
<fwereade> niemeyer, the deployer removes the unit's it's responsible for when they become Dead
<fwereade> niemeyer, they shouldn't care that their units are Dying
<fwereade> sorry errant '
<niemeyer> fwereade: Ah, right, okay
<fwereade> niemeyer, Dying is a signal to the UA to clean itself up and mark itself Dead when it's ready
<niemeyer> fwereade: Yeah, you're right.. something else has to mark it as Dead in that edge case being debated
<fwereade> niemeyer, yeah, and the requirements for the two cases are rather different, so a single EnsureDead method with no parameters is not enough to express what we need to do
<rogpeppe> fwereade: if we want it to be expressed as a single transaction, then we need an agent to do the rest for us
<rogpeppe> fwereade: call it the CA if you want
<fwereade> niemeyer, in a clean shutdown, the uniter shouldn't set itself to Dead until it has no remaining subordinates
<fwereade> niemeyer, in a forced death, we need to make all the subordinates dead as well, I think -- but I guess that is a red herring, it's just where I started the conversation from
<rogpeppe> fwereade: i don't think that remove-unit --force should simply call EnsureDead to do the work
<fwereade> niemeyer, the relations question is much more interesting
<niemeyer> fwereade: It's an interesting case, actually
<fwereade> rogpeppe, ok, sorry, I have been talking a lot
<fwereade> rogpeppe, would you outline how how would expect the feature to work?
<fwereade> s/how how/how you/
<rogpeppe> fwereade: i'd expect the forced shutdown to go through the same phases as a non-forced shutdown
<rogpeppe> fwereade: and i would expect it to error out if the agent involved is actually alive.
<fwereade> rogpeppe, ok, but when the unit is Dying its agent is still the responsible entity
<rogpeppe> fwereade: mebbe
<fwereade> rogpeppe, shouldn't remove-unit --force be able to forcibly remove the unit whatever it's doing?
<rogpeppe> fwereade: yeah, probably
<rogpeppe> fwereade: in which case we need to make the shutdown phases work even when multiple entities are performing them
<niemeyer> fwereade: It should, but if that's done when the unit is not even dying, it feels like we should do something to not create unnecessary harm
<fwereade> rogpeppe, well, we should have been doing so anyway, right, what with the txn runners
<rogpeppe> fwereade: yes, except you were saying you can't do it all in one transaction. which seems a reasonable issue.
<fwereade> niemeyer, rogpeppe: yeah, I think it's fine to start remove-unit --force by setting Dying
<niemeyer> fwereade: and having a grace period, maybe
<fwereade> niemeyer, rogpeppe: I think it's ok to have remove-unit always set the unit to Dying
<rogpeppe> niemeyer: that, presumably, would mean that remove-unit --force would need to take as long as that grace period... or we'd need an agent to do it.
<niemeyer> rogpeppe: Assuming the unit wasn't already Dying
<fwereade> niemeyer, rogpeppe: and then if --force is set, aggressively go around trashing its stuff until it can call EnsureDead (or finds that the unit has done so itself)
<rogpeppe> fwereade: that's what i'm thinking
<rogpeppe> niemeyer: if it is already dying, we ignore the grace period?
<niemeyer> rogpeppe: Yeah, I think that sounds reasonable
<fwereade> niemeyer, I rather feel that the grace period is unnecessary -- if we're forcing then we're stating that we don't want or need a clean shutdown, we just want it gone
<rogpeppe> niemeyer: it seems slightly odd that. {remove-unit; remove-unit --force} is different from {remove-unit --force}
<niemeyer> fwereade: That sounds like unnecessary mess in the system
<niemeyer> Alternatively, we could disallow --force to be used without first calling without force
<niemeyer> Which would set dying
<niemeyer> THis would be a better teaching instrument than the grace period
<rogpeppe> niemeyer: yes, that sounds better
<fwereade> niemeyer, no remove-unit --force unless already dying SGTM
<niemeyer> fwereade: Yeah
<rogpeppe> niemeyer: then the user can choose their own grace
<niemeyer> rogpeppe: Right, the issue becomes obvious
 * fwereade is a little worried that the UA *will* be going around re-entering relations
<rogpeppe> fwereade: can it reenter a relation if the unit is dying?
<fwereade> niemeyer, rogpeppe: and hence that --force might have to grind away at it a bit
 * fwereade check
<niemeyer> fwereade: How can it re-enter a relation if it's dying?
<niemeyer> Erm.. what Roger siad
<niemeyer> said
<fwereade> niemeyer, it shouldn't leave scope until it is completely done with the relation
<rogpeppe> fwereade: "leave scope"?
<niemeyer> fwereade: I still don't get it
<niemeyer> fwereade: How can it *re-enter* while dying?
<fwereade> niemeyer, because calling Join() when we're adding a relation we expect to do stuff with works right even if it's done twice
<fwereade> niemeyer, let me double-check when we leave
<niemeyer> fwereade: I still don't see the reason why it works while Dying there
<fwereade> niemeyer, ok, do you agree that a unit should not leave a relation scope until it has run the relation-broken hook?
<fwereade> niemeyer, (possibly when UA is down something else may leave scope for it, that is ont the question right now)
<niemeyer> fwereade: Okay
<niemeyer> fwereade: I still don't see the connection though
<fwereade> niemeyer, the purpose of the scope doc is to indicate that a unit still needs the relation to be alive
<niemeyer> fwereade: I understand that
<fwereade> niemeyer, a Dying unit that has not run all its hooks needs that relation to be alive
<niemeyer> fwereade: Yes, and how did we get to re-entering relation
<niemeyer> s
<fwereade> niemeyer, so when it comes up, it ensures that (if it knows itself to be part of the relation) its scope doc exists
<niemeyer> fwereade: Yep
<niemeyer> fwereade: So..?
<fwereade> niemeyer, with the proposed style of remove-unit, we will be leaving its scope from outside, and it will be erroring out, coming back up, and rejoining scopes we've already exited
<fwereade> niemeyer, the 5s break should be enough to make that only happen once
<fwereade> niemeyer, maybe not at all
<niemeyer> fwereade: Do we allow it to create scope documents even when it is dying?
<niemeyer> fwereade: *create* scope documents
<fwereade> niemeyer, there's no distinction at the API level -- EnterScope could be EnsureInScope
<niemeyer> fwereade: Can you answer the question, for once? :-)
<fwereade> niemeyer, yes, we allow it to create scope documents at any time we are certain that it is appropriate for it to have a state document, which includes when the unit is Dying
<fwereade> s/state doc/scope doc/
<niemeyer> fwereade: Okay, i don't think we should allow it to create scope documents whne when it is dying
<niemeyer> fwereade: It makes no sense to be establishing new participation in relations when the unit is dying
<fwereade> niemeyer, I could just as easily say it makes no sense to be messing around with a unit's state while the unit is Dying and still responsible for its own state
<niemeyer> fwereade: You could, but I don't get what you mean by that
<niemeyer> fwereade: I'm pinpointing a very specific case which I was hoping would be easy to agree or disagree on itself
<niemeyer> fwereade: If the unit is not in a scope, and is in a dying state, it doesn't make sense to be joining relation it wasn't in, AFAICS
<fwereade> niemeyer, why are we even considering relations it's not already in?
<niemeyer> * fwereade is a little worried that the UA *will* be going around re-entering relations
<fwereade> niemeyer, "at any time we are certain that it is appropriate for it to have a scope document" means "when we know we're already part of the relation"
<fwereade> niemeyer, if the unit agent is running and knows it should be in a relation it will call EnterScope
<niemeyer> fwereade: There's someone else that disagreed meanwhile.. sounds like a straightforward race
<niemeyer> fwereade: Yes, and then it's put to Dying, and that should fail
<fwereade> niemeyer, well, yeah, I am a bit blindsided by all of this
<fwereade> niemeyer, meh, no trouble, we can add that to the transaction
<niemeyer> fwereade: Exactly
<fwereade> niemeyer, induce a few errors while it's going down is not really a big deal I guess
<niemeyer> fwereade: We don't have to induce errors.. EnterScope already does verifications of that nature
<niemeyer> fwereade: Since we already disallow entereing a dying relation
<niemeyer> fwereade: For similar reasons, I suspect
<fwereade> niemeyer, we will absolutely be inducing errors by purposefully corrupting the uniter's state
<niemeyer> fwereade: How's that "corrupting the uniter's state"?  Putting a unit to dying is a normal condition
<fwereade> niemeyer, looping through all its relations, leaving scope for each, while the uniter is valiantly trying to run its hooks *is* corrupting the uniter's state
<niemeyer> fwereade: I'm talking about the case above still
<fwereade> niemeyer, whether or not EnterScope should work when the unit is dying? I have no problem changing that, it's not a big deal, it will make no observable difference to the system AFAICT
<niemeyer> <fwereade> niemeyer, meh, no trouble, we can add that to the transaction
<niemeyer> <niemeyer> fwereade: Exactly
<niemeyer> <fwereade> niemeyer, induce a few errors while it's going down is not really a big deal I guess
<fwereade> niemeyer, sensible check, though, definitely a good thing
<niemeyer> fwereade: That's the context in which you mentioned corrupting state and which I answered to
<fwereade> niemeyer, ok, AIUI you are advocating roger's proposed style of remove-unit --force
<fwereade> niemeyer, which is all about corrupting the unit's state as it runs
<niemeyer> fwereade: I don't actually know. What was roger suggesting?
<fwereade> niemeyer, EnsureDying(); leave scope for each relation unit; EnsureDead
<niemeyer> fwereade: Is there an alternative you were suggesting?
<fwereade> niemeyer, I had felt that the sane thing to do was to either have an EnsureDead transaction that cleans up the necessary state in one go
<fwereade> niemeyer, well, not necessarily sane
<fwereade> niemeyer, the alternatives I saw were:
<fwereade> niemeyer, 1) a potentially large and unwieldy transaction
<fwereade> niemeyer, 2) some complicated corrective agent
<fwereade> niemeyer, and I was not very happy with either
<niemeyer> fwereade: I think there's another alternative that might be easier to implement
<fwereade> niemeyer, rogpeppe has suggested just taking joint responsibility while the UA is running, and trying to shut itself down cleanly, by shutting everything down for it -- surely inducing errors -- and then setting it to dead when that's done
<fwereade> niemeyer, I honestly don't like that solution much either
<niemeyer> fwereade: Having the deployer being responsible for double-checking that the unit resources have been cleared if death was forced
<niemeyer> fwereade: After killing the software, before removing from state
<fwereade> niemeyer, that STM like a special case of "complicated corrective agent"
<fwereade> niemeyer, but worth exploring, cool
<niemeyer> fwereade: That sounds like cleaning resources to me..
<niemeyer> fwereade: Can't get simpler than that
<niemeyer> fwereade: We can't clear resources without clearing resources
<fwereade> niemeyer, ok, this to me somewhat changes the balance of Dying/Dead
<niemeyer> fwereade: How? It doesn't seem to change them at all to me
<fwereade> niemeyer, but could probably work very nicely, once we have containerisation in place
<niemeyer> fwereade: I don't think waiting is necessary
<fwereade> niemeyer, I'm pretty sure we had agreed that eg a principal should not be able to become Dead while its subordinates were still around
<fwereade> niemeyer, because otherwise the subordinate would be holding a reference to a dead object it was not responsible for destroying
<fwereade> niemeyer, and that would be a Bad thing
<niemeyer> fwereade: Hmm.. yeah.. so how does that change the picture?
<fwereade> niemeyer, so setting something to Dead, and having its deployer clean up the its various other connections of the system, seems wrong to me
<niemeyer> fwereade: Sorry for being slow, but I don't understand the underlying point being made
<fwereade> niemeyer, I had been working from a picture in which Dead was as good as removed from the perspective of the single entity responsible for actual removal
<niemeyer> fwereade: --force means the thing itself can't clean up its connection to the rest of the system
<niemeyer> fwereade: Something has to do that
<fwereade> niemeyer, ok, you are saying that the deployer can clean up a unit's relations after the relation is dead? or not at all?
<niemeyer> fwereade: If you want to do that at the command line, you'll be "corrupting the state"
<niemeyer> fwereade: The only thing that can be sure the thing is actually dead is its runner
<fwereade> niemeyer, yes, it is something else's responsibility to do the cleaning up
<niemeyer> fwereade: Which is the deployer
<niemeyer> fwereade: So that's the place we can clean up without corrupting the state
<niemeyer> fwereade: But you don't think that's right, because..?
<fwereade> niemeyer, a dead unit that's still participating in relations *is* corrupt state IMO
<niemeyer> fwereade: We can ask the deployer to force its death without putting its status to Dead
<mramm> BTW, just sent an e-mail with updated meeting times -- and moved them to thursday for this week since I'm traveling tomorrow.
<niemeyer> mramm: Cool, cheers
<mgz> mramm: saw that. won't be there for thursday (as I'm travelling) but the rest of our lot should be.
<mramm> mgz: no problem.   We are bound to be missing one or two folks most weeks in december!
<fwereade> niemeyer, ok, so: a flag on units, checked by their deployer, that when set (1) causes the uniter to treat its unit as Dead and (2) causes its deployer to clean up its state, setit to dead, and uninstall it?
<mramm> mgz: probably more most weeks.
<niemeyer> fwereade: The uniter doesn't have to change.. as far as it knows, it's already Dying, so it *should* be aborting its activities
<fwereade> niemeyer, and so it will be
<niemeyer> fwereade: The force flag action would happen purely on the deployer
<fwereade> niemeyer, ok, isn't it better for the uniter to see that someone else has taken responsibility, and just stop doing things?
<niemeyer> fwereade: It will stop doing things, because the deployer will shoot it in the head
<niemeyer> fwereade: There's no point in sending him a postcard telling its being shot in the head :-)
<niemeyer> (and by now I'm probably in a list of dangerous people somewhere)
<fwereade> niemeyer, haha
<fwereade> niemeyer, this somewhat complicates the UnitsWatcher
<fwereade> niemeyer, or the clients I guess, I just need it to send all changes
<niemeyer> fwereade: I don't think it should change
<fwereade> niemeyer, ok, so the deployer needs to set up kill-switch watches on each of its units itself?
<mramm> also, friendly reminder to folks: put your vacation on the Juju Calendar so we all know it.
<niemeyer> fwereade: Although, don't we already monitor all changes
<niemeyer> fwereade: On units
<fwereade> niemeyer, no, that watcher filters it to just lifecycle changes
<niemeyer> fwereade: What does that mean in practice? Life is a field in the unit
<fwereade> niemeyer, yes, when it gets a unit change it discards it unless Life has changed
<niemeyer> fwereade: Right
<niemeyer> fwereade: That sounds like a life-related change, and is in the same place as the other settings
<niemeyer> fwereade: We'd just be comparing Life+ForcedDeath or whatever, instead of Life
<niemeyer> fwereade: (just a strawman for now)
<fwereade> niemeyer, sure, it's just a change to the watcher and a change to the deployer to handle the new possibilities
<niemeyer> fwereade: Right
<fwereade> niemeyer, I'm not suggesting it's the end of the world :)
<niemeyer> fwereade: Sure, I'm just testing waters by suggesting it's actually simple
<niemeyer> fwereade: and looking at your face while doing that ;-)
<fwereade> niemeyer, it's not wrong, but it's not quite conventional either
<fwereade> niemeyer, eg, if we're doing that, we should be handling ports changes on the firewaller similarly
<fwereade> niemeyer, ie the MUW should be filtering for Life and Ports changes, not just life
<fwereade> niemeyer, hmm, that will save us a goroutine for every unit and probably save us a lot of complexity actually
<niemeyer> fwereade: Uh.. that'd be a more controversial change
<fwereade> niemeyer, ok, this is interesting to me: what's the distinction?
<niemeyer> fwereade: Ports are not lifecycle related
<niemeyer> fwereade: Then we're designing an API purely based on how we designed who's using it
<niemeyer> fwereade: Which is not generally a good idea, as it means high-coupling, and changes on one side cascade on meaningless changes on the other
<fwereade> niemeyer, ok, in that case all I'm really bothered about is the change to the lifecycle -- either by adding a flag, or by adding new states
<fwereade> niemeyer, neither fills me with joy
<niemeyer> fwereade: Sure, but I've suggested the new flag to accommodate your other points.. we can't preserve the unit as dying, and yet know to kill its activities from outside, without knowing we should
<fwereade> niemeyer, I'm still not saying it's bad, just that I'm nervous about it :)
<niemeyer> fwereade: and who kills the subordinates in such circumstance?
<niemeyer> Must be the deployer as well, since the principal is out-of-action
<fwereade> niemeyer, this is entertaining, because it's actually a different deployer in that case
<fwereade> niemeyer, the principal had a deployer
<niemeyer> fwereade: Well, not really
<niemeyer> fwereade: Right
<niemeyer> fwereade: Sorry, yes, a different deployer from the one that originally fired things
<niemeyer> fwereade: Which brings back a point: the machiner is not necessarily exactly the same as a principaller (ugh)
<fwereade> niemeyer, so, all these concerns were what was leading me towards the idea that composing a transaction that asserted and did the Right Thing all in one place might be viable
<niemeyer> fwereade: I don't understand what that means.. transactions asserting things don't solve any of the previous points I think
<fwereade> niemeyer, how many ops would you consider to be an unacceptably large transaction?
<niemeyer> fwereade: You can assert whatever, and transact, whatever and it'd still be corrupting someone else's state (according to what you described as that meant)
<fwereade> niemeyer, composing a transaction that sets all the right things to Dead, and leaves their relation scopes, etc etc, is not intrinsically unreasonable, is it?
<fwereade> niemeyer, the uniter expects to become arbitrarily Dead at some point and stops calmly when it finds itself so to be
<niemeyer> fwereade: Depends on what you consider unreasonable.. it seemed that you were unhappy about changing state underlying running software
<niemeyer> fwereade: That's not solving anything about that
<niemeyer> fwereade: It'll still catch all uniters off-guard
<fwereade> niemeyer, the uniters will hiccup once at most, and then return some specific error
<fwereade> niemeyer, but suddenly becoming Dead is not an unexpected occurrence
<niemeyer> fwereade: Nothing is unexpected if we say we're fine with it :-)
<fwereade> niemeyer, right, we agreed some time ago that the uniter needed to cope with becoming unexpectedly dead, and that its role at that point was done
<niemeyer> fwereade: Moments ago it sounded like there was some unhappiness about the fact the uniter is doing activities such as joining relations and whatnot while someone else was changing the game underlying it
<dimitern> is there a difference between foo := &SomeType{} and foo := new(SomeType) ?
<niemeyer> dimitern: Now
<niemeyer> dimitern: No
<fwereade> niemeyer, ok, there is state created by the uniter, that it considers itself to own
<dimitern> niemeyer: 10x
<niemeyer> dimitern: We use the former
<niemeyer> dimitern: Just out of convention
<fwereade> niemeyer, and there is other state that it doesn't change; it just responds to changes in it
<dimitern> niemeyer: ok
<fwereade> niemeyer, ok, bah, this is not true of Dead because it *does* set its unit to Dead: but at that point it also stops watching it, or doing anything at all, so it feels kinda sensible
<fwereade> niemeyer, I am opposed to changing the unit's relation state from <something that does not change under the feet of a live uniter> to <something that does>
<niemeyer> fwereade: So the one-big-transaction idea doesn't solve it, right?
<niemeyer> rogpeppe: ping
<rogpeppe> niemeyer: pong
<fwereade> niemeyer, in my mind it does, but you are correct to point out that it is a pretty fuzzy distinction, and I'm having trouble putting my finger on it
<niemeyer> rogpeppe: Do you have a moment for a call at the top of the hour, with the cloud-green team, regarding some kind of interaction with the API?
<fwereade> niemeyer, I shall meditate upon this over lunch
<rogpeppe> niemeyer: yes
<niemeyer> rogpeppe: Cool
<niemeyer> rogpeppe: https://plus.google.com/hangouts/_/d94c1034338161320971329404422704b987f4f9
<niemeyer> rogpeppe: I'm not there yet
<rogpeppe> niemeyer: fetching macbook
<niemeyer> fwereade: Sounds good, let's come back to it after lunch then
<fwereade> niemeyer, I just had a thought on subordinate names
<fwereade> niemeyer, at the moment it seems we have no way to guarantee via txn that only one subordinate can be created per principal unit
<fwereade> niemeyer, ie we can't assert that no document with the same service and principal exists, right?
<fwereade> niemeyer, but we could name them, eg, logging/wordpress/0
<fwereade> niemeyer, which makes some sort of sense
<niemeyer> fwereade: The two issues seem independent
<fwereade> niemeyer, and doesn'timpact anything else, because the only thing we can do to a unit is remove it
<TheMue> (late) lunchtime, biab
<fwereade> niemeyer, and there's no clear way to remove a subordinate without removing the relation
<niemeyer> fwereade:  bzr remove/destroy-unit?
<niemeyer> Erm
<niemeyer> juju
<fwereade> niemeyer, ok, sorry, how would you expect that to work?
<niemeyer> fwereade: Ah, indeed..
<niemeyer> fwereade: But that sounds pretty independent from the original point
<niemeyer> fwereade: We're again trying to solve three problems at once
<fwereade> niemeyer, yeah, just a supporting detail
<niemeyer> fwereade: A) How to name units
<niemeyer> fwereade: B) How to remove subordinate units
<niemeyer> fwereade: C) How to limit a single subordinate unit per service
<niemeyer> Per principal, rather
<fwereade> niemeyer, I don't think we need B at this point, but it may become relevant I suppose
<niemeyer> fwereade: I don't think we need any of them right now
<fwereade> niemeyer, so I'm saying: C is a problem, A may be a solution
<niemeyer> fwereade: Why is C a problem?
<fwereade> niemeyer, because I am terminally suspicious of things changing state I don't expect to change, and I want to write something that will react sanely if the subordinate that was trying to be added showed up from another source
<niemeyer> fwereade: Sorry, I can't relate the answer to the question
<fwereade> niemeyer, I can check that no subordinate exists before I attempt the creation transaction, but I cannot assert within the transaction that no duplicate subordinate exists, and this leaves a window in which bad things can happen
<fwereade> niemeyer, what of the above statement seems lacking in appropriate sanity?
<fwereade> niemeyer, ofc we do not need to actually change unit names to do this
<fwereade> niemeyer, but I *think* we do need at least to change what we keep in the _id field
<niemeyer> fwereade: That seems like a pretty radical change
<fwereade> niemeyer, sure, so let's drop that bit
<niemeyer> fwereade: Should we try to look at the problem first?
<fwereade> niemeyer, yes
<fwereade> niemeyer, do you have any thoughts on it?
<niemeyer> fwereade: So the idea is we want a subordinate service and a principal service to spawn a single subordinate unit for each principal unit. Is that a good way to put it?
<fwereade> niemeyer, that every principal unit creates its own subordinate, yes
<niemeyer> fwereade: Hmm.. no, I didnt' say anything about who creates what yet
<fwereade> niemeyer, er, ok, what did you mean by "spawn" then?
<niemeyer> fwereade: I mean existance
<fwereade> niemeyer, so did I...
<niemeyer> fwereade: Stuff creating its own foo seems to indicate otherwise, but okay, I'll suppose we're in agreement
<niemeyer> fwereade: So at which point the subordinate units do get created in the state?
<fwereade> niemeyer, I think it happens when a principal unit in a locally scoped relation successfully enters the relation's scope, and has no subordinate
<fwereade> niemeyer, and I'll just shut up now
<fwereade> niemeyer, I can check the principal unit doc's Subordinates, and I can assert no changes to the principal unit doc, and retry as appropriate
<niemeyer> fwereade: Hmmm, sounds sensible
<niemeyer> fwereade: It'd be nice if we could even assert specifically that there are no subordinates of the given service there
<niemeyer> fwereade: But I guess the language doesn't allow us to match on a prefix of a sub-entry, maybe
 * niemeyer looks
<fwereade> niemeyer, hmm, I'd expect we could but I'm not sure offhand
<niemeyer> fwereade: Hmm.. equality does unwind the array.. I wonder if regexes work the same way
 * niemeyer tries it
<niemeyer> fwereade: Works!
<niemeyer> fwereade: http://pastebin.ubuntu.com/1408009/
<fwereade> niemeyer, awesome
<fwereade> niemeyer, right, well, I know what I'm doing re adding them, anyway :)
<fwereade> niemeyer, I feel like I still need to think a bit about removal based on the discussion earlier
<fwereade> niemeyer, I will try to grab you sometime later this evening if I can
<fwereade> niemeyer, thanks :)
<niemeyer> fwereade: Actually, I think that's more like what we want: http://pastebin.ubuntu.com/1408022/
<niemeyer> fwereade: As we can assert it
<fwereade> niemeyer, yep, verytrue
<fwereade> blast, need to pop to the shops, back imminently
<niemeyer> fwereade_: Is there any follow up that justifies changing the type of relation ids to ints?
<niemeyer> fwereade_: Or from ints, I guess
<fwereade_> niemeyer, not imminent, I'm afraid
<fwereade_> niemeyer, do you not feel the zero-value change is worth it?
<niemeyer> fwereade_: If that's the motivation, just disallow 0?
<niemeyer> fwereade_: I'm mainly trying to understand what's going on
<niemeyer> fwereade_: It feels like we've been in a campaign against int keys, and I'm not sure why :)
<fwereade_> niemeyer, also, it feels more consistent to me -- but I agree that this proposal doesn't have the weight behind it that machine ids have
<fwereade_> niemeyer, it's fundamentally about making things that are similar look similar, because I believe that then the ways to use them better will be clearer
<fwereade_> niemeyer, but honestly, relation ids are mainly a problem for me because of id/_id
<niemeyer> fwereade_: I don't know.. I've never designed any software on which we had a database we tried to make ints look like strings so that they were similar
<fwereade_> niemeyer, if we called it something other than Id, and disallowed 0, I'd probably be just as happy
<Aram> consistent interfaces are a virtue, see OpenVMS vs. UNIX.
<Aram> good night
<niemeyer> Being a nice person too..
<fwereade_> niemeyer, whatever we call it I don't think we index it yet anyway
<niemeyer> But that's harder to get right..
<fwereade_> niemeyer, I can understand him not wanting to get drawn into the argument :)
<niemeyer> Sounding like an ass and then leaving is indeed a great way to make friends.
<fwereade_> niemeyer, anyway, the more I think about the relation IDs the less I care what type they are, so long as we clearly disambiguate them for Key/_id
<fwereade_> niemeyer, that#'t the core of my discombobulation here
<niemeyer> fwereade_: If we were to unify things and interfaces, to be honest I'd try to do the opposite of what we've been doing.. I'd make everything have a surrogate key that is an integer
<niemeyer> fwereade_: I appreciate the advantages of that novel idea that is binary numbers :)
<niemeyer> fwereade_: 400kb holds 100 thousand ids, in a properly balanced and aligned tree.
<fwereade_> niemeyer, do you recall me suggesting this when I first introduced the unification idea?
<niemeyer> fwereade_: Hmm.. not with any emphasis
<fwereade_> niemeyer, as it is, we have consistent string _ids throughout, and a surrogate integer key on relations
<fwereade_> niemeyer, which, it crosses my mind, we probably ought to index ;)
<niemeyer> fwereade_: Yep
<fwereade_> niemeyer, so my own consistency hunger is actually largely assuaged -- I just don;t like the inconsistent and/or confusing terminology :)
<fwereade_> niemeyer, ie Name/Id/Key for various types
<fwereade_> niemeyer, and one with an extra Id that doesn't mean what the other id means
<niemeyer> fwereade_: It was pretty consistent before
<niemeyer> fwereade_: Id => int
<niemeyer> fwereade_: Name => string
<niemeyer> fwereade_: Now it's somewhat arbitrary or consistent, depending on how you slice it
<fwereade_> niemeyer, I guess we place different weights on different kinds on consistency
<fwereade_> niemeyer, although I note you don't mention Key in there
<fwereade_> niemeyer, or the fact that a casual observer might expect that an Id would be indexed
<niemeyer> fwereade_: A non-casual observer (me) would expect it to be indexed as well
<fwereade_> niemeyer, so, well, there is at least agreement that we need to do that :)
<niemeyer> fwereade_: To me having different types on primary keys is really not a big deal, if that's what you meant by the weights
<niemeyer> fwereade_: Primary keys, at least where I've been lucky to observe, tend to reflect the application model
<niemeyer> fwereade_: A person will generally have an id, or a username.. the id is generally an integer, not a string.
<niemeyer> fwereade_: The username, if used, is a string
<niemeyer> fwereade_: Etc
<fwereade_> niemeyer, are you arguing that I should reverse the machine ID change?
<fwereade_> niemeyer, because I am not actually arguing for a type change on relation id at the moment, just a name change
<niemeyer> fwereade_: No, I'm having a pretty high-level conversation about what we're pursuing
<niemeyer> fwereade_: Well, there's a branch to change it up for review.. that's as close to arguing about a change as it gets :-)
<fwereade_> niemeyer, ISTM that you are clearly going to reject this change, and I've been fighting a desparate rearguard attempt to get you to admit that, at least for morons like me, it *is* confusing to have a field called Key that serializes as _id and one call Id that serializes as id
<fwereade_> niemeyer, and that maybe we could change it, or you could explain how we benefit by this choice of names?
<niemeyer> fwereade_: Right, that's where our conversation started..
<fwereade_> niemeyer, ok, you just told me that I had been arguing about the type of the field
<fwereade_> niemeyer, I thought I had clearly conceded that point
<niemeyer> fwereade_: You did, sorry
<fwereade_> niemeyer, I apologise if this was unclear
<niemeyer> fwereade_: It wasn't
<fwereade_> niemeyer, ok -- so, I have no technical arguments for changing the type of relation.Id, but I would*really* like to change the field name
<fwereade_> niemeyer, the Tag suggestion I made in the CL is pretty clearly bad
<niemeyer> fwereade_: Okay, what if we introduced surrogate keys for all types
<fwereade_> niemeyer, I would rather like that, I think, even if having separate getters might be slightly unattractive
<niemeyer> fwereade_: Id/_id
<niemeyer> fwereade_: ints
<fwereade_> niemeyer, I thought the _id field would not be suitable, because we don't get the uniqueness verification from name collisions?
<fwereade_> niemeyer, but I'd willing to skip that derail for now
<niemeyer> fwereade_: We'll likely have to add some check on adds, yes
<fwereade_> niemeyer, Id/_id and Name/name, each of which must be unique, would feel to me like a good thing
<niemeyer> fwereade_: Right, and Name present only where it makes sense
<niemeyer> fwereade_: We can also rule out zeros
<fwereade_> niemeyer, are we thinking of a relation Key as a Name or not? :)
<niemeyer> fwereade_: Yeah, I think that'd be fine
<fwereade_> niemeyer, ok, cool
<niemeyer> fwereade_: It sounds good actually, rather than just fine
<fwereade_> niemeyer, so I *think* all the consequences here are good
<niemeyer> fwereade_: I originally thought we could not remove the idea of machine zero, but I think any software that trusts it to be special is already broken since we don't offer this guarantee and will most certainly break it in the future
<niemeyer> fwereade_: So we could start all valid counting from 1
<fwereade_> niemeyer, I'm +1 on making it completely unspecial :)
<niemeyer> fwereade_: Well, it is special, as it'd become invalid :-)
<fwereade_> niemeyer, I was sad to spot an explicit zero check in the code somewhere
<fwereade_> niemeyer, haha, reverse special :)
<niemeyer> fwereade_: Yeah, we've always had it, but it's always been consciously a temporary hack
<fwereade_> niemeyer, ok, so a big Id/_id/int change would be for the better; I'm not sure how soon I can pick that up while still in the throes of subordinates
<niemeyer> fwereade_: Understandable
<fwereade_> niemeyer, ...and cath has just made supper, so I should leave you for a bit
<fwereade_> niemeyer, I shall try to be on later, but no guarantees I'm afraid
<niemeyer> fwereade_: np, thanks for the ideas
<niemeyer> fwereade_: and have fun there
<fwereade_> niemeyer, cheers, enjoy your evening :)
<niemeyer> fwereade_: Thanks
<rogpeppe> i'm also off for the evening.
<rogpeppe> night all
<niemeyer> rogpeppe: Have a good one too
<niemeyer> davecheney: Morning!
<davecheney> niemeyer: howdy!
<davecheney> niemeyer: thanks for your suggestion
<davecheney> our replies crossed in the either
<davecheney> i'll look at that test now
<niemeyer> davecheney: np, not sure if there's something there, but I think there might be a path towards a test
<davecheney> niemeyer: % time juju status -e ap-southeast-2
<davecheney> machines:
<davecheney>   "0":
<davecheney>     agent-version: 1.9.4
<davecheney>     dns-name: ec2-54-252-2-107.ap-southeast-2.compute.amazonaws.com
<davecheney>     instance-id: i-ff7b0ac5
<davecheney> services: {}
<davecheney> real    0m2.231s
<davecheney> user    0m0.288s
<davecheney> sys     0m0.024s
<davecheney> 2 two seconds !!
<niemeyer> davecheney: Woha!
<niemeyer> davecheney: That's the fastest I've seen it run, ever!
<davecheney> so many fewer round trips
<niemeyer> Okidoki
<niemeyer> I'm heading off for the day
<niemeyer> davecheney: Have a good time there
#juju-dev 2012-12-04
<TheMue> Good morning.
<fwereade_> TheMue, heyhey
<TheMue> fwereade_: Hiya
<TheMue> fwereade_: Will push the first CL for the firewaller in a few moments. It contains the port init in global mode for machined. Looks good.
<fwereade_> TheMue, cool
<fwereade_> TheMue, I'll take a look when it's there :)
<TheMue> fwereade_: I'll notify you, it's a small one.
<TheMue> fwereade_: Here you are: https://codereview.appspot.com/6875053/
<rogpeppe1> mornin' folks
<fwereade_> rogpeppe1, heyhey
<TheMue> rogpeppe: Hi
<TheMue> rogpeppe: Would you also take a look at https://codereview.appspot.com/6875053/ ?
<rogpeppe> TheMue: will do
<TheMue> rogpeppe: Thx
<fwereade_> TheMue, afaict this change makes the window smaller, but does not fix the bug... please remind me what the second branch will be?
<TheMue> fwereade_: The second one will create the unitd's when a machined is created before returning to the main firewaller loop.
<TheMue> fwereade_: See http://irclogs.ubuntu.com/2012/11/30/%23juju-dev.html at the end.
<fwereade_> TheMue, how will that change affect this one?
<fwereade_> TheMue, ah, ok I get it now -- we're never sending port *changes* out of the unitds, just the current set of ports
<fwereade_> TheMue, and so the window doesn't matter, so long as we don't hold inconsistent data
<fwereade_> TheMue, but wait...
<fwereade_> TheMue, isn't the window between initialization of globalPortOpen and globalPortsRef still enough to cause the bug?
<TheMue> fwereade_: Good question, that I asked myself.
<fwereade_> TheMue, I think it's just narrowed the window to the time between initGlobalPorts and the relevant machineLifeChanged
<TheMue> fwereade_: I already thought about a solution where the whole tree of machined, unitd and serviced is initially started when the firewaller starts.
<fwereade_> TheMue, I think that's the only viable solution, isn't it?
<fwereade_> TheMue, saves an awful lot of bandwidth too ;)
<fwereade_> TheMue, but I agree that it is not entirely a trivial solution
<TheMue> fwereade_: Yes, and it works independent of the mode. In case of a watcher event it's checked that the according *d is started. If not then it's done then.
<fwereade_> TheMue, ah, maybe I'm confused... ISTM that the only generally sane way to do it is to start watchers and consume their initial events to build up the state *before* handling anything in the FW's main loop
<TheMue> fwereade_: Oh, no, that's not the idea. You see today in global mode that we scan the state once. And that scan could also be used to start all machind's, unitd's and serviced's. And later events only check if they are already covered (*d exists) or if it's a change (add/remove).
<fwereade_> TheMue, ah, great, I misunderstood :)
<TheMue> fwereade_: So before entering the loop the fw is up and running. ;)
 * TheMue steps out for a few minutes, have to fetch medicine from the pharmacy. 
<Aram> morning.
<jam> morning Aram
<niemeyer> Good morning juju-devs
<fwereade_> niemeyer, heyhey
<TheMue> niemeyer: Hiya
<TheMue> niemeyer: The first firewaller CL regarding the machined.ports is in. But fwereade_ and I still see that there's a window between the init of the global ports and the machine life events. So even if the current branch runs stable so far testing it multiple times in a loop there's still a risk.
<TheMue> niemeyer: What do you think about the idea of starting all machined's etc initially, before entering the firewaller loop?
<TheMue> niemeyer: So they are not lazy started by the initial watcher events. The events are then used to check if the goroutine for the entity is already running, if it has to ben started or if it has to be stopped and removed.
<mgz> wallyworld_: looks from the nova code that what gets used when filtering is just whatever webob gives for GET:
<mgz> search_opts = {}
<mgz> search_opts.update(req.GET)
<niemeyer> TheMue: So there were two different changes, right?
<mgz> ...
<mgz> status = search_opts.pop('status', None)
<niemeyer> TheMue: When you say you and William still see a window, is that the window we discussed previously?
<wallyworld_> mgz: so if the request params have more than one status value?
<niemeyer> TheMue: and that was supposed to be the second CL?
<mgz> ah, webob has a neat MultiDict thing... which is not used by this code
<wallyworld_> \o/
<wallyworld_> not :-(
<mgz> wallyworld_: so, it will just pick one at dictionary order random it seems
<TheMue> niemeyer: We discussed it today. The second CL will add the init of the unitd's for each machined. But that starts with the first events for machines. And there is still a gap until those events arrive.
<wallyworld_> awesome
<TheMue> niemeyer: The both CLs narrow it, but it's still a window there.
<niemeyer> TheMue, fwereade_: Isn't that exactly what we discussed before?
<wallyworld_> mgz: so until the openstack side of things does something sensible, we are hamstrung
<niemeyer> TheMue: What's the gap?
<mgz> right, I'm almost tempted to say supporting this level of (server-side) filtering is not useful
<wallyworld_> well, i think it may have some use
<TheMue> niemeyer: Then I maybe misinterpreted it. I now init machind.ports when a new machined is created. And that is done when a machine event is received.
<TheMue> niemeyer: Did you meant to immediatelly start the machined? Then I'll change it and I'm happy. ;)
<niemeyer> TheMue: Initialized from where?
<niemeyer> TheMue: No, I'm still trying to understand what's the issue.. I can't possibly make a suggestion before that :)
<TheMue> niemeyer: The machined is started in machineLifeChanged(). In global mode it retrieves the machines units and their ports to init the ports field and the ref count.
<TheMue> niemeyer: It's so far in https://codereview.appspot.com/6875053/.
<niemeyer> TheMue: Right, there's already a moment where you obtain all open ports for all machines in global mode, which means we already know what all the open ports are, and all the units, and all the machines.
<niemeyer> TheMue: Why are we doing this a second time, with less information?
<TheMue> niemeyer: Yes, that's why I thought about starting the machined's already there. I would like to do so independent of the mode.
<niemeyer> TheMue: It's not independent of the mode, because currently open ports, and how to open or close the ports, depends on the mode
<niemeyer> TheMue: And so does port referencing
<niemeyer> TheMue: Do you understand what the gap that William described actually is?
<TheMue> niemeyer: OK, I have to specify it more. Starting the machined's etc would be done based on the initial state but for sure the port handling depends on the mode.
<TheMue> niemeyer: The gap is between the init of the global ports and the first machine life event. Here it may happen that the status of the firewaller doesn't match to the reallity anymore.
<niemeyer> TheMue: Why is that a problem? How does the bug take place?
<niemeyer> TheMue?
<TheMue> niemeyer: Yeah, still thinking.
<TheMue> niemeyer: With the CL a machined retrieves it's correct status regarding the ports (but with addition I/O).
<niemeyer> TheMue: It's still not clear what you think the problem is, and how the bug takes place
<TheMue> niemeyer: I've got to admit that I'm not sure anymore.
<TheMue> niemeyer: I'm currently walking different paths through the state (order of adding/removing units) to see where it would break.
<fwereade_> TheMue, niemeyer: I could easily be wrong -- but ISTM that the fundamental problem is that the port refcounts are initialised using different data to that used to initialize their openness
<niemeyer> TheMue: I think it's important to understand the semantics of the problem itself so we can drive towards a good solution
<TheMue> fwereade_: Yes, I already looked here too. But the problem is to be sure, that a later change has already been catched by the initial state scanning or if it is an independent event.
<niemeyer> TheMue: I don't quite see what that means
<niemeyer> TheMue: The problem is simpler than it sounds
<TheMue> niemeyer: Any hint is welcome.
<fwereade_> TheMue, isn't the problem that the initial scan does not fully initialize the firewaller? the ideal is that *any* time you get an event in from the watcher you can check it against known-sane, and complete, state
<niemeyer> TheMue: There's a point in the system where we observe the actual state, as the provider reports it
<niemeyer> TheMue: Can you pinpoint that location?
<TheMue> fwereade_: That's my idea to fully initialize it before going into event handling. I wrote it on Friday.
<fwereade_> TheMue, ok, but the state you use to initialise it must itself be consistent
<fwereade_> TheMue, you can't get separate sets of ports in spearate places and assume they match
<TheMue> fwereade_: Exactly.
<TheMue> niemeyer: Sorry, can't follow.
<TheMue> niemeyer: We start by watching the machines.
<niemeyer> TheMue: No, we don't
<niemeyer> TheMue: There's a point in the system where we observe the actual ports that are open, as the provider reports it. Can you pinpoint that location?
<TheMue> niemeyer: We retrieve it from Environ.OpenPorts() and INstance.OpenPort(), depending of the mode.
<niemeyer> TheMue: Yes, let's leave aside instance mode for a while
<niemeyer> TheMue: What's the line?
<TheMue> niemeyer: Eh, in the code?
<niemeyer> TheMue: Yes, in the code
<TheMue> niemeyer: It's in Firewaller.initGlobalMode(), used to check which ports are already open.
<niemeyer> TheMue: What's the line number?
<TheMue> niemeyer: In the CL it's 197.
<niemeyer> TheMue: No, it's not
<TheMue> niemeyer: Exactly, just seen. On moment.
<TheMue> niemeyer: I meant Ports() in 165.
<TheMue> niemeyer: The other one opens them.
<niemeyer> TheMue: Very well, now let's say that port 123 is open there
<niemeyer> TheMue: How do we tell we should close it or not?
<TheMue> niemeyer: After collecting the open ports in state we are diffing to see, which one are to open and which one are to close.
<niemeyer> TheMue: Cool
<niemeyer> TheMue: Now, let's say that we finish running initGlobalMode.. port 123 was still open at that time, but the unit closes it right then, at the end of initGlobalMode
<niemeyer> TheMue: When do we close port 123?
<TheMue> niemeyer: We should do it after the last unit using 123 is removed, but at that moment we don't watch the units and so we don't get aware that its ports isn't needed anymore.
<niemeyer> TheMue: THat's not it
<niemeyer> TheMue: We do watch the units
<niemeyer> TheMue: Why is it not working even though we do watch the units?
<TheMue> niemeyer: When the machined is started.
<TheMue> niemeyer: The machined is watching its units, but at that moment the machined isn't running, so it doesn't watch the unit.
<TheMue> niemeyer: Or what do I miss?
<niemeyer> TheMue: Nope.. that's not it
<niemeyer> TheMue: That's exactly what I'm trying to figure
<niemeyer> TheMue: Port 123 is in initial ports.. what else knows that port 123 is open?
<TheMue> niemeyer: IMHO per machined in line 495 we start watching the units of the machine.
<TheMue> niemeyer: The state knows. That's why we compare the environment to the state.
<niemeyer> TheMue: No, it doesn't
<niemeyer> TheMue: The unit has closed port 123 at the end of initGlobalMode, remember?
<TheMue> niemeyer: Yes.
<niemeyer> TheMue: So.. port 123 is open.. and it shouldn't be.. how do we know to close it
<TheMue> niemeyer: Hmm, if we watch the unit we should receive an event and look what parts a unit has opened.
<TheMue> niemeyer: That's line 573.
<niemeyer> TheMue: Will port 123 be there?
<TheMue> niemeyer: A correct unitd should know it still, OpenedPorts() doesn't return it, so the change will be sent to the Firewaller.
<niemeyer> TheMue: Why? Where is port 123 in that logic?
<TheMue> niemeyer: unitData has a slice with the ports it uses. If it receives a unit change, and closing port 123 is a unit change, it retrieves the opened ports from state (without 123). That will be sent to the firewaller.
<niemeyer> TheMue: Which variable/code line will we find port 123 in?
<TheMue> niemeyer: And via the chain flushUnits() -> flushMachine() -> flushGlobalPorts() a diff is done and 123 should be closed.
<TheMue> niemeyer: change in 573 doesn't contain it anymore while fw.globalOpenPorts still has it as open.
<niemeyer> TheMue: and why does it matter? See line 574
<TheMue> niemeyer: unitd knows its ports and the opened ones in state are not the same.
<niemeyer> TheMue: Why are they not the same?  Port 123 was closed in the state at the end of initGobalMode, remember?
<TheMue> niemeyer: Just found by that way that we dont use unitData.ports here.
<niemeyer> <niemeyer> TheMue: Why are they not the same?  Port 123 was closed in the state at the end of initGobalMode, remember?
<TheMue> niemeyer: So th first event of the unit watch, which is after the closing of port 123, initializes it without that port, they are the same and nothing is raised.
<TheMue> niemeyer: Why did you repeated the sentence?
<niemeyer> TheMue: Because it's been four minutes
<TheMue> niemeyer: OK.
<TheMue> niemeyer: Will remember.
<niemeyer> TheMue: So, is the issue clear now?
<TheMue> niemeyer: I hope so. But that's exactly why I wanted to start the machined etc before the loop is starting.
<TheMue> niemeyer: With the data from state.
 * fwereade_ => lunch
 * TheMue 's family wait for him to come to lunch too.
<niemeyer> TheMue: Argh
<niemeyer> TheMue: The issue isn't starting machined early with data from state
<niemeyer> TheMue: Okay, have a good lunch..
<niemeyer> fwereade_: And you too
<TheMue> niemeyer: Maybe the sentence is just to short and I'll outline it after lunch.
<niemeyer> TheMue: No, it's missing the point..
<niemeyer> TheMue: The bug isn't about how early you start machined with data from state
<niemeyer> TheMue: The bug is the lack of correlation with the known open ports
<niemeyer> TheMue: That's what we spent the last 2 hours going over
<TheMue> niemeyer: Last sentence before lunch: The bug is clear and I thank you for the help, really. It's only my thought of how to solve it.
<niemeyer> TheMue: Glad to hear it, thanks
<TheMue> So, back again.
<niemeyer> Lunch time here too
<TheMue> niemeyer: Enjoy.
<fwereade> niemeyer, btw, I am having a spot of bother with AddUnitSubordinateTo -- I don't really think that's the right API
<fwereade> niemeyer, I'm leaning towards something like RelationUnit.EnsureSubordinate instead
<fwereade> niemeyer, does that strike you as obviously mad?
<fwereade> oh, bother, lunch
<fwereade> TheMue, rogpeppe: sniff test on the above?
<TheMue> fwereade: Yes, second one reads better.
<fwereade> TheMue, cheers
<niemeyer> fwereade: I don't have as much context as you do. Is there any motivation for the change?
<fwereade> niemeyer, the motivation is (1) I think EnsureSubordinate is the natural way to do this (2) nobody uses AddUnitSubordinateTo (3) the name AddUnitSubordinateTo (to me) implies that failure to add a unit should be an error, while EnsureSubordinate implies that it's fine if a subordinate already exists
<niemeyer> fwereade: (2) is somewhat obvious, given that the feature is in development
<fwereade> niemeyer, (ok, tests use AddUnitSubordinateTo, but often they use it wrong -- it's not sane to have 2 subordinates of the same service with the same principal)
<niemeyer> fwereade: (3) doesn't sound like an issue.. it can be a well defined error
<niemeyer> fwereade: I find it quite appropriate that adding units is done in the same place always, and with similar interfaces
<fwereade> niemeyer, I dunno, the gulf between a principal service and a subordinate service is pretty significant
<niemeyer> fwereade: The two only functions that do that today are right next to one another, and they share the same implementatin
<niemeyer> fwereade: That seems very compelling
<fwereade> niemeyer, if I try to keep them using the same implementation, it gets ugly fast
<fwereade> niemeyer, the ops and assertions end up reasonably different, and determining the failure is *very* different
<niemeyer> fwereade: I'm definitely missing background.. you don't have to try to do that. It's already in place
<fwereade> niemeyer, well
<fwereade> niemeyer, a method called AddUnitSubordinateTo does exist
<fwereade> niemeyer, but it doesn't have any checks for dupe subordinates
<fwereade> niemeyer, so, really, it's not in place
<niemeyer> fwereade: Right.. that's the assertion we talked about yesterday
<fwereade> niemeyer, yes; that bit is easy; but correct abort handling STM to get kinda ugly
<niemeyer> fwereade: Which seems pretty easy to add in addUnit, in the location that already exists to add a subordinate
<niemeyer> fwereade: It won't get any less ugly if you move that logic elsewhere, it sounds like
<niemeyer> fwereade: All the common logic for adding a unit that exists on this function sounds sensible, and necessary
<fwereade> niemeyer, moving up a level for a sec -- are you -1 on the very notion of RelationUnit.EnsureSubordinate(), on the basis that there's already some code that does something a bit like what we want?
<fwereade> niemeyer, or are we just arguing implementation details?
<niemeyer> fwereade: We're surely arguing implementation details, given that we're talking about logic placement
<fwereade> niemeyer, ok, that is not something I think I can honestly speak to until I've tried 2 or 3 different styles
<niemeyer> fwereade: I'm -1 on duplicating logic without a clear reason
<fwereade> niemeyer, right
<niemeyer> fwereade: and moving it away from other logic that looks very much alike
<niemeyer> fwereade: EnsureSubordinate sounds like something that can trivially be built upon AddUnitSubordinateTo
<niemeyer> fwereade: err := addUnit; err == AlreadyThere { Oh, okay. }
<fwereade> niemeyer, ok, but then we have a useful public state method and a useless one, for doing the same thing
<fwereade> niemeyer, which feels kinda bloaty
<fwereade> niemeyer, but perhaps you have some extra use in mind for AUST?
<niemeyer> fwereade: Okay, so let's not have EnsureSubordinate.. because that's the trivial one that woudl duplicate logic
<fwereade> niemeyer, can you explain the use case for the AddUnitSubordinateTo method?
<niemeyer> fwereade: I have both AddUnit and AddUnitSubordinateTo open up in my screen, in the *same* terminal window.. they look very much alike, and share an implementation.
<fwereade> niemeyer, I have been asking various people about this for months, and nobody has given me a reason other than "er, the python is like that"
<niemeyer> fwereade: I don't understand the question.. the answer would be self-obvious
<niemeyer> fwereade: It adds a unit that is a subordinate to a principal
<fwereade> niemeyer, right, and when do we need to do this?
<niemeyer> fwereade: When we want to add a subordinate unit
<fwereade> niemeyer, ISTM that you are assuming that there are a multiplicity of situations in which we want to do this
<niemeyer> fwereade: No, I'm not assuming anything..
<niemeyer> fwereade: There's a method in state that allows adding a unit, and there's a method for adding a unit that is a subordinate of a principal..
<niemeyer> fwereade: It sounds to me like straightforward design
<fwereade> niemeyer, right -- expect it's seriously wrong in at least two ways
<niemeyer> fwereade: Okay?
<fwereade> niemeyer, and it is not a remotely convenient thing to do in the situation in which it's required
<niemeyer> fwereade: Sorry, how is it seriously wrong, and how is it not remotely convenient
<fwereade> niemeyer, it is seriously wrong in that there is no verification of relation state -- you can add anything to anything -- and in the way we've already discussed
<niemeyer> fwereade: So AddUnit is completely wrong too?
<fwereade> niemeyer, no -- AddUnit is AFAICT ok, although I haven't been looking at that side of it closely
<niemeyer> fwereade: Why? We can add anything to anything as well, without any care about relation state
<fwereade> niemeyer, why should AddUnit care about relation state?
<niemeyer> fwereade: Well, there are peer relations too as well, right?
<fwereade> niemeyer, so what?
<niemeyer> fwereade: Well, exactly.. :)
<niemeyer> fwereade: Subordinates aren't different..
<fwereade> niemeyer, we've already made the decision that broken peer relations are just fine
<fwereade> niemeyer, and that the user mustn't touch them
<fwereade> niemeyer, I consider this to be crack, but anyway
<fwereade> niemeyer, but still -- why would the existence or otherwise of a peer relation impact whether it's ok to add a unit?
<niemeyer> fwereade: Sorry, I don't understand what we've decided regarding broken peer relations
<niemeyer> fwereade: But okay, let's leave that aside
<niemeyer> fwereade: The relation state can be completely inconsistent after you added that unit
<niemeyer> fwereade: There should be a peer relation, and there isn't
<niemeyer> fwereade: This will be eventually established as the uniter runs
<fwereade> niemeyer, whaaaaa?
<niemeyer> fwereade: I'm trying to understand why you think that's so much different from the case of subordinates
<niemeyer> fwereade: What within AddUnitSubordinateTo and addUnit do you think shouldn't be there?
<fwereade> niemeyer, is it OK to have one unit of mongodb deployer?
<niemeyer> fwereade: Why is that wrong?
<fwereade> s/deployer/deployed/
<niemeyer> fwereade: Obviously
<niemeyer> fwereade: What within AddUnitSubordinateTo and addUnit you believe shouldn't be there?
<fwereade> niemeyer, can we back up?
<niemeyer> fwereade: That's what I'm trying to do
<fwereade> niemeyer, because you have thrown me a complete curevball regarding peer relations
<niemeyer> fwereade: Yes, sorry, please ignore me there
<niemeyer> fwereade: What within AddUnitSubordinateTo and addUnit you believe shouldn't be there?
<fwereade> niemeyer, why is the existence or otherwise of a peer relation relevant to whether or not it's ok to add a unit of a service?
<niemeyer> fwereade: Can we back up a bit?
<niemeyer> fwereade: What within AddUnitSubordinateTo and addUnit you believe shouldn't be there?
<fwereade> niemeyer, where on earth did that question come from? I am talking about *deficiencies* in AUST
<fwereade> niemeyer, things that are *not* there
<niemeyer> fwereade: It came from my mind
<fwereade> niemeyer, well, please, what does it have to do with what I have been saying?
<niemeyer> <fwereade> niemeyer, right -- expect it's seriously wrong in at least two ways
<niemeyer> fwereade: I'm trying to understand what's so dramatically wrong about it
<niemeyer> fwereade: Because I've been pointing out that what's there is necessary, it's close to related logic, etc
<niemeyer> fwereade: It feels like we've been talking across each other
<fwereade> niemeyer, ok, so what does it mean for S1 to have a subordinate of S2 when there is no relation between S1 and S2?
<fwereade> niemeyer, and, what does it mean for S1 to have a unit when S1 is not in a peer relation?
<niemeyer> fwereade: It means that there are two units, one is subordinate of the other, and they'll both be deployed
<niemeyer> fwereade: S1 and S2 will live in the same container
<niemeyer> fwereade: and unless a relation is established between them, nothing else will happen
<fwereade> niemeyer, why would we deploy a subordinate in the first place without there being a relation between them?
<niemeyer> fwereade: Relations are completely orthogonal to whether something is subordinate or not.. we've decided to tweak the UI to make that common so that we could explain less to users
<niemeyer> fwereade: I believe everything actually does work, even if there's no relation between them
<fwereade> niemeyer, I thought that the existence of any subordinate unit was predicated on a locally scoped relation between the subordinate service and the principal service?
<niemeyer> fwereade: It's a sane building block, and one that seems to work pretty well thus far
<niemeyer> fwereade: That's how we implement the UI, yep
<niemeyer> fwereade: Feels great
<fwereade> niemeyer, ok, but it sounds like you have some use case for subordinates that are not in relations with their principals?
<niemeyer> fwereade: Maybe there's another reason why you believe OMG THATS TERRIBLE WE'LL ALL DIE if that method exists
<niemeyer> fwereade: Which is what I've been trying to extract so far
<niemeyer> fwereade: No, I don't have a reason to believe that the current logic is completely broken, yet.. that's all
<niemeyer> fwereade: We have tests, we have methods.. methods work well so far.. Python seems to work as well to some degree.. etc
<niemeyer> fwereade: You seem to think that the logic that is there is actually okay too, actually
<niemeyer> fwereade: But there's more logic that we'll need, to make things work well
<niemeyer> fwereade: Which sounds fine too
<niemeyer> fwereade: So perhaps what I don't understand is the strong feeling that things are totally broken
<niemeyer> fwereade: So, what's the case? EnsureSubordinate.. what would it do that AddUnitSubordinateTo can't be doing?
<fwereade> niemeyer, it would hopefully have the bugs fixed; but the main point is that it would be attached to an object we have available at the point we need it, and not require us to do a ridiculous little dance extracting info from the RU at the point we need to actually deploy a subordinate
<fwereade> niemeyer, but I am still confused about a number of the things you have been saying
<niemeyer> fwereade: Okay, please just submit a review then.. I really don't have nearly enough context and am probably on crack
<fwereade> niemeyer, I dunno, you're saying things that are surprising to me -- one of us probably is, I agree, but I'm not yet willing to assume it's you ;p
<niemeyer> fwereade: I don't mind changes myself, as long as they are improvements. A contentious point will be code duplication, but I'm sure you can avoid that.
<niemeyer> fwereade: Sorry, please ignore the half of what I said that you didn't understand. It really doesn't matter. If we have good APIs for doing what we do need to do, we can change stuff later if needed.
<fwereade> niemeyer, would you explain again, though, what use cases you imagine for subordinates that are not in relations with their principals?
<fwereade> niemeyer, because I have seen subordinates and relations as essentially inseparable
<fwereade> niemeyer, and I fear we're talking Big Redesign if you have plans for arbitrary subordinates without relations
<niemeyer> fwereade: What I pointed out is that this was a deliberate choice to simplify comprehension.. I guess we did a good job at that.
<niemeyer> fwereade: No, I don't have any impending plans at all.
<niemeyer> fwereade: Inside my own mind I just kept the original reasoning that these concepts are, in fact, orthogonal. Maybe that's my mistake.
<fwereade> niemeyer, ok, cool -- then ISTM that a subordinate that is not in a relation with its principal is essentially corrupt state
<niemeyer> fwereade: Sure, that sounds like a good way to put it for all valid purposes.
<fwereade> niemeyer, ok, so: do we then agree that the fact that AUST does not verify relation membership, and just adds a subordinate to an arbitrary principal, is wrong?
<fwereade> niemeyer, (ofc I would like to make the concerns entirely orthogonal, but I don't see a way to separate them that is not buggy)
<niemeyer> fwereade: Yes, I'm happy to have an API that takes that perspective.
<fwereade> niemeyer, ok, cool; so sane subordinate creation depends on (1) a locally scoped relation (2) a principal unit of that relation (and (3) the subordinate service itself)
<fwereade> niemeyer, ISTM that RelationUnit, which incorporates (1) and (2) and has easy access to (3), is a good place to expose this functionality
<fwereade> niemeyer, and that given a RelationUnit, what I (as the uniter) really want to do is to make sure that any subordinates that should exist do, and just move on
<fwereade> niemeyer, so ISTM that `ru.EnterScope(); ru.EnsureSubordinate()` is moderately sensible
<niemeyer> fwereade: Yep, sounds fine
<fwereade> niemeyer, (part of me wants to roll it into EnterScope, but that really is mixing concerns, so let's not go there)
<niemeyer> fwereade: Agreed, some separation is beneficial
<fwereade> niemeyer, since this is the *only* known situation in which it is meaningful to add a subordinate, I am -1 on the existence of duplicate functionality in the public API, ie AUST
<niemeyer> fwereade: Fair enough
<fwereade> niemeyer, I have no official position on which bits of code should or should not be common; I will only discover that in the course of further exploration ;)
<niemeyer> fwereade: Yep, I can understand that
<niemeyer> fwereade: I'll just note that all of the logic that is currently in addUnit remains needed
<fwereade> niemeyer, indeed -- I shall be extra careful to bear that in mind :)
<niemeyer> and so does the logic in AddSubordinate
<fwereade> niemeyer, you will find no disagreement on that front here
<niemeyer> fwereade: Cool, so we're in sync I think
<fwereade> niemeyer, yeah, agreed -- thanks :)
<fwereade> niemeyer, how would you feel about an exploration of my whining about missing peer relations now? :)
<niemeyer> fwereade: I think it's unnecessary at this point.. I suggest celebrating our agreement and trying to push that front forward
<fwereade> niemeyer, ok, fair enough :)
 * niemeyer breaks for an errand
<rogpeppe> i think i've written something quite nice, but i've got to the end of the day.
<rogpeppe> spiky spike though.
<rogpeppe> g'night all, see you tomorrow.
<mramm> Made it through my 20 hours of flights, and none the worse except for one phone with a broken screen
<mramm> so if you need to contact me email and IRC work, and I'll be heading out to get a new phone tomorrow
#juju-dev 2012-12-05
<hazmat> mramm, welcome back
<mramm> hazmat: thanks
<TheMue> Morning
<rogpeppe1> TheMue_, davecheney, fwereade: morning!
<TheMue_> rogpeppe1: Heyhey
<fwereade> rogpeppe1, TheMue_, davecheney: heyhey
<TheMue_> fwereade: Also for you a good morning
<jam> TheMue_: good morning
<TheMue_> jam: Hiya
<fwereade> TheMue, rogpeppe1: do either of you have any context on what the deal is with those vast terrifying incomprehensible tables in machine_test.go?
<fwereade> TheMue, rogpeppe1: ISTM that they should be completely rewritten and could probably be at least 50% smaller
 * rogpeppe looks
<fwereade> TheMue, rogpeppe: but I don't know that for sure, and maybe one of you will be able to point out some way this would be a foolish move
<rogpeppe> fwereade: you're talking about machinePrincipalsWatchTests, right?
<fwereade> rogpeppe, all the machine watchy tests really
<rogpeppe> fwereade: if you've got a nice way of making them prettier and more maintainable, i'm definitely all for that
<rogpeppe> fwereade: it would be nice if all the test cases were independent, right enough
<rogpeppe> fwereade: although i fear that we might slow down the testing considerably by doing that
<rogpeppe> fwereade: i'd prefer to avoid another test suite that takes 2 minutes to run
<fwereade> rogpeppe, yeah, it's work I don't really *want* to do, but since I need to mess with those tests destructively anyway (because they're doing totally nonsensical things with subordinates)
<fwereade> rogpeppe, indeed, I shall try to figure out a way to avoid setting up and tearing down all of state in one go
<fwereade> rogpeppe, incidentally, I would appreciate comments on https://codereview.appspot.com/6845120/ if you have a moment
<rogpeppe> fwereade: yes, sorry, i've been fully focusing on getting an API proof-of-concept spike going the last couple of days. (i'm hoping to have something for you to have a look at later this morning)
<fwereade> rogpeppe, awesome
<TheMue> fwereade: I'll take a look.
 * TheMue is puzzled about his firewaller test which is hanging after the last change
<fwereade> TheMue, cheers
<TheMue> fwereade: Didn't forget you, just stuck in the tests. ;) So far OK, I'm only a bit unhappy with Context.
<fwereade> TheMue, if yu can think of a beeter name for it I'm all ears -- or are there deeper worries?
<TheMue> fwereade: No, it's exactly the name. Somehow too generic. I'm trying to find a better one, but so far I've got none. :/
<TheMue> fwereade: Where do you see the future implementations of the Context interface? At the providers?
<fwereade> TheMue, I was considering Target but that wasn't quite right
<fwereade> TheMue, no, I expect an LXC implementation and that's it
<fwereade> TheMue, I am only aware of those two deployment styles, but I guess there may be others one day?
<TheMue> fwereade: Hmm, here I'm missing the big picture. I would like to stand in front of a whiteboard with you now. :)
<fwereade> TheMue, ok, I'll try to do the quick version
<fwereade> TheMue, principal is to machine as subordinate is to principal
<fwereade> TheMue, ie the responsibilities of a machine (wrt deployment) are the same as those of a prinicpal: the aspects that differ are (1) what setof units are relevant and (2) how those units should be deployed
<fwereade> TheMue, at the moment, I plan to add deployers for both machine agents and for prinicpal unit agents, each of which will use the SimpleContext
<fwereade> TheMue, once we have LXC available, the machine's deployer just swaps out the Context, and we're done
<TheMue> fwereade: Thanks, makses it clearer.
<fwereade> TheMue, cool
<TheMue> fwereade: Sadly all the names that come into my mind now are generic too.
<TheMue> fwereade: Maybe workspace, that matches to your method names.
<fwereade> TheMue, hmm, interesting, I think I might like that
<fwereade> TheMue, it makes me wonder whether the state info should be on the context/workspace or the Deployer
<TheMue> fwereade: Cheers
<fwereade> TheMue, with "context" that made sense, with "workspace" maybe less so
<fwereade> TheMue, but maybe the interface should be demanding the deployer name in every method, and state info for dpeloying with
<TheMue> fwereade: What exactly do you refer with "state info" to?
<fwereade> TheMue, *state.Info?
<TheMue> fwereade: Aargh, ok.
<TheMue> fwereade: I would see a Workspace implementation only responsible for that Workspace, but everything regarding the state should be handled by the Deployer.
<TheMue> fwereade: But for shure those informations could be passed to the Workspace by the Deployer initially.
<fwereade> TheMue, yeah, I'll think about it :)
<TheMue> fwereade: Cheers++
<TheMue> fwereade: AddUnitSubordinateTo(unit) is on Service and adds the Unit as subordinate to the Service?
<fwereade> TheMue, it is currently, but I am changing that ATM
<TheMue> fwereade: What will the new name be?
<fwereade> TheMue, I want RelationUnit.EnsureSubordinate()
<TheMue> fwereade: What we talked about, ok. Thought that this would not come. Alternatively AddUnitAsSubordinate() would also be ok, not as elegant as Ensureâ¦, but ok. :)
<fwereade> TheMue, this is unwieldy from a testing perspective, but I think right from every other perspective
<fwereade> TheMue, sorry, what type would that method be on?
<fwereade> TheMue, and what args?
<TheMue> fwereade: Eh, as I've seen it here in the tests on Service. But you do more than just a renaming, so I have to wait and see.
<fwereade> TheMue, my issue with putting it on Service is that, at the point we need to add a subordinate, we have a RelationUnit but no Service; and it seems silly to do the whole service-getting dance outside state, just so we can use a method with lots of opportunities for crackful usage, rather than to keep the fiddly details inside state and have a single method that can only be used when there's a reasonable chance of the operation being sane
<fwereade> TheMue, the tests are riddled with nasty things like principals with N subordinates of the same service, and subordinates existing without having relations with their principal
<fwereade> TheMue, EnsureSubordinate involves more work with the tests, but at least ensures (ha) that we can only add subordinates when it's sane to do so
<TheMue> fwereade: Thanks for explanation, I only stumbled uppon the name.
<fwereade> TheMue, ofc fixing all those tests is no fun, every time I look at machine_test.go my brain thinks of something else I have to do that is terribly important
<fwereade> TheMue, but I'm getting there :)
<TheMue> fwereade: *rofl*
<rogpeppe> fuck me, it works
<rogpeppe> :-)
<fwereade> rogpeppe, yay!
<rogpeppe> fwereade: here's the spike: https://codereview.appspot.com/6878052/
<rogpeppe> fwereade: i'm very interested to know what you think. in particular, i *think* the rpc package is quite nice, but mileages might vary significantly!
<rogpeppe> TheMue: likewise, i'd like your reaction
<rogpeppe> the docs for the rpc package can be viewed here: http://go.pkgdoc.org/launchpad.net/~rogpeppe/juju-core/176-rpc-spike/rpc
<TheMue> rogpeppe: I'll take a look.
<Aram> moin.
<rogpeppe> Aram: hiya
<rogpeppe> Aram: if you fancy it, you might wanna take a look at my API spike branch. https://codereview.appspot.com/6878052/
<TheMue> Aram: Moring
<TheMue> s/Moring/Morning/
<Aram> rogpeppe, right away.
<Aram> TheMue, cheers.
<TheMue> rogpeppe: First look is impressive, need more time to get it. Will look again after found my own problem here. ;)
<rogpeppe> TheMue: thanks
<rogpeppe> here comes the snow...
<TheMue> rogpeppe: Here we have a fine white cover since last night. Not very much, but white (and cold, -6Â° this morning).
<rogpeppe> TheMue: it's just started snowing quite hard... i wonder how long it'll keep going
<TheMue> rogpeppe: Those are the day I like working from home. ;)
<rogpeppe> TheMue: me too!
<fwereade> whoops, gotta dash, post office about to close, bbiab
<niemeyer> Morning all
<rogpeppe> niemeyer: hiya
<dimitern> jam, wallyworld_ here's the first part of the nova double https://codereview.appspot.com/6877054
<wallyworld_> cool
<dimitern> jam: ping
<dimitern> just a quick question - so I have that branch I proposed, but I want to keep working on the same feature, is the correct process like this: 1) create a new branch while I'm in the same branch I proposed, 2) continue working there, 3) once ready - propose it with a prerequisite=the old branch?
<jam> dimitern: it depends what "keep working" means, but generally I would say yes. I'm not sure if lbox supports prerequisite branches or not, though.
<dimitern> jam: yes it does
<jam> --req =
<dimitern> yep
<dimitern> jam: keep working means continue on the same feature, even though the first branch is not merged yet
<jam> dimitern: right, so it is possible that keep-working is tweaking the already proposed branch, such that you would just push up a new version. Or it can be "building stuff on top of the branch" which would be creating a new branch and using prereq
<jam> Most likely what you want is a prereq
<dimitern> jam: ok, so I'll do that (with the new branch of the old one) - I ask because once the proposed branch is merged, you cannot propose it again - it has to be a new one
<jam> dimitern: I think 'lbox propose' will refresh the current proposal
<jam> certainly you can resubmit in the lp ui
<jam> I'm not 100% sure what the reitveld bits are.
<dimitern> jam: not unless it's submitted (it will reopen the CL, but the merge won't work if you try submitting again)\
<Aram> lbox propose refreshes both rietveld and launchpad, yes.
<Aram> basically the workflow you describe dimitern is sane, that's what we use.
<Aram> do something, propose, branch into something else, propose with prereq, repeat.
<dimitern> Aram: ok, thanks, good to know I'm doing it right :)
<niemeyer> Gosh
<niemeyer> I was trying to build a C++ program last night.. it's interesting how biased we get after we taste sane builds
<niemeyer> Not only it took several hours and overheated the laptop, but it also killed my disk space
<niemeyer> (which is why I'm reminding of it just now)
<Aram> what the hell did you try to build?
<jam> niemeyer: what were you building?
<niemeyer> jam: LLVM
<Aram> ah, yes.
<Aram> heh.
<niemeyer> 7GB on the build directory so far
<niemeyer> Hasn't finished, so I don't know how much it actually takes
<niemeyer> and some people complain Go binaries are large
<Aram> it's interesting. Solaris was 20 million lines of C code. It took two hours to build on T410, but LLVM is significantly smaller and takes longer just because it is C++ not C.
<Aram> oh, and Solaris was built twice in this time (one with gcc and once with sunpro).
<rogpeppe> niemeyer: i've made a spike branch as a proof of concept around the API stuff. i'd like to know what you think. it's sketchy in many places. https://codereview.appspot.com/6878052/
<niemeyer> rogpeppe: Cool, I'll just finish this test I'm writing and we can talk
<rogpeppe> niemeyer: cool
<niemeyer> Why is env.Destroy hanging on TearDown of the dummy live tests?
<rogpeppe> niemeyer: it works for me
<niemeyer> The lack of space probably destroyed my state
<niemeyer> Nope.. looks like something is actually wrong with the way dummy is managing MongoDB
<niemeyer> rogpeppe: We should go back to that idea of using a single binary at some point
<rogpeppe> niemeyer: agreed
<rogpeppe> niemeyer: it'll be quite simple to change i think. the only wrinkle is setting up the symlinks
<niemeyer> rogpeppe: How's it different?
<rogpeppe> niemeyer: we'd still need binaries called "juju", "jujud", and "jujuc" even if they all point to the same thing, no?
<niemeyer> rogpeppe: I don't think so
<rogpeppe> niemeyer: ah ok - we just have the same binary act as all things?
<niemeyer> rogpeppe: jujuc is already just a symlink target by itself
<niemeyer> rogpeppe: jujud is the only one that might require some tweaking
<rogpeppe> niemeyer: so the user types "juju help" and they get jujud subcommands too?
<niemeyer> [LOG] 81.27788 JUJU state: connection failed: dial tcp 107.21.168.241:37017: connection refused
<niemeyer> [LOG] 81.27794 JUJU state: connecting to 107.21.168.241:37017
<niemeyer> [LOG] 81.46296 JUJU state: connection failed: dial tcp 107.21.168.241:37017: connection refused
<niemeyer> [LOG] 82.02849 JUJU state: connecting to 107.21.168.241:37017
<niemeyer> [LOG] 82.23336 JUJU state: connection failed: dial tcp 107.21.168.241:37017: connection refused
<niemeyer> [LOG] 82.23343 JUJU state: connecting to 107.21.168.241:37017
<niemeyer> rogpeppe: Btw, we should slow that down a bit, I thnk
<rogpeppe> niemeyer: i agree. perhaps an option in mgo.Info ?
<niemeyer> rogpeppe: A sane default might be enough
<rogpeppe> niemeyer: yeah, true
<niemeyer> We're punching the server/services too hard it seems
<rogpeppe> niemeyer: is it causing problems?
<niemeyer> rogpeppe: It creates a fast-scrolling log, without much in return
<rogpeppe> niemeyer: agreed.
<niemeyer> rogpeppe: and I just got a "unauthorized access" error
<niemeyer> rogpeppe: Any hints when that happens?
<rogpeppe> niemeyer: did it cause a test failure?
<niemeyer> rogpeppe: Yeah
<rogpeppe> niemeyer: hmm. i thought i'd fixed that
<niemeyer> rogpeppe: What was the cause?
<rogpeppe> niemeyer: can you paste the test log?
<rogpeppe> niemeyer: the race between bootstrap-state and the client connection
<niemeyer> rogpeppe: The log is really uninteresting.. a long tail of attempts to connect followed by unauthorized
<rogpeppe> niemeyer: which test was it? and at which point in the test was it trying to connect? the log might still be useful to show that.
<rogpeppe> niemeyer: this was in the live tests?
<niemeyer> rogpeppe: It's the test I'm writing, and it's trying to connect after bootstrap
<niemeyer> rogpeppe: Yes
<niemeyer> rogpeppe: Maybe it's my fault then.. I'll have a deeper look
<rogpeppe> niemeyer: is it using juju.NewConn ?
<niemeyer> rogpeppe: Yeah
<niemeyer> Awww
<niemeyer> It kills the machine doesn't it
<rogpeppe> niemeyer: i bet bootstrap-state never ran
<niemeyer> rogpeppe: Yeah, possibly.. it'd be easy to figure it out with logs
<niemeyer> I mean, if I had access to the machine
<niemeyer> Hmm.. I wonder if I can still check the console logs
<rogpeppe> niemeyer: run the test, wait until it's started the machine, then ^C the test and ssh to the machine
<niemeyer> Yep!
<rogpeppe> niemeyer: do you actually see anything useful in the console logs?
<niemeyer> rogpeppe: All good, it's probably me doing something silly on the test
<niemeyer> rogpeppe: It couldn't find jujud
<niemeyer> rogpeppe: Which is most likely me
<rogpeppe> niemeyer: ah, that could explain it :-)
<niemeyer> rogpeppe: I'll keep the symptom in mind :)
<rogpeppe> niemeyer: there is one outstanding issue to do with unauthorized access, but i think that only happens when several clients are trying to connect concurrently.
<rogpeppe> niemeyer: yeah, it would be nice to give a more helpful error.
<rogpeppe> niemeyer: the only problem is that if your password is genuinely wrong, then that error message is actually correct.
<niemeyer> rogpeppe: Yeah, it's a bit tricky to be helpful in this case
<niemeyer> rogpeppe: Really, *anything* could go wrong
<niemeyer> There I go again
<niemeyer> Awesome, the failing test was entirely related to the fact the test was *supposed* to fail.
 * rogpeppe goes for some lunch
<niemeyer> rogpeppe: Enjoy
<niemeyer> I shall not be long either
<niemeyer> mramm: ping
<rogpeppe> back
<niemeyer> https://codereview.appspot.com/6868070
<niemeyer> I'll step out for lunch
<niemeyer> mramm: The meeting you scheduled overnight is over my lunch time.. I hope it's okay to move it forward.
<andrewdeane> niemeyer, rogpeppe Hello. I'm just going through the data available. Is it possible to detail the certificate(s) associated with an environment, machine, service, unit? Thanks.
<rogpeppe> andrewdeane: from what point of view are you asking?
<andrewdeane> When recording the current state I'd like to see the details of the 'person' that invoked the item.
<mramm> niemeyer: sure, we can move it, let's move it back though.
<mramm> just ping me when you get back from lunch
<rogpeppe> andrewdeane: i don't quite understand what you're saying there. what do you mean by "recording the current state" ?
<rogpeppe> andrewdeane: (BTW i generally won't see remarks on IRC unless they mention my IRC nick)
<rogpeppe> fwereade: ping
<fwereade> rogpeppe, pong
<rogpeppe> fwereade: i'm trying to understand what you're trying to protect against with the DeployerName check
<rogpeppe> fwereade: when might it be triggered?
<fwereade> rogpeppe, sorry, which check? the purpose of DeployerName is to distinguish between different contexts that happen to be using the same init dir
<rogpeppe> fwereade: when would that happen?
<andrewdeane> rogpeppe, Sorry. If I do a juju status is there a way to see who (certificates?) started the items?
<fwereade> rogpeppe, any agent that runs will have an upstart job for itself in that dir that it is not responsible for
<rogpeppe> andrewdeane: not currently, no.
<andrewdeane> rogpeppe, any plans to, or to get the information out some other way?
<fwereade> rogpeppe, and since we don't currently have LXC, we'll have (1) the machine agent which should not be removed; (2) N principals which should be touched by the machine agent only and (3) M subordinates which should each be touched only by their principal
<fwereade> rogpeppe, all in the same init dir
<rogpeppe> fwereade: but surely the deployer's unit watcher won't see the wrong units?
<rogpeppe> fwereade: (which is what we're trying to guard against, presumable)
<rogpeppe> y
<rogpeppe> andrewdeane: currently there are no notions of "ownership" in the state. we might move in that direction in the future, but maybe not for a while
<rogpeppe> andrewdeane: anyone with admin rights (the administrator password) can currently change anything in the state.
<fwereade> rogpeppe, hmm, I suppose we could just drop the ability to list deployed services
<rogpeppe> fwereade: or we could perhaps tag the upstart job name with the entity that's responsible.
<fwereade> rogpeppe, that is what you're complaining about, isn't it?
<andrewdeane> rogpeppe, is it true that on bootstrap we know which cert is being used? Is there a way to id those with admin rights? Incoming IP, say?
<rogpeppe> fwereade: i'm mumbling because i don't understand the logic in changed() and why it's there.
<rogpeppe> andrewdeane: with the API will come the ability to know who's doing what in the state.
<rogpeppe> andrewdeane: then it will be easy enough to tag items with the entity that touched them
<rogpeppe> andrewdeane: but first i've got to implement it!
<andrewdeane> rogpeppe, :) I wont hold you to anything but do you have ball park timescales?
<rogpeppe> andrewdeane: currently there are only two certificates in the system - the server's certificate and the CA certificate.
<rogpeppe> andrewdeane: the API is due for 13/04, i think
<fwereade> rogpeppe, I'm not sure how to answer that -- changed() makes sure that the local deployed state is what it should be, for any given unit, and recalls anything that shouldn't be there even if by some poor luck it was not mentioned in an event
<fwereade> rogpeppe, eg on unit reassignment
<fwereade> rogpeppe, if a unit is assigned away from a machine while the agent is down, it won't know about the unit at all when it comes back up, except by examining local state
<fwereade> rogpeppe, I know we never unassign from machines in real code at the moment
<andrewdeane> rogpeppe, ok. thanks. I can put placeholder fields in, in the meantime.
<fwereade> rogpeppe, but since it's possible, I want to be sure it's handled sanely, even if it's kinda incomplete
<fwereade> rogpeppe, ie I'm trying to handle what's currently possible sanely, not to make the whole assign/unassign functionality sane and usable at this stage
<fwereade> rogpeppe, sorry gtg a mo
<rogpeppe> fwereade: k
<rogpeppe> fwereade: i'm trying to understand how a DeployedUnits can return the names of any units that weren't deployed by the context's deployer entity, given that the upstart job's name is tagged with that entity name.
<rogpeppe> fwereade: i.e. i think we could lose the "responsible" logic entirely, and we would never see any difference in behaviour.
<niemeyer> mramm: Hi
<mramm> hey
<niemeyer> mramm: Back and forward are interesting when talking about time
<niemeyer> mramm: I tend to think that "moving forward" is making it later
<niemeyer> mramm: But that's obviously subjective :)
<mramm> niemeyer: forward feels like pulling it towards me
<mramm> and back like pushing it away
<niemeyer> mramm: TIme machines that move back in time tend to go to the past :)
<mramm> niemeyer: true enough
<Aram> I never understood forward and backward when talking about time. it's especially confusing during daylight changes. "how did the clock move?", "one hour forward", "wtf".
<niemeyer> Aram: Yeah, I'm totally lost on those too
<niemeyer> There might be some cartesian thinking there too.. I tend to think of the T axis as having positive values on the right-hand side
<fwereade> rogpeppe, sorry, I forgot I was talking to you :/
<fwereade> rogpeppe, ok, what should happen when a unit is unassigned while the machine agent is down?
<rogpeppe> fwereade: when it comes back up, it should remove it.
<fwereade> rogpeppe, how do we do that without the responsible logic?
<rogpeppe> fwereade: SimpleContainer doesn't return any units that weren't deployed by the responsible entity
<rogpeppe> fwereade: so, AFAICS, the responsible logic is redundant
<fwereade> rogpeppe, I don't understand how that can be the case -- I could rearrange how the deployer is written, but I still need to handle that case, don't I?
<fwereade> rogpeppe, and I still need to be able to get just the units I've deployed
<rogpeppe> fwereade: that's what SimpleContainer does, no?
<fwereade> rogpeppe, precisely how I reconcile is open for discussion, but I picked what I did because it seemed simplest
<rogpeppe> fwereade: (by including the deployer name in the upstart job name)
<rogpeppe> fwereade: i may well have got the wrong end of the stick!
<fwereade> rogpeppe, ah! got you! sorry, no, I did
<fwereade> rogpeppe, how can we guarantee that a unit reported by WatchPrincipalUnits is still assigned to the machine that reported it?
<fwereade> rogpeppe, all we know is that something changed wrt that unit
<rogpeppe> fwereade: the responsible logic isn't checking the assigned machine AFAICS
<fwereade> rogpeppe, `responsible = deployerName == d.ctx.DeployerName()`?
<rogpeppe> fwereade: how does that tell us anything about the assigned machine?
<fwereade> rogpeppe, what is deployerName?
<rogpeppe> fwereade: ah, gotcha.
<rogpeppe> fwereade: but if the unit *has* been assigned to a different machine, surely we'd want to kill it locally anyway>
<rogpeppe> ?
<fwereade> rogpeppe, we want to recall it, but not to do anything else
<fwereade> rogpeppe, if it's installed and we're not responsible, we recall it
<fwereade> s/installed/deployed/
<rogpeppe> fwereade: i don't like the "recall" name btw
<rogpeppe> fwereade: i think "kill" might be better
<fwereade> rogpeppe, whoa please no
<fwereade> rogpeppe, total sematic collision with ensuredead etc
<rogpeppe> fwereade: recall sounds like something fairly non-destructive
<niemeyer> rogpeppe: Replied to the comments, thanks for the review
<fwereade> rogpeppe, mmmaybe, although I think it is conceptually non-destructive -- put your mind 6 months in the future and consider, if the unit is unassigned and recalled, surely any persistent storage it uses will still be around ready for its new assignment
<rogpeppe> fwereade: rm -rf is pretty destructive, no matter how you look at it.
<fwereade> rogpeppe, won't that actually just be "unmount" in the future though?
<niemeyer> fwereade: I think unassignment is the same as a Dead unit, for all purposes of the deployer, except it shouldn't remove the unit from the state
<niemeyer> fwereade: I don't think we even need a noun for that.. "the unit was unassigned" sounds fine
<fwereade> niemeyer, is there some way in which what I have implemented differs from that?
<niemeyer> fwereade: I have no idea.. I'm commenting on top of the conversation above
<niemeyer> fwereade: I'll try to review the branch today still, though
<niemeyer> fwereade: I see you saying "we want to recall it", for example, and whether "kill" or "recall" is beter
<niemeyer> better
<niemeyer> fwereade: I think we want to remove the container, in the same way we do for a dead unit
<fwereade> niemeyer, yes
<fwereade> niemeyer, that is what I do
<fwereade> niemeyer, "recall" is the opposite of "deploy", in english, and it seems sensible to use words in juju to mean what they mean in english where possible
<niemeyer> fwereade: Cool, I'm probably just misunderstanding the context then
<fwereade> niemeyer, the semantic intent is precisely "opposite of deploy"
<fwereade> niemeyer, this is *currently* a destructive operation
<niemeyer> fwereade: That's what I was referring to
<niemeyer> fwereade: I don't *think* we need a new noun for "remove the container"
<fwereade> niemeyer, but I don't think that "undeploying" a unit is necessarily always going to be so
<niemeyer> fwereade: juju deploy is something pretty different
<fwereade> niemeyer, ok, we agreed that "deploy" was the term to use for "install a unit agent"
<fwereade> niemeyer, I implemented an Installer you told me was badly named, sight unseen, and now you're telling me that deploy is the wrong verb?
<niemeyer> fwereade: Erm
<niemeyer> fwereade: Sorry, please ignore me
<fwereade> niemeyer, np -- I think we do have terminology problems though, in hindsight you had no way of knowing what sort of "deploy" we were talking about :)
<niemeyer> fwereade: I think we have too much terminology.. I don't know what "recall" actually means, and I was hoping I wouldn't have to learn.
<fwereade> niemeyer, it is the opposite of "deploy", and is used entirely within the deployer package to indicate the operation that reverses a deploy
<niemeyer> fwereade: Okay
<fwereade> niemeyer, I'm not emotionally attached to the name, but "undeploy" just seemed overwhelmingly sucky so I spent some time with a dictionary :)
<fwereade> niemeyer, always happy to hear improvements :)
<rogpeppe> fwereade: remove?
<fwereade> rogpeppe, collides with other remove operations in the package
<fwereade> rogpeppe, words involving death collide with death-related operations used in the package
<rogpeppe> fwereade: RemoveDeployment ? :-)
<rogpeppe> niemeyer: replied https://codereview.appspot.com/6868070/
<rogpeppe> niemeyer: if you've got a little while at some point, i'd like to have a chat about the API stuff
<fwereade> rogpeppe, is life *so* boring that we need to overload terminology to keep ourselves on our toes?
<niemeyer> fwereade: recall is fine
<fwereade> rogpeppe, (and what makes you think that recalling intrinsically "removes" anything? ;))
<fwereade> niemeyer, cheers
<niemeyer> fwereade: I mainly wondered if we could avoid the term altogether.. maybe we can't
<niemeyer> fwereade: (rather than suggested there was another term that would fit better)
<rogpeppe> fwereade: "os.RemoveAll(agentDir)" :-)
<fwereade> niemeyer, it's not something that anyone needs to know unless they're directly working withing the deployer package
<fwereade> rogpeppe, and?
<fwereade> rogpeppe, that is the only possible implementation?
<fwereade> rogpeppe, we have no persistent storage features planned?
<rogpeppe> fwereade: we do, but they might not be always active.
<rogpeppe> fwereade: "Recall" sounds like it's fetching something
<rogpeppe> fwereade: but it's actually removing local state pertaining to the deployed unit
<fwereade> rogpeppe, my point is that all your suggestions carry a "this is a destructive operation" payload, and I don't think that's necesarily appropriate
<niemeyer> Really, recall is fine.. the discussion I was trying to avoid is already happening :)
<fwereade> niemeyer, ha, yeah, I will be quiet :)
<niemeyer> We all know what "recall" means now :)
<rogpeppe> niemeyer: ok, if you're ok with it, i'll go with it too :-)
<fwereade> niemeyer, haha
<fwereade> rogpeppe, cheers
<fwereade> huh, cath has done a super-early supper, I'll be back in a sec
<mramm> I'm heading out to a lunch meeting, will be back in a bit.
<mramm> if you need me in the meantime my phone is operational again, so feel free to call/text
<rogpeppe> niemeyer: any chance of that chat some time today? (sorry to poke, but it would be nice to have an idea if i'm totally off in the wrong direction)
<niemeyer> rogpeppe: Oh yeah, thanks for poking, I forgot actually
<niemeyer> rogpeppe: What's up?
<rogpeppe> niemeyer: i did a spike branch, just to see if some ideas would work out
<rogpeppe> niemeyer: it's only a sketch in quite a few places, but it does give an idea of how it might work, end-to-end: https://codereview.appspot.com/6878052/
<rogpeppe> niemeyer: crucial pieces are:
<rogpeppe> http://bazaar.launchpad.net/+branch/~rogpeppe/juju-core/176-rpc-spike/view/head:/state/api/serve.go
<rogpeppe> http://go.pkgdoc.org/launchpad.net/~rogpeppe/juju-core/176-rpc-spike/rpc#NewServer
<rogpeppe> niemeyer: the former is how the API surface is implemented at by the server. the latter is the piece of magic that makes it possible.
<rogpeppe> s/at by/by/
<niemeyer> rogpeppe: Well, there are thousands of lines in there.. I can't really comment on it without sitting down and reading what's going on
<rogpeppe> niemeyer: that's why i pointed you at the NewServer docs
<rogpeppe> niemeyer: which should give an idea, even without looking at the implementation
<niemeyer> rogpeppe: A Server with an Accept method.. yeah, that sounds fine :-)
<rogpeppe> niemeyer: the idea is that a single API implementation can provide access to both http and connection-based RPC interfaces.
<rogpeppe> niemeyer: i've modelled the rpc package quite strongly on net/rpc where appropriate
<rogpeppe> s/on/after/
<niemeyer> rogpeppe: That kind of stuff is a bit dubious: "If a path element contains a hyphen (-) character, the method's argument type T must be string, and it will be supplied from any characters after the hyphen."
<rogpeppe> niemeyer: that's so you can use a path like: /State/Machine-0/InstanceId
<niemeyer> rogpeppe: There seems to be quite a bit of generality built on that interface
<niemeyer> rogpeppe: I'd feel more comfortable with an API that was highly tailored for our specific problem
<niemeyer> rogpeppe: with one encoding, that is known sensible for all use cases we have in mind
<niemeyer> rogpeppe: we're not building net/rpc, in that sense
<rogpeppe> niemeyer: i think we want to provide both http access and RPC-style access
<niemeyer> rogpeppe: We discussed this over UDS, I think
<niemeyer> rogpeppe: The strawman was attempt to have https only
<niemeyer> rogpeppe: and see where that would lead
<rogpeppe> niemeyer: even if we don't provide access via multiple protocols, i still think this package is a good way to do it
<niemeyer> rogpeppe: Having stuff like XML there makes it pretty clear that we're over-engineering it, and stumbling upon that while trying to tailor it to our needs
<niemeyer> rogpeppe: (as the /State/Machine-0/InstanceId edge case indicates)
<rogpeppe> niemeyer: the XML stuff is like 10 lines of code
<niemeyer> rogpeppe: It could be 2
<niemeyer> rogpeppe: Or one
<niemeyer> rogpeppe: It's still XML, and we don't want to use that
<rogpeppe> niemeyer: sure. i just put it in there because it only took 10s to type
<niemeyer> rogpeppe: EC2 is a lesson in that sense.. there are two interfaces
<niemeyer> rogpeppe: And no body but the heaviest Java pundits use that
<rogpeppe> niemeyer: http://paste.ubuntu.com/1412851/
<niemeyer> rogpeppe: You see my underlying point, though?
<niemeyer> rogpeppe: It took 10 seconds to put it in there because there's a *LOT* of logic to make that possible
<rogpeppe> niemeyer: so do you think we should build it as a standard web server, with no websockets?
<rogpeppe> niemeyer: i hadn't thought we'd decided that at UDS.
<niemeyer> rogpeppe: Erm.. how did websockets get into the above context?
<rogpeppe> niemeyer: well, there's a significant difference between serving GET requests and serving messages from a websocket
<niemeyer> rogpeppe: I bet.. still.. !?
<rogpeppe> niemeyer: i'd be happy if we could do both.
<niemeyer> rogpeppe: Ah, I see
<niemeyer> rogpeppe: Rather than designing two entirely different API mechanisms at once, it feels like we could try to get one of them right
<rogpeppe> niemeyer: this package means we can serve them both with the same implementation
<niemeyer> rogpeppe: Understood, the comments above take that into account
<rogpeppe> niemeyer: agreed. i'd start with the websocket implementation, in that case.
<rogpeppe> niemeyer: and i *still* think this would be a useful package for that.
<rogpeppe> niemeyer: because net/rpc is not sufficient
<niemeyer> rogpeppe: I'm not sure.. I'd start with the http package, and the websocket package
<rogpeppe> niemeyer: so where do we deal with the message processing?
<rogpeppe> niemeyer: this rpc package is my way of trying to abstract that ou
<rogpeppe> out
<rogpeppe> niemeyer: in the same way that net/rpc does, but a little more general
<rogpeppe> niemeyer: in that it allows rpc calls to know the context of the connection that's making the call
<rogpeppe> niemeyer: i *think* it makes for a very natural style of implementation of the API, while keeping all the RPC gunge out of the way.
<niemeyer> rogpeppe: I'm trying to understand what that would mean, but I can't see that yet
<rogpeppe> niemeyer: did you look at the server implementation? http://bazaar.launchpad.net/+branch/~rogpeppe/juju-core/176-rpc-spike/view/head:/state/api/serve.go
<niemeyer> rogpeppe: It seems that most of whta is there is a generic mechanism to plug different codecs
<niemeyer> rogpeppe: Without any correlation to the needs of the underlying infrastructure
<rogpeppe> niemeyer: you could delete all the codec stuff and it would still be useful
<rogpeppe> niemeyer: for instance we could hard-code JSON
<rogpeppe> niemeyer: that would be fine
<niemeyer> rogpeppe: What are we getting out of it then?
<niemeyer> rogpeppe: http://gary.beagledreams.com/page/go-websocket-chat.html
<niemeyer> rogpeppe: This is where I would start
<rogpeppe> niemeyer: i've written something that sends and receives messages over websockets already
<niemeyer> rogpeppe: It seems a few orders of magnitude more straightforward than the logic that is in that rpc package
<niemeyer> rogpeppe: I'm not saying that stuff is impossible to do with your implementation.. I bet you can do anything with it. :-)
<rogpeppe> niemeyer: but we don't want to manually marshal and unmarshal messages for every single call in the API
<niemeyer> rogpeppe: No, we don't.. we also don't want to implement a generic net/rpc package that is even more generic than net/rpc
<niemeyer> rogpeppe: There's a sweet spot in between, and hopefully that sweet spot is actually trivial to implement
<rogpeppe> niemeyer: well, it *needs* to be more generic than net/rpc
<rogpeppe> niemeyer: because net/rpc doesn't provide what we need
<niemeyer> rogpeppe: It doesn't have to be generic at all, IMHO.. it has to solve our problem, our very specific problem.. not anyone else's problem
<rogpeppe> niemeyer: i thought it was quite neat :-(
<niemeyer> rogpeppe: It seems super neat, and a lot of people will love that I think..
<niemeyer> rogpeppe: But we have a smaller issue to solve
<rogpeppe> niemeyer: agreed, but our API surface is fairly broad. i was trying to think of something just sufficiently broad to solve our problem.
<niemeyer> rogpeppe: Okay, *that* is exactly the problem
<niemeyer> rogpeppe: I suggest we start by *having* the problem, and walking up from there
<rogpeppe> niemeyer: where do you think i should start then?
<niemeyer> rogpeppe: By implementing one tiny API call that actually works
<niemeyer> rogpeppe: Without significant infrastructure to support everything we think we may need for everything we think we'll do
<rogpeppe> niemeyer: ok, so don't use net/rpc at all then.
<niemeyer> rogpeppe: Right
<rogpeppe> niemeyer: ok, i'll start there. and i still think that what i've just made will be a good fit for what we'll need, but i take your point that it's perhaps overly general.
<rogpeppe> niemeyer: i built something that would make building what we need fairly trivial, which i thought was a reasonable approach.
<niemeyer> rogpeppe: I reserve my right to doubt that this net/rpc replacement makes building what we need fairly trivial.
<rogpeppe> niemeyer: but still, only three days lost.
<niemeyer> rogpeppe: We have enough track record about "fairly trivial" changes by now :)
<rogpeppe> niemeyer: yeah, "fairly trivial" means "i can see a clear path" :-)
<niemeyer> rogpeppe: Exactly, a path to somewhere.. but given that we don't have a working API, and are not even close to having one, we don't know where that path really leads
<niemeyer> rogpeppe: There's a lot of value in putting stuff to actually work, and learning about what it takes to do that
<rogpeppe> niemeyer: well, that branch does include at least one test from the state package that passes.
<rogpeppe> niemeyer: which seems not a million miles away from having a working API
<niemeyer> rogpeppe: In 20 minutes I can code a websocket that talks across the wire to the state package
<niemeyer> rogpeppe: I'd still consider myself completely ignorant about what it takes to make all the juju communication work across the wire
<niemeyer> rogpeppe: But that's just me.. perhaps I'm just pessimistic about myself.
<rogpeppe> niemeyer: in a few days, i reckon i could get most of the non-watcher state tests passing. but perhaps that would still be "not even close".
<niemeyer> rogpeppe: That's great to hear.. maybe we can have the API working even before christmas then
<rogpeppe> niemeyer: without some equivalent of net/rpc, i'm not sure.
<rogpeppe> niemeyer: oh, sorry, that was sarcasm, right?
<niemeyer> rogpeppe: Not at all, I was seriuos
<niemeyer> rogpeppe: If you can get everything all of state in three days, it doesn't feel off-the-park to have most of it in 2.5 weeks
<rogpeppe> niemeyer: i could only do that because my rpc package makes it straightforward
<niemeyer> rogpeppe: heh
<rogpeppe> niemeyer: because all the bookkeeping is hidden away
<niemeyer> rogpeppe: I'm only suggesting doing what we actually need, rather than what we need and what we don't
<niemeyer> rogpeppe: There's nothing about websockets there, there's nothing about auth there, there's nothing about a lot of stuff there
<rogpeppe> niemeyer: websockets are trivial to add, and so is auth
<rogpeppe> niemeyer: i designed it with both in mind
<niemeyer> rogpeppe: Can you explain how websockets is trivial to add there and is so hard without it?
<niemeyer> rogpeppe: Because this: http://gary.beagledreams.com/page/go-websocket-chat.html, is surprisingly similar to what we want
<niemeyer> rogpeppe: and it has a fraction of the size
<rogpeppe> niemeyer: most of the code that we will be writing will be working out what call has been made, marshalling and unmarshalling parameters, and checking authorisation.
<rogpeppe> niemeyer: not dealing with websockets directly
<rogpeppe> niemeyer: that example has only a single message type
<rogpeppe> niemeyer: and no authorisation
<rogpeppe> niemeyer: we've got about 60 calls in our API used outside of state
<niemeyer> rogpeppe: http://go.pkgdoc.org/code.google.com/p/go.net/websocket
<niemeyer> var JSON = Codec{jsonMarshal, jsonUnmarshal}
<niemeyer> type T struct {
<niemeyer> 	Msg string
<niemeyer> 	Count int
<niemeyer> }
<niemeyer> rogpeppe: You see the similarity in terminology?
<niemeyer> type Codec struct {
<niemeyer>     Marshal   func(v interface{}) (data []byte, payloadType byte, err error)
<niemeyer>     Unmarshal func(data []byte, payloadType byte, v interface{}) (err error)
<niemeyer> }
<niemeyer> func (cd Codec) Send(ws *Conn, v interface{}) (err error)
<rogpeppe> niemeyer: it can encode and decode json messages... and?
<niemeyer> rogpeppe: and!?
<rogpeppe> niemeyer: that's a fraction of what we need to do
<niemeyer> rogpeppe: Exactly, that's all pretty close to what that package implements
<niemeyer> rogpeppe: Codecs, marshalling, unmarshalling
<niemeyer> rogpeppe: There's missing logic.. let's add it
<rogpeppe> niemeyer: yes, and none of that is implemented by my rpc package
<rogpeppe> niemeyer: it just uses codecs from outside
<niemeyer> rogpeppe: Okay, sorry, if you think state is going to be ready before the end of the week, please go for it.
<niemeyer> rogpeppe: We're clearly talking across each other without much benefit
<niemeyer> rogpeppe: If state isn't ready by the end of the week, we can talk again
<rogpeppe> niemeyer: i will implement a direct websocket implementation to show where we might be headed without something like net/rpc.
<TheMue> Yiiiiiiiiehaaa! After a wrong direction and fiddling why that breaks I think the firewaller is stable now. I'll add some explicitely stressing tests tomorrow, but it looks good. *phew*
<rogpeppe> niemeyer: and then hopefully it'll be more clear what the advantages are
<niemeyer> rogpeppe: Please don't do that.. I'm not interested in you coding logic to prove how bad it will look like
<rogpeppe> niemeyer: in which case i don't know what to do
<rogpeppe> niemeyer: i'd try to make it look as nice as possible
<niemeyer> rogpeppe: I was hoping to collaborate to a working prototype, without creating massive infrastructure which is obviously necessary, despite the fact we have no idea
<niemeyer> rogpeppe: The freaking thing supports XML for god's sake
<rogpeppe> niemeyer: only because it was ultra-trivial to do
<rogpeppe> niemeyer: (and actually it doesn't work :-])
<niemeyer> rogpeppe: Yeah, it's trivial to put a car within an airplane, once you have the airplane
 * rogpeppe should have known that typing the letters X-M-L was a mistake
<niemeyer> rogpeppe: But this isn't being productive.. there's no point in you working on stuff you're unhappy about
<rogpeppe> niemeyer: i think that putting in the infrastructure for the websocket stuff from the bottom up would actually be a good thing
<niemeyer> rogpeppe: My personal opinion is that the current super-net/rpc package is a pretty trivial part of the problem we have to solve, and I'd prefer to focus on the problem itself rather than trying to put yet another layer we don't even understand
<niemeyer> rogpeppe: But there's an easy way out: you said it's going to be ready in three days, so go for it.
<rogpeppe> niemeyer: then when we've sent a few messages back and forth, we can see if something like my rpc package might be a good fit.
<niemeyer> rogpeppe: Won't be much time spent, and will be a learning exercise anyway.
<niemeyer> rogpeppe: I think there's a non-trivial number of pieces missing, that will guarantee thousands of additional lines
<rogpeppe> niemeyer: it's a spike. even if i get things working, it's still a spike.
<rogpeppe> niemeyer: out of interest, what pieces are you thinking of particularly?
<rogpeppe> niemeyer: i'm wondering where you see the blocking points
<niemeyer> rogpeppe: Well, it's not a code spike.. a spike is something we code to understand a problem. This is a pretty polished API that is documented and that you're attached to already
<rogpeppe> niemeyer: it really was code to try to understand how i might solve the problem
<rogpeppe> niemeyer: even if i did write some doc comments
<rogpeppe> niemeyer: (which i actually wrote before i wrote the code, so i had an idea of what i was trying to do)
<rogpeppe> niemeyer: and it's true i think it turned out quite nice, but there y'go.
<rogpeppe> niemeyer: anyway, gotta go.
<rogpeppe> niemeyer: see you tomorrow
<niemeyer> rogpeppe: See ya
<niemeyer> This reminds me of Launchpad's API all over again
<niemeyer> rogpeppe: That sounds like a side effect of that design, for example:
<niemeyer> 	 102         path := fmt.Sprintf("/Machine-%s/SetInstanceId", m.Id)
<niemeyer> 	 103         return m.state.c.Call(path, instId, nil)
<niemeyer> rogpeppe: The natural design on that is to have the equivalent of {"method": "set-instance-id", "machine": m.Id, "instance-id": instId}
<niemeyer> (disconsidering marshalling issues.. that's a map in whatever format we decide to us)
<niemeyer> use
<niemeyer> Possibly with an additional "request-id" key too
<niemeyer> I guess I suck at explaining.. I should try to do what I do best and code it
 * niemeyer => doc appointment.. back later
<fwereade> niemeyer, ping
<fwereade> niemeyer, in case you come back, please consider what the Dying state might mean for a Machine; I'm starting to believe it's meaningless (setting a machine to Dying should not, I think, affect its units; so a Dying machine generally just sits there forever (or at least until all its units are unassigned; but having a long-running machine which cannot have further units assigned feels like crack, as does the alternative of allowing units to be assigned t
<fwereade> o Dying machines
<fwereade> )
<fwereade> niemeyer, grar, just thought of something: there's an additional constraint on subordinate relations, which is that the principal and the subordinate charms must have the same series
 * fwereade will stop dumping random thoughts into the channel and go to bed
#juju-dev 2012-12-06
<rogpeppe> davecheney: morning
<davecheney> rogpeppe: morning !
<davecheney> how's it going
<rogpeppe> not too bad
<davecheney> rogpeppe: here's one for ya! https://codereview.appspot.com/6862050/
<davecheney> rogpeppe: look at this http://code.google.com/p/rietveld/issues/detail?id=406#c2
<rogpeppe> davecheney: re the above CL: why not just rename serialisedLogger to logger and lose the interface type?
<rogpeppe> davecheney: (i was planning to run the race detector on our tests today too actually - nice idea)
<davecheney> rogpeppe: yeah, we can do that
<davecheney> rogpeppe: yes, i've been running it on the core state/* and worker classes
<davecheney> we're pretty good
<rogpeppe> davecheney: that's a relief to hear
<rogpeppe> davecheney: although it would be interesting to run some of the tests concurrently.
<davecheney> oh boy, don't even think about GOMAXPROCS > 1
<rogpeppe> i meant concurrently rather than in parallel, but that too
<davecheney> oooooooooooh, you mean all the state tests at once ?
<rogpeppe> davecheney: well, obviously that wouldn't currently be possible, but yeah, that kind of thing
<rogpeppe> davecheney: gocheck should allow the equivalent of t.Parallel() (if i remember that name right)
<rogpeppe> http://golang.org/pkg/testing/#T.Parallel
<rogpeppe> i really think that gocheck could use some modernisation
<rogpeppe> davecheney: you've got a review on that gocheck CL
<davecheney> rogpeppe: ta muchly
<fwereade> mornings
<TheMue> Mornings from a snowy Oldenburg too
<fwereade> TheMue, I forget, did I discuss the meaning of a Dying Machine with you? I'm trying to find someone who can explain in what circumstances a Machine should be able to become so, and how the consequences of that make sense
<fwereade> TheMue_, I forget, did I discuss the meaning of a Dying Machine with you? I'm trying to find someone who can explain in what circumstances a Machine should be able to become so, and how the consequences of that make sense
<fwereade> TheMue, ping
<TheMue> fwereade: Pong.
<fwereade> TheMue, I think you might have missed:
<fwereade> <fwereade> TheMue, I forget, did I discuss the meaning of a Dying Machine with you? I'm trying to find someone who can explain in what circumstances a Machine should be able to become so, and how the consequences of that make sense
<TheMue> fwereade: Oh, yes, I missed. Strange.
<fwereade> TheMue, because the "obvious" meaning is that the machine should start to shut itself down when it becomes Dying
<fwereade> TheMue, but actually, if it has units, it can't do that: it'll be blocked for as long as they exist, which could be forever
<fwereade> TheMue, but the machine will from that point on be kinda broken -- eg you won't be able to assign new units to it
<fwereade> TheMue, and the Alive=>Dying change is irrevocable
<fwereade> TheMue, and it really doesn't seem like a smart idea to ever put machines in that state
<TheMue> fwereade: So far following, yes.
<fwereade> TheMue, can you think of an alternative conception of what should happen? or, if not, do you then agree that a Dying machine is somewhere between meaningless and harmful?
<TheMue> fwereade: To take a step back, I always had my problems with the idea of setting lifecyle a state to trigger a reaction.
<TheMue> fwereade: My natural feeling is more vice versa.
<fwereade> TheMue, that's interesting, would you expand on how you'd prefer to see it?
<TheMue> fwereade: I somehow tell a machine to to stop. During this, it sets its state to Daying to represent this. And at the end it sets its state to Dead.
<TheMue> fwereade: So the lifecycle state is only a representing and informal information for others to see what's going on.
<fwereade> TheMue, I thought that was the model we did follow? except with the first bit slightly compressed -- ie that setting something to Dying is precisely how we communicate that it should stop
<TheMue> fwereade: Yes, and that's what I dislike.
<TheMue> fwereade: Because it isn't dying, we want it to do so.
<TheMue> fwereade: And itself can only tell: "Hey, guys, I'm dying."
<TheMue> fwereade: But that's only a personal preference and I won't start this discussion again. ;)
<fwereade> TheMue, if it's annoying for you to go over it again then I'll gladly drop it, but I'm interested in your perspective here
<fwereade> TheMue, what do you expect will be watching for the Dying status?
<TheMue> fwereade: No, it's not annoying. But we have an agreement, so that should be changed.
<fwereade> TheMue, ofc it's tedious to discuss this generally, because the various lifecycle entities are much more different than I think we ever imagined :/
<TheMue> fwereade: Currently I don't know for whom it could be interesting to watch the Dying status, but it could be interesting if I'm iterating and only want those with status Alive.
<TheMue> fwereade: Yes, exactly.
<fwereade> TheMue, I don't *think* that's a sane operation, but I could easily be wrong
<fwereade> TheMue, yeah, I'm sure I'm wrong somewhere, just because there are so many possibilities :)
<TheMue> fwereade: But let's come back to the current way to do so and what you've discovered.
<fwereade> TheMue, heh, much of this will be messy, I don't think I've ever tried to express it all at once before, but here goes
<fwereade> TheMue, so: the detect-self-Dying=>cleanup-and-set-self-Dead model is very clear when it comes to entities that have agents, ie units and machines
<fwereade> TheMue, notwithstanding my belief that it isn't *actually* sane wrt machines
<fwereade> TheMue, but in any case it's definitely different to service and relation lifecycle management
<fwereade> TheMue, because those entities rely on agents for other entities to manage them
<TheMue> fwereade: Yep.
<fwereade> TheMue, (and in fact units depend on machine agents (or sometimes other unit agents), and machine agents depend on other machine agents (ie those running provisioners)
<fwereade> TheMue, so, putting it that way, maybe it's not as different as I thought: ie *every* entity has its lifecycle partly managed by something else
<fwereade> TheMue, it's just that some of them don't have their own agents at all, so their lifecycles have to be exclusively managed by other things
<fwereade> TheMue, but, nonetheless
<fwereade> TheMue, Relation has already lost EnsureDead on the basis that that is not a meaningful change
<fwereade> TheMue, actually, it's lost EnsureDying too
<fwereade> TheMue, it just has Destroy
<fwereade> TheMue, are you familiar with that change and its context?
<fwereade> TheMue, (it should also be noted that there's no such thing as a Dead relation -- the possible paths are Alive -> removed and Alive -> Dying -> removed)
<TheMue> fwereade: Not yet, but I'm simply following what you're saying here and raise a question when I need more background (or at least I think so).
<TheMue> fwereade: Sounds reasonable.
<fwereade> TheMue, the issue is basically that the point at which a relation can be removed is the point at which the last relation unit leaves scope
<fwereade> TheMue, we can't enter scope if any of the involved entities are dying
<fwereade> TheMue, and in fact no entity will take it on itself to just remove a Dying relation -- if the relation is Dying when the uniter first sees it, it just ignores it
<TheMue> fwereade: Hmm, that's a bit too short and fast.
<TheMue> fwereade: Could you please explain it differently.
<fwereade> TheMue, I'll try :)
<fwereade> TheMue, a relation exists, and we don't want it to any more
<fwereade> TheMue, the "easy" answer is "set Dying, let the system handle it"
<TheMue> fwereade: That's the UI, but how shall the system do it.
<fwereade> TheMue, but if there are no units currently participating in that relation, there is no entity with responsibility for that relation's death
<fwereade> TheMue, it would ordinarily be the last unit
<fwereade> TheMue, to leave scope
<fwereade> TheMue, but if none are in scope this isn't sane
<fwereade> TheMue, so, in the absence of a CA, just setting Dying is not good enough
<TheMue> fwereade: Why does the relation exist if there's no unit in scope?
<fwereade> TheMue, because someone related two services before any of the services machines were provisioned, for example
<fwereade> TheMue, relation scope is how units express their personal participation in a relation
<fwereade> TheMue, a scope is meaningless without a relation, but a relation on its own is just an expression of how we would like the units of certain services to communicate
<TheMue> fwereade: Ah, yes, forgot it.
<fwereade> TheMue, so, EnsureDying is useless
<fwereade> TheMue, and for similar reasons EnsureDead is useless, except even more so, because by definition a relation cannot become Dead while there is some unit in scope -- ie some unit willing to take responsibility for finally removing it
<fwereade> TheMue, so the actual lifecycle management (apart from creation) is managed by some slightly icky transactions in RU.EnterScope, RU.LeaveScope, and R.Destroy
<fwereade> TheMue, and nothing therein ever sets the relation to Dead -- once it's safe to make it Dead, it's safe to destroy it entirely
<fwereade> TheMue, making sense so far?
<TheMue> fwereade: Yes, can follow.
<fwereade> TheMue, ok, now apply what I have just said to Service
<fwereade> TheMue, I *think* that the only sane way to manage service is as above
<fwereade> TheMue, ie via careful txn management in S.AddUnit, S.RemoveUnit, and (mooted) s.Destroy
<fwereade> TheMue, for service, this renders EnsureDying and EnsureDead meaningless
<fwereade> TheMue, and in fact the Dead state is again meaningless for a Service
<TheMue> fwereade: It feels more natural than fiddling with life states.
<fwereade> TheMue, yeah, hopefully so :)
<fwereade> TheMue, the trouble is that the life model as we originally designed it does not apply to:
<fwereade> TheMue, 1) relations -- no meaningful Dead
<fwereade> TheMue, 2) services -- no meaningful Dead
<fwereade> TheMue, 3) machines -- no meaningful Dying
<fwereade> TheMue, 4) and units -- because Alive/Dying/Dead is (according to niemeyer) not expressive enough to accommodate forced removal
<fwereade> TheMue, and my big personal problem is that we still seem to have this idea that lifecycles are agreed, and make sense, and have already been implemented
<TheMue> fwereade: I don't hear you anymore. *whistling*
<fwereade> TheMue, when in fact they aren't, they don't, and they haven't
<fwereade> TheMue, haha :)
<TheMue> fwereade: Somehow you're hitting the nerve. Never thought about it so concentrated.
<fwereade> TheMue, if you can see any holes in the above, please poke them, because I'm going to have to muster the emotional energy to bring this up with niemeyer at some point soon, and yu know how it can be -- one missed detail means he dismissed your whole argument
<TheMue> fwereade: For me the life discussion is difficult. I prefer to collect facts/arguments/etc in a document and discuss that. It's simpler to refer to.
<fwereade> TheMue, (ofc I don't mean to say he always does that -- but it is a risk I am loath to assume)
<rogpeppe> fwereade: for me, the big advantage of Dying is that the entity enters a shutdown phase, meaning that other things can't interact with it while it's dying (which might take several transactions)
<fwereade> rogpeppe, yeah, I know that's what we think it means
<fwereade> rogpeppe, in practice, leaving aside the niche use case in which we render a machine crippled for an arbitrary length of time, Dying STM not to be a useful state for a machine... or, if it is, it's one we can only enter when the machine has no units
<fwereade> rogpeppe, but we certainly don't enforce that anywhere
<fwereade> rogpeppe, I'd be perfectly happy with that restriction, but it's pretty inconsistent with everything else
<rogpeppe> fwereade: the problem is: when can you remove a machine?
<rogpeppe> fwereade: and i guess the answer is: when it has no units assigned
<fwereade> rogpeppe, when and only when it has no units assigned, I think
<rogpeppe> fwereade: ok, in this case we can consider dying/dead redundant because they're compressed into one transaction ("remove iff no units")
<fwereade> rogpeppe, I think we still need Dead, so the provisioner can trash the instance and remove the machine
<rogpeppe> fwereade: ok, EnsureDead iff no units
<rogpeppe> fwereade: which presumably the Machine.EnsureDead checks anyway
<rogpeppe> fwereade: ha, it doesn't
<rogpeppe> fwereade: but it should
<fwereade> rogpeppe, yeah, like just about every other bit f lifecycle that we'v "implemented" :/
<rogpeppe> fwereade: can we check for no units and set to dead all in one tranaction?
<fwereade> rogpeppe, think so, yeah
<rogpeppe> fwereade: well, actually, given dying and dead, it's not a problem
<rogpeppe> fwereade: we don't need to do that
<rogpeppe> fwereade: but, as you've said, perhaps it doesn't really make sense to tag a machine to die in the future
<rogpeppe> fwereade: a unit could live its entire life on a Dying machine
<rogpeppe> fwereade: which seems kinda cool in a sf kinda way :-)
<fwereade> rogpeppe, yeah, exactly -- and I'm not even saying that use case never applies
<fwereade> rogpeppe, just that I'm not sure it should be our primary focus
<rogpeppe> fwereade: i still think that even if we don't leverage all the states, dying and dead is still a useful way to think about removing entities
<fwereade> rogpeppe, ok, but... Dying is (currently) useless for machines
<fwereade> rogpeppe, Dead is meaningless for relations and services
<rogpeppe> fwereade: even if for some things, we just use Remove instead of EnsureDead, and for others we don't use Dying at all.
<fwereade> rogpeppe, and niemeyer wants an extra lifecycley flag for units to accommodate --force
<rogpeppe> fwereade: i'm still not sure about --force and how it should be done. there are quite a few edge cases there that i'm wary of.
<fwereade> rogpeppe, well, yeah -- IMO the only sane way is with a single transaction, but meh, niemeyer wants a flag so he's probably going to get a flag
<rogpeppe> fwereade: can't it be a flag on the unit rather than an extra lifecycle state that everything needs to deal with?
<fwereade> rogpeppe, yeah, but IMO that kicks the last leg out from underthe idea that there is anything usefull common about lifecycle handling
<rogpeppe> fwereade: i don't think so. the dying/dead distinction is still useful there
<fwereade> rogpeppe, the existence of the states as they is, IMO, just fine
<fwereade> rogpeppe, but in practice no entity follows the model we've kidded ourselves that they all do
<fwereade> rogpeppe, the only features that STM to apply across the board are:
<fwereade>   * lifecycle state can only advance
<fwereade>   * entity lifecycles can sometimes skip states
<fwereade>   * no agent/entity is responsible for removing itself
<rogpeppe> that seems fine to me
<rogpeppe> fwereade: it doesn't say that lifecycle state is irrelevant
<fwereade> rogpeppe, when did I say it was?
<fwereade> rogpeppe, the states are fine
<fwereade> rogpeppe, but we're operating under a shared delusion that we've understood the problem
<rogpeppe> fwereade: ah, i thought you were objecting to Dying and Dead
<fwereade> rogpeppe, no -- just conceits like EnsureDying and EnsureDead
<rogpeppe> fwereade: they're just setting flags. i'd never expected them to be anything other than part of the solution.
<fwereade> rogpeppe, but just setting a flag is almost always crackful
<rogpeppe> fwereade: yeah, quite possibly. i think maybe they should not be exported
<rogpeppe> fwereade: then each entity gets methods that are appropriate
<fwereade> rogpeppe, I'm pretty sure they shouldn't even exist -- they're premature abstraction that's just trying to wedge everything into the same model, that doesn't actually ever apply to anything
<rogpeppe> fwereade: such as Remove (for a machine - only works if no units)
<fwereade> rogpeppe, I think entity.Destroy is sensible
<fwereade> rogpeppe, to go with the agreed CLI terminology
<fwereade> rogpeppe, that's what we have for relations anyway
<fwereade> rogpeppe, and what I *think* we need for services
<rogpeppe> fwereade: i dunno. i think EnsureDying is reasonable for things where the entity *will* take a while to clear up.
<fwereade> rogpeppe, and how do yu determine that to be the case?
<rogpeppe> fwereade: but maybe Destroy is appropriate there too.
<rogpeppe> fwereade: well, if you're removing a unit, it's going to take a while for the unit to actually shut down
<rogpeppe> fwereade: but if there's no responsible entity yet, perhaps it *should* be removed immediately
<fwereade> rogpeppe, indeed so, but even then I'm not sure
<rogpeppe> fwereade: and Destroy could return before the entity is actually destroyed
<fwereade> rogpeppe, I am pretty sure about Relation.Destroy and Service.Destroy
<fwereade> rogpeppe, yeah, absolutely
<fwereade> rogpeppe, (and that neither of those should have EnsureD* methods)
<fwereade> rogpeppe, machines and units I am still very unsure about
<fwereade> rogpeppe, I think that if a unit has an asisgned machine, we have to set it to Dying and let the MA make the more complex decisions
<rogpeppe> fwereade: i vote for having Destroy on all of them, and have Dying/Dead as an implementation detail (and something that's potentially watchable)
<rogpeppe> fwereade: definitely. a unit needs to be able to clean itself up, that's the whole point of all thise
<rogpeppe> this
<rogpeppe> fwereade: i think Life makes sense, but EnsureDead/Dying not as much (or at all, perhaps)
<fwereade> rogpeppe, ok, I think we are in agreement
<rogpeppe> fwereade: but the difficulty here is the distinction between Destroy and Remove.
<rogpeppe> fwereade: but... maybe we don't make that distinction
<rogpeppe> fwereade: we just make Destroy idempotent
<fwereade> rogpeppe, I think it's ok -- Destroy might cause removal right away or it might not, but should always succeed absent corrupt state or network problems; removal
<rogpeppe> fwereade: so when a uniter sees Dying, it'll clean up stuff, then call Destroy again
<fwereade> rogpeppe, I surely hope not
<rogpeppe> fwereade: and Destroy will itself call ensureDying on itself
<rogpeppe> fwereade: ?
<fwereade> rogpeppe, I don't understand what you're expecting Destroy to do -- but I *think* that while the unit is still assigned to a machine, it should do nothing, and the machine is the thing responsible for calling RemoveUnit once the unit is dead
<fwereade> rogpeppe, so the unit does somehow cause itself to become Dead
<fwereade> rogpeppe, but that is, I think, not as simple as justsetting a flag
<rogpeppe> fwereade: if the user calls juju remove-unit, what should that call in state?
<fwereade> rogpeppe, Unit.Destroy()
<rogpeppe> fwereade: " while the unit is still assigned to a machine, it should do nothing"
<rogpeppe> fwereade: so how does remove-unit work then?
<fwereade> rogpeppe, calling it *again* should do nothing, because no interesting state has changed apart from the unit being Dying rather than Dead, which I don't think is relevant
<rogpeppe> fwereade: ok, so how does the unit actually get removed from state?
<fwereade> rogpeppe, it depends... either by Destroy (if it's not assigned to anything and therefore has no responsible entity) or by the MA once the unit has set itself to Dead
<rogpeppe> fwereade: how does the MA do it?
<fwereade> rogpeppe, RemoveUnit?
<rogpeppe> fwereade: so we've got Destroy, which *may* call RemoveUnit, and RemoveUnit, which is *sometimes* appropriate to call
<rogpeppe> fwereade: i'm wondering if we need RemoveUnit
<fwereade> rogpeppe, we definitely need RemoveUnit, how else are we going to handle service removal?
<fwereade> rogpeppe, (that's not implemented either ofc :/)
<rogpeppe> fwereade: what if Destroy removes the unit when it can?
<fwereade> rogpeppe, it does -- in that case, that also has to be responsible for service cleanup
<rogpeppe> fwereade: why's that?
<fwereade> rogpeppe, who else is going to do it?
<rogpeppe> fwereade: (removing a unit never removes a service, does it?)
<fwereade> rogpeppe, it should
<rogpeppe> fwereade: i disagree
<rogpeppe> fwereade: it's perfectly ok to have a service with no units
<fwereade> rogpeppe, no shit
<rogpeppe> fwereade: sure, we can have that right now
<fwereade> rogpeppe, it's not ok to have a dying service with no units because nothing else is going ot clean it up
<fwereade> rogpeppe, service is basically the same as relation
<rogpeppe> fwereade: ah, if the service is dying, sure
<fwereade> rogpeppe, we need the same txn dance in AddUnit/RemoveUnit/Destroy as we have for EnterScope/LeaveScope/Destroy with relations
<rogpeppe> fwereade: so what's wrong with having Unit.Destroy also responsible for service cleanup, if it finds that the service is dying.
<rogpeppe> ?
<fwereade> rogpeppe, (sorry, I am feeling extremely grumpy about this, because I feel like I've been misled repeatedly about the impact of various decisions and that I have basically spent the last 3 months clearing up the shit that has been thrown over the wall to me under pretence of being "finished", and AFAICT I am going to have to basically rewrite state in order to complete subordinates)
<fwereade> rogpeppe, Destroy can do that if the unit's unassigned, yes, but it's not the only thing that might need to do that
<rogpeppe> fwereade: to be fair, i think nobody understood the problem that well before, and it's difficult to write code to solve someone else's future problem.
<fwereade> rogpeppe, in practice it's going to be something like unit.removeOps() that gets composed into bigger txns in a couple of places
<rogpeppe> fwereade: it's just like clearing the last refcount, yes? when anything that can be blocking a dying thing from become dead becomes dead itself, it must try to destroy the things that it might be blocking
<fwereade> rogpeppe, yeah, exactly
<fwereade> rogpeppe, this is not a non-obvious consequence of the model
<fwereade> rogpeppe, it is something I was talking about back in lisbon 1
<rogpeppe> fwereade: it's actually very similar to the refcounting i was talking about a long time ago
<fwereade> rogpeppe, and I think I am somewhat legitimately pissed off that the problem has just been ignored
<rogpeppe> fwereade: i always felt a bit fuzzy (in a not good way) about the lifecycle stuff, but i felt i didn't have enough handle on the actual problem to propose something better.
<rogpeppe> fwereade: how about this: delete all Remove* calls, and replace them with Destroy calls.
<rogpeppe> fwereade: also remove all EnsureDead and EnsureDying methods
<rogpeppe> fwereade: then Destroy always has a clear responsibility - move towards removing the entity from state.
<fwereade> rogpeppe, I don't think I'm willing to put my weight behind an alternative one-size-fits-all solution -- I think that's the fundamental prblem
<rogpeppe> fwereade: and any entities that might also be in the process of destruction that depend on it.
<rogpeppe> fwereade: can you think of anything that can't be made fit that model, in a reasonably sane way?
<fwereade> rogpeppe, different kinds of unit destruction?
<fwereade> rogpeppe, especially since niemeyer has rejected the txn idea for that
<rogpeppe> fwereade: i really think that's something that's a mode of dying that can be implemented as a flag on the unit itself.
<fwereade> rogpeppe, and wants to add new lifecycle flags and have multiple responsible entites running concurrently  instead
<fwereade> rogpeppe, well, that is a possibility
<rogpeppe> fwereade: i think that whether to use transactions or not is up to the implementation of each entity
<rogpeppe> fwereade: i think Destroy as a catch-all entry point could make sense.
<rogpeppe> fwereade: because that's what we're actually trying to do, after all
<fwereade> rogpeppe, I just don't get in what universe it is sane to have two different pieces of code fighting over which one is responsible for the backing state
<rogpeppe> fwereade: (well, we'd probably have ForceDestroy on units)
<fwereade> rogpeppe, as an entry point, yeah, I think I'm +1
<rogpeppe> fwereade: i think we're inevitably going to have two different pieces of code fighting over responsibility, because agents can become temporarily (and perhaps permanently) divorced from the state they're supposed to be responsible for
<rogpeppe> fwereade: and that's the big issue around --force
<fwereade> rogpeppe, right -- but we agreed a long time ago that the unit should watch for its own Deadness, and that that would be the signal that it was to Stop Doing Everything
<fwereade> rogpeppe, Life=Dead is, explicitly, the responsibility handover point
<rogpeppe> fwereade: sure. just as Dying is such a signal, right?
<rogpeppe> fwereade: (is there any difference between Dying and Dead there, in fact?)
<fwereade> rogpeppe, if we set a unit to Dead the worst that will happen is one spurious error and restart before it cleanly shuts down
<fwereade> rogpeppe, if we do some funky dying flag then we have two entities fighting over who gets to do what
<fwereade> rogpeppe, and I refuse to believe that that is a sensible way to manage things
<rogpeppe> fwereade: do you think you can do the entire unit clean up (leaving all relations) in a single transaction?
<rogpeppe> pw
<fwereade> rogpeppe, I think it'll be hard, and ugly, but I don't think it'll be impossible, or noticeably uglier than any other options
<rogpeppe> fwereade: surely all you need to do is make leaving scope idempotent?
<rogpeppe> fwereade: then who cares if two entities are doing it at the same time?
<rogpeppe> fwereade: and it doesn't need to be hard, or ugly.
<rogpeppe> fwereade: and it's a sufficiently rare case that efficiency is of no concern at all
<fwereade> rogpeppe, it offends me that a unit agent can have state that it depends on secretly whipped out from under its feet by another entity
<rogpeppe> fwereade: but that's what force is all about, surely?
<rogpeppe> fwereade: and it's not great, but sometimes it's the only option
<fwereade> rogpeppe, it's about not giving the charm notice to clean up -- not about fucking up our internal state
<fwereade> rogpeppe, I think that if there's some way we can implement it without dropping ideas of ownership out of the window, we should do so
<rogpeppe> fwereade: we're talking force majeur here. that's not an issue, i believe.
<rogpeppe> fwereade: this will only be used when we really don't care about the unit's internal state any more
<fwereade> rogpeppe, I may be misunderstanding what you mean by "unit's internal state", but I don't think we can ever say we don't care about it...
<rogpeppe> fwereade: if i use remove-unit --force, surely that's giving license to the system to remove the unit regardless of its current state.
<fwereade> rogpeppe, yeah -- but that doesn't mean we can ignore that state, the removal still needs to keep state consistent
<rogpeppe> fwereade: even if you use transactions, you're still potentially whipping state out from under the unit agent's feet
<fwereade> rogpeppe, yeah, in a way that uniter has been designed to handle from the beginning
<fwereade> rogpeppe, because we discussed this, and agreed it, and so I implemented it
<rogpeppe> fwereade: so what are you worried that it *won't* deal with correctly?
<rogpeppe> fwereade: when a unit is dying, what needs to happen before it actually dies? is it just that it needs to leave relation scopes?
<fwereade> rogpeppe, and deal with its subordinates
<fwereade> rogpeppe, which will themselves have scopes to leave
<rogpeppe> fwereade: so if a unit finds itself dying, it sets its subords to dying and waits for them to die before dying itself?
<fwereade> rogpeppe, the necessary txn will not be entirely trivial to compose, but I really think that adding more states for more entities and distributing the cleanup responsibilities amongst more entities (each of which could themselves fail at arbitrary times) is a bad move
<fwereade> rogpeppe, I think that setting a unit to Dying should also set its subordinates to Dying
<rogpeppe> fwereade: is that possible to do in one transaction?
<fwereade> rogpeppe, I think a unit should not be able to become dead while it has subordinates...
<fwereade> rogpeppe, don't see why not
<fwereade> rogpeppe, what problems are you anticipating?
 * rogpeppe reminds himself of the transaction capabilities
<rogpeppe> fwereade: don't you need to know the doc id that will be updated before starting the transaction?
<rogpeppe> fwereade: so you can't do an equivalent of "UPDATE life=Dying WHERE principal==myUnit"
<fwereade> rogpeppe, no: I do it for everything in doc.Principals, assert no changes to what I composed the txn from, and retry on abort
<rogpeppe> fwereade: ah, so you could assert that subordinates haven't changed (by looking in unitDoc.Subordinates)
<rogpeppe> fwereade: so, just to be clear, what state *have* you anticipated changing under the feet of the uniter?
<fwereade> rogpeppe, Life becoming Dead
<fwereade> rogpeppe, this can cause arbitrary failures if we're unlucky, but I *think* that's fine, because when we come up again we handle Dead correctly
<fwereade> rogpeppe, basically, when the unit becomes Dead, it will complete its current operation if possible, and not do anything else ever again
<rogpeppe> fwereade: so to make that work, you need to set dead and remove all relations, all in the same transaction?
<rogpeppe> s/remove/leave/
<fwereade> rogpeppe, yeah
<fwereade> rogpeppe, but the relations are not the hard bit
<fwereade> rogpeppe, well, they're not trivial
<rogpeppe> fwereade: oh yeah, and set subords to dead and leave all their relations too
<rogpeppe> fwereade: all in the same transaction
<fwereade> rogpeppe, but nor are they going to be impossible to handle -- it's doing all that for N units that really bothers me
<rogpeppe> fwereade: istm this might be pushing the boundaries of the transaction system a little
<fwereade> rogpeppe, do you have any idea what those boundaries actually are? how many ops is too many?
<rogpeppe> fwereade: i suppose the uniter could see that the "force" flag is set, and treat that as the same as it currently sees Dead
<fwereade> rogpeppe, yeah, indeed, but niemeyer was -1 on that for no reason I could determine
<rogpeppe> fwereade: no idea at all. just that a thousand might pushing it, and isn't beyond the bounds of possibility
<rogpeppe> fwereade: apart from anything else, it'll act as a bottleneck in the system until the whole transaction goes through
<fwereade> rogpeppe, yeah, I don't have a clear enough idea of the details to know what sort of impact that'll have
<rogpeppe> fwereade: it seems like something we should avoid if we can
<fwereade> rogpeppe, yeah, maybe, I just feel like the alternatives are being handwaved a bit
<rogpeppe> fwereade: you're right, they are, 'cos we don't know the uniter details like you do :-)
<fwereade> rogpeppe, heh, yeah
<rogpeppe> fwereade: istm that if there was something the uniter could see that it would treat as it currently treats Dead but isn't *actually* Dead, then things could work ok
<rogpeppe> fwereade: and it wouldn't change too many of the current uniter assumptions
<fwereade> rogpeppe, yeah, but niemeyer rejected that suggestion :/
<rogpeppe> fwereade: perhaps you didn't phrase it in a sufficiently sympathetic way :-)
<fwereade> rogpeppe, it's really tedious doing development with all these invisible walls
<rogpeppe> fwereade: agreed. did you ever look at my api spike branch BTW?
<fwereade> rogpeppe, I did, but I didn't quite get my head around it properly I'm afraid :(
<rogpeppe> fwereade: that's fine. you're not the only one.
<fwereade> rogpeppe, :)
<rogpeppe> fwereade: probably means it really is crack after all
<rogpeppe> fwereade: the rpc thing is perhaps too magical
<rogpeppe> fwereade: but it does make it *really* easy to write rpc servers...
<fwereade> rogpeppe, I'm not sure -- I just wasn't able to flush my other thoughts and properly engage with it
<rogpeppe> fwereade: gustavo's not keen; i'm not entirely surprised.
<Aram> I've looked at your branch, though not as thoroughly to write a review. I like the way you are supposed to use the abstractions provided. feels very easy to me. I can't comment on the abstraction implementation yet, though.
<rogpeppe> Aram: the implementation is still (deliberately) very sketchy currently, although still reasonably neat. in particular it doesn't deal with any kind of concurrency at all yet.,
<rogpeppe> Aram: thanks for the response though - that's the first positive remark i've had yet!
<Aram> I don't think your branch is overkill. I believe that the problem at hand is not trivial, and doing it ad hoc is opening a can of worms. using your thing I'm sure we could change it in time as needed, but an ad hoc wrapper is bound by its initial implementation.
<rogpeppe> Aram: that's my thought too. but building general mechanisms and then using them seems to frowned upon in these parts.
<rogpeppe> Aram: there are a few things the hierarchical nature of the rpc thingy makes potentially possible too - caching being one of them. and you can get a reasonable-looking HTTP interface for free too, even if you decide to use gob for agent communication.
<Aram> going to buy some bread.
<niemeyer> Morning all
<niemeyer> Aram: Have a good, erm.. bread? :)
<rogpeppe> niemeyer: hiya
<niemeyer> rogpeppe: Yo
<niemeyer> fwereade: Pingous
<fwereade> niemeyer, heyhey
<niemeyer> fwereade: Heya
<fwereade> niemeyer, how's it going?
<niemeyer> fwereade: I think a machine should be Dyiable (ugh) when there are still units
<niemeyer> Erm
<niemeyer> Should not
<rogpeppe> niemeyer: +1
<niemeyer> fwereade: I thought we already enforced that constraint
<fwereade> niemeyer, cool, I think the consequences are unpleasantr
<niemeyer> fwereade: But perhaps it got lost in the changes
<fwereade> niemeyer, state enforces almost no constraints :/
<fwereade> niemeyer, ok, that is an overstatement
<niemeyer> fwereade: Really? .. Okay :-)
<rogpeppe> niemeyer: we've never set any constraints on setting something to Dying, only Dead, i think.
<niemeyer> rogpeppe: The constraint was on removal
<niemeyer> rogpeppe: I think
<niemeyer> At least I have fresh memories (perhaps fake) that it was impossible to remove a machine with units in it
<rogpeppe> niemeyer: that's not quite the same thing as setting it to dying though
<niemeyer> rogpeppe: It actually is, in terms of intent
<niemeyer> rogpeppe: I mean what we previously did as "remove" and what we now do as "set to dying"
<niemeyer>             if topology.has_machine(internal_id):
<niemeyer>                 if topology.machine_has_units(internal_id):
<niemeyer>                     raise MachineStateInUse(machine_id)
<niemeyer> Memories aren't entirely fake, at least
<rogpeppe> niemeyer: yeah, it doesn't look like we make any checks like that currently
<rogpeppe> niemeyer: even in RemoveMachine
<rogpeppe> niemeyer: which is definitely wrong
<rogpeppe> niemeyer: if we can't set a machine to dying unless it has no units, do we actually need Dying and Dead for a machine? we could just remove it.
<rogpeppe> niemeyer: (assuming removal will fail if the machine has units)
<niemeyer> rogpeppe: I don't understand how that constraint changes things with regards to the machine lifecycle
<niemeyer> rogpeppe: It's still not alright to remove state under agents that are running
<rogpeppe> niemeyer: true
<niemeyer> -               if hasUnits {
<niemeyer> -                       return fmt.Errorf("machine has units")
<niemeyer> -               }
<niemeyer> -               return t.RemoveMachine(key)
<niemeyer> RemoveMachine was actually implemented like that
<rogpeppe> niemeyer: in cmd/juju ?
<niemeyer> rogpeppe: In state
<niemeyer> This was lost with the implementation of lifecycle support
<rogpeppe> niemeyer: ah. that's racy of course, but still. i'm surprised we didn't have tests for that behaviour
<niemeyer> rogpeppe: No, it actually wasn't
<fwereade> niemeyer, rogpeppe: I think removal is the PA's responsibility, so we do need Dead
<rogpeppe> niemeyer: ah, 'cos it was in a retry loop?
<niemeyer> rogpeppe: Yep
<niemeyer> fwereade: Yeah, the constraint is a minor red-herring that we do need still
<rogpeppe> could someone remind me again why agents can remove their own entities?
<rogpeppe> s/can/can't/
<niemeyer> rogpeppe: A unit container doesn't magically go away because it removed the entity, same thing for a machine
<niemeyer> rogpeppe: While the resource still exists, the respective management object should also exist
<rogpeppe> niemeyer: ok, that's a useful maxim
<fwereade> niemeyer, rogpeppe: and I'm not even certain that Dying is *really* meaningless -- but I don't know what a machine agent's responsibility is other than to set itself to Dead
<niemeyer> fwereade: Agreed, it can easily receive attributes that have to be accomplished before death
<fwereade> niemeyer, rogpeppe: given that we agree that it shoudn't have units when it's set to Dying
<rogpeppe> fwereade: it provides a hook for the machine agent to do something more sophisticated later
<niemeyer> We haven't really invested much in the machine management so far
<niemeyer> fwereade: Well, actually..
<fwereade> niemeyer, rogpeppe: so I *think* that for machines we have constraints on EnsureDying, and a Machiner that just sets itself to Dead at an opportune moment, and that can be extended in future
<niemeyer> fwereade: Removing units!
<fwereade> niemeyer, ah, we don't agree that?
<fwereade> niemeyer, the units must be removed before the machine is set to Dying, mustn't they?
<fwereade> niemeyer, so, yeah, the machiner won't want to set itself to dead while it still has units deployed
<fwereade> niemeyer, but it won't become Dying until that process is underway
<niemeyer> fwereade: It's a good question, actually
<niemeyer> fwereade: But that's perhaps an easier first step
<fwereade> niemeyer, would you expand? sounds like I'm missing something :)
<niemeyer> fwereade: In theory it would be fine to set a machine to dying when all units are also dying
<fwereade> niemeyer, ah, yes, good point
<fwereade> niemeyer, I don't think that hurts us much, though, we still need to wait to recall all our units before we can become dead
<fwereade> niemeyer, whoops no that's crack
<fwereade> niemeyer, they won't be removed from state until they're recalled
<niemeyer> fwereade: Sure, but how's that an issue?
<fwereade> niemeyer, with the current model, we need to wait for that to happen before we can make the machine Dying, so we dodge that problem
<niemeyer> fwereade: Right
<fwereade> niemeyer, if we allow dying machine with dying units, then we need to hang around
<niemeyer> fwereade: Which is easier, agreed
<niemeyer> fwereade: Yep
<niemeyer> We don't have to do that now
<fwereade> niemeyer, you know, I think we do, for the uniter with subordinates
<fwereade> niemeyer, maybe I'm wrong, but I think that setting a unit to dying should also set all it subordinates to dying
<niemeyer> fwereade: +1
<niemeyer> fwereade: Or,
<niemeyer> fwereade: It can reject, but that sounds a bit weird
<niemeyer> fwereade: Setting subordinates to dying is probably the right thing to do
<Aram> niemeyer, bread with sweet crispy bacon goodness.
<fwereade> niemeyer, I'm pretty sure that's wrong -- the only other way to get rid of the subordinates is to kill the relation, isn't it?
<niemeyer> fwereade: Indeed
<fwereade> niemeyer, cool
<niemeyer> Aram: Poor bread! ;-)
 * rogpeppe should probably have some breakfast too
<mramm> good morning all
<fwereade> mramm, heyhey
<TheMue> mramm: Morning
<mramm> who all is planning to come to this morning's team meeting?
<mramm> I will add you to the invite
<fwereade> mramm, I am
<dimitern> mramm: me too
<Aram> me too
<Aram> mramm, can you add me
<Aram> ?
<Aram> fwereade, dimitern ^ ^
<dimitern> Aram: i'm not sure how?
<Aram> ping mark
<mramm> done
<Aram> thx
<mramm> dimitern: you are loved
<dimitern> :)
<Aram> mac os x sucks
<mramm> aram: http://mergy.org/2012/12/irecognition-i-am-no-longer-apples-target-market/
<fwereade> niemeyer, btw, there was something else I vaguely wanted to mention re LXC: AIUI, we *can* run, eg, precise i386 containers on quantal amd64... and this strikes me as an interesting complication for constraints (and the Deployer...) that I had not hitherto considered
<fwereade> niemeyer, but I don't think I'm in a position to anticipate it properly atm, and I guess I can't usefully worry about it until we have some actual containers to use
<fwereade> niemeyer, just running the host OS will be a lot simpler and get us a long way
<niemeyer> fwereade: Agreed. I think the very first thing we should do is to not even worry about containers, but just do plain co-location
<fwereade> niemeyer, +1
<niemeyer> fwereade: I did consider the use of containers across releases before, but I did the same as you, and thought "Yeah, we should support that.", but discarded any kind of deep thinking as premature
<fwereade> niemeyer, yeah, I'd only really considered it re local provider before
<fwereade> niemeyer, rogpeppe, TheMue: https://codereview.appspot.com/6862053 is a rough proposal for making state testing a little clearer: it's not ready to go in, because it we're going to use it we should do so consistently, but I think the general idea is can be judged as it stands
<rogpeppe> fwereade: looking
<fwereade> niemeyer, rogpeppe, TheMue: some tests it helps with more than others ofc
<TheMue> fwereade: Will hava a look
<rogpeppe> fwereade: have you changed any of the test semantics in that CL?
<fwereade> rogpeppe, I am confident that everything that was tested before is still tested, but the details have changed in a couple of places
<rogpeppe> fwereade: i'd be happier to check this large CL if the only changes, for this round at least, were just changing the existing tests to use the new primitives.
<rogpeppe> fwereade: otherwise i feel i'm trying to check two things at once, and there are quite a lot of changes.
<fwereade> rogpeppe, true -- the other side of it is that I tried to do that, but some of the tests were too painful to maintain like that :)
<rogpeppe> fwereade: how about a CL that just changes the tests that you feel are worth changing, and reserve refactoring the other ones for another CL (maybe several)?
<fwereade> rogpeppe, if the general idea is roughly accepted then when it's ready I imagine I will do it as a series of single-test-file CLs on top of one that introduces the vocabulary
<rogpeppe> fwereade: ah, ok
<rogpeppe> fwereade: in which case i'm +0.5. i can see it makes the tests briefer, but it's also less clear to see what's going on each individual test.
<fwereade> rogpeppe, so not entirely without pain -- but I *think* it's easier to mentally check equivalence between X and Z directly than between X and Y and Y and Z in two steps
<fwereade> rogpeppe, that is interesting to me
<fwereade> rogpeppe, because my experience has been that the sheer number of characters in play is the biggest obstacle to understanding a test I'm not familiar with
<rogpeppe> fwereade: maybe you're right. and the test primitives do have quite intuitive mappings to underlying operations
<fwereade> rogpeppe, and ISTM that there's enough repetition that learning a small simple vocabulary once is a clear win, even when some of the verbs only save a line
<rogpeppe> fwereade: yeah.
<TheMue> I'm off for now, will return later.
<niemeyer> fwereade: Sent a few comments
<fwereade> niemeyer, cool, thanks
<niemeyer> fwereade: I'm being asked for lunch, so will have to continue afterwards
<fwereade> niemeyer, np, enjoy :)
<rogpeppe> niemeyer: here's a bare-minimum API CL, just to get us off the ground. hopefully we can agree on at least this much as a starting point. https://codereview.appspot.com/6888048
<niemeyer> rogpeppe: Looking
<niemeyer> rogpeppe: That's a great bootstrap
<rogpeppe> niemeyer: cool
<rogpeppe> niemeyer: it's still entirely compatible with what i did before, of course :-) :-)
<niemeyer> rogpeppe: I bet :)
<niemeyer> fwereade: What is a deployer.Context?
<niemeyer> fwereade: I'm a bit confused by the fact there's a DeployerName
<niemeyer> fwereade: I see some debate there as well, so maybe we'd benefit from a quick exchange on it
<fwereade> niemeyer, I'm really sorry -- cath's aunt is just about to arrive -- I will try to grab you later if that's ok?
<niemeyer> fwereade: Oh, surely
<niemeyer> fwereade: I'll see if I can comment on the CL itself, actually, in a way that does not bring confusing
<niemeyer> confusion
<fwereade> niemeyer, cheers :)
<fwereade> niemeyer, lovely, thanks
<rogpeppe> niemeyer: DeployerName is the entity that the deployer is working on behalf of. in practice it will be the machine's entity name or a principal unit's entity name.
<niemeyer> rogpeppe: I understand that much.. I don't understand why it's part of the interface it's in
<rogpeppe> niemeyer: it's so that the deployer can know if it should should remove a deployed unit from the state when it disappears from the watcher list
<rogpeppe> niemeyer: i think
<rogpeppe> niemeyer: i chatted with fwereade a bit ago about this
<fwereade> niemeyer, sorry, back briefly -- the main reason is that I don't want to pass the deployer name (and maybe state info) into every context method
<fwereade> niemeyer, it seemed more obfuscatory than helpful to do so, despite the slightly odd arrangement
<fwereade> niemeyer, the Context is the thing that knows all about how to deploy; the Deployer just tells it what to deploy
<fwereade> niemeyer, maybe it would be cleaner to just drop it from the interface and duplicate it in the Deployer
<niemeyer> fwereade: I'm still catching up on the details of how the interfaces are used, so I don't know for sure, but I'll add a note in either case just to let you know of how it felt
<niemeyer_> fwereade: Review is out
<niemeyer_> fwereade: It's great step forward, thanks
<fwereade> niemeyer_, awesome, cheers, all looks good -- really off for now :)
<fwereade> niemeyer_, (and, yeah, Manager works for me, it's just a name I basically dismiss out of hand :))
<rogpeppe> dimitern: you've got a review: https://codereview.appspot.com/6877054/
<dimitern> rogpeppe: thanks!
<dimitern> rogpeppe: in fact, I want to get rid of the interface altogether, but first I'll finish the refactoring of the tests
<rogpeppe> dimitern: +1
<dimitern> I'm making them more granular with a few ensureExists, ensureNotExist and remove - taking interface{} and calling the appropriate underlying API
<rogpeppe> dimitern: in general it's best to avoid functions/methods that use interface{}. why do you need it here?
<dimitern> rogpeppe: I'll glad if you take a look once I propose the changes (tomorrow probably)
<dimitern> I have this http://paste.ubuntu.com/1415180/
<dimitern> rogpeppe: ^^
<rogpeppe> dimitern: i'm not convinced that's much gain over small simple statically typed functions tbh
<dimitern> rogpeppe: well, if I don't have this then I end up with these huge test cases (Add/Has/Get/Remove in one, to be idempotent)
<rogpeppe> dimitern: i'm not arguing against helper functions, just that it's better IMHO to go with static typing where you can. something like this, for example? http://paste.ubuntu.com/1415205/
<dimitern> rogpeppe: ok, can you explain why this is better? I'm not arguing, just curious
<rogpeppe> dimitern: then each function is evidently trivial
<rogpeppe> dimitern: it's better because the compiler will tell you when you get it wrong
<dimitern> rogpeppe: I see, so it's better to write more code if it's static vs. less more generic code with type checks
<dimitern> more, but simpler, that is
<rogpeppe> dimitern: yeah. but in this case it might actually be less code in some case too
<rogpeppe> dimitern: ensureNotExist is 23 lines but as separate functions (not counting blank lines) it's 20
<dimitern> rogpeppe: ok, I like it :) I'll change it
<rogpeppe> dimitern: every time i see "interface{}" i have to do a double-take, and wonder what's going on and what might be really happening at runtime.
<dimitern> rogpeppe: like unhandled cases and unknown types, in addition - probably some run-time reflection overhead
<rogpeppe> dimitern: definitely. although performance is no issue here, of course.
<rogpeppe> dimitern: BTW your ensureExists function is a bit odd for the name it's given, especially paired with ensureNotExist. ensureExists makes sure that something exists; ensureNotExist checks if something doesn't currently exist.  i'd probably call it "add" :-)
<dimitern> rogpeppe: I was unsure what's the proper name myself :) but yeah, add or create seems more appropriate given what it does
<rogpeppe> niemeyer_: i've just been thinking about how we might manage our migration to using the API
<rogpeppe> niemeyer_: i think we can do it quite nicely in an incremental way and reap some of the benefits early on
<rogpeppe> niemeyer_: my thought is to implement jujud server, and get it running, even though we have no (well, one!) API calls yet.
<rogpeppe> niemeyer_: then add an APIState field to juju.Conn, and have juju.NewConn connect to the API state too.
<rogpeppe> fwereade: how does that sound to you?
<rogpeppe> mramm: i think this is similar to the approach you were suggesting, and i think it's actually not much problem to do.
<mramm> rogpeppe: yea, that's what I was thinking about in general
<mramm> having an incremental migration path for clients
<rogpeppe> then we can gradually migrate anything that uses juju.Conn to use calls in Conn.APIState (juju status might  be a nice early candidate, as it would be significantly speeded up by the API)
<rogpeppe> finally, state/api can be renamed to state, and the existing State API can be made a private part of it.
<rogpeppe> anyway, it's eod for me now
<rogpeppe> mramm, all: g'night
<niemeyer_> rogpeppe: What woudl be APIState?
<niemeyer_> rogpeppe: Either way, big +1 on the overall idea of having it working end-to-end from the get-go
<rogpeppe> niemeyer_: APIState would be the type i've defined as api.State
<mramm> I will likely be about 5 minutes late for our team meeting... Can somebody else start up the google hangout?
<davecheney> hello, is it meeting time ?
<wallyworld> davecheney: i started a hangout https://plus.google.com/hangouts/_/01030a125de04a2af6a02a552ef3090109b438d5?authuser=0&hl=en
<wallyworld> mramm: ^^^^
<davecheney> wallyworld: can you add me as a super friend on g+, https://plus.google.com/u/1/103860539990722557379/posts
<wallyworld> sure
<davecheney> two secs
<davecheney> sound is AFU
<mramm> kk
#juju-dev 2012-12-07
<davecheney> wallyworld: is there a page on the wiki describing the days the company is closed this year ?
<wallyworld> davecheney: not sure, i know there was something about the 28th. let me see if i have anything in my email
<wallyworld> davecheney: this is all i know - it says company closed between xmas and new year https://wiki.canonical.com/PeopleAndCulture/Policies/Leave
<davecheney> i love how out of date this information is
<wallyworld> but there's a separate page somewhere with reference to the 28th being a free day
<davecheney> it makes it really useful
<wallyworld> yeah
<davecheney> there is an up to date pdf on the page
<davecheney> thanks
<wallyworld> davecheney: i think that pdf is the one with the free day etc
<TheMue> Morning
<fwereade> TheMue, rogpeppe: heyhey
<TheMue> fwereade: Hi
<TheMue> fwereade: The improved firewaller is in at https://codereview.appspot.com/6875053/
<fwereade> TheMue, cool, looking
<TheMue> fwereade: Cheers
<fwereade> TheMue, what happens when a machine has a unit removed after initGlobalPorts and createMachineData?
<fwereade> s/and/and before/
<TheMue> fwereade: This is catched by the error check for an unassigned machine.
<fwereade> TheMue, and what happens in response?
<fwereade> TheMue, I don't get how you can distinguish between "this unit is not assigned to the machine I'm looking at: keep it around until we find the machine that does"
<TheMue> fwereade: Today the unit is skipped, but you're right, it should be dropped from those temporary stored units.
<fwereade> TheMue, but then the first machine you process will trash all the others, won't it?
<TheMue> fwereade: Why?
<fwereade> TheMue, because it will go through all the valid data saying "nope, not mine, trash it" -- won't it?
<TheMue> fwereade: Each machine data, when created, is scanning the remaining units from the initial scan and checks, if that unit is assigned to the machine the data is representing.
<TheMue> fwereade: No, I only should trash those without any assigned machine id. Today I just skip it.
<fwereade> TheMue, yes
<fwereade> TheMue, and you *have* to
<TheMue> fwereade: Line 228.
<fwereade> TheMue, because that unit might be assigned to another machine
<TheMue> fwereade: So far it's by accident as good as it is. ;)
<fwereade> TheMue, ok, what I'm afraid I'm saying is that you have I think rearranged the problems but not fixed them
<TheMue> fwereade: I had this when during testing units have been created while init runs in background and those units are not yet assigned.
<fwereade> TheMue, the fundamental problem is that you cannot effectively take a delta between the initial-scan state and the state at the point where you've started all the watches you need
<TheMue> fwereade: machineData and unitData are proper initialized before returning to the loop. I had masses of runs with debugging statements. They shown that, depending on who is faster, init, adding or mixed, the state afterwards is always correct.
<TheMue> fwereade: Yes, you never can, because scanning it initially takes time, especially when state is large, and during this time state can be changed.
<TheMue> fwereade: So this CL fetches its initial state as good as it can and at the time machineData and unitData are created it looks if those initial units match to them.
<fwereade> TheMue, so how are you in a position to assume that the initial-changes state that you build up corresponds to the globalUnitPorts state you built up before?
<fwereade> TheMue, ISTM that a removed unit will stay in globalUnitPorts forever
<TheMue> fwereade: Yes, that's the only problem I still have. I have no kind of safe signal that says "Now you can clean it up, guy."
<fwereade> TheMue, yeah
<fwereade> TheMue, and this is an inescapable consequence of the decision to separate the initial scan from the creation of the watchers, isn't it?
<TheMue> fwereade: One way could be a kind of timer checking those entries from time to time if they are removed.
<fwereade> TheMue, it's just another layer of paper over the problem
<TheMue> fwereade: Sure.
<TheMue> fwereade: We need the initial scan to look if the current open ports are those we need when the Firewaller starts.
<TheMue> fwereade: And the initial scan needs time.
<TheMue> fwereade: And during this time the state can (and will) change.
<fwereade> TheMue, understood -- but why don't you also create the *Datas at that point?
<fwereade> TheMue, they hold the watchers which have properties that allow us to avoid this problem
<TheMue> fwereade: Thought about it, niemeyer disliked it.
<TheMue> fwereade: But here it's the same problem.
<TheMue> fwereade: Starting those watchers still produces a gap while scanning.
<fwereade> TheMue, the initial events from those watchers *give* you the initial state, don't they?
<TheMue> fwereade: At least if I start the right ones which may possibly not happen because the state changes while I'm scanning to know which watchers I have to start.
<fwereade> TheMue, er, don't you just need to start the machines watch and handle the initial event?
<TheMue> fwereade: That's what the firewaller has done in his first "release", just that.
<TheMue> fwereade: :/
<fwereade> TheMue, I've never seen the FW setting up initial UD state by consuming the first event from the unit watcher before starting the watchLoop
<fwereade> TheMue, did it really do that>
<fwereade> TheMue, AFAICS the only way to do this reliably is to follow that model
<rogpeppe> fwereade, TheMue: morning!
<TheMue> fwereade: Please rephrase the UD sentence.
<TheMue> rogpeppe: Morning.
<fwereade> TheMue, if everything in the data tree makes sure it's initialized via initial event before starting its watch loop
<fwereade> TheMue, then all you need to do the global port open/close (just once) at the end of the first machinesw.Changes event
<fwereade> s/need to/need is to/
<fwereade> no that s/ is crack
<fwereade> whoops, no, it's what I meant
<fwereade> TheMue, er, am I making any sense?
<TheMue> fwereade: I have to think about it more.
<TheMue> fwereade: It would avoid possibly remaining removed units in that one map, yes.
<TheMue> fwereade: But it would totally change the way of initialization and have to check how this effects the instance mode, because we here talk about the global mode.
<TheMue> fwereade: *sigh*
<fwereade> TheMue, yeah, I'm sorry :(
<TheMue> fwereade: Fridays discussion ended with this proposal, but seems it falls too short.
<fwereade> TheMue, I was always trying to communicate that separating watcher setup from initial scanning was intrinsically broken -- sory I haven't been very effective at that
<TheMue> fwereade: Funnily it works really fine, it only has the I-dont-know-if-those-left-units-are-removed-during-startup-in-global-mode-problem.
<fwereade> TheMue, which is exactly the bug we're concerned about
<fwereade> TheMue, fwiw, you've also introduced a goroutine-unsafety bug in unitData.watchLoop
<fwereade> TheMue, ud.ports is used on the main goroutine
<TheMue> fwereade: Could you point me to the line?
<fwereade> TheMue, old: line 542 removed; new: lines 612, 615 changed
<TheMue> fwereade: IMHO the goroutine is started after initialization (btw opposit to the former solution).
<fwereade> TheMue, sorry, rephrase please?
<TheMue> fwereade: Seems to be independent, sorry. I have a deeper look. Thought you meant something different.
<dimitern> jam, wallyworld, rogpeppe: PTAL https://codereview.appspot.com/6877054/
<TheMue> fwereade: But I see the problem, yes. I stumbled about the same naming ports as field (where it's just used by the Firewaller itself) and the local variable. Shit, has been intended, but not lucilly.
<TheMue> luckilly
<TheMue> Hehe, this word doesn't exist.
<TheMue> Just lucky.
<fwereade> TheMue, sorry, I'm still a bit confused :)
<TheMue> fwereade: In the old version the unitData has a field named ports and in watchLoop() a variable named ports. I thought that this has been a mistake, but that has been intended.
<TheMue> fwereade: So my "correction" has been wrong.
<fwereade> TheMue, yeah, it's definitely required, but it demands close reading -- probably worth a comment for future reference
<TheMue> fwereade: Yes.
<fwereade> TheMue, otherwise people will definitely try the same "correction" in future :)
<TheMue> fwereade: Maybe also a better naming. Dunno, will think about it. First I'll mark it again as WIP.
<fwereade> TheMue, cool, thanks
<TheMue> fwereade: Done
<TheMue> fwereade: So the problem now is, that following your/our approach is a larger change.
<TheMue> fwereade: And starting the datas based on the initial event is independent of instance or global mode.
<TheMue> fwereade: The later ony is interesting to mark ports as opened and to manage the ref count.
<TheMue> fwereade: Which can be handled during this startup phase.
<TheMue> fwereade: So, before entereing the firewaller loop, init is called waiting for the first machine watcher event.
<TheMue> fwereade: It starts the machines, but then? Hmm, how do I know that all unit watchers and service watchers are set up based on it?
<TheMue> fwereade: All machines have to notify the init via a channel when they are done and the same with units and services, bottom up.
<TheMue> fwereade: Does that makes sense?
<fwereade> TheMue, sorry, I'm still processing
 * TheMue thinks of a concurrently initializing tree which at least has to signal the firewallers init that it's done so that it can continue and enter the loop.
<fwereade> TheMue, the trouble is partly that I haven't ever analysed the non-global mode
<TheMue> fwereade: Hehe
<fwereade> TheMue, but I *think* that the "initial scan" can be perfectly synchronous if the MD creation goes something like...
<fwereade> w := m.WatchUnits()
<fwereade> units := <-w.Changes()
<fwereade> for _, u := range units {
<fwereade>     // create unit data similarly
<fwereade> }
<fwereade> go md.loop(w)
<fwereade> TheMue, (very poorly sketched ofc)
<TheMue> fwereade: When are the ud loops started?
<fwereade> TheMue, exactly the same model as the MD ones
<TheMue> fwereade: Then it's very similar to my thought, only expressed in code. ;)
<fwereade> TheMue, so, after you've created all your MDs (synchronously) in the first event
<fwereade> TheMue, sorry -- it's the thought I've been trying to communicate for a week or 2 ;p
<TheMue> fwereade: *rofl*
<TheMue> fwereade: To come back to the mode question.
<TheMue> fwereade: IMHO independent of the mode the startup of the watcher is always the same. What differs is, how the firewaller itself (the type Firewaller) handles those informations based on the mode.
<TheMue> fwereade: And here instance and global mode have to be handled differently, yes.
<fwereade> TheMue, yes, indeed -- in global mode you can do a flush-everything at the end of the first machine event, and in instance mode just flush each machine as it's handled
<TheMue> fwereade: +1
<fwereade> TheMue, I *think* this also means you can drop everything global* except globalPortRef
<TheMue> fwereade: A ref != 0 means true, yes
<fwereade> TheMue, you just need to be a little bit careful about updating it for initial events in create methods *only* before the first init run has completed
<fwereade> TheMue, and subsequently doing so only as a consequence of flushes
<fwereade> TheMue, ...or maybe I'm overcomplicating it
<TheMue> fwereade: Could you please explain your thoughts more.
<fwereade> TheMue, I think that I misspoke -- you never need to touch the refcounts while you're doing the "initial scan", they can be handled purely on the main goroutine as they currently are
<TheMue> fwereade: Am I right you're thinking of a kind of "init" argument set to true during startup and later being false and internally work as described above only when it's true?
<fwereade> TheMue, I *was*, but I don't think it's needed by anything except the main FW loop, so it can just be a var therein
<fwereade> TheMue, sorry, bbiab
<TheMue> fwereade: Hmm, I need the refcounts at the end of the startup to compare the needed ports to the currently open ports.
<fwereade> TheMue, you can build the refcounts during the initial global flush, can't you?
<TheMue> fwereade: Pardon?
<fwereade> TheMue, sorry, back
<TheMue> fwereade: NP
<fwereade> TheMue, when you first do the initial global mode flush, won't you have a complete set of MDs and UDs available?
<fwereade> TheMue, so you can just build the refcounts from that data
<fwereade> TheMue, I think?
<TheMue> fwereade: So you think of an explicit flush at the end of the init? Otherwise I have somewhere to remember that it's the first time.
<fwereade> TheMue, yeah, exactly -- I'm saying that I think the only place you need that information is within the main FW loop
<fwereade> TheMue, I could well be wrong though -- I'm hardly conversant with all the details of the FW
<TheMue> fwereade: In the loop? Opposit to a firewaller field?
<fwereade> TheMue, yeah -- I can't see when else you'd need it -- but you know the FW better than I do
<TheMue> fwereade: We're still talking about the ref count? Just to make sure.
<fwereade> TheMue, maybe "initial global mode flush" is the wrong term?
<fwereade> TheMue, yeah, we're talking about the refcounts
<fwereade> TheMue, they can be build up in initGlobalMode
<TheMue> fwereade: I need access to it in the method flushGlobalPorts().
<fwereade> TheMue, but *that* only needs to be called, once, at the end of the first machines change
<fwereade> TheMue, doesn't initGlobalMode call flushGlobalPorts? I may be confused
<TheMue> fwereade: Not today, but I would do it at the end of the init after the change we talked about and when global mode is active.
<fwereade> TheMue, ah, no, I'm on crack, sorry
<fwereade> TheMue, so "flush" was the wrong term
<TheMue> fwereade: But it's also needed later, it is the alternative to flushInstancePorts()
<TheMue> fwereade: You smoke too much. *lol* And maybe the wrong stuff.
<TheMue> fwereade: *scnr*
<fwereade> TheMue, what I'm trying to say is that if you call initGlobalMode after handling the first machines change
<fwereade> TheMue, you can use all the MD/UDs that have been built up -- and never have to hit state at all
<fwereade> TheMue, it's a simple loop through all the units to count up the port refs
<fwereade> TheMue, and then do the usual open/close as you currently do at the end of initGlobalMode
<TheMue> fwereade: Yes, that's right, here I would update the refcounts based on the datas, compare it to the open ports and handle the needed opens and closes.
<fwereade> TheMue, ...and then you're done, because you *know* that any changes from that state are already being watched for
<fwereade> TheMue, and will be handled sanely once you hit the main loop select again
<TheMue> fwereade: Bingo!
<fwereade> TheMue, the only wrinkle about which I am still uncertain is service exposure
<TheMue> fwereade: If that changes during init?
<fwereade> TheMue, just in general -- I know there's code to handle it but I've never needed to tease out the details
<fwereade> TheMue, but, yes, the same general strategy needs to be used, I think
<TheMue> fwereade: We're watching the expose flag to see if a service is exposed. That's covered today in init (what will change) and during runtime (by the watches).
<fwereade> TheMue, yeah -- my point is that you need to unify init and watcher setup in the same way we discussed for MD/UD
<TheMue> fwereade: Yep, the whole chain MD/UD/SD.
<fwereade> TheMue, I think it's pretty easy so long as you're careful to make sure all your *Data creation is done on the main goroutine
<fwereade> TheMue, including for subsequent events
<TheMue> fwereade: Yes, sounds reasonable.
<fwereade> TheMue, cool :)
<TheMue> fwereade: So I think I'll exactly do this now.
<fwereade> TheMue, great
<fwereade> TheMue, tyvm for your time on this, I'm sorry it's such a hassle
<TheMue> fwereade: No, it's absolutely great to do this kind of design review.
<TheMue> fwereade: I have to thank you.
<fwereade> TheMue, always a pleasure :)
<fwereade> btw TheMue, or rogpeppe, I need another review on https://codereview.appspot.com/6864050/ if either of you has a moment
<rogpeppe> fwereade: looking
<rogpeppe> fwereade: you've got a review
<rogpeppe> fwereade: BTW i don't *think* it's possible for the transaction mechanism to determine which op failed.
<rogpeppe> fwereade: because the asserts are executed in a distributed way
<fwereade> rogpeppe, I'm still not sure it's a good idea -- but it should at least be able to figure out the doc, right?
<rogpeppe> fwereade: i'm not sure
<rogpeppe> fwereade: because it can't do the check and write the doc in the same moment
<rogpeppe> fwereade: if someone else happens to execute the transaction, how do we get that information back to the entity that started that transaction?
<fwereade> rogpeppe, you could be right
<fwereade> rogpeppe, I suspect it would still be possible, but almost certainly not worth the effort :)
<rogpeppe> fwereade: i'm not sure actually. seeing when a transaction fails and writing some information about it is non-atomic, and it needs to be atomic to work.
<fwereade> rogpeppe, maybe so, I certainly don't claim to understand all the nuances of txn's implementation
<rogpeppe> fwereade: neither me :-)
<fwereade> rogpeppe, TheMue: I have a couple of responses to the reviews, would appreciate your thoughts
<fwereade> rogpeppe, (I'm not disagreeing that a comment is called for, btw, I'm explaining the logic in case I'm on crack)
<TheMue> fwereade: Already writing.
<fwereade> morning Aram
<Aram> yo
<rogpeppe> fwereade: replied
<rogpeppe> Aram: hiya
<Aram> hey there
<fwereade> rogpeppe, TheMue, cheers guys
<TheMue> fwereade: YW
<TheMue> Aram: Hiay
<TheMue> Hiya
<Aram> hello.
<TheMue> Grmpf, fingers too fast.
<fwereade> oh, bugger, doc appointment, bb after lunch
<wallyworld> dimitern: hi, i just commented on your mp
<dimitern> wallyworld: hey, 10x
<wallyworld> sorry about the delay, i was eating dinner and having a few tasty drinks :-(
<wallyworld> :-) i mean
<dimitern> wallyworld: no worries :)
<dimitern> wallyworld: the FlavorDetail used internally is enough to return lists of either Entity or FlavorDetail with the HTTP API
<wallyworld> dimitern: ok, so long as the current goose nova client can unmarshall the data as returned by the double, then i;m happy :-)
<dimitern> wallyworld: certainly
<wallyworld> dimitern: i'll plug in the live/local tests once this lands and we can see how it all pans out
<dimitern> wallyworld: yep, can't wait already
<dimitern> rogpeppe: when you can, could you look at my CL please?
<rogpeppe> wallyworld: i've just sent another review of your errors CL
<wallyworld> dimitern: yeah, when you have worked on a branch for a little while, you just want it to land!
<rogpeppe> wallyworld: perhaps we could have a chat about it some time
<rogpeppe> dimitern: which one?
<wallyworld> rogpeppe: ok, let me take a look
<dimitern> rogpeppe: https://codereview.appspot.com/6877054/
<rogpeppe> dimitern: looking
<wallyworld> rogpeppe: happy to chat when you have a moment, after you finish reviewing. just ping me
<rogpeppe> dimitern: sorry, "d" is shorthand for "delete this line" (ed(1) syntax :-])
<dimitern> rogpeppe: I see :)
<niemeyer> Hello all
<rogpeppe> niemeyer: mornin'
<dimitern> niemeyer: hiya
<rogpeppe> dimitern: you've got another review
<rogpeppe> wallyworld: ping
<dimitern> rogpeppe: thanks!
<wallyworld> rogpeppe: mumble?
<rogpeppe> wallyworld: what *is* mumble??
<wallyworld> oh, maybe google hangout?
<wallyworld> rogpeppe: mumble is a chat program, like skype
<rogpeppe> wallyworld: i'd be happy to use mumble if i knew what it was :-)
<wallyworld> except free, open source etc
<rogpeppe> wallyworld: ah
<rogpeppe> wallyworld: i'll give it a go. presumably i can apt-get it?
<wallyworld> yeah, believe so. but you need a passwrd etc for the server - it's on y=the wiki
<wallyworld> i'll see if i can dig up a link
<wallyworld> rogpeppe: https://wiki.canonical.com/StayingInTouch/Voice/Mumble
<rogpeppe> wallyworld: ok, i'm on mumble
<rogpeppe> wallyworld: which space do you use?
<wallyworld> rogpeppe: goto cloud engineering kitchen?
<dimitern> rogpeppe: settings links to []nova.Link{} prevents AddFlavor/Server from generating them (it does that when links == nil). Not setting them will cause DeepEquals  to fail and will force the unnecessary IMHO creating the full list of links every time I need to compare what's in the state vs. what I set there.
<dimitern> s/settings/setting
 * niemeyer => reboot
<niemeyer> fwereade: ping
<fwereade> niemeyer, pon
<fwereade> er, pong
<niemeyer> fwereade: Heya :)
<niemeyer> fwereade: pon is great, though
<fwereade> niemeyer, how's it going?
<niemeyer> fwereade: We should totally switch over
<niemeyer> pin.. pon
<fwereade> niemeyer, yeah, think how much typing we'd save :)
<fwereade> niemeyer, thanks for your reviews last night, very helpful
<niemeyer> fwereade: No worries.. that's actually what I'm pinging about.. is there anything you'd like to chat on to unblock?
<fwereade> niemeyer, I think I'm good, thanks
<fwereade> niemeyer, I'm still marshalling my thoughts re testing methods
<niemeyer> fwereade: Coolio
<fwereade> niemeyer, I guess the differing perspective may be that I see the testing methods as implying "do all required busywork to get to the point where I can call, say, RemoveService"
<fwereade> niemeyer, but the current uses are not very compelling
<fwereade> niemeyer, I'm fine dropping AddAnonymousService, but I reserve the right to get it out and wave it around if and when I spot a situation that deserves it, even if it probably never deserves to go on a shared test case
<fwereade> s/case/suite/
<fwereade> ;)
<niemeyer> fwereade: Hehe :)
<niemeyer> fwereade: Sounds good :)
<TheMue> lunchtime, biab
<niemeyer> rogpeppe: Regarding this comment:
<niemeyer> """
<niemeyer> one possibility would be to allow forcing the version number only, and have
<niemeyer> version.Current always report the current series.
<niemeyer> """
<niemeyer> rogpeppe: I don't get the idea
<niemeyer> rogpeppe: version.Current.Series already does report the current series
<rogpeppe> on a call, back in mo
<niemeyer> rogpeppe: Suppa
<rogpeppe> niemeyer: version.Current.Series doesn't report the current series if the version is forced, i think
<rogpeppe> niemeyer: i may be wrong, let me check
<rogpeppe> niemeyer: no, that's right
<niemeyer> rogpeppe: Ah, I see what you meant
<niemeyer> rogpeppe: Hacking the hack sounds a bit hackish :-)
<rogpeppe> niemeyer: actually we'd be making the hack slightly smaller
<rogpeppe> niemeyer: by making it change the version number only
<niemeyer> rogpeppe: I guess so
<niemeyer> rogpeppe: I'll have a look at this
<fwereade> brb cleaner cleaning
<niemeyer> rogpeppe: Actually, on a second thought, I don't understand why that matters for that particular CL
<rogpeppe> niemeyer: aren't you trying to verify that an instance with the given series is actually started?
<niemeyer> Yes, and where's current series picked from?
<niemeyer> Ah, from the system
<niemeyer> That's slightly unexpected (by me)
<rogpeppe> niemeyer: where else could it come from?
<niemeyer> It basically means what we get out of the agent tools is reporting the machine, not the actual running tools
<niemeyer> Not sure which way is more useful, though
<rogpeppe> niemeyer: they're both potentially useful
<niemeyer> Yeah
<rogpeppe> niemeyer: but it's hard to bake the current series into the compiled executable
<niemeyer> rogpeppe: Well.. I'm coding a change right now to *avoid* doing it, so it shouldn't be so hard.. 8)
<niemeyer> rogpeppe: Either way, we'd be guessing in both directions.. so we don't have to make any changes to that right now
<niemeyer> I'll just do that one side-change as an independent CL and see how it looks
<niemeyer> After lunch, though
<niemeyer> biab
<fwereade> dimitern, wallyworld, jam: is http://paste.ubuntu.com/1416964/ expected? I just updated goose...
<dimitern> fwereade: well, that seems you're missing some openstack env vars
<fwereade> dimitern, surely the tests should pass on a clean system?
<dimitern> fwereade: if you don't pass -live to the tests it should not matter
<fwereade> dimitern, I'm pretty certain my env vars should not be able to cause tests to either fail or succeed :)
<fwereade> dimitern, I didn't, though
<fwereade> dimitern, I concede the point re -live though :)
<dimitern> fwereade: hmm.. well, I don't know if wallyworld is around - that are his last changes
<fwereade> dimitern, bah, I never know whether to dive in and fix myself, or revert, or flag and move on
<dimitern> fwereade: as a workaround until fixed, you'll need to set these to non-empty strings: OS_USERNAME, OS_PASSWORD, OS_TENANT_NAME, OS_REGION_NAME
<dimitern> fwereade: if tests are failing I think revert is the best bet
<fwereade> dimitern, yeah, probably sensible :)
<fwereade> rogpeppe, TheMue: https://codereview.appspot.com/6864050 comments addressed I think
<fwereade> (fwiw, does anyone know best practice for reversing a merge into trunk? absent better ideas I'm just going to copy the old version over the top and ask for a trivial LGTM on trust)
<fwereade> bah, it's not the latest
<rogpeppe> fwereade: last time i did it, i used patch with a revert diff
<rogpeppe> reverse diff
<rogpeppe> fwereade: LGTM  BTW
<fwereade> rogpeppe, yeah, I'm *sure* bzr would do it for me if I knew what I were doing :)
<TheMue> fwereade: Review is in.
<rogpeppe> niemeyer: next stage in bottom-up API implementation: https://codereview.appspot.com/6913043
<niemeyer> rogpeppe: Cheers
<niemeyer> dimitern: Can you give a hand there and propose a fix to avoid the issue for the moment?
<dimitern> niemeyer: what fix?
<niemeyer> dimitern: One that unbreaks tests
<dimitern> niemeyer: ok, I'll take a look
<niemeyer> dimitern: Thanks a lot
<fwereade> niemeyer, dimitern: I just proposed a branch with that commit straight-up reverted -- https://codereview.appspot.com/6907049
<fwereade> niemeyer, dimitern: I'll go ahead and reject it if a fix is incoming
<niemeyer> fwereade: I'm assuming that a trivial temporary fix for this would be under 10 lines of code.. if it seems any controversial, +1 on reverting
<dimitern> fwereade: I have no idea what the fix should be yet, looking..
<fwereade> rogpeppe, fwiw, `bzr diff -r-2..-3` gave me a patch that I could just apply
<rogpeppe> fwereade: yeah, that's what i was suggesting
<fwereade> niemeyer, gut says it's a bit of trivial test setup
<rogpeppe> fwereade: sorry if i wasn't clear enough
<fwereade> rogpeppe, not to worry
<dimitern> fwereade: i cannot even compile the tests from trunk - go test -i in environs/openstack says: ../../state/open.go:72: undefined: mgo.DialWithInfo; mgo.DialInfo
<fwereade> dimitern, if it looks like it'll be complex then maybe just give my CL a quick check and let me know if it's sane
<fwereade> dimitern, go get -u labix.org/v2/mgo/...
<niemeyer> dimitern: Update your mgo
<niemeyer> What fwereade says
<niemeyer> dimitern: How long since you last run tests, btw? :-)
<dimitern> fwereade: ok I'll check out the CL in the mean time
<fwereade> niemeyer, I think he's been on goose :)
<niemeyer> Ah, that'd explain it
<dimitern> niemeyer: a week probably - been working on goose mostly, not juju-core
<niemeyer> fwereade: Can we do that on our side maybe, if we know what to do? Is it just a matter of putting those vars in the env?
<fwereade> niemeyer, no idea, it probably is
<fwereade> niemeyer, I was trying to avoid distracting myself... didn't actually manage that, though :)
<niemeyer> fwereade: If it is, +1 on adding them next to a TODO: This is wrong as the test shouldn't have to hack the environment to work out-of-the-box. Please fix at next chance.
<niemeyer> fwereade: LOL
<dimitern> fwereade: the CL looks fine to me
<fwereade> dimitern, I'll see if there's an easy fix -- we can race :)
<dimitern> fwereade: I don't think the revert was premature - ListServers() for example in the test I have has some arguments in the provider, but not in goose, so I think it's half-baked anyway
<fwereade> dimitern, yeah, looking at the tests I haz a bit of a confuse, it looks like it *ought* to work
<rogpeppe> it's going to be interesting keeping goose and juju-core in sync
<niemeyer> Slightly unfortunate but if the APIs are diverging, there's no other way
<niemeyer> rogpeppe: Feels pretty easy
<niemeyer> rogpeppe: We've been doing that with every other package for quite  awhile
<rogpeppe> niemeyer: true, but the external packages weren't perhaps changing quite as fast
<niemeyer> rogpeppe: Sure, but the idea is still the same.. merge only when both as ready to go in
<niemeyer> s/as/are
<niemeyer> rogpeppe: https://codereview.appspot.com/6907050
<niemeyer> dimitern: It doesn't look like the API is broken, though?
<niemeyer> dimitern: It seems to compile fine here
<dimitern> niemeyer: maybe I'm not getting something right then
<fwereade> dimitern, ah, no, it's just a broken string (and tests which appear to run twice), plus whatever the deal is with local_test which I don't really get at all
<dimitern> fwereade: I think the local tests are supposed to use the test doubles
<fwereade> dimitern, yeah, but none of those tests actually hit anything let alone the test double
<dimitern> fwereade: no idea, wallyworld would know I haven't looked at them yet
<rogpeppe> niemeyer: looking
<fwereade> dimitern, ISTM that they're failing to construct an env and trying to get the provider from that
<niemeyer> fwereade: I have a fix.. pushing
<fwereade> dimitern, it looks like that's what we do in ec2, though, although I can't figure out *why*
<dimitern> fwereade: probably it needs both tenant and region, which is not needed for ec2?
<fwereade> dimitern, it doesn't need an env at all :/
<fwereade> dimitern, it's creating one for absolutely no reason
<dimitern> fwereade: I see, so it shouldn't?
<rogpeppe> niemeyer: LGTM
<fwereade> dimitern, AFAICT it should just be doing Provider("openstack").PrivateAddress()
<fwereade> dimitern, PrivateAddress has to work with no credentials or it's worthless
<niemeyer> fwereade, dimitern: https://codereview.appspot.com/6912047
<fwereade> niemeyer, LGTM as far as it goes, but it might be nice to drop the redundant Test() func in config_test.go
<niemeyer> fwereade: Yep, there's no point in having credentials there
<fwereade> niemeyer, and frankly local_test.go should just be deleted
<niemeyer> fwereade: DOing it
<dimitern> niemeyer: LGTM
<fwereade> niemeyer, LGTM
<niemeyer> dimitern, fwereade: Thanks. I've sent an improvement to the comment as well, to point out that what was done there is a temporary hack to get things building, but definitely not the right thing to do.
<dimitern> ok
<fwereade> niemeyer, lovely, cheers
<fwereade> niemeyer, ah, sorry, I thought you were submitting - LGTM again, I guess, but really almost all that file is useless and confusing
<fwereade> niemeyer, hard to fault the implementer, though, it looks exactly like the ec2 version -- I think I'll propose my own on top when you've submitted
<niemeyer> fwereade: Yes, my attempt is not to fix anything other than doing the smallest possible to help unrelated work to continue
<fwereade> niemeyer, actually it's not quite as simple as it looks
<fwereade> niemeyer, yeah, sensible
<niemeyer> fwereade: What's that?
<fwereade> niemeyer, I think the *Address tests should have their own metadataHost patching
<fwereade> niemeyer, and I don;t think any of the other tests have any justification for doing so
<niemeyer> fwereade: Agreed on both counts
<fwereade> niemeyer, ok, it is pretty easy, I'll dash it off a bit later
<rogpeppe> niemeyer: i was in the process of implementing jujud server (to serve the API) then realised that perhaps it might be good as another machine worker. what do you think?
<rogpeppe> fwereade: ^
<fwereade> rogpeppe, I don't *think* so
<fwereade> rogpeppe, but would you expand?
<rogpeppe> fwereade: i started doing it separately and then realised that it had almost everything in common
<fwereade> rogpeppe, hum, interesting -- if you can pull it out cleanly then that would probably be a Good Thing then
<rogpeppe> fwereade: it looks at state and does stuff to it. almost the same as any other worker.
<rogpeppe> fwereade: the only thing i'd want to do (to make things consistent) is to remove the --ca-cert argument from jujud
<rogpeppe> fwereade: and have it pull the file from a known file within data-dir
<fwereade> rogpeppe, +1 to that for sure
<rogpeppe> s/the file/the cert/
<rogpeppe> fwereade: then the API worker can look in data-dir too to get the server cert and key
<rogpeppe> fwereade: hmm, maybe it's crack though actually
<rogpeppe> fwereade: they are almost identical *now*
<fwereade> rogpeppe, I'm still a bit confused by it, it is true, but laura is shouting persistently in the other room so I am finding it tricky to concentrate
<rogpeppe> fwereade: but in the future, the agents will need to talk to the API server, not the db
<fwereade> rogpeppe, ah, yes, good point
<rogpeppe> fwereade: yeah, so the upgrader loop must be different, because they're talking to two different kinds of state server.
<rogpeppe> fwereade: i mean the state-reconnect loop
<fwereade> rogpeppe, maybe best to keep them separate for now then
<rogpeppe> fwereade: definitely
<rogpeppe> fwereade: thanks for the feedback
<fwereade> rogpeppe, it's always nice to be thanked for saying "er... dunno" :)
<rogpeppe> fwereade: it could be worse, you might be a teddy bear
<rogpeppe> fwereade: it will be a little weird though, 'cos i think the API server agent will need to listen to itself...
<fwereade> rogpeppe, haha
<rogpeppe> fwereade: the worker does need an entity name though, as it needs to store its password in state and in a directory. how about just "server" for the entity name, and having State.SetAPIServerPassword ?
<rogpeppe> fwereade: oh bugger, that doesn't work
<rogpeppe> anyone wanna have a look at this CL? https://codereview.appspot.com/6913043
<fwereade> rogpeppe, niemeyer: can I get a trivial LGTM on https://codereview.appspot.com/6907051 please? I ran all the tests :)
<niemeyer> fwereade: Looking
<niemeyer> fwereade: That seems rather controversial
<fwereade> niemeyer, :p
<niemeyer> fwereade: LGTM :-)
<fwereade> niemeyer, cheers
<rogpeppe> fwereade: how did that compile before?
<fwereade> rogpeppe, it didn't, niemeyer owes us cookies :p
<niemeyer> rogpeppe: Probably didn't.. the guy was incompetent
<niemeyer> :-)
<rogpeppe> ha ha
<fwereade> rogpeppe, niemeyer: the trouble is I didn't spot it when I did the full test run for my previous submit
<rogpeppe> niemeyer: that probably cancels out a few cookies i owe you :-)
<niemeyer> rogpeppe: I'm not sure.. I should write down a footnote about the effects of the rule on build fixes
 * niemeyer laughs in an evil way
<niemeyer> Reminds of that game where we make up rules as we go
<niemeyer> We should play that at some point
<fwereade> rogpeppe, niemeyer, I don't suppose there's a way to get go test to put the bad output, that I care about, at the end where I see it more obviously?
<niemeyer> fwereade: Can I get a second review on this easy one: https://codereview.appspot.com/6907050
<fwereade> niemeyer, looking
<niemeyer> fwereade: Hmm.. not that I know of
<fwereade> niemeyer, just a NO YUO or something, that's all I ask :)
<niemeyer> That's the game, btw: http://en.wikipedia.org/wiki/Mao_(card_game)
<fwereade> niemeyer, can you briefly explain the context? code looks good but I don't quite understand what it helps us do
<niemeyer> fwereade: It's a pre-req of..
<niemeyer> fwereade: https://codereview.appspot.com/6868070/
<fwereade> niemeyer, ah, ok
<fwereade> niemeyer, LGTM
<niemeyer> fwereade: Cheers
<fwereade> I think I'm done for the day
<fwereade> dimitern, I think we're all going down to get ribs at la rive, join us if you fancy it, they're pretty good :)
<dimitern> fwereade: cool, when?
<fwereade> dimitern, imminently :)
<dimitern> fwereade: :) be there in 20m
<fwereade> sweet, see you shortly then :)
<fwereade> happy weekends everyone
<dimitern> from me too :) happy weekend!
<fwereade> niemeyer, (don't quite know how I failed to get to the deployer rework today :( -- it'll be there soon though :))
<niemeyer> fwereade: All good
<niemeyer> fwereade, rogpeppe: I'll be off on Monday, btw, in exchange for the 28th that I missed.. Back Tuesday and on
<rogpeppe> fwereade: have a good one!
<rogpeppe> niemeyer: ok, have a great weekend
<niemeyer> fwereade, rogpeppe: A great weekend to all as well
<niemeyer> rogpeppe: api-server LGTM
<rogpeppe> niemeyer: cool, thanks
<rogpeppe> niemeyer: next up is adding a conventional Stop method, then adding a jujud server command.
<niemeyer> rogpeppe: Sweet, sounds sensible
<rogpeppe> niemeyer: i'm wondering how we should manage the api server's state password
<rogpeppe> niemeyer: it probably needs its own entity management stuff in state, which i'm not too keen on doing, as it seems a lot of work for not much gain
<rogpeppe> niemeyer: i've been toying with the idea of making the state server part of the machine agent
<niemeyer> rogpeppe: How do you mean
<niemeyer> ?
<rogpeppe> niemeyer: it can't *quite* be "just another worker" but maybe not far off
<niemeyer> rogpeppe: Soryr, let me be more specific
<niemeyer> rogpeppe: Why state password would the API server need?
<niemeyer> s/Why/what
<rogpeppe> niemeyer: i think we probably want to keep mongodb password access
<rogpeppe> niemeyer: even after the API server is done
<niemeyer> rogpeppe: I see
<niemeyer> rogpeppe: The API server is a bit of an interesting case
<rogpeppe> niemeyer: and so we still need to go through the same initial password dance that we do now
<rogpeppe> niemeyer: yeah, it is. but it's perhaps not as special as we might initially think.
<niemeyer> rogpeppe: It can be, depending which way we go
<rogpeppe> niemeyer: yea
<niemeyer> rogpeppe: We could opt to use a unix socket, for example
<rogpeppe> h
<niemeyer> Which isn't supported by mgo right now
<niemeyer> rogpeppe: I'm not sure about how much that'd make things easier/nicer, though
<rogpeppe> niemeyer: i don't think that works so well, as i think we want to be able to have a state server fan out to several mongos, perhaps
<rogpeppe> niemeyer: and the state server might implement some caching, so it might make sense to have more state servers than mongods
<niemeyer> rogpeppe: Yeah, that's an interesting schema to keep in the sleeve, even if we don't do it right away
<rogpeppe> niemeyer: i'm thinking that in fact, if we choose it to be, the state server is actually very similar to a normal worker.
<rogpeppe> niemeyer: except that, eventually, it will act on the mongo state rather than the api state
<niemeyer> rogpeppe: Agreed
<niemeyer> rogpeppe: Sounds like a good way to put it
<rogpeppe> niemeyer: and thinking that way, perhaps it makes sense to make it just another worker in the machine agent's arsenal.
<niemeyer> rogpeppe: Yep
<rogpeppe> niemeyer: there will be an interesting little dance for machine agents that also happen to be api servers, but i *think* it's possible
<rogpeppe> niemeyer: the upgrade logic will want to talk to the api that's being served by the same machine agent, for example, but that will probably work ok.
<niemeyer> rogpeppe: Yeah, it might work alright
<rogpeppe> right, time for me to go too
<rogpeppe> night all! have great weekending.
#juju-dev 2013-12-02
<axw> hey thumper, have a good week away?
<thumper> axw: yeah, I did
<thumper> quite relaxing
<axw> cool
<axw> good week for it ;)
<axw> it was a little bit crazy last week, but all turned out well
<thumper> cool
<thumper> I'm tyring to get the team's prioritized list
<thumper> so we can focus on what to do next
<thumper> I'm finishing off the kvm support
<axw> thumper: I'm pretty close to having synchronous bootstrap done I think; I've got some updates to do per jam's suggestion of toning down the output
 * thumper nods
<thumper> cool
<axw> thumper: waiting on some feedback from fwereade regarding state changes for API-based destroy-environment
<axw> when that's done I can finish off destroy-env for the manual provider
<thumper> that'll be good to finish off
<thumper> axw: I think it would be great to create an "ask ubuntu" question around "how do I use juju with digital ocean", and answer yourself demonstrating the manual bootstrap / manual provisioning
<axw> thumper: I also looked into manual provisioning into non-manual env. There's one particularly difficult bit, which is that agents/units/etc. address each other with private addresses, which isn't likely to work with an external machine
<thumper> hmm...
 * thumper nods
<axw> thumper: SGTM, tho Mark Baker (I think it was him) wanted me to get in touch before publicising widely
<thumper> axw: what about crazy shit like running a sshuttle
<thumper> on the bootstrap node
<axw> yeah something like that may be necessary
<axw> tho I'm starting to think that it's a lost cause, and we should just handle it with cross-environment
<axw> haven't gotten too deep into it yet
<thumper> hmm...
<axw> wallyworld_: you can just do "defer os.Chdir(cwd)" - no need to change it now though
<wallyworld_> ah bollock
<wallyworld_> s
<wallyworld_> the tests didn't break regardless, just thought i'd be complete
<wallyworld_> i'll fix as a driveby
<axw> cool
<wallyworld_> thumper: initial framework https://code.launchpad.net/~wallyworld/juju-core/ssh-keys-plugin/+merge/197310
<thumper> wallyworld_: ack
 * thumper nips to the supermarket for dinner stuff...
<jam1> axw: "agents address each other with private addresses", in the case of AWS I *think* we use the DNS name, which appropriately resolves private/public based on where you make the request. But I'll agree that you don't guarantee routing. Assuming a flat network for agent communication probably isn't going to change this cycle, but we may get there eventually.
<axw> jam1: it may not be a difficult problem to solve, but I first came up against this when the agents were deployed with stateserver/apiserver set to the private address
<axw> that was on EC2
<jam1> axw: yeah, code has been changing in that area. It would appear something is resolving that address and passing around the IP, rather than the DNS name.
<axw> jam: I don't think it was the IP - it was an internal DNS name
<axw> couldn't be resolved outside EC2
<axw> bbs, getting lunch
<axw> jam: still working out the kinks, but synchronous bootstrap will look something like this in the near future: http://paste.ubuntu.com/6507781/
<jam> axw_: looks pretty good.
<axw> the apt stuff should be at the beginning obviously :)
<dimitern> morning
<rogpeppe> mornin' all
<jam> jamespage: poke about mongo/mongoexport/mongodump/etc
<jamespage> jam: morning
<jam> jamespage: I hope you had a good weekned
<jam> weekend
<jamespage> yes thanks - how was yours?
<jam> pretty good. It was Thanksgiving into UAE National Day, and Expo 2020 celebration, so the weekend gaps were all a bit silly. My son will end up with a 5-day weekend
<jamespage> jam: nice
<jam> jamespage: so I responded to your questions about what tools we need to include. I'd like to know more from your end what the costs of it are so that we can find a good balance point.
<jam> morning fwereade, I had a question for you when you're available
<fwereade> jam, heyhey
<jam> fwereade: namely, the 14.04 priorities/schedule stuff
<fwereade> jam, I couldn't see anything by mramm, so I just copied them straight into the old schedule doc
<fwereade> jam, https://docs.google.com/a/canonical.com/spreadsheet/ccc?key=0At5cjYKYHu9odDJTenFhOGE2OE16SERZajE5XzZlRVE&usp=drive_web#gid=2
<jam> fwereade: yay, I was hoping they would end up there
<jamespage> jam: just reading
<jam> k
<jamespage> jam: OK - I see what you are saying - I'll respond on list so everyone can see
<axw> fwereade: re https://codereview.appspot.com/28880043/diff/20001/state/state_test.go#newcode567, there's code in AddServce/AddMachine* to assert environment is alive
<axw> and accompanying tests
<axw> actually that is the test - maybe I misunderstood you
<fwereade> axw, yeah -- but Ididn't see a test for what happens if the env was alive as the txn was being calculated, but became not-alive before the txn was applied
<fwereade> axw, look in export_test.go -- inparticular, SetBeforeHook, I think -- for a mechanism that lets you change state at that point
<fwereade> axw, there are a few other tests that use it... but not many, the capability is relatively recent
<axw> fwereade: thanks. just so I understand- you mean to check what happens between the initial environment.Life() check, and when the transaction is executed?
<fwereade> axw, yeah,exactly
<axw> okey dokey, I will add that in
<fwereade> axw, cheers
<axw> fwereade: and this one: https://codereview.appspot.com/28880043/diff/40001/state/environ.go#newcode84  -- I thought I was following your earlier advice ("I kinda hate this, but we're stuck with it without a schemachange for a total-service-refcount somewhere..."), but I suppose I've misunderstood you
<axw> jam: I'd appreciate another look at https://codereview.appspot.com/30190043/ if you have some time to spare later
<fwereade> axw, indeed, I was a bit worried I might have misunderstood myself there
<axw> heh :)
<fwereade> axw, I don't *think* I ever said that you'd have to remove everything before an env could be destroyed, but I might have said something else so unclearly that the distinction was minimal
<fwereade> axw, it's always been manual machines that are the real problem
<axw> indeed
<axw> they are not much like what's preexisting
<fwereade> axw, yeah
<fwereade> axw, I think we're massaging them into shape though
<axw> fwereade: how about I just drop that comment and stick with the cleanup. did you have any thoughts on cleaning up machine docs?
<axw> i.e. destroying machines as environment->dying
<fwereade> axw, I feel that bit should be reasonably short-circuitable
<fwereade> axw, ie just kill all the instances and be done with it
<fwereade> axw, the only reason not to is, again, the manual machines
<axw> yeah, except for manual machines
<fwereade> jinx ;p
<axw> fwereade: so, one option is to have it schedule a new cleanup for machines that *doesn't* imply --force
<axw> and that will wait for all units to clean up (as a result of services being cleaned up)
<fwereade> axw, so yeah I could maybe be convinced that *if* there are manual machines we should destroy all the others
<fwereade> axw, well, there's not actually much point waiting
<fwereade> axw, I am starting to think that actually destroy-machine should always act like --force anyway
<axw> fwereade: I was just thinking about leaving units in a weird state, but maybe we don't care?
<axw> doesn't matter for non-manual of course
<fwereade> axw, if you don't want a machine, you don't want it, and if the whole thing's being decommissioned there's no point caring about the units in a weird state
<fwereade> axw, yeah, doesn't apply to manual machines ;p
<axw> yeah... except if you want to reuse that machine
<axw> heh
<axw> ok, I'll play with that some more tomorrow
<fwereade> axw, destroy-machine implies no reuse, I think
<fwereade> axw, whereas manual... often implies reuse
<axw> fwereade: at worst, users can destroy-machine the ones they care about
<axw> then wait for things to clean up
<axw> then do destroy-env
<fwereade> jam, seeding a thought: can we get rudimentary versions into the API by 1.18? can chat about it more after standup
<axw> assuming destroy-machine doesn't change to always --force
<fwereade> axw, yeah, indeed
<fwereade> axw, manual machines again are the reason not to always --force
<fwereade> blasted things ;)
 * fwereade has to pop out a mo, bbs
 * axw has to go
<axw> adios
<jam> have a good evenning axw
<axw> thanks jam
<jam> TheMue: rogpeppe, mgz, standup? https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig
<jam> wallyworld_: ^^
<TheMue> jam: i'm already each day in the qa timeout
<jam> TheMue: absolutely, just didn't know if you were coming to ours, so you're welcome if you want
<TheMue> jam: so i think it's better to take part on thursdays, there's more than status
<TheMue> jam: right now i'm in an interesting testing, maybe tomorrow
<jam> TheMue: have a good day, then.
<TheMue> jam: thank, u2
<TheMue> jam: short info about current status, built a "Tailer" for filtered tailing of any ReadSeeker and writing to a Writer. looks good, right now testing it.
<TheMue> jam: it's part of the debug-log api
 * TheMue => lunch
<natefinch> jam: I can try reaching out to the github.com/canonical guy through linked in... we're like 3rd degree relations. Seems like he keeps it up to date there.
<jam> natefinch: go for it.I'll forward you the email I tried sending earlier
<natefinch> jam: cool
 * rogpeppe3 goes for lunch
<sinzui> jam I still see my blocking bug stuck in triaged: https://launchpad.net/juju-core/+milestone/1.16.4 Is bug 1253643 a duplicate of bug 1252469
<_mup_> Bug #1253643: juju destroy-machine is incompatible in trunk vs 1.16 <compatibility> <regression> <juju-core:Fix Committed by jameinel> <juju-core 1.16:Fix Committed by jameinel> <https://launchpad.net/bugs/1253643>
<_mup_> Bug #1252469: API incompatability: ERROR no such request "DestroyMachines" on Client <terminate-machine> <juju-core:Triaged> <https://launchpad.net/bugs/1252469>
<jam> sinzui: that gets bumped to 1.16.5 because we moved the DestroyMachines code out of 1.16.4
<jam> I'll fix it
<sinzui> jam, then the bug is closed :) I will start motions for the release of 1.16.4
<jam> sinzui: right, the *key* thing is NEC is already using something (1.16.4.1) which we really want to make 1.16.4, and then make the next release 1.16.5
<jam> I forgot to check the milestone page, as another of the bugs gets bumped as well
<sinzui> uhg
<jam> sinzui: did we end up re-landing bug #1227952 ?
<_mup_> Bug #1227952: juju get give a "panic: index out of range" error <regression> <goyaml:Fix Committed by dave-cheney> <juju-core:Fix Committed by dave-cheney> <juju-core 1.16:Fix Committed by sinzui> <https://launchpad.net/bugs/1227952>
<sinzui> relanding? I don't know. I can check the version
<sinzui> in deps
<jam> sinzui: I pivoted to remove everything in lp:juju-core/1.16 back to 1.16.3 so that we could land the things that were critical to NEC, and then get the non-critical stuff out in the next one
<jam> I think bug #1227952 needs to target 1.16.5
<_mup_> Bug #1227952: juju get give a "panic: index out of range" error <regression> <goyaml:Fix Committed by dave-cheney> <juju-core:Fix Committed by dave-cheney> <juju-core 1.16:Fix Committed by sinzui> <https://launchpad.net/bugs/1227952>
<sinzui> jam, ah, the missing info from the Wednesday conversation. okay. A fine strategy
<jam> sinzui: as soon as you give me the go ahead I have lp:~jameinel/juju-core/preparation-for-1.16.5 to bring back all the stuff
<sinzui> understood
<benji> marcoceppi: It is a work-in-progress, but I would like your initial impressions of lp:~benji/+junk/prooflib-first-cut
<marcoceppi> benji: I thought proof was already free of all sys.exit calls already?
<benji> marcoceppi: I haven't looked for sys.exit specifically yet, so far I have concentrated on project structure
<marcoceppi> benji: why seperate this from charm-tools?
<marcoceppi> I'm confused as to the goal
<marcoceppi> benji: care to jump on a hangout?
<benji> the goal is to create a library that third-parties can consume
<benji> I have a call in a couple of minutes, but I can do it after that (say 30 minutes or so from now)
<marcoceppi> benji: I'm confused as to why charm-tools can't be that library?
<marcoceppi> sounds good!
<mattyw> stupid question: I'm trying to query the mongodb that gets started when I run make check - but I can't work out where to get the right creds to start a mongo shell connection
<rogpeppe3> mattyw: what's "make check"?
<rogpeppe> mgz: ping
<mattyw> rogpeppe, running all the tests
<mgz> rogpeppe: pong sorry, not paying attention
<rogpeppe> mgz: np, am currently in a call
<mgz> yell after if you still need me
<mgz> will be around at least another couple of hours
<rogpeppe> niemeyer: ping
<niemeyer> rogpeppe: pongus
<rogpeppe> niemeyer: i wondered if you might have a moment to join us in a hangout - i'm trying to resolve some issues after restoring a mongodb
<niemeyer> Sure thing
<niemeyer> Link?
<rogpeppe> niemeyer: https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig?authuser=1
<natefinch> Fixing tests with <-time.After(x)  ...bad or truly terrible?
<hatch> Hey all, is there a whitelist of characters for service names in core? I need to set up proper validation in the GUI https://bugs.launchpad.net/juju-gui/+bug/1252578
<_mup_> Bug #1252578: GUI shows invalid service name as valid when deploying ghost <juju-gui:Triaged> <https://launchpad.net/bugs/1252578>
<hatch> found the regex
<sinzui> abentley, i am have a bad day tearing down envs. Local required manual deletions of lxc dirs and symlinks. I got two timeouts destroying azure
<abentley> sinzui: Ugh.
<abentley> sinzui: I am starting a test run on the new instances. (juju-ci-2 env)
<sinzui> fab
<abentley> sinzui: Is it possible that you recently destroyed an aws environment?
<sinzui> abentley, in the last 15 minutes I destroyed test-aws
<sinzui> and test-hp
<sinzui> abentley, I don't think these can collide
<abentley> sinzui: This log is baffling: http://162.213.35.54:8080/job/aws-upgrade-deploy/71/console
<abentley> sinzui: It bootstraps successfully, and later reports it's not bootstrapped.
<sinzui> hmm, maybe the control-buckets are the same.
<abentley> sinzui: Most likely explanation is it was torn down after bootstrapping by one of us or our pet machines.
<abentley> sinzui: Ends with yakitori?
<sinzui> abentley, yep
<abentley> sinzui: I think that's the explanation, then.
<sinzui> abentley, can you instrument CI destroying aws and undermining my acceptance test?
<sinzui> well
<sinzui> I can teardown now and find some more japanese food
<abentley> I didn't understand the request.  But I can certainly change the control bucket.
<sinzui> abentley, I was wondering if you wanted to make ci destroy aws and then I would see if the env went missing
<abentley> sinzui: Okay.
<abentley> sinzui: Done.
<sinzui> abentley, that was it
<abentley> sinzui: Okay, changing control bucket here.
<sinzui> and I will change mine too
<abentley> sinzui: 1.16.4 on the local provider no longer ooms, but the deploy goes funny: http://162.213.35.54:8080/job/local-upgrade-deploy/75/console
<abentley> sinzui: So the upgrade went fine, but the deploy seems to use the 1.16.3 tools instead of the 1.16.4.
<sinzui> abentley, we could use --show-log or --debug to see more info about what was selected
<marcoceppi> Who knows the most about the openstack provider?
<abentley> sinzui: Okay.  I can say from the existing log that 1.16.3.1 was selected on that run.
<natefinch> marcoceppi: I wouldn't say I know a lot about the openstack provider, but maybe I can help?
<sinzui> abentley, looking at before the destroy step, we can see that 1.16.4.1 was selected, but I don't see any agents upgraded during the wait phase
<abentley> sinzui: As the agents reach the expected value, they disappear from the list.  So when 0 disappears from "<local> 1.16.3.1: 1, 2, mysql/0, wordpress/0", we can infer that it was upgraded.
<abentley> sinzui: When /var/lib/jenkins/ci-cd-scripts2/wait_for_agent_update.py exits without an error, that indicates that all the agents have upgraded.  I can change it to print that out.
<sinzui> hmm, why are the two upgrade commands different in this log
<sinzui> abentley, this command assume that 1.16.4 is at the location specified in the tools url: juju upgrade-juju -e local --version 1.16.4.1
<thumper> morning
<sinzui> ^ I think that might assume the bucket was populate from the previous effort
<abentley> sinzui: They are for different reasons.  the first tests upgrades.
<thumper> mramm: back in the land of stars and stripes?
<mramm> yes
<abentley> sinzui: The second is to force the agents to match the client.  It was intended for the case where  the agent is newer than the client.
<sinzui> abentley, sure, but --version requires a binary at the tools url
<abentley> sinzui: sure, but shouldn't 1.16.4 deploy its binary to the tools url?
<sinzui> abentley, only --upload-tools will put the tool in the location. I think --version just pulls from that location
<natefinch> morning thumper
<thumper> o/ natefinch
<thumper> mramm: our 1:1 has moved from 9am to 11am for me due to summer time here and not there
<thumper> we can move it earlier if you like
<mramm> ok
<mramm> right now I have an interview
<abentley> sinzui: So how does 1.16.3 get into the tools url?  Remember we're talking local provider here.
<mramm> thumper: but I can move it forward for next week
<thumper> mramm: sure
<sinzui> abentley, local-bucket > streams > aws
<abentley> sinzui: So do we need to specify a testing tools-url for local-provider then?
<sinzui> abentley, I think so, I have in the past with limited success.
 * sinzui ponders trying lxc with with aws testing
<abentley> sinzui: In the old environment, I have http://10.0.3.1:8040/tools
<thumper> sinzui: fyi, we are going to want to test the local provider with lxc and kvm shortly
<sinzui> thumper, I was planning that for 1.18, but given the topic of jamespages's reply to the 1.16.4 announcement, I think we need to quickly increment the minor numbers and release 1.18 today
 * thumper tilts his head
<thumper> sinzui: must have been a personal reply
<thumper> didn't see it
<sinzui> thumper, do you have a reply to "Adding new modes to the provisioner and new plugins feels outside the scope of stable updates to me"
<thumper> what is the rationale there?
<thumper> I guess that makes sense
<thumper> are we needing a release?
<sinzui> I learned that jam had planned an alternate 1.16.4 from myself. I made the release with jam's plan, but james thinks we should be doing a 1.18 with these changes
<sinzui> abentley, thumper I am tempted to create a stable series and branch from 1.16 and just release stables from it with selective merges
<abentley> sinzui: Does this shed any light? http://162.213.35.54:8080/job/local-upgrade-deploy/76/console
<abentley> sinzui: That sounds like it could work, but I think it would be better if the dev team was doing that.  Or better yet, landing fixes to stable and merging them into devel.
<sinzui> abentley, I am sure that strategy would avoid the porting pain we have had in the last 30 days
<abentley> sinzui: agreed.
<thumper> sinzui: what is the push to get the plugins out?
<thumper> is it external need?
<thumper> if that is the driver,
<thumper> then +1 on a 1.18 from me
<thumper> if it is to just get testing
<thumper> why not 1.17?
<thumper> we haven't done 1.17 yet have we?
<sinzui> thumper, no way, 1.17 is really broken. we haven't see hp-cloud work in weeks
<thumper> really?
<thumper> WTF?
<thumper> broken how?
<sinzui> thumper, we can do a dev release with 2071 from Nov 4.
<thumper> why is it so broken?
<thumper> shouldn't that be a "stop the line" type issue?
<sinzui> we cannot tell. exactly. Charms don't deploy on it for testing
 * thumper shakes his head
<thumper> what about canonistack?
<thumper> is that working?
<sinzui> 2071 always works, 2072 always fails. canonistack passes
<sinzui> abentley, about the log. I am still not surprised. I think 1.16.3 was pulled from aws because --upload-tools was not used.
<thumper> the only think that leaps to mind with that has to do with installing stuff in cloud-init leaving apt in a weird state
<thumper> that revision is nothing openstack specific
<thumper> so we are left to look at the differences in the actual cloud images
<sinzui> abentley, let me see if I can get control of my local and try with aws testing location
<thumper> sinzui: got logs?
<sinzui> sure do
<sinzui> thumper, https://bugs.launchpad.net/juju-core/+bug/1255242
<_mup_> Bug #1255242: upgrade-juju on HP cloud broken in devel <ci> <hp-cloud> <regression> <upgrade-juju> <juju-core:Triaged> <https://launchpad.net/bugs/1255242>
<sinzui> thumper, abentley ^ about this bug. I ponder if Hp got too fast for juju. I see similar errors on Hp when I am making juju go as fast as possible. When I wait a few minutes for the services to come up before add-relation, I get a clean deployment
<abentley> sinzui: I can change the script to always supply --upload-tools for the local provider, if that's what we want.
<thumper> sinzui: weird
<sinzui> abentley, I think that might be the case. let me finish my test with with local + aws testing
<abentley> sinzui: Sure.  BTW, here's an example of our automatic downgrade to 1.16.3 to match a 1.16.3 client: http://162.213.35.54:8080/job/aws-upgrade-deploy/73/console
<sinzui> abentley, use use --upload-tools for the second deploy phase. my aws testing hacks didn't help
<abentley> sinzui: Can I also use it for the first deploy phase?  I use the same script for both deploys.
<sinzui> abentley, if the first juju is stable and the second is proposed, then it is okay
<abentley> sinzui: Okay, I will use it for both deploys.
<abentley> sinzui: Okay, I applied --upload-tools to bootstrap, but for some reason, 1.16.3.1 was selected again.  Did you mean I should apply it to upgrade-juju?
<sinzui> abentley, for the second case these is a disaster. 1.16.4 should only upload tools for itself.
<sinzui> abentley, the test is to verify propose stable can bootstrap itsel
<sinzui> f
<sinzui> abentley, The local upgrade and bootstrap just works for me.
<sinzui> abentley, http://pastebin.ubuntu.com/6511274/
<abentley> sinzui: I am not installing the deb, so that I don't have to worry about version conflicts.
<sinzui> abentley, but is this a case where we are using the extract juju? Since we didn't install juju, the 1.16.3 tools are all that is available
<abentley> sinzui: Instead, I am just using the binary directly.
<sinzui> bingo
<sinzui> this is tricky
<abentley> sinzui: So there's a ./usr/lib/juju-1.16.4/bin/jujud in a directory in the workspace.  Any way to convince juju to use that?
<sinzui> GOPATH?
<abentley> I don't know enough about what GOPATH means.  Is it for resources as well as executables ?
<sinzui> GOPATH=./usr/lib/juju-1.16.4/ indicates where to find bins and srcs
 * sinzui can try this now in fact
<sinzui> abentley, I think GOPATH works are in thunderbirds
<abentley> in thunderbirds?
<natefinch> that's the signal!  Delta team, go go!
<natefinch> >.>
<natefinch> <.<
<sinzui> abentley, "thunders are go" http://pastebin.ubuntu.com/6511330/
<thumper> :)
 * sinzui likes supermarionation
<natefinch> this does not surprise me.
 * abentley liked Team America: World Police, but hasn't seen much of the original stuff.
<sinzui> my son has a youtube subscription to follow Captain Scarlet
<abentley> sinzui: Did I do it wrong? http://162.213.35.54:8080/job/local-upgrade-deploy/81/console
<natefinch> thumper: my tests for mongo HA require two thread.Sleep() equivalents (<-time.After(time.Second))... it's  because I'm starting and stopping mongo servers, and the code that does it is asynchronous.... so sometimes mongo hasn't finished starting yet.  Thoughts?  I could spend some time making the mongo start function synchronous.. but it's just a test mechanism, so I'm not sure how much time to put into it.
<thumper> natefinch: I'd suggest making the start synchronous, it shouldn't be too hard no?
<thumper> we do this now with the upstart start method
<thumper> so we go: start, are you started?
<sinzui> abentley, I explicitly call the juju bin instead of PATH resolution
<thumper> wait a bit, and ask again
<natefinch> thumper: yeah, that's what I was thinking of doing.  Fair enough.
<thumper> we have short attempts
<sinzui> abentley, but I see you put the bin as the first element in PATH
<abentley> sinzui: Also, I run 'which' to make sure I have the right one.
<thumper> but I think making it synchronous is the most obvious thing, hide the waiting from the caller, make the tests simple to read and understand
<thumper> I don't think you ever regret making tests better and more understandable
<sinzui> abentley, is is possible sudo got root's PATH
<thumper> within reason
<abentley> sinzui: I use -E to preserve the environment.
<sinzui> abentley, Doesn't work for me
<sinzui> GOPATH=~/Work/juju-core_1.16.4 PATH=~/Work/juju-core_1.16.4/bin:$PATH sudo -E juju --version
<sinzui> 1.16.3-trusty-amd64
<sinzui> abentley, but this works because I removed all of the historic PATH:
<sinzui> $ GOPATH=~/Work/juju-core_1.16.4 PATH=~/Work/juju-core_1.16.4/bin juju --version
<sinzui> 1.16.4-trusty-amd64
<abentley> sinzui: Weird.
<sinzui> yeah
<thumper> -E doesn't pass in PATH
<thumper> IIRC
<thumper> I use "sudo $(which juju) --version"
<thumper> mramm: I'm in the hangout... just hanging out
<bigjools> hello.  I have a dead machine but juju still thinks it's there.  How can I remove it?  terminate-service/destroy-machine all fail to work because they seem to want to contact the agent on the dead machine.
<davecheney> bigjools: i don't thnk you can at the moment
<bigjools> davecheney: so my env is fucked?
<davecheney> you could try building 1.17-trunk from source
<bigjools> ok
<davecheney> bigjools: we don't tell the customres their environment is fucked
<davecheney> but, yes
<bigjools> :)
<bigjools> I think that when you see a "hook failed" message it ought to suggest running "juju resolved"
<bigjools> someone who will remain nameless decided to use "nova delete" instead
<davecheney> that's a paddlin'
<wallyworld_> thumper: wrt https://codereview.appspot.com/35800044/, william had some issues. i've responded. i feel like there's value in this work. do we need to discuss?
#juju-dev 2013-12-03
<bigjools> error: invalid service name "tarmac-1.4"
<bigjools> yay
<davecheney> probably the do
<davecheney> pretty much only safe to use a-z, 0-9 and hyphen
<thumper> wallyworld_: looking
<wallyworld_> k
<davecheney> s/do/dot
<bigjools> davecheney: but why... :/
<bigjools> I tried with tarmac-14 and it still complained :(
<wallyworld_> thumper: i didn't mention the listener because it is orthogonal to the management of keys in state. that aside, i'll continue with the current work then
<thumper> wallyworld_: I think it is worth the few days effort now to make it easier
<wallyworld_> me too
<thumper> the fact that it gives you a break to do something slightly different is a bonus
<wallyworld_> if had the the road map all sorted it would be easier to know exactly what to do next
<wallyworld_> thumper: i just ran into that @%#^%!&@!^& issue where different values of c are used in different phases of the tests. ffs
<thumper> how?
<thumper> storing c?
<wallyworld_> no, constructing a stub in the SetupTest, where the stub took c *gc.C as the arg and the called c.Assert
<wallyworld_> the c.Assert failed and the wrong go routine was stopped
<wallyworld_> so the test said it passed
<thumper> heh
<thumper> oops
<wallyworld_> well, oops smoops
<wallyworld_> the same value of c should be used
<wallyworld_> also, our commands are hard to test sometimes
<wallyworld_> cause they live in a main package
<wallyworld_> so can't easily be mocked out from another package
<wallyworld_> without refactoring the Run()
<thumper> true that...
<thumper> school run
 * thumper is hanging out for synchronous bootstrap...
<thumper> I'm waiting for canonistack to come back saying it is up
<thumper> no idea what the state is right now
<thumper> all I know is that it has started an instance
<jam1> thumper: sinzui: We delivered a 1.16.4.1 to NEC, that I'd *really* like to be called an official 1.16.4 so we don't have to tilt our heads every time we look at it. Beyond that we're in a situation where we've got enough work queued up in the pipeline that we're afraid to release trunk, and I'd really like to get us out of that situation.
<thumper> jam: why afraid?
<thumper> OH FF!!!!@!
<thumper> loaded invalid environment configuration: required environment variable not set for credentials attribute: User
<thumper> that is on the server side, worked client side...
<thumper> GRR...
 * thumper cringes...
 * thumper needs to talk this through with someone
<thumper> axw, wallyworld_: either of you available to be a talking teddybear?
<axw> thumper: sure, brb
<wallyworld_> ok. i've never been a teddy bear before
<axw> back..
<thumper> https://plus.google.com/hangouts/_/76cpibimp3dvc1rugac12elqh8?hl=en
<thumper> mramm: ta
<mramm> you are most welcome
 * thumper downloads failing machine log
<thumper> for investigation
<axw> jam: are you happy for me to land my synchronous bootstrap branch after addressing fwereade's comments? do you have anything else to add?
<jam> axw: I haven't actually had a chance to look at it, though I never actually audited the patch, either. I just tested it out.
<axw> ok
<jam> so if someone like william has actually done code review, I'm fine with it landing
<jam> just don't count it as "I've looked closely and I'm happy with everything in there" :)
<jam> axw: if you have particular bits that you're more unsure of, you can point me to them, and I can give you directed feedback
<axw> jam: no problems
<jam> I'm just distracted with other things
<axw> ok
<axw> jam: I think it's ok, just wanted to make sure you didn't want to have a look before I went ahead
<fwereade> man, it creeps me out when my phone tells me I've sent a new review of something *before* the new page renders in the browser
<jam>  fwereade :)
<jam> morning
<fwereade> jam, heyhey
<fwereade> jam, so, about 1.16.4 et al... I quite like the idea of making our current projected 1.16.5 into 1.18, and getting that out asap, with a view to the rest of the CLI API landing in 1.20 as soon as it's ready
<fwereade> jam, drawbacks?
<jam> fwereade: well we do have actual bugfixes for 1.16 as well, and I'm concerned about future precedent (what happens the next time we have to do a stop-the-line fix for a client using an older version of juju) but whatever we pick today can work
<jam> fwereade: the fact that it has been ~1month since we did any kind of a release is surprising for me
<fwereade> jam, well, the issue here is that the "fix" was really more or a "feature"
<jam> and also makes me wonder if we've got all the balance points correct.
<jam> fwereade: necessary work for a stable series that took precedence over other feature work
<jam> fwereade: fwiw we did the same in 1.16.2 (maas-agent-name is terribly incompatible)
<fwereade> jam, ha, yes
<jam> 1.16.3 is a genuine bugfix, as it is a single line
<jam> fwereade: so a gut feeling is that we are being inconsistent about the size of the meaning of an X+2 stable release
<jam> 1.16 vs 1.18 is going to be tiny, but 1.18 vs 1.20 is going to be massive
<jam> thats ok
<jam> but it is a drawback
<jam> fwereade: I do *like* the idea of stable being bugfixes only, but that assumes we have the features implemented that someone on a stable series actually needs
<fwereade> jam, well,it won't be any bigger than today's 1.16->1.18 already would be
<jam> fwereade: sure, but 1.12 -> 1.14 -> 1.16 -> ? were all similarly sized
<fwereade> jam, I do agree that's not nice in itself
<fwereade> jam, OTOH "size of change" is a bit subjective anyway and I'm not sure it's a great yardstick against which to measure minor version changes
<jam> fwereade: its very much a gut feeling about "hmmm, this seems strange" more than any sort of concrete thing. I'd rather have regular cadenced .X+2 releases (1/month?, 1/3months?) and then figure out the rest as we go
<jam> I *do* feel that 1.17 is building up a lot of steam pressure as we keep piling stuff into it and haven't done a release
<jam> as in, we're more likely to break lots of stuff because people haven't been able to give any incremental feedback
<fwereade> jam, yeah, I'd like to get a release off trunk soon
<jam> fwereade: at which point, if we had 1.17.0 released, and then we needed something for NEC, what would you prefer ?
<jam> it feels really odd to jump from 1.16.3 => 1.18 over the 1.17 series, though obviously we can do it
<jam> but 1.17 would then contain "all of" 1.18
<fwereade> jam, well the problem is that 1.17 (if it existed) would contain a bunch of things that aren't in 1.18
<fwereade> jam, so it's a good thing it doesn't ;)
<jam> fwereade: well 1.18 would contain lots of things not in 1.17 either
<jam> sorry
<fwereade> jam, ;p
<jam> 1.17 would contain lots of things not in 1.18
<jam> as in, they are modestly unrelated
<fwereade> jam, yeah
<jam> and we only didn't have 1.17.0 because the release didn't go out the week before
<jam> (probably because of test suite failures)
<jam> CI
<jam> I don't think I ever heard a "we can't release 1.17.0 because of X" from sinzui
<fwereade> jam, honestly, in general, I would prefer that specific client fixes be individual hotfixes based on what they were already running -- but, yes, in this case it didn't come out that way
<jam> fwereade: why would you rather have client-specific branches?
<jam> it seems far more maintenance than getting the actual code out as a marked & tagged release
<jam> fwereade: as in, NEC comes back and does "juju status" wtf version are they running ?
<fwereade> jam, because it minimizes the risk/disruption at the client end
<jam> if we give them a real release
<jam> then they have an obvious match
<jam> and the guy in #juju-core can reason about it
<fwereade> jam, that problem already exists because we fucked up and let --upload-tools out into general use
 * fwereade makes grumpy face
<jam> fwereade: well, we can tell it is at least a custom build, and what it is based upon
<fwereade> jam, well, not really
<fwereade> jam, eg they weren't using a custom build, but they did have a .build version, because --upload-tools
<jam> fwereade: sure, but we can tell that it is version X + epsilon (in this case epsilon = 0), that is still better than a pure hotfix that doesn't match any release at all
<fwereade> jam, ok, my preference is predicated on the idea that we would be able to bring clients back into the usual stream once we have a real release that includes the bits they need, and that is not necessarily realistic
<jam> fwereade: well it isn't realistic if we don't actually release their bits :)
<fwereade> jam, the trouble is that we have no information about epsilon
<fwereade> jam, ha
<jam> fwereade: and there is the "will NEC use 1.18" ? Only if forced ?
<fwereade> jam, yeah, that's the question
<fwereade> jam, I do agree that always having clients on officially released versions is optimal
<jam> fwereade: so we could have done, "we're going to need some big patches for them, bump our current 1.18, and prepare a new 1.18 based on 1.16"
<jam> a concern in the future is "what if we already have 1.18.0 out the door" because they are using an old stable
<jam> IIRC, IS is still using 1.14 in a lot of places
<jam> but  I *think* we got them off of 1.13
<jam> well, 1.13.2 from trunk before it was actually released, etc.
<jam> (I remember doing help with them and 'fixing' something that was in 1.13.2 but not in their build of it :)
<jam> fwereade: so also, "tools-url" lets you put whatever you want into "official" versions as well
<fwereade> jam, sure, tools-url lets you lie if you're so inclined -- but --upload-tools forces a lie on you whether you want it or not
<jam> fwereade: "want it or not" you did ask for it :)
<fwereade> jam, well, not so much... if you just want to upload the local jujud you don't *really* also want funky versions
<rogpeppe1> mornin' all
<wallyworld_> fwereade: when you are free i'd like a quick chat, maybe you can ping me when convenient. i may have to pop out for a bit at some point so if i don't respond i'm not ignoring you
<fwereade> wallyworld_, sure, 5 mins?
<wallyworld_> ok
<fwereade> rogpeppe1, heyhey, I think you have a couple of branches LGTMed and ready to land
<rogpeppe1> fwereade: ah, yes. i can think of at least one.
<rogpeppe1> fwereade: BTW i spent some time with Peter Waller yesterday afternoon trying to get his juju installation up again
<fwereade> rogpeppe1, ah, thank you
<rogpeppe1> fwereade: an interesting and illuminating exercise
 * fwereade peers closely at rogpeppe1, tries to determine degree of deadpan
<rogpeppe1> fwereade: gustavo's just added a feature to the txn package to help recover in this kind of situaton
<rogpeppe1> fwereade: actually said totally straight
<rogpeppe1> fwereade: we found lots of places that referred to transactions that did not exist
 * fwereade looks nervous
<rogpeppe1> fwereade: that's probably because we don't run in safe mode (we don't ensure an fsync before we assume a transaction has been successfully written)
<rogpeppe1> fwereade: so when we ran out of disk space, this kind of thing can happen
<fwereade> wtf, I could have sworn I double-checked that *months* ago
<rogpeppe1> fwereade: we should probably run at least the transaction operations safely
<fwereade> rogpeppe1, surely we should run safely full stop
<rogpeppe1> fwereade: probably
<rogpeppe1> fwereade: it'll be interesting to see how much it slows things down
<fwereade> rogpeppe, indeed
<rogpeppe> fwereade: yeah, i don't see any calls to SetSafe in the code
<jam> rogpeppe: any understanding of why it isn't the default ?
<rogpeppe> jam: i'm just trying to work out what the default actually is
<jam> rogpeppe: you can use go and call Session.Safe()
<rogpeppe> jam: yeah, i'm just doing that
<jam> rogpeppe: "session.SetMode(Strong, true)" is called in DialWithInfo
<jam> but I don't think that changes Safe
<rogpeppe> jam: i think that's orthogonal to the safety mode
<rogpeppe> jam: the safety mode we use is the zero one
<jam> newSession calls "SetSafe(&Safe{})"
<jam> but what does that actually trigger by default?
<jam> rogpeppe: "if safe.WMode == "" => w = safe.WMode"
<rogpeppe> jam: huh?
<jam> rogpeppe: mgo session.go, ensureSafe() when creating a new session
<jam> it starts with "Safe{}" mode
<jam> then...
<jam> I'm not 100% sure
<jam> I thought gustavo claimed it was Write mode safe by default
<jam> but I'm not quite seeing that
<rogpeppe> jam: ah, you mean safe.WMode != ""
<jam> rogpeppe: oddly enough, SetSafe calls ensureSafe which means you can never decrease the safe mode from the existing value
<rogpeppe> jam: i'm not sure
<rogpeppe> jam: i think that safeOp might make a difference there
<jam> rogpeppe: so, we can set it the first time, but it looks like if we've ever set safe before (which happens when you call newSession automatically) then it gets set to the exact value
<jam> but after that point
<jam> it *might* be that the safe value must be greater than the existing one
<jam> the actual logic is hard for me to sort out
<rogpeppe> jam: yeah, i think it does - the comparison logic is only triggered if safeOp is non-nil
<jam> rogpeppe: sure, comparison only if safeOp is non nil, but mgo calls SetSafe as part of setting up the session
<rogpeppe> jam: SetSafe sets safeOp to nil before calling ensureSafe
<jam> which means by the time a *user* gets to it
<jam> ah
<jam> k
<jam> I missed that, fair point
<jam> rogpeppe: so by default mgo does read back getLastError
<jam> (as SetSafe(Safe{}) at least does the getLastError check)
<jam> however, it doesn't actually set W or WMode
<rogpeppe> jam: yeah, i think we should have WMode: "majority"
<rogpeppe> jam: and FSync: true
<jam> rogpeppe: inconsequential until we have HA, but I agree
<rogpeppe> jam: the latter is still important
<jam> rogpeppe: I would say if we have WMode: majority we may not need fsync, it depends on the catastrophic failure you have to worry about.
<rogpeppe> jam: i'm not sure.
<rogpeppe> jam: it depends if we replicate the logs to the HA servers too
<rogpeppe> jam: if we do, then all are equally likely to run out of disk space at the same time
<rogpeppe> jam: we really need a better story for that situation anyway though
<jam> rogpeppe: running out of disk space should be something we address, but I don't think it relates to FSync concern for us, does it?
<rogpeppe> jam: if you don't fsync, you don't know when you've run out of disk space
<rogpeppe> jam: so for instance you can have a transaction that gets added and then set on the relevant document, but the transaction can be lost
<rogpeppe> jam: which is the situation that seems to have happened in peter's case
<jam> rogpeppe: but we only ran out of disk space because of the log exploding, right? the mgo database doesn't grow allt hat much in our own
<jam> so if we fix logs, then we'll avoid that side of it
<rogpeppe> jam: well...
<rogpeppe> jam: there was another interesting thing that happened yesterday, and i have no current explanation
<rogpeppe> jam: the mongo database had been moved onto another EBS device (one with more space)
<rogpeppe> jam: with a symlink from /var/lib/juju to that
<rogpeppe> jam: when we restarted mongo, it started writing data very quickly
<rogpeppe> jam: and the size grew from ~3GB to ~14GB in a few minutes
<rogpeppe> jam: before we stopped it
<rogpeppe> jam: we fixed it by doing a mongodump/mongorestore
<rogpeppe> jam: (the amount of data when dumped was only 70MB)
<jam> rogpeppe: I have a strong feeling you were already in an error state that was running out of controll (for whatever reason). My 10k node setup was on the order of 600MB, IIRC
<rogpeppe> jam: quite possibly. i've no idea what kind of error state would cause that though
<jam> rogpeppe: mgo txns that don't exist causing jujud to try to fix something that isn't broken over and over ?
<jam> I don't really know either, I'm surprised dump restore fixed it
<jam> as that sounds like its a bug in mongo
<rogpeppe> jam: i'm pretty sure it wasn't anything to do with juju itself.
<rogpeppe> jam: i'm not sure it *could* grow the transaction collection that fast
<rogpeppe> jam: it's possible that it's some kind of mongo bug
<jam> rogpeppe: so I think I would set the write concern to at least 1, and the Journal value to True, rather than FSync.
<rogpeppe> jam: what are the implications of the J value? the comment is somewhat obscure to me.
<jam> rogpeppe: "write the data to the journal" vs "fsync the whole db"
<jam> rogpeppe: http://docs.mongodb.org/manual/reference/command/getLastError/#dbcmd.getLastError
<jam> "In most cases, use the j option to ensure durability..." as the doc under "fsync"
 * TheMue just uses the synchronous bootstrap the first time. feels better. but the bootstrap help text still tealls it's asynchronous
<jam> rogpeppe: but I'd go for a patch that after Dial immediately calls EnsureSafe(Safe{"majority", J=True})
<fwereade> TheMue, well spotted, would you quickly fix it please? ;p
<fwereade> jam, +1
<TheMue> fwereade: yep, will do
<fwereade> TheMue, <3
<TheMue> but I need a sed specialist ;) how do I prefix all lines of a file with a given string?
<jam> TheMue: well you could do: "sed -i .back -e 's/\(.*\)/STUFF\1/"
<jam> but I'm not sure if sed is the best fit for it
<TheMue> jam: thx. I'm also open for other ideas, otherwise I'll use my editor
<jam> TheMue: if you have tons of stuff, sed is fine for it, with vim gg^VG ^ISTUFF<ESC>
<jam> (go top, block insert, Go bottom, Insert all, write STUFF, ESC to finish)
<TheMue> cool
<TheMue> will try after proposal
<TheMue> fwereade: dunno if my english is good enough: https://codereview.appspot.com/36520043/
<fwereade> jam, TheMue, standup
<wallyworld_> jam: fwereade: i'm back if you wanted to discuss the auth keys plugin. or not. ping me if you do.
<mgz> wallyworld_: you could rejoin hangout
<natefinch> wallyworld_: we're still in the hangout if you want to pop back in
 * mgz wins!
 * natefinch is too slow
<natefinch> :)\
 * dimitern lunch
<TheMue> Anyone interested in reviewing my Tailer, the first component for the debug log API: https://codereview.appspot.com/36540043
<TheMue> Is intended to do a filtered tailing of any ReaderSeeker (like a File) into a Writer
* ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: see calendar | Bugs: 8 Critical, 240 High - https://bugs.launchpad.net/juju-core/
<rogpeppe> dimitern: ping
<TheMue> rogpeppe: quick look on https://codereview.appspot.com/36540043 ?
<rogpeppe> TheMue: will do
<TheMue> rogpeppe: thanks
<dimitern> rogpeppe, pong
<rogpeppe> dimitern: i'm wondering about the upgrade-juju behaviour
<rogpeppe> dimitern: in particular: when was the checking for version consistency introduced?
<dimitern> rogpeppe, yeah?
<dimitern> rogpeppe, recently
<rogpeppe> dimitern: after 1.16?
<dimitern> rogpeppe, yes
<rogpeppe> dimitern: the other thing is: does it ignore dead machines when it's checking?
<dimitern> rogpeppe, take a look at SetEnvironAgentVersion
<dimitern> rogpeppe, it just checks tools versions, not the life
<rogpeppe> dimitern: hmm, i think that's probably wrong then
<rogpeppe> dimitern: if an agent is dead, i think we probably don't care about its version
<dimitern> rogpeppe, perhaps we can unset agent version from dead machines anyway
<rogpeppe> dimitern: but it's a good thing it's not released yet, because that logic won't prevent peter waller from upgrading his environment currently
<rogpeppe> dimitern: i don't think that's necessary
<rogpeppe> dimitern: i think setting life should set life only
<rogpeppe> dimitern: and it's just possible that the agent version info could be useful to someone, somewhere, i guess
<dimitern> rogpeppe, well, it's not just that actually
<dimitern> rogpeppe, uprgade-juju does the check of version constraints before trying to change it
<dimitern> rogpeppe, so in fact it will have helped in peter's case not to upgrade to more than 1.14
<rogpeppe> dimitern: what do mean by the version constraints?
<dimitern> rogpeppe, "next stable" logic
<dimitern> rogpeppe, (or current, failing that)
<jamespage> sinzui, hey - about to start working on 1.16.4 - I see some discussion on my observation about whether this is really a stable release - what's the outcome? what do I need todo now?
<sinzui> jamespage, I was just writing the list to summarise jam's argument
<sinzui> jamespage, from the dev's perspective this is a stable release because it address issue with how juju is currently used. Some papercuts are improvements/features, but they are always 100% compatible.
<jamespage> sinzui, we are going to struggle with a minor release exception if that is the case
<sinzui> jamespage, devel and minor version increments are bug features and version incompatibilities
 * TheMue has to step out for his office appointment, will return later
<sinzui> s/bug/big features/
<jamespage> sinzui, I'll discuss with a few SRU people
<dimitern> rogpeppe, and anyway the case you're describing is very unusual - dead machines with inconsistent agent versions - that never would've happened if the usual upgrade process is followed
<jamespage> sinzui, I did look at the changes - I think the plugin I could probably swing with as its isolated from the rest of the codebase
<jamespage> sinzui, the provisioner safe-mode feels less SRU'able
<rogpeppe> dimitern: why not?
<sinzui> jamespage, I am keen to do a release, I can make this 1.18.0 is a couple of hours. The devs are a little more reticent.
<dimitern> rogpeppe, well, unless you force it ofc
<dimitern> rogpeppe, due to the version constraints checks
<rogpeppe> dimitern: the machines have been around for a long time - their instances were manually destroyed from the aws console AFAIK
<jamespage> sinzui, fwiw I'm trying to SRU all 1.16.x point releases to saucy as evidence that juju-core is ready for a MRE for trusty
<rogpeppe> dimitern: those checks aren't in 1.16 though, right?
<dimitern> rogpeppe, no
<rogpeppe> dimitern: "no they aren't" or "no that's wrong" ?
<dimitern> rogpeppe, sorry, no they aren't
<rogpeppe> dimitern: ok, cool
<sinzui> jamespage, That is admirable. If the devs were producing smaller features to release a stable each month, would that cause pain?
<rogpeppe> dimitern: it might be a bit of a problem that one broken machine can prevent upgrade of a whole environment, but... can we manually override by specifying --version ?
 * sinzui thinks enterprise customers get juju from the location CTS points to, so rapid increments is always fine
 * jamespage thinks about sinzui's suggestion
<mgz> jamespage: the extra plugin was actually trying to do a bug fix in a non-intrusive way... unfortunately that does mean packaging changes instead which isn't really what you want for a minor version
<jamespage> mgz, I guess my query is about whether a feature that allows you to backup/restore a juju environment should be landing on a stable release branch
<jamespage> mgz, (I appreciate the way the plugin was done does isolate it from the rest of the codebase - which avoids regression potentials)
<dimitern> my main fuse tripped and i trips again when I turn it back on, unless i stop one of the the other ones, so now I have no power on any outlet in the living room and had to do some trickery to get it to work from the bedroom :/
<jamespage> mgz, hey - any plans on bug 1241674
<_mup_> Bug #1241674: juju-core broken with OpenStack Havana for tenants with multiple networks <cts-cloud-review> <openstack-provider> <juju-core:Triaged> <https://launchpad.net/bugs/1241674>
<jamespage> its what I get most frequently asked about these days
<mgz> jamespage: yeah, I should post summary to that bug
<mgz> so then those people who ask have something to read
<jamespage> please do
<fwereade> sounds like dimitern is having persistent power problems and we might not see him again today
<rogpeppe> fwereade: ping
<fwereade> rogpeppe, pong
<rogpeppe> fwereade: would you be free for a little bit
<fwereade> rogpeppe, maybe, but I'll have to drop off to talk to green in max 20 mins
<rogpeppe> fwereade: that would be fine
<fwereade> rogpeppe, consider me at your service then
<rogpeppe> fwereade: https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig?authuser=1
<rogpeppe> fwereade: (with peter waller)
<sinzui> jamespage, I replied to juju 1.16.4 conversation on the list. I think you may want to correct or elaborate on what I wrote
<rogpeppe> this is really odd
<rogpeppe> niemeyer: ping
<mgz> rogpeppe: I refreshed a branch you already reviewed for the update-bootstrap tweaks btw
<rogpeppe> mgz: ok, will have a look
<rogpeppe> mgz: am currently still trying to sort out this broken environment
<rogpeppe> mgz: have you looked at mgo/txn at all, BTW?
<mgz> alas no :)
<mgz> just enough to add some operations to state
<mgz> didn't try and understand how it was actually working
<jam> jamespage: unfortunately I cleared my traceback a bit, but I will say the "provisioner-safe-mode" is like *the key* bit that NEC actually needs, the rest is automation around stuff they can do manually.
<jam> jamespage: is there a reason cloud-archive:tools is still reporting 1.16.0?
<jam> sinzui: ^^
<jamespage> jam: yes the SRU only just went into saucy - its waiting for processing in the cloud-tools staging PPA right now
<jamespage> along with a few other things
<jam> jamespage: k, jpds is having a problem with keyserver stuff and that is fixed in 1.16.2
<sinzui> jamespage, is there anything I should be doing to speed that up?
<jamespage> I'll poke smoser for review
<jam> jamespage: "the SRU" of which version?
<jamespage> 1.16.3
<rogpeppe> jam: any idea what might be going on here? http://paste.ubuntu.com/6515237/
<jam> jamespage: great
<rogpeppe> jam: this is on the broken environment i mentioned in the standup
<jam> rogpeppe: context?
<jam> thx
<rogpeppe> jam: note all the calls to txn.flusher.recurse
<rogpeppe> jam: i *think* that indicates something's broken with transactions (which wouldn't actually be too surprising in this case)
<jam> rogpeppe: the 'active' frame is the top one, right?
<rogpeppe> jam: yes
<niemeyer> rogpeppe: Heya
<niemeyer> rogpeppe: So, problem sovled?
<rogpeppe> niemeyer: i'm not sure it is, unfortunately
<niemeyer> rogpeppe: Haven't seen any replies since you've mailed him about it
<rogpeppe> niemeyer: i've been working with him to try and bring things up again.
<rogpeppe> niemeyer: i *thought* it was all pretty much working,
<rogpeppe> niemeyer: but there appears to be something still up with the transaction queues
<rogpeppe> niemeyer: this is the stack trace i'm seeing on the machine agent: http://paste.ubuntu.com/6515237/
<rogpeppe> niemeyer: note the many calls to the recurse method
<rogpeppe> niemeyer: it seems that nothing is actually making any progress
<niemeyer> rogpeppe: Seems to be trying to apply transactions
<rogpeppe> niemeyer: it does, but none seem to be actually being applied
<niemeyer> rogpeppe: That's a side effect of having missing transactions
<niemeyer> rogpeppe: Missing transaction documents, that is
<niemeyer> rogpeppe: It'll refuse to make progress because the system was corrupted
<rogpeppe> niemeyer: i thought the PurgeMissing call was supposed to deal with that
<niemeyer> rogpeppe: So it cannot make reasonable progress
<niemeyer> rogpeppe: Yes, it is
<rogpeppe> niemeyer: so, we did that and it seemed to succeed
<niemeyer> rogpeppe: Did it clean everything up?
<niemeyer> rogpeppe: on the rigth database, etc
<rogpeppe> niemeyer: yes, i believe so - it made a lot more progress (no errors about missing transactions any more)
<niemeyer> rogpeppe: Tell him to kill the transaction logs completely, run purge-txns again
<rogpeppe> niemeyer: ok
<niemeyer> rogpeppe: Drop both txns and txns.log
<niemeyer> rogpeppe: and txns.stash
<rogpeppe> niemeyer: ok, trying that
<niemeyer> rogpeppe: After that, purge-txns will cause a full cleanup
<rogpeppe> niemeyer: when you say "drop", is that a specific call, or is it just something like db.txns.remove(nil)
<rogpeppe> ?
<niemeyer> > db.test.drop()
<niemeyer> true
<niemeyer> >
<niemeyer> rogpeppe: Be mindful.. there is no protection against doing major damage
<rogpeppe> niemeyer: i am aware of that
<rogpeppe> niemeyer: there is a backup though
<niemeyer> rogpeppe: Yeah, I'm actually curious about one thing:
<niemeyer> rogpeppe: the db dump I got.. was that the backup, or was that the one being executed live?
<rogpeppe> niemeyer: that was a backup made at my instigation
<sinzui> Bug #1257371 is a regression that breaks bootstrapping on aws and canonistack
<_mup_> Bug #1257371: bootstrap fails because Permission denied (publickey) <bootstrap> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1257371>
<rogpeppe> niemeyer: i.e. after the problems had started to occur
<rogpeppe> e
<rogpeppe> r
<niemeyer> rogpeppe: Right, I'm pretty sure trying to run the system on that state would great quite a bit of churn in the database
<niemeyer> rogpeppe: s/great/create/
<niemeyer> rogpeppe: Depending on the retry strategies...
<niemeyer> rogpeppe: This might explain why the database was growing
<niemeyer> rogpeppe: and might also explain why the system is in that state you see now
<rogpeppe> niemeyer: ok. let's hope this strategy works then
<rogpeppe> niemeyer: just about to drop. wish me luck :-)
<niemeyer> rogpeppe: The transactions may all be fine now.. but if you put a massive number of runners trying to finalize a massive number of pending and dependent transactions at once, it won't be great
<niemeyer> rogpeppe: The traceback you pasted seems to corroborate with that theory too
<rogpeppe> niemeyer: collections dropped
<rogpeppe> niemeyer: it's currently purged >10000 transactions
<niemeyer> rogpeppe: There you go..
<niemeyer> rogpeppe: No wonder it was stuck
<rogpeppe> niemeyer: it's still going...
<niemeyer> rogpeppe: That's definitely not the database I have here, by the way
<niemeyer> rogpeppe: I did check the magnitude of proper transactions to be applied
<rogpeppe> niemeyer: indeed not - i think they've all been started since this morning
<rogpeppe> niemeyer: there were only a page or so this morning
<niemeyer> rogpeppe: Well, a page of missing
<niemeyer> rogpeppe: The problem now is a different one
<rogpeppe> niemeyer: ah yes
<niemeyer> rogpeppe: These are not missing or bad transactions
<niemeyer> rogpeppe: They're perfectly good transactions that have been attempted continuously and in parallel, but unable to be applied because the system was wedged with a few transactions that were lost
<niemeyer> rogpeppe: Then, once the system was restored to a good state, there was that massive amount of pending transactions to be applied.. and due to how juju is trying to do stuff from several fronts, there was an attempt to flush the queues concurrently
<niemeyer> rogpeppe: Not great
<niemeyer> rogpeppe: At the same time, a good sign that the txn package did hold the mess back instead of creating havoc
<rogpeppe> niemeyer: yeah
<rogpeppe> niemeyer: 34500 now
<niemeyer> rogpeppe: Gosh
<niemeyer> rogpeppe: How come it was running for so long?
<niemeyer> rogpeppe: What happens when juju panics?  I guess we have upstart scripts that put it back alive?
<rogpeppe> niemeyer: it *should* all be ok
<rogpeppe> niemeyer: the main problem with panics is that when the recur continually, the logs fill up
<rogpeppe> niemeyer: and that was the indirect cause of what we're seeing now
<niemeyer> rogpeppe: Well, that's not the only problem.. :)
<rogpeppe> niemeyer: indeed
<rogpeppe> niemeyer: 5 whys
<niemeyer> rogpeppe: "OMG, things are broken! Fix it!" => "Try it again!" => "OMG, things are broken! Fix it!" => "Once more!" => .....
<niemeyer> rogpeppe: That's how we end up with tens of thousands of pending transactions :)
<rogpeppe> niemeyer: well to be fair, we only applied one fix today
<niemeyer> rogpeppe: Hmm.. how do you mean?
<rogpeppe> niemeyer: we ran PurgeMissing
<niemeyer> rogpeppe: Sorry, I'm missing the context
<niemeyer> rogpeppe: I don't get the hook of "to be fair"
<rogpeppe> niemeyer: ah, i thought you were talking about human intervention
<rogpeppe> niemeyer: but perhaps you're talking about what the agents were doing
<niemeyer> rogpeppe: No, I'm talking about the fact the system loops continuously doing more damage when we explicitly say in code that we cannot continue
<rogpeppe> niemeyer: right
<rogpeppe> niemeyer: it's an interesting question as to what's the best approach there
<rogpeppe> niemeyer: i definitely think that some kind of backoff or retry limit would be good
<niemeyer> rogpeppe: Yeah, I think we should enable that in our upstart scripts
<niemeyer> rogpeppe: This is a well known practice, even in systems that take the fail-and-restart approach to heart
<niemeyer> rogpeppe: (e.g. erlang)
<niemeyer> rogpeppe: (or, Erlang OTP, more correctly)
<rogpeppe> niemeyer: yeah
<rogpeppe> niemeyer: hmm, 70000 transactions purged so far. i'm really quite surprised there are that many
<niemeyer> rogpeppe: Depending on how far that goes, it might be wise to start from that backup instead of that crippled live system
<rogpeppe> niemeyer: latest is that it has probably fixed the problem
<rogpeppe> niemeyer: except...
<rogpeppe> niemeyer: that now amazon has rate-limited the requests because we'd restarted too often (probably)
<rogpeppe> niemeyer: so hopefully that will have resolved by the morning
<niemeyer> rogpeppe: Gosh..
<rogpeppe> niemeyer: lots of instance id requests because they've got a substantial number of machines in the environment which are dead (with missing instances)
<rogpeppe> niemeyer: and if we get a missing instance, we retry because amazon might be lying due to eventual consistency
<rogpeppe> niemeyer: so we make more requests than we should
<niemeyer> rogpeppe: Right
 * rogpeppe is done for the day
<rogpeppe> g'night all
<smoser> hey.
<smoser> before i write an email to juju-dev
<smoser> can someone tell me real quick if there is some plan (or existing path) that a charm can indicate that it can or cannot run in a lxc container
<smoser> and if so, any modules that it might need access to or devices (or kernel version or such)
<natefinch> smoser: I don't think we have any such thing today.... I don't know of a plan to include such a thing.
<smoser> thanks.
<thumper> morning
<thumper> also, WTF?
<thumper> anyone got a working environment up?
<thumper> I get: ERROR <nil> when I go 'juju add-machine'
<thumper> anyone else confirm?
<natefinch> doh
<natefinch> thumper: lemme give it a try, half a sec, need to switch to trunk
<thumper> kk
<thumper> oh, and yay
<thumper> with the kvm local provider I can create nested kvm
 * thumper wants to try lxc in kvm in kvm
<thumper> heh...
<natefinch> just keep nesting until something breaks
<thumper> also means I can fix the kvm provisioner code without needing to start canonistack
<natefinch> awesome
<thumper> natefinch: I've heard from robie that three deep causes problems
<thumper> but I've not tested
<thumper> also, memory probably an issue...
<thumper> the outer kvm would need more ram for the inner kvm to work properly
<natefinch> where's your sense of adventure?
<thumper> but that too would allow me to test the hardware characteristics
 * thumper has 16 gig of ram
<thumper> lets do this
<natefinch> :D
<thumper> after I've fixed the bug that is...
<thumper> kvm container provisioner is panicing
<natefinch> no one on warthogs wants to talk about google compute engine evidently...
<thumper> heh
 * thumper goes to write a stack trace function for loggo
<natefinch> some day juju status will return
<natefinch> and then I can try add machine
<natefinch> thumper: add machine works for me on trunk/ec2
<thumper> no error?
<natefinch> correct
<thumper> it could well be linked to the kvm stuff
<thumper> ta, I'll keep digging
<natefinch> welcome
<hazinhell> natefinch, what about it?
<hazinhell> gce that is
<thumper> I was wondering why my container in a container was taking so long to start
<thumper> it seems the host is downloading the cloud image
<thumper> what we really want is a squid cache on the host machine
<thumper> who knows squit?
<thumper> squid
<natefinch> ......crickets
 * thumper hangs his head
<thumper> damn networking
<thumper> so, this kinda works...
 * thumper wonders where the "br0" is coming from...
 * thumper thinks...
<thumper> ah
<thumper> DefaultKVMBridge
 * thumper tweaks the local provider to make eth0 bridged
 * thumper wonders how crazy this is getting
<natefinch> thumper: I can't see br0 without thinking "You mad bro?"
<thumper> heh
<natefinch> and usually, if I'm looking at br0, I'm mad
<thumper> :)
<hazinhell> thumper, varnish ftw ;-)
<hazinhell> thumper, are we setting up lxc on a different bridge then kvm?
<thumper> hazinhell: varnish?
<thumper> hazinhell: well, lxc defaults to lxcbr0 and kvm to virbr0
<thumper> the config wasn't setting one
<thumper> and for a container inside the local provider we need to have bridged eth0
<hazinhell> thumper, varnish over squid for proxy.
<thumper> hazinhell: docs?
<hazinhell> thumper, varnish-cache.org.. but if your using one of the apt proxies, afaik only squid is setup for that
<thumper> hazinhell: what I wanted was a local cache of the lxc ubuntu-cloud image and the kvm one
<thumper> to make creating container locally faster
<thumper> as a new kvm instance needs to sync the images
<thumper> to start an lxc or kvm container
<hazinhell> thumper, lxc already caches
<thumper> hazinhell: not for this use case
<thumper> because it is a new machine
<thumper> hazinhell: consider this ...
<thumper> laptop host
<thumper> has both kvm and lxc images cached
<thumper> boot up kvm local provider
<thumper> start a machine
<thumper> uses cache
<thumper> then go "juju add-machine kvm:1"
<thumper> machine 1, the new kvm instance, then syncs the kvm image
<thumper> this goes to the internet to get it
<thumper> I want a cache on the host
<thumper> similarly if the new machine 1 wants an lxc image
<hazinhell> ah.. nesting with cache access
<thumper> it goes to the internet to sync image
<thumper> ack
<thumper> so squit cache on the host to make it faster
<hazinhell> thumper, what about mount the host cache over
<thumper> for new machines starting containers
<hazinhell> thumper, read mount
<thumper> sounds crazy :)
<hazinhell> it does.. you need some supervision tree to share the read mounts down the hierarchy
<hazinhell> s/supervision/
<hazinhell> thumper,
<hazinhell> thumper, you could just do the host object storage (provider storage) and link the cache into that
<thumper> surely a cache on the host would be less crazy
<hazinhell> thumper, the host already has the cache, a mount of that directly into the guests, allows all the default tools to see it without any extra work on juju's part
<hazinhell> doing a network endpoint, means you have to interject some juju logic to pull from that endpoint into the local disk cache
<hazinhell> and you end up with wasted space
<hazinhell> its kinda of a shame we can't use the same for both..
<hazinhell> ie lxc is a rootfs and kvm is basically a disk image.
<hazinhell> hmm
<hazinhell> sadly can't quite loop dev the img and mount it into the cache, lxc wants a tarball there, would have to set it up as a container rootfs.
<thumper> hmm...
<thumper> hmm...
<hazinhell> thumper, read mount sound good?
<hazinhell> thumper, or something else come to mind?
<thumper> hazinhell: busy fixing the basics at the moment
<hazinhell> ack
 * hazinhell returns to hell
<thumper> anyone have an idea why my kvm machine doesn't have the networking service running?
 * thumper steps back a bit
#juju-dev 2013-12-04
<wallyworld__> thumper: yo
<thumper> wallyworld__: hey
<thumper> wallyworld__: how's it going today?
<wallyworld__> thumper: notice how i didn't ping :-)
<wallyworld__> going ok, deep into some refactoring
<thumper> wallyworld__: as you may have seen, I've put the kvm broker review up
<thumper> wallyworld__: yeah...
<wallyworld__> thumper: funny that, i have a question
<thumper> wallyworld__: I'm now getting containers to return hardware characteristics
<wallyworld__> quick hangout?
<wallyworld__> https://plus.google.com/hangouts/_/76cpj9mtcok4cua6di82i0o3ms?hl=en
 * thumper joins
<thumper> wallyworld__: https://codereview.appspot.com/36980043/
<wallyworld__> \o/
<thumper> wallyworld__: it's a start
<wallyworld__> yes
<thumper> thanks for the review
<thumper> I'll tweak and land tomorrow morning
<thumper> night all
<axw> gtg to my daughter's school orientation, bbl
<axw> back
<rogpeppe> mornin' all
<axw> morning
<jam> smoser: there was a discussion about allowing charms to add constraints (like mem, etc), I could see that being extended to support stuff like "!lxc". However, that probably isn't on the roadmap for this cycle.
<axw> jam: I forget, do we need to support old CLI with new server?
<axw> jam: just wondering if I can remove secrets pushing API
<axw> server-side
<jam> axw: the old CLI didn't push secrets via PI
<jam> API
<jam> so yes, but we don't have to keep that bit
<jam> axw: we have to allow the old CLI direct DB access
<axw> jam: ah, it's only on trunk isn't it?
<axw> no.. I broke the last release
<jam> axw: ? you broke the new CLI connecting to the old server
<jam> IIRC
<jam> but we can drop that bit
<jam> right now we are trying and if it fails just continuing
<axw> ok
<jam> with synchronous bootstrap, we don't even have to try anymore :)
<axw> jam: well, we should still push secrets for old servers, right?
<axw> or do we not care?
<axw> assume they're already set up?
<jam> axw: even if there was a 1.17.0 that had async bootstrap and we were pushing via the API as a dev release wedon't have to push secrets to it in 1.17.1
<jam> axw: we don't care if it is only a dev release
<jam> thats why we call them *dev*
<axw> jam: sorry, I mean, can we drop the code that pushes secrets to existing installations
<jam> it is our way of "lets get this out there, without committing to supporting migration to/from it"
<axw> of non-dev
<jam> axw: so for pushing secrets directly to the DB, I'm not sure if we can drop it.
<jam> "maybe"
<jam> we had talked about "if you bootstrap with 1.16 and then *never do anything with it* and then try to connect with 1.17" that might be broken, but we're not sure we care
<axw> I really don't think we should
<jam> (you can destroy-env & rebotstrap because you have nothing in your env, or you can connect with the 1.16 that you bootstrapped, etc)
<axw> indeed
<jam> axw: so *I* wouldn't immediately say "when connecting via direct DB access don't pass secrets"
<jam> axw: it isn't worth poking that code just to remove it if we don't have to touch it at all
<jam> axw: I'd much rather focus on "we never connect to the DB for a 1.18 client and server"
<jam> so that instead of having a "and now we don't push secrets" we end up with a "and now we never connect to the DB"
<jam> axw: so *probably* we could drop it, but I'd rather get to the point where we can drop NewConnFromName completely
<axw> understood
<axw> hmm
<axw> jam: the reason why I'd like to remove it altogether, is then we can get rid of the idea of secrets
<axw> but I can leave it for now I guess
<jam> axw: we can't. we still have to not put the secrets into cloud-init, and only pass them once we connect to the bootstrap node
<jam> axw: the reason we have them today is to not put them into cloud-init
<axw> jam: that bit's irrelevant, we never put any config into cloud-init anymore
<axw> (with synchronous bootstrap)
<axw> cloud-init now does ssh keys, and that's it
<jam> axw: so IMO, changing that isn't our first priority right now. I do like that, but I'd be fine if getting rid of the notion of secrets was in 1.20
<axw> mmkay
<jam> axw: vs, we actually have to finish stuff like upgrade-charm and status or we can't remove direct DB access in 1.20
<jam> axw: and, of course, time spent with sinzui to make sure CI is happy is time very well spent.
<axw> jam: indeed. not sure what's going on there :/
<axw> I can't reproduce the issues on either canonistack or ec2
<jam> axw: I haven't been able to either. there was a comment about connection flakiness for them. Certainly how did we get a machine up and running and then not have SSH configured.
<jam> axw: one thing I was wondering
<jam> how do we pick the ssh private key to connect with?
<jam> If I set "authorized-keys: foobar" in my environments.yaml how do you match that with eg ~/.ssh/id_rsa_ec2key
<axw> jam: by default I think it takes id_dsa.pub, id_rsa.pub and identity.pub from ~/.ssh
<axw> jam: now sure what you mean by "how do you match that"
<axw> I think ssh will just cycle through all the possible private keys on ~/.ssh?
<jam> axw: no
<jam> it cycles through everything in your ssh-agent
<jam> I think
<jam> but for direct keys you can configure it in ~/.ssh/config or supply it as "ssh -i PATHTOKEY" or a couple of other ways.
<jam> but I'm pretty sure it doesn't just try keys on disk
<jam> I could be wrong, and I've overspecified my ssh config
<axw> no actually I think you're right
<jam> but I was wondering if what sinzui and aaron were seeing was because of keys not getting picked up correctly.
<axw> it'll also try those defaults I mentioned
<axw> jam: well, that bit hasn't changed at all, so I don't understand how that'd be the case
<jam> axw: right, the specific 2/3 private bits
<jam> axw: they weren't having Juju drive SSH before
<jam> axw: in the test suite
<axw> ah, true.
<axw> :)
<jam> they were copying logs of via scp at one point
<jam> but that command *might* be configured specially
<jam> the fact that they get "Permission denied (publickey)" sometimes hints that *something* is doing tat.
<jam> axw: for Canonistack, I've heard another wrinkel
<jam> if you give a Canonistack instance a floating IP
<jam> that becomes a world-routable IP, but it is *not* routable within the cloud
<jam> so machine A on canonistack has to talk to machine B in canonistack via the private address, and *not* the floating ip
<jam> axw: so you might try "use-floating-ip: true" in your canonistack config and see if that breaks bootstrap for you
<axw> hurngh
<jam> it shouldn't from local
<jam> but it might from cstack => cstack
<axw> no, but maybe from jenkins
<jam> axw: ah, also
<jam> you can't directly connect to CStack machines
 * jam => lightbulb
<axw> yeah, gotta sshuttle
<rogpeppe> anyone know anything about amazon request limits?
<jam> axw: actually most people configure their SSH to bounce via chinstrap
<jam> axw: ProxyCommand ssh chinstrap.canonical.com nc -q0 %h %p
<rogpeppe> i'm still trying to get this guy's environment up, and we're seeing "Request limit exceeded" errors
<jam> axw: so for CStack abentley and sinzui both probably have that, which would let them SSH to a machine, but would *not* let them Dial a machine
<axw> jam: doesn't explain the permission denied error tho
<jam> rogpeppe: more that 100/5s is going to trigger their limits, but I don't know what the actual values are (xplod charm was causing amazon request limit exceeded problems before we fixed it)
<jam> axw: It would explain not being able to connect, but yes, doesn't explain perm denied
<axw> I think you're probably onto something with the private keys/them not using ssh before
<jam> axw: connect as "ubuntu" user?
<axw> jam: yep
<rogpeppe> jam: we're still seeing this error after a night of inactivity
<jam> axw: I *think* user@host:port is openssh specific
<jam> rogpeppe: whose inactivity :)
<jam> rogpeppe: everything shut down?
<rogpeppe> jam: well, we stopped jujud-machine-0
<jam> rogpeppe: i
<rogpeppe> jam: i wonder if the other instances are making ec2 requests
<jam> if this was 1.13.2 the User and Machine agents had provider creds, IIRC
<rogpeppe> jam: everything is on 1.16.3 AFAIK
<jam> so they could have been doing it
<jam> rogpeppe: did you actually get it all moved over?
<rogpeppe> jam: yes
<rogpeppe> jam: but now we're getting this problem
<rogpeppe> jam: one issue after another :-(
<axw> wow, that ping thread is getting on a bit
<rogpeppe> axw: which ping thread?
<axw> rogpeppe: warthogs
<rogpeppe> axw: ha fun
<rogpeppe> axw: i don't look at warthogs much
<TheMue> rogpeppe: any chance to take a look at my CL today?
<TheMue> rogpeppe: oh, and hello btw
<rogpeppe> TheMue: i started on it yesterday, will continue today
<TheMue> rogpeppe: ah, looking forward, thx
<rogpeppe> TheMue: the main issue i have with it so far is that it always reads from the very start of the file
<rogpeppe> TheMue: and for very big files (and log files can be very big) that's a significant waste of resources
<TheMue> rogpeppe: I have to admit I stole an algo of you of the newsgroup :)
<rogpeppe> TheMue: ha ha
<TheMue> rogpeppe: but doing the initial reading from the end indeed seems better
<rogpeppe> TheMue: when was that from?
<TheMue> rogpeppe: oh, pretty old, would have to look again
<TheMue> rogpeppe: but I liked the clean approach
<jam> TheMue, rogpeppe: one thought would be that the api could pass in a possible "bytes from start" to give some context as to where it was looking, which might be a negative number to mean from the end of the file ?
<rogpeppe> jam: i think it would be more intuitive if the api passed in number of lines of context
<jam> rogpeppe: it would, but it is also *very* hard to do efficiently, vs if you had a byte offset hint
<jam> it could even just be a hint
<rogpeppe> jam: or even a start date
<rogpeppe> jam: it's not too hard
<rogpeppe> jam: i've done it before, and tail(1) does it without too much difficulty
<TheMue> rogpeppe: but even then you would have to find it in the file first. ok, a binary search could hel.
<rogpeppe> TheMue: you can't do binary search
<rogpeppe> TheMue: but you can read backwards
<rogpeppe> TheMue: actually, you could binary search if you're looking for a start date
<TheMue> rogpeppe: that's what I meant
<rogpeppe> the difficulty with a start date is clock skew
<rogpeppe> log lines won't necessarily be in strict date order
<rogpeppe> but that might not be too much of a problem in practice
<TheMue> rogpeppe: will talk to frankban if that inaccuracy would be ok there
<dimitern> rogpeppe, fwereade_, mgz, jam, others interested: PutCharm proposal document for comments https://docs.google.com/a/canonical.com/document/d/1TxnOCLPDqG6y3kCzmUGIkDr0tywXk1XQnHx7G6gO5tI/edit#
<rogpeppe> and as for client vs server clocks, you could probably ask for a given duration before the last log message
<TheMue> rogpeppe: hey, that's a nice approach, like it
<TheMue> rogpeppe: with different operating modes I still can keep the full scan if wanted
<rogpeppe> dimitern: looking
<jam> TheMue: I think the GUI probably wants "as many old lines as is comfortable to put into the UI" so strictly restricting by date might be unwanted
<jam> consider "it failed 2 days ago"
<jam> or "it looks failed, when did it fail" ?
<jam> So *hinting* to make finding the right value sounds good, but assuming it is just a hint is probably worthwhile
<rogpeppe> jam: i think that's probably more of an argument for being able to move back in time
<jam> I guess you could just seed "estimated size of line" in the code, and then update that estimation for a given request as you read through stuff and filter it
<jam> rogpeppe: sure, but what estimate is 'good' for 100 lines
<rogpeppe> jam: i don't think we need to estimate in bytes
<jam> rogpeppe: I mean we don't have the API hint, but we use an internal hint
<jam> given that you might have a really noisy machine that *isn't* in the filter
<jam> or the machine you are reading is noisy itself
<rogpeppe> jam: are you suggesting this as an optimisation?
<jam> rogpeppe: right
<jam> I think we don't need to expose it to the UI, and we do already have "num lines" in both the CLI and in what the GUI would like
<jam> that said
<rogpeppe> jam: i'm not entirely sure i see how it helps
<jam> rogpeppe: I think we can land an unoptimized version as long as we can think of a way to make it better when we need to.
<rogpeppe> jam: could you explain?
<TheMue> jam: yep, wanted is number of lines
<jam> rogpeppe: so we want to get (eg) 100 lines of filtered context for the UI. We start at some place near the end, read and determine how many filtered lines are in that space, and then jump back an estimate based on the number of lines we've found so far.
<jam> you could make the default start 1MB from the end of the file, which might catch 90% of the actual cases
<jam> but the specific tweaks would all be things that we'd actually need real world testing to fine tune
<jam> so lets not optimize too much until we have evidence that it is a problem
<rogpeppe> jam: so you're suggesting that we might or might not return the number of lines requested by the user?
<rogpeppe> jam: personally i prefer to avoid heuristics when we can
<jam> rogpeppe: I'm saying we start with an estimate of where those lines might be, and keep looking until we find them, potentially hitting the beginning of the file
<jam> rogpeppe: seeking from the end ==> heuristic about how much you should read in one chunk, etc
<rogpeppe> one mo, am just going to check if this guy's environment is actually working
<jam> if we do "something" that is ~ reasonable, we get most of the benefit and can drive the rest of the work by actual content
<jam> TheMue: so I guess my point is, we know the log file can get to multiple GB, so a small amount of "try to get the answer near the end of the file" is worth implementing. But don't do a lot of work to optimize the code until we actually know it is a problem
<TheMue> jam: btw, how often do we log rotate?
<jam> TheMue: we don't yet
<jam> IIRC natefinch started on something, but he wasn't able to get all files so stopped trying
<TheMue> jam: ouch
<jam> (he could get all-machines.log to be better, but juju itself kept the log file handle open, so it just kept writing to the rotated place)
<jam> TheMue: bug #1191651
<_mup_> Bug #1191651: Juju logs don't rotate. <canonical-webops> <canonistack> <logging> <pyjuju:Triaged> <juju-core:Triaged> <https://launchpad.net/bugs/1191651>
<jam> or bug #1078213
<_mup_> Bug #1078213: juju-machine-agent.log/logs are not logrotated <amd64> <apport-bug> <canonical-webops> <canonistack> <logging> <precise> <juju-core:Triaged> <juju (Ubuntu):Triaged> <https://launchpad.net/bugs/1078213>
<TheMue> jam: ic, and needs indeed a solution, especially for larger environments
<jam> TheMue: sure, and it will probably also end up interfering with debug-log, which we'll want to sort out, but it can be an exercise in the future for now
<jam> though I think just rotating all-machine.log would be a big win today
<TheMue> jam: yep
<jam> even if we aren't rotating everything
<jam> all-machines.log is something we can do 'easily' because rsyslog already has hooks for rotating its log files
<jam> vs jujud that would need a SIGHUP or something to be added
<TheMue> jam: btw, did you looked at the CL so far?
<jam> TheMue: only in brief, I didn't review it.
<TheMue> jam: ah, ok
<rogpeppe> oh darn, another problem encountered.
<rogpeppe> 2013-12-04 10:19:20 ERROR juju.provisioner provisioner_task.go:342 cannot start instance for machine "77": cannot set up groups: cannot authorize securityGroup: The permission '36226792-3--1--1' has already been authorized on the specified group (InvalidPermission.Duplicate)
<rogpeppe> anyone seen the above error before?
<jam> rogpeppe: I saw it once. After I had a permission group and I added ICMP to it. Then destroyed and rebootstrapped
<jam> because destroy doesn't delete perm groups
<jam> it saw a group that already existed and tried to reconfigure it
<jam> apparently in a duplicate fashion
 * rogpeppe goes to look at that logic
<jam> rogpeppe: I *fixed* it by deleting all the low-numbered juju-ENV-0,1,2 etc groups
<rogpeppe> these guys are really seeing the worst of juju
<jam> in this case, you'd want to delete juju-ENV-77
<jam> juju-ENV-machine-77, I think
<rogpeppe> jam: hmm, i guess it might just be an eventual consistency issue
<rogpeppe> jam: we revoke first, but perhaps the authorize hasn't seen the initial revoke, so it gives the duplicate error
<rogpeppe> jam: FYI i just looked at all their security groups and there's no security group for machine 77
<jam> mgz: rogpeppe: TheMue: standup ?
<rogpeppe> jam: i think it must be the global group
<jam> https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig
<jam> could be
<mgz> there in a sec
 * dimitern gives up on the hangout
<jam> mgz: I approved your https://code.launchpad.net/~gz/juju-core/1.16-juju-update-bootstrap-tweak/+merge/196950 but you might want to hold off for a tick
<jam> IsStateServer got renamed to IsManager in the old 1.16.4 stuff
<jam> so it conflicts there
<mgz> wait, we moved the rename back to 1.16 in the end?
<jam> mgz: that is part of destroy-machine --force, which is still targetted to the 1.16 series (as in something that customers do need, so we're putting it into the current stable series)
<mgz> >_<
<mgz> I'll merge and fixup conflicts
<jam> mgz: it hasn't landed yet, I accidentally proposed against trukn
<mgz> okay, so if I win the race, *you* have to fix it  up? :0
<mgz> that was not the correct smilie...
<jam> mgz: that is true, but I just marked mine approved
<mgz> :D
<jam> mgz: well, frank's stuff is in the queue now, so you have a chance
<jam> depending on what the bot finds first :)
<jam> mgz: mine landed first :)
<mgz> I let you win :P
<jam> I didn't realize you could do that, what does it look like on your end (Insert horizontal rule)
<TheMue> mgz, jam: geeks :D
 * TheMue => lunch
<axw> hey natefinch. got my xps 15 today - did you have any issues loading 13.10 onto it?
<axw> I'm getting weird udev issues at startup :\
<axw> about to go back to 12.04 and then upgrade...
<natefinch> axw: I just transferred my hard drive from my old machine to this one
<natefinch> axw: but that worked fine :D
<jam> morning natefinch, I hope you're feeling better
<natefinch> jam: got a litle extra sleep, feel like a new man... well, not exactly, but not bad :)
<mgz> a new-ish man
<natefinch> jam: certainly better than I would have if I'd gotten up at 5:20
<axw> natefinch: aha. I was thinking of doing that as a last resort :)
<natefinch> axw: sorry, it never occurred to me that installation would be a problem
<axw> natefinch: no problems, I'm somewhat used to this :(
<axw> sad to say
<rogpeppe> lunch
<dimitern> rogpeppe, jam, any comments on the PutCharm doc?
<dimitern> mgz, if you want to take a look as well?
<rogpeppe> dimitern: i will have; juggling a few things currently.
<dimitern> rogpeppe, sure, take your time
<mgz> dimitern: sure, looking
<mgz> geh, google doc links are annoying to transfer across... :)
<mgz> dimitern: left a couple of notes
<niemeyer> rogpeppe: Any news from that issue?
<rogpeppe> niemeyer: we got it all working, but they've decided that juju is not for them in the future, sadly
<niemeyer> rogpeppe: Due to that issue, or did they bring up any other reason?
<rogpeppe> niemeyer: a number of issues
<niemeyer> rogpeppe: Okay, well.. at least we have feedback to work on then
<rogpeppe> niemeyer: yes - i am putting together a summary
<niemeyer> rogpeppe: Thanks a lot
<rogpeppe> niemeyer: i asked if they could summarise their juju experience for us too
<rogpeppe> niemeyer: they've put a lot of time into it
<dimitern> mgz, cheers
<sinzui> jamespage, I think devs are willing to forgo an SRU for saucy for the sake of 1.16.4.
<jam> sinzui, jamespage: so 1.16.4 is set in stone (IMO) because it is actually being used in the wild. If we decide we have to revert things/move it to 1.18/whatever we can do so as necessary
<jamespage> jam, sinzui: so I guess you would like me to upload it to trusty and do the backports dance right?
<jam> jamespage: I think we want to make binaries available, I'm not 100% sure how that has to happen
<sinzui> jamespage, yes please :)
<jamespage> sinzui, jam: OK _ but as we probably can't sru this point release, it can't go into cloud-tools
<jamespage> sinzui, jam: uploaded to trusty and backporting for the stable PPA now
<sinzui> Thank you jamespage
<abentley> jam: It appears that bug #1257481 applies only when hyphen is used as a separator.  Can you confirm?
<_mup_> Bug #1257481: juju destroy-environment destroys other environments <ci> <destroy-environment> <juju-core:In Progress by jameinel> <https://launchpad.net/bugs/1257481>
<mgz> abentley: yeah, that seems right
<mgz> we match machines on juju-ENVNAME-*
<mgz> so, be more creative in your naming :0
<jam> abentley: yes
<mgz> jam: we can just make the pattern better, right?
<mgz> in fact, I think it *was* better in pyjuju
<jam> mgz: see the branch associated with the bug
<mgz> jam: I see no associated branch
<jam> mgz: "juju-ENVNAME-machine-\d*"
<jam> mgz: because I fat-fingered it to 1256481
<jam> ugh
<mgz> jam: you li... right
<mgz> reviewing, at any rate :)
<jam> mgz: I'm proposing one with the right associated bugs, but the code is the same
<jam> then again, it is *one* line :)
<mgz> jam: really does seem like we want a test
<jam> mgz: how would you write such a test?
<jam> mgz: we do still have 1 test that calls AllMachines
<jam> and I did manual testing on HP and Canonistack
<jam> I can do a whitebox test of what the regex is
<jam> but that doesn't really help much
<jam> you really need a test in an environment, because the thing interpreting the regex is HP/Canonistack/Havana/etc
<jam> mgz: and it doesn't help that our "go test -live" is actually broken right now (something about not having AgentVersion in the config)
<mgz> jam: can write a local live test, hook in some extra instances, and assert only sanely named ones are returned by Instances
<mgz> doesn't need to be a live live test
<mgz> ...but I do see why you want that
<jam> mgz: so we can, though it sounds very close to testing that I implemented exactly that, if you have a good way to phrase it that isn't *too* tied to implementaiton, I do think it is good "I didn't screw this up when I refactored code"
<mgz> manually nova boot another server with a name that would have matched under the old pattern, but won't under this?
<mgz> that way we have a test that fails on the actual bug reported
<jam> mgz: reasonable point, please leave it in the review so I remember to do it
<mgz> jam: are you putting up a new -cr or should I use that one?
<rogpeppe> dimitern, fwereade_: i've added my original thoughts for how we might do charm uploads to the PutCharm proposal.
<rogpeppe> dimitern, fwereade_: it seems a bit simpler to me, but there's probably a good reason why it would not work
<dimitern> rogpeppe, thanks, i'm reading it now
<dimitern> rogpeppe, and commenting
<abentley> jam: Thanks for the quick fix.  Looks like it will be really hard to cause a bogus match from here on out.
<natefinch> niemeyer: got a second?  Hopefully quick question about replica sets
<niemeyer> natefinch: Yo, yep
<natefinch> niemeyer: I'm writing tests for my code that configures replica sets, so it brings up some mongo instances with mongod --replset foo etc, but I need a good way to test if they're fully up and ready before adding them to the replica set.  Suggestions?  Would Ping() work?
<natefinch> niemeyer: my thought was to put a direct dial and then a ping into a loop, and wait til it succeeds or passes a deadline... but it seems never to succeed
<mgz> can I have stamp on codereview.appspot.com/37210044 merge of 1.16 into trunk please?
<abentley> jam: It appears that when you create an instance on openstack, you can attach metadata to it.  And when you list instances, you can retrieve that metadata.  This could be a sure-fire way of ensuring you delete only the instances you created, e.g. by storing the env name as metadata.
<natefinch> niemeyer: nevermind, I think I just figured it out... forgot to use a monotonic session.
<niemeyer> natefinch: Have you seen the LiveServers method?  That's one way too
<niemeyer> natefinch: Oh, *before* adding to the RS.. okay, nevermind
<natefinch> niemeyer: right, the problem is that I was trying to add them to the replicaset before they were fully up, so mongo would complain that it couldn't add a majority of the servers to the set
<rogpeppe> fwereade_, jam, dimitern: any idea how the problem mentioned by iri- in #juju might be happening?
<niemeyer> natefinch: Understood
<rogpeppe> fwereade_, jam, dimitern: (iri- is peter waller, BTW, the person I've been trying to sort out the environment of)
<mgz> rogpeppe: can I have a stamp on the trunk merge of 1.16 please? cr 37210044
<rogpeppe> mgz: looking
<rogpeppe> mgz: LGTM
<mgz> ta!
<bac> hi rogpeppe or fwereade_, when bootstrapping juju is there a way to get the same effect of 'default-series' in the environment file via the command-line?  --series does not seem to be it.
<rogpeppe> bac:  i think it has to be in the environment config
<bac> rogpeppe: that was my conclusion but i'd hoped i'd overlooked something.
<fwereade_> bac, it's not ideal, but you can always change default-series with set-env after you've bootstrapped
<bac> fwereade_: and that will affect your bootstrap node?
<fwereade_> bac, no, it won't
<bac> yeah, that would be scary.
<fwereade_> bac, it'll affect what charm series is inferred when yu deploy, and what series machines added without --series get, but that should beit
<bac> ok, thanks
<fwereade_> bac, cheers
<jcastro> sinzui, I have a community update on juju to stream today, looking at the milestone for 1.16.5 it's 1 bug away, is it safe for me to say that we'll have .5 out before the holidays?
<sinzui> NO way
<sinzui> jcastro, I reported that bug weeks ago and no one is working on it. I think 1.16.5 can never happen because it breaks cli compatability
<jcastro> ok so I can just reiterate that .4 is where it's at?
<sinzui> I think we should hope for 1.17 this month with 1.18 in January
<sinzui> damn it. abentley there is an extra leading / in the azure release files that is not in the testing files. All the files were pushed to the wrong location
 * sinzui blows a gasket.
<abentley> Crap!
<sinzui> abentley, I am pushing the files to the correct location. We can talk while everything goes up
<TheMue> rogpeppe: you'll get a new push of my CL tomorrow, I'm now changing it to reading from the end first
<rogpeppe> TheMue: cool
<rogpeppe> TheMue: you got my email?
<TheMue> rogpeppe: oh, eh, yes (right now, didn't looked into mail before) :D
<abentley> sinzui: relevant? http://162.213.35.54:8080/job/azure-upgrade-deploy/71/console
<rogpeppe> TheMue: quite fun that i was still able to find the code that i remembered...
<TheMue> rogpeppe: great, my approach looks initially simpler, but with less error control and less generic
<TheMue> rogpeppe: will combine both
<TheMue> rogpeppe: but have to step out now, visitors
<rogpeppe> TheMue: ok, see ya
<sinzui> abentley, yes, in fact, it answers what is in my head,
<jamespage> sinzui, all the juju-core 1.16.4 packages are built in the juju-packagers PPA btw
<sinzui> jamespage, I saw thanks
 * rogpeppe is done for the day
<rogpeppe> g'night all
<thumper> mramm: you around for a quick hangout?
<mramm> yea
<mramm> can I have 5 min
<thumper> sure
 * thumper didn't notice the response
<thumper> mramm: if you need a few more minutes, I'll go make a coffee
<mramm> yea
<mramm> need a few more
<thumper> ok
 * thumper goes to make coffee
<natefinch> plugged in a new USB gigabit ethernet adapter today, and laptop froze up twice today, which it's never done before.  Coincidence?
<thumper> natefinch: probably not :-)
<natefinch> thumper: I didn't really think so :)   Dang... 'cause it's really much more pleasant on ethernet than wifi where my desk is (there's like 3 walls between me and the router, and my signal blows)
<thumper> natefinch: we have brick or plaster interrior walls, and that kills the signal.
<thumper> ran a cat 6 cable from the office to the dining room
<thumper> and have a second access point there
<thumper> mramm: for when you are ready https://plus.google.com/hangouts/_/72cpim91vapctd3ad0g1v02aa0?hl=en
<natefinch> I actually am planning to run a cable from the basement under my office... there's a phone jack in this room that is sort of hilariously useless, which I plan to replace with an ethernet jack.  Putting an access point in here is a good idea, though.  hadn' thought of that
<natefinch> You know one of the things I like best about Canonical?  Sometimes stuff actually gets addressed when I complain.
<natefinch> niemeyer: one (hopefully last) mongo question if you have second?
<jcastro> heya thumper
<jcastro> what's the TLDR on manual provisioning lately?
<thumper> jcastro: mostly working
<thumper> jcastro: hazmat is using it quite a lot I think
<thumper> jcastro: axw is working on making it destroy itself properly
<thumper> which is involving moving destroy environment into the API
<thumper> which is mostly done I think
<thumper> apart from that, I think it is working
<thumper> but not really documented
<thumper> o/ waigani
<waigani> thumper: hello :)
<hazinhell> jcastro, its awesome.. after i get out of hell, i've got a cool plugin that you'll like
<hazinhell> is there still on-going work on api
<hazinhell> deploy is listed as is progress but dimeter is on holiday for a while
<hazinhell> mostly just looking for putcharm in the api
<hazinhell> jcastro, manual still has some rough edges, there's the manual-provider tag on the bugs
<hazinhell> but it works pretty well outside of the rough edges imo.
<jcastro> thumper, it's partially documented: https://juju.ubuntu.com/docs/config-manual.html
<jcastro> hazinhell, that's good to hear!
<hatch> hey does 1.16.4 have the fix for manual provider?
#juju-dev 2013-12-05
<thumper> hatch: which fix?
<thumper> hatch: although probably not
<thumper> hatch: you can probably expect a 1.18 soonish
<hatch> thumper there was one related to mongodb that was fixed but not sure if it was going in 16 or 18
<thumper> yeah, not sure sorry
<thumper> axw may know when he gets online
<hatch> no problem thanks though
<wallyworld> thumper: meeting?
<thumper> oh yeah...
<thumper> wallyworld: my popup reminder conveniently ended up behind emacs
<rick_h__> come on, surely emacs will do reminders :P
<thumper> rick_h__: probably, but it isn't hooked up to my google calendar!
<thumper> axw: sorted out the screen issue?
<axw> thumper: all sorted
<axw> sorry I'm late, will make it up later
<axw> thumper: I got my new laptop - thought I had it sorted last night :/
<thumper> what's the new laptop?
<axw> xps 15
<thumper> nice?
<thumper> with the high res screen?
<axw> thumper: yup, the one I just turned down to 1080p because I can hardly read anything :(
<thumper> I have some dude taking out my power meter and replacing it with a smart meter
<thumper> no power in the house
<axw> at 3200x1800 that is
<thumper> so on battery and using mobile 3G
<thumper> wow, that is good
<thumper> how's the battery?
<thumper> and how heavy?
<axw> thumper: dunno yet, hardly used it - just been frigging around getting my preferred OS on it ;)
<axw> ~2kg
<thumper> heh
<thumper> I'm curious to know when you have hammered it a bit
<axw> I'll let you know when I do :)
<axw> the keyboard is a bit puny compared to what I'm used to
<axw> getting used to it though
<_thumper_> bugger!
<axw> thumper, wallyworld: is there a reason why we don't generate new SSH keys for environments?
<wallyworld> axw: we use the ssh key of the user who created the environment
<wallyworld> i guess that was deemed sufficient
<axw> wallyworld: yeah I know that's what we do now, just wondering why we shouldn't just create a new key
<axw> mmk
<wallyworld> not sure, that design decision was way before my time
<axw> nps
<axw> wallyworld: https://bugs.launchpad.net/juju-core/+bug/1257371/comments/9
<_mup_> Bug #1257371: bootstrap fails because Permission denied (publickey) <bootstrap> <regression> <juju-core:In Progress by axwalk> <https://launchpad.net/bugs/1257371>
<axw> some people might not have a default ssh key, which buggers things up - that's why I ask
<wallyworld> ah, just read the bug
<wallyworld> i can see why we don't generate a new key
<wallyworld> s/dont't/ might not want to
<wallyworld> it would be another key to manage
<wallyworld> when we could just add existing user keys (or import using ssh-import-id)
<wallyworld> but if a user doesn't have a key already....
<wallyworld> axw: are you adding code to allow a key to be specified if not in ~/.ssh/id_rsa.pub?
<axw> wallyworld: just considering options atm
<wallyworld> ok
<axw> wallyworld: that's one option, or there's the option of generating one automatically
<axw> or both
<wallyworld> generating one would be more user/bot friendly
<axw> yeah that's what I'm thinking
<wallyworld> i'd +1 that approach :-)
<wallyworld> maybe run it by jam or william as well
<axw> yeah, I'll just send an email to juju-dev now
<thumper> wallyworld: http://pastebin.ubuntu.com/6523153/
<thumper> wallyworld: and I logged in and checked, and it did indeed have 4G of disk, 1G of ram and 2 cores\
<thumper> \o/
<wallyworld> \o/
<wallyworld> awesome
 * thumper proposes
<thumper> wallyworld:  https://codereview.appspot.com/37610043
 * thumper goes to write some emails
 * wallyworld looks
 * wallyworld ->doctor, bbiab
<davecheney> doh
<davecheney> so close
<davecheney> lucky(~/src/launchpad.net/juju-core) % juju ssh 0
<davecheney> WARNING discarding API open error: <nil>
<davecheney> ERROR environment has no access-key or secret-key
<davecheney> ^ what does this mean
<davecheney> any ideas ?
<davecheney> axw: thanks for the tip
<davecheney> that fixed it
<axw> davecheney: cool
<axw> seems to me it's another win for using gc-built jujud
<axw> the size, that is
<davecheney> axw: oh, i'm not passing -Os
<davecheney> maybe that would help
<axw> davecheney: maybe at the expense of performance
<davecheney> man, even with sync bootstrap juju deploy the first time is still very slow
<davecheney> ie, i type juju depoy mysql m1
<davecheney> and it still took 2 minutes to return and tell me i spelt the command wrong
<axw> huh. I didn't think it did much before parsing the command line...
<davecheney> i think we still connect to the state too early in the subcommand
<davecheney> axw: brace yourself
<davecheney> lucky(~/src/launchpad.net/juju-core) % ls -alh ~/bin/juju{,d}
<davecheney> -rwxr-xr-x 1 dfc dfc 36M Dec  5 17:21 /home/dfc/bin/juju
<davecheney> -rwxr-xr-x 1 dfc dfc 40M Dec  5 17:21 /home/dfc/bin/jujud
<axw> :o
<davecheney> ding ding ding!
<davecheney> http://ec2-54-253-189-54.ap-southeast-2.compute.amazonaws.com/wp-admin/install.php
<axw> very nice :)
<davecheney> if only it wasn't wordpress
<davecheney> we're not allowed to mention wordpress
<rogpeppe1> mornin' all
<dimitern> morning!
<dimitern> we should decide what to do about putcharm today
<dimitern> otherwise the discussion can drag on too long
<dimitern> fwereade_, i'm not sure i understand the icon question for local charms
<dimitern> fwereade_, why do we need GET support?
<fwereade_> dimitern, ah sorry -- did the mail I forwarded lack the original context?
<dimitern> fwereade_, the context is there, but I still have no clue how to do that
<dimitern> fwereade_, I don't know how it works with the charmworld now
<fwereade_> dimitern, I think the charmworld bit is a red herring tbh
<dimitern> fwereade_, can the icon be part of the charm package?
<fwereade_> dimitern, the question is maybe best phrased as "can we somehow serve individual charm files to authenticated clients"
<fwereade_> dimitern, yeah, icon.svg
<dimitern> fwereade_, hmm..
<dimitern> fwereade_, i suppose, if we duplicate part of charmworld in the api server yes
<fwereade_> dimitern, if we do go through an unpack/repack step for every local charm we do at least have all that data available somehwere already, and it'd just a matter of putting it in the right place
<dimitern> fwereade_, we can have GET /charms/xyz/icon.svg
<fwereade_> dimitern, and I don't think we need to worry about store charms so much because they can already get the data from elsewhere
<fwereade_> dimitern, although it would probably be best to make this functionality available for allcharms
<dimitern> fwereade_, i was thinking of removing the data after repackaging - why bother - it's already in the provider storage
<dimitern> fwereade_, so it gets hairer by the minute :)
<dimitern> fwereade_, proxy any charm files from the provider storage or charm store through the api
<thumper> dimitern: hey, when do you go on leave?
<dimitern> fwereade_, this seems like a potential DDOS target - pinging the api server with random urls of valid charm, just so it has to fetch and unpack them
<dimitern> thumper, hey, on 23rd
<fwereade_> dimitern, well we *do* want charm access to be authenticated
<thumper> dimitern: ah, for some reason canonical admin is all fubared
<thumper> dimitern: how's the upload charm stuff going?
<fwereade_> dimitern, we dropped the ball on that because signed urls are an openstack extension iirc
<dimitern> thumper, oh? it was looking fine last time i looked
<thumper> dimitern: I'm not surprised it is screwed, it was mramm that told me
<dimitern> fwereade_, nevertheless it's a ddos target of a sort - an authenticated client written badly can bring down the api server like that perhaps
<fwereade_> dimitern, but we ought to be clawing it back if possible -- in general we shouldn't just be making the contents of environment storage available to anyone who asks
<fwereade_> dimitern, more so than *any* endpoint?
<fwereade_> dimitern, if we're ok with one-time tokens for PUTs, we could surely do the same with GETs
<dimitern> fwereade_, well, a bit more so, because it requires some busy work of fetching and unpacking a charm just to serve a file
<dimitern> fwereade_, i'm not sure what approach did we decide to pick
<fwereade_> dimitern, quite, we should definitely not do that
<fwereade_> dimitern, and your work maybe involves extracting all the charm files anyway
<fwereade_> dimitern, hence the potential connection in my mind
<dimitern> fwereade_, if we need the get stuff, rogpeppe1's proposal seems to fit better than mine
<dimitern> fwereade_, have POST/PUT/GET support for charm urls prefix, like /charms/ and handle everything there
<dimitern> fwereade_, potentially caching the unpacked charms locally forever
<fwereade_> dimitern, well we can't do that exactly
<fwereade_> dimitern, HA
<rogpeppe1> fwereade_: why is HA a problem there?
<fwereade_> dimitern, we presumably want them, and everything else that doesn't have to be provider-level, in gridfs storage
<dimitern> fwereade_, do what? cache?
<jam> fwereade_: you could write it all into mongo
<jam> rogpeppe1: dimiter was saying 'localyl'
<fwereade_> dimitern, which machine unpacked then?
<jam> as in /tmp, I believ
<rogpeppe1> jam: i think that was with respect to caching
<rogpeppe1> jam: which would be fine, i think
<rogpeppe1> jam: although you'd want to keep a handle on the size of the cache
<dimitern> fwereade_, doesn't matter which machine
<fwereade_> dimitern, well, we need the files on all the machines, right?
<dimitern> fwereade_, whichever api server serves a charm, it checks its local cache and populates it before serving
<rogpeppe1> jam: and actually, it might not be worth caching - you don't need to unpack for a GET
<dimitern> fwereade_, no
<dimitern> rogpeppe1, you need specific files inside the package, like the icon.svg
<jam> rogpeppe1: so the request from fwereade_ was that you could serve just the icon.svg, for example, and that is not trivial to proxy without unpacking
<rogpeppe1> ah, i hadn't realised that
 * rogpeppe1 goes to look at the gridfs docs
<rogpeppe1> jam: actually, i don't think it would be too hard
<dimitern> fwereade_, rogpeppe1, although I still think going with a pure http-based interface will screw us up badly when we need to think of role-based auth
<rogpeppe1> dimitern: how's that?
<jam> dimitern: basic auth is username + password
<dimitern> i now that
<jam> I think oauth is an HTTP header
<jam> and you can always have X-Juju-Authorization-Token: XYZ
<rogpeppe1> jam: +1
<dimitern> but what don't know is would it be enough/like the api login
<rogpeppe1> dimitern: it doesn't necessarily need to be
<rogpeppe1> dimitern: did you see my reply to Gary's comment?
<fwereade_> rogpeppe1, jam, dimitern: well, I want us to remain open to the possibility of alternative auth mechanisms
<rogpeppe1> fwereade_: definitely
<dimitern> rogpeppe1, so using the same user/pass for login and for basic http auth
<jam> fwereade_: I think HTTP headers are just as turing complete as anything else
<rogpeppe1> jam: actually, they're not quite
<fwereade_> rogpeppe1, jam, dimitern: so issuing tokens from the API feels like it gets round that potentialcomplexity completely
<fwereade_> ?
<rogpeppe1> jam: you can't do a multi-stage login protocol
<dimitern> fwereade_, exactly
<dimitern> fwereade_, but then we don't solve the get issue
<rogpeppe1> fwereade_: we could issue auth tokens from the API
<dimitern> fwereade_, and having to implement a get handler + a bunch of apis, rather than a bunch of http handlers only seems simpler
<jam> rogpeppe1: you could with round trips, and I'm pretty sure basic PUT can return an "please finish Auth" request.
<rogpeppe1> fwereade_: which don't have any specific relationship to charms
<rogpeppe1> jam: ah, ok
<rogpeppe1> jam: sounds complex though.
<dimitern> jam, PUT can return 401 as POST or GET
<jam> rogpeppe1: that said, client sides may not implement the 100 Continue stuff, which would mean they try to upload the whole content and then get a "hey, you need to auth, please upload again"
<rogpeppe1> fwereade_: in the future, i'm thinking of a method, say Client.AuthToken that returns an authentication token that can be used to authenticate future URL operations.
<rogpeppe1> fwereade_: but for the time being, i don't think it's necessary.
<dimitern> rogpeppe1, actually it is i think
<dimitern> rogpeppe1, we can always generate a token and return it as a result of login
<dimitern> rogpeppe1, then we can use this token as a session key for url requests
<dimitern> rogpeppe1, and have a call to renew a token perhaps
<rogpeppe1> dimitern: that's not a bad idea - i don't think it's *necessary* but it would work well
<rogpeppe1> dimitern: for the time being we could just return the username and password
<dimitern> rogpeppe1, you mean ask for?
<dimitern> rogpeppe1, not return
<rogpeppe1> dimitern: i mean that the auth token returned by Login would just embed the user name and password that was passed to Login
<rogpeppe1> dimitern: the client would treat it as an opaque identifier
<dimitern> rogpeppe1, not in plain text though
<rogpeppe1> dimitern: so if we changed to using a more sophisticated scheme in the future, the client would not need to change
<rogpeppe1> dimitern: why not?
<dimitern> rogpeppe1, because it's a security leak
<rogpeppe1> dimitern: how so?
<dimitern> rogpeppe1, returning the "usernamepassword" in plain text from the server?
<rogpeppe1> dimitern: yeah
<dimitern> rogpeppe1, and then using that as a token?
<rogpeppe1> dimitern: yup
<dimitern> rogpeppe1, why?
<rogpeppe1> dimitern: how is it a security leak?
<dimitern> rogpeppe1, basic auth does that already - user/pass auth
<rogpeppe1> dimitern: sure, and that's essentially what we're doing now
<rogpeppe1> dimitern: how is it a security leak?
<dimitern> rogpeppe1, my point is that a token is probably either a part of the url (unsafe) or a header (with ssl probably safe)
<rogpeppe1> dimitern: it would be part of the header (and the url is actually safe too with https, i believe)
<dimitern> rogpeppe1, unsafe, meaning if the token is not opaque, like you're suggesting
<rogpeppe1> dimitern: we're talking about an authentication token, right? any time that leaks, it's a security leak regardless, whether it's plain text or not),
<rogpeppe1> dimitern: i'm just suggesting we go with an ultra-simple approach to start with, which is sufficient for our current needs, and also capable for the future.
<dimitern> rogpeppe1, no it's not
<rogpeppe1> dimitern: why not?
<jam> dimitern: rogpeppe1, fwereade_, TheMue, davecheney: weekly standup
<dimitern> rogpeppe1, an opaque token, which is time-sensitive is better, even if leaked the damage is minimized
<jam> mramm: standup if you're around
<jam> mgz: ^^
<jam> https://plus.google.com/hangouts/_/calendar/bWFyay5yYW1tLWNocmlzdGVuc2VuQGNhbm9uaWNhbC5jb20.8sj9smn017584lljvp63djdnn8
<mgz> I'm there
<davecheney> jam: bad news, -O2/s has no appreciable effect on binary size
<davecheney> well, -O2 made it 5% bigger
<davecheney> -rwxr-xr-x 1 dfc dfc 19M Dec  5 22:20 /home/dfc/bin/juju
<davecheney> -rwxr-xr-x 1 dfc dfc 22M Dec  5 22:20 /home/dfc/bin/jujud
<davecheney> ^ jam stripped
<davecheney> but I don't know if it is safe to do that with gccgo binaries
<davecheney> i'd guess it is probably safer than gc binaries
<davecheney> but i don't have enough experiecen
 * dimitern lunch
 * davecheney bed
<jam> davecheney: thanks for the reference point, sleep well
<davecheney> jam: antonio put me up to this, i swear
<jamespage> jam: urgh - I just got the --upload-tools thing when running from trusty with gccgo and trying the manage precise
<jamespage> esp if libgo is not statically linked....
<jamespage> gah
<jam> jamespage: so as I understand from davecheney, it *is* intended that final binaries built with gccgo would be staticly linked to avoid 'what version is on what platform' problems
<jamespage> jam: not sure about the statically linked thing
<jamespage> that's something the security team where keen to avoud
<jamespage> libgo contains a core set of SSL libraries
<jamespage> so not having to rebuild for every security vulnerability is a +
<TheMue> rogpeppe1: *carefulPing* I changed the tailer now that it starts N lines before the end. only downside is that those N lines aren't already filtered, so that the initially returned lines can be less.
<rogpeppe1> TheMue: presumably you could filter as you go back through the file?
<TheMue> rogpeppe1: should be possible too, only a bit more logic to create the right strings to check
<TheMue> rogpeppe1: and those may span two or more read buffers
<TheMue> rogpeppe1: but will try it after hangout (in 2 mins)
<rogpeppe1> TheMue: yes, you may need to use something more like the code i sent you
<TheMue> rogpeppe1: a bit more of it, yes
<TheMue> rogpeppe1: btw, how about the storm in your hometown? here it's ok so far, but it still shall grow
<rogpeppe1> TheMue: pretty windy today
<TheMue> rogpeppe1: yeah, we have some douglas firs in front of the house and they are swinging fine :)
<TheMue> rogpeppe1: so, off for some minutes, hangout
<dimitern> rogpeppe1, why is there a mutex in apiserver Login/
<rogpeppe1> dimitern: because it might be called concurrently
<dimitern> rogpeppe1, it seems I can reuse the same Login call for http basic auth
<dimitern> rogpeppe1, except for the loggedIn flag
<rogpeppe1> dimitern: i'd abstract out a separate function called by both
<rogpeppe1> dimitern: i did that in the example http handler code i posted a while ago
<dimitern> rogpeppe1, yeah, my thoughts exactly
<rogpeppe1> dimitern: authUser in https://codereview.appspot.com/22100045/diff/20001/state/apiserver/admin.go
<sinzui> natefinch, I put you in an n+1 config yesterday. I built the windows installer juju 1.16.4 and verified I could bootstrap and deploy from windows with it.
<natefinch> sinzui: awesome
<sinzui> natefinch, I learned I can use wine+inno to make the installer. I don't have windows so I started an instance in aws. I learned about rdp and making ssh work. I think I know enough now to test this reguarly
<natefinch> sinzui: awesome.  I hoped wine would work, but hadn't tried it myself.
<abentley> sinzui: I think I will upgrade CI's system juju to 1.16.4.  Any reason not to?
<sinzui> abentley, +1 to upgrade
<rogpeppe1> TheMue: you have a review
<TheMue> rogpeppe1: thx
<mgz> dstroppa: so, do you have lbox installed and working?
<dstroppa> mgz: I'm getting an 'undefined: syscall.TCGETS' when I execute 'go get launchpad.net/lbox'
<mgz> >_<
<mgz> maaacs...
<mgz> from goetveld/rietveld/terminal_darwin.go?
<mgz> that should be worked around in latest release...
<dstroppa> goetveld/rietveld/terminal.go
<mgz> I have terminal_darwin.go and terminal_linux.go
<dstroppa> here http://bazaar.launchpad.net/~goetveld/goetveld/trunk/files I can only see terminal.go
<mgz> uuu
<mgz> dstroppa: can you branch lp:~gophers/goetveld/trunk instead?
<mgz> I'm not sure why we have two, at different revisions
<mgz> dstroppa: or just try againt now, I've changed it on launchpad
<mgz> so go get should now work
<dstroppa> mgz: go get works
<dstroppa> and now I got lbox
<mgz> phew :)
<mgz> okay, so now go to your gojoyent branch, and run: (adjust the path as needed)
<mgz> `~/go/bin/lbox propose --for lp:~juju/gojoyent/for_review -cr -v -wip`
<mgz> which does some stuff, then should bring up an editor for you to write a review message in
<mgz> also creds for g+ and launchpad as part of it... painful but hopefully not too bad
<dstroppa> Branch is not clean (I got some file that are not pushed yet as I'm still working on it)
<dstroppa> shall I move them or can work around it?
<mgz> shelve those bits then run it again
<mgz> `bzr shelve --all -m "wip whatever"`
<mgz> and `bzr unshelve` when you want it back
<dstroppa> sh: sensible-editor: command not found
<mgz> heh
<mgz> set EDITOR to something?
<mgz> this is the flakiest tool...
<dstroppa> error: Change summary is empty.
<dstroppa> even though I added both summary and desc
<mgz> dstroppa: the code opens a tempfile, execs your editor with that name, waits for it to exit, seeks to 0, then reads
<mgz> is there anything there that would get upset for you?
<mgz> or maybe just a blank line at the top or something daft?
<dstroppa> I believe no
<dstroppa> it opens up vi
<dstroppa> I edit the file (replace the <enterâ¦> with my text)
<dstroppa> save and close
<mgz> feel free to poke at lbox text.go to work out why it's being fussy
<mgz> vim's what I use so I'd expect it to be fine for you as well
<mgz> dstroppa: any luck?
<dstroppa> tried vi and vim, same error
<dstroppa> looking at text.go, but can see anything in particular
<mgz> could just add a log statement to dump the file name/contents before that error, see if that makes anything clearer
<dstroppa> mgz: looks like it's still reading the template
<dstroppa> even though the temp file is saved
<hazinhell> so is safe mode the default?
<hatch> on juju 1.16.4-precise has anyone else been experiencing 'unauthorized mongo access' errors?
<jam> hatch: from where, to where? from the 'juju' CLI or from agents, or ? (It hasn't been reported before, but it is worth investigating)
<hatch> jam so I just did a apt-get upgrade and now when I type `sudo juju bootstrap` I get that error
<jam> hatch: so this is for local provider ?
<hatch> correct
<jam> (otherwise you wouldn't be using sudo, presumably)
<jam> hatch: can you try "sudo juju bootstrap --debug" and then paste bin the result (it might have secrets if you want to be careful)
<hatch> sure one sec
<jam> hatch: also, what version of 'mongodb' is on your machine? (dpkg -l mongodb-server)
<hatch> jam I think here are the relevant bits https://gist.github.com/hatched/3e8a13af98250c236c9d
<hatch> 1:2.2.4-0ubuntu1~ub
<jam> hatch: what series are you on? (precise/raring/etc)
<hatch> precise
<hatch> these issues appeared after the update to ,4
<jam> so I'm not saying to do this *yet* but we do have mongodb-2.4.6 in the cloud-archive: sudo add-apt-repository cloud-archive:tools
<jam> hatch: the actual changes of 1.16.3 vs 1.16.4 don't seem like they would be causing db auth problems, but I'll dig a bit
<hatch> I saw some emails going by that talked about a juju specific mongodb
<hatch> did those land in .4?
<jam> hatch: I just "juju destroy-environment; and then juju bootstrap" locally and it worked on precise, so it *might* be the newer mongodb
<jam> hatch: no
<jam> that is going to be a while in the future
<hatch> ahh ok ok
<jam> hatch: so I know we didn't intend to change anything about how we connect to the db in 1.16.4 vs 1.16.3, and I confirmed the diff doesn't appear to contain a change there.
<hatch> very odd...
<jam> hatch: are you bootstrapping with a DB already configured or something like that ? (I wouldn't think you would get this far)
<hatch> jam I'm going to say no....
<hatch> I've tried deleting the local/ and environments/ but that doesn't appear to help
<jam> hatch: well if you delete those, then you can't "juju destroy-environment" properly, which means you possibly *do* have a stale DB that we aren't aware of (maybe)
<hatch> oh hmm
<hatch> any idea on how I would wipe it clean?
<jam> hatch: just to try it "juju destroy-environment -e local; sudo juju bootstrap -e local"
<jam> or whatever you named it
<hatch> same
<hatch> issue
<jam> hatch: I'll try to install 2.2.4 here, but so far I have no luck reproducing your issue  (next step is to install the 1.16.4 binary that was built, because it might be flawed somehow vs building from trunk)
<hatch> jam is there a way I could clear out my mongodb besides the destroy-environment call?
<jam> hatch: btw, do you know about paste.canonical.com? it requires Auth tokens so is generally 'safe enough' when pasting private stuff.
<jam> hatch: well, "create mongo journal dir: /home/jameinel/.juju/local/db/journal" so I'm 95% sure that deleting ~/.juju/local would have done that
<hatch> ahh ok that was my thought process as well
<hatch> jam yeah I usually dont' use that paste because it doesn't allow edits
<jam> hatch: can you paste your local config ?
<hatch> sure
<hatch> https://gist.github.com/hatched/72ce29821d4c56c76280
<hatch> nothing changed in there since the update
<jam> hatch: sure, you could *try* just commenting out the admin-secret
<jam> as we should generate one for you if it isn't present
<jam> but I haven't been able to reproduce the problem
<hatch> ok trying
<hatch> nope :/
<hatch> darn
<jam> hatch: I didn't expect much there, as there is no reason to expect the value there is invalid. I also can't reproduce on Precise with the jujud binary installed from the ppa and mongo from that ppa as well.
<jam> hatch: maybe you can pastebin more of the log? In case there was a config earlier that is wierd
<jam> weird
<jam> natefinch: I'm about to head to bed, any chance you can give this a look ? I'm skeptical that this is caused by 1.16.4 vs 1.16.3, but I'd really like to know what a root cause is
<hatch> thanks for the help jam - maybe what I'll do is remove juju and mongo and start frash
<hatch> fresh*
<hatch> maybe something got messed up in the update
<jam> or maybe wallyworld or thumper depending on who logs in first (we're almost around to start-of-day NZ I believe)
<jam> hatch: I would like to understand what is broken, but I can also understand you just want to get it working
<hatch> well the good news is that I don't need to test any new GUI work until tomorrow :)
<hatch> haha so I have until then
<jam> its nearly 8am in NZ, so thumper should be around soon. and rogpeppe2 is often known as a very helpful fella :)
<jam> though its past his EOD, I believe
<rogpeppe2> jam: it is :-)
<natefinch> jam: I'm here
<natefinch> jam: took a jaunt to the Lexington, MA office, since they wanted to see Ubuntu on my laptop (high DPI screen)
<jam> natefinch: k, brief summary, hatch upgraded to 1.16.4 from the ppa, and now when he tries to bootstrap it gives him unauthorized failures
<jam> I'm unable to reproduce, even using Precise + mongo-2.2.4 + juju-1.16.4 from the PPA
<jam> natefinch: sounds fun
<jam> how far away is it for you?
<hatch> natefinch this is the error https://gist.github.com/hatched/3e8a13af98250c236c9d
<natefinch> 1/2 hour, not bad at all.  Had lunch with David Pitkin, old colleague from our younger startup days
<natefinch> hatch: hmm.. interesting
<hatch> I've tried deleting the environments/ and local/ as well as doing a destroy-environment and no luck
<jam> natefinch: a thought occurs, could it be something with apparmor ?
<jam> it doesn't seem like it should be, but that unauthorized message looks odd to me
<natefinch> yeah, I was looking at the unauthorized message
<jam> looks like it is stock mongo failures
<jam> http://stackoverflow.com/questions/13850191/mongodb-set-user-password-to-access-to-db
<jam> hatch: other things to check "what is ps -ef | grep mongod" look like
<jam> what happens when you "juju destroy-environment -e local" and then check ps again
<jam> (i'm wondering if you have a rogue mongodb process that for some reason isn't being stopped like it should)
<jam> anyway, sleepytime
<hatch> jam no change, 1 mongo process running
<hatch> jam thanks for your help! have a good night
<natefinch> hatch: I'm doing some investigating on this end.  trying to see if we changed some of the mongo login code or otherwise changed our form of access to mongo
<natefinch> hatch - saw you and jam talking about mongo versions... just curious, you don't have to do it yet, but would it be a problem to upgrade 2.4.6?   I'm not convinced it'll fix anything, just want to get the lay of the land.
<hatch> natefinch nope no problem at all, it's just my testing box
<hatch> normal repos don't show a 2.4.6 version
<natefinch> hatch: hmmm I must have a ppa for mongo
<natefinch> 2.4.6 is their most recent stable version (from just a few months ago)
<hatch> looking for another ppa
<hatch> natefinch here is my dpkg output https://gist.github.com/hatched/b2e72169891505be29d4 for mongo*
<natefinch> ahh yeah, that's what it was, mongodb-10gen
<hatch> so I'm all up to date then it looks like
<natefinch> yep, seems like it.  good
<hatch> basically I'm at the point where it looks like a R&R on Juju and Mongo is in order
<hatch> since noone else is having these issues heh
<natefinch> yeah, it's weird.  It would be good for us to understand what happened so we can make sure it won't happen again, though.
<hatch> yeah well I'm available to help but I will need to get it back working for tomorrow :)
<natefinch> I know the feeling
<natefinch> certainly, if there's a point where you need to bail and reset everything, just let me know and go for it.
<hatch> I can probably use ec2 tomorrow/Monday to give us some extra time
<natefinch> hatch: if we can't get it figured out, I'll talk to jam, and we'll decide if we will gain anything by making life more difficult for you :)
<hatch> haha
<hatch> sounds like a plan
<natefinch> hatch: very strange.... the code really looks totally right.... we create the db  with your admin secret as the password... unless the admin secret changed in the middle of bootstrap somehow, which seems impossible
<hatch> heh yeah
<natefinch> hatch: obviously *something* is going on
<thumper> morning
<natefinch> thumper: morning
<hatch> natefinch when I ps -ef | grep mongod there is a mongod instance running
<hatch> assuming that's supposed to be there
<hatch>  /usr/bin/mongod --auth --dbpath=/home/hatch/.juju/local/db
<natefinch> hatch: yeah, it should be running.  That looks correct.
<natefinch> thumper: hatch is having a problem bootstrapping the local provider after upgrading from the ppa to 1.16.4.  He's getting an unauthorized error from mongo during bootstrap
<hatch> https://gist.github.com/hatched/3e8a13af98250c236c9d this is the error
<thumper> weird...
<natefinch> hatch: had you had mongo or juju local running before the upgrade?
<hatch> natefinch nope
<hatch> I'm rebooting the machine now
<natefinch> k
<hatch> maybe that'll fix it heh
<thumper> hatch: are you bootstrapping using sudo?
<natefinch> thumper: it won't let you bootstrap without sudo
<hatch> thumper yes `sudo juju bootstrap` and 'local' is my default
<thumper> hatch: if you go 'which juju' what does it say?
<hatch>  /usr/bin/juju
<hatch> ok the machine has been rebooted
<hatch> and....
<hatch> *drumroll*
<hatch> .....longer drumrolll
<natefinch> heh
<natefinch> if that works, I'll eat my hat
<hatch> bootstrapped
<thumper> haha
<hatch> ....w t h
<natefinch> haha
<natefinch> good thing I don't actually have a hat
<hatch> lol
<thumper> natefinch: I'll buy one for you
<natefinch> thumper: that's very kind of you
<thumper> natefinch: np
<hatch> so...wow...why? heh
<hatch> I even killed and restarted mongo
<thumper> ours is not to reason why, ours is just to reboot
<hatch> haha
<hatch> damn....that took so long to debug
<hatch> I guess we should have just listened to the IT Crowd
<hatch> "did you turn it off and back on again"
<natefinch> step 1: reboot
<hatch> well thanks everyone for their help haha
<natefinch> I'm glad it's fixed.  I can't even... I don't...
<hatch> well honestly....I even killed and restarted mongo and it didn't help
<hatch> so I have no idea what else was stuck in there causing it to break
<natefinch> YEah,. that's the thing, I'd think killing mongo would have to be the same as rebooting
<thumper> hmm...
 * thumper shrugs
<natefinch> Whatever, I'm just glad I could be here to fix it for you ;)
<thumper> wallyworld: I'm heading to the gym, if you are happy with my changes, can I get you to approve the MP so it lands?
<thumper> cheers
<wallyworld> thumper: done
#juju-dev 2013-12-06
<thumper> wallyworld: ta
<wallyworld> np
<wallyworld> thumper: one other thing i was going to do but too late probably
<thumper> yeah?
<wallyworld> the fmt.Sprintf() thing could be wrapped around the entire arg string
<wallyworld> not just the one param
<wallyworld> saves string concat etc
<wallyworld> a bit neater maybe
<thumper> not easily...
<thumper> and I'm not sure it'd end up being neater
<thumper> but not a big deal
<wallyworld> or even just the chubks
<wallyworld> chunks
<wallyworld> yeah, not a big deal
<thumper> hazinhell: still in hell?
<thumper> hazinhell: keen to come to Dunedin?
<hazinhell> thumper, yes and hell yes ;-)
 * thumper nods
<hazinhell> thumper, i've got some family thing i'm trying to work out that's got some minor overlap, but shouldn't be an issue
<thumper> ok
<davecheney> sinzui still around ?
<axw> thumper: https://bugs.launchpad.net/juju-core/+bug/1258132
<_mup_> Bug #1258132: [manual] bootstrap fails due to juju-db not starting <juju-core:Triaged> <https://launchpad.net/bugs/1258132>
<axw> did you know about the reload-config option? consider it for local provider?
<thumper> axw: yeah...
<thumper> no, what is it?
<axw> tells upstart to reload config from /etc/init
<axw> I'm not sure if that's actually solving it, or just ading enough of a delay
<axw> i.e. is it synchronous, or is it just scheduling another update that'll race
<axw> anyway, was just wondering if you knew
<thumper> axw: well, there is some race there that I've not solved yet
<thumper> would be great if this did solve it
<thumper> abently has a machine that fails about 50% of the tiem
<thumper> so we could always use him as a tester :)
 * thumper fixes another test isolation failure
<thumper> GAH, WTF
<thumper> hitting so many weird test failures
<thumper> intermittent ones
<thumper> this one in manual provider
<thumper> environ_test.go:121:
<thumper>     c.Assert(err, gc.ErrorMatches, "exit code 99")
<thumper> ... error string = "failed to write input: write |1: broken pipe (output: \"JUJU-RC: 99\")"
<thumper> ... regex string = "exit code 99"
<thumper> huh?
<thumper> and getting the provisioner not always making machines
<thumper> two different tests failed due to that
<thumper> in different places
<axw> what the
<axw> thumper: I'm changing some code in that vicinity, will look into it later today
<thumper> kk
<axw> thumper: my hypothesis is that upstart is seeing the file half-way through writing it (in place)
<axw> gonna see if I can force reproduce it
<thumper> axw: sounds possible
<axw> in that case a touch would be sufficient
<axw> reload-configuration would reload everything
<davecheney> ummm
<davecheney> % juju deploy cs:precise/haproxy-67
<davecheney> ERROR cannot get charm: charm not found: cs:precise/haproxy-67
<davecheney> ^ what has happened here
<davecheney> this revision absolutely does exist
<thumper> your syntax is wrong
<thumper> surely
<thumper> haproxy-67 isn't the name of a charm
<davecheney> In all cases, a versioned charm URL will be expanded as expected (for example,
<davecheney> mysql-33 becomes cs:precise/mysql-33).
<davecheney> do I need to leave off the cs:precise ?
<davecheney> % juju deploy haproxy-67
<davecheney> ERROR cannot get charm: charm not found: cs:precise/haproxy-67
<davecheney> thumper: this is sort os serious
<davecheney> cts have hit a bug and need to use an older revision of the charm
 * thumper looks
<davecheney> oh god
<davecheney> this is this fucking revno SHIT
<davecheney> thumper: sorry
<davecheney> its not a juju problem
<davecheney> the charm revision is unrelated to the bzr revision
<wallyworld> thumper: got time for a quick hangout?
<thumper> wallyworld: ah... yeah
<thumper> was just writing an email
<wallyworld> ok, ping me when done
<thumper> how short?
<wallyworld> i can wait
<thumper> school run is in 20 min
<thumper> can you wait 30 minutes?
<wallyworld> ok
<thumper> ok
<thumper> thanks
<wallyworld> fwereade_: wtf are you doing awake?
<thumper> wallyworld: https://plus.google.com/hangouts/_/7acpilj94h6vnkfr0qilf02c5k?hl=en
<axw> davecheney: isn't the bootstrap error message logged higher up?
<davecheney> axw: in a word, no
<davecheney> or at least not without -v
<davecheney> axw: https://bugs.launchpad.net/juju-core/+bug/1258246
<_mup_> Bug #1258246: azure: juju bootstrap aborted for no apparent reason <azure-provider> <bootstrap> <ci> <intermittent-failure> <openstack-provider> <juju-core:In Progress by dave-cheney> <https://launchpad.net/bugs/1258246>
<axw> davecheney: and with the additional log, don't you just see another "exit status 1" log message?
<davecheney> axw: dunno
<davecheney> i don't ahve control over that environemnt
<axw> davecheney: I think the problem is that the error message we're showing isn't useful, not that it's not there
<axw> but I may be mistaken
<axw> ERROR exit status 1
<axw> I think is from the ssh process
<davecheney> i thnk that is coming from the jenkins builder
<davecheney> axw: i'm suspecting the underlying cause is a timeout
<wallyworld> bigjools: we need your advice. https://plus.google.com/hangouts/_/7acpilj94h6vnkfr0qilf02c5k?hl=en
<wallyworld> btw it's work naked day today
<thumper> citation needed
<wallyworld> http://www.google.com.au/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC8QqQIwAA&url=http%3A%2F%2Fwww.news.com.au%2Fbusiness%2Fworklife%2Ffriday-6-december-is-work-in-the-nude-day%2Fstory-e6frfm9r-1226776444086&ei=rDihUuPvMoXskgWp44GQAQ&usg=AFQjCNExndNUWYGv4BBhrcIseRATwSXiJg&bvm=bv.57155469,d.dGI
<wallyworld> bah. google sucks http://www.news.com.au/business/worklife/friday-6-december-is-work-in-the-nude-day/story-e6frfm9r-1226776444086
<thumper> wallyworld: I'll just go make a coffee to sup while doing your reviews
<thumper> axw: do you have any you'd like me to look at too?
<wallyworld> ok
<axw> thumper: ummm no not right now
<axw> thanks
<axw> I may do later on
<axw> thumper: I'm going to look at fixing https://bugs.launchpad.net/juju-core/+bug/1167616 -- keeps tripping me up in tests
<_mup_> Bug #1167616: state.SetEnvironConfig should take old and new config values <config> <state> <juju-core:Triaged> <https://launchpad.net/bugs/1167616>
<axw> actually, I'm not even sure what's right there. hrm
<wallyworld> thumper: the code doesn't allow 2 keys with the same comment to be added. but i guess they could be added manually. there's no go code for fingerprint that i know of so may have to shell out to get it
<thumper> hmm...
<thumper> Perhaps we just document the limitation
<thumper> for now
<wallyworld> i did think about using fingerprint
<wallyworld> i'll have another look at the ssh package to see if there's anthing i missed
<thumper> kk
<jam> hatch: there is a 2.4.6 in the cloud-archive:tools archive (vs the stable ppa)
<hatch> jam goodmorning :)
<hatch> it's all working now
<jam> hatch: so from the traceback, you had a mongod that was serving your .juju/local/db directory
<jam> which because of how it got shutdown
<jam> wasn't getting stopped by "juju destroy-environment"
<jam> "some other thing was"
<jam> and then every time you bootstrapped
<jam> we would try to start a new one
<jam> which was probably failing, but we didn't notice because hey, there is something listening on 37017 for us
<jam> hatch: *probably* if you had killed the mongod it would have cleaned things up, too.
<jam> as there *shouldn't* be a mongod running if nothing is bootstrapped
<jam> hatch: anyway, I'm glad you got it working.
<hatch> ahhh - so is this something that we could 'fix' - maybe simply by creating a better error message?
<axw_> rogpeppe2: thanks for the review. just replied to your comment; would appreciate another look before I land
<rogpeppe2> axw_: ok
<rogpeppe2> axw_: the overwrite will come, i think, from the fact that we call Update on *all* the attributes
<rogpeppe2> axw_: so "set-environment foo=a" will also update bar to the value that was previously seen
<axw_> rogpeppe2: no, because it doesn't change between the load and the store
<axw_> wait
<axw_> no you're right
<axw_> :)
<rogpeppe2> axw_: :-)
<rogpeppe2> axw_: we could change settings so that it only made the update if things had actually changed, but i think that has more likelihood of leading to problems in practice
<axw_> rogpeppe2: I did start down the road of doing reflect.DeepEquals to check if things had changed, but was wary of changing things too much there
<rogpeppe2> axw_: i thought about that, don't think it actually helps that much
<rogpeppe2> axw_: we can still end up with an invalid config
<rogpeppe2> axw_: that's why i think that something like what i suggested is the only decent way forward, as it allows the config to be verified as a whole, while still allowing atomic delta changes
<rogpeppe2> axw_: but as i said, i'm ok with what you've done for the time being - it is an improvement, and it unblocks the other CL
<axw_> rogpeppe2: and that would need to hook into the provider right? along the lines of the prechecker?
<axw_> rogpeppe2: yep thanks, I'll land as is and leave the bug open
<rogpeppe2> axw_: yeah - the function would need to call EnvironProvider.Verify
<rogpeppe2> axw_: to my shame, i haven't actually looked at the prechecker
<axw_> rogpeppe2: that's ok, it's not actually switched on yet :)
<axw_> but it does essentially that, for machine/container additions
<rogpeppe2> mgz: how close are you to landing the status changes?
<rogpeppe> mgz: i find myself wanting better status info in the juju-restore command
<dimitern> rogpeppe, mgz, fwereade_, TheMue, wallyworld, standup
<dimitern> mgz, ?
 * dimitern lunch
<dimitern> rogpeppe, you've got a review on https://codereview.appspot.com/37850043/
<rogpeppe> dimitern: thanks!
<rogpeppe> dimitern: FWIW I aim to remove the current state.AddMachine entirely
<rogpeppe> dimitern: it's only ever used in tests
<dimitern> rogpeppe, ah, then perhaps rename it to AddTestingMachine? :)
<rogpeppe> dimitern: i don't want to do that in this CL because there are hundreds of calls to it
<rogpeppe> dimitern: hence i left it alone
<dimitern> rogpeppe, so it merits mentioning in the CL description as well then
<rogpeppe> dimitern: good point - i'll do that
<dimitern> rogpeppe, i'd appreciate if you can take a look at this (still wip, just checking i'm on the right track): (only the file upload part done and tested, no state operations yet) https://codereview.appspot.com/38370043
<rogpeppe> dimitern: looking
<rogpeppe> dimitern: reviewed
<dimitern> rogpeppe, cheers!
<dpb1> Is this an old error that is now reapearing?  dpb@helo:trunk$ juju deploy --repository dev/charms local:raring/landscape-devenv
<dpb1> ERROR cannot bundle charm: symlink "." links out of charm: "../precise/landscape-devenv/"
<dimitern> rogpeppe, i can't see a way to verify the content type of the uploaded file before reading it
<dimitern> dpb1, i've never seen this one - is it from juju-core or pyjuju?
<dpb1> dimitern: juju-core.  I've seen it before with other types of symlinks in the repository path.  Thing is, I've been launching units like this just fine for multiple weeks on juju-core, and now it just breaks.
<dimitern> dpb1, is it a charm dir or an archive?
<rogpeppe>  dimitern: it's in Part.Header
<dimitern> rogpeppe, i'll try that
<dpb1> dimitern: this symlink points to an charm directory.  let me put up a paste
<dpb1> dimitern: http://paste.ubuntu.com/6530321/
<dimitern> dpb1, can you paste a find . instead in the charms dir please?
<dpb1> dimitern: http://paste.ubuntu.com/6530335/
<dimitern> dpb1, weird.. the error comes from checkSymlinkTarget in charm/bundle.go and that's been there for quite a while
<dpb1> I'm racking my brain to think about what changed.  so far I have nothing.
<dimitern> dpb1, maybe the symlinks are somehow strange? i can see // in the targets, instead of / - have you changed that?
<dpb1> checking
<dpb1> there is a trailing slash, let me try removing that
<dpb1> dimitern: no, the trailing slash is there from ls -lF, nm  The symlink looks normal and hasn't changed.
<dimitern> dpb1, what's your go version?
<dpb1> dimitern: you mean juju-core version?
<dpb1> 1.16.4-saucy-amd64
<dimitern> dpb1, no, go version
<dpb1> go version go1.1.2 linux/amd64
<dimitern> dpb1, or you're using the binary and not building from source?
<dpb1> binary
<dpb1> yes, from the ppa
<dimitern> sorry, i don't seem to grok how it did work before.. it shouldn't and you've gotten the same error earlier
<dpb1> dimitern: ok  I'll keep digging into it
<dimitern> dpb1, sorry i couldn't be more helpful
<dpb1> dimitern: thx for the effort.  I'm stumped as well. :)
<rogpeppe> does anyone know how to get tip bootstrapping correctly?
<rogpeppe> i get "cannot find bootstrap tools: XML syntax error on line 9: element <hr> closed by </body>"
<rogpeppe> i thought that would have been fixed a while ago
<rogpeppe> this bug says it should have been fixed: https://bugs.launchpad.net/juju-core/+bug/1254401
<_mup_> Bug #1254401: error reading from streams.canonical.com <bootstrap> <juju-core:Triaged> <https://launchpad.net/bugs/1254401>
<rogpeppe> i guess i'll just have to make do with uploading tools
<mgz> you have to upload-tools, yeah
<mgz> rogpeppe: I think you can also supply tools-metadata-url
<rogpeppe> mgz: you're alive! :-)
<mgz> if you've put the tools+config bits in local swift say
<mgz> rogpeppe: yeah, sorry I wasn't around earlier
<rogpeppe> mgz: ah, do you know what tools-,metadata-url might work?
<rogpeppe> mgz: np, just wondered if anything was up
<mgz> sec, checking
<mgz> rogpeppe: https://swift.canonistack.canonical.com/v1/AUTH_526ad877f3e3464589dc1145dfeaac60/juju-dist/tools may work, checking
<rogpeppe> mgz: even from ec2?
<mgz> hm, good question
<mgz> well, to do tip you'd want to build/sync-tools/juju-metadata regardless, to your own s3/swift bucket
<mgz> heh, and generate-tools is borked by the same bug
<mgz> so, I guess --upload-tools *is* the only sane option
<rogpeppe> mgz: :-(
<rogpeppe> mgz: how can we have left tip broken for so long? this is not good.
 * rogpeppe is tempted to escalate the bug to Critical
<mgz> well, it's mostly that we have testing stuff that doesn't work in a place trunk expects to be sane
<rogpeppe> ha, bootstrap is totally broken for me currently
<rogpeppe> or for any client where it takes more than 5 seconds to connect to the bootstrap machine
<rogpeppe> (it takes me 15 seconds to connect to any ec2 instance because of my horribly broken dns)
<rogpeppe> https://bugs.launchpad.net/juju-core/+bug/1258607
<_mup_> Bug #1258607: bootstrap doesn't work if Dial takes more than 5 seconds <juju-core:New> <https://launchpad.net/bugs/1258607>
 * rogpeppe is done
<rogpeppe> happy weekends all
<mgz> and have a nice weekend rog
<rogpeppe> mgz: you too mate
#juju-dev 2013-12-08
<fwereade_> wallyworld, ping
<wallyworld> hey
<fwereade_> wallyworld, heyhey
<thumper> o/
<fwereade_> wallyworld, can we chat about the credentials stuff in about 5 mins? I just rolleda ciggie
<thumper> fwereade_: if you have a few minutes, would like to talk about juju-run
<wallyworld> sure
<fwereade_> thumper, I would love to, can I do ian first please? ;p
<thumper> fwereade_: yes, do ian first
 * thumper watches
<fwereade_> wallyworld, so, I was wondering about the auth in particularthere
<fwereade_> wallyworld, it's currently only useful for environ provisioners, right?
<fwereade_> thumper, ok, wallyworld just disappeared
<thumper> fwereade_: as in you are done with him, or expecting him back?
<fwereade_> thumper, well, I expect he'll come back but I can't see him in here
<fwereade_> thumper, so I'm happy to talk juju-run instead for now
<thumper> fwereade_: hangout?
<thumper> fwereade_: https://plus.google.com/hangouts/_/7ecpi8fbhg89tcdmnrdsfs8lng?hl=en
<wallyworld> fwereade_: sorry, network died
<fwereade_> wallyworld, sorry, just joined tim
<wallyworld> ok, i'll wait
<fwereade_> wallyworld, heyhey
<wallyworld> hi
<fwereade_> wallyworld, https://plus.google.com/hangouts/_/7acpi6csd2a2uo69fjve75rg3k?hl=en
#juju-dev 2014-12-01
<thumper> davecheney: command doesn't depend on version.Version
<thumper> I was thinking about adding some automatic deprecation checks
<thumper> based on version
<thumper> but I don't want to introduce the dependency
<thumper> so I'll just go for 'deprecated bool'
<thumper> oh...
<thumper> you suggestion had me thinking...
 * thumper will ponder for a bit
<davecheney> thumper: wanna do the 1 on 1 now
<thumper> davecheney: sure
<davecheney> no reason to delay
<davecheney> i'll see you in the hangout
<thumper> kk
<thumper> waigani: one to one?
<waigani> thumper: ah, I'm in the wrong chan
<waigani> thumper: my google cal is crashing for me - can you paste me the link please?
<menn0> thumper, waigani: auto multi-env filtering: http://reviews.vapour.ws/r/551/diff/
<thumper> menn0: awesome, looking now
<thumper> menn0: hey, the diff fits on one page \o/
<davecheney> +1
<menn0> cool :)
<waigani> thumper, menn0, davecheney: we have tests! http://reviews.vapour.ws/r/474/
<davecheney> type ListMode bool
<davecheney> var ( FullKeys     ListMode = true Fingerprints ListMode = false
<davecheney> seriously
<davecheney> this is an api parameter
<davecheney> it has two states
<davecheney> FullKeys or Fingerprints
<davecheney> it is not possible to add a new state
<davecheney> and when the gui calls it
<davecheney> it has to say { Mode: true }
<davecheney> What the F
<waigani> menn0: http://reviews.vapour.ws/r/551/
<waigani> menn0: just realised I didn't click 'publish'
<axw> wallyworld_: I'm working this afternoon, as I've got a school assembly to go to on Thursday morning and a school workshop on Friday morning. let me know if you need anything...
<wallyworld_> axw: will do, thanks
<wallyworld_> i'm going to ping andres later to set up a meeting for this week
<davecheney> 2/win11
<wallyworld_> axw: thanks for reviewing the gwacl mp, i just went to do it and it had been done \o/
<axw> wallyworld_: no worries
<wallyworld_> dimitern: thanks for looking at those networking bugs
<dimitern> wallyworld_, np, I can do more if we had the lshw dump
<wallyworld_> yeah, hopefully he can provide it
<jam1> voidspace: standup ?
<voidspace> oops
<voidspace> jam1: omw
<hazmat> axw, the lshw /block dev question on the list brings up an interesting question.. if i do have say extra physical block devs on one unit but not others, would i see them? ie. storage spec is at a service level. thanks for the review and catch on azure regions
<hazmat> axw, hmm.. the microsoft guy pointed out a new api which does a limited form of the static map we're doing currently for azure http://msdn.microsoft.com/en-us/library/azure/gg441293.aspx
<hazmat> fwiw
<hazmat> jam, ping
<hazmat> dimitern, out of curiosity what is the logic for juju-br0 selection?
 * hazmat is looking at the lshw output
<dimitern> hazmat, the first network device or the one with network:0 index in the lshw xml dump
<hazmat> dimitern, aha.. default 0 behavior perhaps.. eth2 is mellanox with id="network"
<dimitern> hazmat, can I see the dump?
<hazmat> dimitern, its in the bug
<hazmat> dimitern, https://bugs.launchpad.net/juju-core/+bug/1395908/+attachment/4267056/+files/1395908.tar.gz
<mup> Bug #1395908: LXC containers in pending state due to juju-br0 misconfiguration <lxc> <oil> <juju-core:Triaged> <https://launchpad.net/bugs/1395908>
<dimitern> hazmat, these are just logs, no lshw xml output
<hazmat> hmm
<hazmat> dimitern, right bug, wrong attachment.. https://bugs.launchpad.net/juju-core/+bug/1395908/+attachment/4271935/+files/lshw.tar.gz
<mup> Bug #1395908: LXC containers in pending state due to juju-br0 misconfiguration <lxc> <oil> <juju-core:Triaged> <https://launchpad.net/bugs/1395908>
<dimitern> hazmat, ah, I see the issue - the logic should skip network devices marked with disabled="true"
<wallyworld_> hazmat: you free for the meeting about zone constraints later in your day?
<hazmat> wallyworld_, yeah.. just rearrange the cal a bit for it
<hazmat> er..i just rearranged
<wallyworld_> hazmat: thanks, hard to find a time to suit aus, eng and us
<hazmat> wallyworld_, np. thanks for setting it up.
<wallyworld_> sure
<voidspace> dimitern: do you think IP address must have a Refresh method? We don't need it currently. (We did need it for Subnet.)
<dimitern> voidspace, no need for Refresh
<dimitern> voidspace, there won't be a life field; the only change that can happen is to set the state to "unavailable" or "allocated"
<voidspace> dimitern: yep, thanks
<dimitern> voidspace, but those transitions can be handled internally with proper asserts
<voidspace> dimitern: so UpdateState should take an IPAddress (rather than a string value) and assert that the state has not changed in the transaction that sets the new state?
<dimitern> voidspace, yeah, so if that assert fails, we can reload the doc from state and retry
 * fwereade_ lunch, walk
<dimitern> voidspace, I'd suggest using SetStatus or SetState rather than UpdateState (for consistency with other state objects)
<voidspace> dimitern: if we're just going to retry, why not just set the new state directly
<voidspace> dimitern: (yes on name)
<voidspace> dimitern: all we need to assert is that the ip address still exists
<voidspace> dimitern: I would only check state if we're going to fail because of stale data
<dimitern> voidspace, we need to retry in principle, because we assume stale data
<voidspace> dimitern: if we're only setting a new state - and we will force that to succeed by retrying - why assert the old state matches?
<voidspace> dimitern: we don't care if it matches, we're setting the new state *anyway*
<dimitern> voidspace,  if the address state goes only forward (like life) we need to ensure we won't set a state which is impossible
<voidspace> dimitern: can it only go forward?
<voidspace> dimitern: I guess it should only go from "" to "allocated" or "unavailable"
<voidspace> dimitern: but in which case we shouldn't retry, but fail
<dimitern> voidspace, shouldn't we allow Unknown -> Allocated|Unavailable  only?
<voidspace> dimitern: right - so we *shouldn't* retry then, we should fail
<voidspace> and assert that State is Unknown
<dimitern> voidspace, let's think how it'll go
<voidspace> we pick a new address - initially created with State Unknown
<voidspace> then we attempt to allocate and it goes to Allocated or Unavailable
<dimitern> voidspace, new address -> set to unknown (i.e. "reserve" a doc for it until we decide the final state for that address)
<voidspace> dimitern: until we start using this model for "other addresses"
<dimitern> voidspace, existing unknown address -> set to allocated or unavailable, but fail if it's already set
<voidspace> initially that's the only use case
<dimitern> voidspace, *or* don't fail if the state is already the one we want to set
<voidspace> ah, ok - so the assert is State == Unknown || NewState
<dimitern> voidspace, yeah, I think so
<voidspace> dimitern: do you know of an existing assert that is similar, so I can see how to implement that?
<voidspace> I can just grep through state to see though
<dimitern> voidspace, it's easy - you can combine assert
<dimitern> voidspace, just a sec
<dimitern> voidspace, unknownOrSame := bson.D{{"state", bson.D{{"$nin", []string{"unknown", newState}}}}}
<voidspace> dimitern: thanks!
<dimitern> $nin=not in list
<dimitern> voidspace, np
<voidspace> dimitern: so that asserts that the state *is* one of the listed ones?
<dimitern> voidspace, yes -- we have a cached view of the data, and as long as this holds we're good, otherwise (assert fails) reload and redo
<dimitern> (if it makes sense to)
<voidspace> cool, thanks
<voidspace> "not in list" is an odd name for an assert that does the opposite (assert that the specified field *is* one of the ones in the list)
<dimitern> voidspace, ah, sorry :)
<dimitern> voidspace, you're right - it should've been "$in"
<voidspace> dimitern: :-)
<voidspace> dimitern: and SetState should be an IPAddress method rather than State.SetIPAddressState ?
<voidspace> we will always have an IPAddress when we need to use it, so it makes sense - I'd just assumed a State method though
<voidspace> (for no good reason it seems)
<dimitern> voidspace, yes, that's the usual case
<dimitern> voidspace, well - we create an "unknown" address, then update it to allocated or unavailable
<voidspace> dimitern: yep
<axw> hazmat: well, I had intended for storage specification to be at the unit level (specifiable at service level, which would cascade/merge down)
<axw> hazmat: as for the Azure API, doesn't look like it has much useful info in it
<axw> I mean, useful to us...
 * axw goes afk again
<hazmat> axw, g'night
<perrito666> brb
<abentley> sinzui: Industrial testing on AWS has been failing because there's no subnet for AZ us-east-1e: https://pastebin.canonical.com/121381/
<mattyw> has anyone seen fwereade_ ?
<fwereade_> mattyw, I'm around, did I miss a ping?
<mattyw> fwereade_, you might have done - but I'll forgive it
<natefinch> perrito666: standup?
<perrito666> ngetting there trying to find my headphones
<voidspace> dimitern: should we require IP address type to be provided in the IPAddressInfo (argument to AddIPAddress)
<voidspace> dimitern: or should we infer it from the actual address provided?
<voidspace> dimitern: I'm inclined to infer it and remove it from IPAddressInfo
<dimitern> voidspace, hmm.. why not take a network.Address instead?
<voidspace> dimitern: ah, fair point
<dimitern> voidspace, it has all the needed fields already, we just need to convert it from network.Address into a doc
<voidspace> dimitern: ok, better do some reqorking
<voidspace> dimitern: yep, good point
<voidspace> dimitern: should I still validate the IP address?
<voidspace> dimitern: I guess not
<voidspace> dimitern: the only thing that prevents is *creating* an address in anything but an Unknown state
<voidspace> dimitern: but if we ever have that use case it's easy to just call SetState after creation
<dimitern> voidspace, I think we should validate it
<voidspace> dimitern: ok
<dimitern> voidspace, before storing it in state at least
<voidspace> dimitern: yep
<dimitern> voidspace, in the network package there are some validation methods I think you can use
<voidspace> dimitern: net.ParseIP is easy enough
<voidspace> I don't think I need anything more than that, but I can take a look
<dimitern> voidspace, ok - I've just looked and there are no special validation methods for network.Address
<voidspace> dimitern: I couldn't find any either... :-)
<sinzui> natefinch, Can you get people to look into bug 1397376 and bug 1397995?
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <landscape> <maas-provider> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397376>
<mup> Bug #1397995: 1.21b3: juju status isn't filtering by service <landscape> <regression> <status> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397995>
<sinzui> natefinch, bug 1396981 is also critical, but needs discussion about scope. I am going to update the bug with my understanding of the issue
<mup> Bug #1396981: Upgrade fails with tools sha mismatch because product json is renamed <ci> <regression> <upgrade-juju> <juju-core:Triaged by wallyworld> <https://launchpad.net/bugs/1396981>
 * perrito666 notices that google tells him that is going to upgrade his calendar ... nothing changes
<natefinch> sinzui: ok
<natefinch> fwereade_: I was thinking, since I just used the skeleton provider to create the outline for the gce provider, maybe I should just copy it to make a real skeleton provider..... since I had to do a lot of work to get the skeleton provider to compile.
<fwereade_> natefinch, +1000
<fwereade_> natefinch, it seriously sucks that I got distracted and never merged that
<fwereade_> natefinch, getting it into the source tree would prevent it from being such a horrible hassle again
<natefinch> fwereade_: that's what I was thinking.  I'll make a separate branch for it, shouldn't be too much work
<fwereade_> natefinch, tyvm
<alexisb> fwereade_, sorry, will be there shortly
<fwereade_> alexisb, np
<voidspace> dimitern: ooh, if we take a network.Address it doesn't allow us to set a MachineId, SubnetId or InterfaceId
<voidspace> dimitern: shall I provide setters for those?
<voidspace> dimitern: especially subnet id we want
<dimitern> voidspace, hmm.. we can also take them as args
<voidspace> dimitern: for SubnetId that makes sense
<voidspace> dimitern: as we will always (initially) want it
<voidspace> dimitern: MachineId and InterfaceId that will be a nuisance as we will rarely want to set them initially
<dimitern> voidspace, yeah, and the rest (esp. machine id) won't be known at first
<voidspace> dimitern: shall I add setter methods for those two?
<dimitern> voidspace, let's have setters yeah
<voidspace> dimitern: cool
<dimitern> voidspace, cheers
<voidspace> dimitern: hmmm... Dave Cheney said that we shouldn't add new dependencies to state
<voidspace> dimitern: this PR adds network
<voidspace> dimitern: I assume that "low-level" dependencies like that are permitted
<dimitern> voidspace, good point
<dimitern> voidspace, but I think we already import network in state
<voidspace> dimitern: nope, it's showing in the diff
<dimitern> voidspace, we do in some tests and in network.go
<dimitern> voidspace, and in address.go
<voidspace> ah...
<voidspace> dimitern: right, so it's not actually a new dependency
<voidspace> dimitern: I was just looking state/state.go
<voidspace> :-)
<dimitern> voidspace, :) whew..
<dimitern> voidspace, well, for one this sort of dependency makes sense to me
<voidspace> it's not a structural dependency which is the real problem
<voidspace> and state has to deal with networking concepts
<dimitern> exactly
<alexisb> natefinch, IBM interlock
<alexisb> jam, on and ready when you are
<voidspace> g'night all
<perrito666> gnight
<voidspace> o/
<sinzui> perrito666, natefinch , do either of you have time to review http://reviews.vapour.ws/r/554/diff/#
<perrito666> sinzui: done, nate is most likely not here
<sinzui> thank you perrito666
<perrito666> yw
<thumper> morning folks
<perrito666> thumper: hi
<thumper> sinzui: got a minute?
<thumper> sinzui: I'd like to discuss https://bugs.launchpad.net/juju-core/+bug/1397376
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <landscape> <maas-provider> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397376>
<sinzui> hi thumper, I do
<thumper> sinzui: standup is about to start, can we talk after that?
<sinzui> yep
<thumper> cool
<sinzui> hazmat, are you preparing the branch to inc the gwacl dep in juju-core master? Should I get the other developers on it.
<thumper> sinzui: https://plus.google.com/hangouts/_/canonical.com/bug-1397376
<bac> hey marcoceppi, we talked a while back about you releasing a new version of amulet.  can you schedule that for soonish?
<davecheney> mwhudson: short version, there may be an announcement about the 1.5 cycle today
<davecheney> but people are distracted with fixing 1.4 for release
<mwhudson> davecheney: cool, thanks for chasing
<davecheney> regarding spinning wheels on arm64 deveopment
<davecheney> i understand
<davecheney> i'm in the same boat
<thumper> katco: you around?
<katco> thumper: yup, what's up?
<thumper> katco: seen https://bugs.launchpad.net/juju-core/+bug/1397995 ?
<mup> Bug #1397995: 1.21b3: juju status isn't filtering by service <landscape> <regression> <status> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397995>
<thumper> katco: thought of you because of your recent status changes
<katco> thumper: yeah i saw that; likely it's related
<katco> thumper: it looks like there was some kind of interplay between versions, but i just skimmed.
<thumper> katco: can you look actively at it?
<thumper> katco: it is blocking 1.21 release
<katco> thumper: sure, just wrapping up latest set of review changes for leadership
<thumper> kk
<thumper> thanks
<katco> thumper: thanks for pinging
<wallyworld_> katco: hi, can we delay by 15 minutes? i'm in another meeting
<katco> wallyworld_: sure thing
<wallyworld_> ty
<wallyworld_> katco: hi, i'm finally free
<katco> wallyworld_: ok, hopping on
<thumper> mwhudson: want to catch up now or soonish?
<mwhudson> thumper: would about 20 mins work?
<thumper> sure
<mwhudson> cool
<mwhudson> thumper: long 20 mins sorry, ready now
<thumper> mwhudson: our 1:1 hangout?
<mwhudson> thumper: sure, trying to find it now :)
<mwhudson> ok, there
<wallyworld_> hazmat: you've got bug 1389422 as in progress on 1.22. i'm just updating 1.21 branch to pull in your new gwacl branch. i can do the same for 1.22 if you want
<mup> Bug #1389422: azure instance-types and regions missing <azure-provider> <constraints> <Go Windows Azure Client Library:Fix Committed by hazmat> <juju-core:In Progress by hazmat> <juju-core 1.20:Triaged> <juju-core 1.21:In Progress by wallyworld> <https://launchpad.net/bugs/1389422>
#juju-dev 2014-12-02
<jw4> thumper: do you have some direction for me regarding feature flags?  We'd like to begin landing the Actions CLI behind a feature toggle until it's complete.
<jw4> thumper: I'm planning on using an envar say DEV_ACTIONS_CLI or something
<hazmat> wallyworld_, that would be great
<wallyworld_> np, done
<wallyworld_> or will be real soon :-)
<wallyworld_> hazmat: are some of those instance types only available in certain regions?
<wallyworld_> cause Juju is complaining there's missing cost data in regions for G1 etc
<hazmat> wallyworld_, bummer
<hazmat> wallyworld_, bummer.. needs better unit tests..
<hazmat> wallyworld_, the G series are not publicly available, i put in the definitions for them.. but only setup mappings for the g5s for a  partner request
<hazmat> their in a public beta atm, but pricing is not known
<hazmat> but a partner was interested in using them
<wallyworld_> what default would be sensible you think?
<wallyworld_> or would a default even be worthwhile
<wallyworld_> if i put in a high figure, that would mre or ess require explicit instance type selection by the user
<wallyworld_> that probably could be ok i think
<hazmat> wallyworld_, sounds reasonable
<wallyworld_> ok, will do thanks
<hazmat> wallyworld_, g series are mostly monster machines.
<hazmat> so i imagine the price is.. well if you have to ask ;-)
<wallyworld_> lol
<thumper> jw4: oh hai
<thumper> jw4: the feature flag branch needs tweaking and can land shortly after
<thumper> didn't realise anyone else was waiting on it
<davecheney> mwhudson: 1.5 discussion will start in the next 48 hours
<davecheney> probably sooner
<mwhudson> davecheney: \o/
<mwhudson> in other news, i've found yet another gccgo bug
<jw4> thumper: no worries - we just started talking about it too
<wallyworld_> axw: i had to make a fix to gwacl to ensure all instance types had costs. I fixed the tests also. these changes are in rolesizes.go and rolesizes_test.go. However, I got sick of the non-standard formatting cause my editor gofmts on save, so the branch has had vanilla gofmt run over it so the diff is large. all tests run, the real changes are in the files i just mentioned. are you able to take a look?
<wallyworld_> https://code.launchpad.net/~wallyworld/gwacl/ensure-all-roles-have-costs/+merge/243346
<wallyworld_> ignore my crappy branch name
<jw4> thumper: I will be interested to see it when the branch is ready to land
<thumper> jw4: ack
<davecheney> mwhudson: sorry, there is no prize for gccgo bugs
<axw> wallyworld_: ok
<wallyworld_> ty and sorry :-)
<wallyworld_> hopefully you can ignore the white space
<axw> wallyworld_: if you haven't already, can you please either delete the format target in Makefile, or change it to "go fmt"?
<mwhudson> davecheney: heh heh
<wallyworld_> axw: yeah did that :-)
<axw> wallyworld_ hazmat: do we even care about actual costs? I think we only care about relative costs... I wonder if we could just do "1, 2, 3, ..." in order of role size
<wallyworld_> axw: i made the costs arbitary and large
<wallyworld_> and in order of role size
<axw> wallyworld_: I guess I should just actually look at the MP ;)
<wallyworld_> axw: that way, G1 etc are not selected by default, and require an explicit instance type constraint
<axw> right. goodo
<perrito666> I try to run my tests with -gocheck.f="myfilter" but whenever I use -gocheck.something the only run test is dependenciesTest.TestDependenciesTsvFormat
<perrito666> anybody hit tat before?
<davecheney> perrito666: are you still stuck on that dependency problem ?
<perrito666> davecheney: nope, I just tried to run one test alone of a suite
<perrito666> davecheney: tx for your help the other day, I realized too late you where not working that day, apologies
<mwhudson> davecheney: ah no, it's probably just a go tool bug
<mwhudson> davecheney: even tinier non-prize
<davecheney> perrito666: meh, if i'm not working, i'll ignore you
<davecheney> you don't need to appologise
<perrito666> :p I am a very nice person, why would you ignore me?
 * perrito666 is offended
<mwhudson> yes
<mwhudson> davecheney: do you remember enough of the pain to say if this makes sense? http://paste.ubuntu.com/9338212/
<davecheney> mwhudson: i remember that link order was super fragile
<davecheney> we occasionally see builds blow up on power64 when link order is wrong
<davecheney> i can't say that what you have proposed is wrong
<mwhudson> davecheney: yeah, that's the bug i'm poking at
<davecheney> but it does make sense that that is the cause
<mwhudson> davecheney: the problem seems to be that i added code to not pass the target of two actions for the same a.p
<mwhudson> but there are two different a.p's for the package being tested when you run go test
<mwhudson> (which makes sense, maybe, now that i think about it: one has the internal _test.go files in it)
<wallyworld_> katco: i asked for tests, let me know if you have any questions
<davecheney> mwhudson: juju uses 'external' tests a lot more than the stb lib
<davecheney> and we're also leaning hevily on the export_test.go escape hatch to get access to internal functions during tests
<davecheney> i'm sure this has agrevated the issue
<jw4> thanks thumper - I see your PR
<thumper> hmm... is juju/cmd hooked up to review board?
<thumper> https://github.com/juju/cmd/pull/10
<thumper> menn0: you might like that one :-) ^^^^
<menn0> thumper: looking...
<thumper> menn0: obviously the supercommand code wasn't complicated enough...
<thumper> menn0: I've just noticed the I never got around to updating the doc strings for the new functions
<menn0> thumper: i'll ignore that then :)
<menn0> thumper: reviews for other repos don't go to RB then?
<thumper> menn0: some do, some don't
<thumper> I don't know which do
<menn0> thumper: well this hasn't so I'll do the review on GH
<menn0> thumper: so I should ignore the incorrect docstring for RegisterSuperAlias?
<thumper> yes
<ericsnow> axw: patch to add AZ to instanceData: http://reviews.vapour.ws/r/557/
<ericsnow> axw: let me know if that's what you had in mind
<ericsnow> axw: with that my unit-get patch will get a lot simpler :)
<axw> ericsnow: yep, pretty much. I don't think Instance needs to get a new AvailabilityZone method; we should just return it from StartInstance
<axw> I guess it doesn't hurt to have it there though
<axw> ericsnow: oh right, it's needed for upgrades. never mind me
<axw> ah hm, not really if it's a ZonedEnviron ... anyway.
<menn0> thedac: review done
<menn0> thedac: sorry, that was supposed to be for thumper
<davecheney> mwhudson: the patch attached to this issue appears to be a patch, if it's not a patch, then remove the patch tag from this issue ...
<davecheney> thanks, i guess
<huwshimi> Hi, I was just wanting to try a few things with 1.21 and wondering which ppa to use?
<axw> huwshimi: I think you want https://launchpad.net/~juju/+archive/ubuntu/devel
<huwshimi> axw: A brilliant thanks, I'll take a look
<huwshimi> *Ah
<dimitern> morning
<axw> fwereade_: any chance you could take a look at https://github.com/juju/charm/pull/77 some time soon? I expect we can iterate on this for changes due to mandatory constraints
<fwereade_> axw, sorry, queued it up
<axw> fwereade_: thanks
<dimitern_> fwereade_, wallyworld_, TheMue, hey guys, any of you willing to review a fix for bug 1395908 ? http://reviews.vapour.ws/r/558/
<mup> Bug #1395908: LXC containers in pending state due to juju-br0 misconfiguration <lxc> <oil> <juju-core:In Progress by dimitern> <juju-core 1.20:In Progress by dimitern> <juju-core 1.21:In Progress by dimitern> <https://launchpad.net/bugs/1395908>
<wallyworld_> dimitern_: i've been called to dinner but can look afterwards if not done by then
<dimitern_> wallyworld_, no worries
<fwereade_> dimitern_, LGTM, very clean, ty
<mattyw> morning all
<dimitern_> fwereade_, thanks!
<dimitern_> morning matty
<mattyw> dimitern_, morning
<mattyw> fwereade_, morning, would you have a few minutes to take a look at http://reviews.vapour.ws/r/553/
<dimitern_> fwereade_, when you have 5m, I have a backport for the same bug fix - http://reviews.vapour.ws/r/559/ (for 1.21) - no other changes
<fwereade_> dimitern_, LGTM again :)
<dimitern_> fwereade_, cheers :)
<voidspace> Nobody has posted to juju or canonical-tech about rocket...
<fwereade_> voidspace, it was on cloud though
 * fwereade_ popping out, bbs
<voidspace> fwereade_: ah, I may not be on that list then...
<voidspace> jam1: dimitern_: stdup?
<dimitern_> voidspace, omw, sorry
<fwereade_> menn0, ping
<menn0> fwereade_: hi
<fwereade_> menn0, remind/reassure me: did we manage to close off the paths whereby non-state server agents can end up connecting to state servers of an earlier version than themselves?
<menn0> fwereade_: you mean while an upgrade is in progress?
<fwereade_> menn0, yeah
<menn0> fwereade_: yes, that was already done before I touched the upgrade code
<menn0> fwereade_: state servers don't advertise the new version until they themselves have upgrades
<menn0> upgraded
<fwereade_> menn0, I'm just looking at the bit I wrote in the uniter for spinning on CodeNotImplemented from a particular call while waiting for the state server to be upgraded
<fwereade_> menn0, and suspecting I can drop it
<fwereade_> menn0, cool
<fwereade_> menn0, thanks
<menn0> fwereade_: and given that an upgrade can't start unless all the hosts in the env are on the current or next version there's really no way a non-state server can get ahead of the state servers
<natefinch> fwereade_: do you mind if I tweak this a little?  I'm not really a fan of the cast here, since it can panic: http://bazaar.launchpad.net/~fwereade/juju-core/provider-skeleton/view/head:/provider/skeleton/config.go#L125
<wallyworld_> natefinch: here's a fix for the critical bug on master http://reviews.vapour.ws/r/560/
<natefinch> fwereade_: I'd much rather do it once in validateConfig and just store the values in a normal old struct
<natefinch> wallyworld_: looking
<wallyworld_> ty
<fwereade_> natefinch, so, in *theory* it won't panic because it should have been validated on creation
<fwereade_> natefinch, but *hell yes* put things in structs if you can
<natefinch> fwereade_: kk will do.  I know it shouldn't panic, but it gives me the heebie jeebies
<natefinch> man, %q was just the best idea ever
<mattyw> hi folks, I'm consistently getting errors on FAIL: status_test.go:2702: StatusSuite.TestFilterSubordinateButNotParent. Anyone else? It seems to have been happening for a while for me so it must just be me
<wallyworld_> jam1: do you know the history surrounding the change in api endpoints as per bug 1397376? it's claimed the change to only return a single address for api-endpoints is a regression, and it sure seems that way. also the use of dns name rather than ip address
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <landscape> <maas-provider> <juju-core:New> <juju-core 1.21:New> <https://launchpad.net/bugs/1397376>
<wallyworld_> mattyw: katco should be able to help with that
<natefinch> wallyworld_: lgtm
<wallyworld_> natefinch: tyvm
<mattyw> wallyworld_, ack, thanks very much
<wallyworld_> np, she should be on soon
<dimitern_> wallyworld_, your fix landed, but the bot still refuses new PRs
<dimitern_> __JFDI__ doesn't seem to work either :/
<wallyworld_> dimitern_: i think it takes a while for the fix to go through CI
<dimitern_> wallyworld_, ah, my mistake - it did work with __JFDI__
<wallyworld_> great
<mattyw> fwereade_, ping?
<fwereade_> mattyw, pong, in meeting, responding slowly
<dimitern_> fwereade_, last backport - for 1.20; same bug, no changes - will you approve it? http://reviews.vapour.ws/r/561/
<dimitern_> voidspace, can you have a look instead?  http://reviews.vapour.ws/r/561/
 * dimitern is afk for a while
<voidspace> dimitern: sure
<voidspace> dimitern: LGTM
<dimitern> voidspace, thanks!
<fwereade_> can anyone think of a reason not to make "machineenvironmentworker.MachineEnvironmentWorker" into something a bit more readable like "proxy.Worker"?
<perrito666> is anyone very very savvy about github.com/juju/cmd?
<dimitern> fwereade_, +100
<dimitern> perrito666, rogpeppe or thumper I think
<perrito666> dimitern: tx, I figured out, testing subcommands was confusing me a bitÃ§
<rogpeppe> fwereade_: why "proxy"?
<rogpeppe> fwereade_: i'd certainly use less stuttering
<jw4> wallyworld_: how can I know definitively when CI is accepting new builds?  CI must be using some check that I can run myself before $$merge$$?
<perrito666> fwereade_: +1 to have names that can be pronounced without taking a breath at the point
<fwereade_> rogpeppe, because it's all about setting proxy-related env vars, and writing out files describing what proxies to use ;)
<dimitern> jw4, I use this bookmark https://bugs.launchpad.net/juju-core/+bugs?field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.importance%3Alist=CRITICAL&field.tag=ci+regression+&field.tags_combinator=ALL
<fwereade_> rogpeppe, but maybe I missed some detail?
<jw4> dimitern: that's basically what I use too, but CI is still rejecting builds even though that reports zero bugs
<dimitern> jw4, it was blocked earlier, but not anymore - a PR of mine landed less than an hour ago
<rogpeppe> fwereade_: is there any reason to export the worker type at all from that package?
<fwereade_> rogpeppe, fair point
<jw4> dimitern: hmm; mine was rejected after yours was accepted
<rogpeppe> fwereade_: i'd suggest proxyworker as the package identifier
<rogpeppe> fwereade_: and define proxyworker.New as the creator of the worker
<fwereade_> rogpeppe, "worker/proxyworker" is a little bit yucky, but yeah, that sounds sane
<rogpeppe> fwereade_: and use "worker" as the internal type
<rogpeppe> fwereade_: i think it's fairly idiomatic - e.g. http/httputil
<fwereade_> rogpeppe, yeah
<jw4> wallyworld_, dimitern yeah - again : Does not match ['fixes-1396981']
<jw4> CI must be using a different query - I just want to know what that is so I don't have to keep guessing when CI gets unblocked
<dimitern> jw4, they must've changed it
<dimitern> mgz_, sinzui, any clue why the bot is still blocked on bug 1396981 ?
<mup> Bug #1396981: Upgrade fails because product json is renamed <ci> <regression> <upgrade-juju> <juju-core:Fix Committed by wallyworld> <https://launchpad.net/bugs/1396981>
<jw4> dimitern: oh well - I've waited almost a week to land this anyway - I can wait longer :)
<sinzui> dimitern, 1. the bug is not Fix released, 2. we need to remove the rule for frozen index.json.
<sinzui> dimitern, I think master will be open in about 30 minutes
<sinzui> jam can you get someone to look into Bug #1398406? the gwacl change broke azure deploys for CI
<mup> Bug #1398406: Azure provider attempts to deploy with unsupported "D1" RoleSize <azure-provider> <bootstrap> <ci> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1398406>
<dimitern> sinzui, great, thanks!
<dimitern> sinzui, jam is not here today, I can have a look perhaps
<sinzui> thank you dimitern
<dimitern> sinzui, btw, lest I forget - will there be a  1.20.14 release? I have just fixed an issue with maas there - bug 1395908
<mup> Bug #1395908: LXC containers in pending state due to juju-br0 misconfiguration <lxc> <oil> <juju-core:Fix Committed by dimitern> <juju-core 1.20:Fix Committed by dimitern> <juju-core 1.21:Fix Committed by dimitern> <https://launchpad.net/bugs/1395908>
<sinzui> dimitern, we need to do one for the gwacl change...but the change is broke tests
<dimitern> sinzui, the gwacl one I'm looking at now?
<sinzui> dimitern, I was just looking at that bug and think about how I might verify the change works with the QA maas
<sinzui> dimitern, yes. bug 1398406 is fallout from bug 1389422
<mup> Bug #1398406: Azure provider attempts to deploy with unsupported "D1" RoleSize <azure-provider> <bootstrap> <ci> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1398406>
<mup> Bug #1389422: azure instance-types and regions missing <azure-provider> <constraints> <Go Windows Azure Client Library:Fix Committed by hazmat> <juju-core:In Progress by wallyworld> <juju-core 1.20:Triaged> <juju-core 1.21:Fix Committed by wallyworld> <https://launchpad.net/bugs/1389422>
<dimitern> sinzui, well, if it's a virtual maas it can be tested by tweaking one or a few kvm domain configs (vms) to have disabled NICs
<sinzui> thank you dimitern,
<dimitern> sinzui, that's how I tested it locally, I can give you more details should you need them
<sinzui> :)
<dimitern> sinzui, with the fix juju will log "node "XXX" skipping disabled network interface "XYZ" for disabled devices; luckily by default only the primary interface is enabled (as discovered by lshw) so you should see these right after a node is acquired from maas (e.g. during bootstrap --debug)
<dimitern> sinzui, and after a few "skipping disabled network interface.." log lines (depending on how many NICs are there) there will be a final "node "<maas-uuid>" primary network interface is "XYZ" (e.g. "eth2", which will appear in /etc/network/interfaces under the juju-br0 "bridge-ports" setting)
<jw4> dimitern: (sinzui ) this updated query takes into account Fix Committed and not Fix Released... : https://api.launchpad.net/devel/juju-core?ws.op=searchTasks&status%3Alist=Triaged&status%3Alist=In+Progress&status%3Alist=Fix+Committed&importance%3Alist=Critical&tags%3Alist=regression&tags%3Alist=ci&tags_combinator=All
<jw4> sinzui: that should be the right query for me to check status before merging right?
<dimitern> jw4, do you have a bugs.launchpad.net link with the same filters?
<perrito666> natefinch: ericsnow can we jump into the standup since we are in it?
<jw4> dimitern, no but I think I can whip it up...
<dimitern> jw4, cheers!
<dimitern> sinzui, is bug 1398406 only for 1.21 or also for trunk and 1.20 ?
<mup> Bug #1398406: Azure provider attempts to deploy with unsupported "D1" RoleSize <azure-provider> <bootstrap> <ci> <regression> <juju-core:In Progress by dimitern> <https://launchpad.net/bugs/1398406>
<sinzui> dimitern, the Lp only claims the merge was in 1.21
 * sinzui checks log
<perrito666> natefinch: ?
<sinzui> dimitern, I don't see the dep change in master. just 1.21
<dimitern> sinzui, right; so after some analysis I think the issue is MS renamed "D1" to "Standard_D1" etc.
<dimitern> sinzui, I'll propose a fix, can it be tested in the same CI scenario easily?
<sinzui> yes
<sinzui> dimitern, this is the command that failed. the constraint is important, the env can be any asure: juju --show-log bootstrap -e azure-deploy-precise-amd64 --constraints mem=2G
<dimitern> sinzui, right, I'll try it locally as well
<dimitern> thanks
<sinzui> jw4, that call is right. I use
<sinzui>  juju-ci-tools/check_blockers.py master 1234
<sinzui> where the number is random
<jw4> sinzui: perfect
<jw4> dimitern: check_blockers.py is the best route, but I think the query would look like: https://bugs.launchpad.net/juju-core/+bugs?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.
<jw4> importance%3Alist=CRITICAL&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=ci+regression+&field.tags_combinator=ALL&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.
<jw4> has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on
<mgz_> jw4: it's possible to get a shorter version of that url :)
<sinzui> dimitern, jw4 abentley: bug 1398406 is wrongly on master (and the wrong milestone), I am going to target it to the 1.21 branch and beta4 to get master unblocked
<mup> Bug #1398406: Azure provider attempts to deploy with unsupported "D1" RoleSize <azure-provider> <bootstrap> <ci> <regression> <juju-core:In Progress by dimitern> <https://launchpad.net/bugs/1398406>
<jw4> mgz_: lol - you mean with a shortner or by removing all the empty fields ?
<dimitern> jw4, thanks!
<jw4> mgz_: I keep trimming the empty fields by hand, but the search tool puts them back in whenever I tweak the params
<dimitern> sinzui, cool, thank you
<mgz_> jw4: :)
<jw4> mgz_: (dimitern) how about : https://bugs.launchpad.net/juju-core/+bugs?field.searchtext=&orderby=-importance&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.importance%3Alist=CRITICAL&field.tag=ci+regression+&field.tags_combinator=ALL
<jw4> tiny bit shorter
<mgz_> that looks reasonable
<dimitern> thanks again jw4
<jw4> dimitern: :)
<natefinch> ericsnow, perrito666: sorry, chrome crashed again, and I can't get back in.  I think we mostly were done.  I'll be back in like 5 minutes.
<perrito666> natefinch: we noticed
<natefinch> btw, is it possible to downgrade back to trusty?  Because this is just ridiculous
<ericsnow> natefinch: ping me when you are ready for our 1-on-1
<natefinch> ericsnow: will do]
<perrito666> natefinch: mm, I am not sure, it should be it requires some crafting by hand and some things might break
<perrito666> there are not that many changes in utopic
<perrito666> natefinch: you could downgrade the browser
<natefinch> I don't even know that it changed the browser.  I'm running Chrome, not Chromium
<perrito666> natefinch: chances are that the browser updated and it might also be broken in trusty, chrome comes from a ppa iirc
<alexisb> hey there team, I see katco has pickup one critical bug do we have an assignee for this one:
<alexisb> https://bugs.launchpad.net/juju-core/+bug/1397376
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <landscape> <maas-provider> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397376>
<natefinch> alexisb: looking
<natefinch> sinzui: about that api-endpoints bug.... it sounds like the scripts were relying on behavior that was not guaranteed, but simply happened to be somewhat reliable in their particular environments.  I'm with Tim in that I'm not sure this is a regression.
<sinzui> natefinch, They are relying on behaviour that worked most of the time, now doesn't work most of the time
<sinzui> natefinch, I think we need to know if anyone has built tools on the unguaranteed behaviour. If deployer assumes it, we have a problem
<sinzui> hazmat, juju 1.21 api-endpoints is more likely to return a dns address instead of an ip address. Some people assume it only returned an IP address. does deployer make such assumptions?
<natefinch> sinzui: isn't deployer on the server?  api-endpoints is a client-only command,  I don't think deployer could call it if it wanted to.  Also, again, deployer is on the server, so it shouldn't need to get the api endpoint, right?
<sinzui> natefinch, no, deployer drives your client
<natefinch> or am I crazy about how deployer works
 * natefinch is crazy, ok. :)
<natefinch> I've never used the deployer, sorry :)
<sinzui> well supplants your client with an bulk api clinet
<sinzui> natefinch,  a deployer files/bundle is a summary of an end state of an env. deployer/quickstart walks juju through the steps to but the env into the state
<voidspace> dimitern: ping
<dimitern> voidspace, pong
<voidspace> dimitern: I'm stuck on a failing test
<voidspace> dimitern: this is for IPAddress.SetState
<voidspace> dimitern: the Assert seems to prevent State being changed at all
<voidspace> dimitern: this is the code https://github.com/voidspace/juju/compare/state-ipaddresses
<voidspace> dimitern: this is the error message: http://pastebin.ubuntu.com/9346641/
<dimitern> voidspace, looking
<voidspace> dimitern: the failing test is TestIPAddressSetState
<voidspace> dimitern: it fails on line 1306 - where we attempt to change the state (actually calling SetState)
<voidspace> dimitern: I've checked other Asserts in our code and can't see the difference
<voidspace> dimitern: the Assert itself is line 96 of the diff (first block - the new ipaddresses.go file)
<dimitern> voidspace, unknownOrSame?
<voidspace> dimitern: yes
<voidspace> dimitern: to change the state, the existing state must be AddressStateUnknown, or the same state as the one we're setting
<voidspace> so the state must be unknownOrSame
<voidspace> which was your name I believe... yesterday :-)
<dimitern> voidspace, yeah
<dimitern> voidspace, hmm.. weird..
<voidspace> dimitern: the test creates a new address, checks that the state is AddressStateUnknown and then attempts to set the state to AddressStateAllocated
<voidspace> dimitern: and that fails with "transaction aborted"
<dimitern> voidspace, how about if you use getCollection + defer like in other places, rather than using C: ipaddressesC
<voidspace> dimitern: ok, I'll try that
<dimitern> voidspace, ah, another thing I just noticed
<voidspace> dimitern: hmm... other places uses st.runTransaction with C
<dimitern> voidspace, state is omitempty, which means it can be null in addition to any valid state
<dimitern> voidspace, it shouldn't be omitempty
<voidspace> dimitern: ok
<voidspace> dimitern: that might be it
<voidspace> StateUnknown is ""
<dimitern> voidspace, we have an "unknown" .. yeah :)
<voidspace> dimitern: getting rid of omitempty worked!
<voidspace> dimitern: thanks :-)
<dimitern> voidspace, sweet!
<voidspace> I knew there was a reason we kept you around ;-)
<dimitern> :D
<dimitern> indeed
<voidspace> :-)
<dimitern> ok, eod for me
<dimitern> g'night all
<voidspace> dimitern: g'ngight
<dimitern> voidspace, o/
<mattyw_> ericsnow, ping?
<ericsnow> mattyw_: hey
<mattyw_> ericsnow, hey hey, what's the current process for proposing a branch with a prereq?
<ericsnow> mattyw_: Do it like normal and then use "rbt post -u --parent ..." afterward
<mattyw_> ericsnow, perfect thanks very much
<ericsnow> mattyw_: np :)
<mattyw_> cmars, http://reviews.vapour.ws/r/563/
<voidspace> g'night all
<jw4> on call reviewer today available to PTAL?  Very small change - fully tested with juju-core (small fix to juju-core to follow this update) http://reviews.vapour.ws/r/565/
<natefinch> jw4: what have we learned?
<natefinch> <chorus> never use regular expressions </chorus>
<katco> natefinch: that is a rather firm stance to take
<natefinch> katco: perhaps only slightly too firm.  Never use regular expressions longer than 10 characters?
<katco> natefinch: i can at least consider that statement :)
<katco> natefinch: very useful: http://www.colm.net/open-source/ragel/
<jw4> natefinch: lol!
<natefinch> jw4: haha, did we not even have a positive testcase for that?
<natefinch> wait wait wait
<jw4> natefinch: blush
<natefinch> why is IsValidUUIDString not just
<jw4> katco: interesting link
<natefinch> _, err := utils.UUIDFromString(mystring)
<katco> jw4: it's a very neat project
<natefinch> return err == nil
<jw4> natefinch: that's actually how this was caught
<jw4> the check was passing but the UUIDFromString was panicing
<natefinch> in fact... why do we even have an IsValidUUIDString?
<jw4> natefinch: no answer except that we just started using that method too
<jw4> natefinch: we could switch to the _,err approach and remove the bool version
<natefinch> jw4: please do... there's no sense having two implementations which can get out of sync
<jw4> natefinch: I'll have a juju core PR to follow this, and now a names package follow up too
<natefinch> jw4: cool, sorry to give you more work, but it's definitely the better fix
<jw4> natefinch: no problem - I'm semi blocked on CI anyway
<perrito666> I am all for regexes as long as you have a test battery that covers every possible case for them :p
 * perrito666 gets a friend to bring an en_US kb from ny and jumps of happiness like mario with a new coin
<jw4> natefinch: the regex is useful because it's an easy way to verify that the 15th char is a '4' and the 20th char is one of '8','9','a','b'
<jw4> natefinch: we  could do that check manually but the regex actually seems simpler
<jw4> natefinch: utilis.UUIDFromString uses the regex to verify those special characters, and then the rest is just simple hex decoding
<natefinch> jw4: you think that 73 character regex is simpler than  s[14] == '4' && switch s[19] { case  '8','9','a','b':    default: error } ?
<jw4> natefinch: ;-p
<jw4> natefinch: okay
<natefinch> jw4: also, what's up with the 4 and 89ab thing anyway?
<jw4> natefinch: something to do with valid UUID per a spec somewhere
<jw4> natefinch: fwereade_ gave me a link but I forget it
<jw4> natefinch: the 4 is a version number
<perrito666> mm uuid validator, I seem to recall we have code that checks that
<natefinch> ahh yeah, I see some comments about version 4
<Makyo> sinzui, ping
<jw4> natefinch: btw, we also have to validate length if we remove the regex.  I'll update the PR with those changes
<sinzui> hi Makyo
<jw4> perrito666: other than in the utils package?
<perrito666> jw4: nope, same ugly regexp
<Makyo> sinzui, was told to get in touch with you re: juju beta PPA
<natefinch> jw4: is there a reason we're rolling our own and not using something like https://godoc.org/code.google.com/p/go-uuid/uuid
<perrito666> I recall having a fake uuid breaking and trying to decypher that regex to figure out wtf until I just read the rfc
<sinzui> Makyo, https://launchpad.net/~juju/+archive/ubuntu/devel ?
<jw4> perrito666: fwereade_ pointed me to that package - we can ask him about go-uuid
<Makyo> sinzui, ah, thanks, we didn't know the location.
<perrito666> jw4: I know natefinch will not agree with me, but I would definitely go for something more in the side of http://pastebin.ubuntu.com/9348308/
<perrito666> if I actually had to use that regex
<perrito666> and properly name the blocks
<perrito666> aaaand of course a nice doc explaining it a bit more
<jw4> perrito666: I'm okay with that - if natefinch agrees I will go that route
<natefinch> jw4: until such a time as we can use something more complete, that's fine... except you don't need sprintf
<jw4> natefinch: +1
<perrito666> duh, the sprintf was from the first implementation I wrote
<perrito666> but I found it clearer by just concatenating the strings
<jw4> perrito666: yeah
<jw4> perrito666: natefinch: also - I'm guessing we want to allow upper case letters too
<perrito666> jw4: It would be wiser to de-capitalize the uuid before matching
<jw4> perrito666: kk
<perrito666> can I trigger the compilation of a _windows file on linux??
 * perrito666 knows the answer and starts a windows+
<jw4> GO_ARCH=windows?
<natefinch> jw4: so, yes, but I think you have to have the go tools built to run that way
<jw4> natefinch: I see.  FWIW, I'm coming back to a non-regex approach again! :)
<natefinch> jw4: http://dave.cheney.net/2013/07/09/an-introduction-to-cross-compilation-with-go-1-1
<jw4> natefinch: cool.
<natefinch> jw4: once the tools are built `GOOS=windows go build` will work
<perrito666> if you build them, they will compile...
<jw4> natefinch, perrito666 : http://reviews.vapour.ws/r/565/
<perrito666> jw4: I lgtmd but you need davecheney to bless it
<perrito666> jw4: I proposed the change :p I was not going to be against it
<perrito666> we have tests failing because my windows is in spanish :s
<perrito666> is reviewboard on for utils?
<perrito666> it is, and it changes the description, how nice
<perrito666> http://reviews.vapour.ws/r/567/
<perrito666> ok EOD for me, cheers everyone
 * menn0 is sick of CI being blocked
<menn0> is anyone looking at bug 1398473?
<mup> Bug #1398473: backup output format changed <backup-restore> <ci> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1398473>
<perrito666> menn0: lemme check
<menn0> perrito666: either the test needs to change or the output filename
<menn0> perrito666: the other blocker is also backup related
<menn0> perrito666: bug 1398448
<mup> Bug #1398448: ERROR while creating backup archive: while bundling state-critical files: write to tar file failed: open /var/log/juju/all-machines.log: no such file or directory <backup-restore> <ci> <ha> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1398448>
<menn0> perrito666: seems more serious
<perrito666> ericsnow: did you merge the deprecation of backup before restore is merged?
<ericsnow> perrito666: yes
<perrito666> ericsnow: old restore will not restore the new backups
<ericsnow> perrito666: why not?
<perrito666> ericsnow: are you dumping with ooplog?
<ericsnow> perrito666: the oplog doesn't matter
<perrito666> ericsnow: ?
<ericsnow> perrito666: the dumped oplog will just get ignored, giving exactly the same result as if dumped without the oplog
<perrito666> ericsnow: except that loosing everything that happened between the beginning and end of juju backups create
<perrito666> old backup would stop the state server while backing up
<ericsnow> perrito666: I see
<perrito666> yes, I know, its a PITA
<perrito666> :)
<ericsnow> perrito666: however, restore will land in 1.22 so it won't matter, right?
<ericsnow> perrito666: is the backup archive (with oplog) breaking tests?
<ericsnow> perrito666: that bug is related to the filename, not the archive content
<perrito666> ericsnow: no that is just a side effect I noticed while looking the other issue
<ericsnow> perrito666: k
<perrito666> ericsnow: I believe a change in the test is in order, unless there is some strict rule on the file name
<ericsnow> perrito666: looks like the test has some hard-coded regex for the filename that no longer works
<perrito666> ericsnow: it has
<perrito666> it expects it to have the old name
<perrito666> sinzui: is this considered a break on backwards compatibility?
<menn0> perrito666, ericsnow: if users of the old backup system have scripts which expect the old filename the change will break them
<ericsnow> menn0: fair enough
<menn0> perrito666, ericsnow: not sure how worried we should be about that though
<perrito666> ericsnow: I presume you changed the script to call backups create, right?
<menn0> wallyworld_: ^^^
<ericsnow> perrito666: right
<perrito666> ericsnow: you could add a mv at the end
<perrito666> but, apart from that
<perrito666> there is an underlying problem there
<ericsnow> perrito666: I was thinking the same thing, though I don't like it :P
<perrito666> ericsnow: although that script should be nuked before 1.22
<ericsnow> perrito666: we have to keep the script around due to backward-compatibility
<menn0> ericsnow: so what's the difference between the old and new style filenames?
<sinzui> perrito666, menn0, ericsnow : I don't think we promised a naming convention for the backup file. The promise is a new file.
<perrito666> it will produce a different format than previously and an unknowing user might try to restore one of these into an old juju (because of reasons)
<ericsnow> menn0: the new filename has a .tar.gz suffix instead of .tgz, and now includes the env UUID in the filename
<menn0> perrito666: that's a good point. given that the format has changed a new file name style is probably a good thing
<ericsnow> menn0: actually, the env UUID isn't included in the filename
<ericsnow> menn0: so just the suffix changed
<menn0> ericsnow: ok
<menn0> ericsnow: I agree that tar.gz is marginally better :)
<perrito666> so this bug is on sinzui's side
<menn0> so if the filename is staying the same let's change the test to expect the new filename
<perrito666> although its our fault
<perrito666> for not communicating properly the change
<menn0> perrito666: we have access to the CI tests and can fix them too...
<ericsnow> menn0: I'm all for just fixing the regex in the CI test :)
<perrito666> menn0: we do, that is a particularly convoluted tests though :) because it tests an ugly plugin of juju
<ericsnow> FWIW, I don't think there's a whole lot we can do about someone trying to restore a 1.22 backup using 1.20 or 1.21
<perrito666> I am not sure for the case of backup, but it also might be expecting certain output from stdout
<menn0> sinzui: remind me what the process for changing CI tests is
<perrito666> ericsnow: we could print a hughe warning ni the script, that is it
<sinzui> perrito666, abentley I agree that "juju-backup-20141202-141348.tar.gz" is still clearly a backup. We will loosen the regex
<ericsnow> Doing so should still work fine, with the (very?) remote chance of introducing DB inconsistencies
<perrito666> ericsnow: well, there will not be inconsistencies in CI, just in practice
<abentley> sinzui: It is a regression, though.
<sinzui> abentley, yep, I have targeted the bug and started a fix
<abentley> sinzui: If it breaks my code, it can just as easily break someone else's.
<ericsnow> perrito666: and probably very very rarely
<menn0> ericsnow, perrito666: now what about the other issue relating to all-machines.log?
<perrito666> ericsnow: I agree, in the current status of users for juju, although it is technical debt
<perrito666> menn0: checking
<menn0> ericsnow, perrito666: bug 1398448
<mup> Bug #1398448: ERROR while creating backup archive: while bundling state-critical files: write to tar file failed: open /var/log/juju/all-machines.log: no such file or directory <backup-restore> <ci> <ha> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1398448>
<sinzui> abentley, there is no promise of a specific naming convention. The new name is very similar and We...specifically me was probably too specific.
 * ericsnow looks
<perrito666> menn0: that is odd, I have personally done that process a lot of times
<ericsnow> abentley: shouldn't /var/log/juju/all-machines.log always be there when the CI tests run?
<perrito666> ericsnow: albeit, we might just let it pass if a log file is not there
<ericsnow> perrito666: we already do that for a couple other files
<ericsnow> perrito666: but I'd expect all-machines.log to be there
<perrito666> ericsnow: I think logs are not that critical as config and db
<perrito666> ericsnow: although yes, it should be there
<abentley> sinzui: I think that it is a regression to change even behaviour that was not promised.  Users will depend on anything they can get their hands on.
<abentley> sinzui: But it's your call.
<perrito666> abentley: I dont think this qualifies as a change in behavior
<sinzui> abentley, no one told me a naming convention. I picked a test that satisfied myself
<abentley> perrito666: Then why is our test broken?
<perrito666> we make a backup and download a file in the same format, the name changes but that is an accident
<perrito666> abentley: you made a reasonable assumption on the file name, yet it is a bit extremist to call it a regression
<perrito666> the compression is the same, the format is the same, it is compatible
<perrito666> abentley: ftr, we suck for not spec-ing that and I do apologize for it
<perrito666> I am working on a spec
<abentley> perrito666: Yes, and the same could be said if you had changed the commandline arguments.
<abentley> perrito666: But just like this change, changing the commandline arguments would break scripts.
<jw4> perrito666: thanks
<abentley> perrito666: I'm not upset that you changed it, I'm just concerned to make sure we don't break our customers.
<perrito666> abentley: I am trying to rubber duck with you, I am not yet decided on if we break or not our customers
<abentley> perrito666: You've already said the regex makes a "reasonable assumption" about the filename format.  Do you think it is unlikely that any of our customers made the same assumption?
<perrito666> abentley: yes, I was re-reading myself
<wallyworld_> menn0: hi, did you need me?
<perrito666> abentley: I concede, it is a reasonable assumption and we might need to maintain it.
<abentley> perrito666: Okay, well sinzui has overruled me, so I won't waste everyone's time discussing it further.  EOD for me.
<perrito666> oh you can override abentley :p that is a cool feature
<perrito666> have a nice EOD
<perrito666> ericsnow: sinzui well its your call guys
<perrito666> I am EOD too
<Ice-x> does anyone know where I can buy dvd movies with a low price on the internet ?
<ericsnow> perrito666: goodnight
<menn0> perrito666: ciao
<jw4> davecheney: ptal http://reviews.vapour.ws/r/565/
<jw4> davecheney: perrito666 said I should get your review on that.
<thumper> Ice-x: wrong channel
<mwhudson> thumper, davecheney, alexisb: https://groups.google.com/d/msg/golang-dev/2ZUi792oztM/UCA2V7Ul3nkJ
<davecheney> jw4: /me looks
<jw4> davecheney: ta
<jw4> thanks davecheney :)  Comments duly noted
<jw4> davecheney: actually - my preference would be to use https://code.google.com/p/go-uuid/ and not even mess with it ourselves - the only benefit is that our version is presumably lighterweight
<davecheney> jw4: if someone else has alredy written a package to to dhtat
<davecheney> use it
<davecheney> juju clearly has no policy against external depenencies
<jw4> davecheney: I think I floated that idea and fwereade_ or someone pointed me to the utils package instead (sorry if I'm falsely accusing you fwereade_ )
<davecheney> jw4: meh
<davecheney> what's better than having one of something ?
<davecheney> having more than one of something!!
<jw4> davecheney: lol
<davecheney> code reuse involves less heroism
<ericsnow> abentley: so about all-machines.log in CI, why would it be missing?
<menn0> ericsnow: it could be a logging regression
<ericsnow> menn0: perhaps; do we have any other CI tests that rely on all-machines.log
<ericsnow> ?
<menn0> ericsnow: where the log file isn't being created on some state servers?
<menn0> ericsnow: just a guess
<ericsnow> menn0: yeah
<ericsnow> menn0: could be a permissions issue too
<menn0> ericsnow: i'm just spinning up an HA env on ec2 now to check
<ericsnow> menn0: cool
 * ericsnow remembers abentley is EOD already
<katco> wallyworld_: can someone stamp http://reviews.vapour.ws/r/569/
<katco> oops, that's to anyone, not just wallyworld_
<wallyworld_> katco: sure, looking
<katco> wallyworld_: ty sir
<wallyworld_> katco: land away
<katco> wallyworld_: sweet. what the heck is the fixes syntax (or where can i find it)?
<wallyworld_> $$fixes12345$$
<katco> ty
<wallyworld_> oops
<wallyworld_> $$fixes-12345$$
<wallyworld_> katco: ^^^
<katco> it's too late! the damage is done!!
<katco> oh, the horror!
<wallyworld_> sorry
<wallyworld_> it will be rejected
<katco> i'm just kidding around :)
<wallyworld_> i know
<menn0> ericsnow: perms on all-machines.log could be the problem. on a new state server the logs dir looks like:
<menn0> -rw------- 1 syslog adm    99219 Dec  2 22:54 all-machines.log
<menn0> -rw------- 1 syslog syslog   883 Dec  2 22:52 ca-cert.pem
<menn0> -rw------- 1 syslog syslog   589 Dec  2 22:52 logrotate.conf
<menn0> -rwx------ 1 syslog syslog    83 Dec  2 22:52 logrotate.run*
<menn0> -rw------- 1 syslog syslog 95478 Dec  2 22:54 machine-0.log
<menn0> -rw------- 1 syslog syslog   895 Dec  2 22:52 rsyslog-cert.pem
<menn0> -rw------- 1 syslog syslog   891 Dec  2 22:52 rsyslog-key.pem
<katco> it will be interesting to see if this completely fixes andreas's issue
<ericsnow> menn0: I recall thumper saying something about log permissions recently
<ericsnow> menn0: which is what made me think of it
<menn0> ericsnow: but the backup runs as root right?
<ericsnow> menn0: I expect the issue of permissions partly depends on the user running the CI tests
<ericsnow> menn0: I don't know how they are set up
<ericsnow> menn0: I'm less convinced it's a permissions issue (the error message is "no such file or directory")
<menn0> ericsnow: I'm pretty sure these tests run against a separate ec2 env basically from the end user's persepective
<menn0> ericsnow: i've just got that ec2 env up
<menn0> ericsnow: there's no all-machines.log on the 2 new state servers created by ensure-availability
<menn0> ericsnow: so that could be the problem
<ericsnow> menn0: ah, yep, and from that perspective it's the API server process that is running the backup
<menn0> ericsnow: ok the all-machines.log is there now on the new machines
<menn0> ericsnow: it just takes a while to appear
<menn0> ericsnow: that's probably the issue
<ericsnow> menn0: so if one of them runs the backup we'd get the failure we're seeing
<menn0> ericsnow: which state server runs the backup? the master or the API server that handled the request?
<ericsnow> menn0: so a race condition of sorts?
<ericsnow> menn0: the one that handled the request
<menn0> ericsnow: yeah, probably a race condition
<menn0> ericsnow: i've looked at the test and it waits for all machines to be "started" after ensure-availability
<menn0> ericsnow: but when I looked all machines were in the started state
<menn0> ericsnow: but the all-machines.log wasn't there yet
<ericsnow> menn0: but it showed up after a little while, right?
<menn0> ericsnow: yep
<ericsnow> menn0: so in practice this isn't likely to bite users
<menn0> ericsnow: looking at the logs for the rsyslog worker, there might be at least a 20s gap
<menn0> ericsnow: it would suck though if something wasn't quite right with rsyslog meaning that backup couldn't be taken
<ericsnow> menn0: agreed
<menn0> ericsnow: maybe the backup system should be tolerant of missing log files
<ericsnow> menn0: we already have precedent
<ericsnow> menn0: what would be the importance of the logs on the restored host
<ericsnow> menn0: do we have code that relies on all-machines.log?
<ericsnow> menn0: did the /var/log/juju directory exist even when the log file was missing?
<menn0> ericsnow: yes /var/log/juju was there with the local machine-N.log and the key files for logging
<menn0> ericsnow: I don't think anything in Juju relies on all-machines.log except the debug-log feature
<menn0> ericsnow: I don't think it's at all essential for the backup
<ericsnow> menn0: okay, good
<menn0> wallyworld_: do you agree that it isn't essential that all-machines.log is included in a backup?
<wallyworld_> menn0: do we include other logs?
<wallyworld_> i can see that it would be useful, but not strictly necessary to get a system up and running again
<ericsnow> menn0: k
<ericsnow> menn0: I'm working up a patch to allow for missing log files
<ericsnow> wallyworld_, menn0: we also back up machine-0.log
<katco> wallyworld_: (when you get a chance) regarding your comment about tests for apiserver/leadership regarding permissions: can i just stub out an authorizer that returns a failure and ensure the code behaves correctly?
<wallyworld_> so in that case why not include all-machines.log
<wallyworld_> katco: i think so - from memory, the other similar tests use a fakeAuthorizer, but i'd need to check to be sure
<ericsnow> wallyworld_: all-machines.log does get backed up
<katco> wallyworld_: cool, ty.
<wallyworld_> ericsnow: ok
<ericsnow> wallyworld_: the problem is where it didn't exist yet in the HA case
<wallyworld_> doesn't it always exist?
<ericsnow> wallyworld_: apparently it doesn't show up immediately on the extra hosts under HA
<ericsnow> wallyworld_: menn0 saw something like a 20s gap
<ericsnow> wallyworld_: and that was enough for CI to fail
<wallyworld_> oh, i see
<wallyworld_> so why doesn't backup use the primary state server?
<ericsnow> wallyworld_: we use whichever one cmd.Command.NewAPIRoot gives us
<ericsnow> wallyworld_: you're suggesting we force that to machine 0?
<wallyworld_> ericsnow: the state server itself that receives the request could redirect to the primary (not necessarily machine 0). i'm just thinking out loud
<ericsnow> wallyworld_: yeah, menn0 hinted at the same thing
<menn0> wallyworld_: that seems pretty fragile
<wallyworld_> why fragile?
<menn0> what happens if the master changes
<menn0> if state servers have come and go
<menn0> what is machine-0 anyway?
<wallyworld_> the cluster knows which is master
<wallyworld_> surely?
<wallyworld_> i didn't say machine 0
<wallyworld_> i said NOT necessarily machine 0
<menn0> the cluster does know who master is
<wallyworld_> right
<wallyworld_> so if the machine that receives the request is not master, it redirects to master
<menn0> it seems like a lot of extra complexity just to get a non-essential log file
<menn0> which will be there most of the time anyway
<menn0> it's only not there for the first 20s ish after a new state server comes up
<wallyworld_> trouble is, people may want it (I can't make that call), and the backup needs to be consistent
<wallyworld_> it can't sometimes include all-machines and sometimes not
<menn0> I guess the oldest state server (which may or may not be the master) will have the most complete all-machines.log
<wallyworld_> i'm not saying we must include it etc - just offering an argument of why we should
<ericsnow> wallyworld_: agreed that the contents of a backup archive should be consistent
<wallyworld_> there's an argument we could leave it out
<menn0> wallyworld_: the problem is, I've seen the mongo master change even when the master doesn't go away
<wallyworld_> menn0: ericsnow: so, i think we need to ask stakeholders - do they want all-machines, and can we exclude it
<wallyworld_> hopefully they'll say they don't want it - they just want a backup so that a system can be restored if needed
<wallyworld_> for now, to remove the regression, we can ignore all-machines if it doesn't exist
<wallyworld_> or just exclude always
<ericsnow> wallyworld_: why was it included in the first place (along with machine-0.log)?
<wallyworld_> don't know for sure, i think everything in logs was just included
<menn0> wallyworld_: I don't think we're talking about excluding all-machines.log are we
<menn0> wallyworld_: just not aborting the backup if isn't there
<wallyworld_> menn0: well, we could exclude it so that we always have a consistent backup
<menn0> wallyworld_: i.e. it'll get included if it's there but otherwise there will just be a warning or something but the backup continues
<menn0> wallyworld_: ok fair enough
<wallyworld_> i'd rather not have it there sometimes and not others
<wallyworld_> that's IMHO
<menn0> wallyworld_: fair enough
<ericsnow> menn0: I'd be nervous about a backup archive having it only some of the time
<menn0> wallyworld_: if we do decide it should be there I think it still shouldn't cause the backup to abort if it's not there
<wallyworld_> agreed
<menn0> wallyworld_: not being able to backup if rsyslogd isn't working would be kinda shite
<wallyworld_> yep, i hope stakeholders just say we can exclude it
<menn0> wallyworld_: so how about ericsnow makes an immediate change to make the log optional so we can get CI unblocked
<wallyworld_> menn0: yep, that was my suggestion above
<menn0> wallyworld_: and then once we've heard more from stakeholders there might be a further change
<menn0> wallyworld_: ok. i missed that.
<wallyworld_> and ericsnow needs to ask nate to engage stakeholers to see what we need to do
<wallyworld_> we need to explain to them the implications etc so they understand the issues
<menn0> ericsnow: i've just noticed there's another cause for this CI test failure too
<menn0> ericsnow: ERROR unable to get DB names: EOF
<menn0> ericsnow: that's from a different run. http://juju-ci.vapour.ws:8080/job/functional-ha-backup-restore/1165/console
<ericsnow> wallyworld_: will do
<menn0> ericsnow: looks unrelated to the all-machines.log issue
<wallyworld_> ty
<ericsnow> menn0: I'll take a look in a sec
<menn0> ericsnow: ok, i'll leave you to it
<ericsnow> menn0: k
<menn0> sinzui, wallyworld_: do you think we can the all-machines.log backup CI blocker to no longer be a regression. it's really more of a race condition that's only likely to happen in the test.
<wallyworld_> menn0: i do but it's not my ultimate call
<ericsnow> wallyworld_, menn0: http://reviews.vapour.ws/r/572/
<wallyworld_> ericsnow: why machine0Log?
<wallyworld_> it won't be always machine0 surely?
<ericsnow> wallyworld_: right
#juju-dev 2014-12-03
<davecheney> thumper: nagging nag, did you talk to smoser about power64 ?
<ericsnow> wallyworld_: it's left over from the original backup script
<ericsnow> wallyworld_: I'm in the process of sorting out all those hard-coded paths
<wallyworld_> ok
<wallyworld_> we must have been lucky this hasn't failed then
<wallyworld_> in an HA environment
<katco> wallyworld_: for your perusal: http://reviews.vapour.ws/r/519/
<wallyworld_> rightio
<ericsnow> wallyworld_: right
<thumper> davecheney: yes, been escelated
<thumper> for some spelling of that
<axw> wallyworld_: seen the Azure bug?
<wallyworld_> axw: yeah, otp, but i was going to ask you to pick it up. i trusted kapil to have tested it but obviously not :-(
<axw> will do
<wallyworld_> ty
<thumper> davecheney: I noticed that the juju/utils package tests panic with gccgo
<thumper> davecheney: but no idea why
<thumper> I couldn't grok the panic
<thumper> subpackages are fine, just the main one
<thumper> wallyworld_: do you know which juju projects have the autolander?
<menn0> ericsnow: it's pretty awful that everything is hardcoded to be able machine-0 but if that's going to get resolved soon then I guess that's ok
<davecheney> thumper: thanks
<menn0> s/able/about/
<davecheney> thumper: paste.ubuntu ?
 * thumper looks  for it again
<ericsnow> menn0: agreed
<thumper> davecheney: http://paste.ubuntu.com/9351357/
<thumper> ericsnow: do you know which juju projects are auto-landed?
<ericsnow> thumper: you talking about the bot or for reviewboard?
<thumper> ericsnow: bot
<thumper> ericsnow: although knowing about reviewboard would also help
<ericsnow> thumper: RB is currently just core and utils
<thumper> ericsnow: it seemed that juju/cmd didn't have the rb hookup
<davecheney> thumper: sorry, my internets shat themselves
<thumper> ericsnow: ah...
<davecheney> you were saying ?
<thumper> davecheney: http://paste.ubuntu.com/9351357/
<davecheney> ta
<thumper> I think I'll just use the github magic merge button for juju/utils
<davecheney> thumper: first thougth is you have the wrong gccgo
<thumper> davecheney: I've not changed it...
<davecheney> prety much everyone has the wrong gccgo
<ericsnow> thumper: for the bot I think it's just core
<davecheney> thumper: do this
<davecheney> go test -c $PKG
<davecheney> gdb --args ./$PKG.test
<davecheney> r
<thumper> davecheney: can I use '.' for $PKG?
<davecheney> yes
<davecheney> so if your in juju/cmd
<davecheney> the bin will be called
<davecheney> ./cmd.test
<thumper> davecheney: but also need to specify the compiler right?
<davecheney> yes
<ericsnow> wallyworld_, menn0: fix landed for 1398448
<davecheney> -c basically says "don't throw away the test binary afterwads, and give it a predictable name"
<menn0> ericsnow: awesome
<thumper> davecheney: all passed with gdb
<menn0> ericsnow: it'll take a while to filter down to the relevant CI jobs
<ericsnow> menn0: for that "unable to get DB names" issue, looks like the mgo session dropped or something (hence EOF)
<davecheney> thumper: so here is a fun thing
<davecheney> gccgo development has moved on since 4.9.2
<ericsnow> menn0: I'm working on making it at least a little more robust
<davecheney> what are out changes of getting gccgo 5.0 backported to trusty ?
<davecheney> s/out changes/our chances/
<thumper> nfi, but we can ask
<davecheney> basically we're going to have to stick with the trunk of gccgo if we want to have any support from uupstream
<menn0> ericsnow: sounds good
<davecheney> thumper: this is on a branch ?
<davecheney> i'll try to repro
<thumper> davecheney: master
<davecheney> ok, that makes it easy
<davecheney> github.com/juju/cmd ?
<thumper> davecheney: no, utils
 * davecheney smacks forhead
<davecheney> http://paste.ubuntu.com/9351435/
<davecheney> thumper: ummm
<davecheney> what happened here
<davecheney> oh, wait
<davecheney> sorry, local problem
<wallyworld_> ericsnow: ty, sorry just got out of meeting
<ericsnow> wallyworld_: no worried
<wallyworld_> thumper: no, many are supposed to, you mean on github or lp?
<wallyworld_> i think the tarmac bot has died
<thumper> wallyworld_: I just don't know which ones
<thumper> wallyworld_: my general approach is to try $$merge$$ and if nothing happens for a few minutes, do it manually
<wallyworld_> we need to follow up there, it's fallen into a hole
<davecheney> thumper: ok, repro pretty easy
<thumper> wallyworld_: github lander not lp
<davecheney> % env GOMAXPROCS=42 ./utils.test
<davecheney> Segmentation fault (core dumped)
<davecheney> wheeee
<thumper> davecheney: is the fix as obvious?
<davecheney> now I should be able to get it to shit itself under gdb
<davecheney> urgh, looks like a gc bug
<davecheney> or a crash in libunwind
<davecheney> http://paste.ubuntu.com/9351492/
<thumper> interesting
<davecheney> thumper: what release are you running ?
<thumper> trusty
<davecheney> same
 * thumper afk for a bit
<mwhudson> davecheney: oh, i've seen that one
<mwhudson> davecheney: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64001
<davecheney> mwhudson: do you want me to work on a repro ?
<davecheney> it's related to the http -> tls -> asn code path
<mwhudson> davecheney: a small-ish repro would be great, yes
<mwhudson> davecheney: yeah, but the crash seems dementedly unrelated to that
<davecheney> you're right about the stack split
<davecheney> that paste above is pretty clear where it's happening
<mwhudson> yeah, but not why
<davecheney> somethign it makeing libunwind shit itself
<mwhudson> well no
<davecheney> ?
<mwhudson> at least not in my case
<mwhudson> it's in the __morestack splitting code
<mwhudson> and $rsp is bogus
<mwhudson> sometimes unaligned, sometimes unmapped
<mwhudson> i guess if you're really lucky it's valid but pointing at some random part of the heap so you get random corruption
<davecheney> http://paste.ubuntu.com/9351539/
<davecheney> mwhudson can you do this
<mwhudson> i saw at least three distinct failure modes fwiw
<davecheney> env GOGC=1 gdb --args go get github.com/lxc/lxd
<mwhudson> that was one of them
<davecheney> yeah, i've seen some others
<davecheney> i think the https one is the easiest to make a repro
<mwhudson> it worked once...
<mwhudson> but then
<mwhudson> Program received signal SIGBUS, Bus error.
<mwhudson> __morestack () at ../../../src/libgcc/config/i386/morestack.S:529
<davecheney> oh
<davecheney> interesting
<mwhudson> (gdb) p $rsp
<mwhudson> $1 = (void *) 0xffffedf45360
<davecheney> blergh
<davecheney> mwhudson: ok, i'll try to make a stand alone repro that blows up
<davecheney> hopefully ian can figure out what is failing
<davecheney> 'cos it's way above my ken
<mwhudson> yeah
<mwhudson> i spent an hour or so poking at it when i was in austin and got ~nowhere
<wallyworld_> katco: standup?
<davecheney> mwhudson: do you know if 4.9.2 is available in a trusty-backports ?
<katco> wallyworld_: shoot sorry brt
<mwhudson> davecheney: i do not know
<davecheney> poop
<davecheney> mwhudson: http://paste.ubuntu.com/9351595/
<davecheney> worlds smallest repro
<davecheney> mwhudson: will you be my mule and put this stuff on the gcc bugzilla ?
<thumper> menn0: updated https://github.com/juju/cmd/pull/10/files
<thumper> davecheney: that is pretty small
 * thumper rushes to get the cmd branch in before anastasiamac
<mwhudson> davecheney: !
<axw> wallyworld_: sorry I was wrong, D1 fails for me too. I put in a rubbish name and saw "D1", but that's hard-coded in D1
<axw> err hard coded in Juju
<mwhudson> davecheney: sure, will en-bugzilla
<wallyworld_> axw: no worries, glad it's reproduceable
<menn0> thumper: looking
<davecheney> mwhudson: so that one liner can blow up simply
<davecheney> i'm trying to disect it down to the bit that confuses the gc
<davecheney> oh
<davecheney> i know what it could be
<davecheney> asn1 probably looks like a random field of pointers to the gc
<mwhudson> oh
<mwhudson> it fails with GOGC=off though?
<davecheney> should do
<davecheney> repro is super fiddly
<davecheney> oh, interestingly
<davecheney> that aspolodes as well
<davecheney> which means you were right about the stack split
<anastasiamac> thumper: u r making me cry but it's k - whoever has the last laugh and all.... :-P
<menn0> thumper: done. there's a few doc issues but otherwise LGTM.
<menn0> thumper: note that some of the comments are directly against the last commit which Github makes a little tricky to get to.
<thumper> menn0: ta
<thumper> ah... didn't fix the docstrings on those alias methods
<thumper> bugger
<davecheney> anastasiamac: i like your attitude to code review
<davecheney> don't get mad, get even
<anastasiamac> davecheney: i get made at drivers not coders ;-)
<anastasiamac> davecheney: plus thumper is scary
<anastasiamac> davecheney: wouldn't want to get mad at him...
<thumper> pfft
<davecheney> yeah, he's fluffy
<davecheney> like a kitten
<thumper> fluffy now with the facial fuzz
<axw> wallyworld_: https://code.launchpad.net/~axwalk/gwacl/rolesizes-fix-dg-names/+merge/243478 please
<wallyworld_> looking
<wallyworld_> axw: where are the Aliases used?
<axw> wallyworld_: nowhere yet, they'll be used in Juju
<axw> wallyworld_: I'll extend environs/instances/InstanceType to have an "Aliases []string" field
<wallyworld_> axw: ok, do we have the ExtraLarge etc ones the right way around?
<axw> wallyworld_: yes
<wallyworld_> rightio
<axw> wallyworld_: from Azure:
<axw> 2014-12-02 11:37:29 ERROR juju.cmd supercommand.go:323 failed to bootstrap environment: cannot start bootstrap instance: POST request failed: BadRequest - Value 'D1' specified for parameter 'RoleSize' is invalid. Allowed values are 'ExtraSmall,Small,Medium,Large,ExtraLarge,A5,A6,A7,A8,A9,Basic_A0,Basic_A1,Basic_A2,Basic_A3,Basic_A4,Standard_D1,Standard_D2,Standard_D3,Standard_D4,Standard_D11,Standard_D12,Standard_D13,Standard_D14'. (http code
<axw>  400: Bad Request)
<wallyworld_> ok, and the same for the G ones I guess
<wallyworld_> those names look like a dog's breakfast
<thumper> waigani_: very complex review ... not!  http://reviews.vapour.ws/r/575/
<thumper> wallyworld_: is the landing bot unblocked?
<wallyworld_> should be
<wallyworld_> for master?
<wallyworld_> i landed a fix last night
<thumper> jw4: FAIL: action_test.go:144: actionSuite.TestFindActionTagsByPrefix
<thumper> jw4: intermittent?
<mwhudson> davecheney: __morestack is called a _lot_
<mwhudson> it turns out
<thumper> WTH!!!
<thumper> that action test fails for me all the time...
<axw> wallyworld_: I always forget. the bot does gwacl?
<wallyworld_> axw: i merged by hand, i think tarmac is dead
<axw> ok
 * thumper throws his hands up and leaves the office
<thumper> how the fuck did this test land?
<thumper> menn0: state/action.go:278
<thumper> menn0: and apiserver/action/action_test.go:144
<davecheney> mwhudson: yeah, which is odd
<davecheney> 'cos there is no escape analysis in gccgo
<davecheney> so there should be _less_ stack pressure
 * thumper heads to the 'uper duper market
<mwhudson> davecheney: so i was wrong in my initial bug comment, __generic_morestack is returning junk
<mwhudson> davecheney: also, it really doesn't fail with my random gccgo tip build
<axw> wallyworld_: D1 fails to provision still, but it's a different error now (Compute.OverconstrainedAllocationRequest)
<axw> this may take a little while...
<wallyworld_> balls
<wallyworld_> sure that's not just a transient azure snaffu
<axw> wallyworld_: I get the same error on West US and Southeast Asia, for both D1 and D2 multiple times
<axw> trying through the management console now
<wallyworld_> hmmm
<thumper> menn0: ping
<thumper> jw4: ping
<thumper> wallyworld_: can I get you to test something for me please?
<wallyworld_> sure
<thumper> wallyworld_: on master, run the tests in apiserver/actions plz?
<thumper> I get a fail that I can't see how it landed
<thumper> it is me
<thumper> grr
<thumper> fuckity fuck fuck
<thumper> I'm bringing in jw4's utils change...
<wallyworld_> ok, just gotta shelve, sec
<wallyworld_> OK: 8 passed
<wallyworld_> PASS
<wallyworld_> ok      github.com/juju/juju/apiserver/action   5.622s
<wallyworld_> thumper: ^^^^
<thumper> ta
<wallyworld_> let me check that y master is up to date, i think it ia
<thumper> it is
<thumper> I know what it is
<wallyworld_> kk
<menn0> thumper: sorry, I was out giving programming lessons
<thumper> menn0: to?
<menn0> thumper: still need me to do something?
<menn0> thumper: a friend's son (he's 11 but very keen)
 * thumper nods
<menn0> doing an hour a week with him
<thumper> menn0: got a minute to hangout?
<menn0> thumper: yep
<thumper> menn0: standup hangout
<menn0> thumper: there now
<anastasiamac> 4421 lines in one file :-(
<anastasiamac> painful...
<wallyworld_> jam1: ping
<bradm> davecheney: hey
<davecheney> bradm: hey
<bradm> davecheney: this might be easier than going back and forth via RT :)
<davecheney> bradm: yeah
<bradm> davecheney: so I checked the squid config on batuan, you did not have github on it.
<bradm> davecheney: but you do now. :)
<davecheney> excelelnt
<davecheney> thanks
<davecheney> i'll try now
<bradm> davecheney: give it a go, all those things you listed should be working.
<davecheney> bradm: what do I do abour getting mercurial on rubgy ?
<bradm> davecheney: is the packaged version ok?
<davecheney> for <reasons> juju uses all three of git, bzr and hg
<davecheney> yup
<bradm> davecheney: do you need it in a chroot?  or just in the base OS?
<davecheney> base os
<davecheney> don't care about that chroot stuff
<davecheney> in this case i'm just a luser
<bradm> you put your hg in your bzr in your git?  then you should cvs it, and then rcs that.
<bradm> you'd never lose anything then.
<davecheney> thanks for the tip
<davecheney> did i mention we also use more than one code review system
 * davecheney stabs self in face
<bradm> davecheney: mercurial package is installed.
<davecheney> danka
<bradm> davecheney: how's that look?  need anything else to get you going?
<davecheney> squid.internal gives some public ip
<davecheney> can I check the proxy settings with you ?
<bradm> sure
<bradm> what are you using?
<davecheney> http_proxy=http://squid.internal:8123
<davecheney> https_proxy=$http_proxy
<davecheney> export http_proxy https_proxy
<bradm> try http_proxy=http://batuan.canonical.com:3128
<bradm> and the https_proxy bit too
<bradm> squid.internal is mostly used by UK hosts, I only mentioned it because I wasn't sure what you had
<davecheney> bradm: thanks
<davecheney> all working now
<bradm> davecheney:  perfect!  let us know if you have any further issues
<davecheney> bradm: dfc@rugby:~$ gcc
<davecheney> The program 'gcc' is currently not installed. To run 'gcc' please ask your administrator to install the package 'gcc'
<davecheney> sorry, i didn't even think to check this
<davecheney> could I get build-essential and gdb pls
<bradm> davecheney: running
<bradm> davecheney: done.
<davecheney> thanks
 * axw pulls hair out
<axw> wallyworld_: now I'm finding that some locations don't have some role sizes. gotta filter that out too...
<axw> le sigh
<wallyworld_> ffs
<wallyworld_> fwereade_: if you did have time, here's the work to generate the server cert on each state server. i've removed the cert from state entirely. http://reviews.vapour.ws/r/552/
<axw> wallyworld_: https://code.launchpad.net/~axwalk/gwacl/listlocations/+merge/243495 - another small one, please
<wallyworld_> looking
<TheMue> morning
<TheMue> jam1: a few seconds, missed to install the plugin
<TheMue> jam1: dimitern: so, in the hangout
<dimitern> TheMue, what hangout?
<TheMue> dimitern: I thought you would be part of the 1:1:1 ;)
<dimitern> TheMue, ah, I though yours was yesterday
<dimitern> thought
<TheMue> dimitern: jam1 moved it due to his holiday
<dimitern> TheMue, I don't have the link - can you add me to the guests?
<TheMue> dimitern: yep
<TheMue> dimitern: just invited you
<dimitern> TheMue,yeah, thanks - hmm.. it came to my phone though
<TheMue> dimitern: *lol* that's the magic of the google cloud
<voidspace> dimitern: when you get a chance old boy
<voidspace> dimitern: http://reviews.vapour.ws/r/564/
<dimitern> voidspace, sure, will have a look shortly
<voidspace> dimitern: "PickNewAddress" (or whatever we call it). Subnet method or State method?
<voidspace> dimitern: I think Subnet method. State has a big enough API already.
<voidspace> coffee - brb
<jam1> dimitern: TheMue: standup ?
<dimitern> jam1, omw
<TheMue> jam1: coming, just finished 1:1
<rogpeppe> voidspace: ping
<perrito666> good morning
<wallyworld_> jam1: after your standup, can you ping me back? i need to ask about bug 1397376
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <fallout> <landscape> <maas-provider> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397376>
<jam1> wallyworld_: we'll want to make sure dimitern is in that conversation, as there seems to be some strong disagreement about whether things should be talking DNS names or IP addresses.
<wallyworld_> sure
<wallyworld_> customers want ip addresses i think
<wallyworld_> and there are claims only returning one address is a regression
<wallyworld_> since it used to return multiple
<jam1> wallyworld_: IIRC from the maas discussion, the IP address can change while the DNS name would stay consistent
<wallyworld_> and the api is called api-endpoints after all
<jam1> so the MaaS guys asked us to talk in terms of DNS names
<wallyworld_> oh, i see
<axw> rogpeppe: hey. would there be any reason for charm.Meta to have bson tags, if we duplicate the structure in state and put bson tags there?
<jam1> wallyworld_: api-endpoints did return only 1, then it started returning 2, then we reverted back to 1
<perrito666> davecheney: did you mean http://golang.org/pkg/os/#Create ? in your email about chmod?
<dimitern> wallyworld_, jam1, I can bring this up at today's maas cross team call
<rogpeppe> axw: yeah, i think we're storing it directly in mongo in the charm store
<axw> ok
<wallyworld_> dimitern: that would be good. i fear (perhaps unnecessarily) that other providers would work better with ip addresses
<wallyworld_> eg openstack autopilot
<wallyworld_> see comment #8
<wallyworld_> so it seems we want one behaviour for maas, and another for openstack
<wallyworld_> but if the provider gives us the correct info for machine addresses, it all should just work
<dimitern> wallyworld_, afaik maas did not return dns names before due to a bug - it was supposed to
<wallyworld_> and we need the prefered address a 0
<wallyworld_> as
<wallyworld_> as per comment #9
<wallyworld_> dimitern: so it seems then that with a mass that works correctly, the bug becomes a matter of ensuring that we ensure that the preferred ip address is in Addresses[0] so that's te one printed
<dimitern> wallyworld_, or we can just try to resolve dns names before saving
<wallyworld_> for maas
<jam1> dimitern: wallyworld_: so I believe the bug *here* is that MaaS guys are saying "use the DNS name" but the Autopilot guys are saying "but I don't want to have to add MaaS as my DNS source"
<jam1> It makes some sense fo inside the MaaS cloud as everything is run by MaaS, but for client machines
<jam1> they are fairly likely to just know the MaaS endpoint, and not be configured to talk to "foo.maas"
<wallyworld_> client machines would want ip address i think
<jam1> wallyworld_: AIUI, api-info is intended to supersede api-endpoints
<jam1> as it can give information on stuff like CA Cert
<wallyworld_> that could well be true, i'm not across the detail on this bit of the system, hence asking you guys :-)
<wallyworld_> but
<wallyworld_> we do need to consider backwards compatibility, no?
<wallyworld_> dimitern: jam1: so can i leave this bug in your guys' capable hands? :-)
<dimitern> wallyworld_, sure thing :)
<wallyworld_> ty :-) \o/
<wallyworld_> dimitern: andrew is working the azure bug - it got complicated, so he has to add in extra apis to query location as not all locations support all the role sizes :-(
<dimitern> wallyworld_, sweet! I underestimated that one badly
<jam1> dimitern: I'm just makign coffee, but I'll be at the next meeting.
<wallyworld_> dimitern: me too. i thought that the gwacl changes made by kapil were all good to go, bad assumption :-)
<wallyworld_> as it turned out
<wallyworld_> azure is complicated
<dimitern> yeah, and often broken as well
<wallyworld_> yep, through no fault of ours it seems many times
<dimitern> fwereade_, jamespage, gnuoy, networking call?
<jamespage> dimitern, yes - sorry - still in stockholm
<bac> hi axw, you still around?
<axw> bac: heya, yes I am
<bac> axw: good, i know it is late for you.  i saw a problem with azure yesterday that i wanted to tell you about.
<axw> bac: is it this one? https://bugs.launchpad.net/juju-core/1.21/+bug/1398406
<mup> Bug #1398406: Azure provider attempts to deploy with unsupported "D1" RoleSize <azure-provider> <bootstrap> <ci> <regression> <juju-core 1.21:In Progress by axwalk> <https://launchpad.net/bugs/1398406>
<bac> axw: no.
<axw> okey dokey
<voidspace> rogpeppe: pong
<voidspace> rogpeppe: sorry, only just seen your ping
<rogpeppe> voidspace: np
<bac> axw: it involved having two environments up with the same credentials but different storage for the state servers.  i used destroy-environment on one and it took down both.
<rogpeppe> voidspace: how much do you know about USSO?
<axw> eep
<voidspace> rogpeppe: well, I've never heard of the acronym
<bac> axw: i have the remnants on azure but have not been able to create a minimal reproduction of the problem
<voidspace> rogpeppe: so not a good start...
<rogpeppe> voidspace: just saw your name on the identityprovider source code and thought you might be able to help us...
<bac> axw: i was using 1.20.12-utopic-amd64.
<axw> bac: I suspect azure is not using the env UUID to separate things... I will see if I can figure it out. did you raise a bug already?
<voidspace> rogpeppe: heh, I worked a lot on identityprovider - but not actually on the identity protocols in the end
<bac> axw: i did not since i could not reproduce
<voidspace> rogpeppe: and I don't recall ever hearing USSO, so it might have been added after I left
<axw> fair enough
<rogpeppe> voidspace: ubuntu single sign on
<axw> bac: thanks, I'll take a look at the code to see if I can think up a repro
<bac> axw: the problem was quite costly as it destroyed our CI environment
<axw> :(
<voidspace> rogpeppe: ah...
<bac> axw: if you'd like to look at the jenv files or azure storage i've kept them
<voidspace> rogpeppe: oh, it depends on the question then
<voidspace> rogpeppe: it's been a while though...
<axw> bac: that would be good to have
<bac> axw: and i'm happy to file a bug
<rogpeppe> voidspace: perhaps you could join us in a hangout for a few moments?
<voidspace> rogpeppe: sure
<bac> axw: do you have access to chinstrap? i'd like to not sanitize the jenv files so i don't want to attach them to the bug report
<axw> bac: yes I do
<bac> axw: cool, i'll put it there and reference it.
<bac> axw: thanks and good night
<axw> bac: thanks
<axw> (and sorry about your CI env :()
<axw> PTAL, I tweaked the isLimitedRoleSize code a bit
<axw> err
<axw> wallyworld_: ^^
<wallyworld_> looking
<wallyworld_> axw: just a typo, might be good for one more live test before landing, just to be sure with the tweaks
<axw> wallyworld_: thanks. sure
 * axw dinners first
<wallyworld_> np, ty
<dimitern> voidspace, whew.. no meetings for a while, so back to your review
<voidspace> dimitern: cool, thanks
<dimitern> voidspace, reviewed, please ping me if something is unclear
<voidspace> dimitern: sure, thanks
<perrito666> Is there a create version tat supports permissions?
<perrito666> what exactly will _posix tag match?
<perrito666> build* tag
<wallyworld_> fwereade_: if you did get a chance or the inclination to look at http://reviews.vapour.ws/r/552/ that would be great, but you're busy so i'll bother someone else tomorrow if needed
<fwereade_> wallyworld_, thanks, I'll try
<wallyworld_> sure, no hassle if not
<axw> wallyworld_: the Azure fixes still need backporting to 1.21, I'll do that tomorrow
<wallyworld_> axw: sure, np
<wallyworld_> thanks for fixing
<bac> axw: i filed bug 1398820, not sure if you saw it
<mup> Bug #1398820: destroying Azure environment took down other environment <juju-core:New> <https://launchpad.net/bugs/1398820>
<mbruzek1> wallyworld_ ping
<bac> sinzui: thanks for the insight on that bug.  i'm trying to reproduce with azure and azure-1 now.
<anastasiamac> mbruzek1: it's 00.35am where wallyworld is :)
<wallyworld_> but sadly he is here
<mbruzek1> ping retracted go to sleep Ian
<wallyworld_> mbruzek1: how can i help?
<anastasiamac> wallyworld_: OMG!!
<wallyworld_> like you can talk :-)
<sinzui> bac: abentley and I are trying to remember the issue we found about a year ago.
<wallyworld_> same time for you
<anastasiamac> wallyworld_: I had 89 comments on my tiny PR... how can I sleep?
<wallyworld_> all/mostly trivial
<anastasiamac> wallyworld_: true. in fact, I was hoping u could review tomorrow in hopes of landing before ur holiday :)
<abentley> sinzui, bac: https://bugs.launchpad.net/juju-core/+bug/1257481
<mup> Bug #1257481: juju destroy-environment destroys other environments <ci> <destroy-environment> <juju-core:Fix Released by jameinel> <juju-core 1.16:Fix Released by jameinel> <juju-core (Ubuntu):Fix Released> <juju-core (Ubuntu Saucy):New> <https://launchpad.net/bugs/1257481>
<wallyworld_> anastasiamac: sure
<anastasiamac> wallyworld_: thnx ;-) m EOD-ing now
<wallyworld_> as well you should
<bac> thanks abentley.
<anastasiamac> wallyworld_: u2
<mattyw> sinzui, landing is blocked on 1398448 - but it looks like a fix has landed for it?
 * sinzui checks test
<sinzui> mattyw, ha ha, that bug is indeed fix released, but is replaced by another bug 1398837
<mup> Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf <backup-restore> <ci> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1398837>
<mattyw> sinzui, so still blocked - but for a different reason?
<sinzui> mattyw, yes, sorry.
<mattyw> sinzui, no problem - is someone working on that bug?
<sinzui> mattyw, not yet, it was reported when abentley was checking on the other blocking bug. I am sending out an email about the many hot bugs targeted to 3 milestones
<mattyw> sinzui, ok - I'm not vounteering as I'm going to be out for a few days - but just wanted to make sure they weren't being ignored
<mattyw> sinzui, so I'm not being helpful - but I'm being supportive ;)
<sinzui> understood Makyo
<sinzui> understood mattyw
<natefinch> ooh ooh, can I be unhelpful but supportive, too? ;)
<mattyw> natefinch, I've taken that job, you'll have to be helpful but unsupportive
<mattyw> or you can be unhelpful and unsupportive
<natefinch> I can't believe you fools!  here's the fix.
<perrito666> natefinch: /ignore sinzui ?
<natefinch> I didn't even know /ignore was a thing...
 * natefinch is an IRC n00b
<perrito666> odd, you seem to have the age to have been young when IRC was a thing
 * perrito666 hides
<natefinch> thou dost injure me, perrito666
<natefinch> I used IRC back in the late 90's, and figured everyone had moved on since then.  But, you know, linux people seem to pine for 1999 for some reason.
<perrito666> natefinch: you do realize that if I did not spent so much time watching tv and movies that joke would have been completely lost on non english native speakers
<dimitern> wallyworld_, jam1, the dns issue with api-endpoints for maas is not maas-related, it stems most likely from the changes in address selection logic (i.e. prefer hostnames to public/cloud-local IPs - the latter being most common with maas)
<katco> sinzui: ping
<sinzui> hi katco
<katco> sinzui: good morning :)
<katco> sinzui: i'd like to provide andreas with some binaries for tip of 1.21
<katco> sinzui: what's the best-practice for doing so? i want to make sure he's testing what will eventually be b4
<sinzui> katco, http://juju-ci.vapour.ws:8080/job/publish-revision/1254/ lists the binaries we built and tested
<sinzui> katco, , but there are no streams for these
<katco> sinzui: that's probably _perfect_
<katco> sinzui: i haven't wrapped my head around simple streams yet, but i suspect he could use these binaries to update his existing testing environment?
<ericsnow> perrito666: FYI, that blocker is due to the hard-coded "machine-0" in the old restore :P
<sinzui> katco, --upload-tools is the only option.
<katco> sinzui: hm. he can probably build a local environment if pressed.
<sinzui> katco, our testing streams are volatile, they are master at the moment. there were 1.21-beta4 for about 3 hours when those packges were made
<sinzui> katco, yes, the local env will be fine he just needs the state-server and two services
<katco> sinzui: i'm pretty confident in that use-case. but i want to make sure it works for his more complex example
<perrito666> ericsnow: well, i told you that there was no guaranteeof compatibility with thenew backup
<ericsnow> perrito666: I'm expecting that the new restore will break in the same way (under HA)
<bac> sinzui: i am able to reproduce bug 1398820 and have added more information.
<mup> Bug #1398820: destroying Azure environment took down other environment <azure-provider> <destroy-environment> <juju-core:Triaged> <https://launchpad.net/bugs/1398820>
<perrito666> ericsnow: most likely, but it is waaaay easier to fix
<ericsnow> perrito666: easier to test at least :)
<bac> sinzui: due to documented loss of user data (the blunder cost us 10 engineer hours and denied 12 engineers the use of our landing automation for five hours) i'd suggest it be marked critical.  a work-around after-the-fact is not very useful.
<perrito666> ericsnow: inded
<sinzui> thank you bac
<voidspace> dimitern: one of your suggestions
<voidspace> dimitern: adding the following to the comment for the State.IPAddress method "with the given value."
<voidspace> dimitern: you don't think it's entirely obvious that one returned will represent the value you pass in?
<dimitern> voidspace, yeah, I really meant to say "missing full-stop"
<voidspace> dimitern: heh, you got me on that one.
<voidspace> dimitern: cool
<dimitern> voidspace, in general doc comments should be proper sentences when possible I think
<voidspace> dimitern: ok, I've never been quite sure.
<voidspace> dimitern: I'll stick to that from now on. Easy enough.
<dimitern> voidspace, cheers
<alexisb> dimitern, ping
<dimitern> alexisb, hey
<dimitern> alexisb, just replying to your mail btw
<alexisb> hey there dimitern
<alexisb> nws, no rush on that
<alexisb> I saw some of your irc chatter earlier; are you working this bug:
<alexisb> https://bugs.launchpad.net/juju-core/+bug/1397376
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <fallout> <landscape> <maas-provider> <juju-core:Triaged> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1397376>
<alexisb> ??
<dimitern> alexisb, I did investigate the issue - it seems like a juju-specific regression (if you can call it that, since it was never documented nor claimed anywhere api-endpoints should return IPs instead of hostnames)
<alexisb> dimitern, given it is currently marked as critical and blocking the 1,21 release we need to decide on a path forward and get it resolved
<dimitern> alexisb, there are several possible solutions
<dimitern> alexisb, and I prefer to have the same one across all providers, i.e. always return a usable IP as first endpoint (if we only have a hostname try resolving it first)
<dimitern> alexisb, I'll add a comment to the bug
<alexisb> dimitern, thank you
<dimitern> alexisb, and propose a fix + backports
<alexisb> make sure to make it clear if you are looking for a response from the bug commiter, etc
<dimitern> sure
<voidspace> dimitern: instead of enforcing that InterfaceId or MachineId can only be set once
<voidspace> dimitern: how about we only do it once?
<voidspace> dimitern: if we do that then the check is unnecessary. If we *need* to do it more than once then the code is actually a hindrance.
<voidspace> dimitern: so at *best* the code is useless... IMO
 * fwereade_ swears at the jujud tests and kicks things for a bit
<dimitern> voidspace, how can you guarantee it's only once
<voidspace> dimitern: by only doing it in one place
<voidspace> dimitern: by understanding the code
<voidspace> dimitern: if we attempt to do it more than once and it fails juju will have problems due to a code path that stops
<voidspace> dimitern: so we need to do that *anyway*
<voidspace> dimitern: I can add an assert, it's easy enough
<voidspace> dimitern: I'm just not convinced it's very useful, and may actually be the opposite of useful (at which point we just take it out again)
<alexisb> fwereade_, I see you are having a good evening ;)
<fwereade_> alexisb, a delight, as always :)
<alexisb> :)
<dimitern> voidspace, ok, let's think about it for a bit
<dimitern> voidspace, the reason we have similar "hard states", e.g. once a machine is provisioned it can't be "unprovisioned", is because these steps happen at different times and more importantly in different workers
<voidspace> dimitern: right
<voidspace> dimitern: in our case an IP address will be requested *for* a machine
<voidspace> dimitern: so nothing else will attempt to use it
<voidspace> dimitern: and once allocated the MachineId will be set
<voidspace> dimitern: so there's no use case for changing it, but nor is it possible that something will try
<dimitern> voidspace, yeah, it seems reasonable to bind setting state to allocated with setting the machine id
<dimitern> voidspace,  how about having a AllocateTo(machineId, interfaceId) method ?
<voidspace> dimitern: instead of individual setters - ok.
<voidspace> dimitern: to add the asserts I need to get rid of omitempty it would seem
<voidspace> I *would* need to get rid of omitempty
<dimitern> voidspace, it asserts the state is unknown and machine / interface ids are empty before setting state to allocated + mid +iid
<dimitern> voidspace, yeah - omitempty should go on both ids
<voidspace> dimitern: so you still want the asserts...
<voidspace> to protect us against something that can never happen... :-p
<voidspace> I should sprinkle anti-polar-bear dust around the code as well just in case
<dimitern> voidspace, imagine multiple workers trying AllocateTo() in parallel
<voidspace> dimitern: but they'll be given different ip addresses
<dimitern> voidspace, true
<voidspace> technically there's a race between fetching the full set of existing addresses and generating a new one
<dimitern> voidspace, ok, at least let's have an error when it's already allocated
<voidspace> ok :-)
<dimitern> voidspace, indeed, that's why we need to assert the txn-revno of ipaddressesC haven't changed since we fetched them :) - good point
<voidspace> a fair compromise and that will protect against the race
<voidspace> we should do that in the code that generates the ip address
<dimitern> voidspace, but that's relevant to the picking algorithm
<dimitern> voidspace, yep
<katco> voidspace, dimitern: not sure if this is how fwereade_ intended this to be used, but the new leasing stuff will provide a sort of "environment mutex"
<katco> where you could say, "i'm mucking in the ip addresses, give me that lease"
<dimitern> voidspace, so AllocateTo() returns nil on success or - say ErrAlreadyAllocated with message "IP "1.2.3.4" is already allocated to machine "1""
<katco> and then when others would like to know about ip addresses they could grab that lease
<dimitern> katco, oh GIL ? :)
<dimitern> or we can call it JIL instead :D
<katco> not familiar with that term
<dimitern> katco, good to know
<fwereade_> katco, I suspect that will not be a very happy pairing of components
<dimitern> katco, that's python's global interpreter lock
<katco> ahh ok
<katco> fwereade_: thanks for chiming in :) i retract my suggestion! :p
<fwereade_> katco, no worries :)
<fwereade_> dimitern, voidspace: I'm not really up to date with what you're dicussing, btw
<fwereade_> dimitern, voidspace: but txn-revno is a very big hammer to be using in general
<dimitern> voidspace, and the other method can be SetUnavailable(value) - returning some (doesn't have to be typed) error when the address is already unavailable
<dimitern> fwereade_, we're discussing how to do the address allocation algorithm for containers
<dimitern> fwereade_, creating a new placeholder doc in the ipaddresses collection with state "unknown" (similarly to local charm uploads), then allocating it (or failing) and changing the state to "allocated" or "unavailable" depending on the result
<perrito666> natefinch: stand up?
<voidspace> dimitern: using txn-revno would also prevent *another* address being allocated
<natefinch> perrito666: momentarily
<dimitern> voidspace, correct
<fwereade_> voidspace, dimitern: I'm probably being dense, but txn-revno on what doc exactly?
<voidspace> fwereade_: ipaddresses - a new one
<dimitern> fwereade_, voidspace, on the new ipaddresses collection Michael is doing now
<fwereade_> voidspace, so the documents in that collection will be one-to-one with machines?
<dimitern> fwereade_, it will contain all machine addresses juju knows about
<fwereade_> dimitern, not just one doc with all addresses, surely?
<dimitern> fwereade_, yes - machines and network interfaces
<fwereade_> dimitern, that will be a hell of a bottleneck
<dimitern> fwereade_, wow, that haven't occurred to me *lol*
<fwereade_> dimitern, even without txn-revno, all writes will be serialized
<dimitern> fwereade_, right, so txn-revno is no good then
<dimitern> fwereade_, we can't just fetch all addresses, pick a random in the same range, not yet existing, then create it asserting the collection hasn't changed
<katco> fwereade_: forgive my ignorance, but doesn't this really sound like an environmental mutex would be helpful? what am i missing?
<dimitern> katco, how is that mutex supposed to work?
<fwereade_> dimitern, voidspace: I am most likely misunderstanding the problem -- but I'm not following who needs to allocate these addresses?
<fwereade_> dimitern, voidspace: surely in the general case we are asking the provider for new addresses, aren't we?
<voidspace> fwereade_: no
<katco> dimitern: well, the lease service is a fifo stack; 1 exists per state machine, and only 1 person can own a lease for a namespace at a time
<voidspace> fwereade_: for "reasons"
<voidspace> fwereade_: let me remember specifically
<dimitern> fwereade_, no, we're asking for a specific address
<voidspace> fwereade_: with ec2 we can request they give us a new address - but they don't tell us what it is
<voidspace> fwereade_: so we then have to look on the network interface and find "the new address"
<voidspace> fwereade_: and if we do that in parallel (adding several machines to an interface) we have race conditions working out which one is for which machine
<dimitern> fwereade_, yeah - ec2 is quite unhelpful and most providers allow you to request a specific address to allocate
<voidspace> fwereade_: but both ec2 and maas (and openstack but we only initially care about ec2 and maas)
<fwereade_> voidspace, dimitern: oh, what fun
<voidspace> fwereade_: will allow us to pick an address and request that specific one
<voidspace> fwereade_: so we're generating and storing allocated addresses
<dimitern> fwereade_, and keeping track which machine uses them, or marking them as unavailable (i.e. something else uses it, we'll try another but remember this)
<fwereade_> voidspace, dimitern: ok, so, to try to restate in my own words
<fwereade_> voidspace, dimitern: an example would be: you have machine N, with a bunch of containers N/lxc/A,B,C
<fwereade_> voidspace, dimitern: something running in the agent for machine N observes that there are 3 new addresses available, and needs to assign them to the appropriate containers?
<fwereade_> voidspace, dimitern: or have I completely misunderstood the context?
 * fwereade_ suspects he completely has
<voidspace> fwereade_: it's more that we notice there are three new containers, we need to create three new addresses
<voidspace> fwereade_: more likely we will try to create one address three times
<dimitern> fwereade_, yeah - that thing will be the container provisioner most likely, but it won't notice the new addresses, it will request them and allocate them to each container, so that the address updater (after this happens) can see the machine has 4 IPs (as seen via the cloud api) but only 1 IP is for the machine (already allocated) the rest are for the containers
<voidspace> fwereade_: if we arean't picking the addresses ourself, but we have a worker that requests an address and then has to work out what the new one is
<voidspace> fwereade_: if that happens three time in parallel - when the worker checks the network interface and sees three new addresses, how does it know which is the "new one"
<fwereade_> dimitern, that cannot be the container provisioner, can it?
<fwereade_> dimitern, we don't want to give it the cloud credentials necessary to ask for the addresses
<dimitern> fwereade_, it won't talk to the cloud directly, obviously the api will be used
<dimitern> :)
<fwereade_> dimitern, ok cool :)
<fwereade_> dimitern, not 100% sure it's the container provisioner then? but, ehh, that's an irrelevant detail at this point I guess
<dimitern> voidspace, fwereade_, so effectively: the apiserver will be creating ipaddresses docs, calling the cloud to allocate them and assigning them to their machineid
<dimitern> voidspace, fwereade_, that's a detail yes, we'll get there
<fwereade_> dimitern, so, what we have at the moment is that the machiner calls SetMachineAddresses with all the addresses it knows about -- and presumably we'll keep on doing roughly the same thing? just doing so more often, and not in the machiner?
<marcoceppi> bac: building amulet now
<marcoceppi> will be released in about 20 mins
<bac> thanks marcoceppi
<dimitern> fwereade_, IIRC SetMachineAddresses discovers what's on the machine, not via the cloud api
<fwereade_> dimitern, yes
<dimitern> fwereade_, so it will be the same
<dimitern> fwereade_, the address updater is problematic and has to be fixed carefully to separate container addresses from host addresses when refreshing the instance info from the cloud
<fwereade_> dimitern, and we also get a stream of SetAddresses~s coming from the instanceupdater as well
<fwereade_> dimitern, how *can* we separate them? what's the difference between a provider address that was allocated for a container, and one that was allocated for the host machine?
<dimitern> fwereade_, yeah - which in turn also calls SetAddresses at the end via the apiserver
<katco> sinzui: we have confirmation that 1397995 is fixed in v1.21. discussions are ongoing about expected behavior, but that's just a spec thing, not a bug.
<dimitern> fwereade_, ha! there's the beauty of it :)
<sinzui> katco, \o/
<katco> sinzui: tyvm for your efforts and assistance. you did a great job! :)
<dimitern> fwereade_, because we pre-allocate the address we know what it is and for which machine id (or container) it's allocated before the container starts
<fwereade_> dimitern, ok, so we (could) have a provider id for each address that comes in via the instanceupdater?
<dimitern> fwereade_, I think the least painful way to resolve all these is to change state.Machine.SetAddresses() internally to do the separation considering what's in ipaddressesC
<dimitern> fwereade_, by provider id you mean instance id in this case?
<fwereade_> dimitern, no? I think I mean a provider id for the address
<dimitern> fwereade_, there's no specific id - the IP itself is the key - at least in maas, openstack and ec2
<fwereade_> dimitern, instance id is not a problem, is it? we know which *host* machine already because it's 1:1 with instances
<fwereade_> dimitern, I am suspicious that there is an implicit assumption that those addresses won't change
<fwereade_> dimitern, which does not gel with what I understand ec2 in particular is likely to do to you if you suspend your instances for a bit
<dimitern> fwereade_, it depends on how the instance was started (i.e. the termination behavior) - we don't do suspend/resume ourselves
<fwereade_> dimitern, ok, I am still questioning the assumption that addresses won't change
<fwereade_> dimitern, eg when an AZ falls over for a bit I would like it if we were able to absorb that when everything was running again
<voidspace> evilnickveitch: ping
<evilnickveitch> voidspace, hey
<voidspace> evilnickveitch: hey
<voidspace> evilnickveitch: you just sent me an email about juju docs
<voidspace> evilnickveitch: I suspect it was intended for someone else...
<evilnickveitch> voidspace, oops - sorry! I am blaming gmail autocomplete
<voidspace> evilnickveitch: :-)
<dimitern> fwereade_, the address won't change *at will*
<dimitern> fwereade_, it will happen due to some specific user action - i.e. stopping an instance and restarting it perhaps - juju should be able to detect this and know how to handle it
<dimitern> fwereade_, an address that was allocated (reserved, in-use, whatever) cannot be reused by any other node on the same subnet (except for some weird clustering/balancing/etc. on L2)
<dimitern> fwereade_, a machine can get a new (allocated) address and become unreachable under the old one, but the old one won't just appear on some machine (as long as the machine is up)
<fwereade_> dimitern, no arguments there... the problem STM to be that i-abc123 can reasonably return 2 non-overlapping sets of IPs in two consecutive requests to the provider
<fwereade_> dimitern, and if we don't have some notion of what the underlying provider id of a given IP is, we can't cleanly map the first set onto the second
<fwereade_> dimitern, (that said, I believe that the underlying provider ids do not necessarily exist -- I'm not saying you should find them, but I am fixating on the nasty edge cases that I think I can see)
<fwereade_> dimitern, in general it's not outside the bounds of probability that we might see, for a single machine
<dimitern> fwereade_, I understand the problem and appreciate you bringing it up - I'll make sure we do have tests for such cases
<fwereade_> dimitern, SA: a; SMA: b; SA: c, d; SMA: e,f
<fwereade_> dimitern, cool
<fwereade_> dimitern, have I just been massively derailing you then? :(
<dimitern> fwereade_, shouldn't we prefer addresses discovered on the machine rather than the ones coming via the cloud api?
<fwereade_> dimitern, it still sort of feels like the problem you're talking about is "how do we reconcile these streams of addresses from two different sources"
<fwereade_> dimitern, hmm
<fwereade_> dimitern, I *think* the cloud api is more likely to give us usable addresses, isn't it?
<fwereade_> dimitern, like how in address-get we want to return the advertise address rather than the bind address?
<dimitern> fwereade_, no, it's very useful as I did have much time to think in detail about how to solve the 2 address sources problem
<fwereade_> dimitern, I assert <wave hands vigorously> that what the cloud tells us is more likely to correspond to the addresses which will cause the traffic to be delivered to the right place
<dimitern> fwereade_, well, if the cloud *thinks* node A uses IP1, but eth0 on A says IP2, the latter will be usable (all other things equal - i.e. no restrictive firewalling/routing for IP1 vs IP2)
<dimitern> fwereade_, yeah
<fwereade_> dimitern, my concern is exactly that all other things will not be equal
<dimitern> fwereade_, at some point we'll need a way to actually *verify* a given node's IP is usable (inbound and out)
<fwereade_> dimitern, eg when you expose something to the public internet, you need to advertise the public address that will get the packets to the right place, which is not necessarily the one you're binding to at all
<fwereade_> dimitern, that's a nice-to-have where it's practical
<dimitern> fwereade_, until then I guess trusting the cloud makes more sense
<fwereade_> dimitern, but I'm not sure it's something we can depend on in all cases
<fwereade_> dimitern, in particular
<fwereade_> dimitern, these thorny sorts of questions
<fwereade_> dimitern, are why I'm keen that we continue to record the stuff that gets set via SA/SMA unchanged
<dimitern> fwereade_, the "just-in-" case :)
<fwereade_> dimitern, and we recalculate our best guess at reality, taking *both* of those things into account, in response to a change in either
<voidspace> dimitern: when you have a minute (shouldn't take any longer) can you check the implementation of AllocateTo
<voidspace> dimitern: http://reviews.vapour.ws/r/564/
<voidspace> dimitern: all issues resolved
<dimitern> fwereade_, right, but the ips known to be allocated should be used when needed, while the SA/SMA ones only when they change
<dimitern> voidspace, sure, will look shortly
<dimitern> fwereade_, (the IPs in ipaddressesC I mean)
<fwereade_> dimitern, true -- the providers that *do* tell us what IPs they've allocated should probably be trusted over and above the SA/SMA ones
<dimitern> fwereade_, what if the cloud says node A's IP is now b (used to be a)? Should we also ensure SMA reports b as well? (I mean reconfigure the interface on the machine)
<fwereade_> dimitern, I have literally no idea, I'm afraid :(
<fwereade_> dimitern, I suspect "it depends" but I don't really know what on
<dimitern> fwereade_, and also how often/when this is possible
<dimitern> voidspace, reviewed
<dimitern> voidspace, I'm ok with moving subnet method tests from state_test.go into a new subnets_test.go as a follow-up
<alexisb> dimitern, you joining the networking call?
<voidspace> dimitern: ok
<voidspace> dimitern: thanks, still a few things to fix
<voidspace> dimitern: do we *need* DeepEquals for comparing structs?
<voidspace> dimitern: I wasn't sure
<natefinch> equals will work for simple structs that don't have, for example, maps or slices in them
<voidspace> natefinch: and if it doesn't work it should complain, right?
<voidspace> natefinch: as it worked I didn't feel the need to use DeepEquals
<natefinch> yep
<voidspace> cool
<voidspace> natefinch: so no reason to prefer jc.DeepEquals over gc.Equals if it's not required?
<natefinch> voidspace: there's probably no reason not to use DeepEquals, the difference in speed is going to be negligible..... trying to think if there might be a reason why Equals would be preferred...
<voidspace> natefinch: heh, your answer is the opposite of what I asked :-)
<natefinch> voidspace: my point is, jc.DeepEquals should always work, so there may be no reason to ever use gc.Equals unless you're doing a very large number of comparisons.  If one way always works, and one way usually works... why not just always use the way that always works?
<voidspace> natefinch: because I already have a way that works and I need a reason to swap...
<voidspace> natefinch: I understand what you're saying though
<natefinch> voidspace: if it works, it works, don't muck with it ;)
<voidspace> my thoughts exactly :-p
<voidspace> I'll start using jc.DeepEquals just to not have to ever have this conversation again...
<natefinch> haha
<dimitern> voidspace, it's good to use DeepEquals for structs, otherwise you might end up comparing .String() or .GoString() values
<voidspace> dimitern: ah, so there is a reason
<voidspace> dimitern: when would that happen?
<dimitern> voidspace, it depends on how the Equals checker is implemented I guess
<dimitern> voidspace, can't really say without experimenting a bit
<voidspace> ah
<dimitern> voidspace, I tend to use DeepEquals (the jc version, not the gc one as it prints more comprehensive error messages with deeply nested structs / long maps, etc.)
<dimitern> voidspace, ..for anything than a simple type
<voidspace> dimitern: better failure messages is a good reason to prefer it
<dimitern> voidspace, but maybe it's just me being paranoid :)
<voidspace> dimitern: when adding a new test file is there anything else I need to do?
<voidspace> dimitern: it doesn't look like my tests are being run
<voidspace> as in, I inject a failure and nothing fails
<dimitern> voidspace, yeah :)
<dimitern> voidspace, register the suite
<voidspace> gc.Suite(...)
<voidspace> or something elsse?
<dimitern> var _ = gc.Suite(&machineSuite{})
<voidspace> that's there
<dimitern> voidspace, and no tests run still?
<voidspace> doesn't look like it
<voidspace> package is state_test, file is state/subnet_test.go
<voidspace> hmmm... maybe it should be subnets_test.go
<voidspace> dimitern: does each test file need to match a source file?
<dimitern> voidspace, yeah, but that shouldn't matter
<voidspace> dimitern: you're on late
<dimitern> voidspace, btw - see this http://paste.ubuntu.com/9356541/
<voidspace> dimitern: fair enough :-)
<voidspace> hmmm... but the ipaddresses_test is definitely being run
<dimitern> voidspace, even though I added a .GoString() method on network.HostPort to shorten the output considerably (from network.HostPort{Value: 1.2.3.4, Port:42, ...} to 1.2.3.4:42
<dimitern> voidspace, hmm.. let me see if something else was needed
<dimitern> voidspace, I'm on since 7am :)
<dimitern> really need to stop soon
<dimitern> voidspace, can you paste your complete subnets_test.go ?
<voidspace> dimitern: http://pastebin.ubuntu.com/9356566/
<voidspace> dimitern: there's a deliberate error in assertSubnet inside TestAddSubnet (first test)
<voidspace> that should fail
<dimitern> voidspace, you don't need to embed ConnSuite, do you?\
<voidspace> dimitern: StateSuite does, which is what I copied for this and IPAddressSuite
<voidspace> if we can get rid of it then cool
<dimitern> voidspace, hmm.. no sorry - you need it
<voidspace> we need s.State
<dimitern> voidspace, yeah
<voidspace> I've renamed to subnets_test.go
<voidspace> that didn't help
<dimitern> voidspace, why do you have that s.policy.GetConstraintsValidator in SetUpTest?
<dimitern> voidspace, HA!  found it
<dimitern> voidspace, a single character is needed :)
<voidspace> go on
<dimitern> voidspace, var _ = gc.Suite(&SubnetSuite{}) - instead of var _ = gc.Suite(SubnetSuite{}) - all methods have pointer receivers
<voidspace> gah
<voidspace> of course
<voidspace> it registers an empty one
<voidspace> thanks
<dimitern> no worries
<dimitern> i'll be off then :)
<voidspace> sorry
<voidspace> dimitern: g'night
<dimitern> voidspace, g'night
<voidspace> I'm off too
<voidspace> g'night all
<ericsnow> natefinch: could you take a look at http://reviews.vapour.ws/r/577/
<ericsnow> natefinch: (and http://reviews.vapour.ws/r/573/ too) :)
<ericsnow> natefinch: that first one fixes the blocker
<natefinch> ericsnow: cool, looking
<thumper> jw4: ping
<thumper> jw4: I actually have a branch that fixes one of the actions bits https://github.com/juju/juju/pull/1261/files
<thumper> jw4: may clash with your dependency update too
<thumper> jw4: interesting that both our solutions were identical - completely identical :-)
<natefinch> ericsnow: a couple small concerns, but nothing too hard to fix.
<ericsnow> natefinch: k
<ericsnow> natefinch: I've addressed your comments
<natefinch> ericsnow: looking
<ericsnow> natefinch: keep in mind that cmd/plugins/juju-restore/restore.go is going away, so it probably isn't worth nitpicking there :)  For that file I just want to be sure I have the logic right since there isn't any simple way to test it
<thumper> ericsnow: I did a review too, nothing to add
<thumper> ericsnow: I'd wait for natefinch's LGTM before landing though
<ericsnow> thumper: thanks :)
<ericsnow> thumper: will do
<natefinch> ericsnow: I just gave it a shipit :)
<perrito666> \o/
<ericsnow> natefinch: thanks
 * thumper awaits an unblocked laner
<thumper> lander
<natefinch> thumper: while you're waiting, do you have some time to talk about provider configuration?   I'm porting fwereade_'s skeleton provider from 10-months-ago juju to today-juju, and some of the internals bug me.  He thought you might have some insight.
<thumper> sure
<thumper> natefinch: make a hangout
<natefinch>  thumper: https://plus.google.com/hangouts/_/canonical.com/moonstone?authuser=1
<ericsnow> davecheney: I was just reading through your functional options talk (very interesting)
<ericsnow> davecheney: do you think there are any spots that could stand to benefit from functional options in juju?
<ericsnow> davecheney: would it be a good approach for pulling in the different paths that need to be backed up (I think mattyw was hinting at this a while back)
<davecheney> ericsnow: sure
<davecheney> i think there are a few places we could use it
<jw4> thumper: thanks!
<davecheney> ok, time for breakfast
<davecheney> then, REVIEWING!
<thumper> ericsnow: did you see that your branch failed early?
<ericsnow> thumper: yeah, fixing it now
<thumper> kk
 * thumper wants to land stuff
<thumper> anastasiamac: I'm reviewing your mega branch
<thumper> anastasiamac: done
<ericsnow> anyone know how to force a CI test to run (functional-ha-backup-restore, specifically)?
<ericsnow> thumper: my fix landed but I hesitate to mark the bug as committed until the CI test (functional-ha-backup-restore) runs
<thumper> ericsnow: I think they happen automatically
<jw4> fix committed won't clear CI til it's marked as fix released
<thumper> ericsnow: and marking the bug fix committed doesn't release the bot
<ericsnow> thumper: right
<ericsnow> thumper, jw4: thanks for clarifying
<perrito666> interesting, windows does not run update as an atomic process
 * perrito666 has an install running in a machine next to him that installed 170 packages, failed on the 171 and just rolled back the whole thing... this all thing took 4 hs
<jw4> thumper: https://bugs.launchpad.net/juju-core/+bugs?field.searchtext=&orderby=-importance&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.importance%3Alist=CRITICAL&field.tag=ci+regression+&field.tags_combinator=ALL
<jw4> thumper: when that list is empty CI will be unblocked
 * thumper sighs
 * thumper waits
<jw4> thumper: I personally use https://api.launchpad.net/devel/juju-core?ws.op=searchTasks&status%3Alist=Triaged&status%3Alist=In+Progress&status%3Alist=Fix+Committed&importance%3Alist=Critical&tags%3Alist=regression&tags%3Alist=ci&tags_combinator=All
<thumper> impatiently
<thumper> handy link
 * jw4 has had code to land for a week now
<perrito666> davecheney: do you feel answered by my email? I was not sure if you answered the review with an email or added a review and then deleted it
<perrito666> natefinch: fwereade_ ericsnow you should all have a mail with the shared draft for b&r specs
<anastasiamac> thumper: thnx for the review!!!
<wallyworld_> davecheney: i think you're ocr? you able to take a look at http://reviews.vapour.ws/r/552/ for me?
<davecheney> wallyworld_: looking
<davecheney> perrito666: could you summarise the state of play for me
<wallyworld_> ty
<davecheney> i'm unclear what or if there is a problem
<davecheney> wallyworld_: on a scale of 1 to "don't be picky dave", is the importance of this PR ?
<wallyworld_> davecheney: i would like to land but if there are issues, record them
<davecheney> understood
<davecheney> wallyworld_: URGH params.StateServingInfo
<davecheney> that fucking type
<davecheney> such a mess
#juju-dev 2014-12-04
<wallyworld_> davecheney: in a meeting, be with you soon
<davecheney> wallyworld_: done
<wallyworld_> davecheney: tyvm, will look after my meeting
<thumper> sinzui: how long does ci take to go through its tests?
<thumper> sinzui: I'm wondering how long before we know if ericsnow's branch fixes the problem
<davecheney> aboput 18 minutes atm
<thumper> davecheney: that is just to land, I'm talking about the 'ci' tag being taken off the bug so it unblocks landings
<perrito666> davecheney: I could but I dont understand what that means
<davecheney> perrito666: i don't either
<davecheney> is something broken ?
<davecheney> i'm completely lost as to where we left our discussion
<perrito666> lol, ok, you answered a review request for a wrapper around chmod with an email asking something
<perrito666> rings a bell?
<davecheney> right,
<davecheney> yes, why do we need this ?
<perrito666> davecheney: so I answered that mail but in summary, chmod, even though it "runs" on windows, does not work
<davecheney> ok, but why do we need to chmod files ?
<perrito666> davecheney: yes, we do sometime ago I did a check with the cloudbase guys to see what was not working on windows workloads and that one was one of the outstanding
<davecheney> ok, my question is can we solve this problem by removing the cause ?
<davecheney> chmod might be busted on windows
<davecheney> but if we can remove the requirement to change permissions
<davecheney> then that also solves the problem
<perrito666> well its used for charms and It seems necessary
<davecheney> how is it used for charms ?
<davecheney> come on man, work with me here
<perrito666> davecheney: sorry I am re-reading the code as we speak, hold
<perrito666> davecheney: I dont fully see all implications but, except for one case where it it actually there for windows, it seems that we can remove/do only for linux the chmods
<davecheney> to me that sounds like a better solution
<perrito666> davecheney: Ill have a talk with fwereade tomorrow and see if he remembers why we decided to do this instead of nuke all appearances of chmod, Ill be glad to se this go (and with windows tests in place it might never come back)
<davecheney> perrito666: so this is a charm hook that needs to chmod a file ?
<perrito666> davecheney: the uses of Chmod=?
<davecheney> yup
<davecheney> i'm still digging for the why
<perrito666> there are around 26, for what I see almost half are tests simmulating lack of permissions or similar situation, the rest are giving more permissions to certain files, there is one case in environs/config.go that corrects a possible wrong permission on a file and the rest I would have to look
<perrito666> we use a lot chmod for my taste
<perrito666> so, looking at the existing chmod most if not all are of no consequence for windows, I worry a bit about future uses, now we actually need to care for windows workloads and if only os.Chmod would panic as file.Chmod does I would not fear that it might silently crawl under our radar
<perrito666> I might be just over engineering
<sinzui> thumper trunk looks to be in bad shape. Lots of tests failed
<davecheney> perrito666: i'm only interested in charm hooks that use chmod
<davecheney> imo there shuld be none
<sinzui> thumper, I am retesting those that looks like cloud failures
<thumper> sinzui: ta
<sinzui> thumper, but ha-backup-restore is not fixed, the bug has mutated. I add the new error message https://bugs.launchpad.net/juju-core/+bug/1398837
<mup> Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf <backup-restore> <ci> <regression> <juju-core:In Progress by ericsnowcurrently> <https://launchpad.net/bugs/1398837>
 * thumper groans
<ericsnow> thumper, sinzui: I already have http://reviews.vapour.ws/r/573/ that should address that EOF issue
<sinzui> hurray
<ericsnow> it just needs a review
<thumper> ericsnow: I'm looking at it
<ericsnow> thumper: thanks
<thumper> ericsnow: but I don't see how your change improves anything
<thumper> ericsnow: can you explain?
<ericsnow> thumper: if we notice a dropped connection we reconnect and try again
<thumper> ericsnow: a minor change then land it
<ericsnow> thumper: k
<davecheney> ericsnow: http://reviews.vapour.ws/r/573/ not lgtm, yet
<ericsnow> k
<davecheney> 12noon ppl
<davecheney> oh
<davecheney> the meeting is actually at 2pm
<davecheney> ignore me
<davecheney> or one pm
<davecheney> calendars are hard, let's go shopping
<perrito666> davecheney: I would really like, but most imports are closed to make people buy local for christmas
<perrito666> :p
<ericsnow> davecheney: I replied to your comment
<ericsnow> davecheney: basically, I'm pretty sure that code should stick around
<wallyworld_> axw: standup?
<ericsnow> davecheney: but I'll do it cleaner
<davecheney> ok
<wallyworld_> axw: oops sorry, you're not here
<thumper> davecheney: http://reviews.vapour.ws/r/531/diff/# needs a full review
<davecheney> thumper: on it
<thumper> cheers
<thumper> anastasiamac: any ETA on your blocking branch? I don't really want to put my machine branch up for review until I have that merged in
<anastasiamac> thumper: today?..
 * anastasiamac fingers xrossed
<ericsnow> davecheney: I've updated http://reviews.vapour.ws/r/573/
 * thumper crosses fingers too
<davecheney> don't forget to add your things to the agenda
<davecheney> https://docs.google.com/a/canonical.com/document/d/1eeHzbtyt_4dlKQMof-vRfplMWMrClBx32k6BFI-77MI/edit
<davecheney> if you don't, then I get to talk for the whole time
<davecheney> and you probably want to avoid that
<davecheney> ericsnow: ok
<davecheney> ericsnow: review done
<davecheney> i'm not happy with the specialised logic inside getBackupTargetDatabases
<davecheney> the DBSession interface should describe what it needs
<davecheney> if it needs a .Copy method
<davecheney> then the mock needs to implement that as well
<ericsnow> davecheney: it has to match mgo's Session.Copy which returns *mgo.Session, making the interface method kind of goofy
<ericsnow> davecheney: but I agree with you :)
<davecheney> ericsnow: perhaps this is not the right place for a mock then
<davecheney> however unpleasent that is
<ericsnow> davecheney: you may be right :(
<ericsnow> davecheney: in the meantime for the sake of opening the landing bot...
<davecheney> ericsnow: please raise a bug against the next milestone
<ericsnow> davecheney: k
<davecheney> I need a second reviewer for http://reviews.vapour.ws/r/573/
<davecheney> thumper: wallyworld_ menn0 ?
<wallyworld_> sure
<waigani> my cal says we have the team meeting now, but no one is here?
<waigani> ah, just my crappy connection
<perrito666> thumper: I finally found someone that would do an affogato en in argentina :) I only had to go 90Km away to get it
<thumper> haha
<thumper> davecheney: I replied to you comments
<ericsnow> wallyworld_: I'm pretty sure the dropped session (the io.EOF) is due to the replicaset functionionality of HA
<thumper> davecheney: one key change with putting the feature flags into juju/utils is that the flags themselves had to be agnostic of any particular environment variable
<thumper> so it can't really return a map unless we pass the key in
<thumper> which we could make a helper to do...
<wallyworld_> ericsnow: ok, np. i'm not across it enough. is only one retry enough?
<ericsnow> wallyworld_: I just updated the patch to retry 10 times
<ericsnow> wallyworld_: but still only for io.EOF (which mgo returns when the connection drops)
<thumper> davecheney: I'm beginning to think that we should have the feature flag init in each of the main blocks
<thumper> davecheney: then we wouldn't need the test for osenv
<wallyworld_> ericsnow: so why not wait until replicaset is up before doing backup?
<wallyworld_> don't we do that elsewhere?
<ericsnow> wallyworld_: I don't know about elsewhere, but I wouldn't mind waiting until the replicaset is up
<ericsnow> wallyworld_: wouldn't that but in the CI test script though?
<ericsnow> s/but/be/
<wallyworld_> our code needs to wait
<wallyworld_> if backup invoked, it needs to ensure replicaset is up before proceeding
<ericsnow> wallyworld_: k
<wallyworld_> like i think state server might do when starting
<wallyworld_> that's IMHO
<wallyworld_> but seems like the right thing to do
<ericsnow> wallyworld_: okay
<ericsnow> wallyworld_: I'm not well versed in the HA stuff but what you're saying makes sense
<wallyworld_> ericsnow: nate knows all about HA etc :-)
<ericsnow> wallyworld_: lovely :P
<ericsnow> wallyworld_: I'm pretty sure y'all don't want to wait until Nate is back online before I get this pushed :)
<natefinch> I'm here :)
<natefinch> why would I not be here? :)
<ericsnow> magic
<wallyworld_> ericsnow: i'd rather not alnd a bad solution
<ericsnow> lol
<natefinch> haha
<ericsnow> natefinch: wallyworld_ is suggesting that we wait for replicaset to finish coming up before running backup
<ericsnow> natefinch: how do we test for that?
<ericsnow> wallyworld_: agreed
<natefinch> ericsnow: so, iirc, it's non-trivial
<ericsnow> natefinch: not the right answer :P
<axw> wallyworld_: I'm still going to be out for a while longer. I'll be working late tonight
<wallyworld_> axw: np at all
<natefinch> ericsnow: michael did a bunch of work on that relatively recently, you should talk to him in the morning
<ericsnow> natefinch: k
<natefinch> ericsnow: I think there were updates to the replicaset code and/or tests in order to determine that.  IIRC it was something like ping until we get something reasonable back.
<ericsnow> natefinch: got it
<ericsnow> natefinch: keep in mind that for backup we can do the check on the API server side, so we have access to state
<natefinch> ericsnow: yeah, the replicaset tests assume DB access too.  There's just no actual flag saying "replicasets are up"
<wallyworld_> axw: ping
<sebas5384> https://gist.github.com/anonymous/de06b097d25d690b684f after seeing this log i'm pretty sure i'm not able to use kvm
<sebas5384> hehe
<ericsnow> wallyworld_: I've updated http://reviews.vapour.ws/r/573/
<wallyworld_> ok, looking, wow, must be late for you
<ericsnow> wallyworld_: is that what you meant about checking if HA is ready?
<ericsnow> wallyworld_: well, I hate leaving CI blocked
<wallyworld_> ericsnow: looks ok i think, but backup doesn't seem to call WaitUntilReady() and haEnabled() always returns true
<ericsnow> wallyworld_: isn't HA always enabled (even if not utilized), i.e. the --replset option is always passed to mongod
<ericsnow> wallyworld_: do you think WaitUntilReady would be more appropriate than IsReady?
<wallyworld_> true, so why the haEnabled() function? and what about older environments? or are they upgraded to ha ?
<wallyworld_> i think they are upgraded
<ericsnow> wallyworld_: the new backups only applies to 1.22+
<ericsnow> wallyworld_: and haEnabled gets patched to return false in the tests (since we don't use HA there)
<ericsnow> wallyworld_: is "haEnabled" the wrong name? (perhaps "replSetEnabled"?
<wallyworld_> so then that's a bit misleading, the block of code should be extracted from create
<wallyworld_> and put inside a func
<wallyworld_> and that func should be what's patched
<ericsnow> wallyworld_: fair enough
<wallyworld_> but isn't that what WaitUntilReady is for?
<wallyworld_> if you just call WaitUntilReady, it should all be fine, just patch WaitUntilReady?
<wallyworld_> and then IsReady doesn't need to be exported
<ericsnow> wallyworld_: it depends on it we want backup to fail immediately if HA isn't ready or if we make users wait
<wallyworld_> oh right i see
<ericsnow> wallyworld_: currently it fails immediately, but it sounds like you would rather we take the waiting approach
<wallyworld_> for now just do what's needed to unblock so you csn goto bed
<ericsnow> wallyworld_: I could drop the WaitUntilReady func
<ericsnow> k
<wallyworld_> yes, drop that for now
<wallyworld_> jut the minimum, but then come back and fix if needed
<wallyworld_> i'd just like to see the code extracted
<wallyworld_> so the haEnabled() can be dropped
<ericsnow> k
<ericsnow> doing it now
<wallyworld_> ty
<axw> wallyworld_: pong
<wallyworld_> axw: quick hangout?
<axw> sure
<wallyworld_> 1:1 one
<axw> ok
<axw> wallyworld_: hypothetically the openstack provider could do bad things if your environment name contains regexp meta characters
<wallyworld_> axw: do we allow that? i thought env names were constrained
<wallyworld_> to valid chars
<axw> maybe... trying to find where
<axw> wallyworld_: seems we just check that it doesn't contain "/"
<wallyworld_> oh
<wallyworld_> in that case i need to do more in the azure one also
<axw> I'll see what openstack allows for machine names...
<wallyworld_> azure is ok
<wallyworld_> "alphanumeric characters and underscores are valid in the name"
<axw> hmm, can't find any info about it on openstack...
<axw> possibly this? https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py#L61
<axw> which includes .
<ericsnow> wallyworld_: updated http://reviews.vapour.ws/r/573/ (and added tests)
<wallyworld_> looking
<wallyworld_> ericsnow: +1, but i just realised, the CI script might still fail
<ericsnow> FWIW, I plan on following up with mfoord in the morning
<ericsnow> wallyworld_: how so?
<wallyworld_> as it will still see an error
<wallyworld_> it will try to backup and get a "not ready, try again later" error
<ericsnow> wallyworld_: oh, duh
<wallyworld_> as opposed to a EOF error
<ericsnow> wallyworld_: dang it
<wallyworld_> i think we can ask that the script be changed
<wallyworld_> maybe
<wallyworld_> or else we will need that retry loop
<wallyworld_> but you go to bed, i'll follow up
<ericsnow> wallyworld_: and change it to a CI bug rather than a core bug?
<wallyworld_> maybe, i'll have to ask
<wallyworld_> i can see both sides of the argument
<ericsnow> wallyworld_: k, I'll get the merge started
<wallyworld_> sure, tyvm
<wallyworld_> and then we can add to it if needed to put in the wait until ready
<ericsnow> wallyworld_: that WaitUntilReady function is still in the commit history ;)
<wallyworld_> indeed :-)
<ericsnow> wallyworld_: because I'm sneaky like that :)
<wallyworld_> lol
<ericsnow> wallyworld_: okay, it's running the merge CI right now
<wallyworld_> ty :-)
<ericsnow> wallyworld_: I'll leave it in your hands (thanks!)
<wallyworld_> night night
<davecheney> wallyworld_: have a cheeky glass of red for me
<wallyworld_> wish i could
<wallyworld_> i'm still working
<wallyworld_> i was saying good bye to eri
<wallyworld_> c
<wallyworld_> axw: here's the gwacl branch https://code.launchpad.net/~wallyworld/gwacl/prefix-service-match/+merge/243620
<wallyworld_> i need to modify juju to pass in a separator
<axw> looking
<wallyworld_> axw: i might do as you suggest, i can just abandon the gwacl branch
<axw> wallyworld_: I kinda wish those convenience functions in gwacl weren't there
<wallyworld_> yeah
<axw> that one in particular is trouble, obviously
<wallyworld_> indeed
<davecheney> anyone else need a review ?
<wallyworld_> axw: http://reviews.vapour.ws/r/580/
<axw> looking
<wallyworld_> davecheney: i do, but andrew can look as we've discussed befre hand
<davecheney> wallyworld_: done, fwiw
<wallyworld_> thanks dave :-)
<axw> wallyworld_: done also
<wallyworld_> thanks
<wallyworld_> i checked the azure doc, service names don't allow meta chars
<wallyworld_> but i'll add the quoting to be sure
<axw> wallyworld_: they don't *now*, but who's stopping them from changing that?
<wallyworld_> sure
<wallyworld_> jam1: you ok for storage meeting a bit later?
<jam1> wallyworld_: yep
<wallyworld_> jam1: great
<wallyworld_> jam1: axw: i may be a minute or 2 later as i have to drive my wife to the city and am not sure of traffic
<wallyworld_> i should be back on time
<axw> okey dokey
<axw> wallyworld_: when you have a moment, can you see if this makes sense to you? https://docs.google.com/a/canonical.com/document/d/1-9ZPfdgpkj2R9mBG_tlSclGGyK3tRpMf2L4C37mzYD8/edit#bookmark=id.g0p2mahykmz
<TheMue> morning
<dimitern> morning all
<voidspace> dimitern: morning
<dimitern> morning voidspace, TheMue
<TheMue> dimitern: voidspace: o/
<voidspace> TheMue: hiya
<dimitern> jam1, voidspace, standup?
<jam1> dimitern: just working on a meeting, will be there soon
<dimitern> ok
<voidspace> dimitern: omw
 * fwereade_ out for a bit
<perrito666> morning
<voidspace> perrito666: morning
<TheMue> perrito666: heya
<jam1> wallyworld_: axw: are we back in https://plus.google.com/hangouts/_/canonical.com/juju-storage
<jam1> ?
<axw> omw
 * dimitern out for a 1h
<voidspace> perrito666: if you get a chance care to look at this one: http://reviews.vapour.ws/r/583/
<voidspace> perrito666: you're probably more familiar with this code than others, but it's a simple change
<voidspace> perrito666: restores (and tests) some work by ericsnow
<perrito666> voidspace: looking
<voidspace> perrito666: thanks
<perrito666> voidspace: only nits not worth mentioning, LGTM, but it is not worthy that I LGTM since you will need for david tomorrow to ship it :p so look for a better source of lgtmness
<voidspace> perrito666: ok, cool - thanks
<voidspace> anyone else who fancies an easy review
<voidspace> http://reviews.vapour.ws/r/583/
<voidspace> I'm off to lunch
<natefinch> sinzui: where are we on those blockers?
<sinzui> natefinch, still taking stock. I just reported https://bugs.launchpad.net/juju-core/+bug/1399229
<mup> Bug #1399229: win client cannot get status after bootstrap <ci> <regression> <status> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1399229>
<perrito666> sinzui: can we get a --debug run on that?
<sinzui> perrito666, I am trying. I need to setup another win machine first
<sinzui> natefinch, I suspect ha is brittle, but we have so many hot issue, I am going to get you better info for the current bugs than investigate ha
<sinzui> natefinch, the ha issues are connect shutdown/refused talking to the api server, which is the current problem with the backup-restore test, so maybe we already have a bug tracking the problem
<sinzui> perrito666, damn it. We cannot get a debug with a matching server because the testing streams are for 1.21. I will try a bootstrap anyway and hope for a reproduction
<perrito666> sinzui: hold, I have both windows and a fake stream
<perrito666> tell me how to try that
<sinzui> perrito666, The test is just a bootstrap into aws, proof that it can bring up and talk to a state-server
<perrito666> sinzui: ok, that is trunk?
 * perrito666 goes again
<sinzui> perrito666, yes
<perrito666> ok, firing up windows
<perrito666> can you believe it? compiling juju in your machine and inside a vm is a bad idea
<sinzui> dimitern, do we really need to backport bug 1397376 to 1.20? the stakeholders are pointing to 1.20 as an example that works for them
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <fallout> <landscape> <maas-provider> <juju-core:In Progress by dimitern> <juju-core 1.20:In Progress by dimitern> <juju-core 1.21:In Progress by dimitern> <https://launchpad.net/bugs/1397376>
<perrito666> ok sinzui please dont hate me for the stupid thing I am about to ask
<perrito666> where should I store environments.yaml in windows?
<dimitern> sinzui, I wasn't sure, let me check if the logic differs in 1.20
<sinzui> perrito666, /users/<you>/.juju/environments.yaml and you need a .ssh/id_rsa
<katco> perrito666: there should be an environmental variable for /users/<you>.... forget what it is
<katco> perrito666: perhaps %HOME%
<dimitern> sinzui, confirmed - 1.20 is not affected by the same address ordering issue
<sinzui> \o/
<perrito666> asdkjhaskdjhna ksdhjask windows will not allow me to name something .juju from the gui
<perrito666> oh you have to name it .juju.
<perrito666> and windows removes the dot at the end
<natefinch> perrito666: you need to turn off the "hide extensions" thingy, I think... I've never had a problem using wacky extensions in windows before.
<perrito666> natefinch: the issue was not with the extension is that the file began with .
<perrito666> so, no, windows is not looking for %HOME%/.juju
<natefinch> oh yeah, you can't make a file in the UI that starts with a ..... sorry
<natefinch> er with a .  that is
<jw4> perrito666: %USERPROFILE%\.juju\environments.yaml ?
<perrito666>  %UserProfile%
<perrito666> looking
<natefinch> I thought we made juju home on windows just "Juju"
<perrito666> natefinch: let see
<natefinch> perrito666: yeah it's %APPDATA%/Juju
<alexisb> dimitern, you still around?
<natefinch> alexisb: my chrome just froze, will be back on the call in a minute....
<perrito666> bootstraping
<perrito666> sinzui: jw4 katco natefinch tx for your help
<dimitern> alexisb, yes
<katco> perrito666: anytime perrito666
<alexisb> on the cross team call, was curious on status on the api endpoints bugs
<sinzui> natefinch, axw targeted https://bugs.launchpad.net/juju-core/+bug/1388860 to 1.20.14. It is a backport of a fix. He suggests it because there is another bug reporting that the issue does indeed affect the 1.20 series. Can you get someone to look into the backport now, and if it is ugly, then maybe we shouldn't...just go head with the release.
<mup> Bug #1388860: ec2 says     agent-state-info: 'cannot run instances: No default subnet for availability zone:       ''us-east-1e''. (InvalidInput)' <deploy> <ec2-provider>
<mup> <network> <juju-core:Fix Committed by axwalk> <juju-core 1.20:Triaged> <juju-core 1.21:Fix Released by axwalk> <https://launchpad.net/bugs/1388860>
<dimitern> alexisb, i'm working on a fix now
<dimitern> alexisb, it turned out to be more complicated to test than to fix :) - I should be ready later today or tomorrow morning
<perrito666> sinzui: natefinch http://pastebin.ubuntu.com/9368819/
<sinzui> perrito666, try again, that is aws's sucky mirrors
<perrito666> sinzui: I am about ot
<perrito666> to
<sinzui> perrito666, oh..
<sinzui> perrito666, you can tell juju to not do updates and upgrades!
 * perrito666 had to step down a moment to go order food, which requires me standing on a bench on the backyard, single place in the house with cell reception
<sinzui> perrito666, you can improve your chances if success:
<sinzui>     enable-os-upgrade: false
<sinzui>     enable-os-refresh-update: false
<sinzui> ^ we do that with canonistack because its network is unreliable
<perrito666> sinzui: is the test running that way?
<alexisb> dimitern, awesome, thank you for turning that around so quickly
<natefinch> man, it really sucks we're still using launchpad for bugs... it makes everything so much more manual
<perrito666> I would like to stay as close as possible to the tests
<natefinch> sinzui: do you happen to have a link to the PR that fixed https://bugs.launchpad.net/juju-core/+bug/1388860 for 1.21?
<mup> Bug #1388860: ec2 says     agent-state-info: 'cannot run instances: No default subnet for availability zone:       ''us-east-1e''. (InvalidInput)' <deploy> <ec2-provider>
<mup> <network> <juju-core:Fix Committed by axwalk> <juju-core 1.20:Triaged> <juju-core 1.21:Fix Released by axwalk> <https://launchpad.net/bugs/1388860>
<dimitern> alexisb, no worries
<sinzui> perrito666, no, but I think it is somewhat irrelevant because we have fresh images in aws, charms need updated and upgrades, but this failure is about talking to the state server
<perrito666> sinzui: running second bootstrap with your options added
<sinzui> natefinch, I do not, I will look
<voidspace> I have a fix for issue 1398837
<voidspace> waiting for a review
<natefinch> sinzui: I found it
<voidspace> bug 1398837
<mup> Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf <backup-restore> <ci> <regression> <juju-core:In Progress by ericsnowcurrently> <https://launchpad.net/bugs/1398837>
<natefinch> sinzui: looks to be a trivial change
<sinzui> natefinch, I see the change, but not the actual pull request
<natefinch> https://github.com/juju/juju/search?q=1388860&type=Issues&utf8=%E2%9C%93
<natefinch> amazing, a search that actually works the way you'd expect :)
<sinzui> natefinch, https://github.com/juju/juju/commit/9e1f40588eb6befcc543ae64e15cf7d8b11fd090
<sinzui> natefinch, I was slow, I looked at the date and read the commits
<natefinch> sinzui: that's the one.  Super simple.  Want me to backport it?
<sinzui> natefinch, please do. pretty please
<perrito666> odd, ECONREFFUSEd
<ericsnow> sinzui: regarding 1398837, part of the problem is that the failing test does not wait long enough for HA to be ready before trying to run backup
<sinzui> ericsnow, okay, what time should we set?
<ericsnow> sinzui: I'm not sure how long it takes for HA to get ready, but I see that voidspace's patch has a timeout of 60 seconds
<sinzui> what!
<ericsnow> voidspace: thanks for taking that over, by the way
<sinzui> ericsnow, I think we wait 10-20 minutes
<perrito666> sinzui: uff, /proc/self/fd/9: 9: exec: varlibjujutoolsmachine-0/jujud: not found
<perrito666> someone broke paths
<voidspace> ericsnow: no problem, care to review it?
<ericsnow> sinzui: that test has a total runtime of about 15 minutes
<sinzui> perrito666, \o/, or old nemesis win paths
<ericsnow> voidspace: sure, though it looks strangely familiar :)
<perrito666> sinzui: I wonder how in the universe does a windows client affects that path
<sinzui> ericsnow, I am looking at the lib now and hoping for an easy timeout change
<ericsnow> sinzui: k, thanks
<voidspace> ericsnow: heh
<voidspace> ericsnow: sinzui: in my investigations I could find *no* deterministic way to tell when a replicaset is ready
<sinzui> perrito666, the win module is convoluted. it switched between native path separators and posix depending on the code *knowing* that it will be executed against the state server
<sinzui> voidspace, :(
<voidspace> ericsnow: sinzui: even connecting separately to all of them and waiting until the configuration from *all members* reports that they're all ready wasn't enough after a reconfigure
<voidspace> ericsnow: sinzui: at which point I gave up
<ericsnow> voidspace: I wanted to ask you about that (the IsReady function I added)
<voidspace> ericsnow: sinzui: this fix will definitely help *sometimes*
<perrito666> nnnnnailed
<perrito666> sinzui: https://github.com/juju/juju/commit/ad420d9
<voidspace> ericsnow: I have been down this road and this will help sometimes
<voidspace> ericsnow: but sometimes they report ready and the next operation can still fail
<voidspace> ericsnow: although your problem is with initiation - my problem was with reconfigure
<voidspace> ericsnow: so it's likely to be better
<voidspace> ericsnow: the Initiate function could call WaitUntilReady
<ericsnow> voidspace: ah, cool
<voidspace> ericsnow: I didn't make that change as I was focussed on fixing the specific problem
<perrito666> sinzui: Ill try to fix it
<ericsnow> voidspace: ack
<ericsnow> natefinch, perrito666: standup?
<natefinch> ericsnow: trying... google doesn't like me
<voidspace> google is not your friend...
<perrito666> natefinch: use firefox
<natefinch> it was trying to join as my gmail account
<voidspace> ericsnow: why did you remove WaitUntilReady?
<voidspace> ericsnow: a timeout in the script won't help without a retry loop
<ericsnow> voidspace: we weren't using it so wallyworld_ asked me to remove it
<ericsnow> voidspace: right, a timeout won't help but a sleep will :)
<voidspace> ericsnow: ah, it was wallyworld_ who asked me to put it back in
<voidspace> this morning
<ericsnow> voidspace: yeah, I told him it was in the commit history still :)
<voidspace> ericsnow: a retry loop is better than a sleep, surely?
<voidspace> ericsnow: I'm agnostic on it - up to sinzui really
<ericsnow> voidspace: sure, but that's up to sinzui
<voidspace> whether we fix it in juju or they fix it in their test harness
<voidspace> ericsnow: ok
<voidspace> adding WaitUntilReady to Initiate slows the test suite down a lot
<voidspace> I'll tell you by how much when it actually finishes!
<ericsnow> voidspace: I noticed that each test I added for IsReady added about 12 seconds to the tests
<voidspace> well, from 244 seconds to 365 - including one failure (probably needs a mock)
<voidspace> ericsnow: heh, ouch
<voidspace> ericsnow: my tests for WaitUntilReady are fast because they all mock IsReady...
<ericsnow> voidspace: nice
<voidspace> and use a timeout of 1 second...
<sinzui> voidspace, the test sets the timeout for ha to 1200...could something else be timing out before then?
<sinzui> well obviously something is
<voidspace> sinzui: backup was set to hard fail if the replicaset was not ready
<ericsnow> voidspace: IsReady may need a tweak (it will return an error for anything but io.EOF)
<voidspace> sinzui: so it's not a timeout you need as much as a *retry* if it fails for that reason
<voidspace> ericsnow: that sounds correct to me
<voidspace> ericsnow: what other error would you expect?
<sinzui> voidspace, retry of what, status, ha?
<voidspace> sinzui: of the backup itself I think
<voidspace> sinzui: that's where the failure was IIUC
<sinzui> oh, backup, doh
<ericsnow> voidspace: "dial tcp 172.31.3.149:17070: connection refused"
<voidspace> ericsnow: ah
<voidspace> ericsnow: what's the actual error type?
<voidspace> ericsnow: or should I do a match for "connection refused"?
<ericsnow> voidspace: not sure, sinzui noted it in bug 1398837
<mup> Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf <backup-restore> <ci> <regression> <juju-core:In Progress by ericsnowcurrently> <https://launchpad.net/bugs/1398837>
<sinzui> ericsnow, the bug is really about the test failing. if we mark it fix released, we replace it with another bug that backup tests still fail
<ericsnow> sinzui: +1
<voidspace> ericsnow: hmmm... that bug actually reports that restore fails
<voidspace> ericsnow: are you sure that a WaitUntilReady in create will fix that?
<ericsnow> voidspace: are you sure? "ERROR:root:Command '['juju', 'backup']' returned non-zero exit status 1"
<voidspace> ericsnow: ah, that's further down the bug report
<voidspace> "the bug has mutated"
<ericsnow> voidspace: the original restore error was caused by backup though
<voidspace> ericsnow: ok, fair enough
<perrito666> natefinch: you really need to work in your unmuting skills
<ericsnow> voidspace: it's all because I stuck a "juju backups create" call in the backup plugin script a couple weeks back
<ericsnow> voidspace: however, these are real issues that need addressing at some point so might as well be now
<voidspace> ericsnow: so this would be my fix for the connection refused issue
<voidspace> ericsnow: http://pastebin.ubuntu.com/9369537/
<voidspace> no, hang on
<voidspace> if errors.Cause(err) == io.EOF || (err != nil && strings.Contains(err.Error(), "connection refused")) {
<ericsnow> sinzui: so should we apply Michael's fix (http://reviews.vapour.ws/r/583/) or will you be able to add retries/sleep to the CI test script (the HA one) around the backup call?
<sinzui> ericsnow, I am still looking at extending the timeout
<ericsnow> voidspace: I so hate testing for strings in err.Error() :(
<ericsnow> sinzui: I'm not sure a timeout will help
<ericsnow> sinzui: it has to wait somehow for HA to be ready before running backup (or retry the backup if it fails)
<ericsnow> sinzui: voidspace's fix would probably help, but I'd rather not do that *just* for the sake of the CI test if we can help it
<sinzui> ericsnow, yep I see ensure-availability as the issue, I will report a new bug about this, close the backup bug and hope the patch works
<ericsnow> voidspace: but you're probably right
<voidspace> heh
<ericsnow> voidspace: I was relying just on io.EOF for IsReady because of the precedent elsewhere in the replicaset code (I don't have a very good knowledge of the problem-space otherwise)
<ericsnow> voidspace: davechaney had suggested checking for other kinds of failures but I didn't find any examples of that elsewhere in juju so I stuck with just io.EOF
<ericsnow> voidspace: so I'm good with checking for "connection refused" (my dislike of checking err.Error() aside)
<voidspace> right
<voidspace> cool
<voidspace> I've pushed it and we can let the PR lie until we get a definite decision
<voidspace> I'm returning to IP address stuff
<ericsnow> voidspace: k
<ericsnow> sinzui: thanks!
<ericsnow> sinzui: regardless, good came of the bug (though as the cost of CI being blocked for an extra day)
<ericsnow> voidspace: I'll put up a separate patch just for the "connection refused" part of that so that it's not conflated with the WaitUntilReady part
<voidspace> ericsnow: ah, I pushed that
<voidspace> ericsnow: it doesn't do any harm is my thinking...
<ericsnow> voidspace: that's okay
<perrito666> sinzui: natefinch found it, running tests now for proposal
<sinzui> voidspace, ericsnow I reported https://bugs.launchpad.net/juju-core/+bug/1399277 about the ha issue, I add a line for beta4, because I /think/ this will help, but we can discuss it as out of scope
<mup> Bug #1399277: ensure-availability is not reliable <ci> <ha> <regression> <juju-core:In Progress> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1399277>
<ericsnow> voidspace: thanks!
<ericsnow> sinzui: ^
<ericsnow> sinzui: it applies to 1.22 as well, right? (where we are seeing the CI backups failures)
<sinzui> ericsnow, yess 1.22 is really hurting
<ericsnow> sinzui: :(
<arosales_> rick_h_: to confirm where should we log bugs for jujucharms.com?
<rick_h_> https://github.com/CanonicalLtd/jujucharms.com/issues arosales_
<rick_h_> arosales_: updated bug link is landed and will be in monday release
<arosales_> rick_h_: thanks
<rick_h_> arosales_: np, ty for the bug reports
<arosales_> np, thanks for responding to them :-)
<perrito666> uff this patch is awfully hard to revert
<mgz_> perrito666: ...subsequent changes?
<perrito666> mgz_: yes most likely
<natefinch> Can someone review my backport?  It's a very small change: http://reviews.vapour.ws/r/584/
<natefinch> ahh the old "on call reviewers are both done for the day by 8am"
<natefinch> mgz_: can you look? ^
<mgz_> natefinch: on it
<mgz_> natefinch: is the indent of that switch off or is that just reviewboard messing with me?
<natefinch> mgz_: that's how gofmt likes it.
<natefinch> just double checked
<natefinch> personally, I prefer the cases to be indented, but it does then cause double indent for the stuff under the case, so.... yeah.  *shrug*
<ericsnow> natefinch: I reviewed that patch
<mgz_> I think I may just be misreading the html output... it looks like the returns are differently indented, but there are change-indent marks that implies otherwise
<mgz_> lgtm otherwise
<natefinch> mgz_: oh I see what you mean
<natefinch> mgz_: the left hand side code is only under a single "if", the right hand side is under an "if" and a switch, so it is indented more.
<mgz_> yup
<natefinch> I honestly didn't even see the indent marks at first.  It's nice that it doesn't mark the whole thing as just different when it's just the different indent.
<ericsnow> natefinch: we should keep track of these (things people *like* about RB)
<ericsnow> natefinch: all I ever hear is what annoys people (which is typical) :P
<perrito666> ericsnow: is easy, all that is not annoying we like
<natefinch> the whole "make a ton of changes and then publish as one action" is pretty fantastic....
<ericsnow> natefinch: agreed
<ericsnow> perrito666: heh
<natefinch> ericsnow: being able to expand all files with a single click is also pretty awesome.  I hate that I can't do that in github's diffs
<voidspace> natefinch: what do you think of this as a conversion function - dotted quad (IP address) to decimal?
<voidspace> natefinch:
<voidspace> <natefinch> the whole "make a ton of changes and then publish as one
<voidspace> oops...
<voidspace> natefinch: http://pastebin.ubuntu.com/9370769/
<ericsnow> voidspace: FYI: http://reviews.vapour.ws/r/585/ (check other "conn dropped" conditions)
<voidspace> natefinch: just wondering if it has any chance of passing review
<voidspace> ericsnow: LGTM
<voidspace> natefinch: the address is already validated, so no need to handle that potential error case
<voidspace> although it should work fine anyway as ParseInt would fail
<voidspace> and the *caller* will be constructing a sensible error message from that failure
<voidspace> ooh, bug
<voidspace> they need zero padding
<voidspace> grr...
<voidspace> right, g'night all
<natefinch> voidspace: net.IP has some interesting stuff
<voidspace> natefinch: ah, I'll check
<voidspace> natefinch: I need int representations
<voidspace> natefinch: it maybe that I don't need to write them myself
<natefinch> voidspace: I always try to write as little as possible myself :)
<voidspace> natefinch: it doesn't have a ToDecimal
<voidspace> or equivalent
<voidspace> nor the reverse
<voidspace> I need both
<natefinch> voidspace: I'll look around to see if something's already out there
<natefinch> voidspace: don't want to keep you from EOD
<voidspace> ok
<voidspace> natefinch: thanks, if you see something an email would be awesome
<natefinch> voidspace: will do
<voidspace> natefinch: they're not hard to write functions - just converting via string functions is a little icky
<voidspace> natefinch: decimal to dotted quad is less icky
<voidspace> g'night
<perrito666> for the fantastic chance to unlock CI http://reviews.vapour.ws/r/586/
<perrito666> it is extremely trivial
<perrito666> has been tested on windows and linux
<natefinch> perrito666: will this work if deploying a unit to a windows machine?
<perrito666> natefinch: I dont know, I cannot deploy one of those
<perrito666> but if not, I am ok with it since it is a regression
<perrito666> and also https://bugs.launchpad.net/juju-core/+bug/1399322
<mup> Bug #1399322: ToolsDir should be series based <windows> <juju-core:New> <https://launchpad.net/bugs/1399322>
<perrito666> I opened a bug to have someone solve that doubt
<natefinch> perrito666: let's ship it
<perrito666> aghh what was the _fixes thing?
<ericsnow> perrito666: $$fixes-XXXXX$$
<perrito666> tx ericsnow
<perrito666> sent
<natefinch> I think the comments just need "fixes-123456" in them, and then $$whatever$$ works
<perrito666> ok, since the fix is hopefully being merged, I am going to step down for a moment
<thumper> morning folks
<rick_h_> thumper: morning, shot you an email about rescheduling next week if you get time to peek
<thumper> rick_h_: hey
<thumper> rick_h_: can you do two days earlier and an hour later?
<natefinch> man I hate mongo
<thumper> sinzui: what's the status of CI?
<rick_h_> thumper: I can
<rick_h_> thumper: can you email that back and let urulama respond as well?
<thumper> kk
<rick_h_> ty
<sinzui> thumper, master is blocked by 2 regressions, both with fixes due soon
 * thumper sighs
<thumper> grr...
<thumper> gc.Not is so fucked
<anastasiamac> thumper: could u cast ur eyes over my changes for block commands
<thumper> anastasiamac: sure
<anastasiamac> thumper: the PR is so huge now, m thinking to break it into smaller pieces
<anastasiamac> thumper: block functionality itself
<anastasiamac> and than a PR for each command
<thumper> as long as your first commands are add and remove machine I don't care :-)
<anastasiamac>  thumperbut if it's all good as it is now, I'd rather commit my monster without breaking it apart
<anastasiamac> thumper: :)
<anastasiamac> thumper: lets he how it reviews to u atm
<thumper> I'm already going to have to do a mega-conflict merge
<anastasiamac> /s/he/see
<anastasiamac> thumper: :-(
<thumper> that's fine
<thumper> I'm used to it
<anastasiamac> thumper: but beta u than me :)
<thumper> sure
<thumper> :)
<anastasiamac> thumper: m off to sort kids and co...
<thumper> kk
<thumper> I think I'll fetch a coffee and my almond croisant before starting this review
<thumper> it may take a while
<menn0> ericsnow: ping
<ericsnow> menn0: o/
<menn0> ericsnow: howdy.
<menn0> ericsnow: what's the state of play with the ensure-availability/backup issue?
<menn0> ericsnow: looks like you have a Ship It for PR 1271
<menn0> ericsnow: is that PR not particularly important?
<ericsnow> menn0: last I was aware, sinzui was going to see about tweaking the function-ha-backup-restore test so that HA is up before backup runs (or something along those lines)
<menn0> ericsnow: ok cool
<ericsnow> menn0: that PR isn't important for unblocking CI
<menn0> ericsnow: there's also voidpspace's PR 1269
<menn0> ericsnow: which seems relevant but hasn't been merged yet
<ericsnow> menn0: it was just something we noticed would be good to do at some point (so it can wait until things are unblocked)
<menn0> ericsnow: ok so we're really just waiting on sinzui's change to get CI unblocked
<ericsnow> menn0: yeah, Michael's patch (http://reviews.vapour.ws/r/583/) should help but I think the current behavior is correct for users (so I'd rather we not land that patch just for the sake of CI)
<sinzui> menn0, not quite, because I don't know how to make ci know when ha is ready
<ericsnow> menn0: yeah, and he had said he might change that bug to be a CI bug IIRC, which would unblock us
<ericsnow> sinzui: can you just throw a sleep in there before calling backup or put that call to backup in a retry loop
<ericsnow> ?
<sinzui> ericsnow, sleeping for 5 minutes the running backup can still fail.
<ericsnow> sinzui: yuck
<sinzui> ericsnow, I need to poll something that means we are ready
<ericsnow> sinzui: how long does it take for HA to be ready?
<ericsnow> sinzui: I'm not the best resource for finding that polling solution but I'll give it some thought
<ericsnow> natefinch: do you have any ideas off-hand on how sinzui might poll for HA-ready (from a script)
<sinzui> ericsnow, it is variable. Our code calls a method named wait_for_ha() we don't start backup until
<sinzui> ericsnow, We read status http://pastebin.ubuntu.com/9372831/
<sinzui> ericsnow, menn0 is there something else to read about juju *really* being ready
<natefinch> sinzui: I wish I knew.  I believe that can sometimes pass and mongo can still somehow not be ready-ready.
<sinzui> could is "juju run" something on the state-server or the other voting machines?
<sinzui> s/could is/could 1/
 * sinzui gives up
<sinzui> well, I think testing got lucky. I think the replica-set was ready this time
<sinzui> natefinch, ericsnow I will add a 5m sleep if you think it will fix the issue 80% of the time
<ericsnow> sinzui: If it's close enough that it just succeeded as-is, then I'd expect such a sleep (hacky as it is) would help a bunch
 * sinzui adds hack
<ericsnow> sinzui: I would not expect a lot of variability in how long it takes for HA to come up
<menn0> sinzui, ericsnow, natefinch: wouldn't voidspaces change also greatly reduce the odds of the replicaset not being ready?
<sinzui> menn0, I think so
<ericsnow> sinzui: of course that doesn't solve the problem of accurately/reliably introspecting the HA status, but that the point of but 1399277, right?
<ericsnow> menn0: if I were a user I would rather it fail when the replicaset isn't ready than have it wait
<ericsnow> menn0: however, I'd expect the odds of this issue affecting actual users to be remote
<ericsnow> menn0: I'd rather we didn't apply voidspace's patch just for the sake of CI (more -0 than -1)
<natefinch> ericsnow: what, you don't think a lot of people will do "juju bootstrap && juju ensure-availability && juju backup"? :)
<ericsnow> natefinch: :)
<menn0> ericsnow: ok, well if backup reports a clear error immediately about the replicaset not being ready can't the test detect that and retry a few times
<menn0> natefinch: you can't be too careful :)
<ericsnow> natefinch: however, there's a chance this could bite someone if they did " juju ensure-availability && juju backup" on an existing env, no?
<ericsnow> menn0: that was what voidspace suggested when we discussed it
<ericsnow> sinzui, menn0: the error message is "HA not ready; try again later"
<ericsnow> (apiserver/backups/create.go)
<natefinch> ericsnow: I'm not too concerned if someone does "juju ensure-availability && juju backup" and the backup fails with "HA not ready" (or something similar).
<ericsnow> natefinch: right
<ericsnow> natefinch: that's the point of what I did yesterday
<natefinch> ericsnow: well that's great.  I'm fine with there being some times when backup can't be done.
<natefinch> Gotta run
<ericsnow> sinzui: would it be reasonable to drop the "ci" tag from bug 1399277?
<mup> Bug #1399277: ensure-availability is not reliable <ci> <ha> <regression> <juju-core:In Progress> <juju-core 1.21:Triaged> <https://launchpad.net/bugs/1399277>
<sinzui> no, the test has to consistent;y pass
<ericsnow> sinzui: I meant since it's more of a CI bug but having a more reliable way to know when HA is ready is still something we want to get
<sinzui> ericsnow, hell no. enterprises script this out like we do
<ericsnow> sinzui: good point
<sinzui> ericsnow, This issue is new, so I think something bad has happened. Regardless, I am adding a sleep
<ericsnow> sinzui: what changed is I updated the "juju backup" plugin to call "juju backups create"
<ericsnow> sinzui: but that was a few weeks ago so if this issue is new as of a matter of days then yeah
<sinzui> ericsnow, yeah :( If I add a 5 minutes sleep, the test suite also sleeps. I need to do more work
<ericsnow> sinzui: :(
<ericsnow> sinzui: what about a loop around the backup call that checks for the "HA not ready; try again later" message?
<sinzui> ericsnow, :/ doable but maybe award not all juju have this problem. We test 18 and 20 too
<perrito666> sinzui: can you not query status to determine if the ha servers are ready? or they are marked ready before replicaset is actually ready?
<sinzui> perrito666, we do! status said has vote, so we started backup
<ericsnow> perrito666: they are already doing that: http://pastebin.ubuntu.com/9372831/
<perrito666> ok so status is lying
<perrito666> ok,EOD my brain is fried
<perrito666> sinzui: before I leave, where can I see https://bugs.launchpad.net/juju-core/+bug/1399229 job? does it run on the same CI?
<mup> Bug #1399229: win client cannot get status after bootstrap <ci> <regression> <status> <windows> <juju-core:Fix Released by hduran-8> <https://launchpad.net/bugs/1399229>
<perrito666> being windows
<sinzui> perrito666, this is the job, and your commit is the top result http://juju-ci.vapour.ws:8080/job/win-client-deploy/
<perrito666> :D
<perrito666> I go in peace then, have a nice night everybody
#juju-dev 2014-12-05
<ericsnow> wallyworld_: thanks for the review on http://reviews.vapour.ws/r/549/ (I finally got around to responding)
<wallyworld_> sure, i'll go look
<ericsnow> wallyworld_: thanks
<axw> wallyworld_: thanks for doing the backport - was about to do that
<wallyworld_> axw: np, i got blocked on something else so thought i may as well
<wallyworld_> ericsnow: what's happening with the current CI blocker?
<ericsnow> wallyworld_: sinzui was going to add a 5 minute sleep before running backup in the script (or something like that)
<wallyworld_> ok, thanks
<wallyworld_> i guess we still added the wait for ha?
<ericsnow> wallyworld_: I don't know what's going to happen with that blocker bug
<ericsnow> wallyworld_: voidspace's patch did not go in
<wallyworld_> oh ok
<wallyworld_> was it rejected?
<ericsnow> wallyworld_: incidentally, that failing test passed at least once today, so we must be relatively close on when HA is becoming ready
<wallyworld_> mongo sorta sucks that it seems to lie to us about if it is ready
<ericsnow> wallyworld_: yep
<ericsnow> wallyworld_: that's the gist of the current blocker bug, if I understand correctly
<wallyworld_> we need to find a solution
<ericsnow> nate was talking about getting in touch with the mongo folks (I'll follow up tomorrow)
<ericsnow> sounds like he has had some contact with them before about replica set stuff
<ericsnow> wallyworld_: on that upload patch are you okay with me leaving the cmd tests as is (they do mock out the API client)
<wallyworld_> yep
<ericsnow> k
<axw> wallyworld_: can't hear you anymore...
<thumper> axw: FWIW, I was mostly joking about exa, yotta and zeta bytes...
<wallyworld_> thumper: 640k is enough for anybody
<axw> thumper: heh :)
<thumper> for sure
<axw> thumper: doesn't really hurt
<axw> if someone tries it, it's likely to fail for the foreseeable future, but... may as well do the 10 minute job now and forget about it
<axw> wallyworld_: sent out the email. I've got a workshop to go to at the school, heading off in a few
<wallyworld_> axw: sure, tyvm
<huwshimi> hatch: Ugh, now I have to mock classList.remove and classList.add because we don't have a dom to work with :(
<hatch> huwshimi: wrong channel? :)
<huwshimi> erm
<bodie_> watching the first end-to-end run of an Action right now :)
<bodie_> if anyone would like to play with / break it, we have a branch available at under "actions" https://github.com/juju-actions/juju/
<bodie_> landing a commit in a moment that should make "do", "fetch", and "defined" functional, along with the charm-side stuff
<bodie_> also some good content in the wiki at that fork
<dimitern> bodie_, sweet \o/ I'll give it a try later today
<bodie_> dimitern, awesome :) there's a Phoronix testing suite charm we have linked in our wiki, but marcoceppi knows more about how to use it
<dimitern> bodie_, cool, thanks - I'll give you a shout if something is unclear :)
<bodie_> dimitern, great, and feel free to open issues if you manage to break something
<dimitern> bodie_, sure, no worries
<voidspace> ericsnow: ping
<dimitern> voidspace, jam1, standup?
<voidspace> dimitern: ok
<voidspace> dimitern: jam1 will be off won't he
<dimitern> voidspace, ah, it's friday - yeah
<voidspace> TheMue: stdup?
<perrito666> morning
<voidspace> perrito666: o/
<dimitern> jamespage, ping
<dimitern> jamespage, hey, a friendly reminder to send me that mail with request-address/interface use cases when you have time please :)
<jamespage> on my list for today
<dimitern> jamespage, thanks!
<perrito666> how is the blocking status on CI?
<rick_h_> heh, surprised no one's written an irc bot yet :) !ci-status
 * perrito666 goes figure out if he is hearing gunshots or fireworks brb
<perrito666> back
<voidspace> ericsnow: ping
<dimitern> voidspace, are you working on bug 1399277 as well?
<mup> Bug #1399277: ensure-availability is not reliable <ci> <ha> <regression> <juju-core:In Progress> <https://launchpad.net/bugs/1399277>
<voidspace> dimitern: yes
<voidspace> dimitern: we hope this branch is the fix http://reviews.vapour.ws/r/583/
<voidspace> dimitern: but it needs a review
<voidspace> dimitern: and I'm not convinced it's sufficient but it will definitely help
<voidspace> dimitern: I got a "ShipIt" but I added a change to go from "all healthy" to "majority healthy" which needs review
<voidspace> dimitern: and as this changes ericsnow's code I'd really like him to see it before I merge it
<dimitern> voidspace, ok, I'll have a look
<voidspace> dimitern: PickNewAddress is done bar the tests, it's quite a funky little algorithm
<voidspace> dimitern: several possible places for fence post errors, so needs careful checking
<dimitern> voidspace, great!
<dimitern> voidspace, you have a review btw
<dimitern> voidspace, eww... my first comment got awkwardly formatted
<voidspace> dimitern: I understand
<voidspace> dimitern: and I don't know a better way than string comparison
<voidspace> dimitern: I agree it's icky
<voidspace> dimitern: suggestions welcomedc
<voidspace> *welcomed
<voidspace> dimitern: ericsnow has another PR that does the same thing but hides it in a function - I think I'll just merge that
<dimitern> voidspace, this is an error coming from mgo, right?
<voidspace> dimitern: right
<sinzui> voidspace, I have also added a 5m sleep after status reports all machines has-vote. that change did improve the pass rate.
<voidspace> sinzui: cool
<voidspace> sinzui: only "improve", or "fixes the problem"?
<voidspace> shame to add five minutes to the run time
<sinzui> improved
<voidspace> :-/
<sinzui> a automatic retest passed. So we passed slowly
<voidspace> if a five minute sleep doesn't solve this problem then a half hour sleep wouldn't - there's some other issue
<voidspace> that at least is progress
<sinzui> dimitern, how goes your work with bug 1397376
<mup> Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints <api> <cloud-installer> <fallout> <landscape> <maas-provider> <juju-core:In Progress by dimitern> <juju-core 1.21:In Progress by dimitern> <https://launchpad.net/bugs/1397376>
<dimitern> sinzui, I'm about to propose a fix for 1.22, just fixing a final test
<sinzui> \o/
<dimitern> sinzui, it took a lot of time to ensure I don't break something; live tested on maas, local, canonistack, ec2
<sinzui> understood
<dimitern> voidspace, ah, too bad then - go with string comparison :)
<ericsnow> voidspace: o/
<voidspace> ericsnow: hey, hi
<voidspace> ericsnow: I made a change to IsReady
<voidspace> ericsnow: I'd like your agreement before I merge
<voidspace> ericsnow: http://reviews.vapour.ws/r/583/
<ericsnow> voidspace: k
 * ericsnow takes a look
<voidspace> ericsnow: with the current implementation, if one state server in an HA environment goes down
<voidspace> ericsnow: and *then* the user tries to backup
<voidspace> ericsnow: IsReady will always report false and they can't backup
<voidspace> ericsnow: so I changed IsReady to check for majority healthy rather than all healthy
<voidspace> ericsnow: (majority healthy is the requirement a mongo replicaset has to be functioning)
<perrito666> mm, tests dont like to be ctrl+z
<natefinch> voidspace: why are we using health at all?    It's not at all clear what "up" means..... plus, there's a bug in the logic, because the member you're getting the status from always excludes its own value for health, so it'll default to false.
<natefinch> voidspace: there's member status which seems much more detailed about what each state actually means.
<alexisb> ericsnow, voidspace ping
<voidspace> alexisb: pong
<ericsnow> alexisb: hey
<alexisb> hey guys, do you guys know where the April PyCon is going to be held?
<voidspace> natefinch: interesting, that doesn't seem to happen in practise
<alexisb> ie what location
<voidspace> natefinch: so I think your're wrong about defaulting to false
<ericsnow> alexisb: montreal
<voidspace> alexisb: montreal
<alexisb> ok thanls
<alexisb> thanks
<voidspace> alexisb: the proposed sprint date clashes with the pycon sprints
<voidspace> alexisb: or is that deliberate?
<alexisb> yep that is what I am working on
<voidspace> alexisb: cool
<voidspace> natefinch: if you were right, the current implementation would *always* return false
<voidspace> natefinch: so backup could *never* work - which isn't what we're seeing
<ericsnow> sinzui, voidspace: the current failure (connection refused) with the HA backup CI test may not be HA-related
<ericsnow> sinzui, voidspace: see http://reviews.vapour.ws/r/590/
<ericsnow> so, an API client is only good for a single request (and then disconnects)?
<ericsnow> anyone ^
<voidspace> alexisb: I have a talk and poster session on juju submitted to pycon
<voidspace> alexisb: no idea if either has been (or will be) accepted yet
<alexisb> voidspace, awesome
<sinzui> ericsnow, I don't know, but I have hope that the 5m sleep gives juju a moment to gather its wits
<voidspace> hehe
<ericsnow> sinzui: I think it helped
<voidspace> ericsnow: I'm merging  your PR with the better error checking for IsReady with mine
<voidspace> just running tests
<natefinch> voidspace: hmm that;s true.  Weird. I wonder why it doesn't work that way... since the docs definitely say the value doesn't exist for the member you get status from, and therefore the bool should default to false
<sinzui> ericsnow, agreed, the test is still retried, but at least its odds of success are better
<ericsnow> voidspace: I don't think we need to switch to WaitUntilReady to fix the current failures
<voidspace> natefinch: my defence is that *that* code (in juju) pre-exists my PR which just waits for it
<natefinch> voidspace: http://docs.mongodb.org/v2.4/reference/command/replSetGetStatus/
<ericsnow> voidspace: though the fix to IsReady stands on its own
<voidspace> ericsnow: I think it's an improvement and we *definitely* need to move to checking for majority healthy rather than all
<natefinch> voidspace: oh yeah, totally... I just wondered if you knew why it was there.  I guess I should ask git who to ask
<voidspace> ericsnow: otherwise if one state server goes down you can't backup!
<voidspace> natefinch: it's ericsnow :-)
<voidspace> natefinch: I did look in detail at the CurrentConfig and CurrentStatus meanings, but it was a little while ago now
<ericsnow> natefinch: I just did what we were doing elsewhere in juju
<voidspace> maybe mgo does helpful magic for us
<ericsnow> natefinch: actually, the code I added just calls other code that predates mine
<ericsnow> natefinch: that code is what does the check IIRC
<voidspace> natefinch: in that doc I see "self" included in members with meaningful data
<voidspace> i.e. health and state
<voidspace> with "self": true
<ericsnow> perrito666: here's that fix to try out: http://reviews.vapour.ws/r/590/
<perrito666> ericsnow: going
<voidspace> natefinch: "The members field holds an array that contains a document for every member in the replica set."
<voidspace> natefinch: what are you reading?
<natefinch> voidspace: replSetGetStatus.members.health
<natefinch> The health value is only present for the other members of the replica set (i.e. not the member that returns rs.status.) This field conveys if the member is up (i.e. 1) or down (i.e. 0.)
<voidspace> natefinch: hah, ah right
<voidspace> natefinch: indeed, I concur with your reading
<perrito666> ericsnow: you will need a bit of patience, this test takes a bit to run
<voidspace> natefinch: so we should be checking for Member.self and assuming health: 1 for that member
<voidspace> else you couldn't connect to it
<ericsnow> perrito666: I figured as much
<ericsnow> perrito666: thanks
<voidspace> but looks like we don't need to in practise
<voidspace> natefinch: ah, but see the MemberStatus struct
<voidspace> natefinch: and specifically the Healthy field
<voidspace>                                     
<voidspace>         // Healthy reports whether the member is up. It is true for the
<voidspace>         // member that the request was made to.
<voidspace> natefinch: so we specifically don't need to do that
<voidspace> ericsnow: so this is my latest version: http://reviews.vapour.ws/r/583/diff/#
<voidspace> ericsnow: my feeling is that it's only an improvement
<voidspace> ericsnow: and so we should merge
<natefinch> voidspace: yeah, but I wrote that, and the code does not seem to back it up (assuming the docs on mongodb are more likely to be accurate than my code comments)
<voidspace> natefinch: yet it seems to be true in practise
<ericsnow> voidspace: agreed
<voidspace> ericsnow: I'm going to hit merge then, we'll see if it helps
<ericsnow> voidspace: thanks for doing that
<voidspace> ericsnow: we'll see...
<voidspace> I need a break, biab
<ericsnow> does anyone know if an API client is only good for a single request (and then disconnects)?
<perrito666> sinzui: mgz_ abentley "ImportError: No module named boto" <-- what is boto?
<sinzui> perrito666, it is a python lib for talking to aws
<sinzui> perrito666, sudo apt-get install boto
<ericsnow> voidspace: BTW, the "connection refused" condition for isConnectionNotAvailable should probably be dropped
<perrito666> sinzui: oh ok, I though it was some internal module from ci
<sinzui> perrito666, sudo apt-get install python-boto
<ericsnow> voidspace: that message came from the API, not from mongo
<ericsnow> cmars: could I get a review on http://reviews.vapour.ws/r/590/? (you're OCR, right?)
<ericsnow> cmars: if I understood right, it should help unblock CI
<cmars> ericsnow, yep, i'll take a look, was just reading 583
<mgz_> perrito666: what sinzui said, but I think it's python-boto
<perrito666> mgz_: it is
<ericsnow> cmars: ta
<cmars> ericsnow, ok, awesome. i know nothing about HA & replica sets, so i'll probably have lots of stupid questions
<ericsnow> cmars: well, my patch is unrelated to HA :)
<cmars> ericsnow, even better :)
<ericsnow> cmars: and it's small
<cmars> ericsnow, why can't you use an API client for more than a single request?
<ericsnow> cmars: I haven't gotten any answer on that yet :P
<ericsnow> cmars: I find it surprising if true (which is what perrito666 explained to me recently)
<cmars> ericsnow, hmm. let me look a bit at NewAPIClient just to see if this mostly harmless. would i be likely to see any difference in tests passing in my dev env with vs without this patch?
<ericsnow> cmars: in the interest of getting CI unblocked I figured I'd take the chance that the patch helped (since it doesn't hurt)
 * cmars wishes CI could test PR branches for these kinds of experiments
<cmars> but i don't want to be terribly difficult either
<ericsnow> cmars: no worries
<mgz_> cmars: it's doable
<cmars> let me look over that api client just a few minutes & I'll let you know.
<ericsnow> cmars: I definitely want an answer to the API client question either way :)
<ericsnow> cmars: thanks
<ericsnow> mgz_: that would be awesome
<cmars> mgz_, thanks. is it difficult?
<ericsnow> mgz_: from what perrito666 told me, it's a pain setting things up to run CI tests yourself
<mgz_> ericsnow: it not super fun, the other option is we can send alternative branches through
<cmars> mgz_, wasn't trying to be difficult, i was thinking of the github pull request jenkins plugin. i think it'd be able to pull something like this off (though maybe it is tricky to fit into the existing setup?)
<mgz_> cmars: sure but if you're talking about actual ci runs rather than just the gating, that's too much to do on every push to a proposed branch
<cmars> mgz_, ah, that's true, it'd have to be a targeted approach and maybe that's where it falls down
<ericsnow> natefinch: on second thought, using member.Healthy in IsReady is strictly my fault
<ericsnow> natefinch: I made the decision based on the doc comments in MemberStatus
<ericsnow> natefinch: so I'm sure it could stand improvement :)
<ericsnow> mgz_: for Python core development the CI is set up to allow running against a branch on a dev repo on request
<ericsnow> mgz_: that works well
<cmars> ericsnow, is the backups httpClient reusing the API websocket http connection to upload and download files?
<ericsnow> cmars: nope
<perrito666> mgz_: hit this before? http://pastebin.ubuntu.com/9384377/
<cmars> ericsnow, where does it get its httpClient initialized?
<mgz_> perrito666: got your ec2 creds soruced?
<ericsnow> cmars: pretty sure its api/http.go
<perrito666> mgz_: nope, I dont recall being required before, though it makes sense
<cmars> ericsnow, ah, ok thanks
<perrito666> cmars: ericsnow if you can find out why cient fails to being re-used you get each a beer on me next sprint
<mgz_> perrito666: you can also have the ccred sin the environment.yaml name you pass
<ericsnow> cmars: that code follows the precedent of charm download and tools upload
<perrito666> mgz_: I am passing:  WORKSPACE=$(pwd) ./assess_recovery.py --ha-backup /home/hduran/gocode/bin  perritoec2
<perrito666> perritoec2 being my ec2 env
<mgz_> perrito666: so, if perritoec2 had the creds, I'd expect that excdeption not to happen
<cmars> perrito666, do you see client re-use issues in other facades or just backups?
<mgz_> I always just source my creds instead of putting them in the yaml
<perrito666> mgz_: it has them, since it works :p
<perrito666> cmars: I have not actually tried, but most likely only backups facade
<perrito666> cmars: i first encountered this error while trying to retry actions waiting for upgrade to finish
<ericsnow> mgz_, perrito666: ^^^ in the restore patch
<ericsnow> (http://reviews.vapour.ws/r/298/)
<ericsnow> perrito666: oh, right (pre-dating restore)
<perrito666> ericsnow: ?
<perrito666> ericsnow: I am trying your patch again now
<ericsnow> perrito666: thanks
<cmars> ericsnow, LGTM'd it. let's give it a shot
<ericsnow> cmars: k
<ericsnow> cmars: opening a bug on the api client re-use thing first
<cmars> ericsnow, perfect
<perrito666> ericsnow: we should write a facade call empirical test to make succesful and unsuccesful facade calls and see how the client behaves there
<natefinch> ericsnow> natefinch: I made the decision based on the doc comments in MemberStatus
<natefinch> ericsnow: and that's what you get for basing your implementation on my shoddy documentation ;)
<ericsnow> natefinch:  :)
<ericsnow> natefinch: and it was quite late and I was working on unblocking CI so perhaps I wasn't as rigorous for the sake of urgency :)
<perrito666> ericsnow: well my test run failed for the second time without relation to your patch :p running a third one
<ericsnow> perrito666: k :(
<ericsnow> natefinch: want me to open a bug on IsReady?
<natefinch> ericsnow: I don't know for sure that it needs a bug... it's entirely possible that Health is perfectly acceptable and does The Right Thing... I'm just wary of trusting vague docs :)
<perrito666> :p MayBeReadyOrJustAlways0
<ericsnow> natefinch: well, I'll open a bug just so we make sure we at least look into it :)
<ericsnow> natefinch: we can easily close it
<natefinch> ericsnow: good point
<ericsnow> natefinch: https://bugs.launchpad.net/juju-core/+bug/1399730 (feel free to comment on the bug)
<mup> Bug #1399730: replicaset.IsReady should check MemberStatus.State rather than Healthy. <tech-debt> <juju-core:New> <https://launchpad.net/bugs/1399730>
<ericsnow> natefinch: in fact, feel free to pick up the bug ;)
<rogpeppe> is there any way from the command line to get juju to print the environment UUID?
<ericsnow> rogpeppe: the only one I know of is a little indirect: "juju backups create" :)
<rogpeppe> ericsnow: i'm not sure i'm gonna use that one :)
<ericsnow> rogpeppe: the env UUID in the backup metadata
<ericsnow> rogpeppe: yeah, I figured :)
<rogpeppe> i guess it might be good as part of juju status
<ericsnow> rogpeppe: +1
<rick_h_> rogpeppe: does the juju api endpoints command do it?
<rogpeppe> rick_h_: i don't think so
<rogpeppe> rick_h_: that just prints the api endpoint addresses
<rick_h_> k
<ericsnow> sinzui: both my and voidspace's fixes have landed so I'm hopeful for the HA backups CI test result
<sinzui> rock
<ericsnow> sinzui: keep in mind that our fixes don't really address the underlying concern we have with knowing for sure that HA is ready
<sinzui> ericsnow, understood. I hope all our work improve the test's success.
<ericsnow> sinzui: me too!
<sinzui> ericsnow, CI found the new commits and has started the tests.
<ericsnow> sinzui: cool
<ericsnow> alexisb: we're still on in 10 minutes, no?
<alexisb> ericsnow, yep
<ericsnow> alexisb: k
<perrito666> ericsnow: pre-good news, the test passes
<ericsnow> perrito666: cool
<ericsnow> perrito666: and it fails without the patch?
<perrito666> ericsnow: I did not try, but I assumed so
<perrito666> I am runing the same code as CI
<ericsnow> perrito666: cool
<ericsnow> perrito666: CI should be running the test in a little while
<perrito666> ericsnow: lets hope with the same results
<cmars> ericsnow, just LGTM'd 549
<ericsnow> cmars: thanks!
<alexisb> katco, mind if I take a few minutes to chat with natefinch before we meet?
<katco> alexisb: not at all!
<LinStatSDR> Hello All.
<katco> alexisb: ty for asking :)
<ericsnow> cmars: would you mind reviewing a couple more for me (http://reviews.vapour.ws/r/591/ and http://reviews.vapour.ws/r/578/)?
<cmars> ericsnow, already on 591
<ericsnow> cmars: http://reviews.vapour.ws/r/557/ too if you feel up to it :)
<ericsnow> cmars: awesome
<ericsnow> sinzui: how soon do you think we'll know on the CI results for functional-ha-backup-restore?
<sinzui> ericsnow, about an hour. We are in the dull period where CI is publishing streams for the cloud-based tests
<ericsnow> sinzui: k
<voidspace> g'night all
<voidspace> have a happy weekend
<ericsnow> voidspace: thanks for all your help
<voidspace> EOW
<voidspace> ericsnow: see you on monday
<ericsnow> sinzui: FWIW, perrito666 verified that with my patch the CI test passed on his EC2 instance
<sinzui> fab
<alexisb> ok katco ready and joing the hangout
<ericsnow> cmars: thanks for those reviews
<ericsnow> sinzui: looks like the test made it past the spot where it's been failing :)
<cmars> ericsnow, fingers crossed :)
<ericsnow> sinzui: so the test passed and I'm confident that it will pass in re-test
<ericsnow> sinzui: should that blocker bug be marked as fixed-released or should the CI tag be dropped?
<sinzui> ericsnow, I am marking it as fix released
<ericsnow> sinzui: okay, cool
<ericsnow> sinzui: thanks for all the help
<ericsnow> sinzui: I think that sleep really helped
<katco> alexisb: http://www.pvponline.com/comic/2007/12/06/kringus-risen-part-4
<alexisb> lol nice!
<katco> alexisb: we call our cat kringus because he likes to get in the tree :)
<ericsnow> cmars: I'm guessing you got those errors from a header file in the mongo source...
<ericsnow> cmars: are they exposed in mgo at all?
<ericsnow> cmars: or are those from errno?
<cmars> ericsnow, I grepped 'connection' out of the go stdlib :)
<cmars> just for a sample
<cmars> ericsnow, mgo doesn't have anything specific to connection errors, other than when trying to dial
<ericsnow> cmars: okay, I'll go with the ones you listed (see http://golang.org/pkg/syscall/ and http://golang.org/src/pkg/syscall/zerrors_linux_amd64.go)
<natefinch> ericsnow: beware using anything from syscall.  It should set off warning bells in your head if you ever import it.  It is explicitly not cross-platform compatible.  Sometimes the methods & types will be the same across platforms, and sometimes they won't be.  We need to at least make sure that everything compiles on all our platforms (by which of course I mean Windows)
<ericsnow> natefinch: good point
<ericsnow> natefinch: in this case the errnos are set up for most (all?) platforms go supports, including windows
<ericsnow> natefinch: the errno constants are all I need
<natefinch> ericsnow: ok, just wanted to make sure.  Syscall always gives me the heebie jeebies :)
<ericsnow> natefinch: hey, good thinking
<ericsnow> cmars: I've updated http://reviews.vapour.ws/r/591/ with those errnos
<ericsnow> cmars: do you think you'll have time to look at http://reviews.vapour.ws/r/557/?
<ericsnow> cmars: it's a large patch, but most of it is mechanical
<ericsnow> cmars: I tried to break it down helpfully in the description
<cmars> ericsnow, thanks. yep, i'll take at 557 next
<ericsnow> cmars: awesome! thanks for grinding through these :)
<cmars> ericsnow, i don't think i'm qualified to approve 557, sorry
<ericsnow> cmars: no worries :)
<ericsnow> cmars: thanks for taking a look
<ericsnow> cmars: I'll see if I can talk axw into it on Monday :)
<natefinch> ericsnow: I wonder if we shouldn't just put availabilityZone into hardware characteristics
<ericsnow> natefinch: I considered that but from my cursory introduction to that corner of juju it didn't seem to fit quite right there
<ericsnow> natefinch: what would be the motivation?
<ericsnow> natefinch: other than reducing the churn in that patch :)
<natefinch> ericsnow: reducing churn is a noble goal.  But it also just seems like something you might want to pass around with hardware characteristics.
<ericsnow> natefinch: I suppose, but it still doesn't seem like a good fit IMHO
<natefinch> ericsnow: at least as good a fit as "tags".  AZ can't change for the hardware anymore than the architecture can.  Where the hardware is seems like a valid piece of info about the hardware.
<ericsnow> natefinch: FWIW, on IRC axw gave me a non-binding LGTM on the approach after a cursory review
<natefinch> lol
<ericsnow> natefinch: fair enough
<natefinch> ericsnow: I think it's worth bringing it up to others first... I don't want to muck up other parts of the code if changing HW chars has other specific meaning, but from a pure code point of view, it's a lot better than adding yet another return value to all these functions.
<ericsnow> natefinch: sounds good
<ericsnow> natefinch: at once point I believe I had the AZ as stored on the Instance (and expected with an AvailabilityZone method)
<ericsnow> s/expected/exported
<natefinch> ericsnow: I thought of that too... I think HWC is slightly better.. but not 100%
<ericsnow> natefinch: agreed; instance.Instance is more about stuff that I expect can change
<cmars> i'm done. have a good weekend juju people
#juju-dev 2014-12-06
<jam1> fjam
#juju-dev 2014-12-07
<mwhudson> menn0, waigani: email standup then?
<waigani> mwhudson: no one's on hangout so I'm guessing so
<mwhudson> i am!
<mwhudson> but ok
<mwhudson> suits me to not have the call tbh
<waigani> oh, ha I just logged out
<waigani> okay cool, let's do email standup
<menn0> mwhudson, waigani: sounds good
<menn0> waigani: looking at your megawatcher branch now
<waigani> menn0: is this your latest: https://github.com/mjs/juju/commit/dca522e6f5bcc369fbf195cff6aa8cd5bdf2db30
<menn0> waigani: yep that's the one
<waigani> menn0: and are all the changes in that commit?
<menn0> yep
<menn0> waigani: just finished reviewing your megawatcher branch
<waigani> menn0: thanks
<menn0> waigani: looks good overall. you're still going to add tests for services, units, relations etc right?
<waigani> menn0: yes. I said that in the email. I just needed a sanity check before putting any more time into it.
<menn0> waigani: cooll cool.
<menn0> waigani: I like the way the tests look now
<waigani> menn0: "// XXX do me" - you need to come up with a better placeholder
<waigani> menn0: moving txn to it's own file makes sense. using reflection and cycling through the envCollections set on each transaction is going to add overhead, but it looks necessary. And reflection is only needed in the odd case where the doc is not bson.D/M. Overall looks neater and makes sense to me.
<menn0> waigani: glad you like it
<menn0> waigani: actually, in most cases we insert docs using structs so reflection will be the normal case
<menn0> waigani: but the overhead of a little bit of reflection will be nothing compared to the I/O of the actual transaction
<menn0> waigani: I'll do some comparisons of the test run over the state package anyway to make sure I haven't made things a whole lot worse
<waigani> menn0: considering this touches basically all transactions, it would be worth checking
 * menn0 nods
#juju-dev 2015-11-30
<mup> Bug #1521017 opened: adding a ssh key to juju does not take a file path <juju-core:New> <https://launchpad.net/bugs/1521017>
<mup> Bug #1521020 opened: juju authorized-keys import fails without any output <juju-core:New> <https://launchpad.net/bugs/1521020>
<mup> Bug #1521017 changed: adding a ssh key to juju does not take a file path <juju-core:New> <https://launchpad.net/bugs/1521017>
<mup> Bug #1521020 changed: juju authorized-keys import fails without any output <juju-core:New> <https://launchpad.net/bugs/1521020>
<davecheney> https://github.com/godbus/dbus/issues/45
<davecheney> no love
<davecheney> guess we'll be fixing that one
<davecheney> quit
<mup> Bug #1521017 opened: adding a ssh key to juju does not take a file path <juju-core:New> <https://launchpad.net/bugs/1521017>
<mup> Bug #1521020 opened: juju authorized-keys import fails without any output <juju-core:New> <https://launchpad.net/bugs/1521020>
<menn0> wallyworld or axw: I suspect this might make you happy: http://reviews.vapour.ws/r/3269/
<wallyworld> yay
<wallyworld> menn0: i did go to review your PR from last week but it already had a shipit by the time i got there
<menn0> wallyworld: yep, no worries. already landed. this new one depended on that one.
<wallyworld> menn0: lgtm
<menn0> wallyworld: cheers
<axw> wallyworld: BTW, I've renamed params.RelationUnitsChange to params.RemoteRelationUnitsChange, because we'll use tokens and so on
<wallyworld> ok
<axw> wallyworld: so I think there shouldn't be a concern about overlap with the existing RelationUnitsChange any more
<wallyworld> that sounds right
<menn0> axw: could you have a quick look at this one please: http://reviews.vapour.ws/r/3270/
<axw> menn0: sure
<menn0> axw: there's another upgradesteps cleanup PR ready straight after this one
<axw> menn0: LGTM
<menn0> axw: cheers
<menn0> axw: last cleanup PR here: http://reviews.vapour.ws/r/3271/
<axw> looking
<axw> menn0: done
<menn0> axw: thanks
<frobware> jam: rebooting... be there in a bit...
<voidspace> dimitern: looks like I finally connected...
<dimitern> voidspace, welcome ;)
<voidspace> dimitern: shall I just recreate that branch based on a clean patch?
<voidspace> dimitern: no need for you to do it
<dimitern> voidspace, yes please - that will be easiest I think
<voidspace> doing it
<voidspace> dimitern: http://reviews.vapour.ws/r/3273/
<dimitern> voidspace, cheers, looking
<voidspace> dimitern: it's already had a "Ship It" as a previous PR, so need to look again particularly
<voidspace> dimitern: I'm doing a full test run here
<voidspace> dimitern: and we need to talk about implementation strategy for listing spaces on bootstrap
<voidspace> dimitern: we can topic that at/after standup though
<dimitern> voidspace, LGTM
<dimitern> voidspace, sure
<voidspace> dimitern: ta
<voidspace> dimitern: waiting for the test run here to finish before I hit $$merge$$
<dimitern> voidspace, ack
<voidspace> anyone seen this failure with the lxd client test?
<voidspace> ERROR juju.utils cannot find network interface "lxcbr0": no such network interface
<voidspace> this is on maas-spaces feature branch, so there may already be a fix on master that we haven't picked up
<voidspace> jam: fwereade: dimitern: standup?
<dimitern> frobware, voidspace, fwereade, omw - 1m
<voidspace> frobware: I need coffee
<frobware> voidspace, me too. 5 mins?
<voidspace> frobware: sounds good
<voidspace> frobware: ready when you are
<voidspace> frobware: I get the same failures on master, so no *requirement* for an urgent rebase
<frobware> voidspace, ack
<perrito666> hello all btw
<mup> Bug #1519527 opened: juju 1.25.1:  lxc units all have the same IP address <openstack> <sts> <uosci> <juju-core:New> <MAAS:Triaged by mpontillo> <MAAS 1.9:Triaged by mpontillo> <MAAS trunk:Triaged by mpontillo> <juju-core (Ubuntu):New> <https://launchpad.net/bugs/1519527>
<mup> Bug #1521217 opened: TestWorkerAcceptsBrokenRelease fails <ci> <intermittent-failure> <test-failure> <juju-core:Incomplete> <juju-core cross-model-relations:Triaged> <https://launchpad.net/bugs/1521217>
<mup> Bug #1521220 opened: TestShortPollIntervalExponent fails <ci> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1521220>
<mup> Bug #1521220 changed: TestShortPollIntervalExponent fails <ci> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1521220>
<cherylj> dimitern: is the issue with bug 1519527 just that in 1.25.1 juju uses some new MAAS functionality which is not working as expected?
<mup> Bug #1519527: juju 1.25.1:  lxc units all have the same IP address - changed to claim_sticky_ip_address <openstack> <sts> <uosci> <juju-core:New> <MAAS:Triaged
<mup> by mpontillo> <MAAS 1.9:Triaged by mpontillo> <MAAS trunk:Triaged by mpontillo> <juju-core (Ubuntu):New> <https://launchpad.net/bugs/1519527>
<dimitern> cherylj, that's my understanding (proved by mpontillo as well)
<dimitern> or at least observed to not work as expected
<cherylj> dimitern: what about other levels of MAAS?  does the functionality exist on older levels, but works there?
<dimitern> cherylj, yes, it exists in 1.8 and is confirmed to work there as expected, so it's a maas 1.9beta2+ regression
<cherylj> dimitern: thanks!
<mup> Bug #1521220 opened: TestShortPollIntervalExponent fails <ci> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1521220>
<cherylj> dimitern: sorry to bug you again :)  a note from mgz indicated that you might be working on bug 1516989?
<mup> Bug #1516989: juju status <service_name> broken <sts> <juju-core:Triaged by cherylj> <juju-core (Ubuntu):New> <https://launchpad.net/bugs/1516989>
<beisner> hi cherylj, dimitern - i just re-deployed with MAAS 1.9b2 + Juju 1.25.0, and all lxc units get unique and working IP addresses.  With 1.9b2 + 1.25.1, all lxc units get the same IP.  While it may  be an underlying MAAS bug, there is definitely a regression in 1.25.1 Juju.  Will be adding a generic non-openstack reproducer as soon as this next deploy wraps up.
<dimitern> cherylj, I haven't started yet, but I plan to start tomorrow on that one
<dimitern> beisner, it's still a maas issue
<dimitern> beisner, hey btw :)
<beisner> dimitern, i don't disagree.
<cherylj> dimitern: okay, I can pass it off to onyx since they're on bug squad
<dimitern> beisner, you can verify by trying to create a device + parent and then claim-sticky-ip-address for it
<beisner> dimitern, but i do know that if 1.25.1 lands, my lab will be borked.  i presume others who are less-automated will eventually see it too.
<cherylj> beisner: yeah, we don't plan on moving 1.25.1 out of proposed until the MAAS fix has been released
<beisner> cherylj, much appreciated
<rick_h_> cherylj: beisner <3 ty both for working through that.
<frobware> dimitern, voidspace: http://reviews.vapour.ws/r/3275/  - I'm still doing some manual testing against MAAS 1.8/1.9 and precise, trustu, vivid & wily but if you could take an initial look would be appreciated.
<katco> /
<katco> /
<dimitern> frobware, sure, looking
<voidspace> frobware: cool
<beisner> o/ rick_h_  yw, happy to help exercise these things
<dimitern> frobware, ping
<frobware> dimitern, pong
<dimitern> frobware, I'm a bit confused - do we use bash or python or both?
<frobware> dimitern, boht
<frobware> both
<frobware> dimitern, I could do 90% python. in fact I did for a while. but we will always need the bash shim to run the python script.
<frobware> dimitern, want to HO?
<dimitern> frobware, right
<dimitern> frobware, got it
<dimitern> frobware, can do a HO, but the script looks fine
<frobware> dimitern, let's do 10 mins anyway. would be good to talk about it.
<dimitern> frobware, ok
<frobware> dimitern, standup HO?
<dimitern> frobware, sure, omw
<mpontillo> cherylj: dimitern: I'm currently fixing a separate issue in MAAS IP allocation which is blocking me from fully triaging, but from what I saw last week, it's a MAAS issue
<cherylj> thanks, mpontillo
<mpontillo>  cherylj, I think dimitern and Andreas were going to re-setup their MAAS setup from scratch just to be sure though; did that happen?
<cherylj> mpontillo: I'm not sure.  dimitern?
<voidspace> dimitern: ping if you're still around
<dimitern> mpontillo, I have rc2 installed from scratch in lxc, will test with it tomorrow as I'm fixing the related juju bug
<dimitern> cherylj, ^^
<dimitern> voidspace, pong
<frobware> cherylj, I have a LGTM on 1516891 - is 1.25.2 going to be cut today?
<dimitern> frobware, cherylj, I'd wait for that until tomorrow TBO
<cherylj> frobware: I'm not sure.  There are a couple bugs we're looking at for that cutoff
<dimitern> TBH even
<frobware> cherylj, if the answer is "possibly not" I might implement some changes dimitern and I just talked about and do some additional testing tomrrow.
<cherylj> frobware: that should be fine
<voidspace> dimitern: I may not need you...
<jam> voidspace: how could you say such a thing
<jam> we all need dimitern
<dimitern> :D
<voidspace> :-)
<mup> Bug #1521267 opened: After upgrade juju status no longer works <juju-core:New> <https://launchpad.net/bugs/1521267>
<mup> Bug #1521267 changed: After upgrade juju status no longer works <juju-core:New> <https://launchpad.net/bugs/1521267>
<mup> Bug #1521267 opened: After upgrade juju status no longer works <juju-core:New> <https://launchpad.net/bugs/1521267>
<mup> Bug #1521267 changed: After upgrade juju status no longer works <juju-core:New> <https://launchpad.net/bugs/1521267>
<mup> Bug #1519403 opened: 1.24 upgrade does not set environ-uuid <juju-core:New for thumper> <https://launchpad.net/bugs/1519403>
<mup> Bug #1519403 changed: 1.24 upgrade does not set environ-uuid <juju-core:New for thumper> <https://launchpad.net/bugs/1519403>
<mup> Bug #1519403 opened: 1.24 upgrade does not set environ-uuid <juju-core:New for thumper> <https://launchpad.net/bugs/1519403>
<natefinch> well that only took 4 hours of retries.
<perrito666> natefinch: ?
<natefinch> connecting to freenode
<perrito666> ah, yes DoS
<perrito666> could anyone run go test -gocheck.list=true github.com/juju/juju/cmd/jujud/agent/... and paste me their output?
<davecheney> ping, anyone on call reviewer today ? http://reviews.vapour.ws/r/3266/
<perrito666> Ill review it
<perrito666> davecheney:  ship it
<davecheney> perrito666: ta
<thumper> morning
<natefinch> morning thumper
<alexisb> morning thumper
<thumper> davecheney: hey, look master is cursed due to the race build
<thumper> davecheney: I'm hoping we actually caught something new and not a mistake
<thumper> oh
<thumper> http://reports.vapour.ws/releases/3375/job/run-unit-tests-race/attempt/630
<thumper> not a race
<thumper> just a different failure
<mup> Bug #1517992 changed: juju-upgrade to 1.24.7 leaves juju state server unreachable <juju-core:Won't Fix> <https://launchpad.net/bugs/1517992>
<mup> Bug #1521327 opened: API client talking to 1.22 server failed: method Service(1).ServicesDeploy is not implemented <api> <ci> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1521327>
<katco> wwitzel3: ericsnow: sorry i'm taking so long; nothing for you today. we'll have lots to discuss in tomorrow's standup
<mup> Bug #1521354 opened: juju should check common failure conditions before upgrading <juju-core:Triaged> <https://launchpad.net/bugs/1521354>
<menn0> cherylj, davecheney: did you notice that the race detector CI job is failing due to a shell script issue: http://data.vapour.ws/juju-ci/products/version-3370/run-unit-tests-race/build-627/consoleText
<wwitzel3> katco: ok, np
<mup> Bug #1521354 changed: juju should check common failure conditions before upgrading <juju-core:Triaged> <https://launchpad.net/bugs/1521354>
<mup> Bug #1521354 opened: juju should check common failure conditions before upgrading <juju-core:Triaged> <https://launchpad.net/bugs/1521354>
<davecheney> thumper: yup, sadly not a real failure, just fat fingers
<mup> Bug #1513492 opened: add-machine with vsphere triggers machine-0: panic: juju home hasn't been initialized <add-machine> <panic> <vsphere> <juju-core:Triaged by s-matyukevich> <juju-core 1.25:Fix Released> <https://launchpad.net/bugs/1513492>
<davecheney> thumper: xenial, go 1.5 build is not passing
<davecheney> s/not/now
<thumper> \o/
<mup> Bug #1513492 changed: add-machine with vsphere triggers machine-0: panic: juju home hasn't been initialized <add-machine> <panic> <vsphere> <juju-core:Triaged by s-matyukevich> <juju-core 1.25:Fix Released> <https://launchpad.net/bugs/1513492>
<mup> Bug #1513492 opened: add-machine with vsphere triggers machine-0: panic: juju home hasn't been initialized <add-machine> <panic> <vsphere> <juju-core:Triaged by s-matyukevich> <juju-core 1.25:Fix Released> <https://launchpad.net/bugs/1513492>
<davecheney> thumper: juju/container contains it's own on disk lock ....
<davecheney> thumper: in addition to juju/utils.fslock ...
<thumper> ugh
<thumper> hang on
<thumper> I added that
<thumper> wait
<thumper> wat?
<davecheney> yup
<thumper> a different lock type?
<davecheney> slightly different
<davecheney> but not different enough
<davecheney> thumper: https://github.com/juju/juju/pull/3859
<mup> Bug #1521327 changed: API client talking to 1.22 server failed: method Service(1).ServicesDeploy is not implemented <api> <ci> <regression> <juju-ci-tools:New> <juju-core:Won't Fix> <https://launchpad.net/bugs/1521327>
<thumper> davecheney: there are issues with this, but I'm otp right now
<davecheney> mkay
 * davecheney waits
<mup> Bug #1521327 opened: API client talking to 1.22 server failed: method Service(1).ServicesDeploy is not implemented <api> <ci> <regression> <juju-ci-tools:New> <juju-core:Won't Fix> <https://launchpad.net/bugs/1521327>
<mup> Bug #1521327 changed: API client talking to 1.22 server failed: method Service(1).ServicesDeploy is not implemented <api> <ci> <regression> <juju-ci-tools:New> <juju-core:Won't Fix> <https://launchpad.net/bugs/1521327>
<perrito666> sinzui: ping
<davecheney> thumper: ping, still otp ?
<thumper> yes
<davecheney> mkay
<wallyworld> anastasiamac: axw: perrito666: 1 minute late, finishing another meeting
 * perrito666 clocks wallyworld to make sure its 1 minute
<sinzui> perrito666: otp
<thumper> davecheney: off the phone now, gimmie 5 minutes?
<davecheney> sure
<sinzui> hi perrito666
<perrito666> sinzui: hi, I just came back from a short week off and am curious about the mongo3 package :)
<sinzui> perrito666: still working on it. I am right now looking at "testbed auxverb failed with exit code 255"
<perrito666> lovely
<sinzui> perrito666: I think it will be a few more days to see packages being used in tests. We are still waiting on arm hardware, so we wont be sure we are done until it is available to us (which shoudl be this week)
<perrito666> k, tx a lot, just needed a status upgrade
<thumper> davecheney: 1:1 hangout?
<davecheney> thumper: why
<thumper> fine, I"ll just put it here
<davecheney> thanks
<thumper> I believe that the lock in container prefers flock, and falls back to fslock on windows
<thumper> and was done because there were too many failures with fslock during container template creation
<thumper> I'm going to email juju-dev about replacing fslock
<thumper> I think we are spending too much time trying to fix a broken system
<thumper> instead of investing a little effort into making a good, OS agnostic, dies with the process, lock
<thumper> email coming soon from me about that
<davecheney> thumper: no argument there
<davecheney> i've said I want to use the linux facility to do this
<davecheney> sure, it's linux only
<davecheney> but really, so is juju to a first approximation
<davecheney> wrt most of that PR
<thumper> What I want is an OS abstraction
<davecheney> i argue it's still fine to move that coed out of container
<davecheney> thumper: sure
<thumper> davecheney: your branch doesn't change the package name of the windows build file
<davecheney> but if you want it to be os agnostic that will come with a reduced readter set
<thumper> it'll fail for windows
<davecheney> right, tahnks
<davecheney> i'll fix that
<davecheney> reduced 'feature' set
<thumper> I'm fine with that
<davecheney> thinks like fslock.IsLocked is racy
<davecheney> it cannot be used safely
<sinzui> davecheney: the xenial unit tests vote. you rock
<davecheney> for hte same reason there is no os.IsExists
<davecheney> sinzui: thanks
<sinzui> davecheney: is it difficult to do the same for 1.25 which will get a few more releases?
<thumper> davecheney: some of the fslock "features" were added because they could, not because they should
<thumper> also
<davecheney> thumper: precisely
<thumper> they were added to work around the problem of the lock not being released when the process dies
<davecheney> i want to kill the with fire
 * sinzui wants wily unit tests on master passing more that 1.25 on xenial
<davecheney> sinzui: no idea
<davecheney> but I can look at the failures
 * thumper heading out for dogwalk
#juju-dev 2015-12-01
<davecheney> axw: do you think you could backport https://github.com/juju/juju/pull/3841 to 1.25 ?
<davecheney> to see if we can get the xenial tests passing on the 1.25 branch ?
<davecheney> thumper: http://reviews.vapour.ws/r/3278/ fixed and verified it builds under windows, linux, and darwin
<axw> davecheney: 1.25 doesn't have LXD to begin with
<axw> so must be a different issue
<davecheney> ok
<davecheney> thanks
<thumper> davecheney: I'm pretty sure that I was trying to get us away from fslocks in the linux code at least
<thumper> but then I hit places where people were caring about the message
<thumper> and I rage quit
<thumper> however
<thumper> I think if we had a proper lock that went away with the process
<thumper> we could simplify quite a bit of code
<davecheney> i don't think we want, lock.Lock("message")
<davecheney> we want lock.TryLock(timeout)
<davecheney> thumper: i just said to axw that I think at least some cases of fslock / filelock
<davecheney> we don't want locking
<davecheney> just atomic replacement
<davecheney> axw, what were the two case of fslock you knew of ?
<axw> env cache I think (thumper? is that right), and serialising hook execution on a machine
<axw> i.e. so two units (different processes) cannot execute hooks at the same time
<thumper> I don't think it makes any sense to have a Lock function that doesn't lock. TryLock makes much more sense
<davecheney> boom, and all works sopts, https://bugs.launchpad.net/juju-core/+bug/1471941
<mup> Bug #1471941: windows unit tests fail because handles are not available <blocker> <ci> <intermittent-failure> <regression> <unit-tests> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1471941>
<thumper> client side cache, template creation, uniter lock
<davecheney> thumper: those all sound like atomic replacement
<thumper> also I believe something in reboot
<thumper> and maybe actions?
<davecheney> axw: serialised hook execution
<davecheney> can we not make that ownership of the uniter socket ?
<axw> davecheney: different unit agents on the same machine
<axw> davecheney: only one unit agent should be executing a hook at the same time
<davecheney> why not just open a known unix socket on localhost
<davecheney> if the host doens't support unix sockets, tcp will do
<thumper> davecheney: most places it is process serialisation, not just atomic replacement
<davecheney> if hook execution is per machine
<thumper> note that juju run goes through the uniter execution code
<thumper> although I think that axw
<thumper> 's fix
<thumper> make it more in sync with a hook type
<davecheney> thumper: sinzui https://bugs.launchpad.net/juju-core/+bug/1471941
<mup> Bug #1471941: windows unit tests fail because handles are not available <blocker> <ci> <intermittent-failure> <regression> <unit-tests> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1471941>
<davecheney> why did this bug suddently become blocking ?
<sinzui> davecheney: when it suddenlly startd failing reliably a few hour ago
<sinzui> The bug has a long history, but it recently became a problem in master http://reports.vapour.ws/releases/issue/559acfbe749a565572d289f1
<davecheney> WARNING: Error cleaning up temporaries:
<davecheney> ERORS LOCGED AT WARNIGN!!
<thumper> heh
<davecheney> sinzui: i see the bug
<davecheney> given it will require a third party to land any fix
<davecheney> can I implore you to not make it blocking ?
<sinzui> davecheney: how do we encourage someone to fix it, can we revert?
<sinzui> davecheney: or is there something I can do to mitigate the issue so that other regressions don't appear in the window's tests?
<davecheney> sinzui: the problem is whatever is writig that log file hasn't shut down in time
<davecheney> and windows won't let you unlink open files
<davecheney> my solution is to fix gocheck to not fail on a warning
<sinzui> davecheney: could a reboot to recycle space and mem help the log file to shutdown in time?
 * sinzui dreds rebooting every day
<thumper> hmm... we could make it easier by making the tests clean up after themselves
<thumper> but the way to force this is to do what dave says
<thumper> and have gocheck fail if it can't delete the files
<davecheney> thumper: gocheck fails now
<davecheney> but it calls it a warning :grrr:
<sinzui> davecheney: 1.25 looks like it will pass. I can try rebooting the machine hoping to recover something. the machine looks clean. I will then retest the last master rev.
<thumper> sinzui: what is the 1.20 version that we tell people to upgrade through?
<sinzui> thumper: 1.20.14 is final
<thumper> sinzui: ta
<sinzui> davecheney: thumper : the windows unit tests just failed again. Looking at the history of thus job against master, I think something since commit da5ec85 has upset the test suite.
<thumper> sinzui: don't suppose you have a link to that commit?
 * thumper looks at the git cli
<thumper> sinzui: really?
<thumper> that commit doesn't really do much
<thumper> Merge pull request #3841 from axw/lp1520380-remove-lxd-containertype
<thumper> sinzui: that one right?
<sinzui> thumper: that commit was the last pass. since then master fails
<thumper> oh, something since that
<thumper> that makes more sense
<cherylj> wallyworld: ping?
<wallyworld> hey
<cherylj> wallyworld: quick question for you on the unit agent / unit workload status
<wallyworld> sure
<cherylj> was the idea that the old status docs from before the split happened would just become the unit agent status docs?
<cherylj> and there was no need to update them during an upgrade?
<wallyworld> yes
<cherylj> waigani_: ^^
<wallyworld> hence the #charm suffix for worklaod status
<cherylj> wallyworld: that's what I thought, but wanted to sanity check
<wallyworld> sure
<cherylj> thanks!
<waigani_> cherylj, wallyworld: okay cool, will do
<wallyworld> given that, what upgrade step is missing?
<cherylj> creating the status doc for u#<unit>#charm
<waigani_> wallyworld: http://reviews.vapour.ws/r/3279/
<wallyworld> hmm, i had thought that we treated missing as unknown
<davecheney> sinzui: i'll take another look
<wallyworld> that's how it was originally done i think
<cherylj> wallyworld: I can look at the history
<wallyworld> maybe not, can't recal now
<cherylj> but now it fails with a "not found"
<wallyworld> ok, clearly a problem then
<wallyworld> it's been a while so i could be misremembering
<cherylj> wallyworld: so, is the solution to treat it as unknown?  or insert a doc for each during upgrade?
<cherylj> I was assuming that it should be inserted as an upgrade step
<wallyworld> a doc is cleaner
<cherylj> ok
<waigani_> so that's what I've currently got: a new agent status with status unknown inserted on upgrade
<waigani_> wallyworld: any chance if you could take a quick look at the upgrade step, see if it makes sense?
<waigani_> link above
<wallyworld> looking
<waigani_> cheers
<wallyworld> waigani_: comment doesn't make sense, u#name is not invalid
<waigani_> wallyworld: but it's invalid for a unit
<wallyworld> not for an agent
<wallyworld> we need both
<waigani_> wallyworld: that's what this upgrade step does
<waigani_> wallyworld: do you mean it's just the wording of the comment?
<wallyworld> ok, i'll read the code, the comment was just confusing
<wallyworld> waigani_: comment is plain wrong, u#name is not invalid. units are not misidentified. the issues is that in settings, there is an entry for u#name ony, when there should also be one for u#name#charm
<davecheney> sinzui: http://reports.vapour.ws/releases/2859/job/run-unit-tests-win2012-amd64/attempt/838
<davecheney> this file, base_windows.go doesn't appear to exist on master
<davecheney> i'm probably looking in the wrong place
<davecheney> sinzui: nope, not on master
<davecheney> how come this bug is blocking master ?
<sinzui> davecheney: because only master is failing because of it
<davecheney> github.com/juju/juju/testing/base_windows.go
<davecheney> ^ this file does not exist on windows
<davecheney> sorry
<davecheney> in master
<davecheney> at least that I can see
<davecheney> is this for another branch ?
<davecheney> this file exists in juju-1.25.1
<davecheney> but not on master
<thumper> davecheney: that attempt above is on a feature branch
<thumper> the resources one
<davecheney> sinzui: this doesn't sound like it's on master
<davecheney> can you please unblock master
<sinzui> davecheney: this issue is simple. The windows unit tests passed a few commits ago. Now they do not. The job passes on other branches, ergo, something has changed master to make the failure conistent. The issue matches an existing bug seen intermitttently in other brances. We escallated the bug a few hours ago
<wallyworld> waigani_: you need to take another look - it's not correct as is
<thumper> sinzui: got a link to a failing run on master?
<davecheney> sinzui: please can you point me to the jenkins run and i'll look into it
<thumper> menn0: I think this may be your one
<thumper> it is an upgrade featuretest
<sinzui> uhhh, thumper, the issue linked int eh bug diescription points to everything, and all the recent failures are in master
<davecheney> sinzui: i'm not very familar with how to navigate reports.sw
<waigani_> wallyworld: okay, will do
<thumper> menn0: ping
<davecheney> i'd apprecaite it if ou could provide me with a link to the jenkins failure
<wallyworld> waigani_: cherylj  suggested the correct approach
<sinzui> davecheney: thumper  http://reports.vapour.ws/releases/issue/559acfbe749a565572d289f1 from the bug will help. the last six failures are recent and affect master
<menn0> thumper: pong
<thumper> menn0: I believe the blocker above is due to the re-enabled upgrade feature tests
<thumper> menn0: see http://reports.vapour.ws/releases/3376/job/run-unit-tests-win2012-amd64/attempt/1657
<davecheney> sinzui: is this the failure you are talking about ? http://juju-ci.vapour.ws:8080/job/run-unit-tests-win2012-amd64/1659/console
<menn0> sigh
 * menn0 looks
<sinzui> davecheney: yes
<davecheney> thank you for confirming
<davecheney> ok, this is a different issue
<davecheney> this is the failure i see
<davecheney> upgrade_test.go:109:
<davecheney>     go func() { c.Check(a.Run(nil), jc.ErrorIsNil) }()
<davecheney> ... value *errors.Err = &errors.Err{message:"failed to create C:/Juju/bin/juju-run.exe symlink", cause:(*os.LinkError)(0xc084427f00), previous:(*os.LinkError)(0xc084427f00), file:"github.com/juju/juju/cmd/jujud/agent/machine.go", line:1775} ("failed to create C:/Juju/bin/juju-run.exe symlink: symlink c:\\users\\admini~1\\appdata\\local\\temp\\tmpivnvca\\gogo\\tmp-juju-testtxo382\\check-5796262061010820532\\7\\var\\lib\\juju\\tools\\machin
<davecheney> ... error stack:
<davecheney> http://paste.ubuntu.com/13590291/
<davecheney> sorry for the fat finger
<davecheney> somethign is trying to create a symlink on windows
<davecheney> ain't nobody got time for that
<menn0> thumper: i'll take a look
<menn0> davecheney: i'll take a look at that one too since I can guess what it's a bout
<menn0> about
<thumper> menn0: ta
<davecheney> menn0: want me to create a new bug ?
<menn0> should make a nice change from the massive and horrible merge I've been working on all day
<thumper> menn0: almost certainly, some of these shouldn't be running on windows...
<davecheney> menn0: all yours, https://bugs.launchpad.net/juju-core/+bug/1521446
<mup> Bug #1521446: featuretests: failure on windows due to ill informed symlink <juju-core:New> <https://launchpad.net/bugs/1521446>
<menn0> davecheney, thumper: are these both with master?
<thumper> the feature test failures are
<mup> Bug #1521446 opened: featuretests: failure on windows due to ill informed symlink <juju-core:New> <https://launchpad.net/bugs/1521446>
<thumper> menn0: any test that is deailing with running upgrade steps can be skipped on windows IMO
<thumper> as they will never be executed on windows
<thumper> menn0: I believe this is the source of the issue
<menn0> thumper: fair enough... but I think they *did* run on windows previously
<menn0> thumper: and I'm surprised the symlink issue isn't happening on Linux
 * thumper shrugs
<thumper> I've not looked deeply at it
<thumper> just looked at the errors that were being spewed
 * menn0 looks
<thumper> and the commits that happened since the last bless
<thumper> and put 1 and 1 together
<thumper> I may have gotten 3
<thumper> but this appears to be the issue
<sinzui> thumper: davecheney I updated the bug description; https://bugs.launchpad.net/juju-core/+bug/1471941
<mup> Bug #1471941: windows unit tests fail because handles are not available <blocker> <ci> <intermittent-failure> <regression> <unit-tests> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1471941>
<davecheney> for extremely large valyes of 1
<davecheney> sinzui: thank you
<davecheney> sinzui: i think that is incorrect
<davecheney> this faiure does not occur on master
<davecheney> that file is not present
<davecheney> this cannot be the same issue
<sinzui> davecheney: interesting, because http://reports.vapour.ws/releases/issue/559acfbe749a565572d289f1 master was tested
<sinzui> davecheney: The last run in that issue is http://reports.vapour.ws/releases/3376/job/run-unit-tests-win2012-amd64/attempt/1657 and it says master. Do you suspect a stale tarfile?
<davecheney> sinzui: is it possible when disucssion these failures to refer to the jenkins build
<davecheney> that is how I know how to diagnose those failures
<davecheney> is there a link on the reviews.ws page to the jenkins failure ?
<davecheney> nm
<davecheney> that page has enough details
<sinzui> davecheney: http://reports.vapour.ws/releases/3376/job/run-unit-tests-win2012-amd64/attempt/1657 jas a link that says "jenkins"
<davecheney> sinzui: http://paste.ubuntu.com/13590518/
<davecheney> this looks like the failure from that build
<davecheney> sinzui: thanks
<davecheney> which is a different bug
<davecheney> sigh
<sinzui> davecheney: I am not particullary interested in this bug or taht bug. We are certain this master and no other branch tested today fails and the failure is the windows job, and the failure is consistent in 7 tries.
 * thumper is making juju jump through flaming upgrade hoops
<thumper> wallyworld: http://reviews.vapour.ws/r/3281/diff/#
<wallyworld> looking
<mup> Bug # changed: 1497809, 1497810, 1509747, 1519994
<mup> Bug #1521453 opened: worker/upgrader: test failure on windows due to file being in use <juju-core:New> <https://launchpad.net/bugs/1521453>
<menn0> davecheney: regarding that symlink error... do you have the full error text? it's truncated in the backscroll above
<wallyworld> thumper: just a wording quibble
<menn0> davecheney: never mind... the error thumper asked me to look at also has the symlink failure. I think they're the same thing. Not being able to delete some log file is just a side effect.
<thumper> wallyworld: hmm... I think we should change "We"
 * menn0 doesn't have time for this. Skipping this test on windows. As thumper indicated, this test doesn't matter unless we plan to support state servers on windows.
<thumper> wallyworld: how about "Please upgrade to the latest 1.25 release"
<davecheney> menn0: +1
<davecheney> 'aint going to be a state server for a long time
<davecheney> along with the complete sellout of FOSS principals
<davecheney> so skipping the test sounds like a reasonable middle ground
<menn0> davecheney: http://reviews.vapour.ws/r/3282/
<menn0> davecheney: actually, hold the phone. I just noticed this: https://bugs.launchpad.net/juju-core/+bug/1446885
<mup> Bug #1446885: Skipped cmd/jujud/agent/upgrade_test.go tests on windows <skipped-test> <test-failure> <windows> <juju-core:Triaged by menno.smits> <https://launchpad.net/bugs/1446885>
<menn0> davecheney: *now* PTAL: http://reviews.vapour.ws/r/3282/diff/
<davecheney> menn0: looking
<davecheney> menn0: why not
<davecheney>  // +build !windows
<davecheney> just skip 'em all
<davecheney> meh, ship it
<menn0> davecheney: I'd it this way as there may well be non-states server related upgrade tests in the future
<davecheney> sure, i didn't mean to overcomplicate things
<menn0> davecheney: merging nw
<menn0> now
<menn0> and now for more pain...
<davecheney> \o/
<davecheney> menn0: there are only two tests in that suite
<davecheney> the skip should have probably gone to setupTest or something
<davecheney> whatever
<menn0> davecheney: nope, the way I did it was intentional
<davecheney> fair enough
<davecheney> zero ducks
<davecheney> menn0: here is a small review in return
<davecheney> http://reviews.vapour.ws/r/3283/
<waigani_> cherylj, wallyworld: how does this look: http://reviews.vapour.ws/r/3279
<waigani_> cherylj, wallyworld: sorry for the misunderstanding. I thought we had unit statuses disguised as agent statuses - not simply that unit statuses were missing.
<menn0> davecheney: unreserved Ship It :)
<davecheney> ta
<davecheney> interstingly the race only happens on go 1.2
<davecheney> something in later versions obscures it
<menn0> that's gotta be a fluke
<davecheney> even with a long stress time
<davecheney> it's a race, sure and certain
<davecheney> what is likely that in go 1.2, the machiner worker starts _after_ the test has completed
<davecheney> but in later versions the machienr worker runs immediatley
<davecheney> immedaitely
<davecheney> the race is on the patch'ed net.GetInterfaces method
<davecheney> PatchValue will be the first against the wall when the revolution comes
<natefinch> why do we have a utils repo?
<natefinch> why are all those packages not just in top level repos by themselves?
<davecheney> natefinch: +1 from me
<davecheney> if they cannot justify a repo of their own
<davecheney> kill 'em with fire
<natefinch> yep
<wallyworld> waigani_: sorry, was at doctor, reviewed, a couple of small changes to make
<waigani_> wallyworld: cool, just finished dinner, will update now
<voidspace> dimitern: jam: frobware: fwereade: standup?
<dimitern> omw
<perrito666> oh great, my bug is caused by a lock, where have I seen this discussion before...
<dimitern> frobware, ping
<dimitern> frobware, now that I have working IPv6 setup on MAAS, I found a small issue with the bridge script in 1.25 - get_gateway needs to handle the case where you have multiple default gateways (e.g. fc00::.. and fe80::.. - the latter is always added)
<dimitern> only needed change was to pipe the ip route list exact default output through "head -n1" first before piping that through cut
<frobware> dimitern, ack
<voidspace> dimitern: ping
<dimitern> voidspace, pong
<voidspace> dimitern: inside the discoverspacesWorker I need access to both the Spaces facade and the Subnets facade
<voidspace> dimitern: can I create a single API struct with two facades
<voidspace> dimitern: or should I have two separate API structs
<voidspace> I need two FacadeCallers either way, I just wonder if it's good practise to concatenate them into a single type
<dimitern> voidspace, why 2 facades?
<voidspace> dimitern: Subnets and Spaces are separate facades and I need to create both spaces and subnets
<dimitern> voidspace, both of these are client facades, not agent
<voidspace> dimitern: do we have a different facade I should be using?
<voidspace> dimitern: or should I just use an agent facade
<dimitern> voidspace, nope I think we need a new facade including both of these methods, ideally also reusing the methods implementation in the 2 client and the new agent facades, using apiserver.common mixins
<voidspace> dimitern: is there a difference in how agent facades are implemented?
<voidspace> this card just got bigger if we can't use the existing api :-)
<fwereade> voidspace, facades are client-specific
<fwereade> voidspace, they're responsible for auth and params translation -- and if they're the only thing that needs the underlying capabilities they're often implemented there too
<voidspace> fwereade: time to go duplicate some stuff then ;-)
<fwereade> voidspace, if you have capabilities needed by multiple facades, great, move them into (a subpackage of) apiserver/common
<voidspace> yep
<voidspace> I'll do that first in a clean branch, then come back to the worker
<dimitern> voidspace, yeah, what fwereade said; also we can reuse existing code, with some refactoring - see apiserver.common.StatusSetter for example how to do it
<voidspace> dimitern: fwereade: thanks
<fwereade> voidspace, fwiw, it is likely worth just finishing the worker in terms of (an) interface(s) that will be implemented by the client
<fwereade> voidspace, the context switch will probably hurt more
<fwereade> ?
<voidspace> just looking at the code
<voidspace> fwereade: you're probably right
<mup> Bug #1521610 opened: Upgrade hung when moving from 1.18.4.3 to 1.24.7 <juju-core:New> <https://launchpad.net/bugs/1521610>
<dimitern> it's a whole new experience being able to run make check *and* at the same time live test on 3 versions of virtual maas (in lxc) bootstrapping juju in kvms
<TheMue> dimitern: seems you're really happy about your new machine. ;)
<dimitern> TheMue, :) oh yeah!
<TheMue> dimitern: I've got a MBP here, i7, 2.5 MHz, 16 Gigs, and 512 Gigs SSD. only hard part is switching from German to US keyboard :)
<dimitern> TheMue, sounds nice! :) how's erlang treating you?
<TheMue> dimitern: aaahh, pretty nice again. but after that longer pause and doing go for so long I have to change some habits back again.
<TheMue> dimitern: but right now I'm doing API design for the clients, will use JSON via WebSockets
<frobware> dimitern, voidspace, dooferlad: maas interlock?
 * fwereade has found his test bug and is really annoyed about it
<dimitern> frobware, oops sorry, got carried away with maas testing here
<dimitern> TheMue, ;) sounds interesting
 * fwereade is coming to the realisation that go 1.5 is going to make clock tests decidedly unfun
<perrito666> fwereade: oh, you have a version of go that makes them fun?
<dimitern> which was good, because I've discovered yet another issue, this time with precise
<fwereade> perrito666, 1.2 doesn't seem to exhibit a particular unhelpful scheduling behaviour
<perrito666> that, for me, is a few kilometers left of the fun mark
<fwereade> perrito666, haha
 * fwereade is worrying that we'll need to have setup tombs, or something... crib/tomb? cf cradle/grave?
<fwereade> the infuriating thing is that the test that I *know* fails can be easily modified to handle it
<perrito666> wonderful upgrade has a stale lock in agentconfig
<fwereade> but I *also* know that my other tests are now vulnerable too even if it hasn't shown up yet
<mgz> fwereade: any idea if deleting charms out swift mid-upgrade is likely to hose anything?
<fwereade> mgz, hmm, we have something that copies them into gridfs, don't we?
<mgz> we do... this is a bad idea for some old IS deployments
<fwereade> mgz, if that's already done it should be fine; if it's not I would be nervous about deleting them
<fwereade> mgz, ha, right, fat charms
<mgz> one of which is now very slowly copying fat charms onto local disk mid upgrade from 1.18 to 1.24
<fwereade> mgz, right
<fwereade> feck
<fwereade> mgz, I have no idea how it'll react if they're not there
<mgz> or were there when it did a list at the start but then vanish when it comes to copying...
<fwereade> mgz, but if you replace them with a byte of garbage each, I don't think anything depends on their *content* until we deploy new units that expect to use them
<fwereade> mgz, I do still consider that pretty serious breakage
<mgz> hm, I like that though
<mgz> we can see the version numbers in swift I believe
<fwereade> mgz, so long as there's some other way to get the charms into gridfs before they're needed it feels least likely to confuse everything
<fwereade> mgz, ...unless we're smart enough to check the hashes at some point in that process
<mgz> pretty sure we're dumb
<mgz> if it panics because the charm contents don't match when it fetches them I'll cry
<voidspace> dimitern: ping
<voidspace> dimitern: so it's good that we have a new facade for creating spaces, because creating them from the provider is slightly different
<voidspace> dimitern: i.e. we need the provider id and we generate the juju name in the apiserver
<voidspace> dimitern: so it's not an exact duplicate of the existing one
<voidspace> dimitern: this also means that the space collection in state will need to grow a ProviderId field
<voidspace> frobware: see above - the "simple" task I'm on of importing spaces has grown a bit
<voidspace> frobware: we need an entirely new api facade for adding spaces and the state representation of spaces need to change
<voidspace> frobware: not quite just the "add a simple worker" we anticipated yesterday
<frobware> voidspace, :)
<voidspace> frobware: just a heads up...
<frobware> voidspace, thanks.
<dimitern> voidspace, hey
<dimitern> voidspace, yes, that sounds good
<voidspace> dimitern: what does "IsPublic" mean for spaces - this is to do with mapping public IP addresses in AWS, right?
<voidspace> dimitern: we don't need this parameter for the new AddSpace
<dimitern> voidspace, public needs to be enforced for subnets added to a public space (to be themselves public)
<dimitern> voidspace, it expresses intended use, not AWS-specific thing
<voidspace> dimitern: ok, how does that correspond to spaces we discover from MAAS
<voidspace> where we are modelling the outside world, so are not in a position to "enforce" anything
<voidspace> and the concept of public/non-public is not part of the world we are modelling here
<voidspace> (i.e. even if it isn't intended to be AWS specific - it is really... ;-)
<dimitern> voidspace, in MAAS I think all spaces are private by default
<voidspace> ok, I'll just set IsPublic to false
<dimitern> voidspace, +1
<natefinch> ericsnow: seems like revision should just be part of the ResourceInfo.  why isn't it?  And where would it come from if it's not there?
<ericsnow> natefinch: agreed
<ericsnow> natefinch: it's not that way in the spec though :/
<natefinch> ericsnow: hmm, I see it's generated at upload time
<natefinch> ericsnow: so now we have meta metadata :/
<frobware> dimitern, voidspace: OpenStack / juju HO
<dimitern> frobware, omw
<voidspace> frobware: omw
<natefinch> ericsnow: still not sure why we can't store it all in the same place, even if they're separate in other peoples' systems
<natefinch> fwereade: why do our APIs use 'string' instead of a specific Tag type?   It would make them a lot more clear, I'd think, with basically no extra work.
<fwereade> natefinch, I think you're right, I glanced off that earlier today
<mup> Bug #1471941 changed: windows unit tests fail is upgrade steps it will never really do <blocker> <ci> <intermittent-failure> <regression> <unit-tests> <windows> <juju-core:Fix Released by menno.smits> <https://launchpad.net/bugs/1471941>
<mup> Bug #1521699 opened: windows unit tests fail because handles are not available <ci> <intermittent-failure> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1521699>
<katco> ericsnow: natefinch: wwitzel3: i think i've looked through all the code eric's got up. how we doing?
<wwitzel3> katco: I'm going trhough the show-resources command PR
<wwitzel3> katco: almost done
<ericsnow> katco: I'm game whenever
<katco> ericsnow: have you addressed any of the open comments?
<natefinch> I just finished up the 4th PR
<ericsnow> katco: working on it
<natefinch> er, review
<katco> ericsnow: kk
<katco> natefinch: great reviews btw
<natefinch> katco: I just like giving ericsnow a hard time ;)
 * ericsnow makes mental note
<katco> ;p
<katco> natefinch: i don't agree with everything you put, but they were thoughtful comments
<natefinch> katco: yeah, I figured :)
<wwitzel3> katco: ready when ever one else is, just wrapping up
<katco> wwitzel3: k... lunch maybe?
<wwitzel3> k
<katco> natefinch: ericsnow: now? lunch first?
<ericsnow> katco: I'm good either way
<natefinch> katco: half hour?  I'm lunching currently
<katco> natefinch: let's call it an hour so i can get some food and take a walk :)
<wwitzel3> ok, so top of the hour, moonstone?
<natefinch> okie dokie
<ericsnow> katco: k
<katco> wwitzel3: ericsnow: natefinch: sounds like  plan
<katco> sure, a. just... feel free to drop off there.
<wwitzel3> *group high five*
<katco> no need to ppear when you're typed.
 * katco lunches
<perrito666> what is causing the current curse of master?
<thumper> wasn't cursed when I checked just before
<perrito666> http://reports.vapour.ws/releases <-- says that master's lasat bless was 6 days ago
<thumper> perrito666: interesting
<thumper> http://juju.fail doesn't show any blocking bugs
<menn0> thumper, perrito666: you can see the last cursed build here: http://reports.vapour.ws/releases/3378
<cherylj> can I get a quick review?  http://reviews.vapour.ws/r/3286/
<thumper> menn0: ha... xenial, and lease
<thumper> I bet it caught a real intermittent bug
<menn0> seems there's just one xenial bug that needs fixing. I thought davechen1y might have been looking at that (could be wrong)
<mup> Bug #1521777 opened: Allow for upgrades to 2.0 <juju-core:Triaged> <https://launchpad.net/bugs/1521777>
<davechen1y> menn0: thanks for fixing the windows blocker yesterday
#juju-dev 2015-12-02
<menn0> davechen1y: np
<mup> Bug #1519190 changed: worker/addresser: FAIL: worker_test.go:260: workerEnabledSuite.TestWorkerAcceptsBrokenRelease <juju-core:Opinion> <https://launchpad.net/bugs/1519190>
<mup> Bug #1519190 opened: worker/addresser: FAIL: worker_test.go:260: workerEnabledSuite.TestWorkerAcceptsBrokenRelease <juju-core:Opinion> <https://launchpad.net/bugs/1519190>
<mup> Bug #1519190 changed: worker/addresser: FAIL: worker_test.go:260: workerEnabledSuite.TestWorkerAcceptsBrokenRelease <juju-core:Opinion> <https://launchpad.net/bugs/1519190>
<mup> Bug #1519190 opened: worker/addresser: FAIL: worker_test.go:260: workerEnabledSuite.TestWorkerAcceptsBrokenRelease <juju-core:Opinion> <https://launchpad.net/bugs/1519190>
<mup> Bug #1519190 changed: worker/addresser: FAIL: worker_test.go:260: workerEnabledSuite.TestWorkerAcceptsBrokenRelease <juju-core:Opinion> <https://launchpad.net/bugs/1519190>
<davechen1y> thumper: it feels like the fslock debate is crawling towards stalemate
<davechen1y> thumper: what do you want to see from the discussion ?
<thumper> sorry davechen1y, head down with a customer fire right now
<davechen1y> axw: http://reviews.vapour.ws/r/3238/
<davechen1y> can you please review my last comment
<davechen1y> thumper: my condolences
<axw> davechen1y: looking
<axw> davechen1y: I'm not really sure what you mean by that. It's valid for "createAliveLock" to be entered twice
<axw> davechen1y: (just not concurrently)
<davechen1y> axw: createAliveFile starts a goroutine who's job is to scrobble in some state file
<davechen1y> the stanity check fials because that goroutine is being started more than once
<davechen1y> this is why the original data race existed
<davechen1y> and I think the bug is still here, which wasn't the data race, that was just a side effect
<axw> davechen1y: no, before you could have two of those goroutines running at the same time. now you cannot, because of the wait group
<davechen1y> axw: i don't believ that is true
<davechen1y> see the sanity check
<davechen1y> why is createAliveFile being entered more than once
<davechen1y> why is createAliveFile being entered more than once ?
<axw> davechen1y: the sanity check shows that it was entered more than once, not that they were both in there at the same time
<axw> davechen1y: if you close a channel twice, sequentially, it still panics
<davechen1y> yes
<davechen1y> axw: ok, so I should be able to fix the sanityCheck by creating a new sanityCheck channel in Unlock ?
<axw> davechen1y: yep, that would make sense: recreate channel before removing the file
<axw> directory, whatever
<davechen1y> i think Unlock is the right place to refresh the sanity check
<davechen1y> i'm testing that now
<davechen1y> i don't trust this code
<davechen1y> i wnt tripple underpants on it
<cherylj> thumper: for allowing 2.0 upgrades, should I also allow people who've installed a 1.26 alpha to upgrade?  or are we just going to assume they need to destroy the env and bootstrap with 2.0?
<thumper> cherylj: there are two problems here
<thumper> one is getting a version out for 1.25 that'll work
<thumper> people with 1.26 will have a 2.0 client to upgrade
<thumper> so we need to make sure it lands in master too
<thumper> yes I think 1.26 alpha/beta would be fine for 2.0 upgrades
<cherylj> thumper: sounds good.    I also updated the check to actually query the AgentVersion, not just version.Current
<cherylj> there will probably be a follow on change to show users that a 2.0 upgrade is available, but never select it for an automatic upgrade.
<cherylj> well, not automatic.  Just never select it when upgrading a 1.25 env, unless explicitly specified
<thumper> oh ffs
<thumper> menn0: you around still?
<thumper> I have questions
<thumper> ugh
<axw> wallyworld: I think we need to get rid of the short-circuit removal of relations involving remote services. unless they're not registered
 * thumper headdesks
<axw> wallyworld: otherwise I don't think we can guarantee they'll be unregistered
<axw> wallyworld: so Destroy should just mark them Dead, and then the worker will see that, unregister, then remove
<wallyworld> axw: sounds reasonable at first glance
<menn0> thumper: still here
<thumper> menn0: https://github.com/howbazaar/juju-121-upgrades
<thumper> menn0: and a quick hangout?
<thumper> menn0: 1:1 hangout?
<menn0> thumper: ok
<davechen1y> lucky(~/src/github.com/juju/juju) % ls /tmp
<davechen1y> check-5577006791947779410  hugo_cache          juju-tools295324784  juju-tools-released484109868  test-mgo611590862
<davechen1y> config-err-g4oT3p          juju-exec001679756  juju-tools327758394  juju-tools-released817795995  test-mgo998542473
<davechen1y> deps.svg                   juju-exec158644967  juju-tools689131141  t                             tmux-1000
<davechen1y> fileNq80up                 juju-exec845043821  juju-tools849695094  test-mgo169368628             unity_support_test.0
<davechen1y> hsperfdata_root            juju-exec855581256  juju-tools891530041  test-mgo562608033
<davechen1y> here are all the things juju leaks per test run
<davechen1y> :(
<mup> Bug #1501490 changed: juju-local can't bootstrap as root user <bootstrap> <lxd> <juju-core:Expired> <https://launchpad.net/bugs/1501490>
 * thumper heads for the alcohol
<wallyworld> axw: when you have time, ptal http://reviews.vapour.ws/r/3291/
<axw> wallyworld: sure
<wallyworld> ta
<menn0> fwereade: ping
<dimitern> dooferlad, frobware, jam, fwereade, morning :) reviews on http://reviews.vapour.ws/r/3285/ will be appreciated
<frobware> dimitern, looking
<dimitern> frobware, ta!
<menn0> fwereade: when you have a chance, here's 32 pages of diffs :) http://reviews.vapour.ws/r/3292/
<menn0> fwereade: seriously though, see your email
<frobware> dimitern, what's the [1:] for on the noDevicesWarning string?
<dimitern> frobware, so that it doesn't start with a \n
<dimitern> frobware, thanks for the review
<fwereade> katco, ericsnow, wwitzel3: in merging master into machine-dep-engine, it looks like component bits have leaked into worker/meterstatus and uniter.Paths; was this intentional?
<frobware> dimitern, I reopened http://reviews.vapour.ws/r/3285/ as I found a go vet error.
<mup> Bug #1522001 opened: worker/instancepoller: intermittent data race <juju-core:New> <https://launchpad.net/bugs/1522001>
<dimitern> frobware, yeah, I caught this while forward porting
<dimitern> frobware, it will be fixed in master, and I'll propose a 1.25 fix
<frobware> dimitern, thx. I noticed as I rebased to push my chances.
<perrito666> bbl
<mup> Bug #1522025 opened: juju upgrade-juju should confirm it has enough disk space to complete <upgrade-juju> <juju-core:New> <https://launchpad.net/bugs/1522025>
<mup> Bug #1522025 changed: juju upgrade-juju should confirm it has enough disk space to complete <upgrade-juju> <juju-core:New> <https://launchpad.net/bugs/1522025>
<mup> Bug #1522025 opened: juju upgrade-juju should confirm it has enough disk space to complete <upgrade-juju> <juju-core:New> <https://launchpad.net/bugs/1522025>
<mup> Bug #1519527 changed: juju 1.25.1:  lxc units all have the same IP address - changed to claim_sticky_ip_address <openstack> <sts> <uosci> <MAAS:Triaged by mpontillo> <MAAS 1.9:Triaged by mpontillo> <MAAS trunk:Triaged by mpontillo> <https://launchpad.net/bugs/1519527>
<mup> Bug #1519527 opened: juju 1.25.1:  lxc units all have the same IP address - changed to claim_sticky_ip_address <openstack> <sts> <uosci> <MAAS:Triaged by mpontillo> <MAAS 1.9:Triaged by mpontillo> <MAAS trunk:Triaged by mpontillo> <https://launchpad.net/bugs/1519527>
<mup> Bug #1519527 changed: juju 1.25.1:  lxc units all have the same IP address - changed to claim_sticky_ip_address <openstack> <sts> <uosci> <MAAS:Triaged by mpontillo> <MAAS 1.9:Triaged by mpontillo> <MAAS trunk:Triaged by mpontillo> <https://launchpad.net/bugs/1519527>
<dooferlad> dimitern/frobware/voidspace: PTAL: http://reviews.vapour.ws/r/3294/
<mup> Bug #1520380 changed: worker/provisioner: unit tests fail on xenial <juju-core:Fix Released> <https://launchpad.net/bugs/1520380>
<frobware> dimitern, dooferlad, voidspace: PTAL @ http://reviews.vapour.ws/r/3298/
<dimitern> alexisb, hey
<mup> Bug #1511138 opened: Bootstrap with the vSphere provider fails to log into the virtual machine <bootstrap> <cloud-init> <vsphere> <juju-core:Fix Committed by s-matyukevich> <juju-core 1.25:Fix Released> <https://launchpad.net/bugs/1511138>
<mup> Bug #1513982 opened: Juju can't find daily image streams from cloud-images.ubuntu.com/daily <streams> <juju-core:Fix Committed by s-matyukevich> <juju-core 1.25:Fix Released by s-matyukevich> <https://launchpad.net/bugs/1513982>
<frobware> dimitern, voidspace, dooferlad: PTAL @ http://reviews.vapour.ws/r/3300/
<mgz> abentley: I have some hacks in assess_jes_deploy.py on master to add logging, I believe your changes should obselete this, but be aware on merging
<abentley> mgz: My changes to assess_jes_deploy landed on Monday, IIRC.
<mgz> did you run the thingy as well? maybe they didn't conflict and I actually need to make a working branch for 'em then
<voidspace> frobware: looking
<voidspace> frobware: the new api is *almost* the same as the existing facade, but annoyingly needs space provider ids not names
<voidspace> frobware: and it seems that creating a new facade is probably easier than versioning the existing one and adding the extra info...
<voidspace> frobware: if that's just a forward port of your previous branch then LGTM
<perrito666> bbl
<natefinch> wwitzel3: I pressume you're making the upload API client something like Upload(service, name string, resource io.Reader) error ?
<katco> natefinch: why not keep it decoupled? specify the interface you expect and we can adapt w/e wwitzel3 makes to that interface
<natefinch> katco: I figure we might as well start off with a perfect match if there's no reason not to... if we later need an adapter, we can always do that, but not even trying to make them the same seems like it just adds unnecessary complexity to the code.  I'd be looking at the code wondering why the two interfaces were different, and the answer would be Â¯\_(ã)_/Â¯
<wwitzel3> natefinch: the answer would be because they don't need to be the same? but yes, they are passed in, in the order of the command in the spec. Service, name, blob
<katco> natefinch: the reason to take in a func or interface is to guard against changes in the future; if the api layers change, your code doesn't have to. only the adapter
<natefinch> katco: yeah, I was going to take an interface... just might as well make the interface match to start with.  If I see someone's written an adapter, and all the adapter does is change the order of the arguments, I'm gonna be annoyed that I had to go read that code
<katco> natefinch: agreed for order, but not arity
<davecheney> quick review please ? https://github.com/juju/juju/pull/3882
<katco> davecheney: lgtm
<davecheney> katco: ta!
<menn0> sinzui, mgz: the machine-dep-engine feature branch CI runs are never actually happening. it's always "Blocked by parallel-streams-publish-aws, build-binary-vivid-amd64, build-revision"
<menn0> what's that about?
<menn0> abentley too: ^^
<mgz> menn0: it needs master merged to get new version
<mgz> menn0: read http://reports.vapour.ws/releases/3380
<abentley> menn0: Build-revision isn't completing successfully, because it hasn't been updated to a new version.
<menn0> mgz, abentley: I merged master into the feature branch last night. so it should work next attempt?
<mgz> yup, at least that initial step
<abentley> menn0: Yes.  If the version in the code is now greater than 1.26-alpha2.
<menn0> mgz, abentley: it should be. it was master from yesterday.
<menn0> mgz, abentley: Thanks for explaining. I didn't know we had that guard, but it makes sense now that I think about it.
<sinzui> abentley: I thnk you fell off #juju. To catch you. Up we lost access to HP. I we now have access back, in the meantime. I started to move the last machines from their. I quickstart still sucks. I have a hack in place to use gce. I quadrupled our cpus in gce to deploy landscape tests there.
<abentley> sinzui: ack.
<perrito666> katco: is that kb plastic or rubber?
<katco> perrito666: plastic i think
<niedbalski> thumper, ping
<thumper> niedbalski: hey
<thumper> niedbalski: I'm downloading the files to my laptop now
<thumper> niedbalski: although the current logs show more problems than just the db bits
<thumper> which sucks
<sinzui> davecheney: I have mixed news. the run-unit-tests-race is now on the new machine and can complete in less than 40 minutes. I think the new speed or the new host has uncovered errors. CI is retesting now. I may make the job non-voting until we get the test stable again
<katco> sinzui: davecheney: the race flag will only fail tests if it experiences the race, correct?
<sinzui> katco: In my experience, a failure of a test, or a panic will fail the test
<katco> sinzui: right, but the test won't fail if it contains a race condition unless that rc is triggered... even with the -race flag. at least that's my understanding
<sinzui> katco: if go test returns an exit code greater than 0 it will fail. That previous run failed because of panics and fails, I don't see a report of races
<katco> sinzui: right, i understand what you're saying. what i'm trying to add is that i think whether the tests fail because of races is as non-deterministic as the races themselves. so you may see wildly variable results.
<sinzui> katco: yeah.
<katco> sinzui: i.e. the test may not stabalize for a long while
<mwhudson> davecheney: do you remember what the difference between GOARM=6 and GOARM=7 is?
<davecheney> https://github.com/juju/juju/pull/3886
<davecheney> anyone
<davecheney> mwhudson: the extra 8 floating point regs in VPFv3
<davecheney> what do you they call it, D16 or something
<mwhudson> oh right
<davecheney> go defaults to goarm=6
<davecheney> that is the right hoice
<davecheney> choice
<mwhudson> ok
<mwhudson> it's what ubuntu builds with too for armhf, apparently
<davecheney> armv6 vs v7 makes a huge difference in kernel space 'cos the cache model is completely different
<mwhudson> sounds like that's ok?
<davecheney> in user space, bugger all difference
<mwhudson> k
<davecheney> yes, that is the best choice
<mwhudson> thanks
<davecheney> https://github.com/juju/juju/pull/3886
<davecheney> anyone, pleaes
#juju-dev 2015-12-03
<thumper> davecheney: shipit
<sinzui> waigani: Your PR was used to test the new juju-core-slave. You have extra comments sorry. you might care about the last run which looks like a legitimate run https://github.com/juju/juju/pull/3883
<waigani> sinzui: okay, looking ...
<waigani> sinzui: fixed, repushing
<sinzui> waigani: great I will watch to be sure your branch gets a fair test
<waigani> sinzui: thanks :)
<davecheney> thumper: danka
<davecheney> mwhudson: go 1.5.2 was just released
<mwhudson> davecheney: i saw it was close
<davecheney> well, they've tagged the branch
<davecheney> the binary artifacts will be coming soon
<mwhudson> cool
 * mwhudson runs away, will look later on
<thumper> poo
<davecheney> thumper: ?
<thumper> found a bug in the migration steps for 1.21
<perrito666> another one?
<thumper> perrito666: yeah
 * thumper goes to make coffee before he tears this apart
<mup> Bug #1522025 changed: juju upgrade-juju should confirm it has enough disk space to complete <upgrade-juju> <juju-core:New> <https://launchpad.net/bugs/1522025>
<mup> Bug #1522025 opened: juju upgrade-juju should confirm it has enough disk space to complete <upgrade-juju> <juju-core:New> <https://launchpad.net/bugs/1522025>
<mup> Bug #1522025 changed: juju upgrade-juju should confirm it has enough disk space to complete <upgrade-juju> <juju-core:New> <https://launchpad.net/bugs/1522025>
 * thumper is not enjoying reading this code
<wallyworld> davecheney: you ocr? got time to look at http://reviews.vapour.ws/r/3293 ? several of the files are new test charms or param renames
<natefinch> thumper: I've been having that feeling a lot lately
<wallyworld> thumper: can i pull master and run up a multi-env controller without a feature flag yet?
<thumper> wallyworld: no
<wallyworld> damn
<thumper> because I'm fixing bugs :)
<thumper> not features
<wallyworld> thumper: we're starting to need it for x-model work, i guess i could merge in your feature branch
<thumper> um...
<thumper> could do
<wallyworld> or just use flag
<thumper> oh fucking awesome
<wallyworld> i'll use feature flag
 * thumper headdesks
 * thumper headdesks
 * thumper headdesks
 * thumper headdesks
 * thumper headdesks
 * thumper headdesks
 * thumper headdesks
 * thumper headdesks
<wallyworld> lol
 * thumper headdesks
<wallyworld> ouch
<thumper> a test is passing because it isn't doing the asserts it think its doing
<natefinch> lol
<wallyworld> hope it's not my code
<thumper> fuck fuck fuckity fuck
 * thumper digs deeper
<wallyworld> davecheney: so about any reviews?
<thumper> well, I added a line to make the test fail
<wallyworld> which test?
<thumper> state upgradeSuite.TestMigrateUnitPortsToOpenedPorts
<wallyworld> thanks davecheney :-)
<wallyworld> axw: thanks for review, questions answered plus a couple of changes, ptal when you are free
<axw> wallyworld: thanks, looking
<axw> wallyworld: shipit
<wallyworld> axw: awesome, tyvm
 * frobware wonders why in late 2015 he has a /boot partition which is 100MB in size because that's just too damn small...
<frobware> dimitern, voidspace, dooferlad: I'm a little hosed atm. No working desktop. an upgrade gone wrong - lack of disk space seems to be root cause.
<dimitern> frobware, oh sh*t :/
<dimitern> frobware, best of luck fixing it then
<dooferlad> frobware: crumbs. Ask for help if you need it.
<frobware> dooferlad, trying to chroot of a usb boot ...
<dooferlad> frobware: have you tried a boot repair USB stick or a recovery mode with networking support boot?
<dooferlad> chrooting seems painful
<frobware> dooferlad, my genuine /boot has no working kernel. do you know if a 'boot repair' disk will add one?
<dooferlad> frobware: ah, no. It seems to cover bad grub options, file system corruption and other bootloader related stuff, but not a missing kernel
<frobware> dooferlad, mkinitramfs failed (enospace) so ...
<dooferlad> frobware: you should be OK deleting old kernels and related junk from /boot and retrying that. As long as one kernel is left that works...
<frobware> I have working kernels
<frobware> correction: I have NO working kernels.
<dooferlad> dimitern: can I have another review of http://reviews.vapour.ws/r/3294/ please?
<dimitern> dooferlad, sure - looking
<dimitern> dooferlad, replied
<voidspace> frobware: oh, ouch
<voidspace> frobware: dimitern: dooferlad: this PR adds a providerId field to spaces and a providerId parameter to AddSpace
<voidspace> http://reviews.vapour.ws/r/3307/
<voidspace> it's a pre-req of my next couple of PRs
<voidspace> dimitern: passing spaces through as constraints to maas will obviously need to use ProviderId and not Name
<voidspace> dimitern: but if you make that change then the CLI command to add space will no longer work for maas spaces - you'll *need* space discovery
<voidspace> (which is coming)
<voidspace> so maybe don't make that change yet
<dimitern> voidspace, looking
<dimitern> voidspace, space add is not fully implemented
<dimitern> voidspace, so it should be fine there, and for space create we have the card to not allow it when SupportsSpaces is false
<voidspace> dimitern: ah, sorry - it was space create I meant
<voidspace> dimitern: and we won't actually need "space add" I don't think
<voidspace> dimitern: we'll really just need "space import" or whatever to update all definitions
<voidspace> dimitern: a subsequent PR of mine will be adding SupportsSpaceDiscovery to NetworkingEnviron
<voidspace> dimitern: so that will be the switch to disable space / subnet create
<voidspace> dimitern: frobware: fwereade: dooferlad: bike shed on name, agent API facade for space and subnet creation (agent not client)
<voidspace> dimitern: frobware: fwereade: dooferlad: "networking" ?
<fwereade> voidspace, name it for the worker
<voidspace> it's a bit too generic but it's also true
<voidspace> fwereade: ok, that's easy enough
<fwereade> voidspace, that's the granularity we're looking for
<voidspace> discoverspaces
<voidspace> fwereade: cool, thanks
<voidspace> time for a rename then :-)
<dimitern> voidspace, right - agreed about space import
<dimitern> voidspace, +1 for SupportsSpaceManagement instead FWIW
<dimitern> voidspace, reviewed btw
<voidspace> management means nothing
<voidspace> it's a weasel word
<voidspace> dimitern: thanks for the review
<dimitern> voidspace, well, "discovery" is a bit misleading as it implies we can't do discovery in other clouds *ever*
<voidspace> dimitern: it doesn't imply that
<voidspace> dimitern: when we *can* they'll return true
<voidspace> that's precisely the point of it
<voidspace> when we can support it for a provider the result of that method will change
<voidspace> dimitern: you want me to change BackingSubnetInfo.ProviderId to string? I didn't touch that in my PR
<dimitern> voidspace, ok, but then we can't forbid space create because discovery is not supported
<voidspace> dimitern: "create" means create it on the provider
<voidspace> dimitern: so we can forbid it where we can't create it on the provider
<dimitern> voidspace, not BackingSubnetInfo specifically
<voidspace> dimitern: that's the first issue you raised
<voidspace> BackingSubnetInfo specifically...
<dimitern> voidspace, but st.Subnet.ProviderId() returns network.Id
<dimitern> voidspace, well it doesn't but it should (and BackingSubnetInfo represents st.SubnetInfo)
<voidspace> dimitern: I think network.Id would probably be better here
<dimitern> ..which also uses string aargh..
<voidspace> I can fix that too
<voidspace> it just adds work
<dimitern> voidspace, I'd like to use network.Id for anything user-facing w.r.t. provider ids
<voidspace> ok
<voidspace> I'll include those changes here
<dimitern> voidspace, doesn't have to happend now, if we store the correct value/type on the state doc
<voidspace> state/doc is using string
<voidspace> dimitern: should we use network.Id or string in mongo?
<dimitern> voidspace, yeah, as it should, but the related method that returns the provider id field value should cast it to network.Id
<voidspace> dimitern: threading network.Id throufh our code is fine
<voidspace> dimitern: I might as well do it now
<dimitern> voidspace, in mongo we use only native types or locally defined types in state
<dimitern> voidspace, ok, cheers
<voidspace> ok
<dimitern> voidspace, thanks
<voidspace> changed them, now to fix the 17 million things that break with type errors... :-D
<voidspace> at least the compiler tells me where they are...
<voidspace> dimitern: so state.SubnetInfo.ProviderId should also be network.Id
<dimitern> voidspace, yes
<voidspace> dimitern: cool
<voidspace> dimitern: so *really* ProvisioningInfo.SubnetsToZones should be map[network.Id][]string
<voidspace> dimitern: instead of map[string][]string
<voidspace> dimitern: but that would be an incompatible API change, so I'm going to leave it
<dimitern> voidspace, nope
<voidspace> dimitern: what do you mean by nope? :-)
<dimitern> voidspace, field types for api structs like mongo docs need to use types defined in the params (or state) package
<voidspace> nope I'm wrong, nope I shouldn't leave it, nope I'm right
<voidspace> so it shouldn't be network.Id
<dimitern> voidspace, not on the params
<voidspace> ok
<voidspace> dimitern: thanks
<perrito666> morning
<voidspace> perrito666: morning
 * frobware celebrates the restoration of /boot with lunch
<voidspace> frobware: \o/
<frobware> voidspace, was pretty easy after I realised that the first couple of times I did the restore I was doing the restore to /boot which in /, which was in a LVM, but /boot should have been a physical partition... just forget to mount it during the chroot. :)
<frobware> and during lunch I think we'll do a complete system backup. :)
<voidspace> cool
<voidspace> dimitern: ping
<dimitern> voidspace, pong
<voidspace> dimitern: one issue you said:
<voidspace> dimitern: After adding the index, here if err == nil, we should still try fetching the space back to see if it got added (see how AddNetworkInterface handles the similar case).
<voidspace> dimitern: do you mean AddSubnet rather than AddNetworkInterface?
<voidspace> dimitern: because if you mean AddNetworkInterface I can't find that code
<voidspace> dimitern: is it to handle where adding can fail silently, because the providerid isn't unique?
<dimitern> voidspace, AddNetworkInterface is in state/machine.go
<voidspace> dimitern: yes
<voidspace> dimitern: but there's no code as you describe in that method...
<dimitern> voidspace, the switch at the end - case nil: ?
<voidspace> whereas I *think* what you mean is the case in AddSubnet but I want to check I'm understanding you
<voidspace> dimitern: oh right, I missed that
<voidspace> dimitern: yuck, that's horrible code
<dimitern> voidspace, but yeah - AddSubnet is actually a better (simpler) example
<voidspace> how did I miss that
<dimitern> voidspace, since the check is the same - providerid unique when != ""
<voidspace> but anyway, I'll follow the AddSubnet pattern if it's doing the same thing
<dimitern> voidspace, I'm refactoring AddNetworkInterface to be less icky hopefully :)
<voidspace> ah, I grepped for AddNetworkInterface and found it in SetInstanceInfo
<voidspace> and was reading that code
<voidspace> cool
<voidspace> SetInstanceInfo *calls* AddNetworkInterface, but it isn't AddNetworkInterface... :-)
<mup> Bug #1522409 opened: juju machine 0 watcher unable to connect to units <sts> <juju-core:New> <https://launchpad.net/bugs/1522409>
<ericsnow> food for thought: state/State has 195 *exported* methods
<frobware> dimitern, voidspace, dooferlad: PTAL @ http://reviews.vapour.ws/r/3308/
<dimitern> dooferlad, you've got another review btw
<frobware> dimitern, voidspace, dooferlad: too many distractions ^ that was a merge of the wrong branch in my repo
<dimitern> frobware, if it's a straight forward revert, just land it I think
<dimitern> frobware, except - I'll add a comment
<dimitern> frobware, the head -n1 added with my recent PR is not there when getting the gateway
<dimitern> frobware, ah, sorry - it is actually
<frobware> dimitern, it's still there post the revert?
<dimitern> frobware, it is (in the expectedBlaBalCloudConfig)
<dimitern> (the one with the bridge script)
<frobware> dimitern, I was confused why my cherry-pick for the bond change was not clean - now I know why... sigh.
<dimitern> frobware, well, good catch :)
<dimitern> voidspace, ping
<voidspace> dimitern: pong
<dimitern> voidspace, in case you're having issues with the ProviderId index (as I'm discovering now)
<voidspace> dimitern: I am...
<voidspace> dimitern: well, sort of
<voidspace> it's causing unrelated bindings tests to fail
<voidspace> but a worse problem right now is that my space tests don't appear to be running
<dimitern> voidspace, I *think* the correct solution is only verify for uniqueness if ProviderId != ""
<voidspace> dimitern: I thought that's what sparse did
<dimitern> voidspace, well, yeah, but since ProviderId is also omitempty, it will be missing
<dimitern> if not set
<dimitern> "" != {$exists, false} as it happens
<voidspace> ok
<voidspace> dimitern: that's exactly how ProviderId on Subnets is defined though
<dimitern> voidspace, well, it might be different for subnet.. but for NICs I'm having a bunch of issues, still debugging the Add.. call
<voidspace> omitempty but with a unique, sparse index
<voidspace> dimitern: ok
<voidspace> I'm adding it to space - but just following what we did with subnets
<dimitern> voidspace, well, that sounds sensible for subnets, but after spending >1h fiddling with indices, uniqueness, etc. I'm beginning to think having ProviderId defined like that for NICs is not a good idea
<voidspace> dimitern: heh, fun
<dimitern> e.g. imagine machine-0 with eth0 with aa:bb:cc:dd:ee:f0 for mac, "0.1.2.0/24" for subnet, there's no trouble having eth0 (same mac and all), but on subnet "2001:db8::/64"
<voidspace> dammit SpacesSuite is not being run at all
<dimitern> voidspace, sh*t! really?
<voidspace> well, not on my machine right now
<voidspace> it maybe a general problem
<dimitern> voidspace, it might be due to the order of go test args (as I've discovered: -check.v -check.f *and* -race and/or -cover never finds the -check.f)
<voidspace> dimitern: no args except package
<voidspace> (I was originally trying with -check.f - but even without it I'm not seeing my deliberately failing tests)
<voidspace> same with subnets test
<voidspace> what the hell is going on
<dimitern> voidspace, state.ConnSuite logs the beginning and end of SetUpTest btw - e.g. with -check.vv
<dimitern> voidspace, I found also -check.list useful
<voidspace> thanks
<voidspace> dimitern: soooo... changing state/package_test.go from the state package to the state_test package seems to have changed things
<voidspace> now I get build failures
<dimitern> voidspace, wow - and here's davecheney who changed that recently :)
<voidspace> dimitern: that change means that most state tests don't run
<dimitern> voidspace, it's gets weirder - grep for MgoTestPackage in state/
<dimitern> voidspace, uh, no actually no surprises there
<voidspace> heh
<dimitern> voidspace, that's a bug right there - https://github.com/juju/juju/pull/3806/files#diff-d4970ae51fd6268f6325442698dbe02a
<voidspace> yeah, wrong package :-)
<voidspace> oops...
<dimitern> voidspace, that was 10 days ago!
<voidspace> it took me about an hour or so just now to track it down
<dimitern> I wonder how nobody found out
<voidspace> couldn't work out why my tests weren't running
<alexisb> happy belated birthday wwitzel3 !
<dimitern> I'm filing a bug
<voidspace> dimitern: thanks
<wwitzel3> alexisb: thanks :)
<voidspace> dimitern: frobware: meeting?
<mup> Bug #1522484 opened: state package tests no longer run since PR #3806 <blocker> <ci> <regression> <unit-tests> <juju-core:In Progress by dimitern> <https://launchpad.net/bugs/1522484>
<voidspace> dimitern: I'm going to run state tests on maas-spaces
<frobware> dimitern, voidspace, dooferlad: ok to rebase our maas-spaces when the state fix lands?
<dimitern> voidspace, sgmt
<voidspace> dimitern: I have a bunch of endpoint_bindings tests fail - and I can't tell if they're caused by my changes or not
<dimitern> sgtm even
<dooferlad> frobware: go for it
<dimitern> frobware, I'd rather deal with the conflicts than let more regressions slip by :)
<voidspace> dimitern: and it looks like that unique index on provider id is causing me issues
<dimitern> voidspace, I'll have a look there and propose a fix if needed
<voidspace> I add three spaces with empty provider ids and then AllSpaces only returns one space
<voidspace> dimitern: cool
<dimitern> damn :/
<voidspace> adding that check in AddSpace to see if silent failure is the cause of that
<voidspace> we'll see
<voidspace> if it is the problem I'll need to check we don't have the same problem with subnets which use the same pattern
<voidspace> maybe we have that problem but just haven't hit it because we always have a providerid
<dimitern> voidspace, I'm reading https://docs.mongodb.org/v2.4/core/index-sparse/ now - it seems unique+sparse might not be what we need actually
<voidspace> right
<voidspace> dimitern: it sounds exactly right
<voidspace> dimitern:
<voidspace> You can specify a sparse and unique index, that rejects documents that have duplicate values for a field, but allows multiple documents that omit that key.
<dimitern> voidspace, ok, then we should not use that index I think
<dimitern> voidspace, we can verify while adding that there's no existing space with the given non-empty provider id
<voidspace> dimitern: eh, it sounds exactly like what we want
<voidspace> hmmm...
<voidspace> dimitern: I don't have the failing tests on maas-spaces, so introduced by my changes
<dimitern> voidspace, AIUI you can have both doc{id:"foo", providerid:"bar"} and doc{id:"foo1"}, but then you can't have doc{id:"foo2"} as well
<voidspace> dimitern: see their example
<voidspace> at the bottom of that page
<voidspace> you can
<dimitern> voidspace, hmm well it sounds reasonable..
<dimitern> but I can't seem to make it work for NICs
<voidspace> it certainly *seems* like what we want
<voidspace> dimitern: are you using omitempty?
<dimitern> voidspace, yep
<voidspace> ok
<dimitern> ah, well.. perhaps it's because I'm over-complicating things as usual
<dimitern> voidspace, in cases where we need uniqueness on multiple fields, what does work fine and it's easy to verify is a compound _id field, like a global key
<mup> Bug #1517258 changed: juju 1.24.7 precise: container failed to start and was destroyed <oil> <juju-core:Triaged> <https://launchpad.net/bugs/1517258>
<dimitern> but still coupled with individual fields having values used in the compound key - e.g. 8e5a492b-7631-409b-8134-7997678d731e:m#4/lxc/4#n#juju-public + EnvUUID + MachineId + NetworkName (on ports)
<voidspace> dimitern: http://stackoverflow.com/questions/28183109/dealing-with-mongodb-unique-sparse-compound-indexes
<voidspace> it seems like unique and sparse don't work together well for compound indices
<voidspace> so it's probably currently broken for subnetsC
<dimitern> voidspace, yeah, that's what I've (re)discovered I guess
<dimitern> voidspace, not necessarily, as both spaces and subnets + provider id act similarly; but it's definitely broken for network interfaces docs
<dimitern> (I mean there are no compound indices effectively, as env-uuid+field_name is always set)
<voidspace> ah, subnetsC isn't compound
<voidspace> dimitern: I was trying to make spacesC compound on env-uuid
<voidspace> I'll have to leave that as a TODO
<dimitern> frobware, voidspace, btw I see no failures in state/ on maas-spaces tip after changing package_test.go's package to state_test
<voidspace> dimitern: yeah, I tried it too and saw no failure
<dimitern> so rebase might wait
<voidspace> I think most of my failures were actually caused by the index problem
<voidspace> about to find out
<dimitern> voidspace, yeah, sorry for steering you down that rat hole :(
<voidspace> well, it's what we *should* do
<voidspace> not your fault mongodb is broken
<dimitern> :) cheers
<dimitern> voidspace, what I've just noticed is even weirder - I see exactly the same number of tests passing (1245) in state/ for almost the same time (~198s) both with and without the change in package_test.go
<dimitern> voidspace, when I run go test -check.f in state/
<natefinch> ericsnow: I made a slight modification of the apiClient code, to remove the need for the resourceAPIClient type.... let me know what you think: http://reviews.vapour.ws/r/3310/diff/#  (feel free to ignore the upload stuff in there, since it's still a WIP).  I know you made changes around that code too, but wanted to get your opinion of what I did.
<voidspace> that is odd
<dimitern> voidspace, however I suspect it's different for go test github.com/juju/juju/state -check.v
<natefinch> ericsnow: it also makes the function you have to write for every command completely trivial, which is nice.  Just a little adapter
<ericsnow> natefinch: looking
<natefinch> ericsnow: I haven't updated tests yet, obviously.
<dimitern> voidspace, same result with go test github.com/juju/juju/state -check.v and unpatched package_test
<dimitern> (albeit without any output)
<voidspace> dimitern: I wonder if we have some black box state tests then
<voidspace> and we need both
<voidspace> dimitern: e.g. compat_test.go is in the state package
<voidspace> dimitern: allwatcher_internal_test is as well
<dimitern> voidspace, yeah, same for the upgrades
<beisner> hi mgz, cherylj - is 1.25.2 avail in a ppa anywhere atm?
<cherylj> beisner: no, not yet
<dimitern> heh what do you know.. go test ./... -check.list almost managed to bring the quad core i7 to its knees
<voidspace> dimitern: PR updated, all those issues addressed by the way
<dimitern> voidspace, awesome! having a look now..
<voidspace> dimitern: small change, touches 25 files...
<dimitern> voidspace, does it still work if the spacesC and subnetsC index was defined as Key: []string{"env-uuid", "provider"}, Unique: true, Sparse: true?
<voidspace> dimitern: no
<voidspace> dimitern: causes failures
<voidspace> dimitern: the sparse is then ignored
<dimitern> voidspace, won't we have the issue where you can't have a space with no provider id in 2 different environments?
<dimitern> voidspace, or with the same provider id for that matter
<voidspace> dimitern: yes
<voidspace> dimitern: but we already have that with subnets and a TODO to fix it...
<voidspace> not good enough?
<dimitern> voidspace, ok, fair enough
<voidspace> how about we add a card for fixing both to the iteration backlog
<dimitern> voidspace, just checking the options ;)
<voidspace> cool
<dimitern> voidspace, +1
<dimitern> voidspace, lgtm
<voidspace> thanks
<voidspace> I thought that would take an hour at the start of the day
<voidspace> took a bit longer...
<dimitern> voidspace, gotta love mongo, right? :D
<voidspace> right, back to the api
<voidspace> dimitern: yeah... and our testing infrastructure
<dimitern> ok time to call it a day
<voidspace> dimitern: o/
<dimitern> voidspace, :)
<frobware> voidspace, dooferlad, dimitern: Strike 2 - http://reviews.vapour.ws/r/3311/ - this is the correct branch for the change I reverted earlier.
<voidspace> frobware: as it's just a forward port, ShipIt!
<frobware> voidspace, ty
<frobware> rick_h_, hadn't forgotten your HA question - just got a little sidetracked with a broken desktop...
<rick_h_> frobware: I forgot the question now heh
<frobware> rick_h_, HA for OpenStack. Reference platform is 28 nodes. Need to look at the charm to see if that's always the case.
<frobware> rick_h_, if so think I'll enter the h/w business. :)
<rick_h_> frobware: :/ ok I think this was more can we do HA on Juju vs openstack on guimaas
<frobware> rick_h_, my 28 came from: https://wiki.ubuntu.com/ServerTeam/OpenStackHA
<rick_h_> frobware: right, I don't think I was asking about openstack HA
<rick_h_> frobware: but "can I deploy openstack (6? 8?) nodes and have Juju be HA there
<frobware> rick_h_, so not being at the beginning of the call I think I've missed the context.
<rick_h_> frobware: to test upgrades of that deployment
<frobware> rick_h_, ok gotcha
<rick_h_> frobware: how small can I get a functioning OS and have Juju HA state servers is what I'd like to see
<rick_h_> frobware: and then investigate growth of the test case from there
<mup> Bug #1522544 opened: workers restart endlessly <juju-core:New> <https://launchpad.net/bugs/1522544>
<thumper> morning folks
<natefinch> morning thumper.  Happy friday
<lazypower> o/ thumper
<natefinch> wwitzel3: ping me via email when/if your branch is ready and I'll see what I can do to integrate with it.
<alexisb> thumper, when do you head to the airport?
<thumper> alexisb: tomorrow lunch time
<alexisb> thumper, aw ok
<natefinch> back later
<thumper> ok, I have a fix for this 1.21 upgrade step
<thumper> why do we have no reviewers today?
<thumper> http://reviews.vapour.ws/r/3313/diff/#
<thumper> someone plz
<thumper> cherylj, waigani, or davecheney? ^^^
<cherylj> thumper: tal now
<thumper> ta
<thumper> ok, time for coffee... again
<wallyworld> axw: perrito666: a minute late
<perrito666> k
#juju-dev 2015-12-04
<cherylj> wallyworld: ping?
<wallyworld> cherylj: hey give me a sec
<cherylj> np
<wallyworld> cherylj: hey. btw do you have a link to the 2.0 todo list that has been srated for the sprint?
<cherylj> wallyworld: I don't know that one has been started yet
<wallyworld> ah, ok, np. next week then
<cherylj> wallyworld: I know that when you guys were working the bootstack upgrade issues, you came across the problem in this bug:  https://bugs.launchpad.net/juju-core/+bug/1459033
<mup> Bug #1459033: Invalid binary version, version "1.23.3--amd64" or "1.23.3--armhf" <constraints> <maas-provider> <juju-core:Triaged> <https://launchpad.net/bugs/1459033>
<wallyworld> yes
<cherylj> but I cannot find / remember what the cause of that was determined to be
<wallyworld> they had bad data in their tools metadata collection in state
<wallyworld> at some point, juju i think must have returned "" for an unknown series lookup
<wallyworld> and so when tools were being imported, wily or xenial tools for that bad version
<cherylj> did you have to manually recover the db?
<wallyworld> yeah, i went in and deleted the tools metadata
<wallyworld> this causes juju to go out to streams.canonical.com
<wallyworld> to fetch tools instead of using any cached values
<wallyworld> and the newly downloaded tools are then cached again
<cherylj> okay, cool.  Thanks!
<wallyworld> but hard to reporduce
<wallyworld> i don't know how old that "" series issue is
<cherylj> I can take a look
<wallyworld> there's a lot of moving parts
<wallyworld> may be hard to pin down a definitive "this is it" release
<wallyworld> it should be an upgrade check
<wallyworld> the upgrader checks for bad metadata and deletes it
<wallyworld> axw: here's the pr http://reviews.vapour.ws/r/3317/
<axw> wallyworld: ta, looking
<axw> wallyworld: reviewed
<wallyworld> axw: good pickups on the other issues; i've explained one point, hopefully it makes sense
<axw> wallyworld: ok, looking again
<axw> wallyworld: alternatively just "ServiceOffers" if URL is the canonical identifier
<axw> wallyworld: your call tho
<wallyworld> that might work
<wallyworld> shit , just saw a typo
<wallyworld> formatOfferedServiceDetailss
<wallyworld> will fix that
<axw> wallyworld: responded
<wallyworld> looking
<axw> wallyworld: going to go have lunch and start packing, will check back in a little while
<wallyworld> ok, np, i'll push changes
<axw> wallyworld: under what circumstances? re  "even if the query(filter) succeeds and the Error above is nil, converting the data from a particular query result item may have an error."
<axw> wallyworld: just wondering if it's actually worthwhile departing from the conventional all-or-nothing per result
<wallyworld_> axw: it gets the query result and then does things like look up the service and/or charm details (can't recall exactly)
<wallyworld_> so that op per record could fail
<axw> wallyworld_: does it make sense to include that in the result at all then? what're you going to do with that? you didn't specifically ask for an item, you just said "give me all the things that this filter matches"
<wallyworld_> the doc in offered services collection does match. but composing the result errors
<wallyworld_> we could ignore such errors i guess
<axw> wallyworld_: yep... why would the user care about that?
<wallyworld_> and pretent the item doesn't exist
<wallyworld_> i'll rework it
<axw> wallyworld_: it's just not clear to me how the user can action on that error
<axw> wallyworld_: it's not due to an error in input, it's a server-side error
<axw> wallyworld_: if you like, defer to a follow up
<wallyworld_> as person making an offer, i would want to know if one of my offers went bad somehow
<wallyworld_> let's land now and iterate next week? there's a fair bit of other cruft to fix also
<wallyworld_> this is a good start though
<axw> wallyworld_: then I tink we need to move the error inside the details, rather than outside the details. that would be more like the errors we have in machine status, I think
<axw> wallyworld_: sure
<wallyworld_> hmm, ok, i could move inside
<wallyworld_> i'll land as is and we can think a bit. i need to go pack etc
<axw> wallyworld_: do it later, I'll take a once over now
<wallyworld_> ok
<axw> wallyworld_: couple small fixes please, then shipit
<wallyworld> axw: thanks. my eyes get sore looking at all those params structs. they all start to merge into a big mess
<axw> wallyworld: :)
<TheMue> Heya, old team. Next week in OAK/SFO?
<voidspace> dimitern: standup?
<dimitern> voidspace, omw - having some HO issues
<mup> Bug #1522484 changed: state package tests no longer run since PR #3806 <blocker> <ci> <regression> <unit-tests> <juju-core:Fix Released by dimitern> <https://launchpad.net/bugs/1522484>
<mup> Bug #1522484 opened: state package tests no longer run since PR #3806 <blocker> <ci> <regression> <unit-tests> <juju-core:Fix Released by dimitern> <https://launchpad.net/bugs/1522484>
<mup> Bug #1522484 changed: state package tests no longer run since PR #3806 <blocker> <ci> <regression> <unit-tests> <juju-core:Fix Released by dimitern> <https://launchpad.net/bugs/1522484>
<fwereade> dimitern, axw, rogpeppe: you have all contributed to this logic, so you may have insight: ISTM that the broken/closed logic in api/apiclient.go is really rather likely to deadlock on Close(); which would match a bug I've seen; any comments/recollections?
<dimitern> fwereade, I don't recall much around that code, i have to do a refresher
<fwereade> dimitern, axw, rogpeppe2: and it STM that http://paste.ubuntu.com/13665970/ would fix it
<dimitern> fwereade, that looks like it should have been like this to begin with :)
<fwereade> dimitern, it's just about the interaction of 2 channels (s.closed/s.broken) and how the heartbeatMonitor goroutine doesn't always close broken; but Close always waits for broken to be closed
<rogpeppe2> fwereade: looking
<rogpeppe2> fwereade: what's the difference between those two pieces of code?
<rogpeppe2> fwereade: i ca't see that using defer makes a difference
<fwereade> rogpeppe2, ...goddammit
<fwereade> rogpeppe2, this is why it's a good idea to talk to you about these things ;)
<rogpeppe2> fwereade: :)
<fwereade> rogpeppe2, hadn't picked up that we always ping until we fail
 * fwereade wonders if a ping could hang somehow...
<rogpeppe2> uiteam: support removing multiple entities at once: https://github.com/CanonicalLtd/charmstore-client/pull/150
<dimitern> voidspace, ping
<frobware> dooferlad, your python observations on add-juju-bridge.py. I think I'm going to land on master as-is because I'm going to work on the multiple bridge / multiple NICs straight away and I can a) fix them there and b) don't want to invalidate any of the manual testing. Ok?
<dooferlad> frobware: sure
<dimitern> frobware, voidspace, I found out why David is having that issue - I'm about to propose a PR that fixes it by allowing Subnets() to be called without an instanceId (returning all subnets)
<frobware> dimitern, great & thanks.
<voidspace> dimitern: pong
<voidspace> dimitern: without an instanceId... interesting
<dimitern> voidspace, yes - like "gimme all subnets there are
<voidspace> dimitern: sure, what will use that?
<dimitern> voidspace, with the new api that's easy, and in fact already implemented
<dimitern> voidspace, in Spaces()
<voidspace> dimitern: for maas, yes
<dimitern> voidspace, yes, only for maas and until we have import in place
<dimitern> voidspace, this will allow David & et al to try spaces in constraints
<voidspace> dimitern: ok, cool
<rogpeppe2> uiteam: now updated to allow a -n flag: https://github.com/CanonicalLtd/charmstore-client/pull/150
 * rogpeppe2 lunches
<voidspace> dimitern: frobware: dooferlad: wife ill in bed, I'm looking after the boy
<voidspace> hopefully she'll be rested and up soon
<frobware> voidspace, ack
<dimitern> voidspace, sure - speedy recovery!
<dooferlad> voidspace: hope things improve soon :-(
<frobware> dimitern, dooferlad, voidspace: PTAL @ http://reviews.vapour.ws/r/3319/
<dimitern> frobware, looking
<dimitern> frobware, LGTM
<frobware> dimitern, once that lands on master I plan to rebase maas-spaces as I can do the multi nic / multi bridge with that change in place
<dimitern> frobware, great
 * dimitern steps out for ~1h
 * frobware lunches
<fwereade> rogpeppe2, re that heartbeatMonitor
<rogpeppe2> fwereade: yes?
<fwereade> rogpeppe2, the only reason I can see for it to block on Close is if Ping somehow hangs; and while I can't explain exactly why that would happen, it doesn't seem *intrinsically* implausible if the state server is misbehaving somehow
<fwereade> rogpeppe2, (I'm mainly asking you because I think you wrote v1? anyway)
<rogpeppe> fwereade: yeah, i'm responsible for the design of most of that code, i think
<rogpeppe> fwereade: have you got a reproducible test case?
<fwereade> rogpeppe, anything obviously insane about timing the ping out and stopping? AFAICS none of the clients need to block until it's *actually* stoppped?
<fwereade> rogpeppe, sadly not
<fwereade> rogpeppe, I am experimenting with messing around with connectioons and it's been flawless for me so far
<rogpeppe> fwereade: is there a bug report?
<fwereade> rogpeppe, but there's a bug open -- and that includes a panicking state server somewhere -- yeah, https://bugs.launchpad.net/juju-core/+bug/1522544
<mup> Bug #1522544: workers restart endlessly <juju-core:Triaged by fwereade> <https://launchpad.net/bugs/1522544>
<rogpeppe> fwereade: how would a hung-up heartbeatMonitor cause endless restarts?
<rogpeppe> fwereade: i'd've thought it would have the opposite effect
<rogpeppe> fwereade: i.e. it *can't* restart when needed
<fwereade> rogpeppe, the worker that wraps it never finishes and is never restarted; so the conn resource doesn't get replaced, and everybody keeps gamely trying to connect with the old one
<fwereade> rogpeppe, after all, it might just have been some transient error ;)
<rogpeppe> fwereade: hmm
<fwereade> rogpeppe, (but ofc they all just start up, try to do something, fall down)
<fwereade> rogpeppe, anyway
<rogpeppe> fwereade: before putting a timeout on the ping, i would make sure that that actually is happening
<rogpeppe> fwereade: for example by getting a *full* stack trace of all goroutines when this is happening
<rogpeppe> fwereade: if the connection to the state machine has gone, the Ping *should* exit
<fwereade> rogpeppe, yeah, I'm not worried about that
<rogpeppe> fwereade: if it doesn't then it may be a bug in the rpc package
<fwereade> rogpeppe, I'm worried about the worst case of what a confused state server might induce I guess
<fwereade> rogpeppe, ehh, just a thought -- the interesting thing I suppose is that the same failure could have happened silently before, I suspect, it'd just manifest as an agent blocked for some reason and when there's no clear cause it's all too easy to bounce it and move on
<rogpeppe> fwereade: all the client request *should* return regardless of the server state
<rogpeppe> fwereade: because we close the connection
<fwereade> rogpeppe, do you recall the rationale for waiting for a failed ping, rather than just exiting on close? I don't think a successful ping-failure implies anything useful about whether any other clients are using the conn
<rogpeppe> fwereade: and if the connection's closed the response reader should close
<rogpeppe> fwereade: no, i think it would be just fine to return after reading on s.closed
<rogpeppe> fwereade: i don't think that'll make any difference though
<rogpeppe> fwereade: because sending an API request after the client is closed will immediately return an error without actually doing anything
<rogpeppe> fwereade: actually i do see at least one rationale
<rogpeppe> fwereade: which is that State.Close waits on s.broken before returning
<fwereade> rogpeppe, I'm thinking of scenarios like "a panicking apiserver has somehow deadlocked itself mid-rpc"
<rogpeppe> fwereade: thus ensuring that the heartbeat monitor is cleaned up
<rogpeppe> fwereade: even then, it should cause a client to hang up
<rogpeppe> fwereade: i mean, feel free to experiment
<rogpeppe> fwereade: but i think that if you try making an rpc server that never replies on any request, you should still be able to close the client ok
<rogpeppe> fwereade: (if not it's a bug)
<fwereade> rogpeppe, yeah, I think you're right
<fwereade> rogpeppe, cheers :)
<rogpeppe> fwereade: np
<voidspace> frobware: you rebased yet?
<mup> Bug #1522861 opened: Panic in ClenupOldMetrics <juju-core:New> <https://launchpad.net/bugs/1522861>
<frobware> voidspace, just about to. make check works ok on maas-spaces.
<frobware> voidspace, I forget (doh!) what we agreed for rebasing. Just push, or PR and push?
<frobware> dimitern, ^^
<voidspace> frobware: I'd say just push
<voidspace> frobware: you can't merge the PR anyway, so no point (IMO)
<frobware> voidspace, exactly.
<frobware> voidspace, couldn't remember what I did last time...
<voidspace> you PR'd then pushed
<frobware> voidspace, for the record, diff against master looks like: http://pastebin.ubuntu.com/13669884/
<dimitern> frobware, +1 for just push
<frobware> dimitern, dooferlad, voidspace: rebased. http://pastebin.ubuntu.com/13669966/
<dooferlad> frobware: cool, thanks!
<dimitern> frobware, sweet!
<natefinch> mgz:  you around?
<mgz> natefinch: yo
<natefinch> mgz: something wonky with my feature branch here: http://juju-ci.vapour.ws:8080/job/github-merge-juju-charm/20/console
<natefinch> + /var/lib/jenkins/juju-ci-tools/git_gate.py --project gopkg.in/juju/charm.nate-minver
<natefinch> should be --project gopkg.in/juju/charm.v6-unstable
<natefinch> probably the first time we've ever tried a feature branch on a gopkg.in library
<natefinch> mgz: not really sure how we're supposed to make that work.  I can understand what the CI code is doing... I don't know how to tell it "pretend this is charm.v6-unstable"
<mgz> yeah, I don't think gopkg.in and feature branches are compatible
<mgz> well, they are, it's just via the go mechanism of rewriting all your imports
<mgz> you can't name a branch something other than v6-unstable then import it as that via gopkg.in
<natefinch> well, I mean, it works fine using godeps from juju/juju.... because godeps just sets the right commit... but yeah, I guess there's no real way to do that solely from the gopkg.in branch directly
<natefinch> like, you can do go get gopkg.in/juju/charm.v6-unstable and then git checkout minver and it'll work fine
<natefinch> er nate-minver
<mgz> you can get git to give you any rev it can find in the repo
<natefinch> right
<mgz> that doesn't mean it's in any relevent history
<natefinch> I think we need feature branch support in CI.  Otherwise we get into the case where a change to one of these versioned branches breaks juju/juju ... just like that email I sent a week ago or so
<mgz> we can't really hack around the way gopkg.in works
<mgz> ci on github.com/ stuff is fine
<mgz> but we'd need to actually sed imports to make gopokg.in work I think, and that's... not very productive
<natefinch> sure you can.  I just said how: You do go get gopkg.in/juju/charm.v6-unstable and then git checkout nate-minver
<natefinch> (and then godeps dependencies.tsv)_
<natefinch> that's how I do development on my feature branch
<natefinch> let me double check that that actually works from scratch
<natefinch> yep, totally works
<mgz> hm, the simple case works with just mv on the dir yeah
<natefinch> yeah, you just need the code from the branch to be in the directory go expects
<natefinch> so the current CI code could work if we added a simple mv statement... though again, it has to understand what the code "expects" to be called, which now has to live outside the branch name.
<mgz> natefinch: can we just naming scheme it?
<mgz> how many dots is too many dots?
<natefinch> mgz: whatever is easiest for you guys. *I* don't care what the name of my branch is
<natefinch> mgz: I'm pretty sure that gopkg.in only cares about that first dot
<natefinch> mgz: we could do gopkg.in/juju/charm.v6-unstable.nate-minver
<natefinch> lol almost semantic versioning at that point
<mgz> charm.v6-unstable.featurename seems somewhat workabl... right
<natefinch> mgz: would you like me to poke at the CI code to get something like that working? I need it to get my min juju version stuff tested and landed
<mgz> lp:juju-ci-tools git_gate.py I'd add a flag to do magic dot handling
<natefinch> good lord bzr is slow
<natefinch> mgz: thanks, I'll look at it
<mgz> you should just be able to mangle the directory variable in go_test()
<natefinch> mgz: can you rename my branch?
<mgz> natefinch: I can
<natefinch> lol, writing python just broke sublime
<natefinch> bbiab
<dimitern> frobware, voidspace, dooferlad, any of you guys still around?
<frobware> dimitern, yep
<dimitern> frobware, before I go today I need a review on the PR I'm proposing to unblock David
<dimitern> frobware, doing a final live test on maas 1.9 now and I'll publish it
<frobware> dimitern, OK
<frobware> dimitern, it's on the maas-spaces branch so we're pretty contained
<dimitern> frobware, yep
<frobware> dimitern, the alternative is to just email a patch to Davi if you're unsure that you want to land it.
<dimitern> frobware, oh boy :/ I've just discovered it won't work because of yet another maas bug
<dimitern> frobware, http://paste.ubuntu.com/13673162/
<dimitern> frobware, well, I can still propose it and land it, as it's useful but it won't unblock David until this is resolved
<frobware> dimitern, why does that fail?
<dimitern> frobware, no idea - looking at the maas logs now
<dimitern> frobware, PR: http://reviews.vapour.ws/r/3322/
<dimitern> frobware, I don't know why it includes already merged commits though :/
<dimitern> my changes are only in provider/maas/environ.go and environ_whitebox_test.go
<frobware> dimitern, is that because of my rebase?
<dimitern> frobware, well, I did a rebase of my origin/maas-spaces onto upstream/maas-spaces
<dimitern> frobware, and then rebased the last 2 commits on top of that
<dimitern> anyway, I need to start packing..
<frobware> dimitern, your commit touches a lot of files
<frobware> dimitern, that's the single patch for David?
<dimitern> frobware, most of the changed files come from voidspace's ProviderId for subnets/spaces
<frobware> dimitern, are you merging from his branch?
<dimitern> frobware, I guess I'll redo it cleanly starting from fresh upstream/maas-spaces
<dimitern> frobware, no, I was rebasing.. well I did something wrong obviously
<voidspace> dimitern: I'm here, sort of
<frobware> dimitern, the review is split over two pages for me
<dimitern> voidspace, false alarm it seems :)
<dimitern> frobware, yeah, I'll redo it cleanly, but not now
<voidspace> dimitern: how come that PR has my "already landed" changes in it
<dimitern> voidspace, a git late-friday-evening mystery :)
<voidspace> dimitern: heh
#juju-dev 2015-12-06
<davecheney> ping anyone around who can reivew https://github.com/juju/juju/pull/3899
<davecheney> thanks
#juju-dev 2016-12-05
<perrito666> morning everyone
<redir_sprint> morning perrito666
<katco> mgz: https://github.com/juju/juju/pull/6632mg
<katco> mgz: http://juju-ci.vapour.ws:8080/job/github-merge-juju/9749/artifact/artifacts/lxd-err.log
<katco> mgz: http://juju-ci.vapour.ws:8080/job/github-merge-juju/9750/artifact/artifacts/lxd-err.log
<katco> mgz: (same underlying error)
<katco> mgz: http://juju-ci.vapour.ws:8080/job/github-merge-juju/9743/artifact/artifacts/lxd-err.log
<sinzui> katco: mgz: anastasiamac: natefinch https://docs.google.com/document/d/1PMOOW7DDYGtL_JyJST1i4RtXh36Wws5v1Nic1iKrL5U/edit#
<katco> sinzui: ta
<hoenir> could someone take a look on https://github.com/juju/juju/pull/6523 ?
<hoenir> please.
<rick_h> hoenir: the team is sprinting this week. I'll see if I can get someone to look between sessions.
<rick_h> hoenir: sorry for the delays
<frobware> sinzui, mgz: is our production go version 1.6.3?
<katco> hoenir: o/ sorry for the delay
<mup> Bug #1647329 opened: [1.25.8] SSL failure during deployer bundle deployment <cdo-qa> <juju-core:New> <https://launchpad.net/bugs/1647329>
<mgz> frobware: see for instance http://reports.vapour.ws/releases/4625/job/build-binary-xenial-amd64/attempt/1172
<mgz> frobware: Get:3 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 golang-1.6-go amd64 1.6.2-0ubuntu5~16.04 [20.2 MB]
<mup> Bug #1647329 changed: [1.25.8] SSL failure during deployer bundle deployment <cdo-qa> <juju-core:New> <https://launchpad.net/bugs/1647329>
<frobware> voidspace: voila! http://178.62.20.154/~aim/go1.6.3_1.0.0-15_amd64.deb
<frobware> sinzui: does it happen for every CI run now?
<mup> Bug #1646524 changed: Excessive logging in juju-db <canonical-bootstack> <juju-core:New> <https://launchpad.net/bugs/1646524>
<mup> Bug #1646863 changed: Cannot ssh to unit when Docker is setup on both client and unit machine <juju:New> <https://launchpad.net/bugs/1646863>
<mup> Bug #1494661 opened: Old logs are not compressed or removed <canonical-bootstack> <uosci> <juju:Triaged> <juju-core:Triaged> <juju-core 1.25:Triaged> <https://launchpad.net/bugs/1494661>
<frobware> voidspace, macgreagoir: PTAL @ https://github.com/juju/juju/pull/6654
<frobware> voidspace, macgreagoir: because I've reached all bridging quotas. :)
<macgreagoir> frobware: -1 for adding whitespace! ;-)
<beisner> hi rick_h, if i want to query the charm store for a list of currently-supported series, do you have magic handy? :-)
<thumper> axw: I don't think the bot accepts anything but $$merge$$ now
<thumper> I stand corrected
<rick_h> beisner: hmm, no I can't think of any api for that or one we could abuse for it.
<beisner> rick_h, trying to get ahead of https://github.com/juju/charmstore/issues/695 in our gate to catch future travelers who propose updated metadata in openstack charms.
<rick_h> beisner: looking
<rick_h> beisner: yea, I think the best thing might just be adding an api request to the bug to be added. It'll dump out what it's got defined.
<beisner> rick_h, cool.  i don't mind parsing a json blob if there's somewhere that spits that out
<rick_h> beisner: yea, sensible enough
<mup> Bug #1646524 opened: Excessive logging in juju-db <canonical-bootstack> <juju-core:New> <https://launchpad.net/bugs/1646524>
<mup> Bug #1646524 changed: Excessive logging in juju-db <canonical-bootstack> <juju-core:Won't Fix> <https://launchpad.net/bugs/1646524>
<mgz> menn0: is this a think we care about?
<mgz> 2016-12-02 06:54:00 ERROR juju.worker.dependency engine.go:547 "migration-master" manifold worker returned unexpected error: opening target log stream: cannot get discharge from "https://10.18.70.186:17070/auth": cannot start interactive session: interaction required but not possible
<mup> Bug #1646524 opened: Excessive logging in juju-db <canonical-bootstack> <juju-core:New> <https://launchpad.net/bugs/1646524>
<menn0> mgz: yes that sounds like something we should care about. file a bug please.
<menn0> mgz: it may be to do with babbageclunk's recent work
<menn0> mgz: or it might be a more generic issue
<mup> Bug #1646524 changed: Excessive logging in juju-db <canonical-bootstack> <juju-core:Won't Fix> <https://launchpad.net/bugs/1646524>
<perrito666> axw_: http://www.gstatic.com/tv/thumb/movieposters/13779/p13779_p_v8_ah.jpg
<beisner> hi rick_h - raised this as a separate issue/request tracker @ https://github.com/juju/charmstore/issues/699
#juju-dev 2016-12-06
<mup> Bug #1510787 opened: juju upgrade-charm errors when it shouldn't <juju:Triaged> <juju-core:New> <postgresql (Juju Charms Collection):Won't Fix> <https://launchpad.net/bugs/1510787>
<mup> Bug #1510787 changed: juju upgrade-charm errors when it shouldn't <juju:Triaged> <juju-core:New> <postgresql (Juju Charms Collection):Won't Fix> <https://launchpad.net/bugs/1510787>
<mup> Bug #1510787 opened: juju upgrade-charm errors when it shouldn't <canonical-is> <juju:Triaged> <juju-core:New> <postgresql (Juju Charms Collection):Won't Fix> <https://launchpad.net/bugs/1510787>
<hoenir> good morning guys and gals
<hoenir> what's the shizzle my nizzle?
<perrito666> ??
<hoenir> :))
<hoenir> just joking
<hoenir> around.
<mgz> natefinch: (and others interested) for example, a failure:
<mgz> http://reports.vapour.ws/releases/4628/job/run-unit-tests-race/attempt/2113
<babbageclunk> anastasiamac: ha, thanks!
<anastasiamac> babbageclunk: ha? :D
<redir> alexisb: https://github.com/juju/juju/wiki/Debugging-Races
<mup> Bug #1647675 opened: Tab completion doesn't work with Juju 2.0 <juju-core:New> <https://launchpad.net/bugs/1647675>
<mup> Bug #1647675 changed: Tab completion doesn't work with Juju 2.0 <juju:Invalid> <https://launchpad.net/bugs/1647675>
<thumper> perrito666:  https://github.com/juju/juju/pull/6659
<voidspace> macgreagoir: ping
<voidspace> jam: https://pastebin.canonical.com/172627/
<voidspace> macgreagoir: please move your  network-get document to a google doc so I can review it!
<voidspace> macgreagoir: hard to review a pastebin
<voidspace> macgreagoir: also could you add  a header explaining the intent of the doc
<voidspace> macgreagoir: it's not clear to me if this a description of the way things are or the way things should be (it looks like it's doing both- I think!)
<macgreagoir> voidspace: There's a google doc linked to the card as an external link already. As per note, pastebin link is just because I find it easier to read, and I though others might too.
<macgreagoir> voidspace: I hope it describes the way things are now, and begins to note some possible improvements.
<voidspace> macgreagoir: ah! I missed that, thanks.
<macgreagoir> nw, sorry it wasn't clearer!
<voidspace> macgreagoir: a few comments on the google doc
<voidspace> macgreagoir: I'm not trying to be arsey, I'm just struggling to understand the purpose of this doc
<macgreagoir> Cheers
<voidspace> macgreagoir: so feel free to be as blunt as you want in replies to my comments :-)
<macgreagoir> I will apply all my usual charm!
<voidspace> macgreagoir: I sat down with rick_h and we read through it together
<voidspace> macgreagoir: I understand the purpose of the document now :-)
<macgreagoir> voidspace: Cheers, I see your comments there. I'll take a look and respond/update.
<rick_h> macgreagoir: can we get more real examples of "command issued, output received" ? something more like if you were to build help/spec for the feature as it works currently?
<rick_h> macgreagoir: very helpful with the notes and comments being analyzed
<macgreagoir> rick_h: Will do
<mgz> jam: https://github.com/juju/juju/pull/6663
#juju-dev 2016-12-07
<mup> Bug #1647897 opened: Juju bootstrap proxy support <juju-core:New> <https://launchpad.net/bugs/1647897>
<gsamfira> hello folks. We have about 120 nodes constantly being deployed and redeployed by juju in a CI
<gsamfira> this environment has been online for a while now
<gsamfira> as a result, some collections have grown quite a bit
<gsamfira> one of them is: presence.beings
<gsamfira> juju:PRIMARY> db.presence.beings.count()
<gsamfira> 1646080
<gsamfira> the other was logs.logs (which had about 14.000.000 entries)
<perrito666> gsamfira: hey
<gsamfira> hey perrito666
<perrito666> gsamfira: logs should be rotated, something is wrong there
<gsamfira> I don;t think juju got a chance to rotate
<gsamfira> the db was under huge load
<perrito666> gsamfira: but the rotation is done by juju
<perrito666> and by rotation I mean just deletion
<gsamfira> mostly because of a bunch of queries against precence.being
<gsamfira> which had no index on model-uuid
<gsamfira> and did a COLLSCAN for every query
<perrito666> gsamfira: this is 2.x?
<gsamfira> I can imagine. But juju kind of stopped working, erroring out with i/o error while talking to mongo
<gsamfira> 2.0.1
<gsamfira> i/o error and i/o timeout
<gsamfira> I had to drop all connections to the state machine port, go into the database and create an index in presence.beings
<gsamfira> db.presence.beings.createIndex({"model-uuid": 1})
<gsamfira> also on txns
<gsamfira> db.txns.createIndex({"s": 1})
<gsamfira> just to get rid of a lot of spam in syslog about COLLSCANs
<perrito666> thumper: babbageclunk any of you might be interested in this?
<gsamfira> also dropped all logs and saved 1.3 GB of disk space
<perrito666> gsamfira:  lol
<thumper> o/
<perrito666> so, for how long has this been running?
<gsamfira> the state machine is a 16 CPU core, 32 GB of RAM VM hosted on RAID10 10kRPM SAS disks
<thumper> gsamfira: logs is a capped collection so size bound
<thumper> presence is bollocks
<mup> Bug #1647897 changed: Juju bootstrap proxy support <juju-core:Invalid> <https://launchpad.net/bugs/1647897>
<thumper> and grows forever
<gsamfira> also might be worth mentioning we have enabled HA on this particular setup
<gsamfira> thumper: ouch
<gsamfira> 2 low hanging fruit we can implement is simply creating indexes on stuff we query
<gsamfira> especially if we query those frequently
<gsamfira> the environment has been up for a couple of months I think
<gsamfira> lemme check
<gsamfira> since october. Rougly 2 months
<gsamfira> perrito666: ^
<perrito666> gsamfira: tx that is a nice point of data
<gsamfira> http://paste.ubuntu.com/23592735/ <-- this might also be of interest
<gsamfira> this is a CI environment, a lot of units get torn down and spun up again
<gsamfira> so there is a lot of traffic
<gsamfira> perrito666:  juju:PRIMARY> db.statuseshistory.count()
<gsamfira> 1810513
<gsamfira> so you get an idea
<gsamfira> :)
<perrito666> gsamfira: that is quite a reasonable size for status :) because its also capped but that one seems to be working
<gsamfira> yup. The environment seems stable again after creating the indexes and cleaning the logs
<perrito666> gsamfira: that is good to know :) a quick workaround and low hanging fruit all together
<gsamfira> I'll create a patch for the indexes later this week
<mgz> voidspace: http://streams.canonical.com/juju/tools/agent/2.0.2/
<mup> Bug #1357760 changed: ensure-availability (aka HA) should work with manual provider <cloud-installer> <ha> <landscape> <manual-provider> <manual-story> <juju:Triaged> <juju-core:Won't Fix> <https://launchpad.net/bugs/1357760>
<mup> Bug #1493058 changed: ensure-availability fails on GCE <docteam> <enable-ha> <gce-provider> <ha> <jujuqa> <juju:Invalid by thumper> <juju-core:Won't Fix> <juju-core 1.24:Won't Fix> <juju-core 1.25:Won't Fix> <https://launchpad.net/bugs/1493058>
<mup> Bug #1512569 changed: UniterSuite.TestRebootNowKillsHook fails with: uniter still alive <ci> <test-failure> <unit-tests> <juju:Triaged> <juju-core:Won't Fix> <https://launchpad.net/bugs/1512569>
<mgz> alexisb: can you backport your fix for bug 1631369 for 2.0 branch as well?
<mup> Bug #1631369: ExpireSuite.TestClaim_ExpiryInFuture_TimePasses took way too long <ci> <intermittent-failure> <regression> <unit-tests> <juju:In Progress by alexis-bruemmer> <https://launchpad.net/bugs/1631369>
<alexisb> mgz, sure
<axw_> perrito666: I've created https://github.com/juju/juju/tree/feature-persistent-storage
<perrito666> axw_: ack
<perrito666> axw_: we should propose against that I presume
<axw_> perrito666: yup, so we don't mess up 2.1
<perrito666> we dont mess up things, we awesome-ize them
<mup> Bug #1557726 changed: Restore fails on some openstacks like prodstack <backup-restore> <jujuqa> <openstack-provider> <juju:Fix Released by hduran-8> <juju 2.0:Fix Released by reedobrien> <juju-core:Won't Fix> <https://launchpad.net/bugs/1557726>
<natefinch> thumper: https://bugs.launchpad.net/juju/+bug/1648063
<mup> Bug #1648063: kill-controller removes machines from migrated model <model-migration> <juju:Triaged> <https://launchpad.net/bugs/1648063>
<voidspace> jam: from within a hook context (debug-hooks) is there a way to get a list of all the bindings defined for the charm?
<voidspace> frobware: ^^^ do you know?
<jam> voidspace: generally it will be the things listed under "provides" or "requires" in charm-metadata.yaml
<frobware> voidspace: https://github.com/frobware/testcharms
<voidspace> jam: referring to the charm is what I was hoping to avoid
<voidspace> jam: but thanks
<voidspace> :-)
<rick_h> voidspace: :/ had hoped but nothing here of use https://jujucharms.com/docs/2.0/authors-hook-environment
<voidspace> rick_h: it would be a nice tool to have
<voidspace> rick_h: however...
<voidspace> rick_h: I have now torn down that environment and frobware has a test charm that I can deploy locally
<voidspace> rick_h: that defines useful stuff explicitly for playing with bindings
<rick_h> voidspace: yea, all good. Just noting that I can't find anything useful for what you were asking
<voidspace> rick_h: thanks for lookinh
<voidspace> rick_h: *looking
<voidspace> rick_h: but frobware has useful tools for playing with network-get
<natefinch> Review me? 2 line change: https://github.com/juju/juju/pull/6670
<thumper> babbageclunk: https://github.com/juju/juju/pull/6669
<natefinch> perrito666: http://reports.vapour.ws/releases/4631/job/functional-backup-restore/attempt/4900
<mup> Bug #1625624 changed: juju 2 doesn't remove openstack security groups <ci> <landscape> <openstack-provider> <sts> <juju:Fix Committed by gz> <juju 2.0:Fix Committed by gnuoy> <juju-core:Fix Released by gnuoy> <https://launchpad.net/bugs/1625624>
#juju-dev 2016-12-08
<Mmike> Hi, lads. What user/password combination do I use to connect to mongodb on the state servers, I've tried admin/machine-XX as user and oldpassword/sstatepassword from agent.conf files, but no dice.
<rick_h> Mmike: check out https://www.mail-archive.com/juju-dev@lists.ubuntu.com/msg04089.html
<Mmike> rick_h, thnx, looking
<mgz> frobware: lp:juju-ci-tools assess_endpoint_bindings.py for the demonstration
<mgz> see the create_test_charms() function
<redir> alexisb see also https://github.com/y0ssar1an/q
<voidspace> jam: http://leftoversalad.com/c/015_programmingpeople/
<natefinch> anastasiamac: can you re-review?  I updated a bunch of tests to pass.  Thanks! https://github.com/juju/juju/pull/6661
<anastasiamac> natefinch: looking
<anastasiamac> natefinch: LTM
<mgz> axw: you approved already but got macgreagoir to make a few changes, https://github.com/juju/juju/pull/6674
<mgz> anastasiamac: I think bug 1600257 is the badger
<mup> Bug #1600257: Broken bash completion with old ppa packages present <bash-completion> <packaging> <verification-done> <juju-core:Invalid> <juju-core (Ubuntu):Fix Released> <juju-core (Ubuntu Xenial):Fix Released> <https://launchpad.net/bugs/1600257>
<mup> Bug #1648130 opened: ERROR id not found error obscures cause <cdo-qa> <OpenStack Tempest Charm:Invalid> <juju-core:New> <https://launchpad.net/bugs/1648130>
<babbageclunk> menn0: Sorry, thought you said you were waiting for it to start.
<babbageclunk> axw_: If you want to take a look - https://github.com/juju/juju/pull/6678
<mup> Bug #1648130 changed: ERROR id not found error obscures cause <cdo-qa> <usability> <OpenStack Tempest Charm:Invalid> <juju:Triaged> <juju-core:Won't Fix> <https://launchpad.net/bugs/1648130>
<axw_> babbageclunk: sorry not going to be able to finish reviewing today
<babbageclunk> axw_: no worries - I'll bug menn0 to do it instead
<babbageclunk> axw_: (Although there's not a big rush)
<mup> Bug #1648600 opened: juju 1.x charm store channel support <uosci> <OpenStack Charm Test Infra:New> <juju-core:New> <https://launchpad.net/bugs/1648600>
#juju-dev 2016-12-09
<mgz> macgreagoir: this one is for you http://reports.vapour.ws/releases/4639/job/run-unit-tests-race/attempt/2124#highlight
<macgreagoir> mgz: You're too kind!
<mup> Bug #1510689 opened: juju upgrade --upload-tools tries to upload tools agents that are not permitted by the state server <bug-squad> <simplestreams> <upgrade-juju> <upload-tools> <juju-core:Triaged> <https://launchpad.net/bugs/1510689>
<macgreagoir> mgz: https://github.com/juju/juju/pull/6681
<mgz> macgreagoir: thanks
<macgreagoir> Thank you!
<mup> Bug #1648600 changed: juju 1.x charm store channel support <uosci> <OpenStack Charm Test Infra:New> <juju-core:Won't Fix> <https://launchpad.net/bugs/1648600>
<sinzui> voidspace: https://pastebin.canonical.com/173238/
<sinzui> voidspace: I used https://jujucharms.com/docs/stable/reference-hook-tools to learn about the commands I could run
<babbageclunk> thumper: review? (or poke menn0 to do it :) https://github.com/juju/juju/pull/6678
<babbageclunk> thumper: Should be pretty quick, hopefully
<thumper> babbageclunk: ack, looking
<babbageclunk> thumper: thanks!
<macgreagoir> anastasiamac mgz: https://github.com/juju/juju/pull/6683
<natefinch> mgz: https://github.com/juju/juju/pull/6684
<hoenir_> can someone take a look at this PR? https://github.com/juju/juju/pull/6523
<jam> frobware: https://github.com/juju/juju/pull/6690
#juju-dev 2016-12-10
<mup> Bug #1632159 changed: juju 1.25.6 HA - Causing unit state issues. <canonical-bootstack> <juju:Expired> <juju-core:Expired> <juju-core 1.25:Expired> <https://launchpad.net/bugs/1632159>
#juju-dev 2016-12-11
<anastasiamac> menn0: it looks like it's just u and me today... shall we skip standup?
<menn0> anastasiamac: I guess so. given we just had the sprint, we're up to date on everything.
<menn0> anastasiamac: i'm just getting through the remaining migrations work. very little distractions today so it's going well.
<menn0> anastasiamac: how's jetlag?
<anastasiamac> menn0: no jetlag \o/
<anastasiamac> landing with sunrise was amazing :)
<menn0> anastasiamac: I tried to sleep at the right time on the flights back but struggled. Still, I managed to stay awake yesterday and had a productive day.
<menn0> anastasiamac: picked up a christmas tree, mowed the lawn and had friends around before collapsing in a heap at about 10pm.
<anastasiamac> menn0: timing is everything \o/ and now u've had productive sprint, ready for xmas with model migration almost there
<anastasiamac> menn0: u might even get a chance to help with housing/flags so that i can fix this bug for 2.1!
<anastasiamac> menn0: m sure that it'll make ur year :)
<menn0> anastasiamac: happy to :) which days are you working this week?
<anastasiamac> menn0: mon, tue, wed then HOLIDAYS \o/
<menn0> anastasiamac: ok, let's go through that on wednesday. i'll make a calendar entry now.
<anastasiamac> menn0: well, actually in that case, don't worry about it..
<menn0> anastasiamac: how come?
<anastasiamac> menn0: i will not be able to put up PR and land it on wednesday..
<anastasiamac> menn0: and the release for 2.1 will happen while we r all away
<menn0> anastasiamac: I didn't realise there was a PR required? I thought you just wanted to understand how that stuff worked.
<anastasiamac> menn0: to make the cut, pr will need to land before i go
<menn0> anastasiamac: what PR though?
<anastasiamac> menn0: ? why would i want to understand it with no tangible outcome?
<menn0> anastasiamac: what do you want to change?
<anastasiamac> menn0: i have a bug that needs fixing and this is the way that I think i need to fix it..
<anastasiamac> PR is not existant yet... i need to understand stuff before changing the code :D
<menn0> anastasiamac: that wasn't clear to me. sorry. ok, well let's talk this afternoon then.
<anastasiamac> menthere is a worker that deals with agent termination directly
<anastasiamac> and it should not
<anastasiamac> not only that, it actually fails on some series... xenial off the top of my memory
<anastasiamac> menn0: this afternoon?.. but model migration... alexis will never fogive me..
<menn0> anastasiamac: well this bug needs fixing too right?
<anastasiamac> menn0: well, all bugs need fixing :)
<anastasiamac> menn0: but model migration needs to b feature complete...
<menn0> anastasiamac: I'm sure I can spare an hour
<menn0> :)
<menn0> anastasiamac: now is not good but can I ping you later?
<anastasiamac> menn0: k.. let's how we go... afternoon seems to b quiet far for now.. i might b able to catch up on stuff by then too
<menn0> anastasiamac: k
#juju-dev 2017-12-04
<axw_> wallyworld: so renaming caasprovisioner -> caasoperatorprovisioner?
<axw_> wallyworld: and adding caasunitprovisioner
<wallyworld> um, yeah, i think so
<axw> wallyworld: https://github.com/juju/juju/pull/8166
<wallyworld> looking
<wallyworld> axw: lgtm
<axw> wallyworld: https://github.com/juju/juju/pull/8169
<wallyworld> ok
<wallyworld> axw: i've got one up also but there's a feature test failure i'm looking at - there's an issue with directory names
<axw> ok
<balloons> wpk, did you get a chance to look at the spaces test PR? https://github.com/juju/juju/pull/8165
<bdx> https://bugs.launchpad.net/juju/+bug/1736022
<mup> Bug #1736022: failed to bridge devices: bridge activaction error: bridge activation failed: Killed old client process <juju> <juju:New> <MAAS:New> <https://launchpad.net/bugs/1736022>
<bdx> any idea whats going on there^
<bdx> ?
<dnegreira> Hi everyone, I am interested on developing the ability to use LXD as a provider for juju, in order to be able to use an existing remote LXD server. I have read an article from 2015 that it is not possible due to the LXD lack of interface to manage the storage and the firewalling. I have been following closely the development of LXD and I can see that the interface to manage the storage is there
<dnegreira> already but not the firewalling. I am willing to write some code so that LXD has some interface to manage firewalling towards the containers and have already spoken with them. Besides the firewalling and storage part, is there anything else that I should be looking into in order to be able to use LXD as a remote for juju ?
#juju-dev 2017-12-05
<redir> O/
<thumper> hi redir
<thumper> dnegreira: hi there
<thumper> dnegreira: juju already has a lxd provider
<axw> wallyworld: back
<wallyworld> axw: i'm just testing abother iteration - i may have found the issue, i'll ping in a bit if needed
<axw> okey dokey
<axw> I'll finish up my PR then
<wallyworld> sgtm
<axw> wallyworld: is your PR ready for review again? did you sort out the test failure?
<thumper> wallyworld: well... watchers based on a single tailer is working in my controller
<thumper> needs more testing
<thumper> and some real tests added
<thumper> but good start
<thumper> I need to make sure it is robust in failure cases
<axw> thumper: tailable iterator?
<thumper> axw: not at this stage
<thumper> kept the basic implementation
<axw> cool cool, just curious what you meant
<thumper> but now with how it is separated
<thumper> it would be easier to change to use that
<axw> great
<thumper> axw: I've added a worker to the state pool that does the tailing
<thumper> it publishes to a pubsub.SimpleHub owned by the state pool
<thumper> and the other state objects the pool creates just use a watcher that listens to the hub
<thumper> so a slow state won't slow down a fast state
<axw> thumper: sounds good
<thumper> axw: how would you feel if we put a pool attribute on the state instance?
<thumper> there are places where we are creating pools just to iterate models
<thumper> which seems ungood
<thumper> even if they are rare code paths
<axw> thumper: I would rather we didn't. do you have an example on hand
<axw> ?
<thumper> that's ok, they are pretty rare
<axw> thumper: IMO we should just pass a pool into the callers instead
<thumper> hmm...
<thumper> one is in state/cleanups.go
<thumper> another when destroying a controller
<thumper> also in state
<axw> thumper: hmmmm actually, lemme think on it
<thumper> it isn't urgent
<thumper> but areas where we should clean up I bit
<axw> thumper: I think I mentioned this before, but what I want to have is a state.Manager which essentially does what StatePool does now, but it will be the one and only place where you can obtain a State object
<axw> thumper: i.e. no more state.Open
<thumper> right
<thumper> perhaps the pool becomes that thing
<axw> thumper: if we do that, then a State could  reasonably internally have a Manager reference
 * thumper nods
<axw> atm the controller State doesn't come from a StatePool, so feels a bit awkward
<axw> and I really don't want the controller model's State to create or own the StatePool, that feels backwards to me
<axw> thumper: also, I want to remove the State's Close method. they should just be accessible from the manager/pool, and that should take care of closing its mongo session and workers internally
<wallyworld> axw: yeah, ready for another look
<axw> wallyworld: I take it the worker wasn't running hooks and such because of the missing ":latest" ?
<wallyworld> axw: not that reason - it was because cmd/jujud/main was specially looking for a rpc service listening for "Jujuc.Main" and i had renamed the struct for the hook commands from Jujuc to HookCmmnd
<axw> wallyworld: ahhh
<wallyworld> i wanted to get rid of jujuc
<axw> yay reflection magic
<wallyworld> as it doesn't make sense anymore
 * axw nods
<axw> wallyworld: can you please add a NOTE comment on the Jujuc struct
<wallyworld> axw: i just finished deploying my gitlab charm - app status is set to active as it should be so it all works end-end
<axw> wallyworld: sweet :)
<wallyworld> will add comment
<thumper> axw: understood
<wallyworld> axw: i added juju-log hook command, just a simple port. not urgent https://github.com/juju/juju/pull/8172
<axw> wallyworld: LGTM
<wallyworld> ta
<babbageclunk> wallyworld: might be a silly question: but can the API server send requests to the agents?
<wallyworld> babbageclunk: it's all one way from agent -> api server
<wallyworld> agents do start watchers which are long running connections
<babbageclunk> wallyworld: ah - I was confused by the code in rpc/conn.loop that handled the header being a request or a response. But I think that's shared between the client and server, isn't it/
<babbageclunk> ?
<babbageclunk> in rpc/server.go
<wallyworld> maybe, not sure. i'm not 100% across that code
<wallyworld> i could guess but that would be bad
<babbageclunk> ok, I think I get it.
<wallyworld> ok
<wallyworld> axw: this adds some more needed apis, no rush https://github.com/juju/juju/pull/8173
<wallyworld> axw: thanks for review, i replied to comment
<wallyworld> axw: i'll need to debug a bit - it seems when we invoke the config-get hook tool, we always get empty config - the tool runs without error so it may be that the config is not being recorded in state. have you seen that behaviour at all?
<axw> wallyworld: nope, but I had logging in there which I tested
<axw> wallyworld: i.e. the "application config changed" log message
<wallyworld> axw: yeah, in a hook, i call hookenv.config() and it's empty sadly
<wallyworld> debug time
<axw> wallyworld: hookenv.config() is showing the config for me
<wallyworld> axw: a reactive charm?
<axw> wallyworld: I dpeloyed gitlab, then ran "juju config gitlab gitlab_image=foo"
<axw> wallyworld: yup, not that it should matter
<wallyworld> hmmm
<wallyworld> for me hookenv.config() is empty
<wallyworld> i'm not setting ny config explicitly
<wallyworld> i'm relying on defaults
<wallyworld> which i thought were sent as config values
<axw> wallyworld: ah. it does appear to be empty by default, which seems like a bug
<wallyworld> maybe not
<wallyworld> axw: yeah, i'm thinking a bug in caas models
<axw> wallyworld: I suspect we're not populating the application's settings properly
<wallyworld> axw: there's also an issue with container-spec-set that i'm seeing
<wallyworld> fixing that one first
<axw> wallyworld: fixed in my PR
<axw> wallyworld: assuming you mean the order being reversed
<axw> wallyworld: https://github.com/juju/juju/pull/8174
<wallyworld> let me look
<wallyworld> axw: yeah, ooops, that's the bug
<wallyworld> axw: lgtm. i've got a charm which should hopefully work (once we get the config stuff fixed)
<axw> wallyworld: cool
<axw> wallyworld: can't add units yet, that's the next step
<axw> wallyworld: starting to look into that
<wallyworld> axw: we can debug tomorrow - when i deploy my charm, it is setting the spec now, but the container spec watcher is not notifying, so the unit worker is not creating the pod
<axw> wallyworld: probably because you have no units?
<axw> wallyworld: I have it working here
<axw> wallyworld: I had to make changes in state so that units can be added to a CAAS app
<wallyworld> axw: ah right, i forgot about that; i thought we created one but forgot we don't
<wallyworld> so it's just the config issue
<wallyworld> axw: also, i'm mulling over whether we should provide the reactive key value store via a hook tool (similar semantics to model-config where a single CLI cmd is used to set or reset values). that way we get caching via the context and also the connection boilerplate back to the controller all taken care of easily
<axw> wallyworld: yeah, that does sound like a good reason to use a hook tool. or at least going through the operator agent as an intermediary; it could still be an API
<axw> CLI is probably fine tho
<wallyworld> CLI is easiest to hook up at the moment i think
<wallyworld> less messing about with  python etc
<mup> Bug #1727507 changed: [2.3] rationalize offer commands <docteam> <juju:New> <https://launchpad.net/bugs/1727507>
<mup> Bug #1721555 changed: juju run tries to reach manually added machines through private IPs instead of public <juju-core:Won't Fix> <https://launchpad.net/bugs/1721555>
<mup> Bug #1733968 changed: virtual IP addresses should not be registered <sts> <juju:New> <juju-core:Won't Fix> <https://launchpad.net/bugs/1733968>
<hml> wallyworld: i have a pr up - just ignore for now, itâs going to change :-)
<wallyworld> hml: is your PR ready for review?
<hml> wallyworld: pushing the updates now
<wallyworld> ok
#juju-dev 2017-12-06
 * thumper loads up his controller
<thumper> hopefully this won't actually kill my laptop
<thumper> I'm trying to replicate the death spiral we see with some controllers with lots of models
<hml> thumper: ha, youâre brave?
<thumper> well... I'm up to 300 models and 60 total units
<thumper> starting to see rises in txns.log reads
<thumper> since I'm on an SSD, no doubt my profiling will be different
<thumper> but slowly adding stuff to see it get stressed
<hml> wallyworld: i pushed the pr changes, but there is something wrong with the tests, but real world does what it should on packages.
<hml> wallyworld: iâm EOD, look at tests in the morning
<wallyworld> ok, ty
<thumper> currently: goroutine profile: total 166500
<thumper> axw: the new hook execution locking behaviour is 2.3 not 2.2.6 right?
<axw> thumper: correct
 * thumper nods
 * thumper slowly adds more models and units
<thumper> perhaps unsurprisingly, my machine is starting to get a little sluggish
<thumper> wallyworld: I've got a problem, got a minute?
<wallyworld> sure
<thumper> 1:1 HO
<mwhudson> an overloaded controller _and_ hangouts? brave man :)
<axw> wallyworld: the state.Unit.ConigSettings method adds defaults, the application method does not
<axw> ConfigSettings*
<wallyworld> axw: yeah, just discovering that
<wallyworld> just otp with tum
<cmars> i'm running into this bug with the manual provider: https://bugs.launchpad.net/juju/+bug/1736582
<mup> Bug #1736582: Cannot bootstrap manual provider with Juju 2.2.6 <juju:New> <https://launchpad.net/bugs/1736582>
<cmars> is there a workaround?
<thumper> hmm...
<thumper> when I restarted the controller
<thumper> it would have hit the provider asking for machines for each model in very short order
<thumper> starting all the provisioners in parallel has the potential to DDOS the provider
<thumper> which is what we did for lxd
<wallyworld> oops, we should fix that
<wallyworld> thumper: i would expect model migrations to copy across charms - it that something that is done out of band? or is it in the export yaml?
<wallyworld> i'm guessing out of band?
<wallyworld> or is that just resources that gets streamed across
<babbageclunk> wallyworld: charm binaries are done out of band - they're not in the export yaml
<wallyworld> babbageclunk: thanks. so that means we potentially have a situation for a short time where an application has an orphaned charn reference. do we not mark the model as ready until all blobs are copied?
<babbageclunk> wallyworld: that's right - the binaries (agents, charms and resources) are transferred in the import phase, and the model isn't marked as active until the success phase (when all the agents are switched over)
<wallyworld> babbageclunk: thanks! i need to update a test then due to to a change i'm making
<babbageclunk> wallyworld: no worries (he says, too late)
<wallyworld> never too late
 * babbageclunk still has time, it's never to late to change his mind, it's - never - too late - to change his mind
<wallyworld_> axw: for when you are free https://github.com/juju/juju/pull/8179
<axw> sure, just finishing up my PR then will take a look
<wallyworld_> no rush
<axw> wallyworld: https://github.com/juju/juju/pull/8180
<wallyworld> axw: looking. i'm just pushing a change to fix a feature test failure
<axw> okey dokey
<wallyworld> done
<wallyworld> axw: our test had a comment saying "this is wrong" referring to the source attr value, but it wasn't fixed, just a comment left behind
<axw> wallyworld: lol :/
<wallyworld> yeah
<axw> wallyworld: so we don't know who owns the juju namespace?
<wallyworld> axw: no, "they" are trying to find out
<wallyworld> no one knows
<axw> ok
<wallyworld> axw: did you use the latest version of my charm? i added extra config attrs which are set in the pod spec
<axw> wallyworld: nope, and I hacked up the charm a bit to set the spec
<axw> maybe yours does that now?
<wallyworld> yeah, since yesterday
<wallyworld> i'll test once all this lands
<jam> axw: did you see bug #1729880, someone is wanting to run a patched version of Juju but it is in a multi-architecture deploymente
<mup> Bug #1729880: juju 2.2.4 and 2.2.6 actions dissappear when state is changed from running to complete <juju:Fix Committed by ecjones> <https://launchpad.net/bugs/1729880>
<jam> I thought you had a script that would allow cross-compiling and uploading alternative agent binaries.
<jam> fwiw, all he really needs is the controller to be patched, he could use official binaries for the rest of the agents (I think).
<jam> but probably our bug about an upgraded controller creating new units/machines with the *controller* agent version rather than the *model* agent version.
<axw> jam: I didn't. stepping out shortly, but I'll be back later to respond
<axw> jam: FYI, it's https://github.com/axw/juju-tools
<jam> axw: have a good night
<anastasiamac> wallyworld: ping
<wallyworld> yo
<anastasiamac> wallyworld: pm'ed
<balloons> hml, can I get a +1? https://github.com/juju/juju/pull/8182
 * hml looking
<hml> balloons: lgtm
<babbageclunk> thumper: ping? Do you have a moment for some advice?
<thumper> babbageclunk: sure
<babbageclunk> thumper: in 1:1?
<thumper> yep
<wallyworld> hml: pop into stand up HO? just want to touch base about the config stuff
<hml> wallyworld: on my way
#juju-dev 2017-12-07
 * thumper sighs
<thumper> why isn't my shit working...
<rick_h> thumper: try less shit?
<rick_h> :P
<thumper> thanks
<thumper> very helpful
<thumper> veebers: got time for a quick HO?
<veebers> thumper: sure
<thumper> veebers: 1:1 ?
<veebers> sure, omw
<thumper> babbageclunk:  for your amusement https://github.com/juju/juju/pull/8185
<thumper> hmm... where'd wallyworld go?
<thumper> I feel that should have been shortened to
<thumper> where's wally?
<babbageclunk> heh
 * thumper is at EOD
<thumper> see y'all tomorrow
<soumaya> Hi ... I am getting error during bootstrapping on openstack with https auth endpoint
<soumaya> here is the log ....
<soumaya> 02:45:30 INFO  juju.provider.openstack provider.go:144 opening model "controller" 02:45:30 DEBUG juju.provider.openstack provider.go:798 authentication failed: authentication failed caused by: requesting token: failed executing the request https://192.168.23.222:5000/v3/auth/tokens caused by: Post https://192.168.23.222:5000/v3/auth/tokens: x509: cannot validate certificate for 192.168.23.222 because it doesn't contain any IP SANs 
<soumaya> anybody can help me on this ???
<thumper> grumble
<balloons> can I get a review of the version bump? https://github.com/juju/juju/pull/8187
<babbageclunk> wallyworld: can you take a look at my WIP branch? Putting in one more test now. https://github.com/juju/juju/pull/8184
<wallyworld> babbageclunk: sure, already started but didn't get to the end before something shiny distracted me
<babbageclunk> wallyworld: oh awesome.
<babbageclunk> wallyworld: actually, can I ask you a quick opinion? in 1:1...
<wallyworld> sure
<balloons> 2.3 is here!
 * balloons goes to celebrate
<wallyworld> balloons: awesome, ty
<wallyworld> babbageclunk: left a few comments, looks ok so far i think
#juju-dev 2017-12-08
<babbageclunk> wallyworld: awesome, thanks
<thumper> I have found a bunch of failures where people were creating state pools unnecessarily in tests
<thumper> I'm going to land those fixes independently
<thumper> babbageclunk, axw or wallyworld: https://github.com/juju/juju/pull/8189
<wallyworld> looking
<thumper> wallyworld: I'm trying to reduce the surface of the big pull request for the watchers
 * wallyworld nods
<babbageclunk> thumper: lgtm
<babbageclunk> doh, sorry
<thumper> all good
 * thumper afk briefly to collect daughter
<thumper> axw: any thoughts on https://bugs.launchpad.net/juju/+bug/1736582 ?
<mup> Bug #1736582: Cannot bootstrap manual provider with Juju 2.2.6 <juju:New> <https://launchpad.net/bugs/1736582>
<thumper> babbageclunk: another https://github.com/juju/juju/pull/8190
<thumper> or wallyworld: ^^
<axw> thumper: sorry missed your q. from casey's latest comment, sounds like we should perform a check to see if the mongo port is firewalled
 * thumper nods
<wallyworld> thumper: looking
<thumper> wallyworld: thanks, this one is boring too
<wallyworld> axw: here's that pr we talked about https://github.com/juju/juju/pull/8191
<axw> wallyworld: ok, will take a look later after lunch
<wallyworld> yup, no rush
<wallyworld> thumper: lgtm, interfaces good
<thumper> wallyworld: ta
<axw> wallyworld: how come you only pick out one port?
<wallyworld> let me look
<axw> wallyworld: in EnsureService
<axw> wallyworld: I think it makes sense to have ingress just point at the first container port in the spec (we can have that as a convention), but the service can still expose all ports?
<wallyworld> axw: i guess no good reason, just the current assumption that we only support one container per pod
<wallyworld> yeah
<wallyworld> agreed, that is easily fixed
<axw> cool
<axw> wallyworld: though I'm not sure if there's a way to determine which one is "first" once you fetch the service definition. is order preserved?
<axw> wallyworld: i.e. in ExposeService, is the order of ports there the same as what was specified in EnsureService
<wallyworld> axw: thanks for review. i think the order is preserved. at least frm what i have seen
<axw> wallyworld: ok cool
<axw> wallyworld: I think it would be best to do the conversion to int32 internally
<wallyworld> yup, done
<wallyworld> axw: awesome, go 1.10 beta1 changes go fmt behaviour :-/
<axw> wallyworld: ah well, we'd better all upgrade then
<wallyworld> and the builders :-)
<axw> oh right, we check that don't we
<wallyworld> axw: here's the caas firewaller all completed https://github.com/juju/juju/pull/8193
<axw> wallyworld: I think that's going to be a monday job, my brain is dead
<axw> I'll make a start on it anyway
<wallyworld> no worries, no rush
<wallyworld> i'm just happy to get it all working
<wallyworld> i'm off to have dinner, have a good weekebd
<axw> wallyworld: yeah, that's a solid effort - wasn't expecting to see it so soon
<axw> wallyworld: you too
<wallyworld> axw: thanks for the fix. LGTM that should have been caught in CI. clearly there's something missing from the upgrade tests
<axw> wallyworld: ta
<axw> yep
<wallyworld> jam: did you have any thoughts on that network info issue
