/srv/irclogs.ubuntu.com/2013/11/11/#juju-dev.txt

davecheney	services:	00:14
davecheney	gccgo1:	00:14
davecheney	charm: local:raring/gccgo-12	00:14
davecheney	exposed: false	00:14
davecheney	units:	00:14
davecheney	gccgo1/0:	00:14
davecheney	agent-state: installed	00:14
davecheney	nice	00:14
davecheney	the agent now tells you when it is done installing	00:14
davecheney	it used to say 'pending' until it hit started	00:14
thumper	morning	00:23
thumper	wallyworld_: hey there	00:23
wallyworld_	yello	00:23
thumper	wallyworld_: got time to chat?	00:27
wallyworld_	ok	00:27
* thumper fires up a hangout		00:27
thumper	wallyworld_: https://plus.google.com/hangouts/_/76cpj4l2lgncclri44ngapjg78?hl=en	00:29
bigjools	if jam is awake, he's going to get an awesome view of a re-entering soyuz in about 40 minutes.	01:53
=== axw_ is now known as axw
thumper	axw_: around?	02:58
* thumper has a headache		02:58
thumper	perhaps more coffee needed	02:58
thumper	jam: ping	03:01
thumper	axw__: the real axw?	03:01
axw__	thumper: indeed, my ISP is rubbish lately :(	03:02
=== axw__ is now known as axw
thumper	axw: can I get you on a hangout?	03:02
axw	thumper: certainly, just a minute	03:02
=== axw_ is now known as axw
=== thumper is now known as thumper-afk
jam	hey wallyworld_, you around for 1:1 ?	06:04
wallyworld_	sure	06:04
jam	bigjools: damn, wish I knew about that. I do wake up around that time, I just am not at my computer yet to see your message.	06:05
bigjools	jam: they re-enter over the middle east every time, so you get another in about 3 months	06:05
bigjools	jam: not sure if you can see the plasma trail though, but you'll definitely see a burning thing hurtling through the atmosphere	06:06
=== thumper-afk is now known as thumper
thumper	fwereade: ping	06:52
fwereade	thumper, pong,if you're still round	08:29
thumper	fwereade: I'm back around	08:50
thumper	fwereade: hangout?	08:50
fwereade	thumper, sure	08:50
rogpeppe	mornin' all	08:53
axw	morning rogpeppe	08:55
rogpeppe	axw: hiya	08:55
mgz	right, feeling a good bit less dodgy after the weekend	09:02
rogpeppe1	mgz: were you dodgy before? sorry to hear that.	09:32
mgz	rogpeppe1: just generally under the weather, can now talk without croaking again now	09:35
rogpeppe1	mgz: that's good :-)	09:35
jam	TheMue: standup ?	10:57
jam	https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig	10:57
TheMue	jam: ouch, missed it	10:59
mattyw	jam, axw I've merged my branch wirth trunk if you want to take another look: https://code.launchpad.net/~mattyw/juju-core/gocheck_included_in_build_fix/+merge/192411	11:07
jam	thanks mattyw, I marked it approved to land again.	11:12
mattyw	jam, thanks very much	11:13
mattyw	fwereade, could you give me a shout when you have a spare 10 minutes? whenever is good for you	11:43
fwereade	mattyw, hey dude, would you try again in about 1.5 hours please? that's my best guess :(	11:43
mattyw	fwereade, no problem, thanks	11:45
jam	mattyw: fwiw, your earlier gocheck patch landed	12:03
mattyw	jam, thanks very much for your help	12:04
* TheMue => lunch		12:18
axw	mattyw: sorry, I missed the merge failure. thanks for fixing.	12:20
mattyw	axw, no problem, thanks for reviewing	12:22
jam	fwereade: I'm back for a bit, but I should go do homework. Can I touch back with you in 30 min ?	12:58
fwereade	jam, sure, I'm still digging	12:59
fwereade	dimitern, jam: hey, there was a bug with the unit agent bouncing as it departed relations; did we ever resolve that one?	13:08
fwereade	dimitern, jam: because if we didn't I'm starting to wonder whether that's implicated in the immortal relations we're seeing	13:09
dimitern	fwereade, I'm not sure we did fix it	13:12
fwereade	dimitern, cheers	13:14
jam	fwereade: I'm back if we want to chat now	13:37
jam	fwereade: I don't think I followed that bug, so I don't know if it is fixed or not	13:37
fwereade	jam, it's not, I've just verified it	13:43
jam	fwereade: as in you triggered the unit agent to bounce while tearing down	13:43
jam	?	13:43
fwereade	jam, there's an error in uniter.Filter	13:43
fwereade	jam, any time a relation gets removed it bounces the unit agent	13:44
fwereade	jam, trying to figure out if that could cause what we're seeing	13:44
fwereade	jam, it's certainly not intended behaviour	13:44
jam	fwereade: well, bouncing an agent during normal operation doesn't sound very good.	13:44
jam	Would it come back up if things were set to dying?	13:45
fwereade	it comes up fine	13:46
jam	fwereade: but does it come back up without finishing what it was trying to do?	13:46
fwereade	jam, (so we didn't notice it for a while)	13:46
jam	I know we had that for some other teardown event. Where the process would die, and then come back thinking all was fine (destroy-machine of a manually provisioned machine, I think)	13:47
fwereade	jam, and I think it usually does the right thing, because the relation can't actually be removed until the unit agent has handled it...	13:47
fwereade	jam but there's some funky sequence-breaking for juju-info relations	13:47
fwereade	jam, so I need to figure out wtf is going on more-or-less from scratch there	13:47
jam	fwereade: I don't see "juju-info" the string in Uniter	13:49
fwereade	jam, IsImplicit	13:50
jam	fwereade: it does seem to have special handling of Dying in worker/uniter/uniter.go	13:50
jam	(set it do dying, but if that fails check if it is implicit)	13:51
jam	set it to dying	13:51
jam	anyway, I need to go grab dinner for my son, if you need anything you can leave a note and I'll try to check later. (Or email)	13:52
fwereade	jam, will do	13:54
hazmat	fwereade, any time a relation gets removed it bounces the unit agent -> that explains another bug repot..	14:48
hazmat	namely config-changed executing post relation-broken	14:49
fwereade	hazmat, ha!	14:49
fwereade	hazmat, well spotted	14:49
fwereade	hazmat, I expected that to be a quick fix but it'll only be a quickish fix -- can't quite driveby it, I'm making sure I get destroy-machine -- force done first	14:49
hazmat	fwereade, sounds good.. the machine one is priority.. the config-change/broken effected adam_g with ostack charm dev not in the field per se.	14:50
hazmat	fwereade, fwiw filed it as bug 1250106	14:52
fwereade	hazmat, cheers	14:52
TheMue	dimitern: ping	15:08
dimitern	TheMue, pong	15:15
TheMue	dimitern: just wanted to ask you about the background of machinePinger in apiserver/admin.go	15:17
dimitern	TheMue, yeah?	15:17
TheMue	dimitern: it wraps presence.Pinger, only Stop() is redefined to call Kill() at the end	15:17
TheMue	dimitern: can you tell me more about the reason behind it?	15:17
dimitern	TheMue, yes, so all resources in the apiserver need a Stop() method that will stop them	15:18
dimitern	TheMue, the pinger on the other hand does not stop immediately when you call Stop() on it, if you take a look at its implementation you'll see that Kill() is what we need to call, that's why Stop() is redefined to call Kill() on a pinger	15:19
fwereade	dimitern, why would we Kill()?	15:23
fwereade	dimitern, I don't think a connection dropping is reason enough to start shouting that the unit's down	15:23
dimitern	fwereade, because Stop is not guaranteed to stop it immediately	15:23
fwereade	dimitern, that's the point of pinger	15:23
TheMue	fwereade: ah, just wanted to ask after reading the code	15:23
dimitern	fwereade, well, I remember discussing it with rogpeppe1 back then when I implemented it	15:24
fwereade	dimitern, we don't want to raise the alarm as soon as we get some indication something might be wrong	15:24
fwereade	dimitern, we only want to do that when we know it's bad	15:24
dimitern	fwereade, i'm not sure I quite get you	15:24
dimitern	fwereade, the Stop() method is the last thing called in a resource when a connection is already dropped	15:25
fwereade	dimitern, in particular, an agent restarting to upgrade should not kill its pinger	15:25
fwereade	dimitern, because anything trusting pinger state to be a canary for errors might react to it	15:25
rogpeppe1	fwereade: on balance, i think i agree - calling Stop means we could bounce the agent without losing the ping presence	15:25
TheMue	fwereade: I can imagine what you mean, but how to differentiate?	15:25
dimitern	fwereade, I agree this is a corner case	15:26
fwereade	TheMue, well, "never kill" is a lot better than "always kill"	15:26
dimitern	fwereade, it it's not what's desired we can change it to use Stop instead	15:26
TheMue	fwereade: hehe, ok	15:26
fwereade	dimitern, rogpeppe1, TheMue: cool, cheers	15:26
dimitern	fwereade, I was concerned with the fastest detection on a stalled/dropped connection	15:26
fwereade	dimitern, rogpeppe1, TheMue: I think the only time to Kill the pinger is when the unit's dead	15:27
fwereade	TheMue, make sure you test that live though	15:27
fwereade	TheMue, and test it hard	15:27
fwereade	TheMue, ...and actually... bugger	15:28
TheMue	fwereade: the hard tests looked fine so far, but I now have to see how I do a "simple" hickup	15:28
fwereade	TheMue, dimitern, rogpeppe1: am I right in thinking that the replacement presence module broke the (effective) idempotency of a ping?	15:28
rogpeppe1	fwereade: what replacement presence module?	15:28
fwereade	rogpeppe1, niemeyer's mongo version	15:29
rogpeppe1	fwereade: hmm, let me have a look	15:29
fwereade	rogpeppe1, TheMue, dimitern: if it's not safe to have N pingers for the same node, I think we might have to Kill() anyway :(((	15:29
dimitern	fwereade, sounds reasonable	15:30
dimitern	fwereade, and not such a big improvement to have stop vs kill anyway	15:30
dimitern	fwereade, what of bouncing agents - they are down while restarting, so it's not unusual	15:31
rogpeppe1	// Never, ever, ping the same slot twice.	15:31
rogpeppe1	// The increment below would corrupt the slot.	15:31
fwereade	dimitern, they should not be reported as down	15:31
fwereade	dimitern, if they get reported as down as part of normaloperation then the reporting is... unhelpful, at best ;)	15:31
fwereade	rogpeppe1, well, damn	15:31
dimitern	fwereade, i agree	15:31
fwereade	rogpeppe1, that'll need to be fixed for HA anyway	15:31
dimitern	fwereade, but if the agent is being restarted it is down while it starts again, no?	15:32
TheMue	s/"down"/"indifferent"/g ;)	15:32
rogpeppe1	fwereade: i think that means that Stop is currently broken	15:32
fwereade	dimitern, "down" means "whoa, something's really screwed up, go and fix it"	15:32
dimitern	fwereade, really?	15:32
dimitern	fwereade, didn't occur to me before :)	15:32
dimitern	fwereade, I always though of it as an intermediate state	15:33
fwereade	dimitern, the intent was that any agent showing "down" should be reporting a real problem	15:33
TheMue	dimitern: the bug I'm working on has it after killing a machine the hard way	15:34
dimitern	fwereade, ah, ok then - so my assumption was based on our already flawed implementation :)	15:34
fwereade	dimitern, yeah -- good fix, thanks ;p	15:34
rogpeppe1	fwereade: do you know if we might be able to change things to use a more recent mongo version?	15:35
rogpeppe1	fwereade: 'cos that could fix things in one easy swoop (and backwardly compatibly)	15:36
fwereade	rogpeppe1, with $xor?	15:39
rogpeppe1	fwereade: $or, but yes	15:39
rogpeppe1	fwereade: (xor wouldn't be idempotent...)	15:39
fwereade	rogpeppe1, I fear it would be impractical given the trouble we've had with mongo already	15:39
fwereade	rogpeppe1, d'oh	15:39
rogpeppe1	fwereade: it may be worth investigating - we should probably change to using a more recent version of mongo before 14.04 anyway	15:40
rogpeppe1	fwereade: and perhaps most of the required procedures/mechanisms are already in place from the last time	15:41
rogpeppe1	fwereade: so it may not be as difficult this time	15:41
fwereade	rogpeppe1, yeah... I have no idea what it'd actually take, though -- mgz, can you opine here?	15:41
TheMue	fwereade: regarding the machinePinger and our discussion last week, what do you think now? my current tests are fine and kill 3 minutes after the last ping.	15:41
fwereade	TheMue, the presence problems are freaking me out now	15:42
* TheMue can imagine what fwereade means without knowing that term ;)		15:43
fwereade	TheMue, as discussed just above -- more than one pinger is a problem	15:43
fwereade	TheMue, so if an agent reconnected, somehow leaving a zombie connection lying around... we'd break presence state for some other agent	15:44
TheMue	fwereade: so the machine and all units would optimally share one presence pinger?	15:45
fwereade	TheMue, I don't see how that'd help?	15:45
fwereade	TheMue, we want to know, for each agent, whether it's reasonable to assume it's active	15:46
TheMue	fwereade: just tried to find different words	15:46
TheMue	fwereade: yeah, so the "physical pinging" would carry additional "logical pinging" aka machine or unit id	15:46
TheMue	loudThinkiing	15:47
fwereade	TheMue, rogpeppe1: pre-HA, would it be plausible/helpful to kill each old agent connections when a new one was made for that agent?	15:47
TheMue	fwereade: doesn't feel good	15:47
fwereade	TheMue, rogpeppe1: given HA, I think we need a presence module that works with multiple pingers regardless though... right?	15:47
TheMue	fwereade: yep	15:48
rogpeppe1	fwereade: i'm not quite sure if that follows	15:48
fwereade	rogpeppe1, if an agent reconnects to a different api server soon enough after disconnecting from another, do we not risk double-pings?	15:49
* rogpeppe1 thinks		15:50
TheMue	fwereade: double pings in the sense of "two are waiting, only one gets, so the other one reacts wrong"?	15:51
rogpeppe1	fwereade: yes, that's probably right	15:51
rogpeppe1	fwereade: if the network error is asynchronous and instant	15:52
fwereade	TheMue, in the sense of "we end up writing to the wrong agent's slot and ARRRGH"	15:52
rogpeppe1	fwereade: so even if we're only executing pings explicitly for an agent, the ping can be in progress when the connection is made to another api server and another ping made	15:52
fwereade	rogpeppe1, it feels possible, at least	15:53
fwereade	rogpeppe1, I wouldn't want to bet anything on it not happening	15:53
rogpeppe1	fwereade: it would be more possible if we didn't wait some time after bouncing	15:53
rogpeppe1	fwereade: as it is, i think it's pretty remote	15:53
rogpeppe1	fwereade: there's definitely more possibility if we're running the pings as an async process within the API server	15:54
rogpeppe1	fwereade: i think we can probably make the presence package more robust without changing its basic representation.	15:56
rogpeppe1	fwereade: by adding a transaction when starting to ping that verifies that noone else is pinging that same id.	15:57
fwereade	rogpeppe1, isn't the whole point of presence that it doesn't involve transactions?	15:58
rogpeppe1	fwereade: i was thinking a single transaction to initiate a pinger might be ok - none of the other operations require a transaction	15:59
rogpeppe1	fwereade: i.e. one transaction for the entire lifetime of the pinger	15:59
rogpeppe1	fwereade: there may be a cleverer way of doing it that doesn't rely on a transaction.	16:00
fwereade	rogpeppe1, I'm not quite seeing it myself	16:02
rogpeppe1	fwereade: we could always use a little bit of javascript instead of + too. if((x / (1<<slot)) % 2 == 0){x += 1<<slot}	16:03
rogpeppe1	fwereade: assuming mongo has a modulus operator	16:03
fwereade	rogpeppe1, that feels a bit more plausible	16:04
rogpeppe1	fwereade: that's probably the most unintrusive fix, but may not be great performance-wise	16:04
fwereade	rogpeppe1, bah, v8 is 2.4 as well, isn't it?	16:05
rogpeppe1	fwereade: v8?	16:05
fwereade	rogpeppe1, sexy fast javascriptengine	16:05
rogpeppe1	fwereade: ah, no idea sorry	16:05
rogpeppe1	fwereade: i'd be slightly surprised if it made a huge difference for stuff that simple	16:06
rogpeppe1	fwereade: but if it does, then we should do it, because all transactions use js.	16:06
rogpeppe1	fwereade: so it could speed up our bottom line	16:06
fwereade	rogpeppe1, I guess that's one to benchmark at some point in the future, doesn't feel like a priority at this stage	16:09
rogpeppe1	fwereade: we could do with some benchmarks :-)	16:10
fwereade	rogpeppe1, sure, but I think we're currently better off focusing on what we can fix ourselves without swapping out the underlying db	16:12
rogpeppe1	fwereade: yeah	16:12
rogpeppe1	fwereade: but i'd like to see at least one benchmark of presence performance so that we know that it's plausible given the number of pings/second that we already know might happen.	16:13
fwereade	rogpeppe1, I think we currently know that presence as it is is not the bottleneck -- but yeah, if we're changing it, we should check the changes don't screw us at scale	16:16
rogpeppe1	fwereade: BTW, I may be wrong about transactions using js - I had that recollection, but can't now find any evidence for it.	16:36
fwereade	rogpeppe1, I think if they use $where, and possibly a couple of other bits, they still use the JS engine	16:41
rogpeppe1	fwereade: no occurrence of $where that i can see	16:42
* fwereade is stupid, because he didn't think about force-destroying state servers, and grumpy because he just copied the form of DestroyMachines despite his initial discomfort and already regrets it		17:03
jam	fwereade, rogpeppe1: note that there is an abstraction between the unit that is pinging and the actual Pinger. When you start a pinger you get a unique ID and then record the Unit => Pinger ID mapping. So it is conceivable that whenever you reconnect you just always require a new PingerID so you can't get double pings.	17:14
fwereade	jam, p := presence.NewPinger(u.st.presence, u.globalKey())	17:15
fwereade	...?	17:15
jam	so while you might have 2 things saying "mysql/0 is alive", they are writing to different slots.	17:15
fwereade	jam, ah ok	17:16
fwereade	jam, hmm	17:16
jam	fwereade: fieldKey, fieldBit I believe	17:16
jam	globalKey gets mapped into an "integer field"	17:16
fwereade	jam, I am deep in thought about something else so I can't pay proper attention now, can we chat tomorrow please?	17:16
jam	fwereade: np	17:17
jam	but there is a Beings.sequence that gets updated by 1 everytime you call Pinger.prepare	17:17
jam	(which has an issue for garbage accreting over time, but at least you don't get double pings)	17:18
rogpeppe1	jam: thank you for reminding me of that	17:20
jam	rogpeppe1: yeah, it does help a bit for this case (which I'm sure is why it was done because otherwise double pings to the same slot destroy the whole record)	17:20
jam	because double increment ==> bad bad stuff	17:20
rogpeppe1	jam: so in fact we can have two agents pinging at the same time without risk of overflow. not sure what happens about the being info in that case though.,	17:21
jam	if you didn't need pure density	17:21
jam	you could inc by 2	17:21
jam	rogpeppe1: I'm pretty sure it just shows alive	17:21
rogpeppe1	jam: i think it'll show status for only one of them - probably the last one started, but let me check	17:21
jam	rogpeppe1: yeah I think you're right	17:22
jam	if cur < seq { delete(w.beingKey, cur)	17:23
jam	line 411	17:23
jam	I actually really like the idea of putting in at least a little buffering, so a double ping doesn't make everything look offline. but we could play around a few ways with that.	17:25
rogpeppe1	jam: i'm not quite sure what you mean there	17:27
rogpeppe1	jam: does a double ping make everything look offline?	17:27
jam	rogpeppe1: for example if you changed the sequence generate to "inc 2" instead of inc1 .	17:27
jam	rogpeppe1: right now, if all pingers are active	17:27
jam	then all bits get set	17:27
jam	and the ping code uses "inc $bit"	17:27
jam	which means if you double increment your bit	17:27
jam	it overfloaws	17:27
rogpeppe1	jam: ah, i see	17:27
jam	and if all pingers are active	17:28
jam	they all overflow	17:28
jam	and then...	17:28
jam	none are set	17:28
rogpeppe1	jam: but with the unique ids, it should never be able to happen, should it?	17:28
jam	so if you only used every-other-bit then an single overflow can't cascade	17:28
rogpeppe1	jam: i see what you mean now	17:28
jam	rogpeppe1: your estimation of "should never be able to happen" seems to be a different probability than mine :)	17:28
jam	"never" is a strong word	17:28
jam	under a properly executing system it shouldn't happen	17:29
rogpeppe1	jam: can you see a way that it can happen with the current code?	17:29
jam	but that isn't what you defensibly code against	17:29
rogpeppe1	jam: given that each new pinger gets a unique id	17:29
jam	rogpeppe1: so if the Pinger was running agent side, sent an API request and then connected to another API server and sent it again.	17:29
jam	I think the way we have it set up, we use the atomic increment to get unique ids mean we're ok	17:30
rogpeppe1	jam: sounds like you're assuming something other than the current code there? (i.e. something that doesn't check not to update the same id twice in the same time slot)	17:31
jam	nothing actually checks to not update the slot	17:31
rogpeppe1	jam: line 599?	17:31
jam	rogpeppe1: so I think with the code we have, we're reasonably safe. I think the design is such that it wouldn't be too hard for a bug in the code to break something in the future	17:32
jam	i'm not a big fan of code design that escalates bugs	17:33
rogpeppe1	jam: yeah; doubling the space probably isn't too bad, and we can at least have some kind of record that things aren't working	17:33
mattyw	does anyone know if make check gets run on merge now?	17:59
fwereade	mattyw, sorry, I don't know	18:05
fwereade	does anyone have any idea what's going on with tests for JobManageState vs JobManageEnviron in state.Machine?	18:16
fwereade	we seem to use one or the other at random	18:16
rogpeppe1	fwereade: in state/machine_test.go?	18:35
rogpeppe1	fwereade: i expect it's just random	18:36
fwereade	rogpeppe1, heh :)	18:36
rogpeppe1	fwereade: i see only two occurrences of JobManageState in state/machine_test.go, and they look reasonable there	18:43
* rogpeppe1 finishes for the day		18:48
fwereade	rogpeppe1, sorry phone -- enjoy your evening	19:01
* thumper digs through the emails		19:59
* thumper puts his head down to see if he can get a couple of hours of solid coding prior to the gym		20:49
* fwereade ponders the sheer awfulness of writing tests that try to set up state		20:52
* fwereade is going to go and write something a bit less crazy		20:53
thumper	\o/	20:54
* fwereade was about to give up in disgust already, but was heartened by thumper's joy		21:01
thumper	fwereade: it is well worth the effort to work out how to make tests easier to write	21:01
fwereade	thumper, yeah, indeed, it's the tangledness of the existing charms stuff that's putting me off	21:02
fwereade	thumper, all I wanted to do was add one fricking api method	21:02
thumper	I've just realized that I need to tease apart my kvm bits now	21:02
thumper	before it gets too entangled	21:02
thumper	as I was just about to move more shit around	21:02
thumper	it is about to get crazy :)	21:02
fwereade	thumper, ok, I am not going to do it now, because landing this is more important... but I am going to sack off my other responsibilities as much as possible tomorrow so I can deuglify some of this	21:04
thumper	:)	21:04
thumper	wallyworld_: I have three merge proposals that are all pretty trivial	22:13
thumper	https://code.launchpad.net/~thumper/juju-core/fix-add-machine-test/+merge/194753 https://code.launchpad.net/~thumper/juju-core/container-interface/+merge/194757 and https://code.launchpad.net/~thumper/juju-core/container-userdata/+merge/194759	22:15
wallyworld_	thumper: looking	22:51
fwereade	wallyworld_, thumper: https://codereview.appspot.com/24790044 would be nice if you have time -- churnier than I'd like, but better than not churning, I think	23:10
wallyworld_	fwereade: looking	23:11
fwereade	wallyworld_, cheers	23:11
* fwereade sleep now		23:11
wallyworld_	nighty night	23:11

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!