/srv/irclogs.ubuntu.com/2013/11/11/#juju-dev.txt

davecheneyservices:00:14
davecheney  gccgo1:00:14
davecheney    charm: local:raring/gccgo-1200:14
davecheney    exposed: false00:14
davecheney    units:00:14
davecheney      gccgo1/0:00:14
davecheney        agent-state: installed00:14
davecheneynice00:14
davecheneythe agent now tells you when it is done installing00:14
davecheneyit used to say 'pending' until it hit started00:14
thumpermorning00:23
thumperwallyworld_: hey there00:23
wallyworld_yello00:23
thumperwallyworld_: got time to chat?00:27
wallyworld_ok00:27
* thumper fires up a hangout00:27
thumperwallyworld_: https://plus.google.com/hangouts/_/76cpj4l2lgncclri44ngapjg78?hl=en00:29
bigjoolsif jam is awake, he's going to get an awesome view of a re-entering soyuz in about 40 minutes.01:53
=== axw_ is now known as axw
thumperaxw_: around?02:58
* thumper has a headache02:58
thumperperhaps more coffee needed02:58
thumperjam: ping03:01
thumperaxw__: the real axw?03:01
axw__thumper: indeed, my ISP is rubbish lately :(03:02
=== axw__ is now known as axw
thumperaxw: can I get you on a hangout?03:02
axwthumper: certainly, just a minute03:02
=== axw_ is now known as axw
=== thumper is now known as thumper-afk
jamhey wallyworld_, you around for 1:1 ?06:04
wallyworld_sure06:04
jambigjools: damn, wish I knew about that. I do wake up around that time, I just am not at my computer yet to see your message.06:05
bigjoolsjam: they re-enter over the middle east every time, so you get another in about 3 months06:05
bigjoolsjam: not sure if you can see the plasma trail though, but you'll definitely see a burning thing hurtling through the atmosphere06:06
=== thumper-afk is now known as thumper
thumperfwereade: ping06:52
fwereadethumper, pong,if you're still round08:29
thumperfwereade: I'm back around08:50
thumperfwereade: hangout?08:50
fwereadethumper, sure08:50
rogpeppemornin' all08:53
axwmorning rogpeppe08:55
rogpeppeaxw: hiya08:55
mgzright, feeling a good bit less dodgy after the weekend09:02
rogpeppe1mgz: were you dodgy before? sorry to hear that.09:32
mgzrogpeppe1: just generally under the weather, can now talk without croaking again now09:35
rogpeppe1mgz: that's good :-)09:35
jamTheMue: standup ?10:57
jamhttps://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig10:57
TheMuejam: ouch, missed it10:59
mattywjam, axw I've merged my branch wirth trunk if you want to take another look: https://code.launchpad.net/~mattyw/juju-core/gocheck_included_in_build_fix/+merge/19241111:07
jamthanks mattyw, I marked it approved to land again.11:12
mattywjam, thanks very much11:13
mattywfwereade, could you give me a shout when you have a spare 10 minutes? whenever is good for you11:43
fwereademattyw, hey dude, would you try again in about 1.5 hours please? that's my best guess :(11:43
mattywfwereade, no problem, thanks11:45
jammattyw: fwiw, your earlier gocheck patch landed12:03
mattywjam, thanks very much for your help12:04
* TheMue => lunch12:18
axwmattyw: sorry, I missed the merge failure. thanks for fixing.12:20
mattywaxw, no problem, thanks for reviewing12:22
jamfwereade: I'm back for a bit, but I should go do homework. Can I touch back with you in 30 min ?12:58
fwereadejam, sure, I'm still digging12:59
fwereadedimitern, jam: hey, there was a bug with the unit agent bouncing as it departed relations; did we ever resolve that one?13:08
fwereadedimitern, jam: because if we didn't I'm starting to wonder whether that's implicated in the immortal relations we're seeing13:09
dimiternfwereade, I'm not sure we did fix it13:12
fwereadedimitern, cheers13:14
jamfwereade: I'm back if we want to chat now13:37
jamfwereade: I don't think I followed that bug, so I don't know if it is fixed or not13:37
fwereadejam, it's not, I've just verified it13:43
jamfwereade: as in you triggered the unit agent to bounce while tearing down13:43
jam?13:43
fwereadejam, there's an error in uniter.Filter13:43
fwereadejam, any time a relation gets removed it bounces the unit agent13:44
fwereadejam, trying to figure out if that could cause what we're seeing13:44
fwereadejam, it's certainly not intended behaviour13:44
jamfwereade: well, bouncing an agent during normal operation doesn't sound very good.13:44
jamWould it come back up if things were set to dying?13:45
fwereadeit comes up fine13:46
jamfwereade: but does it come back up without finishing what it was trying to do?13:46
fwereadejam, (so we didn't notice it for a while)13:46
jamI know we had that for some other teardown event. Where the process would die, and then come back thinking all was fine (destroy-machine of a manually provisioned machine, I think)13:47
fwereadejam, and I think it *usually* does the right thing, because the relation can't actually *be* removed until the unit agent has handled it...13:47
fwereadejam *but* there's some funky sequence-breaking for juju-info relations13:47
fwereadejam, so I need to figure out wtf is going on more-or-less from scratch there13:47
jamfwereade: I don't see "juju-info" the string in Uniter13:49
fwereadejam, IsImplicit13:50
jamfwereade: it does seem to have special handling of Dying in worker/uniter/uniter.go13:50
jam(set it do dying, but if that fails check if it is implicit)13:51
jamset it *to* dying13:51
jamanyway, I need to go grab dinner for my son, if you need anything you can leave a note and I'll try to check later. (Or email)13:52
fwereadejam, will do13:54
hazmatfwereade, any time a relation gets removed it bounces the unit agent -> that explains another bug repot..14:48
hazmatnamely config-changed executing post relation-broken14:49
fwereadehazmat, ha!14:49
fwereadehazmat, well spotted14:49
fwereadehazmat, I expected that to be a quick fix but it'll only be a quick*ish* fix -- can't quite driveby it, I'm making sure I get destroy-machine -- force done first14:49
hazmatfwereade, sounds good.. the machine one is priority.. the config-change/broken effected adam_g with ostack charm dev not in the field per se.14:50
hazmatfwereade, fwiw filed it as bug 125010614:52
fwereadehazmat, cheers14:52
TheMuedimitern: ping15:08
dimiternTheMue, pong15:15
TheMuedimitern: just wanted to ask you about the background of machinePinger in apiserver/admin.go15:17
dimiternTheMue, yeah?15:17
TheMuedimitern: it wraps presence.Pinger, only Stop() is redefined to call Kill() at the end15:17
TheMuedimitern: can you tell me more about the reason behind it?15:17
dimiternTheMue, yes, so all resources in the apiserver need a Stop() method that will stop them15:18
dimiternTheMue, the pinger on the other hand does not stop immediately when you call Stop() on it, if you take a look at its implementation you'll see that Kill() is what we need to call, that's why Stop() is redefined to call Kill() on a pinger15:19
fwereadedimitern, why would we Kill()?15:23
fwereadedimitern, I don't think a connection dropping is reason enough to start shouting that the unit's down15:23
dimiternfwereade, because Stop is not guaranteed to stop it immediately15:23
fwereadedimitern, that's the point of pinger15:23
TheMuefwereade: ah, just wanted to ask after reading the code15:23
dimiternfwereade, well, I remember discussing it with rogpeppe1 back then when I implemented it15:24
fwereadedimitern, we don't want to raise the alarm as soon as we get some indication something *might* be wrong15:24
fwereadedimitern, we only want to do that when we *know* it's bad15:24
dimiternfwereade, i'm not sure I quite get you15:24
dimiternfwereade, the Stop() method is the last thing called in a resource when a connection is already dropped15:25
fwereadedimitern, in particular, an agent restarting to upgrade should *not* kill its pinger15:25
fwereadedimitern, because anything trusting pinger state to be a canary for errors might react to it15:25
rogpeppe1fwereade: on balance, i think i agree - calling Stop means we could bounce the agent without losing the ping presence15:25
TheMuefwereade: I can imagine what you mean, but how to differentiate?15:25
dimiternfwereade, I agree this is a corner case15:26
fwereadeTheMue, well, "never kill" is a lot better than "always kill"15:26
dimiternfwereade, it it's not what's desired we can change it to use Stop instead15:26
TheMuefwereade: hehe, ok15:26
fwereadedimitern, rogpeppe1, TheMue: cool, cheers15:26
dimiternfwereade, I was concerned with the fastest detection on a stalled/dropped connection15:26
fwereadedimitern, rogpeppe1, TheMue: I think the only time to Kill the pinger is when the unit's dead15:27
fwereadeTheMue, make sure you test that live though15:27
fwereadeTheMue, and test it hard15:27
fwereadeTheMue, ...and actually... bugger15:28
TheMuefwereade: the hard tests looked fine so far, but I now have to see how I do a "simple" hickup15:28
fwereadeTheMue, dimitern, rogpeppe1: am I right in thinking that the replacement presence module broke the (effective) idempotency of a ping?15:28
rogpeppe1fwereade: what replacement presence module?15:28
fwereaderogpeppe1, niemeyer's mongo version15:29
rogpeppe1fwereade: hmm, let me have a look15:29
fwereaderogpeppe1, TheMue, dimitern: if it's not safe to have N pingers for the same node, I think we might have to Kill() anyway :(((15:29
dimiternfwereade, sounds reasonable15:30
dimiternfwereade, and not such a big improvement to have stop vs kill anyway15:30
dimiternfwereade, what of bouncing agents - they are down while restarting, so it's not unusual15:31
rogpeppe1// Never, ever, ping the same slot twice.15:31
rogpeppe1// The increment below would corrupt the slot.15:31
fwereadedimitern, they should not be *reported* as down15:31
fwereadedimitern, if they get reported as down as part of normaloperation then the reporting is... unhelpful, at best ;)15:31
fwereaderogpeppe1, well, damn15:31
dimiternfwereade, i agree15:31
fwereaderogpeppe1, that'll need to be fixed for HA anyway15:31
dimiternfwereade, but if the agent is being restarted it *is* down while it starts again, no?15:32
TheMues/"down"/"indifferent"/g ;)15:32
rogpeppe1fwereade: i *think* that means that Stop is currently broken15:32
fwereadedimitern, "down" means "whoa, something's really screwed up, go and fix it"15:32
dimiternfwereade, really?15:32
dimiternfwereade, didn't occur to me before :)15:32
dimiternfwereade, I always though of it as an intermediate state15:33
fwereadedimitern, the intent was that any agent showing "down" should be reporting a real problem15:33
TheMuedimitern: the bug I'm working on has it after killing a machine the hard way15:34
dimiternfwereade, ah, ok then - so my assumption was based on our already flawed implementation :)15:34
fwereadedimitern, yeah -- good fix, thanks ;p15:34
rogpeppe1fwereade: do you know if we might be able to change things to use a more recent mongo version?15:35
rogpeppe1fwereade: 'cos that could fix things in one easy swoop (and backwardly compatibly)15:36
fwereaderogpeppe1, with $xor?15:39
rogpeppe1fwereade: $or, but yes15:39
rogpeppe1fwereade: (xor wouldn't be idempotent...)15:39
fwereaderogpeppe1, I fear it would be impractical given the trouble we've had with mongo already15:39
fwereaderogpeppe1, d'oh15:39
rogpeppe1fwereade: it may be worth investigating - we should probably change to using a more recent version of mongo before 14.04 anyway15:40
rogpeppe1fwereade: and perhaps most of the required procedures/mechanisms are already in place from the last time15:41
rogpeppe1fwereade: so it *may* not be as difficult this time15:41
fwereaderogpeppe1, yeah... I have no idea what it'd actually take, though -- mgz, can you opine here?15:41
TheMuefwereade: regarding the machinePinger and our discussion last week, what do you think now? my current tests are fine and kill 3 minutes after the last ping.15:41
fwereadeTheMue, the presence problems are freaking me out now15:42
* TheMue can imagine what fwereade means without knowing that term ;)15:43
fwereadeTheMue, as discussed just above -- more than one pinger is a problem15:43
fwereadeTheMue, so if an agent reconnected, somehow leaving a zombie connection lying around... we'd break presence state for some *other* agent15:44
TheMuefwereade: so the machine and all units would optimally share one presence pinger?15:45
fwereadeTheMue, I don't see how that'd help?15:45
fwereadeTheMue, we want to know, for each agent, whether it's reasonable to assume it's active15:46
TheMuefwereade: just tried to find different words15:46
TheMuefwereade: yeah, so the "physical pinging" would carry additional "logical pinging" aka machine or unit id15:46
TheMue*loudThinkiing*15:47
fwereadeTheMue, rogpeppe1: pre-HA, would it be plausible/helpful to kill each old agent connections when a new one was made for that agent?15:47
TheMuefwereade: doesn't feel good15:47
fwereadeTheMue, rogpeppe1: given HA, I think we need a presence module that works with multiple pingers regardless though... right?15:47
TheMuefwereade: yep15:48
rogpeppe1fwereade: i'm not quite sure if that follows15:48
fwereaderogpeppe1, if an agent reconnects to a different api server soon enough after disconnecting from another, do we not risk double-pings?15:49
* rogpeppe1 thinks15:50
TheMuefwereade: double pings in the sense of "two are waiting, only one gets, so the other one reacts wrong"?15:51
rogpeppe1fwereade: yes, that's probably right15:51
rogpeppe1fwereade: if the network error is asynchronous and instant15:52
fwereadeTheMue, in the sense of "we end up writing to the wrong agent's slot and ARRRGH"15:52
rogpeppe1fwereade: so even if we're only executing pings explicitly for an agent, the ping can be in progress when the connection is made to another api server and another ping made15:52
fwereaderogpeppe1, it feels possible, at least15:53
fwereaderogpeppe1, I wouldn't want to bet anything on it not happening15:53
rogpeppe1fwereade: it would be more possible if we didn't wait some time after bouncing15:53
rogpeppe1fwereade: as it is, i think it's pretty remote15:53
rogpeppe1fwereade: there's definitely more possibility if we're running the pings as an async process within the API server15:54
rogpeppe1fwereade: i think we can probably make the presence package more robust without changing its basic representation.15:56
rogpeppe1fwereade: by adding a transaction when starting to ping that verifies that noone else is pinging that same id.15:57
fwereaderogpeppe1, isn't the whole point of presence that it *doesn't* involve transactions?15:58
rogpeppe1fwereade: i was thinking a single transaction to initiate a pinger might be ok - none of the other operations require a transaction15:59
rogpeppe1fwereade: i.e. one transaction for the entire lifetime of the pinger15:59
rogpeppe1fwereade: there may be a cleverer way of doing it that doesn't rely on a transaction.16:00
fwereaderogpeppe1, I'm not quite seeing it myself16:02
rogpeppe1fwereade: we could always use a little bit of javascript instead of + too. if((x / (1<<slot)) % 2 == 0){x += 1<<slot}16:03
rogpeppe1fwereade: assuming mongo has a modulus operator16:03
fwereaderogpeppe1, that feels a bit more plausible16:04
rogpeppe1fwereade: that's probably the most unintrusive fix, but may not be great performance-wise16:04
fwereaderogpeppe1, bah, v8 is 2.4 as well, isn't it?16:05
rogpeppe1fwereade: v8?16:05
fwereaderogpeppe1, sexy fast javascriptengine16:05
rogpeppe1fwereade: ah, no idea sorry16:05
rogpeppe1fwereade: i'd be slightly surprised if it made a huge difference for stuff that simple16:06
rogpeppe1fwereade: but if it does, then we should do it, because all transactions use js.16:06
rogpeppe1fwereade: so it could speed up our bottom line16:06
fwereaderogpeppe1, I guess that's one to benchmark at some point in the future, doesn't feel like a priority at this stage16:09
rogpeppe1fwereade: we could do with *some* benchmarks :-)16:10
fwereaderogpeppe1, sure, but I think we're currently better off focusing on what we can fix ourselves without swapping out the underlying db16:12
rogpeppe1fwereade: yeah16:12
rogpeppe1fwereade: but i'd like to see at least one benchmark of presence performance so that we know that it's plausible given the number of pings/second that we already know might happen.16:13
fwereaderogpeppe1, I *think* we currently know that presence as it is is not the bottleneck -- but yeah, if we're changing it, we should check the changes don't screw us at scale16:16
rogpeppe1fwereade: BTW, I may be wrong about transactions using js - I had that recollection, but can't now find any evidence for it.16:36
fwereaderogpeppe1, I think if they use $where, and possibly a couple of other bits, they still use the JS engine16:41
rogpeppe1fwereade: no occurrence of $where that i can see16:42
* fwereade is stupid, because he didn't think about force-destroying state servers, and grumpy because he just copied the form of DestroyMachines despite his initial discomfort and already regrets it17:03
jamfwereade, rogpeppe1: note that there *is* an abstraction between the unit that is pinging and the actual Pinger. When you start a pinger you get a unique ID and then record the Unit => Pinger ID mapping. So it is conceivable that whenever you reconnect you just always require a new PingerID so you can't get double pings.17:14
fwereadejam,     p := presence.NewPinger(u.st.presence, u.globalKey())17:15
fwereade ...?17:15
jamso while you might have 2 things saying "mysql/0 is alive", they are writing to different slots.17:15
fwereadejam, ah ok17:16
fwereadejam, hmm17:16
jamfwereade: fieldKey, fieldBit I believe17:16
jamglobalKey gets mapped into an "integer field"17:16
fwereadejam, I am deep in thought about something else so I can't pay proper attention now, can we chat tomorrow please?17:16
jamfwereade: np17:17
jambut there is a Beings.sequence that gets updated by 1 everytime you call Pinger.prepare17:17
jam(which has an issue for garbage accreting over time, but at least you don't get double pings)17:18
rogpeppe1jam: thank you for reminding me of that17:20
jamrogpeppe1: yeah, it does help a bit for this case (which I'm sure is why it was done because otherwise double pings to the same slot destroy the whole record)17:20
jambecause double increment ==> bad bad stuff17:20
rogpeppe1jam: so in fact we can have two agents pinging at the same time without risk of overflow. not sure what happens about the being info in that case though.,17:21
jamif you didn't need pure density17:21
jamyou could inc by 217:21
jamrogpeppe1: I'm pretty sure it just shows alive17:21
rogpeppe1jam: i think it'll show status for only one of them - probably the last one started, but let me check17:21
jamrogpeppe1: yeah I think you're right17:22
jamif cur < seq { delete(w.beingKey, cur)17:23
jamline 41117:23
jamI actually really like the idea of putting in at least a little buffering, so a double ping doesn't make everything look offline. but we could play around a few ways with that.17:25
rogpeppe1jam: i'm not quite sure what you mean there17:27
rogpeppe1jam: does a double ping make everything look offline?17:27
jamrogpeppe1: for example if you changed the sequence generate to "inc 2" instead of inc1 .17:27
jamrogpeppe1: right now, if all pingers are active17:27
jamthen all bits get set17:27
jamand the ping code uses "inc $bit"17:27
jamwhich means if you double increment your bit17:27
jamit overfloaws17:27
rogpeppe1jam: ah, i see17:27
jamand if all pingers are active17:28
jamthey all overflow17:28
jamand then...17:28
jamnone are set17:28
rogpeppe1jam: but with the unique ids, it should never be able to happen, should it?17:28
jamso if you only used every-other-bit then an single overflow can't cascade17:28
rogpeppe1jam: i see what you mean now17:28
jamrogpeppe1: your estimation of "should never be able to happen" seems to be a different probability  than mine :)17:28
jam"never" is a strong word17:28
jamunder a properly executing system it shouldn't happen17:29
rogpeppe1jam: can you see a way that it can happen with the current code?17:29
jambut that isn't what you defensibly code against17:29
rogpeppe1jam: given that each new pinger gets a unique id17:29
jamrogpeppe1: so if the Pinger was running agent side, sent an API request and then connected to another API server and sent it again.17:29
jamI think the way we have it set up, we use the atomic increment to get unique ids mean we're ok17:30
rogpeppe1jam: sounds like you're assuming something other than the current code there? (i.e. something that doesn't check not to update the same id twice in the same time slot)17:31
jamnothing actually checks to not update the slot17:31
rogpeppe1jam: line 599?17:31
jamrogpeppe1: so I think with the code we have, we're reasonably safe. I think the design is such that it wouldn't be too hard for a bug in the code to break something in the future17:32
jami'm not a big fan of code design that escalates bugs17:33
rogpeppe1jam: yeah; doubling the space probably isn't too bad, and we can at least have some kind of record that things aren't working17:33
mattywdoes anyone know if make check gets run on merge now?17:59
fwereademattyw, sorry, I don't know18:05
fwereadedoes anyone have any idea what's going on with tests for JobManageState vs JobManageEnviron in state.Machine?18:16
fwereadewe seem to use one or the other at random18:16
rogpeppe1fwereade: in state/machine_test.go?18:35
rogpeppe1fwereade: i expect it's just random18:36
fwereaderogpeppe1, heh :)18:36
rogpeppe1fwereade: i see only two occurrences of JobManageState in state/machine_test.go, and they look reasonable there18:43
* rogpeppe1 finishes for the day18:48
fwereaderogpeppe1, sorry phone -- enjoy your evening19:01
* thumper digs through the emails19:59
* thumper puts his head down to see if he can get a couple of hours of solid coding prior to the gym20:49
* fwereade ponders the sheer awfulness of writing tests that try to set up state20:52
* fwereade is going to go and write something a *bit* less crazy20:53
thumper\o/20:54
* fwereade was about to give up in disgust already, but was heartened by thumper's joy21:01
thumperfwereade: it is well worth the effort to work out how to make tests easier to write21:01
fwereadethumper, yeah, indeed, it's the tangledness of the existing charms stuff that's putting me off21:02
fwereadethumper, all I wanted to do was add one fricking api method21:02
thumperI've just realized that I need to tease apart my kvm bits now21:02
thumperbefore it gets too entangled21:02
thumperas I was just about to move more shit around21:02
thumperit is about to get crazy :)21:02
fwereadethumper, ok, I am *not* going to do it *now*, because landing this is more important... but I *am* going to sack off my other responsibilities as much as possible tomorrow so I can deuglify some of this21:04
thumper:)21:04
thumperwallyworld_: I have three merge proposals that are all pretty trivial22:13
thumperhttps://code.launchpad.net/~thumper/juju-core/fix-add-machine-test/+merge/194753 https://code.launchpad.net/~thumper/juju-core/container-interface/+merge/194757 and https://code.launchpad.net/~thumper/juju-core/container-userdata/+merge/19475922:15
wallyworld_thumper: looking22:51
fwereadewallyworld_, thumper: https://codereview.appspot.com/24790044 would be nice if you have time -- churnier than I'd like, but better than not churning, I think23:10
wallyworld_fwereade: looking23:11
fwereadewallyworld_, cheers23:11
* fwereade sleep now23:11
wallyworld_nighty night23:11

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!