[00:14] services: [00:14] gccgo1: [00:14] charm: local:raring/gccgo-12 [00:14] exposed: false [00:14] units: [00:14] gccgo1/0: [00:14] agent-state: installed [00:14] nice [00:14] the agent now tells you when it is done installing [00:14] it used to say 'pending' until it hit started [00:23] morning [00:23] wallyworld_: hey there [00:23] yello [00:27] wallyworld_: got time to chat? [00:27] ok [00:27] * thumper fires up a hangout [00:29] wallyworld_: https://plus.google.com/hangouts/_/76cpj4l2lgncclri44ngapjg78?hl=en [01:53] if jam is awake, he's going to get an awesome view of a re-entering soyuz in about 40 minutes. === axw_ is now known as axw [02:58] axw_: around? [02:58] * thumper has a headache [02:58] perhaps more coffee needed [03:01] jam: ping [03:01] axw__: the real axw? [03:02] thumper: indeed, my ISP is rubbish lately :( === axw__ is now known as axw [03:02] axw: can I get you on a hangout? [03:02] thumper: certainly, just a minute === axw_ is now known as axw === thumper is now known as thumper-afk [06:04] hey wallyworld_, you around for 1:1 ? [06:04] sure [06:05] bigjools: damn, wish I knew about that. I do wake up around that time, I just am not at my computer yet to see your message. [06:05] jam: they re-enter over the middle east every time, so you get another in about 3 months [06:06] jam: not sure if you can see the plasma trail though, but you'll definitely see a burning thing hurtling through the atmosphere === thumper-afk is now known as thumper [06:52] fwereade: ping [08:29] thumper, pong,if you're still round [08:50] fwereade: I'm back around [08:50] fwereade: hangout? [08:50] thumper, sure [08:53] mornin' all [08:55] morning rogpeppe [08:55] axw: hiya [09:02] right, feeling a good bit less dodgy after the weekend [09:32] mgz: were you dodgy before? sorry to hear that. [09:35] rogpeppe1: just generally under the weather, can now talk without croaking again now [09:35] mgz: that's good :-) [10:57] TheMue: standup ? [10:57] https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig [10:59] jam: ouch, missed it [11:07] jam, axw I've merged my branch wirth trunk if you want to take another look: https://code.launchpad.net/~mattyw/juju-core/gocheck_included_in_build_fix/+merge/192411 [11:12] thanks mattyw, I marked it approved to land again. [11:13] jam, thanks very much [11:43] fwereade, could you give me a shout when you have a spare 10 minutes? whenever is good for you [11:43] mattyw, hey dude, would you try again in about 1.5 hours please? that's my best guess :( [11:45] fwereade, no problem, thanks [12:03] mattyw: fwiw, your earlier gocheck patch landed [12:04] jam, thanks very much for your help [12:18] * TheMue => lunch [12:20] mattyw: sorry, I missed the merge failure. thanks for fixing. [12:22] axw, no problem, thanks for reviewing [12:58] fwereade: I'm back for a bit, but I should go do homework. Can I touch back with you in 30 min ? [12:59] jam, sure, I'm still digging [13:08] dimitern, jam: hey, there was a bug with the unit agent bouncing as it departed relations; did we ever resolve that one? [13:09] dimitern, jam: because if we didn't I'm starting to wonder whether that's implicated in the immortal relations we're seeing [13:12] fwereade, I'm not sure we did fix it [13:14] dimitern, cheers [13:37] fwereade: I'm back if we want to chat now [13:37] fwereade: I don't think I followed that bug, so I don't know if it is fixed or not [13:43] jam, it's not, I've just verified it [13:43] fwereade: as in you triggered the unit agent to bounce while tearing down [13:43] ? [13:43] jam, there's an error in uniter.Filter [13:44] jam, any time a relation gets removed it bounces the unit agent [13:44] jam, trying to figure out if that could cause what we're seeing [13:44] jam, it's certainly not intended behaviour [13:44] fwereade: well, bouncing an agent during normal operation doesn't sound very good. [13:45] Would it come back up if things were set to dying? [13:46] it comes up fine [13:46] fwereade: but does it come back up without finishing what it was trying to do? [13:46] jam, (so we didn't notice it for a while) [13:47] I know we had that for some other teardown event. Where the process would die, and then come back thinking all was fine (destroy-machine of a manually provisioned machine, I think) [13:47] jam, and I think it *usually* does the right thing, because the relation can't actually *be* removed until the unit agent has handled it... [13:47] jam *but* there's some funky sequence-breaking for juju-info relations [13:47] jam, so I need to figure out wtf is going on more-or-less from scratch there [13:49] fwereade: I don't see "juju-info" the string in Uniter [13:50] jam, IsImplicit [13:50] fwereade: it does seem to have special handling of Dying in worker/uniter/uniter.go [13:51] (set it do dying, but if that fails check if it is implicit) [13:51] set it *to* dying [13:52] anyway, I need to go grab dinner for my son, if you need anything you can leave a note and I'll try to check later. (Or email) [13:54] jam, will do [14:48] fwereade, any time a relation gets removed it bounces the unit agent -> that explains another bug repot.. [14:49] namely config-changed executing post relation-broken [14:49] hazmat, ha! [14:49] hazmat, well spotted [14:49] hazmat, I expected that to be a quick fix but it'll only be a quick*ish* fix -- can't quite driveby it, I'm making sure I get destroy-machine -- force done first [14:50] fwereade, sounds good.. the machine one is priority.. the config-change/broken effected adam_g with ostack charm dev not in the field per se. [14:52] fwereade, fwiw filed it as bug 1250106 [14:52] hazmat, cheers [15:08] dimitern: ping [15:15] TheMue, pong [15:17] dimitern: just wanted to ask you about the background of machinePinger in apiserver/admin.go [15:17] TheMue, yeah? [15:17] dimitern: it wraps presence.Pinger, only Stop() is redefined to call Kill() at the end [15:17] dimitern: can you tell me more about the reason behind it? [15:18] TheMue, yes, so all resources in the apiserver need a Stop() method that will stop them [15:19] TheMue, the pinger on the other hand does not stop immediately when you call Stop() on it, if you take a look at its implementation you'll see that Kill() is what we need to call, that's why Stop() is redefined to call Kill() on a pinger [15:23] dimitern, why would we Kill()? [15:23] dimitern, I don't think a connection dropping is reason enough to start shouting that the unit's down [15:23] fwereade, because Stop is not guaranteed to stop it immediately [15:23] dimitern, that's the point of pinger [15:23] fwereade: ah, just wanted to ask after reading the code [15:24] fwereade, well, I remember discussing it with rogpeppe1 back then when I implemented it [15:24] dimitern, we don't want to raise the alarm as soon as we get some indication something *might* be wrong [15:24] dimitern, we only want to do that when we *know* it's bad [15:24] fwereade, i'm not sure I quite get you [15:25] fwereade, the Stop() method is the last thing called in a resource when a connection is already dropped [15:25] dimitern, in particular, an agent restarting to upgrade should *not* kill its pinger [15:25] dimitern, because anything trusting pinger state to be a canary for errors might react to it [15:25] fwereade: on balance, i think i agree - calling Stop means we could bounce the agent without losing the ping presence [15:25] fwereade: I can imagine what you mean, but how to differentiate? [15:26] fwereade, I agree this is a corner case [15:26] TheMue, well, "never kill" is a lot better than "always kill" [15:26] fwereade, it it's not what's desired we can change it to use Stop instead [15:26] fwereade: hehe, ok [15:26] dimitern, rogpeppe1, TheMue: cool, cheers [15:26] fwereade, I was concerned with the fastest detection on a stalled/dropped connection [15:27] dimitern, rogpeppe1, TheMue: I think the only time to Kill the pinger is when the unit's dead [15:27] TheMue, make sure you test that live though [15:27] TheMue, and test it hard [15:28] TheMue, ...and actually... bugger [15:28] fwereade: the hard tests looked fine so far, but I now have to see how I do a "simple" hickup [15:28] TheMue, dimitern, rogpeppe1: am I right in thinking that the replacement presence module broke the (effective) idempotency of a ping? [15:28] fwereade: what replacement presence module? [15:29] rogpeppe1, niemeyer's mongo version [15:29] fwereade: hmm, let me have a look [15:29] rogpeppe1, TheMue, dimitern: if it's not safe to have N pingers for the same node, I think we might have to Kill() anyway :((( [15:30] fwereade, sounds reasonable [15:30] fwereade, and not such a big improvement to have stop vs kill anyway [15:31] fwereade, what of bouncing agents - they are down while restarting, so it's not unusual [15:31] // Never, ever, ping the same slot twice. [15:31] // The increment below would corrupt the slot. [15:31] dimitern, they should not be *reported* as down [15:31] dimitern, if they get reported as down as part of normaloperation then the reporting is... unhelpful, at best ;) [15:31] rogpeppe1, well, damn [15:31] fwereade, i agree [15:31] rogpeppe1, that'll need to be fixed for HA anyway [15:32] fwereade, but if the agent is being restarted it *is* down while it starts again, no? [15:32] s/"down"/"indifferent"/g ;) [15:32] fwereade: i *think* that means that Stop is currently broken [15:32] dimitern, "down" means "whoa, something's really screwed up, go and fix it" [15:32] fwereade, really? [15:32] fwereade, didn't occur to me before :) [15:33] fwereade, I always though of it as an intermediate state [15:33] dimitern, the intent was that any agent showing "down" should be reporting a real problem [15:34] dimitern: the bug I'm working on has it after killing a machine the hard way [15:34] fwereade, ah, ok then - so my assumption was based on our already flawed implementation :) [15:34] dimitern, yeah -- good fix, thanks ;p [15:35] fwereade: do you know if we might be able to change things to use a more recent mongo version? [15:36] fwereade: 'cos that could fix things in one easy swoop (and backwardly compatibly) [15:39] rogpeppe1, with $xor? [15:39] fwereade: $or, but yes [15:39] fwereade: (xor wouldn't be idempotent...) [15:39] rogpeppe1, I fear it would be impractical given the trouble we've had with mongo already [15:39] rogpeppe1, d'oh [15:40] fwereade: it may be worth investigating - we should probably change to using a more recent version of mongo before 14.04 anyway [15:41] fwereade: and perhaps most of the required procedures/mechanisms are already in place from the last time [15:41] fwereade: so it *may* not be as difficult this time [15:41] rogpeppe1, yeah... I have no idea what it'd actually take, though -- mgz, can you opine here? [15:41] fwereade: regarding the machinePinger and our discussion last week, what do you think now? my current tests are fine and kill 3 minutes after the last ping. [15:42] TheMue, the presence problems are freaking me out now [15:43] * TheMue can imagine what fwereade means without knowing that term ;) [15:43] TheMue, as discussed just above -- more than one pinger is a problem [15:44] TheMue, so if an agent reconnected, somehow leaving a zombie connection lying around... we'd break presence state for some *other* agent [15:45] fwereade: so the machine and all units would optimally share one presence pinger? [15:45] TheMue, I don't see how that'd help? [15:46] TheMue, we want to know, for each agent, whether it's reasonable to assume it's active [15:46] fwereade: just tried to find different words [15:46] fwereade: yeah, so the "physical pinging" would carry additional "logical pinging" aka machine or unit id [15:47] *loudThinkiing* [15:47] TheMue, rogpeppe1: pre-HA, would it be plausible/helpful to kill each old agent connections when a new one was made for that agent? [15:47] fwereade: doesn't feel good [15:47] TheMue, rogpeppe1: given HA, I think we need a presence module that works with multiple pingers regardless though... right? [15:48] fwereade: yep [15:48] fwereade: i'm not quite sure if that follows [15:49] rogpeppe1, if an agent reconnects to a different api server soon enough after disconnecting from another, do we not risk double-pings? [15:50] * rogpeppe1 thinks [15:51] fwereade: double pings in the sense of "two are waiting, only one gets, so the other one reacts wrong"? [15:51] fwereade: yes, that's probably right [15:52] fwereade: if the network error is asynchronous and instant [15:52] TheMue, in the sense of "we end up writing to the wrong agent's slot and ARRRGH" [15:52] fwereade: so even if we're only executing pings explicitly for an agent, the ping can be in progress when the connection is made to another api server and another ping made [15:53] rogpeppe1, it feels possible, at least [15:53] rogpeppe1, I wouldn't want to bet anything on it not happening [15:53] fwereade: it would be more possible if we didn't wait some time after bouncing [15:53] fwereade: as it is, i think it's pretty remote [15:54] fwereade: there's definitely more possibility if we're running the pings as an async process within the API server [15:56] fwereade: i think we can probably make the presence package more robust without changing its basic representation. [15:57] fwereade: by adding a transaction when starting to ping that verifies that noone else is pinging that same id. [15:58] rogpeppe1, isn't the whole point of presence that it *doesn't* involve transactions? [15:59] fwereade: i was thinking a single transaction to initiate a pinger might be ok - none of the other operations require a transaction [15:59] fwereade: i.e. one transaction for the entire lifetime of the pinger [16:00] fwereade: there may be a cleverer way of doing it that doesn't rely on a transaction. [16:02] rogpeppe1, I'm not quite seeing it myself [16:03] fwereade: we could always use a little bit of javascript instead of + too. if((x / (1< fwereade: assuming mongo has a modulus operator [16:04] rogpeppe1, that feels a bit more plausible [16:04] fwereade: that's probably the most unintrusive fix, but may not be great performance-wise [16:05] rogpeppe1, bah, v8 is 2.4 as well, isn't it? [16:05] fwereade: v8? [16:05] rogpeppe1, sexy fast javascriptengine [16:05] fwereade: ah, no idea sorry [16:06] fwereade: i'd be slightly surprised if it made a huge difference for stuff that simple [16:06] fwereade: but if it does, then we should do it, because all transactions use js. [16:06] fwereade: so it could speed up our bottom line [16:09] rogpeppe1, I guess that's one to benchmark at some point in the future, doesn't feel like a priority at this stage [16:10] fwereade: we could do with *some* benchmarks :-) [16:12] rogpeppe1, sure, but I think we're currently better off focusing on what we can fix ourselves without swapping out the underlying db [16:12] fwereade: yeah [16:13] fwereade: but i'd like to see at least one benchmark of presence performance so that we know that it's plausible given the number of pings/second that we already know might happen. [16:16] rogpeppe1, I *think* we currently know that presence as it is is not the bottleneck -- but yeah, if we're changing it, we should check the changes don't screw us at scale [16:36] fwereade: BTW, I may be wrong about transactions using js - I had that recollection, but can't now find any evidence for it. [16:41] rogpeppe1, I think if they use $where, and possibly a couple of other bits, they still use the JS engine [16:42] fwereade: no occurrence of $where that i can see [17:03] * fwereade is stupid, because he didn't think about force-destroying state servers, and grumpy because he just copied the form of DestroyMachines despite his initial discomfort and already regrets it [17:14] fwereade, rogpeppe1: note that there *is* an abstraction between the unit that is pinging and the actual Pinger. When you start a pinger you get a unique ID and then record the Unit => Pinger ID mapping. So it is conceivable that whenever you reconnect you just always require a new PingerID so you can't get double pings. [17:15] jam, p := presence.NewPinger(u.st.presence, u.globalKey()) [17:15] ...? [17:15] so while you might have 2 things saying "mysql/0 is alive", they are writing to different slots. [17:16] jam, ah ok [17:16] jam, hmm [17:16] fwereade: fieldKey, fieldBit I believe [17:16] globalKey gets mapped into an "integer field" [17:16] jam, I am deep in thought about something else so I can't pay proper attention now, can we chat tomorrow please? [17:17] fwereade: np [17:17] but there is a Beings.sequence that gets updated by 1 everytime you call Pinger.prepare [17:18] (which has an issue for garbage accreting over time, but at least you don't get double pings) [17:20] jam: thank you for reminding me of that [17:20] rogpeppe1: yeah, it does help a bit for this case (which I'm sure is why it was done because otherwise double pings to the same slot destroy the whole record) [17:20] because double increment ==> bad bad stuff [17:21] jam: so in fact we can have two agents pinging at the same time without risk of overflow. not sure what happens about the being info in that case though., [17:21] if you didn't need pure density [17:21] you could inc by 2 [17:21] rogpeppe1: I'm pretty sure it just shows alive [17:21] jam: i think it'll show status for only one of them - probably the last one started, but let me check [17:22] rogpeppe1: yeah I think you're right [17:23] if cur < seq { delete(w.beingKey, cur) [17:23] line 411 [17:25] I actually really like the idea of putting in at least a little buffering, so a double ping doesn't make everything look offline. but we could play around a few ways with that. [17:27] jam: i'm not quite sure what you mean there [17:27] jam: does a double ping make everything look offline? [17:27] rogpeppe1: for example if you changed the sequence generate to "inc 2" instead of inc1 . [17:27] rogpeppe1: right now, if all pingers are active [17:27] then all bits get set [17:27] and the ping code uses "inc $bit" [17:27] which means if you double increment your bit [17:27] it overfloaws [17:27] jam: ah, i see [17:28] and if all pingers are active [17:28] they all overflow [17:28] and then... [17:28] none are set [17:28] jam: but with the unique ids, it should never be able to happen, should it? [17:28] so if you only used every-other-bit then an single overflow can't cascade [17:28] jam: i see what you mean now [17:28] rogpeppe1: your estimation of "should never be able to happen" seems to be a different probability than mine :) [17:28] "never" is a strong word [17:29] under a properly executing system it shouldn't happen [17:29] jam: can you see a way that it can happen with the current code? [17:29] but that isn't what you defensibly code against [17:29] jam: given that each new pinger gets a unique id [17:29] rogpeppe1: so if the Pinger was running agent side, sent an API request and then connected to another API server and sent it again. [17:30] I think the way we have it set up, we use the atomic increment to get unique ids mean we're ok [17:31] jam: sounds like you're assuming something other than the current code there? (i.e. something that doesn't check not to update the same id twice in the same time slot) [17:31] nothing actually checks to not update the slot [17:31] jam: line 599? [17:32] rogpeppe1: so I think with the code we have, we're reasonably safe. I think the design is such that it wouldn't be too hard for a bug in the code to break something in the future [17:33] i'm not a big fan of code design that escalates bugs [17:33] jam: yeah; doubling the space probably isn't too bad, and we can at least have some kind of record that things aren't working [17:59] does anyone know if make check gets run on merge now? [18:05] mattyw, sorry, I don't know [18:16] does anyone have any idea what's going on with tests for JobManageState vs JobManageEnviron in state.Machine? [18:16] we seem to use one or the other at random [18:35] fwereade: in state/machine_test.go? [18:36] fwereade: i expect it's just random [18:36] rogpeppe1, heh :) [18:43] fwereade: i see only two occurrences of JobManageState in state/machine_test.go, and they look reasonable there [18:48] * rogpeppe1 finishes for the day [19:01] rogpeppe1, sorry phone -- enjoy your evening [19:59] * thumper digs through the emails [20:49] * thumper puts his head down to see if he can get a couple of hours of solid coding prior to the gym [20:52] * fwereade ponders the sheer awfulness of writing tests that try to set up state [20:53] * fwereade is going to go and write something a *bit* less crazy [20:54] \o/ [21:01] * fwereade was about to give up in disgust already, but was heartened by thumper's joy [21:01] fwereade: it is well worth the effort to work out how to make tests easier to write [21:02] thumper, yeah, indeed, it's the tangledness of the existing charms stuff that's putting me off [21:02] thumper, all I wanted to do was add one fricking api method [21:02] I've just realized that I need to tease apart my kvm bits now [21:02] before it gets too entangled [21:02] as I was just about to move more shit around [21:02] it is about to get crazy :) [21:04] thumper, ok, I am *not* going to do it *now*, because landing this is more important... but I *am* going to sack off my other responsibilities as much as possible tomorrow so I can deuglify some of this [21:04] :) [22:13] wallyworld_: I have three merge proposals that are all pretty trivial [22:15] https://code.launchpad.net/~thumper/juju-core/fix-add-machine-test/+merge/194753 https://code.launchpad.net/~thumper/juju-core/container-interface/+merge/194757 and https://code.launchpad.net/~thumper/juju-core/container-userdata/+merge/194759 [22:51] thumper: looking [23:10] wallyworld_, thumper: https://codereview.appspot.com/24790044 would be nice if you have time -- churnier than I'd like, but better than not churning, I think [23:11] fwereade: looking [23:11] wallyworld_, cheers [23:11] * fwereade sleep now [23:11] nighty night