[00:17] who is supposed to call unit.UnassignFromMachine() ? [00:17] is it the UA on the way out the door ? [00:17] the MA on observing the death of the UA ? [00:18] is juju remove-unit ? [00:25] davecheney: Nobody calls it at the moment.. the unit dies and should be removed [00:25] davecheney: And morning! [00:29] hey hey [00:29] for the moment, i stuck it in juju remove-unit [00:29] but that may not be correct [00:31] niemeyer: correct my logic: the MA is responsible for removing units that have reached dead from the state [00:32] davecheney: Right.. remove-unit should not unassign or remove [00:34] niemeyer: right, then we have a problem [00:34] niemeyer: https://bugs.launchpad.net/juju-core/+bug/1067127 [00:38] davecheney: Not entirely surprising [00:38] davecheney: We're just now getting to the end of the watcher support for lifecycle, unit dying, etc [00:39] niemeyer: cool [00:39] so, i can add unassignfrommachine to remove-unit [00:39] davecheney: That said, it would be useful to understand what's the missing spot there [00:39] davecheney: I bet it's something simple [00:39] davecheney: But I can't tell if it's in anyone's plate yet [00:39] davecheney: No, that's not the way to go [00:39] niemeyer: lets talk about it this evening [00:40] davecheney: The unit doesn't have to be unassigned, ever [00:40] davecheney: Because it's being removed with the assignment [00:40] niemeyer: can you say that annother way [00:40] niemeyer: currently we have tow actions, EnsureDying and UnassignFromMachine [00:41] davecheney: The machiner should remove the unit once it's dead [00:41] davecheney: That's probably the missing link [00:41] niemeyer: i agree [00:41] davecheney: So there's no point in unassigning it [00:41] * davecheney reads worker/uniter [00:41] worker/machiner [00:41] davecheney: It should be removed, and then its assignment is gone, whatever it was [00:41] i agee, the machiner should be responsible for calling Unassign [00:41] it is a Machine after all :) [00:42] davecheney: Nobody has to call unassign :) [00:43] niemeyer: well, then there is a bug [00:43] see above [00:43] when you say 'nobody has to call unassign' [00:43] you mean, no person, ie, nobody typing juju remove-unit ? [00:50] davecheney: It should be removed, and then its assignment is gone, whatever it was [00:50] davecheney: The machiner should remove the unit once it's dead [00:50] davecheney: That's probably the missing lin [00:50] niemeyer: ok, thanks, understood [00:51] davecheney: No unassignment in the picture [07:02] davecheney, fwereade: morning! [07:02] wrtp, davecheney, heyhey [07:08] morning [07:08] 70 working charms [07:08] whoo hoo! [07:14] morning [07:14] davecheney: cheers [07:14] davecheney: those are very good news [07:39] TheMue, morning [07:39] fwereade: hiya [08:32] fwereade: i'm not sure about the idea of making all hooks in a container mutually exclusive [08:33] wrtp, go on [08:34] fwereade: it seems a bit like unwarranted interaction between independent charms [08:35] fwereade: for instance, one charm might be very slow to execute certain hooks, making another one slow to react [08:36] fwereade: in fact, if one charm's hook hangs up for a while, it would lock out all other charms in the same container, which seems... dubious [08:36] fwereade: isn't apt-get supposed to work if run concurrently with itself? [08:39] wrtp, you don't think that, say, a subordinate might try to make concurrent tweaks to a setting file being changed by the principal? [08:40] fwereade: i think that would be extremely dodgy behaviour [08:40] fwereade: just because it runs in the same container doesn't mean a subordinate has a right to delve into the inner workings of its principal [08:41] fwereade: if it's a setting file not in the charm directory, then it's not so dodgy, but it's fundamentally flawed if there's no locking, because the principal might not be making the change in a hook context. [08:43] wrtp, yeah I'd been thinking of the second case [08:43] wrtp, but please expand on the changes that might be made outside a hook context [08:44] fwereade: it's perfectly reasonable that a charm might trigger some changes in a hook that don't execute synchronously with the hook [08:45] fwereade: for instance, it might have started a local server that manages some changes for it, and the hook might just be telling that server about the changes. [08:45] fwereade: that's an implementation detail, and not a technique we should preclude [08:46] fwereade: my view is that charms should be views as independent concurrent entities [08:46] wrtp, I dunno, I still have a strong instinct that the Right Thing is to explicitly declare all activity outside either a hook context or an error state (in which you're expected to ssh in) is unsafe [08:46] s/views/viewed/ [08:46] fwereade: huh? [08:46] fwereade: so you can't run a server? [08:47] fwereade: isn't that kinda the whole point? [08:47] wrtp, there is juju activity and service activity [08:47] wrtp, the service does what it does, regardless of juju [08:48] fwereade: it seemed to me like you were talking about the subordinate mucking with service setting files [08:48] wrtp, yes, but not stuff inside the charm directory... the settings of the actual service [08:48] fwereade: ok, so that's service activity, right? [08:48] wrtp, ensuring a logging section has particular content, or something [08:48] wrtp, that is juju activity... acting on the service's config [08:49] fwereade: i think it's a grey area [08:49] wrtp, IME it is rare for services to wantonly change their own config files at arbitrary times [08:49] fwereade: "rare" isn't good enough. [08:50] fwereade: we want charms to be able to work with one another regardless of how they're implemented [08:50] fwereade: and it seems to me like it's perfectly reasonable for a charm to start a service which happens to manage another service. [08:51] wrtp, how does allowing parallel hook execution do anything except make it harder for charms to work reliably together? [08:52] fwereade: it means the failure of one charm is independent on the failure of another [08:52] s/on the/of the/ [08:52] fwereade: and in that sense, it makes it easier to charms to work reliably together [08:53] wrtp, sorry, I may be slow today: how is hook failure relevant? [08:53] wrtp, having them execute in parallel makes it *more* likely that hooks will fail due to inappropriately parallelised operations [08:53] fwereade: if i write a command in a hook that happens to hang for a long time (say 15 minutes, trying to download something), then that should not block out any other charms [08:54] fwereade: i think that if you write a subordinate charm, it's your responsibility to make it work correctly when other things are executing concurrently. [08:55] wrtp, and if you write a principal charm, it's also your responsibility to know everything that every subordinate charm might do so yu can implement your side of the locking correctly? [08:55] fwereade: no. i think that kind of subordinate behaviour is... insubordinate :-) [08:56] fwereade: i think that we should not think of subordinates as ways to muck directly with the operations of other charms. [08:56] fwereade: if you want that, then you should change the other charms directly. [08:58] fwereade: ISTM that if you've got two things concurrently changing the same settings file (whether running in mutually exclusive hooks or not) then it's a recipe for trouble. [08:58] wrtp, the point is to eliminate the concurrency... [08:58] wrtp, by mandating that if yu want to make a change you must do it in a hook, and serialising hook executions across all units, we do that [08:59] wrtp, the other drawbacks may indeed sink the idea [08:59] wrtp, but I'm pretty sure that doing this gives us a much lower chance of weird and hard-to-repro interactions [08:59] fwereade: yeah, but if you're a principal and you change a settings file, you might be warranted in expecting that it's the same when the next hook is called. [09:00] fwereade: for instance you might just *write* the settings file, rather than read it in and modify it [09:00] wrtp, (it also depends on adding juju-run so that we *can* run commands in a hook context at arbitrary times) [09:01] moin. [09:01] Aram: hiya [09:01] Aram, heyhey [09:03] wrtp, well, it is true that a hook never knows what hook (if any) ran last [09:03] fwereade: i don't think we should be making it easier to write the kind of charms that this would facilitate [09:04] wrtp, what's the solution to the apt issues then? [09:04] fwereade: so is it true that apt does not work if called concurrently? [09:04] wrtp, it appears to be [09:05] fwereade: i would not be averse to providing an *explicit* way to get mutual exclusion across charms in a container [09:06] fwereade: so you could do, e.g.: juju-acquire; apt-get...; juju-release [09:08] fwereade: last thing i saw: [10:04:39] wrtp, it appears to be [09:09] fwereade: it would be better if apt-get was fixed though - that seems to be the root of this suggestion. [09:09] wrtp, sorry -- I was composing something like "that sounds potentially good, please suggest it in a reply" [09:09] wrtp, I still find it hard to believe that apt-get is the only possible legitimate source of concurrency issues [09:10] fwereade_: of course it's not - but we're in a timesharing environment - it's all concurrent and people need to deal with that. [09:10] wrtp, and if it's not then everybody has to carve out their own exceptions for the things they know and care about [09:10] wrtp, and I have a strong suspicion that everyone will figure out that the Best Practice is to grab the lock at the start of the hook and release it at the end [09:11] fwereade_: i think that trying to pretend that in this fluffy juju world everything is sequential and lovely, is going to create systems that are very fragile [09:11] fwereade_: that may well be true for install hooks at any rate [09:11] fwereade_: i'm not so sure about other hooks [09:13] wrtp, I am not trying to "pretend" anything... I am saying we can implement things one way, or another way, and that I think one way might be good. you seem to be asserting that even if we do things sequentially they still won't be sequential [09:14] wrtp, it's not about pretending it's about making a choice [09:14] fwereade_: yeah. i'm saying that hook sequencing doesn't necessarily make the actions of a charm sequential [09:14] wrtp, if we only pretend to make choices I agree we'll be screwed there ;) [09:14] wrtp, wait, you have some idea that any charm knows anything about what happened before it was run? [09:15] wrtp, s/any charm/any hook/ [09:15] fwereade_: i think it's reasonable for a charm to assume ownership of some system file. [09:16] wrtp, ok, but that still implies nothing about what the last hook to modify that file was, at any given time [09:17] fwereade_: it means that you know that whatever change was made, it was made by your hooks [09:17] wrtp, I don't remotely care about hook *ordering* in this context... is that the perspective you're considering? [09:17] fwereade_: no, not at all [09:18] wrtp, wait, you were just telling me that "rare" isn't good enough when considering the possibility of, say, a service changing its own config... ISTM that it follows that we must have some magical system which is safe from any and all concurrent modifications (or, really, that every charm author has to build compatible pieces of such a system) [09:19] wrtp, or we have a simple solution, which is, don't run two hooks at a time [09:19] fwereade_: or... don't have one charm that modifies the same things as another. keep out of each others' hair. [09:19] wrtp, so, no apt then [09:20] wrtp, and nothing else that doesn't like consurrent modifications [09:20] fwereade_: apt needs to be fixed. or we need to provide a workaround for that, in particular. [09:20] wrtp, *or* a vast distributed multi-author locking algorithm using new hook commands [09:21] fwereade_: "vast distributed multi-author" ?? [09:21] wrtp, every single charm author has to do the locking dance right [09:22] fwereade_: only if you you're changing something that others might change concurrently. [09:22] fwereade_: i think this all comes down to how we see the role of subordinates [09:22] wrtp, requiring that charm authors have perfect precognition doesn't strike me as helpful ;p [09:24] fwereade_: have you looked at what subordinate charms are out there now, and whether any potentially suffer from these issues? (ignore apt-get issues for the moment) [09:24] wrtp, no, because these sorts of issues are by their very nature subtle and hidden [09:25] fwereade_: i'm not so sure. i think it should be fairly evident if a subordinate is doing stuff that may interact badly with a principal. [09:25] wrtp, I think that if it were that clear, everybody would have spotted the apt problem and worked around it in every single charm [09:26] fwereade_: i assume that hardly anyone uses subordinates yet tbh [09:27] fwereade_: i don't mean evident from behaviour, but evident from what the purpose of the subordinate charm is [09:27] wrtp, right -- my position is that the reason that apt is the only problem we've seen is likely to be because we don't use many yet [09:29] wrtp, IMO it is consistent with the general feel of juju to make it easier, not harder, for charms to play together [09:30] fwereade_: IMO it's also consistent with juju to make independent components that have independent failure modes [09:30] wrtp, we provide a consistent snapshot of remote state in a hook context -- why mess that up by explicitly encouraging inconsistency in local state? [09:30] fwereade_: because we *can't* provide a consistent snapshot of local state? [09:30] wrtp, and yet you seem to consider that adding a class of subtle and hard-to-detect concurrency-based failures is consistent with this goal [09:31] wrtp, we can either have a hook which is equivalent to logging into a machine yourself, or logging into a machine with N concurrent administrators [09:31] wrtp, all making changes at the same time [09:32] wrtp, I don't see how the second scenario is more robust [09:33] fwereade_: if one of those N concurrent adminstrators hangs for ages, the others can continue uninterrupted. i think that's a very useful property. [09:34] wrtp, I think that's a very situation-specific property and not worth introducing this class of bug for [09:34] fwereade_: it means that if i decide to install a subordinate charm, the principal service can carry on regardless. [09:34] wrtp, I feel if we ever do something like this it should be a release/reacquire pair around the long-running operations [09:34] wrtp, making people have to lock by default seems really unhelpful to me [09:36] fwereade_: tbh, i'm very surprised that apt-get doesn't work concurrently by default. i haven't managed to find any bug reports so far. [09:36] fwereade_: it seems to take out file locks [09:36] wrtp, so plausibly 2 things are installing things with overlapping dependencies? [09:39] fwereade_: we could always provide a version of apt-get that *is* exclusive... [09:48] wrtp, I dunno, it feels to me like we'll end up with a bunch of special cases sooner or later [09:48] wrtp, can we take it to the lists for further discussion? need to pop out to baker before it closes [09:49] fwereade_: sure, i'll try and draft a reply [11:01] Good morning! [11:01] hello [11:01] hi [11:02] Anyone has the calls active already, or should I? [11:03] niemeyer: feel free to start, imho none has done it yet [11:03] COol, starting it up [11:04] https://plus.google.com/hangouts/_/2a0ee8de20f9362c47ab06b9b5635551d4959416?authuser=0&hl=en [11:04] no camera today [11:04] not sure why [11:05] the mac says it can see the device [11:05] but no green light :( [11:05] wrtp: ping [11:05] niemeyer: pong [11:05] wrtp: Party time [11:05] niemeyer: am just sorting out the hangout laptop [11:06] I hate this technical shit [11:59] lunch [12:19] back [12:54] fwereade_: Sent a more carefully considered comment on the lock-stepping issue [13:05] How are folks doing this fine morning? [13:06] fwereade_: ping [13:06] mramm: Heya [13:06] I'm about to go over Mark S's open stack design summit keynote with him (and kapil and clint) [13:06] mramm: All good 'round here [13:06] mramm: Brilliant, good luck there [13:06] I think we have a really good story to tell around openstack upgrades thanks to the cloud archive [13:07] and the look and feel of the juju gui is impressive [13:08] fwereade_: When you're back and you have a moment, I'd appreciate talking a bit about https://codereview.appspot.com/6687043 [13:09] fwereade_: Both about the logic in EnterScope, and about the fact the CL seems to include things I've reviewed elsewhere [13:09] mramm: What's the cloud archive? [13:09] mramm: Good to hear re. GUI [13:10] it's just a package archive [13:10] with all the new stuff, backported and tested against the LTS [13:10] mramm: LOL [13:10] mramm: So we manage to stick the word "cloud" on package archives? ;-) [13:10] managed [13:10] it's all "cloud" stuff in the archive [13:11] yes [13:11] gotta make things cloudy [13:13] niemeyer, pong, sorry I missed you [13:13] wrtp: no problem [13:13] Erm [13:13] fwereade_: no problem [13:13] niemeyer, haha [13:14] fwereade_, wrtp: I was actually about to ask something else [13:14] niemeyer: go on [13:14] niemeyer, 043 was meant to be a prereq for 046, i didn't realise I'd skipped it until yesterday [13:14] fwereade_, wrtp: I think it'd make sense to have the interface of juju.Conn exposing at least similar functionality to what we have in the command line [13:14] niemeyer: are you talking about your Deploy bug? [13:15] niemeyer, yes, I think I like that idea [13:15] No, I'm talking about https://codereview.appspot.com/6700048 [13:15] fwereade_, wrtp: We've been going back and forth on what we have in juju.Conn, and the state we're in right now is quite neat [13:15] niemeyer: ah, i think that's a tricky one. [13:15] fwereade_, wrtp: But the decision to put something there or not is a bit ad-hoc at the moment [13:16] niemeyer: i *do* think it's perhaps a bit confusing that RemoveUnit in Conn isn't anything like RemoveUnit in State. [13:16] wrtp: Agreed, and I have a proposal: DestroyUnit [13:16] niemeyer: in state? [13:16] wrtp: No, in juju.Conn [13:16] Ideally that'd be the name of the command-line thing as well, but that's too late [13:17] niemeyer: hmm [13:17] We do have destroy-service and destory-environment, though [13:17] niemeyer, honstly I would prefer us to change Add/Remove in state to Create/Delete, say, and save the meanings of those verbs for the user-facing add/removes [13:17] niemeyer: Destroy sounds more drastic than Remove tbh [13:17] fwereade_: Remove vs. Delete feels awkward [13:17] wrtp: It's mean to be drastric [13:17] drastic [13:17] meant [13:17] I can't spell [13:18] niemeyer: ah, i thought the command-line remove-unit just set dying. [13:18] niemeyer, well, the trouble is we have this awkward remove-unit verb, which doesn't really mean remove at all [13:18] wrtp: and dying does what? :) [13:18] niemeyer: then again, i suppose... yeah [13:18] fwereade_: We can obsolete the command, and have destroy-unit [13:19] fwereade_: (supporting the old name, of course) [13:20] niemeyer, I'm -0.5 on the add/destroy pairing but it doesn't seem all that bad [13:20] fwereade_: we already have add-service, destroy-service, no? [13:20] wrtp: We don't have add-service, yet [13:20] wrtp: WE may have some day.. [13:20] niemeyer: good point [13:21] We do have AddService, though, so the pairing is already there in some fashion at least [13:21] wrtp, we also have terminate-machine rather than destroy-machine [13:21] I quite like destroy precisely because it's drastic, and because it avoids the add/remove/dying conflict [13:21] fwereade_:+1 on destroy-machine too [13:21] wrtp, niemeyer: and in general I am in favour of making the commands more consistent [13:22] fwereade_: +1 too [13:22] fwereade_: destroy-service, destroy-unit, destroy-machine, destroy-environment.. [13:22] I'm happy with that, at least [13:22] destroy for destructive actions seems good [13:22] wrtp, niemeyer: any quibbles I may have over the precise verb are drowned out by my approval for consistency [13:22] niemeyer: sounds like a plan [13:23] niemeyer, wrtp: destroy-relation [13:23] wrtp, fwereade_: Awesome, let's document and move in that direction [13:23] I'll add a comment to Dave's branch [13:23] fwereade_: +1 [13:23] niemeyer, great, thanks [13:23] fwereade_: i don't mind about remove-relation actually - it doesn't feel like that much of a destructive operation. [13:23] wrtp: It actually is [13:23] wrtp, strong disagreement [13:23] * niemeyer has to take the door.. biab [13:23] fwereade_: ok, cool [13:25] niemeyer, re the review -- if I were you I'd just drop that one, you've seen it all already in the one without the prereq [13:25] niemeyer, I will try to figure out exactly where I am and whether I've introduced anything that deserves a test, then I should have the fixed one-you've-seen ready to repropose soon [13:36] interesting; this test didn't *fail*, but it did take over 2 minutes to execute on my machine: http://paste.ubuntu.com/1283102/ [13:37] i'm not sure if i'm being pathological there or not [13:55] wrtp: Would be very useful to know where the time is being spent [13:55] fwereade_: Awesome, thanks [13:55] niemeyer: i'm looking into it right now. [13:55] fwereade_: Can we speak about EnterScope when have a moment? [13:56] niemeyer, any time [13:56] niemeyer, now? [13:56] fwereade_: Let's do it [13:56] niemeyer: quick check before you do that [13:56] wrtp: Sure [13:56] niemeyer: should there be any watcher stuff running in a normal state unit test? [13:56] niemeyer: (i'm seeing hundreds of "watcher: got changelog document" debug msgs [13:57] ) [13:57] wrtp: The underlying watcher starts on state opening [13:57] niemeyer: ah [13:57] wrtp: If you're creating hundreds of machines, that's expected [13:57] niemeyer: i see 4600 such messages initially [13:57] niemeyer, actually, I'm just proposing -wip [13:58] niemeyer, not quite sure it's ready, have ended up a bit confused by the branches [13:58] niemeyer, but it does have an alternative approach to EnterScope [13:58] niemeyer, that I am not quite sure whether I should do as it is, or loop over repeatedly until I get so many aborteds that I give up [13:59] wrtp: You'll get as many messages as changes [13:59] niemeyer, https://codereview.appspot.com/6678046 [13:59] fwereade_: Cool [14:01] fwereade_: Invite sent [14:10] hmm, i see the problem, i think [14:26] wrtp: Found it? [14:26] niemeyer: the problem is that all the goroutines try to assign to the same unused machine at once, but only one succeeds; then the all try with the next one etc etc [14:26] s/the all/they all/ [14:27] niemeyer: i think i've got a solution [14:27] niemeyer: i'm not far off trying it out [14:28] niemeyer: my solution is to read in batches, and then try to assign to each machine in the batch in a random order. [14:28] wrtp: What about going back to the approach we had before? [14:29] niemeyer: which was? [14:29] wrtp: Create a machine and assign to it [14:29] niemeyer: what if we don't need to create a machine? [14:30] niemeyer: this is in AssignToUnusedMachine, which doesn't create machines [14:31] wrtp: My understanding is that we had an approach to allocate machines that was simple, and worked deterministically [14:31] niemeyer: and the approach we used before is inherently racy if someone else *is* using AssignToUnusedMachine [14:31] niemeyer: that's fine (modulo raciness), but that doesn't fix the issue i'm seeing in this test (which we may, of course, decide is pathological and not worth fixing) [14:32] wrtp: The only bad case was that if someone created a machine specifically for a service while someone else attempted to pick a random machine, the random one could pick the machine just allocated for the specific service [14:32] niemeyer: so in that case we should loop, right? [14:33] wrtp: I'm not sure [14:35] niemeyer: actually, we *do* create a machine and then assign the unit to that machine [14:35] niemeyer: and that's the cause of the bug that dfc is seeing (i now realise) [14:37] wrtp: Indeed, sounds plausible [14:37] niemeyer: in the case i'm dealing with currently, we have a big pool of machines already created, all unused, and we're trying to allocate a load of units over them. [14:37] niemeyer: that seems like a reasonable scenario actually. [14:38] wrtp: Agreed [14:38] niemeyer: so i think it's worth trying to make that work ok. [14:38] wrtp: +1 [14:38] niemeyer: so... do you think my proposed solution is reasonable? [14:40] wrtp: It seems to reduce the issue, but still feels racy and brute-forcing [14:40] niemeyer: alternatives are: - read *all* the machines, then choose them in random order; - add a random value to the machine doc and get the results in a random order [14:40] niemeyer: yeah, i know what you mean [14:41] niemeyer: there's probably a way of doing it nicely, though i haven't come up with one yet [14:43] wrtp: I think we could introduce the concept of a lease [14:43] niemeyer: interesting way forward, go on. [14:44] wrtp: When a machine is created, the lease time is set to, say, 30 minutes [14:44] wrtp: AssignToUnused never picks up machines that are within the lease time [14:45] niemeyer: that doesn't solve the big-pool-of-already-created-machines problem AFAICS [14:45] niemeyer: which is, admittedly, a different issue [14:45] wrtp: Hmm, good point [14:47] wrtp: You know what.. I think we shouldn't do anything right now other than retrying [14:47] niemeyer: and ignore the time issue? [14:48] wrtp: Yeah [14:48] niemeyer: the random-selection-from-batch isn't much code and will help the problem a lot [14:48] wrtp: It makes the code more complex and bug-prone for a pretty unlikely scenario [14:49] niemeyer: ok. it's really not that complex, though, but i see what you're saying. [14:50] wrtp: I recall you saying that before spending a couple of days on the last round on unit assignment too :-) [14:50] niemeyer: i've already written this code :-) [14:51] niemeyer: and it's just an optimisation that fairly obviously doesn't affect correctness. [14:52] wrtp: I don't think it's worth it.. it's increasing complexity and the load of the system in exchange for a reduction in the chance of conflicts in non-usual scenarios [14:53] wrtp: We'll still have conflicts, and we still have to deal with the problem [14:53] wrtp: People adding 200 machines in general will do add-machine -n 200 [14:53] wrtp: and we should be able to not blow our own logic out with conflicts in those cases [14:55] niemeyer: i'm thinking of remove-service followed by add-service [14:56] wrtp: Ok? [14:57] niemeyer: sure. [14:57] niemeyer: i'll scale back my test code :-) [14:57] wrtp: Sorry, I was asking what you were thinking [14:57] wrtp: What about remove-service follows by add-service? [14:57] followed [14:58] niemeyer: if someone does remove-service, then two add-services concurrently, they'll see this issue. [14:58] niemeyer: that doesn't seem that unusual a scenario [14:59] niemeyer: i mean two "deploy -n 100"s of course [14:59] niemeyer: assuming the original service had 200 units. [14:59] wrtp: If someone does destroy-service, they'll put units to die.. if they run add-service twice immediately, they'll create two new machines [14:59] wrtp: What's the problem with that? [14:59] niemeyer: if someone does destroy-service, then waits, the machines lie idle with no units after a while, yes? [15:00] wrtp: Sorry, what's the scenario again? Different scenarios are not "of course" the same [15:00] niemeyer: here's the scenario i'm thinking of: [15:02] juju deploy -n 200 somecharm; juju remove-service somecharm; sleep 10000; juju deploy -n 100 othercharm & juju deploy -n 100 anothercharm [15:02] wrtp: I don't understand why we're talking about deploy + remove-service [15:02] wrtp: What's the difference between that and add-machine -n 200? [15:03] niemeyer: because that leaves a load of machines allocated but unused, no? [15:03] wrtp: add-machine -n 200? [15:03] [15:53:33] wrtp: People adding 200 machines in general will do add-machine -n 200 [15:03] wrtp: Yes, what's the difference? [15:03] niemeyer: but they are more likely to remove a service and add another one, i think [15:03] wrtp: Doesn't matter to the allocation algorithm, does it? [15:04] niemeyer: "juju deploy -n 200 foo" doesn't have the issue [15:04] niemeyer: if the machines are not currently allocated [15:05] wrtp: Agreed.. that's why I'm saying the whole problem is not important.. [15:05] wrtp: I still don't get what you're trying to say with deploy+remove-service+sleep [15:06] wrtp: Isn't that an expensive way to say add-machine -n 200? [15:06] niemeyer: i'm trying to show a moderately plausible scenario that would exhibit the pathological behaviour we're seeing here. [15:06] niemeyer: yeah, sure. [15:06] wrtp: Okay, phew.. [15:06] wrtp: So how is add-machine -n 200 + deploy -n 200 an issue? [15:07] niemeyer: it's only an issue if you've got two concurrent deploys. [15:08] wrtp: Okay, so we should just ensure that these cases actually work by retrying, until we sort a real solution out in the future that actually prevents the conflict [15:08] niemeyer: sounds reasonable. === TheMue is now known as TheMue-AFK [15:09] niemeyer: AssignToUnusedMachine does currently retry as it stands actually. [15:09] wrtp: So how is Dave stumbling upon issues? [15:09] niemeyer: the problem is in AssignUnit, but there's a trivial fix, i think [15:09] wrtp: Coo [15:09] ll [15:10] niemeyer: currently AssignUnused calls Unit.AssignToMachine(m) but it should call Unit.assignToMachine(m, true) [15:10] niemeyer: yeah, i was surprised when my test didn't fail. [15:10] wrtp: I'm not sure this solves the issue [15:11] niemeyer: no? [15:12] niemeyer: i *think* it solves the case of AssignUnit racing against itself [15:12] niemeyer: it doesn't solve the problem of AssignUnit racing against AssignToUnusedMachine [15:12] niemeyer: if we want to solve that, we'll need to loop, i think. [15:13] niemeyer: (but that's not the problem that dave is seeing) [15:14] niemeyer: erk [15:14] niemeyer: no, you're right [15:19] niemeyer: i'm thinking of something like this: http://paste.ubuntu.com/1283247/ [15:21] wrtp: This doesn't feel great.. allocating a machine and having it immediately stolen is pretty awkward [15:21] wrtp: If we want to solve this stuff for real, I suggest two different fronts: [15:22] niemeyer: AddMachineWithUnit ? [15:22] 1) Introduce a lease time on AddMachine that prevents someone else from picking it up non-explicitly [15:23] 2) Do a variant of your suggestion that picks the highest and the smallest id of all unused machines, and picks the first one >= a random id in the middle [15:24] wrtp: -1 I think.. this would mean we'll have to do a bunch of transaction merging that right now are totally independent [15:24] niemeyer: ok [15:25] niemeyer: do we have a way of getting an agreed global time for leaseholding? [15:25] niemeyer: presumably the presence stuff does that currently [15:25] niemeyer: hmm, maybe mongo provides access to the current time [15:26] wrtp: Yeah [15:27] wrtp: Although, ideally we'd not even load that time [15:28] niemeyer: i'm thinking we shouldn't need to [15:29] wrtp: if the machine is created with a bson.MongoTimestamp, that's automatically set [15:29] wrtp: It needs to be the second field, though, IIRC [15:29] niemeyer: weird [15:29] wrtp: Yeah, it's a bit of an internal time type [15:30] wrtp: it'd be nicer to use a normal time, actually [15:30] I don't recall if there's a way to create it with "now", though [15:30] * niemeyer checks [15:32] Nothing great [15:33] I talked to Eliot before about $now.. I think it'll come, but doesn't exist yet [15:34] niemeyer: ah [15:34] Anyway, will think further about that over lunch [15:34] niemeyer: cool [15:34] niemeyer: for the time being, perhaps it's best just to do the loop? [15:35] niemeyer: as it's a quick fix for a current bug [15:35] wrtp: Yeah, that's what I think we should do [15:36] wrtp: The real solution is involved and will steal our time [15:36] niemeyer: agreed [15:36] niemeyer: also, what we will have will be correct, just not very efficient. [15:37] wrtp: Yeah, but those are edge cases really.. the cheap answer is "don't allocate tons of machines and then do tons of assignments in parallel" [15:37] wrtp: Which isn't hard to avoid [15:37] niemeyer: yeah [15:38] niemeyer: concurrent deploys are inefficient. we can live with that for the time being. [15:38] Cool, lunch time.. biab [15:38] wrtp: Concurrent deploys with spare machines, specifically [15:39] niemeyer: all concurrent deploys will suffer from the someone-stole-my-new-machine problem, i think. [16:45] niemeyer: this seems to work ok. https://codereview.appspot.com/6713045 [16:47] wrtp: Looking [16:49] wrtp: Nice [16:49] niemeyer, hmm, is it OK to go from Dying straight to removed without passing through Dead? [16:49] wrtp: How long does it take to run? [16:49] niemeyer, blast sorry can't talk now [16:49] niemeyer: <2s [16:49] niemeyer: one mo, i'll check [16:50] niemeyer: 0.753s to run the state tests with just that one test. [16:50] wrtp: Beautiful, thanks [16:50] wrtp: LGTM [16:50] niemeyer: thanks [16:51] niemeyer: it was surprisingly difficult to provoke the race before applying the fix [16:51] fwereade_: If nothing else, I see how it might be okay in cases we have tight control on [16:51] fwereade_: We can talk more once you're back [16:52] wrtp: Those are very useful tests to hold [16:52] niemeyer: agreed [16:54] niemeyer: any chance of getting some feedback on https://codereview.appspot.com/6653050/ ? [16:55] wrtp: I was reviewing that when I stopped to review your request here [16:55] niemeyer: ah brilliant, thanks! [17:01] wrtp: Why does it reset the admin password on tear down of ConnSuite? [17:01] niemeyer, well, it was what we were discussing earlier... that it seemed sensible for the last unit to leave a relation scope to be the one to finally remove it, and that we should do it in a transaction [17:02] niemeyer: because every time we connect to the state, the admin password gets set. [17:02] wrtp: Where's that done? [17:03] niemeyer: in juju.NewConn [17:03] niemeyer: actually, in Bootstrap [17:04] niemeyer: and then juju.NewConn resets it, as is usual [17:04] wrtp: 135 // because the state might have been reset [17:04] 136 // by the test independently of JujuConnSuite. [17:05] wrtp: Is that done when the password change fails? [17:05] wrtp: I mean, where do we reset and put it in such a non-working state [17:05] niemeyer: one mo, i'll check the code again [17:06] wrtp: Cheers [17:06] niemeyer: ah, yes, it's when we have tests that Bootstrap then Destroy [17:07] niemeyer: any test that does a Destroy will cause the SetAdminPassword call to fail [17:08] wrtp: Hmm.. [17:08] wrtp: I'm pondering about what it means.. won't the follow up tear down fail too [17:08] ? [17:08] niemeyer: no, it doesn't. can't quite remember why though, let me check again. [17:09] wrtp: It feels a bit wild [17:09] wrtp: You've just worked on that and can't remember.. neither of us will have any idea of that stuf fin a bit :( [17:09] niemeyer: i know what you mean [17:10] niemeyer: the interaction between JujuConnSuite, MgoSuite and the dummy environ isn't ideal [17:10] wrtp: What actually fails we don [17:10] 't reset the password there? [17:10] niemeyer: lots of tests need the server to be restarted then [17:11] niemeyer: nothing fails - tests just get slower [17:11] wrtp: That's good [17:12] niemeyer: when someone calls Environ.Destroy, it calls mgo.Reset, but the JujuConn.State variable remains pointing to to old connection. [17:12] wrtp: Right [17:13] wrtp: Okay, I'll just suggest a comment there [17:13] niemeyer: sounds good [17:14] / Bootstrap will set the admin password, and render non-authorized use [17:14] / impossible. s.State may still hold the right password, so try to reset [17:14] / the password so that the MgoSuite soft-resetting works. If that fails, [17:14] / it will still work, but it will take a while since it has to kill the whole [17:14] / database and start over. [17:15] Ah, will add a note about when it happens too [17:18] wrtp: LGTM [17:19] wrtp: Pleasantly straightforward [17:19] niemeyer: great, thanks [17:20] niemeyer: yeah, when i realised that all tests were going to need to connect with authorisation, i thought the changes would be worse than they ended up. [17:30] fwereade_: I see [17:31] fwereade_: I think it sounds reasonable in that case [17:31] fwereade_: Is there anyone else that might be responsible for taking the unit from dying => dead => remove? [17:32] right, submitted. good way to end the day. see y'all tomorrow. [17:34] wrtp: Indeed, have a good one! [19:12] niemeyer, it's the relation I'm pondering taking directly Dying->gone [19:13] niemeyer, I *think* it's ok, because the last thing to be doing anything with it should be that last relation [19:13] fwereade__: Makes sense.. have you seen my comments on it? [20:11] niemeyer, sorry, I'm not sure which comments.. I haven't seen any comments of yours less than ~ 1 day old on the CLs I'm thinking of [20:14] fwereade__: My apologies, I meant on IRC, right above [20:14] fwereade_: I see [20:14] fwereade_: I think it sounds reasonable in that case [20:14] fwereade_: Is there anyone else that might be responsible for taking the unit from dying => dead => remove? [20:14] niemeyer, it's not the unit, it's the relation [20:15] fwereade__: Ah, sorry, yes, s/unit/relation [20:15] niemeyer, the other thing thta might have do do it is the client, if no units are in the relation yest [20:15] niemeyer, I'm actually starting to feel less keen on the idea [20:16] niemeyer, I'm starting to think that it would be better to set it to Dead and add a cleanup doc for it [20:16] niemeyer, we can do it in one transaction but don't have to get overly clever [20:17] fwereade__: What's the benefit? [20:18] niemeyer, we get (1) consistent lifecycle progress and (2) a single transaction that the unit agent uses to wash its hands of a dying relation [20:19] fwereade__: Actually, hmm [20:19] fwereade__: Well, before we derail.. [20:20] fwereade__: Both don't look like very strong points.. we're exchanging simple and deterministic termination by a hand-off of responsibility [20:21] fwereade__: There's perhaps an alternative that might offer a middle ground solving some of your concerns, though [20:22] niemeyer, (the big one is "LeaveScope will be less complicated") [20:22] niemeyer, but go on please [20:23] fwereade__: Sorry, I spoke too soon, I think the idea would introduce further races down the road [20:23] niemeyer, ha, no worries [20:24] niemeyer, anyway I'm hardly married to the idea, I'll take it round the block another time and try to simplify a bit more [20:24] fwereade__: I don't see any simple alternatives.. [20:25] fwereade__: Adding a cleanup document would mean persisting service and associated relations for an undetermined amount of time [20:25] fwereade__: Even in the good cases [20:26] niemeyer, really? a Dead relation has decreffed its service... I don't think there's anything blocking service removal at that point [20:26] niemeyer, that's almost the whole point of it being dead [20:27] niemeyer, if anything else is reacting in any way to a dead relation I think they're doing it wrong [20:28] fwereade__: It would be the first time we're keeping dead stuff around referencing data that does not exist [20:29] fwereade__: This feels pretty bad [20:29] fwereade__: Find(relation)... oh, sorry, your service is gone [20:29] fwereade__: Worse.. Find(relation).. oh, look, your service is live, again, but it's a different service! [20:30] fwereade__: The purpose of Dead as we always covered was to implement clean termination, not to leave old unattended data around [20:32] niemeyer, fair enough, as I said I'm happy to take it round the block again -- I had seen it as just one more piece of garbage in the same vein as all its unit settings, but mileage clearly varies [20:36] niemeyer, just to sync up on perspective: would you agree that we should, where possible, be making all related state changes in a single transaction, and only falling back to a CA when dictated by potentially large N? [20:37] fwereade__: Settings have no lifecycle support, and an explicit free pass in the case of relation unit settings because we do want to keep them up after scope-leaving for reasons we discussed [20:38] niemeyer, yep, ok, I am not actually arguing for it any more, I think I have gone into socratic mode largely for my own benefit [20:39] fwereade__: Regarding CA use, yes, it feels like a last resort we should use when that's clearly the best way forward [20:40] fwereade__: Again, it sounds sensible in the case of settings precisely because we have loose control of when to remove [20:40] niemeyer, thanks, but in fact I have a more general statement: we should be making state changes as single transactions where possible, and exceptions need very strong justifications [20:40] niemeyer, because I am suddenly fretting about interrupted deploys [20:41] fwereade__: I'm finding a bit hard to agree with the statement open as it is because I'm not entirely sure about what I'd be agreeing wiht [20:41] niemeyer, it feels like maybe we should actually be adding the service, with N units and all its peer relations, in one go [20:41] fwereade__: The end goal is clear, though: our logic should continue to work reliably even when things explode in the middle [20:42] niemeyer, ok, thank you, that is a much better statement of the sentiment I am trying to express [20:42] fwereade__: In some cases, we may be forced to build up a single transaction [20:42] fwereade__: In other cases, it may be fine to do separate operations because they will independently work correctly even if there's in-between breakage [20:43] fwereade__: and then, just to put the sugar in all of that, we have to remember that our transaction mechanism is read-committed [20:43] fwereade__: We can see mid-state [20:43] fwereade__: even if we have some nice guarantees that it should complete eventually [20:44] niemeyer, I have been trying to keep that at the forefront of my mind but I bet there are some consequences I've missed somewhere ;) [20:44] fwereade__: That's great to know [20:45] fwereade__: We most likely have issues here and there, but if nothing else we've been double-checking [20:46] niemeyer, so, to again consider specifically the extended LeaveScope I'm looking at now [20:46] fwereade__: I often consider the order in which the operations are done, and the effect it has on the watcher side, for example [20:46] fwereade__: Ok [20:47] niemeyer, (huh, that is not something I had properly considered... are they ordered as the input {}Op?) [20:47] fwereade__: Yep [20:47] fwereade__: We've been getting it right, I think :) [20:48] fwereade__: E.g. add to principals after unit is in [20:48] fwereade__: I like to think it's not a coincidence :-) [20:48] niemeyer, cool, but that's one of those totally unexamined assumptions I think I've been making, but could easily casually break in pursuit of aesthetically pleasing code layout or something ;) [20:48] niemeyer, good to be reminded [20:49] fwereade__: So, LeaveScope [20:49] fwereade__: What do you think? [20:50] niemeyer, you know, I'm not sure any more, I need to write some code :/ [20:50] niemeyer, thank you, though, this has helped some things to fall into place [20:51] fwereade__: Okay, since we're here with state loaded in our minds, this is my vague understanding of what we probably need: [20:52] 1) If the relation is live, run a transaction doing the simplest, asserting that the relation is still alive [20:54] 2) If there are > 1 units in the relation we're observing, run a transaction asserting that this is not the last unit, and just pop out the scope [20:55] 3) If there is exactly 1 unit remaining, or 2 was aborted, remove relation and scope, unasserted [20:56] niemeyer, yeah, that matches my understanding [20:56] Actually, sorry, (3) has to assert the scope doc exists [20:56] Otherwise we may havoc the system in some edge cases [20:57] fwereade__: ^ [20:58] niemeyer, it was the refcount checks I had been thinking of when you said unasserted [20:59] niemeyer, but then actually, hmm: by (3) a failed assertion should be reason enough to blow up, unless a refresh reveals that someone else already deleted it... right? [20:59] fwereade__: Right [21:00] niemeyer, we can't do anything sophisticated with the knowledge, it's always going to be an error: may as well assert for everything even in (3) [21:00] niemeyer, at least we fail earlier if state does somehow become corrupt [21:00] fwereade__: Hmm [21:00] fwereade__: Sounds like the opposite [21:00] fwereade__: If we assert what we care about, we can tell how to act [21:00] fwereade__: If we assert just on existence of scope doc, which is the only thing we care about, we're know exactly what happened if it fails [21:01] fwereade__: Even if we don't load anything else [21:01] fwereade__: We don't care about refcounts, in theory [21:01] fwereade__: If it's 1, there's only 1.. if someone removed that one, and it wasn't us, that's okay too [21:02] fwereade__: 1 should never become 2 unless we have a significant bug [21:02] fwereade__: Makes sense? [21:03] niemeyer, so if we assert lots, fail early, and recover if it turns out that the relation was removed by someone else, I think we're fine, and in the case of such a significant bug at least we haven't made *more* nonsensical changes to the system ;) [21:04] fwereade__: My point is that we don't have to "recover if it turns out ..." [21:04] niemeyer, yeah, fair enough, I see that side too [21:04] fwereade__: Otherwise, agreed regarding assert lots [21:05] fwereade__: In fact, we're doing in-memory Life = Dead, which sounds pretty dangerous in that place [21:06] fwereade__: We need to make sure to not use an in-memory value we got from elsewhere in that > 1 logic. [21:06] niemeyer, sorry, I am suddenly at sea [21:06] fwereade__: EnsureDead does in-memory .doc.Life = Dead [21:07] fwereade__: Picking a count and life from an external relation doc and saying "Oh, if it's dying, it surely has no more than 1 units" will bite [21:09] fwereade__: Because someone else may have inc'd before it became Dying [21:10] niemeyer, I may have misunderstood: but my reading was that it would be ok to use the in-memory values to pick a transaction to start off with, but that we should refresh on ErrAborted [21:10] niemeyer, and use those values to figure out what to do next [21:13] fwereade__: It depends on how you build the logic really [21:13] niemeyer, I think I know roughly what I'm doing... time will tell :) [21:13] fwereade__: If we load a value from the database that says life=dying and units=1, you don't have to run a transaction that says >1 because you know it'll fail [21:14] fwereade__: If you have a value in memory you got from elsewhere that says the same thing, you can't trust it [21:14] niemeyer, yes, this is true, there are inferences I can draw once it's known to be dying [21:14] fwereade__: That was the only point I was making in the last few lines [21:14] niemeyer, cool [21:15] fwereade__: I'll go outside to exercise a tad while I can.. back later [21:15] niemeyer, enjoy [21:15] fwereade__: Have a good evening in case we don't catch up [21:15] niemeyer, and you :) [21:15] fwereade__: Cheers === hazmat is now known as kapilt === kapilt is now known as hazmat