[00:48] <_mup_> ensemble/expose-provisioning r226 committed by jim.baker@canonical.com [00:48] <_mup_> Work around the fact that the provisioning agent's state is stored in class attributes === kim0|vacation is now known as kim0 [09:03] Morning everyone o/ [13:49] Good mornings! [14:03] 1 WEEK! [14:05] niemeyer: what for? UDS? [14:07] TeTeT: Yeah :) [16:12] indeed, one week it is! === niemeyer is now known as niemeyer_lunch [16:30] <_mup_> ensemble/unit-agent-resolved r262 committed by kapil.thangavelu@canonical.com [16:30] <_mup_> unit lifecycle resolving unit relations workflows === niemeyer_lunch is now known as niemeyer [17:23] this is interesting. if i run repeatedly with -u a test i have that looks at service watching for exposed services in the provisioning agent, then eventually it fails with an InternalTopologyError of service not found [17:24] looking at the topo dump, the services dict is empty at that point [17:24] i don't think this happens in my new code, but investigating further [17:28] ahh, never mind... just the way the watch is run, this can happen. good [17:41] <_mup_> ensemble/expose-provisioning r227 committed by jim.baker@canonical.com [17:41] <_mup_> Fixed watch on service unit names so that the watch can run after the service topo has changed; tested 500+ tries [17:45] jimbaker: Hmmm [17:45] jimbaker: Is that related to the watch running after the test is done? [17:46] jimbaker: This may be related to the issue I was discussing with hazmat on Friday [17:46] niemeyer, yes, it is quite possible that the watch is running after the completion of the test [17:46] jimbaker: Ok [17:47] jimbaker: hazmat is working to avoid that kind of issue [17:47] niemeyer, regarding that i realized the plan formulated on friday, doesn't work, since there are no watches nesc. when sync is invoked, and there maybe background deferred's unlreated [17:47] hazmat: We covered that case on friday [17:49] hazmat: There shouldn't be unrelated background deferreds lingering without a connection to the test ever [17:49] niemeyer, my understanding is that an alternation of the client sync to do a round trip poke to flush communication channels in addition to serialized watch execution was the plan. but the issue here isn't the watch firing at all [17:49] hazmat: It is an early mistake we made in the design of watches that they allow that to happen [17:50] hazmat: What's the issue? [17:50] its the get/exists information retrieval that establishes the watch, the processing of that initial data being async to the caller is what causes the problem [17:51] hazmat: I'm not following you [17:51] there is no background watch callback execution per se then [17:51] hazmat: We went through that on friday, precisely [17:51] hazmat: I explained why the "initial data being async" isn't the problem [17:51] hazmat: and you agreed [17:52] hazmat: We took a while to get there as well, with examples and so on [17:52] hazmat: If you think this isn't true anymore, I'll need some concrete examples [17:53] niemeyer, my understanding of the solution that was formulated on friday, was that it would be alteration of the zk client to serialize watch callback executions, and sync would ensure none are firing currently and none pending (via roundtrip to zk). [17:54] further reflection on that, and its not apparent that it solves the actual issue encountered here [17:54] hazmat: Yes, none are fired currently, no writes within the roundtrip of sync execute [17:54] because there are no watches firing in the common case [17:55] instead the watch callback (ours not zookeeper) is being invoked with initial data reflecting the current state [17:55] hazmat: I don't understand.. the problem we were solving is watches firing in the background. If there are no watches firing, there is no problem. [17:55] niemeyer, there is a problem that the processing of initial data (that was retrieved from establishing the watch) is processed async to the caller [17:56] hazmat: Oh my [17:56] hazmat: You keep repeating that, but I already explained several times why this isn't the problem. [17:57] hazmat: and you agreed on friday.. so again, I'll need some more concrete examples if you have some new information. [18:00] <_mup_> ensemble/unit-agent-resolved r263 committed by kapil.thangavelu@canonical.com [18:00] <_mup_> unit relation lifecycles can stop/start execution independent of related unit watching. [18:04] niemeyer, i agreed there where commutability problems with multiple async watch creators relying on state resulting in sequencing operations. i still think those are mostly immaterial since at all times operations attempt to reflect current state. I liked the alternative solution, and said i would think about it more, i have and afaics it doesn't solve the original issue. [18:08] hazmat: The original issue and the lack of ordering are actually the same issue. [18:08] niemeyer, as for examples that show this, the one i just brought up embodies the original issue, and it solved by this. calling sync, and doing the round trip doesn't would only guarantee no zk client watch callback activity, but that's not related to our watch api invoking its callback with data that's independent of the zk watch firing [18:08] hazmat: and as we discussed on Friday, I do not agree that it's not relevant. I don't want us to have to think about ordering in tests. [18:09] hazmat: You're thinking about a very narrow problem and trying to come up with a solution for it. [18:09] hazmat: The issue is general.. we have been looking at watches running in the background since we started using that convention [18:09] hazmat: We have to solve the actual issue, rather than a one-off issue [18:10] niemeyer, we constantly think about ordering in tests, except with this problem we can't rely on ordering [18:10] hazmat: No, we never think about them, ever! [18:10] hazmat: We never had to say "Oh, I will create that object first so that the watch of the 3 other objects fire at the time this deferred fires" [18:13] niemeyer, not sure how that relates.. we setup state and behavior in tests, and wait for the behavior to fire on the state. our ordering problem is that we can wait for behavior to be in place before we modify state. at all times behavior execution reflects current state. [18:13] s/can/can't [18:14] hazmat: Sorry, can't parse that sentence [18:15] hazmat: Whenever we add a watch, the code of that watch will run in background right now when the watch fires [18:15] hazmat: and that's the problem.. [18:16] hazmat: Solve this problem, and we don't need to "wait for first fire" [18:16] hazmat: Solve first fire, and you need to ensure that *the data was in the state when it first fires*. [18:16] hazmat: Otherwise what will happen is that the first fire is dumb, because the state you wanted wasn't there [18:18] niemeyer, they are separate issues in one case we have a zk watch firing, in the other we don't, at all times the ensemble api callback considers current state. [18:18] hazmat: No, they are not separate issues. One is a partial solution to the actual problem which requires hand-crafting ordering. The other is the actual problem. [18:21] hazmat: FWIW, I have nothing against waiting for the first state.. it sounds great, actually. [18:22] hazmat: But when we start to tweak ordering to enable tests to pass because there was a watch running in the background, that will be an issue. [18:24] niemeyer, why doesn't the the lack of a currently executing serialized watch in the zk client invalidate the usage? we could call sync immediately after lifecycle.start() for example, and it would return while there are still ongoing zk non-watch operations on the zk state. [18:25] hazmat: "Why doesn't (...) invalidate the usage?" of what? [18:25] niemeyer, of the proposed solution we had friday [18:26] hazmat: Which ongoing zk non-watch operations? [18:28] niemeyer, in this case, the establishment of unit relation workflows and lifecycles, that in turn establish watches, and fire hooks based on the current state. [18:29] niemeyer, in other cases its a bit simpler [18:29] hazmat: I don't see how that's special in any sense? [18:29] hazmat: If we wait on the end of those operations, and do a sync, it should work [18:30] hazmat: wait on the non-watch operations, that is [18:30] hazmat: the sync should wait for the rest to get done [18:30] hazmat: This is twisted 101, and we haven't been following it appropriately.. an operation which will end in the future must return a deferred which fires when the operation is done executing. [18:31] like that of watch_hook_debug, the background activity is establishing the debug setup. or in debug_log its establishing internal logging. neither of those involve further watches, but they do involve get roundtrips. [18:31] hazmat: watches are special, because we can't inherently wait for them because they are started out of band [18:31] hazmat: That's why we need sync() [18:31] hazmat: They should return deferreds, which should be waited upon by whoever called them [18:32] niemeyer, they do currently but only for the establishment of the watch, not the processing of the data retrieved that resulted from setting the watch. [18:33] hazmat: Huh, how come? [18:33] hazmat: Aren't them doing the equivalent of [18:33] yield self._client.get(...)? [18:33] hazmat: That effectively waits for the data to come bak [18:33] back [18:33] niemeyer, yes, the data is retrieved but typically they do ... [18:34] d_get, d_watch = self._client.get_and_watch() [18:34] d_get.addCallback(lambda x: ensemble_api_callback(x)) [18:34] hazmat: Do you have an example? [18:34] hazmat: I mean, a pointer in the code [18:34] ah [18:35] hazmat: In those cases, d_get should typically be returned [18:35] hazmat: So that it gets chained [18:35] hazmat: Since the result of the current function becomes the result of the get [18:35] hazmat: and so on [18:35] niemeyer, ensemble.state.service.ServiceUnitState.watch_hook_debug [18:35] hazmat: Looking [18:36] that should have yield exists_d at the end of it [18:36] exists = yield exists_d [18:36] hazmat: and it does [18:37] niemeyer, er.. i meant yield callback_d [18:38] hazmat: Nope, that's the watch, not the data as you mentioned above [18:38] hazmat: Well, hmmm.. I see your point [18:40] hazmat: But that's problematic, because the real watcher is being added to the chain.. we can't yield it [18:40] niemeyer, that's the simple case, the more complicated case is embodied in unit_lifecycle.start, process_service_changes, relation_workflow.start->relation_lifecycle.start [18:40] hazmat: Nor return it [18:40] niemeyer, the real watcher isn't in the chain, just the establishment of the callback [18:40] hazmat: This will never fire [18:40] hazmat: True [18:41] hazmat: Agreed this is a separate issue from synchronizing with watches [18:42] niemeyer, this problem exists in a more chained fashion (initial data callbacks, processing resulting in more watches, with additional data callbacks) in the lifecycle.start case [18:43] hazmat: Sure, that's related to the problem of watch ordering that we've been debating [18:43] hazmat: Just solving the initial data problem will not solve the fact watches will run in the background [18:43] hazmat: We have to address both of them [18:43] <_mup_> ensemble/expose-provisioning r228 committed by jim.baker@canonical.com [18:43] <_mup_> Properly clean up port policy after a service is no longer exposed [18:43] niemeyer, in neither scenario i've posited is an actual zk watch firing. [18:43] hazmat: The end goal is straightforward: yield whatever() + client.sync(), should never leave stuff hanging in the background [18:44] hazmat: You said "processing resulting in more watches" [18:44] hazmat: If that's not a zk watch, what is it? [18:44] niemeyer, more watches being established, but with those ensemble watch api callbacks being invoked with the initial data from establishing the watch [18:45] not from the watch itself firing [18:46] hazmat: Ok, this sounds like the same problem still then? [18:46] niemeyer, it is just nested [18:47] hazmat: deferred chaining FTW [18:48] and the initial execution of those is also the basis in this case for scheduling hook executions... it bears more thinking about.. [18:48] i'm gonna grab a quick lunch before the standup [18:48] hazmat: Cool, have a good one [19:09] <_mup_> ensemble/expose-provisioning r229 committed by jim.baker@canonical.com [19:09] <_mup_> Fix corner case on policy removal [19:18] back in 10 min [19:19] jimbaker, sounds good [19:32] jimbaker, ready? [19:32] bcsaller, hazmat, niemeyer - standup ? [19:32] Yep [19:32] yeah [20:08] niemeyer, one other thing regarding the resolved work... it looks like unit relations which have errors will currently automatically recover (start executing hooks again) if their containing unit is transitioned from stopped to started again [20:09] niemeyer, at the moment we don't ever execute stop except when we're killing the unit.. but i'm concerned what the correct behavior would look like.. i'm also looking at introducing a separate error state for unit relation workflows, but it has the same behavior on unit.start (relation automatically recovers) [20:10] the question is do we want that behavior [20:10] or force all errors to be explicitly recovered.. [20:12] hazmat: I think it's cool to auto-recover them in that situation [20:12] hazmat: Stopping implies killing all relations [20:12] niemeyer, cool [20:13] hazmat: So it's a bit like restarting the system rather than fixing it [20:13] niemeyer, yeah.. its a significant context switch [20:17] + def set_exposed_flag(self): [20:17] bcsaller: This seems to be part of the config-state-manager branch [20:17] merged trunk at some point [20:17] bcsaller: Has a merge leaked in or something? [20:18] bcsaller: Unless I'm doing something wrong, which wouldn't be surprising, this doesn't seem to be in trunk [20:18] bcsaller: Launchpad seems to agree with me as well, given the diff there [20:18] I thought that was from one of the first expose branches... I can look into it, but I didn't pull from anywhere else [20:19] bcsaller: I hope it's not our trunk in a bad state somehow :( [20:19] 'else' == 'merge from anywhere but trunk' [20:19] yeah [20:20] bcsaller: What's the latest commit in your trunk? [20:20] hmm, so when I grep local trunk it doesn't show up either... maybe it was botched at one point and fixed [20:20] 214: refactor to YAMLState [20:20] bcsaller: Hmm [20:20] bcsaller: annotate would show it, if that was the case [20:20] * niemeyer tries [20:20] 213: Jim Baker 2011-04-27 [merge] merged expose-ensemble-commands [r=niemeyer][f=767407] [20:20] which is where I thought that came from [20:24] indeed it does [20:25] Something pretty strange there [20:26] niemeyer, that method is on trunk [20:26] hazmat: I know.. but the diff is still showing it as being introduced by this branch [20:26] hmm.. strange indeed, diff got confused perhaps [20:27] niemeyer, another interesting resolved issue, the on disk state, and the in memory state won't be directly recoverable anymore without some state specific semantics to recovering from on disk state, ie a restarted unit agent, with a relation in an error state would require special semantics around loading from disk to ensure that the in-memory process state (watching and scheduling but not executing) matches the recovery trans [20:27] ition actions (which just restart hook execution, but assume the watch continues).. this functionality added to better allow for the behavior that while down due to a hook error, the relation would continues to schedule pending hooks [20:27] Definitely [20:28] * hazmat notes the same in the code [20:28] hazmat, bcsaller: It's most probably just a criss-cross that is confusing the diffing [20:30] hazmat, bcsaller: Hmm.. nope [20:30] hazmat, bcsaller: I think there's an actual issue [20:31] bcsaller: Your branch has the code duplicated [20:31] bcsaller: Probably a conflict which wasn't solved properly [20:31] hmm.. yeah. merge conflict issue [20:31] hmm [20:32] I'll see if I can fix it after the fact then. I don't recall this happening [20:32] bcsaller: You can.. just remove the incorrect lines [20:32] bcsaller: Just be careful to remove the right ones, so that you're avoiding a change rather than moving code in trunk [20:35] bzr diff --using=meld -r 213 path_to_file.py [20:35] should help [20:36] hazmat: Re. the recover state, yeah, that's tricky [20:41] hazmat: Maybe we should stop trying to be magic about crashes [20:41] hazmat: and simply introduce a e.g. "stuck" state [20:41] Hmmm.. [20:42] We have to talk more about this [20:42] niemeyer, its fine if we introduce additional hooks for an on disk loader [20:42] Because it involves partitioning situations and the suck [20:42] i mean per state hooks [20:42] such [20:42] (which suck, for sure :) [20:42] i've noted the problem in workflow.py, sans partitioning problems, for future, there's a bug tracking this functionality though afaicr [20:43] hazmat: You mean starting to dump the pending relation changes too? [20:43] niemeyer, yup [20:43] hazmat: and what about the changes lost while in the crashed state? [20:43] niemeyer, good question [20:44] hazmat: I think we need a simpler strategy for "catching up" after a reestablished unit agent [20:44] niemeyer, i think we'll need some sort of read through of the relation and diff against the old known state to create the minimal hook execution needed to transition the local to the global state [20:44] sorry.. bad wording.. read through of all the related units [20:45] which might also work for partitioning, if we enable this for connection reconnects [20:46] hazmat: Maybe.. or maybe we just need a simpler and well-understood behavior on a reestablishment [20:46] hazmat: Even if it involves slightly diverging behavior from normal operation [20:46] hazmat: This feels like a good topic for a session at UDS [20:46] niemeyer, without the diff and its application via formula hooks, its hard to see what state/hook execution guarantees we offer to formulas [20:47] niemeyer, sounds good [20:47] hazmat: Agreed, the answer doesn't seem straightforward yet [20:47] hazmat: But what I'm proposing isn't that we allow changes to simply be ignored [20:48] hazmat: Maybe there are some guarantees we can give to the formula which would suffice [20:48] hazmat: E.g. "Oh, if the zk session is reestablished, the -relation-changed hook is necessarily run." [20:48] hazmat: I don't know.. that may be crack, but that's the kind of idea I'm talking about [20:50] hmmm... but joins and departs could also be there... its worth thinking about.. but i think implementing the application of the state diff via hooks is a pretty good (if more complicated to implement) strategy, we'd get the merge of redundant events, and maintain all the same guarantees.. but we can talk at uds regarding [20:50] hazmat: Joins and departs can always be acknowledged post-mortem [20:50] hazmat: The nodes are effectively gone [20:51] niemeyer, not combined, separate. the combined case is handled by the merge which strips as redundant. [20:51] hazmat: Complex and smart logic is the kind of thing which yields major-EBS-fail kind of problem. Not saying we shouldn't do it, but it would be awesome to have something simple and predictable. [20:53] niemeyer, sure as a general rule, simple beats complex, but done right this allows much simpler reasoning about state for formula authors. [20:54] hazmat: Agreed, we're having some trouble to come up with complex-and-done-right, though, which is an indication that the simpler-and-predictable should still be on the table. [20:57] niemeyer, the notion that anytime as a formula author i might have arbitrary events happen, and i all get is a change hook call, and have to refetch all interesting state and diff within the formula to determine what happened is a problematic burden for formula authors imo... sometimes complexity is justifiable for features.. if we can get those features in a simpler way, that sounds great as well. [20:58] hazmat: You seem to be judging a proposal which I haven't made. [20:58] hazmat: I'm suggesting we should still be looking for a simpler solution, rather than coming up with very complex logic which is error prone. [20:59] hazmat: If we can't come up with the simpler approach, sure, let's go the difficult way [20:59] hazmat: But it feels like we haven't explored the problem well enough yet [20:59] niemeyer, no i'm advocating for one i made, there isn't any other proposal on the table, i don't see another option for maintaining the same guarantees to formula authors... if there are simpler alternatives we should definitely consider them. [21:01] hazmat: " niemeyer, the notion that anytime as a formula author i might have arbitrary events happen, and i all get is a change hook call" [21:01] hazmat: That's what I was talking about [21:01] niemeyer, hazmat: E.g. "Oh, if the zk session is reestablished, the -relation-changed hook is necessarily run." [21:01] i realize that wasn't a proposal just an exploration, but part of my comments where addressing that [21:01] s/where/were [21:02] hazmat: Yes, that's the *kind* of solution [21:02] niemeyer> hazmat: I don't know.. that may be crack, but that's the kind of idea I'm talking about [21:02] hazmat: In your proposal, what happens if the agent crashes while the state is being written? [21:04] hazmat: and what is the ordering of events, to ensure that the given a relation modification occurring, the event isn't lost? [21:05] hazmat: That's the kind of trickery we'll be facing [21:05] niemeyer, if the agent crashes and the change is not on disk, then its part of the state diff to be applied to the loaded state [21:06] there's definitely stuff to be worked out. [21:06] hazmat: I understand that, but the questions above apply irrespective of it [21:07] agreed, we should discuss it more at uds, i should get back to trying to finish resolved [21:07] <_mup_> ensemble/config-state-manager r206 committed by bcsaller@gmail.com [21:07] <_mup_> resolve merge conflict [21:08] hazmat: My point is simply that it's far from easy to come up with the precise behavior to recover the exact agent state from disk, and it won't be fun to debug an improperly recovered agent which drops arbitrary events for whatever reason. [21:08] hazmat: If that's what we have to do, let's do it, but I'd really like to explore some simpler alternatives on the way to it. [21:12] <_mup_> ensemble/unit-agent-resolved r264 committed by kapil.thangavelu@canonical.com [21:12] <_mup_> unit relation's transition to an explicit error state on a hook failure, and our recovered automatically if the unit restarts. [21:19] <_mup_> ensemble/config-state-manager r207 committed by bcsaller@gmail.com [21:19] <_mup_> resolve merge conflict in tests [21:28] <_mup_> ensemble/unit-agent-resolved r265 committed by kapil.thangavelu@canonical.com [21:28] <_mup_> resolved recovery from relation error state, renables hook execution only (watches still running). [21:50] <_mup_> ensemble/unit-agent-resolved r266 committed by kapil.thangavelu@canonical.com [21:50] <_mup_> if a unit relation is already running, and its not resolved, the resolved setting is cleared. [21:52] hmm.. i need to split this branch [22:13] <_mup_> ensemble/unit-agent-resolved r267 committed by kapil.thangavelu@canonical.com [22:13] <_mup_> pull the resolved watch in the unit agent for a future branch. [22:19] <_mup_> ensemble/expose-provisioning r230 committed by jim.baker@canonical.com [22:19] <_mup_> More testing [22:24] <_mup_> ensemble/trunk-merge r189 committed by kapil.thangavelu@canonical.com [22:24] <_mup_> merge trunk [22:25] <_mup_> ensemble/expose-provisioning r231 committed by jim.baker@canonical.com [22:25] <_mup_> Cleanup [22:26] <_mup_> ensemble/resolved-state-api r202 committed by kapil.thangavelu@canonical.com [22:26] <_mup_> merge trunk and pep8ify it. [22:34] <_mup_> ensemble/expose-provisioning r232 committed by jim.baker@canonical.com [22:34] <_mup_> PEP8 [22:49] <_mup_> ensemble/formula-upgrade r269 committed by kapil.thangavelu@canonical.com [22:49] <_mup_> doc cleanup [22:52] jimbaker, test_watch_exposed_flag_waits_on_slow_callbacks seems to hang for me regularly. [23:07] bcsaller, the merge diff on your branch looks strange, now it looks like it removes ensemble-expose on trunk [23:09] niemeyer, is there a hook into mup for reporting builds from an external system like jenkins? [23:09] hazmat: It should be very easy to send arbitrary data to mup [23:10] hazmat: Using the same mechanism we use for sending commit reports [23:10] hazmat: It's a line based protocol with a username/password prefix [23:10] niemeyer, cool [23:10] hazmat: echo + socat should be able to do it even via shell [23:13] biab