/srv/irclogs.ubuntu.com/2011/05/02/#ubuntu-ensemble.txt

_mup_ensemble/expose-provisioning r226 committed by jim.baker@canonical.com00:48
_mup_Work around the fact that the provisioning agent's state is stored in class attributes00:48
=== kim0|vacation is now known as kim0
kim0Morning everyone o/09:03
niemeyerGood mornings!13:49
niemeyer1 WEEK!14:03
TeTeTniemeyer: what for? UDS?14:05
niemeyerTeTeT: Yeah :)14:07
jimbaker indeed, one week it is!16:12
=== niemeyer is now known as niemeyer_lunch
_mup_ensemble/unit-agent-resolved r262 committed by kapil.thangavelu@canonical.com16:30
_mup_unit lifecycle resolving unit relations workflows16:30
=== niemeyer_lunch is now known as niemeyer
jimbakerthis is interesting. if i run repeatedly with -u a test i have that looks at service watching for exposed services in the provisioning agent, then eventually it fails with an InternalTopologyError of service not found17:23
jimbakerlooking at the topo dump, the services dict is empty at that point17:24
jimbakeri don't think this happens in my new code, but investigating further17:24
jimbakerahh, never mind... just the way the watch is run, this can happen. good17:28
_mup_ensemble/expose-provisioning r227 committed by jim.baker@canonical.com17:41
_mup_Fixed watch on service unit names so that the watch can run after the service topo has changed; tested 500+ tries17:41
niemeyerjimbaker: Hmmm17:45
niemeyerjimbaker: Is that related to the watch running after the test is done?17:45
niemeyerjimbaker: This may be related to the issue I was discussing with hazmat on Friday17:46
jimbakerniemeyer, yes, it is quite possible that the watch is running after the completion of the test17:46
niemeyerjimbaker: Ok17:46
niemeyerjimbaker: hazmat is working to avoid that kind of issue17:47
hazmatniemeyer, regarding that i realized the plan formulated on friday, doesn't work, since there are no watches nesc. when sync is invoked, and there maybe background deferred's unlreated17:47
niemeyerhazmat: We covered that case on friday17:47
niemeyerhazmat: There shouldn't be unrelated background deferreds lingering without a connection to the test ever17:49
hazmatniemeyer, my understanding is that an alternation of the client sync to do a round trip poke to flush communication channels in addition to serialized watch execution was the plan. but the issue here isn't the watch firing at all17:49
niemeyerhazmat: It is an early mistake we made in the design of watches that they allow that to happen17:49
niemeyerhazmat: What's the issue?17:50
hazmatits the get/exists information retrieval that establishes the watch, the processing of that initial data being async to the caller is what causes the problem17:50
niemeyerhazmat: I'm not following you17:51
hazmatthere is no background watch callback execution per se then17:51
niemeyerhazmat: We went through that on friday, precisely17:51
niemeyerhazmat: I explained why the "initial data being async" isn't the problem17:51
niemeyerhazmat: and you agreed17:51
niemeyerhazmat: We took a while to get there as well, with examples and so on17:52
niemeyerhazmat: If you think this isn't true anymore, I'll need some concrete examples17:52
hazmatniemeyer, my understanding of the solution that was formulated on friday, was that it would be alteration of the zk client to serialize watch callback executions, and sync would ensure none are firing currently and none pending (via roundtrip to zk).17:53
hazmatfurther reflection on that, and its not apparent that it solves the actual issue encountered here17:54
niemeyerhazmat: Yes, none are fired currently, no writes within the roundtrip of sync execute17:54
hazmatbecause there are no watches firing in the common case17:54
hazmatinstead the watch callback (ours not zookeeper) is being invoked with initial data reflecting the current state17:55
niemeyerhazmat: I don't understand.. the problem we were solving is watches firing in the background.  If there are no watches firing, there is no problem.17:55
hazmatniemeyer, there is a problem that the processing of initial data (that was retrieved from establishing the watch) is processed async to the caller17:55
niemeyerhazmat: Oh my17:56
niemeyerhazmat: You keep repeating that, but I already explained several times why this isn't the problem.17:56
niemeyerhazmat: and you agreed on friday.. so again, I'll need some more concrete examples if you have some new information.17:57
_mup_ensemble/unit-agent-resolved r263 committed by kapil.thangavelu@canonical.com18:00
_mup_unit relation lifecycles can stop/start execution independent of related unit watching.18:00
hazmatniemeyer, i agreed there where commutability problems with multiple async watch creators relying on state resulting in sequencing operations. i still think those are mostly immaterial since at all times operations attempt to reflect current state. I liked the alternative solution, and said i would think about it more, i have and afaics it doesn't solve the original issue.18:04
niemeyerhazmat: The original issue and the lack of ordering are actually the same issue.18:08
hazmatniemeyer, as for examples that show this, the one i just brought up embodies the original issue, and it solved by this. calling sync, and doing the round trip doesn't would only guarantee no zk client watch callback activity, but that's not related to our watch api invoking its callback with data that's independent of the zk watch firing18:08
niemeyerhazmat: and as we discussed on Friday, I do not agree that it's not relevant.  I don't want us to have to think about ordering in tests.18:08
niemeyerhazmat: You're thinking about a very narrow problem and trying to come up with a solution for it.18:09
niemeyerhazmat: The issue is general.. we have been looking at watches running in the background since we started using that convention18:09
niemeyerhazmat: We have to solve the actual issue, rather than a one-off issue18:09
hazmatniemeyer, we constantly think about ordering in tests, except with this problem we can't rely on ordering18:10
niemeyerhazmat: No, we never think about them, ever!18:10
niemeyerhazmat: We never had to say "Oh, I will create that object first so that the watch of the 3 other objects fire at the time this deferred fires"18:10
hazmatniemeyer, not sure how that relates.. we setup state and behavior in tests, and wait for the behavior to fire on the state. our ordering problem is that we can wait for behavior to be in place before we modify state. at all times behavior execution reflects current state.18:13
hazmats/can/can't18:13
niemeyerhazmat: Sorry, can't parse that sentence18:14
niemeyerhazmat: Whenever we add a watch, the code of that watch will run in background right now when the watch fires18:15
niemeyerhazmat: and that's the problem..18:15
niemeyerhazmat: Solve this problem, and we don't need to "wait for first fire"18:16
niemeyerhazmat: Solve first fire, and you need to ensure that *the data was in the state when it first fires*.18:16
niemeyerhazmat: Otherwise what will happen is that the first fire is dumb, because the state you wanted wasn't there18:16
hazmatniemeyer, they are separate issues in one case we have a zk watch firing, in the other we don't, at all times the ensemble api callback considers current state.18:18
niemeyerhazmat: No, they are not separate issues.  One is a partial solution to the actual problem which requires hand-crafting ordering.  The other is the actual problem.18:18
niemeyerhazmat: FWIW, I have nothing against waiting for the first state.. it sounds great, actually.18:21
niemeyerhazmat: But when we start to tweak ordering to enable tests to pass because there was a watch running in the background, that will be an issue.18:22
hazmatniemeyer, why doesn't the the lack of a currently executing serialized watch in the zk client invalidate the usage? we could call sync immediately after lifecycle.start() for example, and it would return while there are still ongoing  zk non-watch operations on the zk state.18:24
niemeyerhazmat: "Why doesn't (...) invalidate the usage?" of what?18:25
hazmatniemeyer, of the proposed solution we had friday18:25
niemeyerhazmat: Which ongoing zk non-watch operations?18:26
hazmatniemeyer, in this case, the establishment of unit relation workflows and lifecycles, that in turn establish watches, and fire hooks based on the current state.18:28
hazmatniemeyer, in other cases its a bit simpler18:29
niemeyerhazmat: I don't see how that's special in any sense?18:29
niemeyerhazmat: If we wait on the end of those operations, and do a sync, it should work18:29
niemeyerhazmat: wait on the non-watch operations, that is18:30
niemeyerhazmat: the sync should wait for the rest to get done18:30
niemeyerhazmat: This is twisted 101, and we haven't been following it appropriately.. an operation which will end in the future must return a deferred which fires when the operation is done executing.18:30
hazmatlike that of watch_hook_debug, the background activity is establishing the debug setup. or in debug_log its establishing internal logging. neither of those involve further watches, but they do involve get roundtrips.18:31
niemeyerhazmat: watches are special, because we can't inherently wait for them because they are started out of band18:31
niemeyerhazmat: That's why we need sync()18:31
niemeyerhazmat: They should return deferreds, which should be waited upon by whoever called them18:31
hazmatniemeyer, they do currently but only for the establishment of the watch, not the processing of the data retrieved that resulted from setting the watch.18:32
niemeyerhazmat: Huh, how come?18:33
niemeyerhazmat: Aren't them doing the equivalent of18:33
niemeyeryield self._client.get(...)?18:33
niemeyerhazmat: That effectively waits for the data to come bak18:33
niemeyerback18:33
hazmatniemeyer, yes, the data is retrieved but typically they do ...18:33
hazmatd_get, d_watch = self._client.get_and_watch() 18:34
hazmatd_get.addCallback(lambda x: ensemble_api_callback(x)) 18:34
niemeyerhazmat: Do you have an example?18:34
niemeyerhazmat: I mean, a pointer in the code18:34
hazmatah18:34
niemeyerhazmat: In those cases, d_get should typically be returned18:35
niemeyerhazmat: So that it gets chained18:35
niemeyerhazmat: Since the result of the current function becomes the result of the get18:35
niemeyerhazmat: and so on18:35
hazmatniemeyer, ensemble.state.service.ServiceUnitState.watch_hook_debug18:35
niemeyerhazmat: Looking18:35
hazmatthat should have yield exists_d at the end of it18:36
niemeyer        exists = yield exists_d18:36
niemeyerhazmat: and it does18:36
hazmatniemeyer, er.. i meant yield callback_d18:37
niemeyerhazmat: Nope, that's the watch, not the data as you mentioned above18:38
niemeyerhazmat: Well, hmmm.. I see your point18:38
niemeyerhazmat: But that's problematic, because the real watcher is being added to the chain.. we can't yield it18:40
hazmatniemeyer, that's the simple case, the more complicated case is embodied in unit_lifecycle.start, process_service_changes, relation_workflow.start->relation_lifecycle.start18:40
niemeyerhazmat: Nor return it18:40
hazmatniemeyer, the real watcher isn't in the chain, just the establishment of the callback18:40
niemeyerhazmat: This will never fire18:40
niemeyerhazmat: True18:40
niemeyerhazmat: Agreed this is a separate issue from synchronizing with watches18:41
hazmatniemeyer, this problem exists in a more chained fashion (initial data callbacks, processing resulting in more watches, with additional data callbacks) in the lifecycle.start case18:42
niemeyerhazmat: Sure, that's related to the problem of watch ordering that we've been debating18:43
niemeyerhazmat: Just solving the initial data problem will not solve the fact watches will run in the background18:43
niemeyerhazmat: We have to address both of them18:43
_mup_ensemble/expose-provisioning r228 committed by jim.baker@canonical.com18:43
_mup_Properly clean up port policy after a service is no longer exposed18:43
hazmatniemeyer, in neither scenario i've posited is an actual zk watch firing.18:43
niemeyerhazmat: The end goal is straightforward: yield whatever() + client.sync(), should never leave stuff hanging in the background18:43
niemeyerhazmat: You said "processing resulting in more watches"18:44
niemeyerhazmat: If that's not a zk watch, what is it?18:44
hazmatniemeyer, more watches being established, but with those ensemble watch api callbacks being invoked with the initial data from establishing the watch18:44
hazmatnot from the watch itself firing18:45
niemeyerhazmat: Ok, this sounds like the same problem still then?18:46
hazmatniemeyer, it is just nested18:46
niemeyerhazmat: deferred chaining FTW18:47
hazmatand the initial execution of those is also the basis in this case for scheduling hook executions... it bears more thinking about.. 18:48
hazmati'm gonna grab a quick lunch before the standup18:48
niemeyerhazmat: Cool, have a good one18:48
_mup_ensemble/expose-provisioning r229 committed by jim.baker@canonical.com19:09
_mup_Fix corner case on policy removal19:09
jimbakerback in 10 min19:18
hazmatjimbaker, sounds good19:19
hazmatjimbaker, ready?19:32
jimbakerbcsaller, hazmat, niemeyer - standup ?19:32
niemeyerYep19:32
bcsalleryeah19:32
hazmatniemeyer, one other thing regarding the resolved work... it looks like unit relations which have errors will currently automatically recover (start executing hooks again) if their containing unit is transitioned from stopped to started again20:08
hazmatniemeyer, at the moment we don't ever execute stop except when we're killing the unit.. but i'm concerned what the correct behavior would look like.. i'm also looking at introducing a separate error state for unit relation workflows, but it has the same behavior on unit.start (relation automatically recovers)20:09
hazmatthe question is do we want that behavior20:10
hazmator force all errors to be explicitly recovered.. 20:10
niemeyerhazmat: I think it's cool to auto-recover them in that situation20:12
niemeyerhazmat: Stopping implies killing all relations20:12
hazmatniemeyer, cool20:12
niemeyerhazmat: So it's a bit like restarting the system rather than fixing it20:13
hazmatniemeyer, yeah.. its a significant context switch20:13
niemeyer+    def set_exposed_flag(self):20:17
niemeyerbcsaller: This seems to be part of the config-state-manager branch20:17
bcsallermerged trunk at some point20:17
niemeyerbcsaller: Has a merge leaked in or something?20:17
niemeyerbcsaller: Unless I'm doing something wrong, which wouldn't be surprising, this doesn't seem to be in trunk20:18
niemeyerbcsaller: Launchpad seems to agree with me as well, given the diff there20:18
bcsallerI thought that was from one of the first expose branches... I can look into it, but I didn't pull from anywhere else 20:18
niemeyerbcsaller: I hope it's not our trunk in a bad state somehow :(20:19
bcsaller'else' == 'merge from anywhere but trunk'20:19
bcsalleryeah20:19
niemeyerbcsaller: What's the latest commit in your trunk?20:20
bcsallerhmm, so when I grep local trunk it doesn't show up either... maybe it was botched at one point and fixed20:20
bcsaller214: refactor to YAMLState20:20
niemeyerbcsaller: Hmm20:20
niemeyerbcsaller: annotate would show it, if that was the case20:20
* niemeyer tries20:20
bcsaller213: Jim Baker 2011-04-27 [merge] merged expose-ensemble-commands [r=niemeyer][f=767407]20:20
bcsallerwhich is where I thought that came from 20:20
hazmatindeed it does20:24
niemeyerSomething pretty strange there20:25
hazmatniemeyer, that method is on trunk20:26
niemeyerhazmat: I know.. but the diff is still showing it as being introduced by this branch20:26
hazmathmm.. strange indeed, diff got confused perhaps20:26
hazmatniemeyer, another interesting resolved issue, the on disk state, and the in memory state won't be directly recoverable anymore without some state specific semantics to recovering from on disk state, ie a restarted unit agent, with a relation in an error state would require special semantics around loading from disk to ensure that the in-memory process state (watching and scheduling but not executing) matches the recovery trans20:27
hazmatition actions (which just restart hook execution, but assume the watch continues).. this functionality added to better allow for the behavior that while down due to a hook error, the relation would continues to schedule pending hooks20:27
niemeyerDefinitely20:27
* hazmat notes the same in the code20:28
niemeyerhazmat, bcsaller: It's most probably just a criss-cross that is confusing the diffing20:28
niemeyerhazmat, bcsaller: Hmm.. nope20:30
niemeyerhazmat, bcsaller: I think there's an actual issue20:30
niemeyerbcsaller: Your branch has the code duplicated20:31
niemeyerbcsaller: Probably a conflict which wasn't solved properly20:31
hazmathmm.. yeah. merge conflict issue20:31
bcsallerhmm20:31
bcsallerI'll see if I can fix it after the fact then. I don't recall this happening 20:32
niemeyerbcsaller: You can.. just remove the incorrect lines20:32
niemeyerbcsaller: Just be careful to remove the right ones, so that you're avoiding a change rather than moving code in trunk20:32
hazmatbzr diff --using=meld -r 213 path_to_file.py20:35
hazmatshould help20:35
niemeyerhazmat: Re. the recover state, yeah, that's tricky20:36
niemeyerhazmat: Maybe we should stop trying to be magic about crashes20:41
niemeyerhazmat: and simply introduce a e.g. "stuck" state20:41
niemeyerHmmm..20:41
niemeyerWe have to talk more about this20:42
hazmatniemeyer, its fine if we introduce additional hooks for an on disk loader20:42
niemeyerBecause it involves partitioning situations and the suck20:42
hazmati mean per state hooks20:42
niemeyersuch20:42
niemeyer(which suck, for sure :)20:42
hazmati've noted the problem in workflow.py, sans partitioning problems, for future, there's a bug tracking this functionality though afaicr20:42
niemeyerhazmat: You mean starting to dump the pending relation changes too? 20:43
hazmatniemeyer, yup20:43
niemeyerhazmat: and what about the changes lost while in the crashed state?20:43
hazmatniemeyer, good question 20:43
niemeyerhazmat: I think we need a simpler strategy for "catching up" after a reestablished unit agent20:44
hazmatniemeyer, i think we'll need some sort of read through of the relation and diff against the old known state to create the minimal hook execution needed to transition the local to the global state20:44
hazmatsorry.. bad wording.. read through of all the related units20:44
hazmatwhich might also work for partitioning, if we enable this for connection reconnects20:45
niemeyerhazmat: Maybe.. or maybe we just need a simpler and well-understood behavior on a reestablishment20:46
niemeyerhazmat: Even if it involves slightly diverging behavior from normal operation20:46
niemeyerhazmat: This feels like a good topic for a session at UDS20:46
hazmatniemeyer, without the diff and its application via formula hooks, its hard to see what state/hook execution guarantees we offer to formulas20:46
hazmatniemeyer, sounds good20:47
niemeyerhazmat: Agreed, the answer doesn't seem straightforward yet20:47
niemeyerhazmat: But what I'm proposing isn't that we allow changes to simply be ignored20:47
niemeyerhazmat: Maybe there are some guarantees we can give to the formula which would suffice20:48
niemeyerhazmat: E.g. "Oh, if the zk session is reestablished, the -relation-changed hook is necessarily run." 20:48
niemeyerhazmat: I don't know.. that may be crack, but that's the kind of idea I'm talking about20:48
hazmathmmm... but joins and departs could also be there... its worth thinking about.. but i think implementing the application of the state diff via hooks is a pretty good (if more complicated to implement) strategy, we'd get the merge of redundant events, and maintain all the same guarantees.. but we can talk at uds regarding20:50
niemeyerhazmat: Joins and departs can always be acknowledged post-mortem20:50
niemeyerhazmat: The nodes are effectively gone20:50
hazmatniemeyer, not combined, separate. the combined case is handled by the merge which strips as redundant. 20:51
niemeyerhazmat: Complex and smart logic is the kind of thing which yields major-EBS-fail kind of problem.  Not saying we shouldn't do it, but it would be awesome to have something simple and predictable.20:51
hazmatniemeyer, sure as a general rule, simple beats complex, but done right this allows much simpler reasoning about state for formula authors.20:53
niemeyerhazmat: Agreed, we're having some trouble to come up with complex-and-done-right, though, which is an indication that the simpler-and-predictable should still be on the table.20:54
hazmatniemeyer, the notion that anytime as a formula author i might have arbitrary events happen, and i all get is a change hook call, and have to refetch all interesting state and diff within the formula to determine what happened is a problematic burden for formula authors imo...  sometimes complexity is justifiable for features.. if we can get those features in a simpler way, that sounds great as well.20:57
niemeyerhazmat: You seem to be judging a proposal which I haven't made.20:58
niemeyerhazmat: I'm suggesting we should still be looking for a simpler solution, rather than coming up with very complex logic which is error prone.20:58
niemeyerhazmat: If we can't come up with the simpler approach, sure, let's go the difficult way20:59
niemeyerhazmat: But it feels like we haven't explored the problem well enough yet20:59
hazmatniemeyer, no i'm advocating for one i made, there isn't any other proposal on the table, i don't see another option for maintaining the same guarantees to formula authors... if there are simpler alternatives we should definitely consider them.20:59
niemeyerhazmat: "<hazmat> niemeyer, the notion that anytime as a formula author i might have arbitrary events happen, and i all get is a change hook call"21:01
niemeyerhazmat: That's what I was talking about21:01
hazmatniemeyer, <niemeyer> hazmat: E.g. "Oh, if the zk session is reestablished, the -relation-changed hook is necessarily run." 21:01
hazmati realize that wasn't a proposal just an exploration, but part of my comments where addressing that21:01
hazmats/where/were21:01
niemeyerhazmat: Yes, that's the *kind* of solution21:02
niemeyerniemeyer> hazmat: I don't know.. that may be crack, but that's the kind of idea I'm talking about21:02
niemeyerhazmat: In your proposal, what happens if the agent crashes while the state is being written?21:02
niemeyerhazmat: and what is the ordering of events, to ensure that the given a relation modification occurring, the event isn't lost?21:04
niemeyerhazmat: That's the kind of trickery we'll be facing21:05
hazmatniemeyer, if the agent crashes and the change is not on disk, then its part of the state diff to be applied to the loaded state 21:05
hazmatthere's definitely stuff to be worked out.21:06
niemeyerhazmat: I understand that, but the questions above apply irrespective of it21:06
hazmatagreed, we should discuss it more at uds, i should get back to trying to finish resolved21:07
_mup_ensemble/config-state-manager r206 committed by bcsaller@gmail.com21:07
_mup_resolve merge conflict21:07
niemeyerhazmat: My point is simply that it's far from easy to come up with the precise behavior to recover the exact agent state from disk, and it won't be fun to debug an improperly recovered agent which drops arbitrary events for whatever reason.21:08
niemeyerhazmat: If that's what we have to do, let's do it, but I'd really like to explore some simpler alternatives on the way to it.21:08
_mup_ensemble/unit-agent-resolved r264 committed by kapil.thangavelu@canonical.com21:12
_mup_unit relation's transition to an explicit error state on a hook failure, and our recovered automatically if the unit restarts.21:12
_mup_ensemble/config-state-manager r207 committed by bcsaller@gmail.com21:19
_mup_resolve merge conflict in tests21:19
_mup_ensemble/unit-agent-resolved r265 committed by kapil.thangavelu@canonical.com21:28
_mup_resolved recovery from relation error state, renables hook execution only (watches still running).21:28
_mup_ensemble/unit-agent-resolved r266 committed by kapil.thangavelu@canonical.com21:50
_mup_if a unit relation is already running, and its not resolved, the resolved setting is cleared.21:50
hazmathmm.. i need to split this branch21:52
_mup_ensemble/unit-agent-resolved r267 committed by kapil.thangavelu@canonical.com22:13
_mup_pull the resolved watch in the unit agent for a future branch.22:13
_mup_ensemble/expose-provisioning r230 committed by jim.baker@canonical.com22:19
_mup_More testing22:19
_mup_ensemble/trunk-merge r189 committed by kapil.thangavelu@canonical.com22:24
_mup_merge trunk22:24
_mup_ensemble/expose-provisioning r231 committed by jim.baker@canonical.com22:25
_mup_Cleanup22:25
_mup_ensemble/resolved-state-api r202 committed by kapil.thangavelu@canonical.com22:26
_mup_merge trunk and pep8ify it.22:26
_mup_ensemble/expose-provisioning r232 committed by jim.baker@canonical.com22:34
_mup_PEP822:34
_mup_ensemble/formula-upgrade r269 committed by kapil.thangavelu@canonical.com22:49
_mup_doc cleanup22:49
hazmatjimbaker,  test_watch_exposed_flag_waits_on_slow_callbacks seems to hang for me regularly.22:52
hazmatbcsaller, the merge diff on your branch looks strange, now it looks like it removes ensemble-expose on trunk23:07
hazmatniemeyer, is there a hook into mup for reporting builds from an external system like jenkins?23:09
niemeyerhazmat: It should be very easy to send arbitrary data to mup23:09
niemeyerhazmat: Using the same mechanism we use for sending commit reports23:10
niemeyerhazmat: It's a line based protocol with a username/password prefix23:10
hazmatniemeyer, cool23:10
niemeyerhazmat: echo + socat should be able to do it even via shell23:10
niemeyerbiab23:13

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!