[07:04] Good morning. [07:09] TheMue, heyhey [07:09] fwereade: Hiya === TheMue_ is now known as TheMue [07:32] good morning gentle ment [07:32] gentlemen [07:32] davecheney, heyhey [07:33] TheMue, ok, I am looking in detail at the "firewaller bug" [07:33] TheMue, have you played around with that at all yourself? [07:33] davecheney: Good mroning [07:34] fwereade: No, only discussed it so far with Aram. [07:34] TheMue, can you recall anything specific about how he was characterising it? [07:34] https://codereview.appspot.com/6856120/ [07:34] ^ version III of cross series bootstapping [07:34] fwereade: It only happened after has change of the firewaller. So maybe it's better to wait until Monday to talk with him about those changes. [07:35] TheMue, so the answer to the question I asked is "no"? :) [07:35] fwereade: He talked about a kind of race situation, but only in global mode. [07:36] TheMue, ah, hmm, ok, it looks like I'm stuck on something different then [07:36] TheMue, I think I fixed that one [07:36] fwereade: Hey, not so fast. I need my time to write here in the bottom line while you're asking new questions above. ;) [07:36] TheMue, sorry :) [07:37] fwereade: NP, had a smiley at the end. :D [07:38] fwereade: His change after the watcher changed has been to extract a larger part of the firewallers loop code into an own method. But there it works differently, because the lifecycle states of the units returned by the watcher has to be checked. [07:39] TheMue, ok, I don't *think* that is currently an issue [07:39] TheMue, I'm pretty certain I have tweaked the firewaller such that it's responding to global mode events nicely [07:39] TheMue, now I'm confused that a unit watch appears not to be firing when it should [07:39] fwereade: Is it already submitted or a CL? [07:40] TheMue, thank you for the context though [07:40] TheMue, it's Arams 120 CL [07:40] TheMue, that niemeer asked me to finish off and land [07:40] fwereade: You're welcome, I'm interested in it too as the Firewaller has initially been my baby. ;) [07:41] fwereade: Do you please have the URL for me, to quickly jump in? [07:41] TheMue, I haven't even pushed it yet, just a mo [07:41] fwereade: OK [07:42] TheMue, hm, sorry, it'll be inconvenient to push right now [07:43] fwereade: OK, I think I get a notification when you'll push it, so then I can jump in again. [07:45] TheMue, I'm just pushed to lp:~fwereade/juju-core/aram-firewaller-sketch if you want to take a look (and, ideally, paste me the output of TestGlobalModeRestartPorts on your machine?) [07:45] TheMue, it has a load of random debug prints pooed in there [07:46] fwereade: Yes, one moment. [08:03] TheMue, heyyyyyyyyy I know what this looks like [08:03] TheMue, maybe [08:04] TheMue, if a client asks for a watch on a channel and isn't receiving, it just blocks the whole watcher [08:16] fwereade: So, after the mgo/mongodb update test is running and TestGlobalModeRestartPorts passes w/o any messages. Strange. [08:16] TheMue, gaah TestGlobalModeRestartPortCounts [08:16] TheMue, sorry [08:17] fwereade: Passes too, also sorry. ;) [08:19] TheMue, ok, that's weird :) [08:20] fwereade: Indeed. [08:51] davecheney, fwereade, TheMue: morning! [08:51] rogpeppe: Hi [08:51] rogpeppe, heyhey [08:52] morning all [08:54] davecheney: glad to hear the TLS update worked for you... [08:57] rogpeppe: it's fn' awesome [08:57] so much faster than ssh [08:58] davecheney: i'm a bit surprised by that actually [08:58] davecheney: the network latency should be approximately similar [08:58] much less round trips [08:58] davecheney: to set up the connection? [08:59] davecheney: i didn't realise ssh was inefficient that way [09:05] 3 round trips for tcp handshake, then a dozen for ssh handshake, then 2 for tcp channel, then the mgo setup on top of that [09:06] i'd bet pesos to dollars that tls handshaking is more effective than ssh tunneling [09:08] davecheney: TLS handshake is standard diffie hellman, i think, which is one round trip. there may be more too though, to negotiate the crypto. [09:09] depends on the size of the cert [09:09] but it is much more efficient [09:09] so, on high latency the difference is remarkable [09:10] also, shit: https://code.google.com/p/go/source/detail?r=697f36fec52ceaabc2208d28918fc34787b617bb [09:10] spent three evenings working on this one [09:23] davecheney: ah bugger. what a pity. have you worked out what the problem was yet? [09:25] issue 599 [09:25] 64bit atomics need to be 8 byte aligned [09:26] i should have remembered earlier [09:34] rogpeppe, TheMue: I had believed that the watcher would panic if you tried to unwatch something you hadn't watched... is that not the case? [09:35] fwereade: you're talking about state/watcher? [09:35] rogpeppe, yeah [09:36] fwereade: doesn't look like it from a brief glance at the source [09:36] fwereade: looks like it's idempotent [09:36] rogpeppe, indeed, I just could have sworn it was like that once [09:36] rogpeppe, probably just fever dreams [09:36] fwereade: it would be an easy change to make [09:37] rogpeppe, indeed, I may mention it to niemeyer [09:38] fwereade: One moment, daughter is here, will be back in a few seconds. [09:38] rogpeppe, TheMue: could I get a trivial +1 on https://codereview.appspot.com/6856122 please? [09:39] rogpeppe, that panicing would have made it *much* quicker to find that issue, I think [09:39] fwereade: *click* [09:41] fwereade: it's a pity that a document id is an interface{} [09:42] fwereade: in fact, where do we use a document id that's not a string? [09:43] fwereade: Has the Unwatch() changed to take an id instead of a Life. [09:44] fwereade: ha, the answer is "nowhere" [09:44] fwereade: probably that's only since you've changed machine id to string [09:45] fwereade: that would've caught the problem too [09:45] fwereade: no, it was just a mistake [09:45] s/fwereade/TheMue/ [09:45] rogpeppe: OK [09:47] rogpeppe, also, relations have an "Id" field that we use internally but a mgo "_id" called Key [09:47] rogpeppe, so, yes from the watcher POV we could still use strings [09:48] rogpeppe, but it's muddy [09:48] fwereade: all the tests run fine with strings [09:48] TheMue, the Unwatch never needed a life [09:48] fwereade: but i see what you mean [09:48] rogpeppe, cool [09:48] fwereade: could you make a test for this? [09:48] rogpeppe, I broached the subject with niemeyer, he seemed -1 on changing it [09:48] fwereade: fairy nuff [09:49] fwereade: Yep, I took a look and saw that it only hasn't been catched by the compiler due to the interface{}. [09:49] rogpeppe, that's an interesting philosophical question [09:49] rogpeppe, if I were to write a test for this I would also feel obliged to write a similar one for every watcher [09:49] fwereade: ha [09:50] fwereade: maybe you should [09:50] fwereade: maybe they're broken similarly... [09:56] rogpeppe: https://codereview.appspot.com/6856120/ [09:58] davecheney: looking [10:00] davecheney: is configTest.series ever going to be set to something other than version.Current.Series? [10:00] davecheney: (or "") [10:00] rogpeppe: yes [10:01] davecheney: how so? [10:01] bootstrapping from quantal -> precise [10:02] davecheney: so is the plan to add an argument to config.New? [10:03] davecheney: thinking about it, i'm not convinced that default-series should default to anything [10:04] davecheney: but i'm open to contrary arguments [10:04] rogpeppe: id' say it should always default to precise [10:04] (insert current LTS release) [10:04] this is all in aide of https://codereview.appspot.com/6851081/ [10:04] davecheney: ignoring version.Current.Series? [10:04] after a very long talk with gustavo [10:04] i still have no idea how to test this [10:04] other than the sledgehammer approaches I have previously suggested [10:06] davecheney: i don't think the configTest.series parameter is necessary. you can just test against whatever the default should be, in configTest.check. [10:06] s/parameter/field/ [10:07] davecheney: i'll have a think about how you might want to test cross-series bootstrapping [10:07] thanks rogpeppe, i'm really stuck [10:21] brb [10:22] hey, can I depend on deterministic ordering in slices? (when testing building a slice by appending) [10:23] and what's the best way to test this, given []Type as a result and having another - what type of assert is best? [10:25] dimitern: yes, slice are ordered [10:25] dimitern: yes [10:26] and DeepEquals [10:26] dimitern: assuming the types are not opaque [10:26] rogpeppe, davecheney: ok, 0x [10:27] dimitern: have you read this article? http://research.swtch.com/godata [10:29] rogpeppe: what do you mean opaque? no I'll read it now [10:29] dimitern: i mean you shouldn't use DeepEqual if the values have unexported fields [10:30] rogpeppe: right, that's the case [10:30] rogpeppe: but does it matter, since the're all in the same package? [10:30] dimitern: it depends what kind of equality you're interested in [10:31] dimitern: what are you actually trying to assert? [10:32] rogpeppe: well, I have AllFlavors() call, which returns []Flavor and error, and I called AddFlavor() twice with fl1, fl2, and now it seems []Flavor{fl1,fl2} != the result, but maybe I'm returning them in reverse order [10:33] dimitern: if your flavors live in a map [10:33] possibly [10:33] dimitern: i didn't realise a Flavor had unexported fields [10:33] davecheney: yes, I have map[string]Flavor [10:33] so if you're doing something lie [10:34] for k, v := range flavors { if k == "something" { append(...) } } [10:34] then the order your iterate over flavors is random [10:34] rogpeppe: it's a local type, Flavor has entity *nova.Entity and detail *nova.FlavorDetail - one or both can be specified [10:35] davecheney: ok, is there something similar to pythons's dict.values(), without iterating over the map? [10:35] dimitern: it sounds like DeepEqual might be appropriate in this particular case. [10:35] rogpeppe: isn't there a SliceEquals or something ? [10:36] for set equality ? [10:36] rogpeppe: tried DeepEquals, but does not work if the order is different [10:36] davecheney: not sure, I'll try [10:36] davecheney: i don't think so [10:37] shitter [10:37] dimitern: one conventional way of doing this is to sort the slices before comparing them [10:37] rogpeppe: how? [10:37] davecheney: no, SliceEquals does not exists [10:37] dimitern: look at the sort package, define Less, Swap and Len methods on a new type sortedFlavors []Flavor [10:38] dimitern: (one line each) [10:38] rogpeppe: ok, I'll check it out [10:38] dimitern: then sort each slice, by type-converting it to sortedFlavors, then use whatever equality operation you feel is appropriate [10:40] rogpeppe: that seems like a lot of work to compare 2 slices - how about just iterating over them and checking each [10:41] dimitern: it's probably more work to do that, tbh [10:42] dimitern: defining the sort methods is mechanical and easy [10:42] dimitern: because you don't just want to declare two slices [10:42] dimitern: you want to make sure they're exactly the same [10:43] rogpeppe: I'm looking at the sort pkg now [10:43] dimitern: regardless of order [10:44] dimitern: tbh, if you're adding your own flavors and checking them, it's probably overkill to check the whole Flavor [10:44] rogpeppe: the docs for Soft says: The sort is not guaranteed to be stable. - what's that? [10:44] dimitern: it means that equal items won't necessarily end up in the original order [10:44] dimitern: you could just check some aspect of it that you know about [10:44] dimitern: e.g. a name [10:45] rogpeppe: exactly, I have only 2 items and they can have at most 2 orderings [10:45] dimitern: in which case, you could just compare and swap, then DeepEqual [10:45] dimitern: same difference [10:45] rogpeppe: yep, that seems the easiest [10:46] dimitern: if even, just compare the first element of each slice, then swap one if they're different, then compare both slices [10:46] s/if even/or even/ [10:46] rogpeppe: yeah, this works and since it's a simple case I'll leave it to this [11:00] mgz: or w7z: ? [11:01] hey dimitern [11:01] mgz: hey :) i was not sure which one you're watching - mumble? [11:02] nearly there, left my headset somewhere silly... [11:05] wallyworld: ^^ [11:05] ok [11:17] rogpeppe, TheMue: I've updated https://codereview.appspot.com/6856122 -- it should still be pretty trivial [11:17] fwereade: Already lookin' ;) [11:17] rogpeppe, TheMue: the watcher package change means that the MUW tests fail reasonably comprehensibly when the dont-unwatch-life-values fix [11:18] ...is not there [11:19] fwereade: Aaaaah, that's what missing, just wanted to ask. [11:21] fwereade: LGTM [11:22] rogpeppe, awesome, thanks [11:24] rogpeppe, I'm printing the channels because it really helped me to be able to just print suspicious-looking channels when I watched, and that was very useful in the early stages when I was still zeroing in on a fix [11:24] rogpeppe, it may not be so useful now, i guess [11:24] rogpeppe, thanks [11:24] fwereade: i think that if you're debugging, you'll add prints where necessary. an isolated channel pointer print isn't much help really. [11:26] rogpeppe, the isolated panic is kinda handy though, because you can just put your prints in state/watcher and not have to worry about understanding the watcher package [11:26] fwereade: true, the channel is an external thing, isn't it. fair enough, why not leave it? [11:27] rogpeppe, cheers [11:28] TheMue, any comments? [11:28] fwereade: You'll get your LGTM in a few moments. ;) [11:29] TheMue, <3 [11:31] fwereade: You've got it. [11:35] TheMue, the channel pointer print is the difference between (1) debugging by sprinking a few prints in state/watcher.go (or whatever the client is) and (2) opening up and coming to understand the watcher package well enough to determine that you need to print the channels around the pointw where I panic [11:35] TheMue, and (2) also includes all of (1) [11:35] TheMue, as rogpeppe pointed out, the channel is a watch param just like coll and id [11:36] TheMue, revno is not interesting here, so I didn't print that [11:36] TheMue, but the specific channel *is* interesting, because the watcher can have N watches for a given key [11:36] TheMue, and the challenge becomes figuring out which watch is unmatched [11:37] TheMue, do you still object to panicing with the chan? [11:38] niemeyer, heyhey [11:39] Good morning! [11:39] fwereade: OK, reasonable. [11:39] TheMue, cheers [11:39] niemeyer: Hiya. [11:40] niemeyer: morning! [11:40] fwereade, TheMue, rogpeppe: Heyas! [11:54] mgz, dimitern, wallyworld: Something landed recently in trunk broke the build: https://pastebin.canonical.com/79484/ [11:54] wallyworld: I think it was your change to return 'nil' for objects when there has been an error. [11:55] yes :-( [11:55] If we want to keep 'nil', we need to change the return to "*Object" rather than "Object" [11:55] since pointers can be nil [11:55] but a struct cannot. [11:55] hmmm. i guess that's the correct thing to do [11:55] wallyworld: but also note, I'm not sure that adding the 'if' statement *improves* the code. [11:55] wallyworld: I tried sending an email this morning, and tried again just now. [11:55] i was doing what the code review called for [11:56] let me check, i missed the email [11:56] wallyworld: Oh, I understand that, I'm disagreeing with rogpeppe not you. I approved your change, but questioned the motive. [11:56] maybe i need to return an empty struct? [11:56] even with that, we should be careful to run 'go test ./...' before committing on trunk, since we don't have a bot yet. [11:56] yeah, sorry [11:56] i thought i had but clearly not [11:57] wallyworld: IMO, returning an empty struct is just as bad as a partial one. Since they have an object that is 'valid'. [11:57] In which case, just return what you have, and set err like you were doing. [11:57] Or, we change everything to pointers. [11:57] anyway, i'm not really working today [11:57] :L) [11:57] your preference? [11:57] :) [11:57] jam: was that the named-return-variable discussion? [11:57] wallyworld: *mine* is to not repeat the if err != nil, check in the function, that the callers are going to have to do anyway. [11:58] rogpeppe: sorry, I do want to discuss this, but my son needs to play Lego Batman... :) [11:58] Can we chat maybe later, or on Monday/Tues? [11:58] jam: nananananananana batman! [11:58] jam: sure [11:58] jam: i'll fix the build and we can discuss next week [12:15] wallyworld: If in doubt, I'd suggest returning a pointer [12:16] wallyworld: That's appropriate in most cases [12:16] wallyworld: Of course, and returning nil consistently when err != nil [12:16] niemeyer: for now, i'm just reverting the changes to fix the breakages, and will tackle properly next week once we have consensus [12:16] since it's late here now and i'm tired [12:16] wallyworld: Sure, I'm just saying this is a well known concept [12:17] wallyworld: If you go over existing code, both ours and upstream, you'll see a clear pattern [12:17] jam makes a good point though about the functions doing their own if err != nil checks [12:17] wallyworld: it's definitely conventional to return a zero value when err != nil, pointer or not. that was the reason for my comments. [12:17] agree pointer is best [12:18] Yes, and returning partial data when err != nil is not okay [12:18] i wonder, if err != nil, why does the returned value need to be zero? [12:18] wouldn't it be ignored anyway? [12:19] There are very rare exceptions to this rule, which confirms the rule.. Those are related to cases where it's important to observe the partial values, and the partial values are deterministic in such cases. [12:19] it seems like it's just adding complexity [12:19] lots of apis in other languages say, if there's an error, such and uch is undefined [12:19] wallyworld: It's one more layer to prevent bad usage [12:20] wallyworld: yes, that's not the case here [12:20] wallyworld: We consistently return a zero value when errors are found [12:20] ok. it costs to do that though [12:20] wallyworld: I haven't spent much so far :) [12:21] well, each api call needs repetitive blocks of code "if err != nil ..." [12:21] makes it messy, especially if the caller also does "if err != nul..." [12:21] wallyworld: We do err != nil all the time [12:21] wallyworld: It's pretty clear and cheap [12:21] so we have 2 places doing the error checking - the caller and the callee [12:22] wallyworld: you only need that if you're returning something that may not be zero [12:22] wallyworld: We have tons of places doing error checking [12:22] wallyworld: in the case we're talking about, we were returning a value that may or may not have been partially filled in [12:22] what if the semantics of the call say - "if there's an error, don't use the returned vsalue" [12:22] wallyworld: That's not even on the list of things that actually burned my own time over the past few years [12:23] wallyworld: That still wouldn't change the fact that there's a strong convention that is useful as it prevents crack from flowing through [12:24] ok, no problem. it does seem like code duplication for little gain, but that's just IMO :-) [12:24] s/code duplication/excessive boiler plate [12:24] wallyworld: I'd appreciate seeing an example [12:24] wallyworld: Of what you claim is code duplication [12:24] ok, i'll do a pastebin [12:25] wallyworld: Doing if err != nil { return nil, err }, or doing if err != nil { return foo, err } [12:25] wallyworld: Doesn't really cost much, and the former one is both less error prone, and more clear in intention [12:26] niemeyer: in this case, i think it was between: if err != nil { return nil, err }; return foo, nil; and a plain return. [12:27] https://pastebin.canonical.com/79488/ [12:27] see how the caller and calle both do the if err != nil [12:27] the function has no need to [12:27] if just clutters the code [12:27] why not just return data, err [12:28] who cares if data is partially populated, err will tell that something is wrong and not to use it [12:28] wallyworld: pastebin.ubuntu.com is a better one for that [12:28] wallyworld: As it's public [12:28] ok, old habits [12:28] wallyworld: Let me grab my second factor auth key :) [12:28] sorry [12:28] wallyworld: No worries [12:29] the private one is in my browser history [12:29] so, my view is that normally, the function semantics of such things normally are such that if an exception is raised, or an error returned, disregard any result data [12:30] wallyworld: It's hard to follow that example, as it contains completely invalid code [12:30] sorry, it's pseudo code cut and pasted from real code [12:30] wallyworld: Yeah, but it's completely broken, badly intended, and referring to variables that don't even exist [12:31] sure, but that doesn'tmatter for the sake of the argument [12:31] wallyworld: You won't convince me about how I'm wrong about things being error prone and trivial with such an example :-) [12:31] that take awaya is [12:31] if err != nil { [12:31] return nil, err [12:31] } [12:31] return resp.Images, nil [12:31] the above is from the function [12:31] resp doesn't exist on that context [12:31] why does it need to do the if err != nil [12:32] why not just return resp.IMages, err [12:32] and let the client do if err != nil [12:32] wallyworld: Sorry, I honestly can't suggest anything.. the example is completely broken. I'm arguing that this is trivial to write, and that it makes code less error prone. Hard to even make the point if the example is completely crackful. [12:33] here's the full function, i was hoping you could just see the intent from the pseudo code, http://pastebin.ubuntu.com/1398973/ [12:34] imagine a module with dozens of such functions, all doing if err != nil unnecessarily, sure adds a lot of bloat and boilerplate [12:34] when the caller just goes ahead and does the err != nil dance anyway [12:34] wallyworld: that's a pretty unusual case actually [12:35] wallyworld: because it's not returning something that something else returned. [12:35] wallyworld: it's returning something that something else might or might not have filled in [12:36] yes true [12:36] and the err tells us if the fill in worked [12:36] wallyworld: agreed. we *could* just allow arbitrary broken return values when err != nil [12:37] wallyworld: That's fine, as long as there's a promise from the underlying function to never touch the resp value if any errors are returned [12:37] that's a pretty standard semantic [12:37] wallyworld: but i think it's nicer if we do things more predictably [12:37] wallyworld: I'd be slightly surprised if that promise is in place, though [12:37] i agree it's nicer perhaps, but at the cost of a fair bit of bloat [12:37] wallyworld: Because it means you cannot return any errors after resp has been touched [12:38] wallyworld: That seems the heart of your concern [12:38] wallyworld: tbh, i think the "bloat" makes the code easier to read [12:38] wallyworld: And honestly bloat to me is something else.. [12:38] i guess i see cut and paste boilerplate as bloat [12:38] wallyworld: it means that i *know* instantly the set of possible values returned by ListImagesDetail [12:39] wallyworld: I see logic that is hard to read, poorly engineered, hard to maintain, as bloat [12:39] wallyworld: Error prone as well.. totally bloatful [12:39] sure, there's more than one type of bloat [12:40] i'll point jam to the scrollback when he is next in the office, since he also shared my concerns [12:40] wallyworld: That trivial line, makes things lighter to process.. I know I don't have to worry about the return value having bad data, and I know any callers couldn't *possibly* use a partial value coming out of it. It empties brain space rather than filling it up. [12:41] s/line/lines :-P [12:42] thanks for the discussion though [12:42] interesting points of view [12:42] wallyworld: Exactly my point.. 10 lines of code may be easier to read than 1. [12:43] maybe, depends on the reader i guess [12:43] wallyworld: Depends on the lines of code [12:44] or more so one's unerlying expections about the semantcs of the call [12:44] as a caller, i would never dream of using returned data if err != nil [12:44] wallyworld: I'm glad to hear that. [12:44] so it never would enter my head to expect the function to have to do anything about it [12:45] wallyworld: That still doesn't mean that you won't do that anyway, by mistake. [12:45] wallyworld: Either way, if the point isn't easy to agree with, I ask that we at least sustain that strong convention that we use both internally, and externally in the core Go development team. [12:46] sure, no problem [12:46] i was trying to understand the underlying rationale [12:46] var foo []Foo [12:46] foo, err = Func() [12:46] if err == ErrSomething { [12:47] // Ah, okay, that error is expected here. [12:47] } [12:47] Oops.. foo has bad data now. [12:47] yes, and so don't use it [12:47] wallyworld: Yeah, you're very smart and will never do that. I'm kind of dumb, and tend to make mistakes. [12:48] hey, i didn't mean it like that [12:48] tests are also part of the solution too [12:50] wallyworld: They surely are. Not rare we assert the fact we get a zero value, for example. [12:51] niemeyer, aram's firewaller change is up again at https://codereview.appspot.com/6843128 [12:51] anyways, i'll make the necessary changes to use pointers [12:51] fwereade: I'll get on it right now [12:51] niemeyer, the firewaller bug was a MachineUnitsWatcher bug, fixed this morning [12:51] so the pattern will work as intended [12:51] fwereade: Oh, what was it about? [12:51] niemeyer, we were unwatching the wrong key [12:51] wallyworld: Thanks a lot [12:51] wallyworld: and thanks for explaining your POV [12:51] np, thanks for the discussion [12:51] niemeyer, state/watcher now panics if you try any funny business like that [12:52] fwereade: Holy crap, the wrong key huh [12:53] niemeyer, `for _, unit := range map[string]Life{...} {` [12:53] fwereade: Woah [12:53] niemeyer, quite so [12:53] fwereade: I think I did that before too.. will pay more attention to ranges [12:53] niemeyer, took a while to track it down, it doesn't exactly leap out at you when the type is defined elsewhere [12:54] lunchtime, biab [12:56] fwereade: It didn't help that Unwatch takes an interface to support keys of multiple types [12:56] TheMue: Enjoy [12:56] niemeyer, yeah, indeed [12:56] niemeyer, rogpeppe pointed out that all our _ids are actually strings now [12:57] niemeyer, although I'm still really bothered by id/_id on Relation [12:57] fwereade: Well, we could have saved ourselves the trouble if the watcher was at least verifying basic typing [12:57] fwereade: string/int/int64... [12:58] fwereade: Do you have an idea to workaround it? [12:58] niemeyer, only that we could change it to demand string ids right now and I think everything would still work [12:59] niemeyer, but tbh the panicing seemed to do the trick well enough [12:59] fwereade: Have you gone through the firewaller logic, or just reinstated the original branch? [12:59] fwereade: I don't understand how that would solve the id/_id case [12:59] niemeyer, I have made slight enhancements to what existed but have not 100% verified everything from the ground up -- but 3 of us have already done so AIUI [12:59] niemeyer, oh, it wouldn't at all [13:00] niemeyer, it's just that I'm reluctant to start making pronouncements about "id"s wen I am still regularly tripped up by the relation id/key thing [13:01] fwereade: You could invert the relationship again by doing an extra lookup [13:02] niemeyer, the id/key one? sorry, I don;t follow [13:02] fwereade: That said, how to guarantee uniqueness would still bother [13:02] fwereade: Yes [13:02] fwereade: I'm talking about the reason why we have key in _id [13:03] niemeyer, ah, ok, I see -- we can't enforce uniqueness on additional fields then? [13:03] fwereade: We can create additional unique indexes on another field [13:03] fwereade: But it's an extra index, and the txn package works against _id [13:04] fwereade: We could disable the auto-index creation, create an additional unique index on key, and change the txn package to support other fields [13:05] fwereade: But, where's the gold at the other side of the rainbow? [13:05] niemeyer, I'm sorry, I don't see where the txn package changes come in [13:06] fwereade: The Id field is mapped on _id [13:06] fwereade: All of the transactioned operations, inserts, updates, removes, are done against _id [13:07] niemeyer, ok -- but both "id" and "_id" would still have to be unique, so surely we could still do the same against _ids that looked like "0" or "198" just as easily as ones that look like "wordpress:db mysql:server", couldn't we? [13:08] niemeyer, and we look up by both key and id, so we really ought to be indexing on key anyway [13:08] niemeyer, sorry, "id" [13:08] niemeyer, fwereade, TheMue: i'm going for lunch now. i will be working this afternoon, but maybe not online all the time as i'll be on the road. [13:09] rogpeppe: Cool, have a good one [13:09] fwereade: Uniqueness against 0 or 198 don't mean anything.. we create those numbers to be unique [13:09] niemeyer, the only other change I can think of is in the check-by-prefix in ServiceRelationsWatcher, and that's just a field switch... right? [13:09] s/don't/doesn't [13:10] niemeyer, will the txn package somehow be able to insert docs with duplicate keys in the face of uniqueness constraints? [13:12] fwereade: Well, the underlying db doesn't allow it [13:13] niemeyer, ah! but it will break txn.DocMissing? [13:14] fwereade: DocExists, DocMissing, yes [13:15] * fwereade sees now [13:15] niemeyer, yeah, probably worth the localised confusion then :) [13:19] fwereade: Hah, funny.. the fictitious err-related example I mentioned above using a zero-value on err != nil exists on this branch. :-) [13:25] niemeyer, oh, poo, what did I do? [13:25] fwereade: Uh, nothing bad? :) [13:25] niemeyer, ah? sorry, I thought you were pointing out a mistake [13:26] fwereade: No, I was pointing out earlier that it's important to be consistent on the zero-value on errors convention [13:26] niemeyer, definitely -- and I guess aram or I did so? :) [13:26] fwereade: and as one example I used a snippet, (HH:46, above) [13:26] fwereade: That snippet exists almost as-is in the branch [13:27] fwereade: It would be a bug if the convention wasn't respected [13:27] niemeyer, ah, yes, indeed [13:30] fwereade: Very nice [13:31] niemeyer, cheers :) [13:31] fwereade: Thanks a lot [13:31] niemeyer, np at all [13:41] niemeyer, ah HA: there is *also*, it appears, a firewaller bug, that I had hitherto failed to repro [13:41] niemeyer, I shall meditate upon this [13:41] fwereade: Hm [13:41] fwereade: How did you figure? [13:42] niemeyer, I saw a test failure :/ [13:42] fwereade: Oh [13:42] niemeyer, in TestGlobalModeRestartPortCount, as flagged :( [13:44] niemeyer, I'm just about to investigate after I propose a trivial fix to UnitsWatcher -- I mistakenlythought I didn't want to notify Dead units in the initial set, but I was wrong [13:44] fwereade: Oh? [13:45] fwereade: Ah, I think I see [13:45] niemeyer, they get lost from view if they're dead when we start up [13:45] niemeyer, generally we need to handle dead ones [13:45] niemeyer, I *could* do it by reconciling against what's installed but htat actually just feels like needless complexity [13:46] fwereade: Agreed, I think it's fine to report [13:46] niemeyer, we can do it differently if you don;t like Deployer (which I will get to one day ;p) [13:46] niemeyer, cool [13:46] fwereade: We can make it respect the Dead-is-the-last-thing-seen idea [13:46] niemeyer, ah, I can't reconcile anyway [13:46] niemeyer, exactly, it is still respected [13:48] niemeyer, https://codereview.appspot.com/6846133 [13:49] fwereade: We don't send dead statuses, strictly speaking [13:50] * niemeyer thinks [13:50] fwereade: "Once a unit observed as Dead has been reported" perhaps [13:51] niemeyer, SGTM, thanks [13:53] fwereade: LGTM, cheers [13:54] niemeyer, "as Dead or removed"? [13:55] fwereade: +1 [13:55] niemeyer, cheers [14:05] * niemeyer => lunch [14:09] niemeyer: Enjoy [14:14] *: Any experiences with OAuth packages around? [14:17] TheMue: which version? [14:17] *: Will start with goauth2 by Andrew. [14:17] TheMue: istr that oauth2 is quite different from oauth, but otherwise i know very little [14:18] rogpeppe: Ah, good hint. The Py version uses a package named oauth. Will take a look which version it supports. [14:21] rogpeppe: Seems to be v1. [14:50] TheMue, may I share your wisdom re firewaller? [14:51] TheMue, I can't quite figure out whether the bug is in the FW or the test [14:51] fwereade: I'm listening. [14:52] TheMue, ok, so, the test is problematic [14:52] TheMue, in that it starts a new FW, and asserts that the opened ports are the same as before [14:53] fwereade: Which test do you refer to? [14:53] TheMue, and so that test actually finds the expected state before the FW has even had a chance to open or close anything [14:53] TheMue, TestGlobalModeRestartPortCount [14:53] TheMue, the problematic one [14:53] fwereade: OK, open it in the editor. [14:54] TheMue, ok, so the last 4 blocks are a problem [14:54] TheMue, the assert in the first one is likely to pass regardless of state [14:57] TheMue, and when the test subsequently passes, the actual timing of the changes is a bit suspect -- initial events are still being handled after port 8080 is closed [14:57] TheMue, and *at the moment* this is passing reliably for me, butthis is more likely a happy accident of logging than anything else [14:58] TheMue, I can make the test 100% reliable, with event handling happening at the times I expect, by opening another port somewhere before starting the FW; and waiting for *that* port to be opened [14:59] fwereade: I'm gong through it step by step now. [15:01] TheMue, http://paste.ubuntu.com/1399817/ and http://paste.ubuntu.com/1399818/ might help you to visualize [15:01] * rogpeppe is looking at a beautiful frosty sunset while cruising down the A1. the joys of mobile connectivity. [15:01] TheMue, === signals event in the main loop; +++ signals port refcount ++, --- signals ports refcount -- [15:02] fwereade: Ah, good. [15:02] TheMue, the first is what we get normally; the second is what we get if I force the test to wait for the correct ports [15:03] TheMue, I *think* it should always look like the second [15:05] TheMue, *but* I don't see how that's possible in general, because (as you can see) the events coming in willy-nilly seem to cause surprising refcounts in the first [15:07] TheMue, ie we never see a refcount of 2 for tcp:80 in the first paste, and ISTM that we should [15:07] fwereade: Yes, I'm wondering. [15:10] fwereade: That's what Aram called the race situation of overlaying changes incrementing and decrementing the refcount at once. [15:10] TheMue, I would have expected the initial state to be built from initial events, so I'm having a little trouble following how the state built up in initGlobalPorts fits in with the watchery state [15:11] fwereade: The initial one is build by taking a look at the state. [15:12] TheMue, yeah, that is clear -- but AFAICT it doesn't necessarily correspond with the state picture that will be built up by the watchers [15:12] TheMue, and I can't see how you reconcile them [15:14] fwereade: One moment, I have to look on my own how we've done it. [15:18] fwereade: It's the first filter part of flushGlobalPorts() [15:18] fwereade: Here toOpen and toClose are only set if they are not only opened or closed. [15:19] fwereade: And in closing, when the refcount is 0, it will be closed. [15:20] fwereade: One has to look exactly when globalPortOpen (a map of port to bool) and globalPortRef (a map of port to int) is used. [15:24] TheMue, ok, but globalPortOpen is initialized in initGlobalPorts while globalPortsRef is not [15:25] fwereade: Yes, because otherwise each port would be counted twice, once by the state reading, once by the intial events. [15:26] TheMue, ok -- and it's not ever possible to get a machine thinking it should close a port that was never open? [15:27] TheMue, (according to globalPortsRef, that is) [15:27] fwereade: So while the initGlobalPorts() just ensures that the ports aren't closed and immediatelly reopened the initial events ensure the correct counting (so far the idea). [15:29] TheMue, ok, so there's no way an initial event can have the effect of closing a port? but that can't be true [15:29] fwereade: "not ever possible" are big words. ;) Have to check it. But closing in global mode is done when ref == 0 and a close event is raised. [15:29] TheMue, sure, I can see that [15:30] TheMue, ok, thought experiment [15:31] fwereade: I'm thinking about enqued event while the state scanning is in progress. [15:31] TheMue, doing that would make me much more comfortable [15:31] TheMue, but let's see: 1 machine, with 2 units on it [15:32] TheMue, one of the units has port 80 open [15:32] TheMue, a FW starts up and scans this state correctly [15:33] TheMue, port 80 is true in globalPortsOpen, and is 0 in globalPortRef [15:33] TheMue, and the FW then starts up its watches [15:33] fwereade: Exactly, and now the initial watcher event is retrieved. [15:34] TheMue, it gets an event for the machine, and starts a watch for its units [15:34] TheMue, it then gets an event with the two units in [15:34] TheMue, and starts port watches for each [15:35] TheMue, at this precise moment, the unitds think they have no open ports, right? [15:36] TheMue, so what happens when we get the first port event depends on which port watch happens to fire first [15:37] fwereade: Important is, what the fw thinks. And it thinks, hey, I have the port open, but I don't know how often. But it gets the number by the incoming events. [15:37] fwereade: But you're right, the watcher for the machines units is started after the machine event. And so the state could have been changed inbetween. [15:38] TheMue, ok, actually, it can be simpler I think [15:38] fwereade: The machine watcher is started before (!) the initial state scanning, so here it's better, [15:38] TheMue, exactly: if the unit with 80 open closes it after the inital scan and before the watch, port 80 will never be mentioned and never be closed [15:39] TheMue, it'll be open forever with refcount 0 [15:40] fwereade: As a first thought I would say you're right. [15:40] fwereade: The delay until the machined is started ... [15:41] fwereade: It seems like initGlobalMode() should start the needed machined immediately. [15:41] TheMue, if we could do that, that would be great [15:42] TheMue, it could be fiddly though [15:42] fwereade: And the initial event of the machines watcher has to compare if the started ones are still correct. [15:43] fwereade: Yes, not trivial. [15:43] TheMue, I don't see any way for it to be correct without doing so though [15:43] fwereade: Based on our current analysis I have to agree, yes. [15:47] niemeyer, ok, I think TheMue and I are in agreement that there is a fundamental race in the firewaller, which is not trivial to resolve [15:47] fwereade: Oh [15:48] niemeyer, I am not intrinsically opposed to diving in and fixing it, but I am reluctant to allow it to delay subordinates [15:48] fwereade: What's it? [15:49] niemeyer, basically it (1) builds up a bunch of reference state and (2) starts a bunch of watches to fill in the details of that state [15:49] niemeyer, but neglects to take account of possible changes in between (1) and (2) [15:49] fwereade: Ok [15:49] fwereade: How could this take place? [15:51] niemeyer, well, initGlobalMode just does a straight scan of state; the main loop then starts building up a huge tree of watchers, starting from the machines and adding units and services as events dictate [15:51] fwereade: Okay [15:52] niemeyer, I don't think it'll be *toooo* hard to do it right, but it will definitely be fiddly [15:52] fwereade: Sorry, I still don't perceive the issue [15:52] fwereade: Can you walk through the error happening? [15:53] niemeyer, one machine with one unit; port 80 is open on the unit and in the environment [15:53] niemeyer, FW does initial scan [15:53] niemeyer, unit closes port 80 [15:53] fwereade: Ok [15:53] fwereade: Ok [15:54] niemeyer, Fw gets its initial machines event, and gets/watches unit state [15:54] fwereade: Okay [15:54] niemeyer, the initial state reported is "no ports open" [15:54] Hmm [15:54] * niemeyer looks at the code [15:55] niemeyer, we're watching a unit we got via the machine's units event, not the one we originally looked at to determine that 80 was open [15:56] fwereade: The set of ports that are actually opened is obtained within the unit watch, when the even fires, right? [15:56] Oh, no [15:56] niemeyer, ok, what happens is: [15:56] niemeyer, the firewaller never sees a change to port 80, and hence never closes it [15:57] niemeyer, ...that's it [15:59] niemeyer, there's a refcounting mechanism that works fine (I think) when state doesn't change during that critical window [16:00] fwereade: Where is the window in the code? [16:01] niemeyer, from the end of fw.initGlobalMode() to the end of the last event on the ports channel that is traceable to the initial machines event [16:01] niemeyer, I think [16:01] niemeyer, ok, they all are [16:02] niemeyer, the last initial event that is itself a conseqence only of initial events, sharing the ultimate ancestor of the first machines event [16:02] niemeyer, I don;t think I made that any clearer, did I? [16:03] fwereade: No, but I blame my question [16:03] niemeyer, ok, looking just at the main fw loop [16:04] niemeyer, first of all we set up initial state, and then we start paying attention to a pre-started global machines watch [16:04] fwereade: Perhaps I should try to point out how I originally imagined this would work [16:04] niemeyer, changes on that are safe [16:04] fwereade: So you can fix my assumption with reality [16:04] niemeyer, ok, go for it [16:06] fwereade: There's an initial pass at the beginning which verifies what is *actually* open in the environment [16:06] niemeyer, ie according to the provider? [16:06] fwereade: Yes [16:07] niemeyer, ok, ty, carry on, this is accurate [16:08] fwereade: Okay, I think I see the problem actually [16:09] fwereade: No, wait, okay [16:10] fwereade: Nevermind, I do see the issue.. closing is indeed a problem because the initial state isn't connected to the initial referenced state [16:10] fwereade: Open works under the same circumstances, though [16:10] fwereade: Makes sense? [16:10] niemeyer, agreed, open is safe [16:11] niemeyer, except, I *think* there is another issue [16:11] niemeyer, but I need to think it through a bit [16:12] niemeyer, gaah it seems not to be flowing [16:14] niemeyer, the trouble is fundamentally that (1) changes can come in and change state both before and during the watch-scan, and this is confusing and (2) there is a failing test that demonstrates weird refcounts while this is happening, even if it doesn't always fail, and which is fixed by delaying the external changes until we;re sure the FW has finished what it's doing [16:15] fwereade: I think the issue is rather simple, actually [16:15] niemeyer, I am sadly not able to actually hold the whole thing in my head atm, so I can't come up with a good explanation of why the test sometimes fails: adding logging seems to have fixed it [16:16] niemeyer, I am talking here in a very narrow context [16:16] * TheMue is playing with the idea of always start the watcher goroutines based on the initial state before retrieving the first machines watcher events. [16:16] fwereade: I suggest trying to address the actual issue we do know about first [16:16] niemeyer, that issue only came to light because of the more fuzzily specified one I just mentioned [16:17] niemeyer, there's a test that fails sometimes [16:17] fwereade: Cool, I'm glad you did look at this [16:17] niemeyer, there's an issue in the watcher [16:17] fwereade: Hm? [16:17] niemeyer, the two may be connected [16:17] niemeyer, I am not equals to the task of concretely demonstrating such a connection without handwaving [16:18] niemeyer, well, I tried to repro it, and I could, and I found the MUW unwatch bug [16:18] fwereade: That's fine, I'll try to interpret the hand-wavy frequency :) [16:18] niemeyer, in the following test runs, the firewaller appeared to be fine [16:18] niemeyer, some time after submitting, I saw the test fail [16:19] niemeyer, it is irritatingly elusive, and currently appears to be passing 100% for me with no changes otherthan logging [16:19] niemeyer, but sometimes it fails every 2 or 3 goes [16:19] fwereade: Okay, but what's the failure about? [16:20] niemeyer, the failure is about whether a port gets closed in the environment or not [16:20] fwereade: Isn't that the bug we're discussing above? [16:21] niemeyer, not exactly -- the test doesn't exercise that precise situation [16:21] niemeyer, this is a port that should close but sometimes doesn't [16:21] fwereade: That's exactly the case we're talking about above [16:22] niemeyer, ...ok, they are more closely connected than I thought :/ [16:23] niemeyer, I'd approached the conclusion from 2 separate directions, didn't realise they were the same place [16:23] niemeyer, so [16:23] niemeyer, I am comfortable that this is the bug [16:24] niemeyer, I have a patch for the test, to cause it to avoid exercising the bug [16:24] niemeyer, I am not *really* comfortable taking on "fix the firewaller" at this point, especially because I don't feel I know it as well as others [16:25] fwereade: I suspect the issue is simple to solve [16:25] fwereade: and to describe [16:26] niemeyer, yeah, it may be [16:26] fwereade: machined.ports starts empty [16:26] fwereade: That's it really [16:27] niemeyer, I'm a bit suspicious of unitd.ports starting empty too [16:27] fwereade: All the deltas are potentially wrong because of that [16:28] niemeyer, yeah, that does sound simpler to deal with [16:28] fwereade: I don't think that's a problem [16:28] fwereade: unitd.ports reflects what we learned from the state itself [16:28] fwereade: Which is always right [16:28] niemeyer, hold on, where does machined.ports come from if not from unitds' ports? [16:28] niemeyer, nah, it starts out empty [16:28] fwereade: Exactly [16:29] fwereade: We compute what we want machined.ports to be out of unitd.ports, which we learn from the state [16:29] niemeyer, ok, but [16:29] niemeyer, when a machine has >1 unit [16:29] fwereade: and diff against the previous value of machined.ports to know what to apply to the env [16:30] niemeyer, we cannot calculate the machined's ports without knowing all its unitds' ports, right? [16:31] fwereade: We know all unitd ports.. that's why we cache it [16:31] niemeyer, but we don't actually set it when we create it [16:31] niemeyer, it sits around empty until the main loop first handles a ports event for that unit [16:32] fwereade: Ah, interesting [16:32] me looks [16:32] niemeyer, so when a machine has 2 units, and we flush the machine's ports based on the first unit's ports only (because we haven't yet handled the second one's initial event) we have a problem [16:34] niemeyer, if I promise I will take a deeper look at it, can I patch the proximate problem by tweaking the test to skirt the problem? [16:34] fwereade: You can do that either way [16:34] fwereade: The problem is there and isn't going away until someone fixes it [16:34] fwereade: Your branch is orthogonal [16:35] niemeyer, true, but we don't want test failures dirtying up our lives [16:36] niemeyer, it was this bug, I think, that originally blocked it [16:36] niemeyer, ok, let me put it differently [16:36] fwereade: Yes, but the point of blocking is that no one could tell what the bug actually was [16:36] niemeyer, ah! and we now have progress? [16:36] fwereade: So whether it was orthogonal or not was at stake [16:36] niemeyer, that makes sense [16:36] niemeyer, got you [16:37] TheMue: Are you following? [16:37] niemeyer: Yes, the whole time. [16:37] TheMue: Awesome, thanks [16:37] TheMue: Can we have a fix for the two issues next week? [16:38] niemeyer: I think so, as it's now clearer. [16:38] niemeyer, TheMue: well, awesome :) [16:39] TheMue: Two different branches: 1) machined.ports must be properly initialized; that must be done differently for the two modes [16:39] TheMue: 2) Creating new machined's must ensure all unitd's within it are initialized before returning to the main loop [16:39] fwereade: Makes sense? Anything to add? [16:40] niemeyer, I think that covers it [16:41] TheMue: Questions? [16:41] * TheMue feels comfortable with it too. [16:42] niemeyer, I will propose a trivial .Skip on that test for now [16:42] niemeyer, I think it's more valuable to have it preserved in a state that more-or-less exposes the issues than it is to hide them away by dodging the problem [16:42] fwereade: +1 [16:48] * rogpeppe2 needs to do some navigation now. i may not return today. if not, have a great weekend everyone! [16:51] rogpeppe2: Enjoy your weekend [16:52] TheMue: thanks. lots of obscure carols being sung this weekend - we're going to a Festival of Village Carols... [16:52] * TheMue will step out in a few moments too. Tomorrow our little Vanessa has her next archery tournament. I'm excited. [16:53] rogpeppe2: That's pre-christmas time … :D [16:53] TheMue: it's the 1st tomorrow! [16:53] TheMue: just scrapes into advent [16:53] niemeyer, https://codereview.appspot.com/6849126 [16:53] niemeyer, for form's sake :) [16:54] rogpeppe2: Yep, have been to our local Christmas market yesterday. Has been quite good. [17:00] rogpeppe2: Have fun there [17:00] fwereade: +1 [17:43] So, I'm stepping out. Have a nice weekend everyone.