[00:01] thumper: can you join #juju on canonical ? [00:01] need to talk about some stuff [00:02] thumper: oh wait, i'm not in #juju [00:02] use #eco [00:34] davecheney: hi, doctor running late as usual. i've fixed william's issues with by bootstrap branch but wanted to understand your issues. i can't really see what the problem is. so long as tools are uploaded to public buckets when ppa is released, things just work. and stricter matching is only done on major.minor, so point releases can still be used [00:35] wallyworld__: when you say "public bucket" that only works for ec2 [00:35] private openstack clouds act as if they have no tools an fall back to using s3 for _release_ (not devel) tools only [00:36] your change would prevent them accepting the final fallback toolset and their bootstrap will fail [00:36] i 100% agree that mixing toolsets is a mistake [00:36] but that genie is out of the bottle and people are using it now [00:39] davecheney: i just got called in, will catch up in a bit [01:24] bigjools: would you mind landing my gwacl branch? pretty sure I don't have rights to [01:24] davecheney: you weren't there [01:24] davecheney: was at the gym, sorry [01:24] axw: ah, I can fix that [01:26] axw: land away! :) [01:26] thumper: imma there [01:26] #eco on canonical [01:26] oh [01:26] bigjools: thanks. next question: how? :) I haven't landed without lbox before [01:27] axw: just change the status of the MP to "approved" and the bot does the rest [01:27] * bigjools grumbles about lbox [01:27] bigjools: ah right, just didn't show up before cos I wasn't in the group. of course. [01:27] thanks! [01:28] yup [01:50] thumper: i might even bring forward 1.13.3 tagging to today [01:50] if I can do the upgrade testing [01:50] 'cos wallyworld wants to break everything [01:50] (wallyworld is sans internets, he'll be here soon) [01:53] axw: my gwacl bot seems to be dead... fixing [01:53] bigjools: thanks [01:53] axw: ah it's not [01:53] did you set a commit message? [01:54] I think so... checking [01:54] nope [01:54] dah [01:54] thanks [01:54] fixed [01:54] tarmac silently ignores MPs with no commit msg [01:54] bot runs every ten minutes [01:55] I need to get landing moved over to the one that jam runs [01:55] okey dokey, thanks bigjools [02:00] davecheney: is this wallyworld's tool selection work? [02:00] wot [02:00] hi wallyworld [02:00] hey [02:00] wallyworld: davecheney says you want to break the wold [02:00] world [02:00] well, not really [02:01] only for private clouds that choose not to set up a public bucket for tools [02:01] wallyworld: lp:~axwalk/juju-core/juju-metadata-generate-tools [02:01] and rely on sync tools pulling down possibly outdated tools from s3 [02:01] wallyworld: so how is it breaking? [02:02] now, juju will fall back to using any tools where the major version number matches [02:02] eg 1.3 client will use 1.2 tools [02:02] but the change now restricts the matching to require major.minor match [02:03] patch updates are used though [02:03] so 1.3.0 client will use 1.3.2 tools [02:03] I don't think that is reasonable [02:04] william does [02:04] I'd say only use patch updates for non-dev releases [02:04] we break things from point release to point release on dev envs [02:05] we can't guarantee 1.28 will be compatible with 1.26 [02:05] it's not an issue in practice [02:06] juju release = upload tools and then upload ppa [02:06] so there are always tools [02:07] for devs on release+1, they should be using --upload-tools [02:08] thumper: william is the one driving this change [02:08] sure [02:08] I guess that's fine [02:08] i'm +0 on it - i can see pros and cons [02:13] wallyworld: correction, juju release uploads tools for *known* cloulds, the CPC clouds [02:13] there exists an unknown set of non public clouds [02:15] who currently (inadvertendtly) rely on the automatic sync-tools behavior that bootstrap does for them [02:15] i 100% agree this is wrong and bad [02:15] but if we take it away from them, they'll lynch us [03:20] wallyworld: I suggest we skip the package review stuff with axw as you are sprinting [03:20] thumper: sounds good to me [03:20] wallyworld: please let axw know [03:20] will do [05:50] 15:48 error: invalid value "maas-name=mongodb.lab.boston.cts.canonical.com" for flag --constraints: unknown constraint "maas-name" [05:50] ^ i think i already know the answer [05:50] but what do I tell jpds ? [06:07] bigjools: can you response please [06:07] davecheney: OTP [06:07] bigjools: when you are free [06:08] but looks like a juju error [06:08] and juju-core doesn't support maas flags [06:09] oh cock [06:09] yes James [06:09] thumper: bigjools: i got my coffee machine back! :-D [06:09] \o/ [06:09] * jtv cheers for wallyworld_ [06:09] it's heating up right now :-D [06:09] and I ran out of beans [06:10] Go over to wallyworld_'s right now — I hear he has working equipment [06:10] I heard the cord was cut [06:10] * wallyworld_ is the coffee nazi - no beans for you [06:11] boom bomm ching [06:11] davecheney: wtf is "maas-name" anyway? [06:11] bigjools: constraints the units to the maas tag [06:11] that's maas-tag I thought [06:11] maas-tags even [06:11] https://juju.ubuntu.com/docs/charms-constraints.html [06:11] and only on pyjuju [06:11] docs says maas-name [06:12] maas docs say maas-tags.... [06:12] haha [06:12] is he using juju-core or pyjuju? [06:12] * davecheney starts to cry [06:12] juju-core [06:12] he's fucked then [06:13] how charming [06:13] until fwereade_ gets provider-specific constraints in (I think --to was being talked about) [06:13] he's not going to be charming [06:15] jpds: short version, we haven't implemented provider specific constriants yet [06:15] the docs are wrong [06:15] they probably refer to pyjuju [06:15] bigjools: That did exist in pyjuju. [06:16] exactly [06:27] jpds: fwiw, https://bugs.launchpad.net/juju-core/+bug/1170337 [06:27] <_mup_> Bug #1170337: maas provider: missing support for maas-specific constraints [06:27] also, https://bugs.launchpad.net/juju-core/+bug/1170337 [06:27] <_mup_> Bug #1170337: maas provider: missing support for maas-specific constraints [06:27] shit [06:28] Keep calm. [06:28] https://bugs.launchpad.net/juju-core/+bug/1217717 [06:28] <_mup_> Bug #1217717: docs: charm constraints page refers to unsupported 'maas-name' constraint [06:37] axw: did we get gwacl updated on the build bot to match your "critical update gwacl" patch? [06:37] the bot doesn't (yet) actually look at dependency.tsv [06:37] jam: no, but it doesn't affect the juju-core tests (it's tested in gwacl) [06:38] axw: sure, but we probably actually want the bot to be building with the version of gwacl we are suggesting. I'll go update it. [06:38] (It also helps that if there was an accidental incompatibility break we would notice it.) [06:38] jam: fair enough, thank you === tasdomas_afk is now known as tasdomas [07:09] Guys, my bootstrap node is dying. [07:10] /var/log/juju/machine-0.log keeps aying: juju environ.go:37 worker: loaded invalid environment configuration: maas-oauth: expected string, got nothing [07:11] Which is a lie. [07:19] jpds, do you know the juju version you used to bootstrap, and the jujud version you're runnning? [07:20] fwereade_: Good morning. I'm slightly concerned that we really do have a bug we need to deal with in ServiceGet. Specifically, the charm config types are described in the schema (description, type, value), and on the server side we are doing the work to put the data into the right type. But we lose that type when we transmit it via JSON. [07:20] fwereade_: ppa:juju/devel 1.13.2. [07:21] fwereade_: so do we need to implement a pass over the data in the client side of ServiceGet? [07:22] fwereade_: But yeah, I've just bootstrap'ed it and nothing works, not even status. [07:25] jpds, would you find the jujud on the bootstrap node (/var/lib/juju/tools/machine-0?) and run `jujud version`, just to check, please [07:25] jam, blech [07:27] fwereade_: 1.13.2-precise-amd64 [07:28] jpds, hmm, would you pastebin me /var/log/cloud-init-output.log please? [07:28] mornin' all [07:28] jam: can you give a specific case when it would be a problem? [07:29] jam, I can accept that's a bug, but I'm reluctant to fix ServiceGet which itself is pretty insane [07:29] rogpeppe: The fact that the metadata tells you what type you have, and the existing code spends the effort to return things of that type sound like the code has made some expectations. [07:30] jam, ISTM that it's an unholy combination of service info and charm info, and that there is literally no possible justification for smooshing that information into the same API call [07:30] fwereade_: +1 [07:30] jam, if we didn't do that mixing, the only problem would be precision, right? [07:31] fwereade_: well, some code may rely on the fact that we expect an int64 for an int attribute value [07:31] fwereade_: so I can confirm the precision issue (I added a test that proves it). The other issue is that you could easily write code that does "var stuff int64 = result.Config["skill-level"]["value"]" [07:31] jam: but i'd still like to see at least one place where this is actually a problem [07:31] rogpeppe: the main problem is that it is code that *we* didn't write. [07:32] jam: ah, you're worrying about 3rd party code using the API? [07:32] rogpeppe: I'm concerned that we have code that clearly spent a lot of time to get types right that we are throwing away [07:33] fwereade_, rogpeppe: I'm willing to just push it forward, file a bug and get on with life, but it was clearly some effort into doing it the way it is done today. [07:33] fwereade_: /msg. [07:33] jam: when we could represent integers in the marshalled form of the data, it made sense to represent an int as an int [07:34] jam: but given that we're now using JSON, i don't see that is necessary. [07:34] rogpeppe, I thought we *could* represent ints regardless [07:34] rogpeppe, jam, I'm starting to be convinced here [07:36] fwereade_: we *could* but it's awkward to do [07:36] rogpeppe, jam, although I still sorta lean towards a "ServiceGet is hopeless rubbish, but it's not just ServiceGet we need to worry about, we should actually fix our marshalling" [07:36] fwereade_: what's bad about ServiceGet, for the record? [07:36] rogpeppe, we've already encountered this problem in environment config [07:37] rogpeppe, it's mixing together charm stuff and service stuff for no clear reason [07:37] rogpeppe: the specific bad of ServiceGet is that it mixes the Charm.Config object with the Service.Constraints data [07:37] which should really be 2 different calls. [07:38] fwereade_, rogpeppe: bug #1217742, I'll document it in the code and reference that bug if we find it is really important to fix. [07:38] <_mup_> Bug #1217742: ServiceGet returns integer values as float64 [07:38] fwereade_: most of our marshalling uses "structs" that will enforce types [07:38] ServiceGet is a bit special in that it is passing around user-data that is in a map[string]interface{} [07:39] fwereade_: i don't really mind the fact that it's one call where it could be two, if most API usage gets both things at the same time [07:39] fwereade_: i'm not sure it makes it hopeless rubbish [07:39] jam: yeah [07:39] jam, I'm not so bothered about service config vs constraints, getting and (beinga ble to) set all service info at once is fine by me really... the big problem is total asymetry [07:40] jam: i'm trying to think of other places that pass around a free interface{} [07:40] jam: we could potentially just use UseNumber always [07:41] rogpeppe, jam: we had that problem with ports in env config [07:41] hmm, why does Resolved return settings? [07:41] rogpeppe, jam, we will probably have it again when unit agents need to use service config [07:41] rogpeppe: we could change it to UseNumber, but we'd still need to implement the custom unmarshalling in ServiceGet to cast the Number into the correct types. [07:42] rogpeppe, I have *no* freaking idea [07:42] rogpeppe, because it seemed like a good idea at the time, regardless of layering/sanity concerns [07:42] rogpeppe, just like ServiceGet ;p [07:42] fwereade_: ports aren't in typed structs? [07:42] I guess EnvironConfig isn't type [07:42] typed [07:42] jam, environment config resists that sort of thing pretty hard [07:43] jam: we could just document that the Go API has changed [07:43] fwereade_: i was looking over some of the camlistore source at the weekend, and it's interesting to see how it manages configuration stuff. it's got quite a neat little type to manage it. [07:44] rogpeppe, oh yes? [07:45] fwereade_: http://godoc.org/github.com/bradfitz/camlistore/pkg/jsonconfig [07:45] * jam goes to grab lunch, will be back soon. [07:46] fwereade_: the cunning bit is Validate [07:46] fwereade_: in particular, lookForUnknownKeys works out which keys could be in the config by knowing what keys have been looked up previously [07:47] fwereade_: which is actually kinda neat if you think about it [07:47] fwereade_: i'm not that keen on the way it stores extra keys in the map though. [07:49] fwereade_: anyway, interesting to see how other code deals with similar issues. [07:49] rogpeppe, thanks, I'll try to take a proper look shortly :) [07:53] a review of this would be much appreciated please. it's large but it's just moving code. https://codereview.appspot.com/13269045/ [07:58] jam, rogpeppe, I have a theory for why jpds's environment is broken [07:59] fwereade_: oh yes? [07:59] jam, rogpeppe: the upgrader api is trying to look in the environment for tools before we've sent the environ config secrets over [08:00] jam, rogpeppe: and for bonus points ServerError is panicking because schema.error_ is unhashable and can't be used with ServerError [08:01] fwereade_: hmm, the upgrader used to wait until it had got a valid environment config [08:02] fwereade_: i guess things will probably sort themselves out eventually [08:02] jam, rogpeppe: leaving aside that it should *never* have had access to an environ config, yes ;p [08:02] rogpeppe, not sure it is actually doing so [08:02] fwereade_: well, sure [08:03] rogpeppe: I have a 43MB log file with a time range of 30 minutes - full of tracebacks. :) [08:03] jpds, how does juju status fail exactly? [08:03] fwereade_: It didn't return. [08:04] I just bootstrap'ed, it installed and immediately started freaking out. [08:04] jpds: could you paste a representative example of some of the log file (including at least two iterations of the traceback)? [08:05] rogpeppe: They're on https://chinstrap.canonical.com/~jpds/juju-debug/ [08:05] jpds, is it remotely possible that you killed the status before mongod was running on the server, and haven't run it since? [08:05] jpds: perhaps from the very start up until it starts to look repetititive [08:05] rogpeppe: The full all-machines.log is there. [08:05] fwereade_: you made a comment on my last CL regarding the GUI team. whom can i ask for it best? [08:06] fwereade_: I may have Ctrl-C'ed it - I remember I ran status before the machine came up.. [08:06] jpds, would you try juju status again please? [08:07] jpds, when that can connect to mongo, the first thing it'll do is hand over any missing secrets [08:07] hmm, this looks highly suspicious: "panic: runtime error: hash of unhashable type schema.error_" [08:07] jpds, this is still definitely a bug, but it might be easily resolved that way [08:07] rogpeppe, yeah, that was what led me in that direction [08:08] rogpeppe, regardless of tools returning an error, that nukes the whole process [08:08] rogpeppe, if ServerError didn't panic, I think the upgrader would nicely fail and get retried [08:08] fwereade_: The status is just sitting there and the log is just getting more errors. [08:08] rogpeppe, oh except, wait, we're still using allFatal for some reason [08:09] jpds, would you run it with --debug and see if anything interesting pops up? [08:09] fwereade_: 2013-08-28 08:09:28 INFO juju.state open.go:68 opening state; mongo addresses: ["angha.maas:37017"]; entity "" [08:09] And nothing. [08:10] Log continues to complain about: maas-oauth: expected string, got nothing [08:10] fwereade_: hmm, schema should really return an error that's a pointer type [08:11] 678646 [08:11] rogpeppe, maybe so, but ServerError should also not be assuming that something's hashable just because it implements error [08:11] good thing those are one-time passwords :) [08:11] fwereade_: that is true too [08:11] jam, haha [08:12] fwereade_: that's my fault entirely. [08:12] fwereade_: though I guess if you give away 2 of them, you can figure out the sequence, or something along those lines (given setting one up is start it, sync it, and enter 2 sequential values) [08:12] rogpeppe, not to worry, we can fix it [08:12] rogpeppe: we should probably be decreasing "discarding action method" from INFO to at least DEBUG, fwiw [08:13] jpds, can you definitely resolve angha.maas from where you're running? [08:14] jam: i have an old branch somewhere that would let us discard that message entirely [08:14] jam: at the least it could only log exported methods, because they're the ones where problems are likely to lie [08:14] fwereade_: Oh, that's interesting; fixed DNS and now it works. [08:14] jpds, because it's starting to look like that's the underlying problem -- the environ errors are an expected consequence of no-secrets-yet, and the panics are an unexpected one, but both will be sorted out once we hand over the secrets [08:14] jpds, sweet [08:15] And the server looks happier. [08:15] rogpeppe, fwereade_: so one thing we certainly want to fix, is ServerError(unknownError) shouldn't ever panic. How do we detect that a type is unhashable and can't be looked up in a map? [08:15] jpds, secrets handover happens first time you connect to an environment [08:15] I wouldn't normally expect it to fail like that on DNS though... [08:15] fwereade_: I see. [08:15] jam: i think we could probably use a switch instead. [08:16] jam: alternatively we could recover from it [08:16] rogpeppe, *is* there a way to detect it other than to try, panic, recover? [08:16] fwereade_: perhaps with reflect [08:16] rogpeppe, eww ;p [08:17] Maybe the node should just sit there until status is run? It's kind of failing up with log files otherwise... [08:18] fwereade_: actually i can't see an easy way even with reflect [08:18] jpds, I think that if we fixed the ServerError panic things would look a lot better [08:18] jpds, panic tracebacks are very heavyweight [08:19] jpds, but there is indeed another bug to fix to prevent it just repeatedly erroring and trying again regardless [08:21] jam, rogpeppe: I think *that* one would be fixed by doing a WaitForEnviron in Upgrader.Tools [08:22] rogpeppe: would the ,ok syntax work, or it still panics on non-hashable types [08:23] jam, rogpeppe, jpds; I'd be a bit reluctant to make *everything* wait for bootstrap to complete [08:23] fwereade_: i'm not entirely sure about that [08:24] jam: the ,ok syntax would not work [08:24] jam, rogpeppe, jpds: better for just the things that need an environment to wait really [08:24] jam: but a switch would [08:24] sigh, my debug-log keeps filling up with "panic: runtime error: hash of unhashable type schema.error_" error still. [08:24] rogpeppe, jam, a switch seems sane to me, just dodge the whole issue [08:24] fwereade_: yeah [08:25] rogpeppe: we have a switch just after it, but I don't quite know what a switch for unhashable types would look like [08:25] My google-fu for "golang unhashable" isn't very helpful [08:25] jam, the point is just that a switch won't try to hash it [08:25] jam: comparing unhashable types is ok if they're different underlying types [08:26] jam: equality comparison compares the types first, then the value [08:26] * rogpeppe goes to makes sure of chapter and verse [08:26] rogpeppe: so it is *usually* a good idea to avoid hardcoding a switch/series of if/else statements if you can put that information into a map so that it is O(1) lookup instead of O(n) and usually maps are easier to add data to at runtime for types that you may not have known about yet. [08:27] So I'd still like to keep the "we have a map for registered types," that we can look in [08:27] jpds, that is... more upsettin [08:27] jpds, is the server still generating those panics after a successful status, or is debug-log itself screwy? [08:28] jpds, there was a bug with rsyslog resyncing state it didn't need to, and I think it was fixed, but I'm not sure when [08:28] wallyworld_, ^^ [08:28] fwereade_: I'm pretty sure that fix is in 1.13.2. You can confirm by checking /etc/rsyslogd/25-juju.log There should be a couple of lines with &~ inbetween them. [08:29] It fixed it in our testing at least. [08:29] And it only triggered when you had deployed a unit to machine-0 [08:29] and we haven't gotten to the point of deploying yet. [08:30] fwereade_: tail'ing the log on the bootstrap node to make sure debug-log isn't just playing catch up. [08:30] jpds, look at machine-0.log, that wouldn't demonstrate the problem even if it were still there [08:31] jam: i was thinking of something like this: http://paste.ubuntu.com/6035835/ [08:31] jam: but it will indeed be O(n) [08:31] fwereade_: ping [08:32] rogpeppe: and not extensible at runtime [08:32] jam: i think on balance i'm preferring the recovery approach [08:32] jam: is that something we'd want? [08:32] rogpeppe: that's where I'm getting to as well. [08:32] rogpeppe: it is often a nice pattern to be able to add things you didn't think about ahead of time as data rather than code. [08:32] jam: at runtime? [08:32] fwereade_: Yeah, that looks OK, it's just playing ping pong. [08:33] rogpeppe: it would allow packages that define an error to register that error mapping into a code [08:33] jpds, ok, cool [08:33] rogpeppe: then the places that define the error define their code [08:33] jam: they can do that anyway, by returning an error with a Code method [08:33] rather than having to have "common/errors.go" know about all possible errors. [08:34] jam: common/errors.go is really for legacy errors [08:34] jam: or at least, that was the intention [08:34] TheMue, pong [08:35] fwereade_: seen my question above regarding the GUI team? [08:35] jam: i guess it comes down to dependencies [08:35] fwereade_: whom can i ask best to be sure about the change? [08:36] TheMue, frankban seems to be right here and would be a good person [08:36] fwereade_: thanks [08:36] jam: and responsibility - we either have lots more things import state/api/params (for the error codes) or something has to be responsible for registering error types [08:36] rogpeppe: it is the Factory pattern which is often quite good, especially for mapping things of X to things of Y. It happens that go specifics mean it may not be great here. [08:36] TheMue, but if he weren't you could probably just pop into #juju-gui and see who's around ;) [08:37] fwereade_: oh, didn't know about that channel. then it's more easy ;) [08:37] fwereade_: after that i'll take a look at the ssl issue. here i may come back to you with questions. [08:37] How do I forcibly move a machien from dying to dead? (It was never alive and I didn't want it to be) - already terminate-machine'd it. [08:39] jam: i'm not quite sure how what you mean there [08:40] jpds, if it's dying, it should have 0 units, and the provisioner ought to clear it up for you... [08:40] jpds, but we did also see a bug in which the provisioner *sometimes* misses a machine that needs cleanup [08:40] jpds, so you may have encountered that [08:41] It's still "Allocated to root" in MAAS. [08:41] jpds, and in status it appears as dying, with an instance id? [08:42] fwereade_: Yes. [08:42] jpds, if the machine is still coming up you just have to wait for it to appear, and the machine agent will figure out that it's no longer wanted and remove itself [08:44] rogpeppe: ugh. we have no direct tests of common.ServerError, at least not in the common package. [08:44] jam: yeah, the test is in state/apiserver. it obviously wasn't moved over when we moved the code. [08:44] jpds, there's room for us to improve that story though [08:45] jpds, I imagine you wouldn't be mentioning it if it weren't annoying, so I'll write some bugs [08:45] jpds, thanks [08:46] fwereade_: Yeah, problem was the machine is a VM with no power settings, so it was never going to boot... [08:47] jpds, ha, I see [08:51] jam, fwereade_: https://codereview.appspot.com/13336043 [08:51] fwereade_: Let me know if you file bugs so I can track them. [08:52] rogpeppe, I'd prefer a code, ok sort of return really [08:53] rogpeppe, it seems a little bit magic [08:53] rogpeppe, and in-band signalling is just generally a Bad Thing ;) [08:54] fwereade_: you'd prefer something like this? http://paste.ubuntu.com/6035888/ [08:54] rogpeppe: for the opposite method I'm submitting it now [08:54] rogpeppe, yeah, I think so, if you have no objections [08:54] waiting for lbox to finish [08:55] rogpeppe, the magic doesn't feel painful now [08:55] fwereade_: ok, fair enough [08:55] rogpeppe: I traced back to who introduced the singleton error map, which is quite a while ago (rev 957 or so) [08:55] jam: i did that [08:55] and lbox fails with "failed to load data" [08:55] jam: it's all my fault :-) [08:56] jpds: bug #1217760 for the bad panic bug [08:56] <_mup_> Bug #1217760: apiserver.common.ServerError needs to handle unhashable errors [08:58] jam: Thanks. [08:59] rogpeppe: https://codereview.appspot.com/13338043 [08:59] hi noodles775 (/wave) [08:59] jam: o/ [09:00] rogpeppe: at this point, I don't *really* care which fix goes in, though we'll want to associate whatever lands with the bug # [09:01] jam, fwereade_: alternative updated: https://codereview.appspot.com/13336043 [09:02] jam, fwereade_: i don't mind much either, although i quite like the more compact nature of the switch. (and using recover in that kind of way has reasonable precedent) [09:03] s/switch/map lookup/ [09:05] jpds, https://bugs.launchpad.net/juju-core/+bug/1217781 [09:05] <_mup_> Bug #1217781: machine destruction depends on machine agents [09:05] rogpeppe: I like the map lookup, the fact that you have to recover tells me it is something to be scared of [09:05] *but* the fact [09:06] jam: for a similar kind of use, see http://golang.org/src/pkg/os/exec/exec.go?s=5901:5926#L122 [09:07] jam: i think it's reasonable [09:08] fwereade_: Ta. [09:08] rogpeppe: I defer to a 3rd party for final decision about the tastefulness of a defer() {recover()} it looks la lot like a "try: except:" in python which you should almost always universally avoid. [09:08] as it suppresses all errors. [09:08] jam: there's only one error possible there [09:08] such as fwereade_^^ [09:08] jam: your code could be a little more compact if you stored the singleton codes in a table and iterated over them [09:09] fwereade_: yeah, i defer to you too [09:09] rogpeppe: we have a switch today, might as well add some bits to it. [09:09] * rogpeppe likes tables [09:10] rogpeppe, jam, I was fine with a switch, but I thought we were leaning towards maps being nice? if we agree that maps are nice I'm fine with the recover [09:10] fwereade_: to be clear, we both have a solution to the problem and not enough motive to decide between, just pick one and we'll go about our lives in a merry fashion. https://codereview.appspot.com/13338043/ and https://codereview.appspot.com/13336043/ [09:10] fwereade_: map is nice, recover is ugly, etc. :) [09:11] jam, rogpeppe: rogpeppe wins by the highly scientific toss-a-coin method [09:11] lol [09:12] jam, rogpeppe: a bit annoying that euros don't have heads on them, that gave me pause for a moment [09:14] (jam, if I wanted to debate further, I'd contend that recover is more unfamiliar than intrinsically ugly, and that when the scope is so very narrow I don't see any problems with it) [09:16] rogpeppe, jam: separately [09:16] fwereade_: having to write a function to do a safe map lookup is... unfortunate. [09:16] rogpeppe: go for it [09:16] rogpeppe, jam: I contend that ca-private-key is Just Another Secret [09:17] rogpeppe, jam: and should be communicated just like all the other secrets [09:17] fwereade_: I take it this is a different discussion. [09:17] fwereade_: how can we do that? [09:17] fwereade_: we can't talk to the server over TLS until it has that secret [09:17] jam: exactly [09:17] rogpeppe, jam: *ca*-private-key [09:18] fwereade_: um, do we even keep the ca private key around? [09:18] rogpeppe, jam: the fact that the server private key is vulnerable is itself separate I think [09:18] fwereade_: i don't think we need to send the ca-private-key over the wire at all, do we? [09:19] fwereade_: it does feel like the server private key going into the system is a pretty big flaw in the overall security. The other option is that it would be generated on the fly, but then how do we validate it is correct? [09:19] rogpeppe, we will for HA if we want to actually give each server its own cert [09:19] fwereade_: if we do, then yes [09:19] fwereade_: note also that it wouldn't be terrible to pass in some entropy to the machines we are starting (Dustin Kirkland has done some work around this area) [09:20] jam: i think the right solution is to send the private key in, but then have the server change its own key and send the new public key back on first connection. [09:20] rogpeppe, jam: the server private key only becomes vulnerable once we're running user code, though [09:21] rogpeppe, jam: so I think it's fine if we generate a new one once the ca-pk is up there [09:21] rogpeppe, managing the restarting dance is going to be interesting though [09:21] jam: i think that's a good idea [09:23] fwereade_, rogpeppe: I think we need someone with real security background to actually do the modeling of what threats we are going to try and handle and which ones we aren't. [09:23] rogpeppe, hmm, how do we revoke the bad one though? [09:24] jam: that sounds like a good idea [09:24] It is true that this handles the "services running on machine-0 (even in containers)" could get access to user data and see the private key that was initialized with. [09:24] But someone that gets that data and snoops the first connection [09:24] can get the ongoing security details [09:24] I think [09:24] jam, we don't run user code until after the first connect [09:24] jam, if the cloud itself is compromised it's out of scope [09:24] fwereade_: so I agree that we've closed the "charms running on machine-0" hole [09:25] fwereade_: but we haven't closed other cases [09:25] jam: if someone can MitM the initial data, we're stuffed, i think [09:25] and we might be able to if someone who actually knows this stuff thinks about it. [09:25] Dustin might be a great person to bounce ideas off of. [09:25] jam: it would be good to get a second opinion, definitely [09:25] rogpeppe: note that SSH does it via cloud-init-output. and the boot log. [09:26] jam: sorry, SSH does what? [09:26] rogpeppe: when cloud init finishes it writes out the fingerprint of the SSH key it generate [09:26] generated [09:26] so that the person who started the bootstrap [09:26] can read the log [09:26] and then when connecting knows that the fingerprint matches the generated vaule [09:26] rogpeppe, fwereade_: so we *could* just generate the server public and private key on first boot [09:26] jam: what's this guarding against? [09:26] in the cloud-init code [09:27] and then have the last bit of cloud-init report the public key [09:27] which we can then verify without having to do a "first connect" song and dance for that bit. [09:27] jam: compromise of the initial StartInstance data? [09:27] rogpeppe: having to upload the private key at all [09:27] jam: how do we know we're connecting to the same machine we thought we were starting? [09:27] rogpeppe: so rather than uploading something which gets replaced, we just have it generated in a location that we can read [09:28] jam: how do we verify that we're talking to the right location? [09:28] rogpeppe: isn't that where you ask the provider for the machine that you started? [09:28] as in, you started it, it gave you an instance id, and you're asking it to give you the IP for that machine as well as the log for that machine. [09:29] jam, assuming you really get the address for that machine once it's (potentially) gone through DNS [09:29] jam: this presumably assumes that you can get a log from the machine without talking to the machine itself, right? [09:30] rogpeppe: you get the log from the provider [09:30] jam: can we assume that's available from al providers? [09:30] all [09:30] jam, rogpeppe: in ec2 we originally decided against it because it makes bootstrap take longer, and yeah, I'm not sure we can depend on it everywhere [09:30] rogpeppe: openstack has "nova console-log" [09:30] I'm pretty sure EC2 has it [09:30] jam, quite a lot of delay on ec2 at least [09:30] because it is how people boot a machine and add the ssh key to their known hosts without having to manually verify the ssh fingerprint [09:31] fwereade_: how does it make bootstrap much longer? [09:31] because of the time to generate a key? [09:31] fwereade_: because it hasn't actually made it any longer to get to a *usable* state [09:31] given that we have to wait for whatever song and dance to finish anyway. [09:31] jam, literally just the wall clock time before `juju bootstrap` returns [09:32] fwereade_: if you have to wait for jujud to be started before you can connect, you have to wait for jujud to be started, it iisn't much longer if you are waiting for the key to be generated at startup. [09:32] fwereade_: you don't have to wait for it, you ask for it at the same time you would have done it during first connect. [09:32] jam, we're talking about different things [09:32] jam, it was a use perception argument [09:32] fwereade_: to be fair, users have wanted us to wait for bootstrap to return for other reasons :) [09:32] jam, "oo, that was quick, how nice" [09:32] jam, yeah, indeed, I am not sure it holds much actual water [09:33] fwereade_: My argument is that it can be functionally identical to what we have today. [09:33] if we wait in bootstrap [09:33] or we wait in first connect [09:33] first connect already has to wait [09:33] jam: i think it's different [09:33] it just happens to *also* not have the public key for the TLS connection. [09:33] until the machine finishes first-startup and we read it from the console-log [09:33] jam: you can call bootstrap, it returns. then some time later you can connect quickly. [09:34] jam, well, yeah, rogpeppe's use case has *some* utility [09:34] jam: what happens if your network connection goes down half way through bootstrap? [09:34] rogpeppe: except we can't today [09:34] because we are waiting for it to start up [09:34] and under the proposed situation [09:34] it is no different [09:34] just the "first connection" reads the console log first [09:34] rogpeppe: doesn't matter. Again, *Do it the way we do today* [09:34] fwereade_: i can certainly say that i've made use of that functionality [09:34] you just read the public key in a different way [09:34] rogpeppe, likewise [09:34] instead of assuming you can write it up, just read it back [09:35] fwereade_: well, I've gone and gotten a coffee because I was waiting for it, but that doesn't mean it was *helpful* functionality :) [09:35] jam: i'd be slightly concerned that we start fundamentally relying on provider-specific security for the log data [09:35] rogpeppe: don't we already? [09:35] jam, rogpeppe: the other possibility is to extend the hardware-characteristics hack [09:36] rogpeppe, fwereade_: we can always add some sort of "secret nonce" to the instance data that we validate on connect, but that doesn't have to be a giant cert file [09:36] jam, rogpeppe: but that's probably also dependent on clou-specific bits [09:36] fwereade_: don't we already have a bootstrap nonce to know that the machine we started is the one we asked for? [09:37] I realize some of that today is written into the DB for the server-side code to do the validation [09:37] but we could save that locally [09:37] and then have an actually unique bootstrap node nonce [09:37] instead of the hard-coded one [09:37] jam, no -- the bootstrap nonce is just a hardcoded value *because* there's no risk of double-starting the bootstrap node [09:38] jam, not to say it's a bad idea [09:38] jam, it's probably good actually [09:38] jam: i'm not sure we do currently need to rely on provider-level security for anything other than starting an instance and making sure that the user data isn't compromised along the way. [09:38] fwereade_: well we have bug #1212177 today, which potentially gives an opening for double-bootstrapping machine-0 [09:38] <_mup_> Bug #1212177: bootstrap detection stops too early [09:39] rogpeppe: so what security is required in reading the log? [09:39] we aren't writing the private key there [09:39] (i don't think the client needs it) [09:39] just the public key [09:39] jam: we need to know definitively that we're reading the unaltered log from the correct machine [09:40] jam, ha, I'd missed that one [09:41] rogpeppe: but if you write up the secret key, you definitely need to make sure you are connecting to the write IP address and someone didn't start another instance with the same private details. [09:41] write IP => right IP [09:43] jam: i think that if someone manages to intercept the initial user data and compromises the provider DNS lookup, i think we're stuffed any which way [09:43] rogpeppe: What I really want to get out of this is that there is more than one way to solve this, and none of us are security experts. So while we can put forth some ideas, if we really care about having good security we should ask someone who knows what the likely and unlikely threats are and see how our proposals stack up in practice. [09:43] rogpeppe, jam: what if we were to create a bootstrap-only CA and finish bootstrap by sending back down the real one generated on the server? [09:45] fwereade_: that's pretty much what i had in mind [09:45] fwereade_: isn't that the same problem of bootstrap having to wait for the server to finish booting? [09:46] and in that case it is *bootstrap* waiting. [09:46] jam: i *think* that fwereade_ was thinking of doing that on first connection [09:46] jam: first connection == "finish bootstrap" [09:46] jam, no, we do it as today: but we use the bootstrap CA only for the first connection, and as part of that we get the real one [09:46] jam: that's what i was thinking of anyway [09:46] fwereade_: CA vs cert [09:47] it doesn't sound like we need a bootstrap CA [09:47] just a cert [09:47] and then it generates a CA and a signed cert for that connection. [09:47] And then how we get that private CA to sign the keys for other HA nodes [09:49] jam, can we revoke that effectively though? [09:49] fwereade_: so yes, it sounds like a workable solution, but I'd like someone who actually knows this stuff to actually validate our thoughts. [09:50] fwereade_: revoke what? The initial cert? [09:50] jam, yeah; if we have completely separate CA certs we don't need to worry about the old one being misused [09:50] why do we care to revoke an ephemeral cert? [09:50] jam, because if it's vulnerable then bad peoples might use it [09:50] after first connect, we regenerate everything anyway, and we stop recording that the original one was what we wanted to connect to. [09:51] jam, ok, cool, I must have misread you [09:51] fwereade_: so a CA is one cert that is used to sign other certs so that you can check that the cert is valid without having to have the public key for every possible cert. [09:51] that sounds like *way* overkill for an initial cert that we are going to be replacing. [09:52] fwereade_: so my proposal is that on "bootstrap" we generate a cert and call it "bootstrap-cert". We then connect and only allow connections to someone whose public key matches that bootstrap cert. [09:52] And on the jujud side [09:52] it generates a new CA and a new cert for that machine. [09:52] and we do the "here is the new CA for all machines in this environment" [09:52] jam, ah, got you, for bootstrap we just send up a cert and key, and we get *back* a CA cert that was used to sign the new server cert [09:52] perfect [09:53] fwereade_: if we *do* generate stuff on server side, we should look into the entropy stuff that Dustin worked on. [09:53] which is to take client entropy and feed it into the startup [09:53] jam, yes indeed [09:53] jam: it's a bit awkward to bypass CA certs entirely for TLS connection [09:53] because otherwies 0 entropy on newly started machines generate bad certs. :) [09:53] jam, quite so [09:53] jam: we already generate stuff server side, so we already need entropy [09:54] rogpeppe, jam, dustin was talking at iom about providing an entropy service [09:54] fwereade_: that sounds like a good plan. [09:54] fwereade_: but more is always better. [09:54] rogpeppe, doesn't it [09:54] fwereade_: and I'm pretty sure that service is what he was talking about charming [09:55] and found it difficult. [09:55] rogpeppe, but I'm not sure we can depend on that being available, so sending it up from the client will work today I guess [09:55] fwereade_: but regardless of that, he had an early thread talking about possibilities. [09:55] and one is to just call out to /dev/random on the client and write up ~100 bytes or so to the server [09:55] to seed its random number generator. [09:55] jam: yeah, that sounds like a good plan [09:56] http://blog.dustinkirkland.com/2012/10/entropy-or-lack-thereof-in-openstack.html and http://blog.dustinkirkland.com/2012/10/seed-devurandom-through-metadata-ec2.html [09:56] jam: it helps more if you're not starting juju from a cloud instance itself [09:56] http://blog.dustinkirkland.com/2012/10/seed-devurandom-through-metadata-ec2.html mentions that *today* you can seed /dev/urandom by just writing to it [09:56] and we already use cloud-init [09:57] so we just need to add [09:57] the AddFile(/dev/urandom, "$SOME_RANDOM_BYTES") [09:57] jpds: fwereade_: the rsync bug should have been fixed last release or the one before [09:59] wallyworld_, just for units, or also for machine 0? it seemed like jpds was seeing repeats from machine-0.log [09:59] fwereade_: the bug that was filed was about adding a unit to machine-0 caused a loop [09:59] fwereade_: if we're seeing a different issue about cycling on itself, then it would need a different fix [10:00] jam, repeated trash from the past was the symptom, though, so maybe it's the same thing happening [10:01] jam, AddFile(/dev/urandom, "$SOME_RANDOM_BYTES") sounds good [10:02] fwereade_: should have been for everything afaik. i didn't do the fix myself (hence i didn't really test it) [10:03] if there might still be an issue it should be investigated [10:03] fwereade_: the fix landed August 20th, so it should definitely be in 1.13.2, so if that isn't fixing it, we need a different fix. [10:03] fwereade_: assuming that AddFile will work in that context, yeah [10:03] fwereade_: bug #1217808 so we don't forget about it. [10:03] <_mup_> Bug #1217808: juju should seed entropy into started instances [10:04] rogpeppe: http://blog.dustinkirkland.com/2012/10/seed-devurandom-through-metadata-ec2.html [10:04] jam, you just beat me there, thanks [10:04] I'll do the debug-log spam one [10:04] jam: ah, cool [10:04] fwereade_: original bug for syslog is bug #1211147 [10:04] <_mup_> Bug #1211147: Deploying service to bootstrap node causes debug-log to spew messages [10:07] fwereade_: bug triage clarification, easy vs bitesize. Easy means a core dev focusing on it shouldn't have much problem, and bitesize is for 3rd parties, or is it the other way around [10:08] jam, bitesize means the actual change is small but you may need context to know what it is [10:08] jam, easy means you don't need so much context but does not necessarily connote "small" [10:08] fwereade_: unfortunately that doesn't stick easily in ones head. [10:09] jam, eh, I picked it because it seemed obvious, I have no specific attachment to those words [10:10] jam, if there's something more standard that we should use to indicate those distinct characteristics I'm fine with that [10:10] brb [10:33] fwereade_, jam: here's a possible way of doing things that doesn't rely on provider logs and does not require the API server to change its server key: http://paste.ubuntu.com/6036133/ [10:34] jam: it uses a bootstrap key, but only for signing - it never actually needs to serve the connection using the bootstrap key [10:35] rogpeppe: given you can sign a key from multiple other certs, why not have the "bootstrap key" be effectively a CA, and then the newly generate CA and key itself are signed by the key passed in? [10:37] jam: ah yes! if i understand correctly, i think that's a great idea [10:37] jam: then for the first connection, we use the bootstrap cert as the only accepted CA, but then we change the accepted CA down one level. [10:38] rogpeppe: essentially it changes that first connect from using TLSInsecuryKey [10:38] right [10:38] jam: that's much nicer, thanks! [10:39] jam: and it means we just need one call in the API (CACert) and to change the initial connection code to change the CA cert to that [10:39] rogpeppe: and the server side to generate the new CA and cert and sign them, but it doesn't change a lot of the actual interchange [10:40] jam: yeah. [10:41] rogpeppe: fwereade_: the bot seems to be stuffed right now, getting errors in CompareAndSwap, any ideas of what might fix it? [10:41] jam: link? [10:42] rogpeppe: recent failures trying to submit merge requests, one secd [10:42] rogpeppe: https://code.launchpad.net/~thumper/juju-core/container-address/+merge/182271 [10:42] rogpeppe, jam: nice [10:42] jam, doesn't ring a bell [10:43] I didn't poke anything to upset the bot recently... [10:43] jam: the problem looks as if notifyWorker.commonWatcher is nil [10:43] mgz: I don't trust you. I have my eyes on you :) [10:50] allenap, ping [10:51] jam: ah, i've found part of the bug, at any rate [10:52] rogpeppe: what's that? [10:52] jam: worker/machiner/machiner.go:53 [10:52] jam: it's the old statically typed nil gotcha [10:53] rogpeppe: I don't quite understand. An err that isn't actually nil? [10:53] rogpeppe: my line 53 is "m.Watch()" [10:53] jam: *return* m.Watch(), right? [10:55] rogpeppe: as in it should be return "m.Watch(), nil" ? [10:55] jam: no, it should be: w, err := m.Watch(); if err != nil { return nil, err}; return w, nil [10:56] jam: alternatively we could make m.Watch return api.NotifyWatcher, or make SetUp return *watcher.NotifyWatcher [10:57] rogpeppe: I thought that was when you return an err that is actually nil but then it gets perceived as not nil, is that correct? [10:57] jam: the problem is that SetUp is returning non-nil, non-nil [10:58] but the other bug is that we are returning an error and a nil pointer for *watcher.NotifyWatcher but that ends up in an interface that doesn't realize it is nil? [10:58] rogpeppe: because it is actually returning nil but that is being put into an interface which makes it look not-nil ? [10:58] jam: exactly [10:59] jam: i think api.Machine.Watch should return api.NotifyWatcher [11:01] jam: i'll propose a fix [11:01] rogpeppe: that *might* be an import loop. At least it looks a lot like when I added the NotifyWatcher interfaces and fwereade_ asked me to return the concrete type that happens to implement the interface. [11:01] jam: yeah, it's an import loop, but i'll just move the interface definition into watcher itself [11:02] api/watcher, that is [11:02] jam, rogpeppe: that nil-not-nil behaviour is starting to really get up my nose [11:02] fwereade_: yeah, it's occasionally very annoying [11:02] fwereade_: it is definitely a giant gotcha for people that don't think about it a lot [11:02] fwereade_: and it means you have to put lots of the "if err != nil" statements when it doesn't look like you actually need it. [11:03] jam, and you only really do think about it when it hits you [11:03] fwereade_: i think it might be amenable to a go vet check actually [11:05] fwereade_: FWIW i proposed a very long time ago that there should only be one nil, but that has its own problems [11:06] rogpeppe, yeah, I might have read that discussion [11:07] rogpeppe: as in interface{nil} is not possible? or that it == nil or ? [11:08] jam: that typeof((*Foo)(nil)) != *Foo [11:09] jam: in general, typeof(x) == typeof((interface{})(x)) [11:09] jam: and "one nil to rule them all: would break that [11:09] rogpeppe: as in, that should just return "nil type" ? [11:11] jam: if you wanted to get rid of the problem, you'd make it, i think, so that interface{}(anyNil) == nil [11:11] if I could just easily ask an interface if the thing contained within it was nil, I'd be happy. the == nil thing has bitten me a few times [11:11] natefinch: reflect.ValueOf(x).IsNil() :-) [11:11] natefinch: i agree though [11:12] That's why I said "easily" I don't consider reflect to be easy :) [11:12] natefinch: :-) [11:12] natefinch: i'm not sure when it would ever be appropriate to do that though, through reflect or not. [11:12] natefinch: the underlying issue is that typed nil is just as valid a value as typed 0 or typed "" [11:13] natefinch: so the language is just being consistent [11:13] rogpeppe: but type Foo string; f := Foo(""); if f == "" {} works today ,doesn't it? [11:13] or do you have to do "if f == Foo("")" ? [11:13] jam: yes, you do [11:13] rogpeppe: I know... it's just annoying when you have a function that returns a pointer wrapped by an interface, and the pointer knows that it's nil and therefore not valid, but the interface doesn't know that [11:14] jam: they're different types and therefore can never be equal [11:14] natefinch: yes, that's the classic place to get the gotcha [11:15] jam: even if Foo is type Foo string [11:15] rogpeppe: yep. First time I hit it, it took me all day to figure out wtf was going on [11:16] natefinch: i do think it might be reasonable to get go vet to help here [11:16] natefinch: i think it does some type-base analysis already [11:16] based [11:17] rogpeppe: yeah, I've had some good help from go vet... though I haven't studied exactly what it finds [11:17] rogpeppe: for your earlier conversation, I think the newly proposed key details is http://paste.ubuntu.com/6036251/ [11:17] natefinch: in particular, it could check if you're returning a statically typed value from a function that returns an interface and an error [11:18] natefinch: the place it has bitten a lot in juju-core is actually the "error" interface when you have a custom error type that is a pointer. [11:18] rogpeppe: could or does? [11:18] natefinch: could [11:18] natefinch: i'm pretty sure it doesn't currently [11:18] rogpeppe: ok [11:18] rogpeppe: does that matter? error itself is problematic because it is an interface [11:19] jam: i'm not sure how to avoid too many false positives for the error case itself [11:20] jam: the place i've seen the problem hit most subtly is when you do return f() [11:21] seems like the only reliable way to avoid the nil problem is to have "if foo == nil { return nil } else { return foo } everywhere you return an interface :/ [11:21] jam: from a function that returns an error [11:21] rogpeppe, natefinch: http://play.golang.org/p/fvTyl7V1EY [11:22] natefinch: that has been my experience [11:22] natefinch: well, anywhere you return a statically typed value that may be nil from an interface, yes [11:22] rogpeppe: "from an" or "as an" ? [11:23] jam: "as an", sorry, yeah [11:23] jam: or "from a function that's returning an" [11:26] rogpeppe: do you agree with http://paste.ubuntu.com/6036251/ [11:27] jam: am looking [11:29] btw, I'm reading Clean Code, and it's coming across quite java-y... especially the naming. createPluralMessageDependentParts? This is from the *good* example? :/ [11:30] not that I don't think there's a lot of good stuff... but I could tell from the first chapter it was written by a java guy [11:32] rogpeppe, natefinch: https://plus.google.com/hangouts/_/f497381ca4d154890227b3b35a85a985b894b471 [11:32] TheMue: ^^ [11:32] TheMue, standup [11:33] fwereade_: pong [11:33] allenap, https://code.launchpad.net/~allenap/juju-core/makefile-stuff/+merge/181113 has LGTMs, does it need something else? [11:34] store CA cert locally for future use, discarding bootstrap key [11:34] fwereade_: sexy times, I think. [11:35] http://paste.ubuntu.com/6036298/ [11:36] fwereade_: Yeah, I need to fix something that jam brought up, then I'll land it. (I don't know why, but email from codereview.a.c goes straight to archive in Gmail and I don't see it; it's probably a bad filter, but it's the reason I haven't been very responsive to reviews on this branch.) [11:37] allenap, no worries, thanks [11:43] rogpeppe: fwiw, your branch got bumped because of the nil stuff. I don't know if your branch *triggers* it, or it is a race condition. Other branches have landed since. [11:44] jam: thanks. i'll propose a fix first, i guess [12:00] jam: Do you think it's worth always doing `go test -i $PROJECT/...` before running the tests for real? (wrt Makefile) [13:05] jam: does this look reasonable to you? http://paste.ubuntu.com/6036547/ [13:05] fwereade_: ^ [14:09] jam, fwereade_: a fix for the worker nil-in-interface problem: https://codereview.appspot.com/13326044 [14:12] fwereade_: i'm not sure what you mean by your comment here: https://codereview.appspot.com/13269045/diff/3001/environs/testing/polling_test.go#oldcode80 [14:12] fwereade_: what should have moved where? [14:14] rogpeppe, *something* is all -- it just seemed to be an environ test entirely in terms of provider [14:15] fwereade_: it's a test for environs/testing.PatchAttemptStrategies, no? [14:15] fwereade_: and the test is in environs/testing, which seems reasonable to me [14:16] rogpeppe, so maybe that's the thing that should move, or maybe it's fine as it is -- i subscribe to the latter really, but that one file made me pause more than the others [14:16] fwereade_: ok, thanks [14:16] fwereade_: FWIW i think we should keep the subdirectories of provider entirely for provider impls [14:16] fwereade_: (that's why i created it) [14:17] fwereade_: "all" is the only exception, which seems kinda ok [14:19] rogpeppe, ehh, that kinda forces us to cram all the possible utilty functions into one package [14:19] rogpeppe, if it cam to it I'd rather have provider basically empty and a util subdir [14:20] rogpeppe, but we're making progress regardless [14:20] rogpeppe, if the packages are themselves clean it's a hell of a lot easier to rearrange them [14:20] fwereade_: i know what you mean. [14:21] fwereade_: but we need somewhere to put the actual provider implementations, and i really don't want to mix that name space with the other utility packages again [14:21] fwereade_: we could have providerutils, i suppose [14:33] man, juju really needs better feedback... I can never tell if a command is hanging or if it's just working, but slowly [14:34] (it's pretty much always the latter, but as a user... I can't tell) === tasdomas is now known as tasdomas_afk [14:43] every time I do juju bootstrap, I immediately follow it with juju status, and juju status hangs forever. Tried with both azure and AWS. Anyone seen that? [14:45] natefinch: i use --verbose or --debug generally [14:45] lemme retry with that... I always forget [14:46] this repeated every 1/3rd of a second: 2013-08-28 14:45:42 ERROR juju open.go:89 state: connection failed, will retry: dial tcp 137.116.116.136:37017: connection refused [14:46] jam: just looking at https://codereview.appspot.com/12949047 [14:46] jam: i'm not sure we should always default to simplify [14:46] I think I've seen that before under similar circumstances... where it just keeps retrying for forever [14:47] natefinch: if that's happening, you need to ssh to the machine and look in the logs to see what's actually gone on [14:48] rogpeppe: huh ok... of course I tried juju ssh which fails with the same error. Have to do it the hard way I guess [14:48] natefinch: yeah, juju ssh requires an environment to connect to [14:49] rogpeppe: I'm spoiled by how easy juju makes things :) [14:51] * rogpeppe goes for lunch [15:07] natefinch: if it is "forever" then it is a bug. if it is "a really long time" then it is just because instances take a long time to start up. For Azure I've heard it can be 10 minutes before cloud-init is all finished. [15:07] natefinch: the one thing I've seen over in the HP Cloud is that bootstrap can start and the machine only has a private IP address, and if 'status' sees that and thinks it should connect there, it will never connect (and the machine later gets a public address). [15:08] We *should* have HP Cloud such that we wait for a pub address before we continue (thanks to noodles775) [15:08] jam: I waited 10 minutes. I'd say that's close enough to forever :) [15:08] natefinch: You can look at whatever overview page Azure provides (I don't know it) [15:08] to make sure the machine is booting. [15:09] jam: yeah... I just killed it, I'm restarting the process with debugging so I can see what's going on [15:09] natefinch: and it is also possible something broke and the machine is up but jujud didn't get itself properly configured and running. [15:10] natefinch: this is one of our mistakes. By default juju gives no output, so all of *us* run with -v, which is slightly too much output, but we really need something more than 0 [15:10] jam: I'll leave it going for longer this time, I might have killed it just before it finished booting up [15:10] jam: yeah, I have that in an email that I was going to send out. There's not enough user feedback, especially for someone new to the product. [15:11] i believe fwereade_, thumper, and myself agreed on informative messages by default (no -v) [15:11] +1 [15:11] and a -q if you really want nothing. [15:11] jam: yeah, I much prefer that way [15:12] there is an argument that you shouldn't tell the user you just did what they asked you to do, but the delays involve mean you sit around wondering "what is going on" [15:13] I'd like to have bootstrap return the text from juju status, honestly. I always type juju status after running bootstrap, and I can't be the only one [15:13] natefinch: that bit is that we don't want to wait around until status is ready. And was discussed *heavily* [15:14] But I personally agree that having bootstrap block your terminal is no worse than the next thing you want to do blocking your terminal [15:14] jam: well, so that's interesting... what information do we have immediately after bootstrap returns? There has to be something that tells bootstrap "ok this was a success" [15:14] but bootstrap and deploy etc are all async by design [15:14] If bootstrap is async, why does it take 2 minutes to return? [15:21] natefinch: it doesn't on ec2, it does take a little while to process and get everything requested [15:21] hmm.. maybe my azure setup is just borked [15:21] but not 2 min [15:22] natefinch: azure has been reported to be slow [15:22] jam: ahh... juju status finally returned after 11 minutes of connection refused [15:23] natefinch: so there is definitely a "lets investigate and see if we can make it better" you can compare that with canonistack pretty easily [15:23] jam: with an error about no reachable servers [15:23] natefinch: no reachable servers happens if we don't see anything BUILDING within 10s or so [15:23] jam: maybe someone messed that up and made it 10 minutes? [15:24] jam: jam http://pastebin.ubuntu.com/6036957/ [15:35] jam: ahh yeah, DefaultDialOpts() in state/open.go uses a 10 minute timeout [15:44] natefinch: that is the time we expect an instance to start in [15:44] vs the time we expect to see a machine *start* building [15:47] jam: it's not really ok to let a command take 10 minutes to return [15:48] natefinch: you asked for a command to wait for the instance to be started, it takes that long for it to start [15:48] jam I asked for the status. The status can be "not started yet, or perhaps in the process of starting" [15:49] jam: I never expect a command to wait for 10 minutes. Prtty much ever, unless I specifically tell it to [15:50] jam: I'd expect "timed out waiting for environment to finish bootstrapping, try again in a few minutes" after at most a 30 second timeout [15:53] natefinch: to be fair things other than azure don't take that long [15:54] jam: well, to be fair, we explicitly set the timeout at 10 minutes :) [15:58] jam: also, something must be broken since it's been 45 minutes and I still can't get status, even though thew azure management page says the instance is up and running fine, and I can ping it at that IP address. Weird. [15:58] natefinch: can you ssh in? [15:58] I don't doubt something might be broken [15:58] and cloud-init-output.log is the best way to find out [15:58] it might just be my configuration though [15:58] I'm trying to figure out how to ssh in without using juju ssh... not sure what key it thinks it's using [15:58] natefinch: it is just ssh to the host address [15:59] we just look it up for you with "juju ssh" [15:59] arrrgh, need help by a brz-diff-patch-freak [15:59] jam: right, but juju ssh has the same connection problem [15:59] jam: maybe you can help me [15:59] natefinch: juju ssh can't work because we connect to the db to find the address. [15:59] natefinch: but if azure says it is up and running, grab the ip address and ssh into it [15:59] ssh ubuntu@host [16:00] Permission denied (publickey) [16:02] jam: question is here: how to apply a (reverting) patch, when one of the files has been renamed since then [16:03] TheMue: bzr merge -r AFTER..BEFORE ? [16:03] TheMue: so if rev 100 committed something; bzr merge -r 100..99 [16:07] jam: ah, hmm, ok, i created a diff and tried to merge it with patch -p0 -R < mydiff.patch [16:07] jam: but will try it now that way, thanks [16:08] natefinch: I know there was a bug with cloud-init on azure not getting ssh access, but I thought it was just for units other than machine-0 [16:08] you might check the launchpad bugs [16:08] jam: ok, thanks [16:19] jam: ah, great, worked. exactly what i wanted. thx again === TheRealMue is now known as TheMue [16:32] jam: wow, you're fast [16:55] rogpeppe, jam, mgz, TheMue, fwereade_: anyone have an opinion on whether or not we should abstract away runtime.GOOS so we can test OS-specific code? Specifically looking at a function I'm writing to get the user's "HOME" directory, which is different on linux vs. windows [16:56] natefinch: hmm, interesting question [16:56] natefinch: in the end, path/filepath is different on linux vs windows so we can't get away without having tests that might fail on one platform or another [16:57] natefinch: i suggest we concentrate all the OS-specific stuff in one place [16:57] rogpeppe: right, that's a good point [16:57] natefinch: (assuming we do have any code that needs tagging specifically for windows) [16:58] natefinch: regarding non-unix clients this sounds ok [16:58] rogpeppe: in theory most of the filepath stuff should abstract us away from the path differences, as long as we use the built-in separators and not hardcode slashes etc [16:58] natefinch: that's my thinking too [16:59] natefinch: the difficulty is when we have paths that can be either local or remote [17:00] natefinch: another approach is to keep all paths in portable form and to transform with filepath.FromSlash at the last moment possible [17:01] rogpeppe: seems like we shouldn't be messing with specific paths too often... or are there places where that's done a lot? [17:01] natefinch: i'm thinking in particular of the stuff around environs/cloudinit [17:02] natefinch: but that's probably not too bad until we actually want to deploy on windows, which tbh would probably require a whole new init script, so not too bad. [17:02] rogpeppe: yeah, totally [17:06] rogpeppe: still not sure what the answer is - put windows-specific tests in a windows-specific file, so the tests will run on windows but not linux? I wonder if they'll actually ever get run that way :/ [17:06] natefinch: hopefully there won't be any windows-specific test necessary [17:06] s/test/tests/ [17:07] natefinch: if we do need such tests, then they would need to be in a windows-specific file, yes [17:07] rogpeppe: "HOME" on windows is "HOMEDRIVE" + "HOMEPATH" ... I'm writing a function to abstract away the difference so the rest of the code doesn't have to know... but I need a test for that function [17:08] natefinch: is it not possible to write in a portable way? [17:08] natefinch: even though it will only run under linux [17:08] windows even [17:08] rogpeppe: the problem is, that I have to switch on runtime.GOOS, so on Linux, the windows path never gets run [17:08] actually [17:09] rogpeppe: one exported function does the switch, calls two internal functions, one per OS... I can then test each internal function separately [17:09] natefinch: that was my thought too [17:10] rogpeppe: sometimes the obvious is too obvious ;) [17:10] natefinch: i think it's worth keeping away from build tags until we can definitely no longer avoid them [17:10] rogpeppe: yep [17:23] fwereade_, jam, natefinch, TheMue: small patch to testing.LoggingSuite: https://codereview.appspot.com/13351043 [17:23] rogpeppe: looking [17:27] rogpeppe: there's no way to redirect the log output to both the normal output and something you can test? [17:27] natefinch: well that wouldn't be testing the functionality i need to test, would it? [17:27] rogpeppe: oh, right :) [17:31] rogpeppe: if it were me, I'd remove the INFO part of the string matching... that seems like it's an internal detail to Infof, and not actually something this code cares about [17:31] rogpeppe: (in the tests, that is) [17:31] natefinch: actually, it's something that's specifically printed by the LoggingSuite code [17:31] natefinch: the error level, that is [17:32] natefinch: specifically this line: w.c.Output(3, fmt.Sprintf("%s %s %s", level, module, message)) [17:32] natefinch: so i think it's worth leaving in [17:33] rogpeppe: oh yeah... ok, I didn't realize that was actually in this package. That's good, then [17:33] natefinch: cool [17:33] LGTM'd [17:36] rogpeppe: LGTM by me too [17:36] TheMue, natefinch: thanks. landing. [17:37] rogpeppe: yw [17:38] so, i'm leaving, see you tomorrow [17:53] rogpeppe: arg.... was using this to get the drive off the path passed to my SetHome() method: http://golang.org/pkg/path/filepath/#VolumeName but it only does the right thing on Windows [17:53] natefinch: ha, of course [17:54] natefinch: perhaps we really should go the build tag route [17:54] natefinch: and have windows-specific files and tests [17:58] natefinch: I would say write the code so that it looks at $HOMEPATH and $HOMEDRIVE from the env [17:58] and then write a test that forces them into the env [17:58] note that you need that anyway [17:58] because *my* homepath may not be on C: [17:58] and Users [17:58] vs Documents and Settings [17:58] vs whatever [17:58] natefinch: that code can run on any system. [17:59] natefinch: then I would write another test that doesn't poke at stuff, but asserts that we get a value for "FindHome()" or whatever we want to call it. [17:59] It won't really matter *what* that is, as long as it is valid. [17:59] And you *could* do a test that has a OS switch [18:00] so you test that we just use $HOME on linux, and when available on Windows, and $HOMEDRIVE\$HOMEPATH when $HOME isn't [18:01] morning people [18:01] rogpeppe: ping [18:01] thumper: hangout? [18:01] rogpeppe: sounds good, got one handy? [18:02] thumper: usual standup hangout should do [18:02] ok [18:02] thumper: https://plus.google.com/hangouts/_/f497381ca4d154890227b3b35a85a985b894b471?authuser=1 [18:03] jam: the code is pretty simple, and does the right thing on the right OS, it's just hard to test it correctly on the wrong OS [18:04] jam: because the code uses stdlib functions that work differently on different OSes (like path.Join) and filepath.VolumeName [18:05] rogpeppe, jam: http://pastebin.ubuntu.com/6037498/ [18:07] jam: notable, filepath.VolumeName returns an empty string on linux regardless, which means the setHomeWin function ends up setting the full path into HOMEPATH [18:16] jam: also, I disagree with looking at %HOME% on a Windows OS. Hopefully whoever set it would keep it in lockstep with HOMEPATH, but if not... that's really non of juju's business. HOMEPATH is what the OS itself uses as the user's home directory. Anything else is application-specific [18:16] natefinch: I use $HOME, I've used work machines that set $HOME to a shared network drive so it is shared across machiens [18:17] jam: my point is... HOME has no meaning on Windows. Any meaning is specific to the application that sets it up. iTunes could set up HOME to mean the iTunes music library root or something. [18:18] natefinch: I fully disagree based on past experience [18:18] certainly *my* .ssh directory is in $HOME [18:19] I would say that if HOME is set, we should preferentially use it. [18:21] jam: I don't know how we can figure out which one the user wants us to use without asking them. When I run a native windows application, I expect it to put stuff in my windows HOMEPATH. If I happen to have $HOME set up somewhere else because I have cygwin installed... I'd be pretty annoyed that juju put stuff in there and not the windows default. We have an application specific environment variable to override the OS defaul [18:21] t, it's JUJU_HOME. [18:26] jam: we have to be respectful of the defaults on each OS, so that the application feels like a native application, and not just a bad port. That [18:27] think about it the other way.... would you want juju to put your .ssh folder under $HOMEPATH if it was set on linux? === tasdomas_afk is now known as tasdomas === tasdomas is now known as tasdomas_afk [18:30] natefinch: clearly not given that $HOMEDRIVE wouldn't be set [18:30] :) [18:30] I can go set it right now :) [18:32] we [18:32] we're not making an application to run under cygwin. We're making an application to run under windows. [18:33] natefinch: well, we're making an app to run under cmd.exe, which is a slighty different beast. There really isn't a "standard place to put your ssh keys on Windows". as such if someone has cygwin installed, they likely *do* have ssh keys in $HOME/.ssh/* that would actually be useful for connecting to another machien. [18:34] it is possible that they also have putty somewhere, and we should try to find that as well [18:34] natefinch: *if* we had an explicit gui, then I would expect that to be much more windows-centric about where things are stored, and it would isolate itself and expect you to configure it via menus, and store its config in the Registry.But that isn't what we have. [18:34] (yet) [18:36] natefinch: I would accept that we shoudrn't write there by default, but we shouldn't write to HOMEPATH either [18:36] we should be writing to APPDATA [18:39] jam: yes and no. .juju should definitely be in appdata... but .ssh is a different beast. in theory it's supposed to be cross-application, though obviously that's not standard on windows [18:40] natefinch: it wouldn't be '.ssh' anyway if it was windows. (windows won't let you create a '.' dir in explorer) [18:41] jam: yes that's true (I hate that restriction in windows btw) [18:43] jam: the other problem with app data is that it's hard to find in explorer, and we expect people to edit the environment.yaml [18:43] natefinch: juju edit config to launch a default editon? [18:43] juju edit-config? [18:43] jam: I didn't know that existed :) [18:44] natefinch: it doesn't, but it could [18:44] jam: ahh. yeah, that would be very good [18:53] back in about 20 minutes, have to drop off a preschool application === natefinch is now known as natefinch-afk [19:02] * rogpeppe is done for the day === natefinch-afk is now known as natefinch [19:34] jam: what do you think we should do for friday with this Windows stuff? My inclination is to put the info in app data like it probably ought to be, and work on making it more accessible later [21:28] blurgh... [21:29] just had to clean up a long vomit stretch in the hall [21:29] I wish the dog would stop eating the cat food [21:29] who'd think it would upset her so much [22:29] good morning vietnam [22:30] thumper, I see your long vomit stretch from the dog, and raise you two projectile vomiting twins [22:30] oh dear [23:15] morning [23:25] o/ davecheney [23:28] Ave Caesar [23:30] back in the rolling valleys of South Wales? [23:31] bigjools: it's grand [23:31] boyo [23:31] * davecheney goes back to upgrade testing