[00:31] thumper: lachie is sick and i have to go get him from school. i should be back in time for standup but if not i won't be far away [00:57] ack === mwhudson is now known as zz_mwhudson [04:43] thumper, wallyworld: would it be terrible if rsyslog configuration weren't set up during machine init, but by the machine agent? that would mean losing logging for failed agent startup, but would mean simpler code [04:44] hmmm [04:44] i'd perhaps need to see the rationale, pros, cons etc [04:44] i think logging failed agent startup is important [04:45] s/is/is potentially [04:45] well, to start with we'll probably need to support changing syslog protocols, from udp to tls/tcp [04:46] that could be done as an upgrade step, but would be cleaner if we handle it as a worker [04:46] so then if we have a worker, we *can* ditch the code that initialises the logging during machine init - threading attributes through API server and provisioner [04:46] but it does mean losing that logging [04:47] we could have a jujud invocation that connects to state and does one-shot configuration, but I doubt that's any more useful than just relying on a working machine agent [04:48] could we do both? still log agent startup? [04:48] yes we can do both still, this is just a question of usefulness vs. code complexity [04:49] to support TLS, I've had to add additional fields to several structs in the API to support env and container provisioners [04:49] i'm very hesitant to want to throw away logging, especially for startup of services where key errors might be logged [04:49] that'll only grow if we continue on this path [04:50] well, the other thing is, the logs will still be there on the machine [04:50] just not in debug-log [04:50] but... not ideal [04:50] yeah [04:51] i'd email the list - there's sure to be a range of opinions :-) [04:51] okey dokey [04:52] cause i can see arguments either way [04:53] and potential changes to the api deserve wider input [05:08] axw: thanks for the review. All makes sense. I'm just stuck on one bug. I've emailed you the details. [05:09] * axw looks [05:10] waigani: the bad code is "return s.configValidator, configValidatorErr" [05:10] you should be returning a pointer. [05:11] ahhhhh [05:11] dur [05:11] if the receiver is *T, then you must assign a *T [05:11] axw: yep. Thank you. [05:15] axw: I now get: invalid indirect of s.configValidator (type mockConfigValidator) [05:17] eh? [05:17] sounds like you're doing * instead of & [05:21] ah, got you now [05:41] axw: interestingly, we shouldn't actually lose anything, as when you first start up the reader, it would read and transmit the entire file [05:42] axw: however the ordering of the events in the all-machines.log will be out of whack [05:42] axw: will respond to the email in more detail later [05:42] * thumper is done for the day, but back for a meeting at 8:30pm === thumper is now known as thumper-afk [05:46] thumper-afk: I meant if the machine agent never comes up at all. Thanks, will await you reply :) [07:41] axw: sent you an email [07:46] axw: just ran TestSetEnvironAgentVersionExcessiveContention by itself, same failure [07:47] waigani: ok. best to go back to trunk before your changes and see if it's any different [07:49] axw: hmmmm, trunk passes. no panics, no fails. [07:50] I've screwed something up... [08:22] mornin' all === wrtp is now known as rogpeppe [08:27] morning rogpeppe [08:27] axw: hiya [08:27] rogpeppe: would you mind taking a look at this? I have a LGTM, but it's a sensitive area so I'd like two :) https://codereview.appspot.com/66920043/ [08:28] I find the config defaults code a bit confusing [08:42] axw: will do [08:45] axw: what was the specific motivation behind the change? [08:46] rogpeppe: to have a default of Omit for existing environments, and a non-Omit default for new environments [08:46] axw: that is, it seems reasonable, but i'm trying to think of an example where the default value isn't the same as the "always-optional" default value [08:48] axw: just trying to understand further: what attribute do we want that for? [08:49] rogpeppe: if you don't use Omit in alwaysOptional (or "" for strings), then New will fail if the attribute doesn't exist in the old config [08:49] rogpeppe: for a new attr, syslog-tls [08:49] which is a bool === mthaddon` is now known as mthaddon [08:54] rogpeppe: also, I don't think the bootstrap timeout options were doing anything. the defaults were all being overwritten with Omit [09:00] hey jam, sorry my alarm clock betrayed me :/ [09:01] * rogpeppe wishes this IRC client had decent notification functionality [09:03] dimitern: I was wondering about that. I hope you enjoyed your trip back to Malta [09:03] axw: i'm not sure why the bootstrap timeout options were Omit anyway [09:04] rogpeppe: I'm in the 1:1 whenever you're available [09:04] jam, oh, it was good yes [09:04] jam: just joining [09:06] rogpeppe: because they were never specified before, so lack of attr fails the config schema checker [09:06] (unless Omit) [09:07] dimitern: can you ping me to check my notification settings [09:07] axw, thanks for the review, I'll propose a follow-up with your suggestions as I already landed that one [09:07] jam, ping? [09:07] thanks [09:08] dimitern: okey dokey [09:37] hi fwereade, when you have a moment can you take a look at https://codereview.appspot.com/64580044. I'll add the mongo txn stuff to SetEnvironConfig in a separate branch. [09:38] waigani, will do, I'm kinda pitifully trying to write some actual code but I know I have to circle round on reviews again today too [09:38] fwereade: no rush, I'm off to bed now. Thanks [09:55] axw: i still don't quite see - why couldn't the value for bootstrap-timeout not be DefaultBootstrapSSHTimeout, similar to state-port, for example? [09:56] rogpeppe: my understanding is that if it's not schema.Omit, then you can't have the attribute missing from the config [09:56] * axw looks at state-port [09:57] maybe I'm full of crap [09:57] axw: i don't think that's true - "alwaysOptional" implies the attr is always optional, regardless of what the value is [09:59] axw: i can't currently think of a situation where it makes sense to have an entry in both alwaysOptional and in the initial schema.Defaults map created by allDefaults [10:02] rogpeppe: do the default values in alwaysOptional get added to a config that specifies NoDefaults? [10:02] * axw is looking at noDefaultsChecker [10:03] axw: yes [10:04] hey all [10:04] mgz: yo! [10:04] rogpeppe: ok, then that's a problem. I want an old environment to continue on with either no syslog-tls or syslog-tls=false [10:04] rogpeppe: but all new config's should have syslog-tls=true [10:04] hey mgz [10:05] axw: ah, ok, that's a good specific case [10:05] rogpeppe: I think I had tested that originally, then confused myself with the bootstrap timeout params which shouldn't be like that anyway [10:05] axw: why do we want to do that, BTW? [10:06] rogpeppe: can't upgrade to TLS because our certs have Key Usage set incorrectly [10:07] rogpeppe: sorry, obtuse - there's a mail thread, one sec [10:07] axw: couldn't we just ignore the attribute if the cert doesn't support TLS? [10:08] rogpeppe: can't tell from the client side [10:08] rogpeppe: the client just has the CA cert [10:09] axw: ah, which only signs the root cert, not the actual server cert, right? [10:09] rogpeppe: the CA cert signs the server cert, but the server cert is the one with the invalid Key Usage [10:10] axw: erm [10:10] axw: isn't the CA cert the thing that specifies the allowed key usage? [10:11] rogpeppe: they all have key usages. my understanding is the CA can just choose not to sign a request with specific key usages [10:12] rogpeppe: https://codereview.appspot.com/66930043/ fixes the problem [10:12] for new environments only of course [10:14] interesting. i hadn't seen ExtKeyUsage before. [10:15] * rogpeppe looks at the docs [10:44] rogpeppe, axw: hey, also, in the cert package, I note we have `SerialNumber: new(big.Int)` which looks like a violation of the spec [10:44] rogpeppe, axw: I'm pretty sure SerialNumber is meant to be unique [10:45] fwereade: quite probable, although i'm not sure it's a security hole [10:46] rogpeppe, I think it's quite likely to be a problem all the same, as soon as something that *does* care about the spec sees two different certs from the same CA with the same serial number [10:47] rogpeppe, it's not so much security hole as an indication that we're being funky and should not be trusted [10:48] * fwereade wonders if there's a good complement to "security hole" that implies false positive rather than false negative [10:48] er, I guess you could read it either way [10:49] fwereade: we should probably just make it a 128-bit random number [10:49] rogpeppe, +1 [10:50] rogpeppe: standup ? [11:21] axw, fwiw.. the rsyslog priv separation thing has been fixed for 2 years.. debian and ubuntu have been carrying around old versions.. trusty gets things back to a current version, another workaround is to install the updated rsyslog pkg. [11:22] which also gives tls on relp (reliable store forward).. probably more than juju cares about. [11:23] rogpeppe, its pretty common re extended key usage.. clientauth, serverauth.. [11:23] * hazmat has been knee deep in x509 the last week [12:11] * rogpeppe doesn't like x509 [12:14] rogpeppe, ping.. i've been hearing reports about issues with the state watcher stopped issue recurring [12:14] rogpeppe, what's interesting is that its not just the admin api client, but all clients watches that go belly up [12:14] rogpeppe, this is on 1.17.3.. cjohnston 's machine-0.log when it happens.. http://paste.ubuntu.com/6986643/ [12:15] cjohnston, this is on canonistack? which region? [12:15] cjohnston, and what instance type is the state server? [12:16] oh.. nm. ic lyc01 [12:16] cpu1-ram1-disk10-ephemeral20 | 1GB RAM | 1 VCPU | 10.0GB Disk [12:33] hazmat: interesting. [12:33] rogpeppe, anything else he could get off that env before he shuts it down that would be helpful for debugging? [12:34] hazmat: does his environment recover from the issue? [12:35] cjohnston, ^ can you run status or deployer again and it works? [12:36] hazmat, cjohnston: it looks like it might be a problem with mongo - the port it's getting a timeout error from is mongod's port [12:37] hmm, but that's somewhat odd [12:37] hazmat: 'status' meaning 'juju status' ? [12:38] cjohnston: yes [12:38] juju status works [12:39] cjohnston: ok, so it seems like this is only a transient error, which is something [12:40] http://paste.ubuntu.com/6986741/ [12:40] cjohnston: thanks. that all looks healthy. [12:40] cjohnston: did you find out about the issue from a GUI error? [12:41] rogpeppe, from a deployer error, he's been reproducing this consistently for a week [12:41] http://paste.ubuntu.com/6986666/ [12:41] cjohnston: you can reliably repro the issue? [12:42] rogpeppe: yesterday I it took 3 tries to reproduce [12:49] cjohnston: you're running the released version of 1.17.3, right? [12:49] 1.17.3-0ubuntu1 [12:54] rogpeppe: do you want anything else from this deployment before I tear it down? [12:54] cjohnston: i don't think so, thanks [12:57] rogpeppe: someone else on my team just got the same error [12:57] rogpeppe: would anything from psivaa's deployment help? [12:57] cjohnston: how have you been reproducing the error? [12:58] rogpeppe: just deploying [12:58] we have a deployment script that deploys our entire setup [13:01] cjohnston, i'd be curious if you beef up your state server instance size if its still reproducible [13:01] fwereade: yeah I noticed the serial, but frankly I don't know what the right thing to do there is. since we only generate one server cert, it doesn't matter at the moment (until we do HA...) [13:01] * fwereade starts counting down in a meaningful sort of way [13:01] hazmat: thanks for the tip about rsyslog updates [13:02] axw, np. thanks for turning around the ssl support there.. i'll still need to chat about the incantations for a manual upgrade are [13:02] psivaa: what size if your bootstrap node [13:02] cjohnston: i dint specify any constraints. so it's the default size. let me confirm [13:02] hazmat: no worries, I'll send you an email with what needs to happen. [13:03] axw, thanks [13:03] cjohnston: hardware: arch=amd64 cpu-cores=1 mem=512M [13:03] even smaller [13:10] wallyworld: how does passing data-dir to syslog config help? data-dir and log-dir are independent [13:13] hazmat: i wondered about that, but it seems odd that a localhost connection would be timing out [13:13] hazmat: even if the machine is heavily loaded [13:19] cjohnston: please file a bug (if you have not done so already) and attach, if possible, instructions for the most reliable way you can find for reproducing the issue [13:30] rogpeppe: do you want a new bug I'm guessing and not re-open the already existing one? [13:31] cjohnston: yes - this seems like a different issue to me [13:32] ack === jcsackett_ is now known as jcsackett [13:47] hi is there any technical doc explaining how juju bootstrap is performed, i'm a bit lost digging in the code ? [13:48] I tried --debug flag, bug things stay unclear === deej` is now known as deej [14:05] fwereade, ping? [14:14] cjohnston: rogpeppe: the issue occurs even with 2G memory of the bootstrap node. (arch=amd64 cpu-cores=1 mem=2048M root-disk=10240M) [14:15] psivaa: if you could post a reliable way that we can reproduce the issue to the bug, that would help greatly, thanks (i haven't checked - maybe someone's done that already) [14:16] cjohnston: ^ would you add that also pls? === psivaa is now known as psivaa-afk-bbl [14:26] niemeyer: ping [14:26] rogpeppe: Heya [14:26] niemeyer: hiya [14:26] niemeyer: just wondering if there's a chance you could have another look at https://codereview.appspot.com/5874049/ ? [14:27] niemeyer: it un-breaks the currently broken time stamp behaviour of gocheck [14:27] niemeyer: (currently gocheck always prints the same time stamp on every line because the time calculation overflows) [14:27] niemeyer: and the output format is as we agreed some time agi [14:27] ago [14:28] rogpeppe: Thanks, will have a look === psivaa-afk-bbl is now known as psivaa [15:55] rogpeppe: should the bug be against -core or -deployer ? [15:55] cjohnston: -core [15:55] ty [16:07] rogpeppe: hazmat bug #1284183 [16:07] <_mup_> Bug #1284183: jujuclient.EnvError: [16:08] cjohnston: thanks! [16:09] rogpeppe, any ideas to cause.. re client level workarounds.. auto-reconnect and ignore? [16:10] hazmat: client level workaround is probably just to reconnect [16:10] hazmat: which is probably something that you should be prepared to do anyway, as it's always possible the ws connection may be terminated [16:11] hazmat: i haven't had time to investigate cause yet, i'm afraid [16:11] hazmat: i have a suspicion or two, but nothing worth mentioning [16:11] rogpeppe, sure.. at a high level i'm trying to move deployer to be delta based, so just rerun.. wrapping every api interaction with auto-retry seems less than idael [16:12] i mean its not a normal occurance imo.. and lacking transactions... auto retry of arbitrary partial seems questionable. [16:13] robbiew, cjohnston, it might be useful to pickup some mongodb logs from an instance when this occurs [16:13] hazmat: huh? [16:13] hazmat: it's not a normal occurrence, except that failure kinda is normal. you'll see this behaviour with HA if there's a server failure, for example [16:14] oh...nevermind ;) [16:14] robbiew, sorry [16:14] robbiew: not the first time :-) [16:15] hazmat: i don't know of a smooth way of coping with server failure here. we should really be making all our ops idempotent. [16:15] hazmat: many of them are in fact, i think. [16:16] rogpeppe, not really.. deploy twice.. error for transmission, error for duplicate service on second.. add-machine twice.. two new machines. [16:17] rogpeppe, that second case is why gui needed a set number of units api.. add-unit/remove-unit are delta ops not target goal ops [16:17] hazmat: add-machine is a good example of one that's not. but error for duplicate service still gives you an idempotent op [16:19] hazmat: it would be nice if the error message distinguished "duplicate service with all the same attributes" and "duplicate service with some different attributes" [16:20] hazmat: or perhaps even if add-service just succeeded if the service already exists with all the same attrs, but that's definitely more arguable. [16:25] rogpeppe, so basically .. do the go thing.. and attach specific error handling and retry logic to every api op... that's fair although burdensome imo (my end goal as i stated is to just be able to do the right thing if people rerun, because i want them to know when the backend turns up errors) ... anyways given the frequency i'd be a bit more inclined to say we should fix the server to be a bit more reliable. [16:26] hazmat: i'm definitely not saying that this isn't a bug - it does sound like it, and it shouldn't be happening [16:26] rogpeppe, which still means getting to root cause there.... i'm hoping the mongo logs might turn up something. [16:26] hazmat: i'm just saying that in the general case, this kind of thing will be able to happen, and can't think of a decent way around it. [16:26] hazmat: definitely [16:28] cjohnston, mongo is logging to syslog it looks like.. if you can reproduce, could you grab a copy to chinstrap [16:28] ack.. psivaa ^ [16:29] hazmat: unfortunately i don't have much time to spend on it currently [16:31] rogpeppe, no worries.. neither do i, but i'm hoping/looking to get a better understanding of root cause, and hopefully at least be able to instruct workarounds. [16:31] actually doing the retry/reconnect for watchers is probably the safest bet atm [16:33] hazmat: yes [16:34] hazmat: FWIW my strongest suspicion is on the mgo package - that's the only place that's setting up tcp timeouts AFAIK. [16:34] pwd [16:35] cjohnston: hazmat: https://launchpadlibrarian.net/167465339/syslog [16:36] rogpeppe, yeah.. possibly ^ Feb 24 13:48:41 juju-ci-oxide-machine-0 mongod.37017[5150]: Mon Feb 24 13:48:41.067 [conn2] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:49239] [16:37] hazmat: yeah - looks like it might not be keeping its deadlines up to date === hatch_ is now known as hatch [17:32] quick and easy review anyone? https://codereview.appspot.com/67820048 << rogpeppe, mgz, natefinch, fwereade ? [17:32] dimitern: looking [17:32] natefinch, ta! [17:37] dimitern: quick (unrelated) query [17:38] you say add tests under api/client for the state compat branch [17:38] mgz, yeah? [17:38] but all I see in state/api/client_test.go is a ref back to the tests under state/apiserver/client/ [17:39] unlike the other sections which have dedicated client tests inside their package [17:39] is it now.. let me check [17:39] we have bug 1217282 [17:39] <_mup_> Bug #1217282: api.Client tests should be in api not state/apiserver/client/ [17:40] hmm.. ok, so i must've been thinking of the agent api client tests [17:40] okay, I'll commit my current fixups for now [17:40] then forget what I said :) [17:40] the bug is one of these "let's fix the world when we have time" [17:44] dimitern: thanks. have pushed up the last changes the branch needs, addresses some of your points and fixes a silly error in the reintroduced legacy api [17:45] dimitern: reviewed [17:46] natefinch, thanks [17:46] mgz, i've lgtmed it [17:47] ta! [17:47] blast! i've just realized i'm the ocr today - on to the review queue [17:48] rogpeppe: extracting out simpleworker as requested: https://codereview.appspot.com/67080043/ [17:57] natefinch: reviewed [18:14] rogpeppe: fwiw, I downgraded to 1.16.6 and was able to do a full deployment. not sure if it is related to the downgrade or not [18:15] cjohnston: my suspicion is that it's something to do with changes to a juju-core dependency [18:16] cjohnston: that's a very useful data point, BTW, thanks. [18:16] :-) === zz_mwhudson is now known as mwhudson [18:57] umm. does related-list return self in a peer relation? [19:04] hazmat: good question. i don't know i'm afraid. [19:04] * rogpeppe is done for the day [19:04] g'night all === mwhudson is now known as zz_mwhudson [19:08] rogpeppe, g'night [20:05] natefinch: hey [20:05] thumper: hoiwdy [20:06] natefinch: was looking through Dimiter's branch that you just reviewed [20:06] and I saw a few things that concerned me [20:06] especially around the arg passing [20:06] the tests for the extra args aren't written how they will be actually executed [20:06] testing "-r -o foo" isn't the same as "-r", "-o", "foo" which is how they are passed into init [20:06] I have a gut feeling that the code is broken [20:07] but that could just be because I see the tests being wrong [20:10] thumper: hmmm... you may be right. I know that exec doesn't handle that case in the way one might think (where "-o -r" is not the same as "-o", "-r" but I figured we ewre doing some of our own parsing here. I might be wrong [20:10] I know that the gnuflag library barfs if it gets options it doesn't know about [20:11] so I was expecting to see either custom arg handling [20:11] or some errors [20:11] and saw neither [20:14] thumper: yeah, it looks like we're just passing the args as-is, so multiple flags in the same string won't work === zz_mwhudson is now known as mwhudson [21:48] natefinch: why do i get a mongo not supported error message when i'm just trying to bootstrap? === Guest70848 is now known as wallyworld [21:49] wallyworld: what does mongod --version return for you? [21:50] it is only 2.2.0, but i'm not running services locally to bootstrap an aws env [21:50] although it has worked for local provider till now :-) [21:51] wallyworld: hmm... it should only check when you bootstrap local provider. And yeah, it's going to complain if you don't have at least 2.2.2 I think [21:51] natefinch: ah, i'm a idiot [21:51] i forgot to switch to aws provider [21:51] ignore my drivel, sorry [21:51] wallyworld: haha ok [21:52] but the version check is new? [21:52] yep [21:52] cool :-) [21:52] time to upgrade mongo [21:52] but i've been updating from the repos [21:53] i'll have to see why i'm still on 2.2.0 [21:53] wallyworld: yeah, that's odd [21:53] i'll figure it out. at least we check [21:57] wallyworld: yeah, I added it, figuring it's better to fail early and visibly than to maybe fail further down the road in some random way when an old mongo doesn't work the way we expect. [21:57] wallyworld: I should add in a check for TLS support, too, since we require that. [21:57] yep, indeed. a good thing for sure [21:57] that would be good [21:57] i wonder when that mac address bug will be fixed [21:58] hopefully before trusty