/srv/irclogs.ubuntu.com/2014/02/24/#juju-dev.txt

wallyworldthumper: lachie is sick and i have to go get him from school. i should be back in time for standup but if not i won't be far away00:31
thumperack00:57
=== mwhudson is now known as zz_mwhudson
axwthumper, wallyworld: would it be terrible if rsyslog configuration weren't set up during machine init, but by the machine agent? that would mean losing logging for failed agent startup, but would mean simpler code04:43
wallyworldhmmm04:44
wallyworldi'd perhaps need to see the rationale, pros, cons etc04:44
wallyworldi think logging failed agent startup is important04:44
wallyworlds/is/is potentially04:45
axwwell, to start with we'll probably need to support changing syslog protocols, from udp to tls/tcp04:45
axwthat could be done as an upgrade step, but would be cleaner if we handle it as a worker04:46
axwso then if we have a worker, we *can* ditch the code that initialises the logging during machine init - threading attributes through API server and provisioner04:46
axwbut it does mean losing that logging04:46
axwwe could have a jujud invocation that connects to state and does one-shot configuration, but I doubt that's any more useful than just relying on a working machine agent04:47
wallyworldcould we do both? still log agent startup?04:48
axwyes we can do both still, this is just a question of usefulness vs. code complexity04:48
axwto support TLS, I've had to add additional fields to several structs in the API to support env and container provisioners04:49
wallyworldi'm very hesitant to want to throw away logging, especially for startup of services where key errors might be logged04:49
axwthat'll only grow if we continue on this path04:49
axwwell, the other thing is, the logs will still be there on the machine04:50
axwjust not in debug-log04:50
axwbut... not ideal04:50
wallyworldyeah04:50
wallyworldi'd email the list - there's sure to be a range of opinions :-)04:51
axwokey dokey04:51
wallyworldcause i can see arguments either way04:52
wallyworldand potential changes to the api deserve wider input04:53
waiganiaxw: thanks for the review. All makes sense. I'm just stuck on one bug. I've emailed you the details.05:08
* axw looks05:09
axwwaigani: the bad code is "return s.configValidator, configValidatorErr"05:10
axwyou should be returning a pointer.05:10
waiganiahhhhh05:11
waiganidur05:11
axwif the receiver is *T, then you must assign a *T05:11
waiganiaxw: yep. Thank you.05:11
waiganiaxw: I now get: invalid indirect of s.configValidator (type mockConfigValidator)05:15
axweh?05:17
axwsounds like you're doing * instead of &05:17
waiganiah, got you now05:21
thumperaxw: interestingly, we shouldn't actually lose anything, as when you first start up the reader, it would read and transmit the entire file05:41
thumperaxw: however the ordering of the events in the all-machines.log will be out of whack05:42
thumperaxw: will respond to the email in more detail later05:42
* thumper is done for the day, but back for a meeting at 8:30pm05:42
=== thumper is now known as thumper-afk
axwthumper-afk: I meant if the machine agent never comes up at all. Thanks, will await you  reply :)05:46
waiganiaxw: sent you an email07:41
waiganiaxw: just ran TestSetEnvironAgentVersionExcessiveContention by itself, same failure07:46
axwwaigani: ok. best to go back to trunk before your changes and see if it's any different07:47
waiganiaxw: hmmmm, trunk passes. no panics, no fails.07:49
waiganiI've screwed something up...07:50
wrtpmornin' all08:22
=== wrtp is now known as rogpeppe
axwmorning rogpeppe08:27
rogpeppeaxw: hiya08:27
axwrogpeppe: would you mind taking a look at this? I have a LGTM, but it's a sensitive area so I'd like two :)   https://codereview.appspot.com/66920043/08:27
axwI find the config defaults code a bit confusing08:28
rogpeppeaxw: will do08:42
rogpeppeaxw: what was the specific motivation behind the change?08:45
axwrogpeppe: to have a default of Omit for existing environments, and a non-Omit default for new environments08:46
rogpeppeaxw: that is, it seems reasonable, but i'm trying to think of an example where the default value isn't the same as the "always-optional" default value08:46
rogpeppeaxw: just trying to understand further: what attribute do we want that for?08:48
axwrogpeppe: if you don't use Omit in alwaysOptional (or "" for strings), then New will fail if the attribute doesn't exist in the old config08:49
axwrogpeppe: for a new attr, syslog-tls08:49
axwwhich is a bool08:49
=== mthaddon` is now known as mthaddon
axwrogpeppe: also, I don't think the bootstrap timeout options were doing anything. the defaults were all being overwritten with Omit08:54
dimiternhey jam, sorry my alarm clock betrayed me :/09:00
* rogpeppe wishes this IRC client had decent notification functionality09:01
jamdimitern: I was wondering about that. I hope you enjoyed your trip back to Malta09:03
rogpeppeaxw: i'm not sure why the bootstrap timeout options were Omit anyway09:03
jamrogpeppe: I'm in the 1:1 whenever you're available09:04
dimiternjam, oh, it was good yes09:04
rogpeppejam: just joining09:04
axwrogpeppe: because they were never specified before, so lack of attr fails the config schema checker09:06
axw(unless Omit)09:06
jamdimitern: can you ping me to check my notification settings09:07
dimiternaxw, thanks for the review, I'll propose a follow-up with your suggestions as I already landed that one09:07
dimiternjam, ping?09:07
jamthanks09:07
axwdimitern: okey dokey09:08
waiganihi fwereade, when you have a moment can you take a look at https://codereview.appspot.com/64580044. I'll add the mongo txn stuff to SetEnvironConfig in a separate branch.09:37
fwereadewaigani, will do, I'm kinda pitifully trying to write some actual code but I know I have to circle round on reviews again today too09:38
waiganifwereade: no rush, I'm off to bed now. Thanks09:38
rogpeppeaxw: i still don't quite see - why couldn't the value for bootstrap-timeout not be DefaultBootstrapSSHTimeout, similar to state-port, for example?09:55
axwrogpeppe: my understanding is that if it's not schema.Omit, then you can't have the attribute missing from the config09:56
* axw looks at state-port09:56
axwmaybe I'm full of crap09:57
rogpeppeaxw: i don't think that's true - "alwaysOptional" implies the attr is always optional, regardless of what the value is09:57
rogpeppeaxw: i can't currently think of a situation where it makes sense to have an entry in both alwaysOptional and in the initial schema.Defaults map created by allDefaults09:59
axwrogpeppe: do the default values in alwaysOptional get added to a config that specifies NoDefaults?10:02
* axw is looking at noDefaultsChecker10:02
rogpeppeaxw: yes10:03
mgzhey all10:04
rogpeppemgz: yo!10:04
axwrogpeppe: ok, then that's a problem. I want an old environment to continue on with either no syslog-tls or syslog-tls=false10:04
axwrogpeppe: but all new config's should have syslog-tls=true10:04
axwhey mgz10:04
rogpeppeaxw: ah, ok, that's a good specific case10:05
axwrogpeppe: I think I had tested that originally, then confused myself with the bootstrap timeout params which shouldn't be like that anyway10:05
rogpeppeaxw: why do we want to do that, BTW?10:05
axwrogpeppe: can't upgrade to TLS because our certs have Key Usage set incorrectly10:06
axwrogpeppe: sorry, obtuse - there's a mail thread, one sec10:07
rogpeppeaxw: couldn't we just ignore the attribute if the cert doesn't support TLS?10:07
axwrogpeppe: can't tell from the client side10:08
axwrogpeppe: the client just has the CA cert10:08
rogpeppeaxw: ah, which only signs the root cert, not the actual server cert, right?10:09
axwrogpeppe: the CA cert signs the server cert, but the server cert is the one with the invalid Key Usage10:09
rogpeppeaxw: erm10:10
rogpeppeaxw: isn't the CA cert the thing that specifies the allowed key usage?10:10
axwrogpeppe: they all have key usages. my understanding is the CA can just choose not to sign a request with specific key usages10:11
axwrogpeppe: https://codereview.appspot.com/66930043/ fixes the problem10:12
axwfor new environments only of course10:12
rogpeppeinteresting. i hadn't seen ExtKeyUsage before.10:14
* rogpeppe looks at the docs10:15
fwereaderogpeppe, axw: hey, also, in the cert package, I note we have `SerialNumber: new(big.Int)` which looks like a violation of the spec10:44
fwereaderogpeppe, axw: I'm pretty sure SerialNumber is meant to be unique10:44
rogpeppefwereade: quite probable, although i'm not sure it's a security hole10:45
fwereaderogpeppe, I think it's quite likely to be a problem all the same, as soon as something that *does* care about the spec sees two different certs from the same CA with the same serial number10:46
fwereaderogpeppe, it's not so much  security hole as an indication that we're being funky and should not be trusted10:47
* fwereade wonders if there's a good complement to "security hole" that implies false positive rather than false negative10:48
fwereadeer, I guess you could read it either way10:48
rogpeppefwereade: we should probably just make it a 128-bit random number10:49
fwereaderogpeppe, +110:49
jamrogpeppe: standup ?10:50
hazmataxw, fwiw.. the rsyslog priv separation thing has been fixed for 2 years.. debian and ubuntu have been carrying around old versions.. trusty gets things back to a current version, another workaround is to install the updated rsyslog pkg.11:21
hazmatwhich also gives tls on relp (reliable store forward).. probably more than juju cares about.11:22
hazmatrogpeppe, its pretty common re extended key usage.. clientauth, serverauth..11:23
* hazmat has been knee deep in x509 the last week11:23
* rogpeppe doesn't like x50912:11
hazmatrogpeppe, ping.. i've been hearing reports about issues with the state watcher stopped issue recurring12:14
hazmatrogpeppe, what's interesting is that its not just the admin api client, but all clients watches that go belly up12:14
hazmatrogpeppe, this is on 1.17.3.. cjohnston 's machine-0.log when it happens.. http://paste.ubuntu.com/6986643/12:14
hazmatcjohnston, this is on canonistack? which region?12:15
hazmatcjohnston, and what instance type is the state server?12:15
hazmatoh.. nm. ic lyc0112:16
cjohnstoncpu1-ram1-disk10-ephemeral20 | 1GB RAM | 1 VCPU | 10.0GB Disk12:16
rogpeppehazmat: interesting.12:33
hazmatrogpeppe, anything else he could get off that env before he shuts it down that would be helpful for debugging?12:33
rogpeppehazmat: does his environment recover from the issue?12:34
hazmatcjohnston, ^ can you run status or deployer again and it works?12:35
rogpeppehazmat, cjohnston: it looks like it might be a problem with mongo - the port it's getting a timeout error from is mongod's port12:36
rogpeppehmm, but that's somewhat odd12:37
cjohnstonhazmat: 'status' meaning 'juju status' ?12:37
rogpeppecjohnston: yes12:38
cjohnstonjuju status works12:38
rogpeppecjohnston: ok, so it seems like this is only a transient error, which is something12:39
cjohnstonhttp://paste.ubuntu.com/6986741/12:40
rogpeppecjohnston: thanks. that all looks healthy.12:40
rogpeppecjohnston: did you find out about the issue from a GUI error?12:40
hazmatrogpeppe, from a deployer error, he's been reproducing this consistently for a week12:41
cjohnstonhttp://paste.ubuntu.com/6986666/12:41
rogpeppecjohnston: you can reliably repro the issue?12:41
cjohnstonrogpeppe: yesterday I it took 3 tries to reproduce12:42
rogpeppecjohnston: you're running the released version of 1.17.3, right?12:49
cjohnston1.17.3-0ubuntu112:49
cjohnstonrogpeppe: do you want anything else from this deployment before I tear it down?12:54
rogpeppecjohnston: i don't think so, thanks12:54
cjohnstonrogpeppe: someone else on my team just got the same error12:57
cjohnstonrogpeppe: would anything from psivaa's deployment help?12:57
rogpeppecjohnston: how have you been reproducing the error?12:57
cjohnstonrogpeppe: just deploying12:58
cjohnstonwe have a deployment script that deploys our entire setup12:58
hazmatcjohnston, i'd be curious if you beef up your state server instance size if its still reproducible13:01
axwfwereade: yeah I noticed the serial, but frankly I don't know what the right thing to do there is. since we only generate one server cert, it doesn't matter at the moment (until we do HA...)13:01
* fwereade starts counting down in a meaningful sort of way13:01
axwhazmat: thanks for the tip about rsyslog updates13:01
hazmataxw, np. thanks for turning around the ssl support there.. i'll still need to chat about the incantations for a manual upgrade are13:02
cjohnstonpsivaa: what size if your bootstrap node13:02
psivaacjohnston: i dint specify any constraints. so it's the default size. let me confirm13:02
axwhazmat: no worries, I'll send you an email with what needs to happen.13:02
hazmataxw, thanks13:03
psivaacjohnston:  hardware: arch=amd64 cpu-cores=1 mem=512M13:03
cjohnstoneven smaller13:03
axwwallyworld: how does passing data-dir to syslog config help? data-dir and log-dir are independent13:10
rogpeppehazmat: i wondered about that, but it seems odd that a localhost connection would be timing out13:13
rogpeppehazmat: even if the machine is heavily loaded13:13
rogpeppecjohnston: please file a bug (if you have not done so already) and attach, if possible, instructions for the most reliable way you can find for reproducing the issue13:19
cjohnstonrogpeppe: do you want a new bug I'm guessing and not re-open the already existing one?13:30
rogpeppecjohnston: yes - this seems like a different issue to me13:31
cjohnstonack13:32
=== jcsackett_ is now known as jcsackett
chris38hi is there any technical doc explaining how juju bootstrap is performed, i'm a bit lost digging in the code ?13:47
chris38I tried --debug flag, bug things stay unclear13:48
=== deej` is now known as deej
mattywfwereade, ping?14:05
psivaacjohnston: rogpeppe: the issue occurs even with 2G memory of the bootstrap node. (arch=amd64 cpu-cores=1 mem=2048M root-disk=10240M)14:14
rogpeppepsivaa: if you could post a reliable way that we can reproduce the issue to the bug, that would help greatly, thanks (i haven't checked - maybe someone's done that already)14:15
psivaacjohnston: ^ would you add that also pls?14:16
=== psivaa is now known as psivaa-afk-bbl
rogpeppeniemeyer: ping14:26
niemeyerrogpeppe: Heya14:26
rogpeppeniemeyer: hiya14:26
rogpeppeniemeyer: just wondering if there's a chance you could have another look at https://codereview.appspot.com/5874049/ ?14:26
rogpeppeniemeyer: it un-breaks the currently broken time stamp behaviour of gocheck14:27
rogpeppeniemeyer: (currently gocheck always prints the same time stamp on every line because the time calculation overflows)14:27
rogpeppeniemeyer: and the output format is as we agreed some time agi14:27
rogpeppeago14:27
niemeyerrogpeppe: Thanks, will have a look14:28
=== psivaa-afk-bbl is now known as psivaa
cjohnstonrogpeppe: should the bug be against -core or -deployer ?15:55
rogpeppecjohnston: -core15:55
cjohnstonty15:55
cjohnstonrogpeppe: hazmat bug #128418316:07
_mup_Bug #1284183: jujuclient.EnvError: <Env Error - Details:  {   u'Error': u'watcher was stopped', u'RequestId': 9, u'Response': {   }} <juju-core:New> <https://launchpad.net/bugs/1284183>16:07
rogpeppecjohnston: thanks!16:08
hazmatrogpeppe, any ideas to cause..  re client level workarounds.. auto-reconnect and ignore?16:09
rogpeppehazmat: client level workaround is probably just to reconnect16:10
rogpeppehazmat: which is probably something that you should be prepared to do anyway, as it's always possible the ws connection may be terminated16:10
rogpeppehazmat: i haven't had time to investigate cause yet, i'm afraid16:11
rogpeppehazmat: i have a suspicion or two, but nothing worth mentioning16:11
hazmatrogpeppe, sure.. at a high level i'm trying to move deployer to be delta based,  so just rerun.. wrapping every api interaction with auto-retry seems less than idael16:11
hazmati mean its not a normal occurance imo.. and lacking transactions... auto retry of arbitrary partial seems questionable.16:12
hazmatrobbiew,  cjohnston, it might be useful to pickup some mongodb logs from an instance when this occurs16:13
robbiewhazmat: huh?16:13
rogpeppehazmat: it's not a normal occurrence, except that failure kinda is normal. you'll see this behaviour with HA if there's a server failure, for example16:13
robbiewoh...nevermind ;)16:14
hazmatrobbiew, sorry16:14
rogpepperobbiew: not the first time :-)16:14
rogpeppehazmat: i don't know of a smooth way of coping with server failure here. we should really be making all our ops idempotent.16:15
rogpeppehazmat: many of them are in fact, i think.16:15
hazmatrogpeppe, not really.. deploy twice.. error for transmission, error for duplicate service on second.. add-machine twice.. two new machines.16:16
hazmatrogpeppe, that second case is why gui needed a set number of units api.. add-unit/remove-unit are delta ops not target goal  ops16:17
rogpeppehazmat: add-machine is a good example of one that's not. but error for duplicate service still gives you an idempotent op16:17
rogpeppehazmat: it would be nice if the error message distinguished "duplicate service with all the same attributes" and "duplicate service with some different attributes"16:19
rogpeppehazmat: or perhaps even if add-service just succeeded if the service already exists with all the same attrs, but that's definitely more arguable.16:20
hazmat rogpeppe, so basically .. do the go thing.. and attach specific error handling and retry logic to every api op... that's fair although burdensome imo (my end goal as i stated is to just be able to do the right thing if people rerun, because i want them to know when the backend turns up errors) ... anyways given the frequency i'd be a bit more inclined to say we should fix the server to be a bit more reliable.16:25
rogpeppehazmat: i'm definitely not saying that this isn't a bug - it does sound like it, and it shouldn't be happening16:26
hazmatrogpeppe,  which still means getting to root cause there.... i'm hoping the mongo logs might turn up something.16:26
rogpeppehazmat: i'm just saying that in the general case, this kind of thing will be able to happen, and can't think of a decent way around it.16:26
rogpeppehazmat: definitely16:26
hazmatcjohnston, mongo is logging to syslog it looks like.. if you can reproduce, could you grab a copy to chinstrap16:28
cjohnstonack.. psivaa ^16:28
rogpeppehazmat: unfortunately i don't have much time to spend on it currently16:29
hazmatrogpeppe, no worries.. neither do i, but i'm hoping/looking to get a better understanding of root cause, and hopefully at least be able to instruct workarounds.16:31
hazmatactually doing the retry/reconnect for watchers is probably the safest bet atm16:31
rogpeppehazmat: yes16:33
rogpeppehazmat: FWIW my strongest suspicion is on the mgo package - that's the only place that's setting up tcp timeouts AFAIK.16:34
rogpeppepwd16:34
psivaacjohnston: hazmat: https://launchpadlibrarian.net/167465339/syslog16:35
hazmatrogpeppe, yeah.. possibly ^ Feb 24 13:48:41 juju-ci-oxide-machine-0 mongod.37017[5150]: Mon Feb 24 13:48:41.067 [conn2] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:49239]16:36
rogpeppehazmat: yeah - looks like it might not be keeping its deadlines up to date16:37
=== hatch_ is now known as hatch
dimiternquick and easy review anyone? https://codereview.appspot.com/67820048 << rogpeppe, mgz, natefinch, fwereade ?17:32
natefinchdimitern: looking17:32
dimiternnatefinch, ta!17:32
mgzdimitern: quick (unrelated) query17:37
mgzyou say add tests under api/client for the state compat branch17:38
dimiternmgz, yeah?17:38
mgzbut all I see in state/api/client_test.go is a ref back to the tests under state/apiserver/client/17:38
mgzunlike the other sections which have dedicated client tests inside their package17:39
dimiternis it now.. let me check17:39
mgzwe have bug 121728217:39
_mup_Bug #1217282: api.Client tests should be in api not state/apiserver/client/ <tech-debt> <juju-core:Triaged> <https://launchpad.net/bugs/1217282>17:39
dimiternhmm.. ok, so i must've been thinking of the agent api client tests17:40
mgzokay, I'll commit my current fixups for now17:40
dimiternthen forget what I said :)17:40
dimiternthe bug is one of these "let's fix the world when we have time"17:40
mgzdimitern: thanks. have pushed up the last changes the branch needs, addresses some of your points and fixes a silly error in the reintroduced legacy api17:44
natefinchdimitern: reviewed17:45
dimiternnatefinch, thanks17:46
dimiternmgz, i've lgtmed it17:46
mgzta!17:47
dimiternblast! i've just realized i'm the ocr today - on to the review queue17:47
natefinchrogpeppe: extracting out simpleworker as requested: https://codereview.appspot.com/67080043/17:48
rogpeppenatefinch: reviewed17:57
cjohnstonrogpeppe: fwiw, I downgraded to 1.16.6 and was able to do a full deployment. not sure if it is related to the downgrade or not18:14
rogpeppecjohnston: my suspicion is that it's something to do with changes to a juju-core dependency18:15
rogpeppecjohnston: that's a very useful data point, BTW, thanks.18:16
cjohnston:-)18:16
=== zz_mwhudson is now known as mwhudson
hazmatumm. does related-list return self in a peer relation?18:57
rogpeppehazmat: good question. i don't know i'm afraid.19:04
* rogpeppe is done for the day19:04
rogpeppeg'night all19:04
=== mwhudson is now known as zz_mwhudson
hazmatrogpeppe, g'night19:08
thumpernatefinch: hey20:05
natefinchthumper: hoiwdy20:05
thumpernatefinch: was looking through Dimiter's branch that you just reviewed20:06
thumperand I saw a few things that concerned me20:06
thumperespecially around the arg passing20:06
thumperthe tests for the extra args aren't written how they will be actually executed20:06
thumpertesting "-r -o foo" isn't the same as "-r", "-o", "foo" which is how they are passed into init20:06
thumperI have a gut feeling that the code is broken20:06
thumperbut that could just be because I see the tests being wrong20:07
natefinchthumper: hmmm... you may be right.  I know that exec doesn't handle that case in the way one might think (where "-o -r" is not the same as "-o", "-r" but I figured we ewre doing some of our own parsing here. I might be wrong20:10
thumperI know that the gnuflag library barfs if it gets options it doesn't know about20:10
thumperso I was expecting to see either custom arg handling20:11
thumperor some errors20:11
thumperand saw neither20:11
natefinchthumper: yeah, it looks like we're just passing the args as-is, so multiple flags in the same string won't work20:14
=== zz_mwhudson is now known as mwhudson
Guest70848natefinch: why do i get a mongo not supported error message when i'm just trying to bootstrap?21:48
=== Guest70848 is now known as wallyworld
natefinchwallyworld: what does mongod --version return for you?21:49
wallyworldit is only 2.2.0, but i'm not running services locally to bootstrap an aws env21:50
wallyworldalthough it has worked for local provider till now :-)21:50
natefinchwallyworld: hmm... it should only check when you bootstrap local provider. And yeah, it's going to complain if you don't have at least 2.2.2 I think21:51
wallyworldnatefinch: ah, i'm a idiot21:51
wallyworldi forgot to switch to aws provider21:51
wallyworldignore my drivel, sorry21:51
natefinchwallyworld: haha ok21:51
wallyworldbut the version check is new?21:52
natefinchyep21:52
wallyworldcool :-)21:52
wallyworldtime to upgrade mongo21:52
wallyworldbut i've been updating from the repos21:52
wallyworldi'll have to see why i'm still on 2.2.021:53
natefinchwallyworld: yeah, that's odd21:53
wallyworldi'll figure it out. at least we check21:53
natefinchwallyworld: yeah, I added it, figuring it's better to fail early and visibly than to maybe fail further down the road in some random way when an old mongo doesn't work the way we expect.21:57
natefinchwallyworld: I should add in a check for TLS support, too, since we require that.21:57
wallyworldyep, indeed. a good thing for sure21:57
wallyworldthat would be good21:57
wallyworldi wonder when that mac address bug will be fixed21:57
wallyworldhopefully before trusty21:58

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!