[00:36] thumper: cherylj [00:36] machine-0: 2016-02-09 00:36:41 INFO juju.apiserver apiserver.go:302 [9] user-admin@local API connection terminated after 7.391830549s, active connections: 5 [00:36] machine-0: 2016-02-09 00:36:41 INFO juju.apiserver apiserver.go:302 [A] user-admin@local API connection terminated after 2.046902769s, active connections: 4 [00:37] machine-0: 2016-02-09 00:36:41 INFO juju.apiserver apiserver.go:302 [C] user-admin@local API connection terminated after 91.680242ms, active connections: 3 [00:37] ^ this is what I can do [00:41] davechen1y: active connections is new current count? [00:43] yup [00:43] cherylj: 2016-02-08 22:30:16 ERROR Command '('juju', '--debug', 'deploy', '-m', 'maas-1_9-deploy-trusty-amd64', 'local:trusty/dummy-sink')' returned non-zero exit status 1 [00:43] where can I find the source for this charm ? assuming the charm matters [00:44] davechen1y: https://private-fileshare.canonical.com/~cherylj/dummy-charms/ [00:44] There's a tar file there, and some copied commands that the CI test uses [00:44] ta [00:45] is this environement a ha environment ? [00:45] but I imagine the charm doesn't matter [00:45] davechen1y: probably not, but you can double check by looking at the test run [00:45] yeah, i deployed some of my usual favorites [00:45] i'm bloody sick of the tools versino checker [00:45] is that on the top of a list for someone to fix ? [00:46] checking every 3 seconds is stupid [00:46] 15 minutes would be sufficient [00:46] davechen1y: no, there are other more serious failures that everyone's working on [00:46] unfortunatley [00:46] unfortunately, even. Because that one is darn annoying [00:46] davechen1y: no, the env is not HA [00:47] juju debug-log pauses if you run it after calling enable-ha [00:47] brilliant [00:48] so does juju ssh [00:48] that's wonderful [00:48] juju enable-ha [00:48] puts your environment into catatonia until that process finishes [00:49] doesn't surprise me. everything dies / pauses when you enable-ha [00:54] cherylj: so what's happened is [00:55] the addresses of the additional mongo servers have been added to the list of state servers [00:55] but those additional mongos are up [00:55] sorry, not up [00:55] they are still going through the cloud=init dance [00:55] so you have a 2/3 chance that the apiserver running on machine-0 will try to connect to those [00:56] 2016-02-09 00:56:04 DEBUG juju.worker.peergrouper desired.go:116 machine "0" is already voting [00:56] 2016-02-09 00:56:04 DEBUG juju.worker.peergrouper desired.go:123 machine "2" is not ready (has status: true) [00:56] 2016-02-09 00:56:04 DEBUG juju.worker.peergrouper desired.go:123 machine "3" is not ready (has status: true) [00:56] yet the api server still tries to dial it :) [00:56] 2016-02-09 00:55:49 DEBUG juju.mongo open.go:117 connection failed, will retry: dial tcp 10.251.20.20:37017: getsockopt: connection refused [00:56] 2016-02-09 00:55:49 DEBUG juju.mongo open.go:117 connection failed, will retry: dial tcp 10.241.59.50:37017: getsockopt: connection refused [00:56] 2016-02-09 00:55:50 DEBUG juju.mongo open.go:117 connection failed, will retry: dial tcp 10.251.20.20:37017: getsockopt: connection refused [00:56] 2016-02-09 00:55:50 DEBUG juju.mongo open.go:117 connection failed, will retry: dial tcp 10.241.59.50:37017: getsockopt: connection refused [00:56] 2016-02-09 00:55:50 DEBUG juju.mongo open.go:117 connection failed, will retry: dial tcp 10.251.20.20:37017: getsockopt: connection refused [00:56] 2016-02-09 00:55:50 DEBUG juju.mongo open.go:117 connection failed, will retry: dial tcp 10.241.59.50:37017: getsockopt: connection refused [00:57] davechen1y: but the two bugs that you're looking at aren't using ha [00:58] right [00:58] i won't get distracted [00:58] i was just using enable-ha to try to get the apiserver to explode [00:58] and enter it's restart behaviour [00:59] davechen1y: ah, I was confused [01:00] i shouldn't go poking around in juju [01:00] there be dragons [01:01] ain't that the truth. [01:03] and it's failed [01:03] you'll love this [01:03] Attempt 63 to download tools from https://10.251.11.185:17070/tools/2.0-alpha2.1-precise-amd64... [01:03] + curl -sSfw tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --noproxy * --insecure -o /var/lib/juju/tools/2.0-alpha2.1-precise-amd64/tools.tar.gz https://10.251.11.185:17070/tools/2.0-alpha2.1-precise-amd64 [01:03] curl: (7) couldn't connect to host [01:03] tools from https://10.251.11.185:17070/tools/2.0-alpha2.1-precise-amd64 downloaded: HTTP 000; time 0.001s; size 0 bytes; speed 0.000 bytes/s + echo Download failed, retrying in 15s [01:03] Download failed, retrying in 15s [01:03] machine-2 is trying to bootstrap [01:03] it needs tools from machine-0 [01:03] machine-0's agent is trying to connect to the replica set [01:04] the replica set is down because it's trying to ensure ha [01:04] so, no tools [01:04] no bootstrap [01:04] no ha [01:04] no tools [01:04] etc [01:05] hmm [01:05] this is even weirder [01:06] this is another case of machine-0 not listening on port 17017 [01:06] 17070 [01:06] I need to look into this [01:06] if that port is not open, no api server [01:39] 2016-02-09 01:36:02 INFO juju.apiserver apiserver.go:302 [1] machine-0 API connection terminated after 3m35.591602712s, active connections: 0 [01:39] 2016-02-09 01:36:02 INFO juju.apiserver apiserver.go:325 closed listening socket "[::]:17070" with final error: [01:39] gets more and more interesting [01:40] the api server shuts down, but never starts back up again [01:46] OH MY GODS [01:46] the bit where we send data to bash via ssh [01:46] after this line [01:46] 2016-02-09 01:43:36 DEBUG juju.utils.ssh ssh.go:249 using OpenSSH ssh client [01:46] we're sending ONE CHARACTER AT A TIME [01:46] axw: ping [01:47] davechen1y: pong [01:47] say whaaat [01:48] davechen1y: where's hte code that's doing that? [01:48] https://bugs.launchpad.net/juju-core/+bug/1543388 [01:48] Bug #1543388: bootstrapping talks to the remote machine one character at a time [01:48] i was watching the bootstrap [01:48] and i'm like why is top using 48% cpu [01:48] so I straced it [01:48] read(0, "e", 1) = 1 [01:48] read(0, "H", 1) = 1 [01:48] read(0, "t", 1) = 1 [01:48] read(0, "D", 1) = 1 [01:48] read(0, "N", 1) = 1 [01:48] read(0, "y", 1) = 1 [01:48] read(0, "q", 1) = 1 [01:48] read(0, "5", 1) = 1 [01:48] read(0, "g", 1) = 1 [01:48] read(0, "H", 1) = 1 [01:48] read(0, "/", 1) = 1 [01:48] read(0, "S", 1) = 1 [01:48] read(0, "o", 1) = 1 [01:48] read(0, "5", 1) = 1 [01:48] read(0, "t", 1) = 1 [01:49] read(0, "6", 1) = 1 [01:49] read(0, "C", 1) = 1 [01:49] :/ [01:49] Bug #1543388 opened: bootstrapping talks to the remote machine one character at a time [01:55] Bug #1543388 changed: bootstrapping talks to the remote machine one character at a time [01:58] Bug #1543388 opened: bootstrapping talks to the remote machine one character at a time [02:04] can tomb.Kill block ? === natefinch-afk is now known as natefinch [02:06] davechen1y: it does acquire a lock, so, in theory, yes. [02:06] davechen1y: but otherwise, it just closes the dying channel [02:06] * thumper takes a deep breath and dives into importing units [02:13] * natefinch looks up the time formatting date for the 1000th time [02:16] natefinch: https://github.com/juju/juju/blob/master/apiserver/apiserver.go#L96 [02:16] so here's the thing [02:16] the only way processCertChanges can exit is if someone called tomb.Kill [02:16] and hte only thing that can is cl.Close [02:16] so this code calls tomb.Kill twice, then tomb.Done for good measure ... [02:16] seems like overkill [02:16] wallyworld: can you take a look at my enable-ha change: http://reviews.vapour.ws/r/3782/ [02:17] sure [02:17] thanks! [02:17] cherylj: did axw ping you about possibly making a non blocking channel send? [02:17] davechen1y: yep... also, it doubly bad, because someone might see cl.tomb.Kill(cl.processCertChanges()) and think they don't need those other lines, but that line won't ever kill the tomb [02:18] wallyworld: no, not yet [02:18] was maybe an alternative to increasing the channel buffer size [02:19] but the buffer size increase might be acceptable for now until manifolds are done for those workers [02:19] okay, can take a look later. I'm going to be afk for ~40 mins or so, but I'll be back [02:19] ok [02:20] cherylj: wallyworld: wasn't going to bother, since we need to change it again. non-blocking send wouldn't work here anyway, it's not just a notify chan [02:20] ok, just wanted to double check, ty :-) [02:20] channels with buffers that are not 0 or 1 are pretty suspect, in my experience [02:21] unless it's directly next to the things populating the channel, I tend to agree [02:24] this sounds like one of those cases where you could have an arbitrary number of sends on the channel before any reads from the channel, in which case any buffer size could be insufficient [02:28] not arbitrary - is equal to the number of allowed controllers (7) plus a known number of local addresses [02:28] rick_h___: noted, about the name=file for resources. [02:28] it's a short term fix until workers migrated to dep engine [02:29] wallyworld: ahh, the fact that it's a short term fix makes a big difference. [02:29] for 2.0, sadly dep engine not going into 1.25 [02:30] so not sure what to do there [02:30] wallyworld: well, we're not going to support 1.25 for very long, right? ;) [02:30] 2 years :-( [02:30] natefinch: wahhahahahahaha [02:30] wallyworld: I know, I was joking [02:30] but 1.25 is blocked right now, so need to get 1.25.4 out [02:30] o/ gallows humor [02:31] wallyworld: just make the channel buffer 128... larger powers of 2 are always better [02:31] ok [02:31] i'll add a comment [02:32] wallyworld: definitely comment why the 10 is 10. [02:32] axw: 2016-02-09 02:27:09 DEBUG juju.utils.ssh ssh.go:249 using OpenSSH ssh client [02:32] what happens after this line [02:32] somethign in bootstrap [02:33] but it's mute until the other side starts to output things [02:33] davechen1y: umm. could be one of a few things, that debug message gets printed whenever an ssh client is created [02:33] davechen1y: first we ssh to each of the possible controller addresses [02:34] davechen1y: then (if you're uploading tools), copy tools across via ssh [02:34] davechen1y: then run the cloud-config rendered as a bash script [02:34] I think that's it [02:35] davechen1y: AFAICR, we just open "ssh" with the script as a bytes.Buffer piped to the ssh process's stdin [02:35] davechen1y: could be that ssh is in an interactive mode? looking for escape codes? [02:35] 13:34 < axw> davechen1y: then (if you're uploading tools), copy tools across via ssh [02:35] ^ it'll be this [02:36] i'm spelunking in the code now [02:36] short version is the openssh impl doens't buffer stdout/stdin [02:36] or something [02:38] davechen1y: actually, gross as it is, we just add the contents of the tools to the bash script (base64 encoded or something) [02:38] so the 2nd and 3rd steps are just one [02:43] rick_h___: you around? [03:00] axw: that's fine [03:00] i knew we bas64'd them [03:00] the problem is something is unbuffered there and it's sending one character at at time over ssh [03:00] which is going to turn each byte into about 400 [03:00] mabye 200 [03:00] but it's a lot [03:00] and the cpu on both sides is non trivial [03:05] davechen1y: yep. I had a look, nothing obvious. what were you stracing exactly? bash on the remote side? ssh on the client? [03:07] remote side [03:07] bash is hitting 50% cpu [03:07] results are in that ticket [03:09] davechen1y: ok, will take another look later [03:20] umm, https://github.com/juju/juju/blob/master/api/apiclient.go#L554 [03:20] natefinch: axw https://github.com/juju/juju/blob/master/api/apiclient.go#L554 [03:21] this construction is unsafe [03:23] davechen1y: you mean because two calls could race? [03:24] davechen1y: if so yeah.. pretty sure it's always one thing's responsibility to close though. [03:24] so in theory, but not in practice (unless we're doing something dumb, which I wouldn't rule out) [03:26] this would have to be hit way harder than we could [03:26] but it's entirely possible to hit this [03:26] https://bugs.launchpad.net/juju-core/+bug/1543404 [03:26] Bug #1543404: unsafe double channel close idiom [03:27] that code in apiclient doesn't look like it's intended to be threadsafe, so hopefully we're not trying to use it from multiple goroutines [03:28] lol panics [03:29] https://github.com/juju/juju/blob/master/api/apiclient.go#L435 [03:34] sooo, http://paste.ubuntu.com/14999670/ [03:34] no matter what logging I add, i cannot get line 71 to output something [03:35] all I can think of is somehow tools are being cached [03:35] and i'm not pushing up what I think i'm pushig up [03:36] davechen1y: log.Infof, not logger? [03:38] OH FOR FUCKS SAKE [03:38] thanks [03:38] * davechen1y wonders what log was in this scope ... [03:39] heh, np. One of those things your brain just can't see if you were the one that wrote it. [03:43] sooo, amazong just gave me a machien without a public ip [03:43] has that ever happened to anyone ? [03:44] it has no public ip or public dns [03:46] natefinch, wallyworld, so should I do 10 or 128? [03:46] :) [03:46] Bug #1543404 opened: unsafe double channel close idiom [03:46] 128 according to nate [03:46] cherylj: I was joking [03:46] ha, ok :) [03:47] cherylj: I do fear that 10 will just error out less often... but *shrug* Seems better than 1 :/ [03:47] natefinch: in practice, I see it firing twice [03:47] if that helps :) [03:47] 10 has some science behnd it [03:48] no, not science, it's http://i.imgur.com/24Jw4gM.gif [03:48] lol, educate dguess then [03:48] cherylj: 1 should be enough [03:48] based on knowledge of the system [03:48] hehe. I love that gif [03:49] davechen1y: I've seen that it's not [03:49] if you just want to hand off the value between producer and consumer without either blocking [03:49] it was 1 before [03:49] and that was the problem [03:49] cherylj: shit [03:49] that's more serious [03:49] it was sending twice [03:49] if it was already buffered [03:49] yeah [03:49] what about making the recieve side timeout [03:50] I'm not sure I see how we would do that. The receiver is blocked waiting on a lock that the sender is holding [03:50] davechen1y: it's a clusterfuck that is getting rewritten soonish [03:50] davechen1y: thus, the bigger buffer is just a stopgap [03:50] and a band aid for 1.25 :) [03:50] whee 1.25 cluserfuck, keeps for 2 years even under adverse conditions [03:53] but, we should explore the other option menn0 suggested for 1.25 - where there's some other synchronization between certupdater and apiserver [04:06] natefinch: still around? [04:07] wanting to verify that the assign units collection is a transitory collection [04:07] meaning that once all units have been assigned to machines, the length of that collection should be zero [04:12] thumper: cherylj http://reviews.vapour.ws/r/3784/ [04:13] thumper: that is correct. [04:13] axw: ta [04:16] Bug #1543408 opened: WatchControllerStatusChanges needs unit tests [04:18] thumper: yes it should be zero [04:19] thumper: when we assign a unit to a machine we also remove that unit from the unit assignment collection [04:19] saw that, just wanted to confirm [04:19] ta [04:39] wallyworld: :/ I've been working on credentials support for the clouds [04:40] oh, damn [04:40] sorry, i thought you were doing --config [04:40] wallyworld: I did, then th other. never mind. I've done other clouds as well [04:40] i started adding joyent support and noticed we needed to do a bit of work [04:42] wallyworld: axw: PTAL http://reviews.vapour.ws/r/3787/ [04:43] axw: so mine just does maas and joyent. with maas, i made the maas-server come from the cloud endpoint attribute in clouds.yaml [04:43] wallyworld: yeah, I did the same in my branch. reviewing now [04:43] axw: i'm just resolving a conflict and pushing [04:44] anastasiamac_: not sure why you would move controllerserver to jujuclient. it's not a client-side thing. [04:49] axw: that's my fault, i misread a question and thought we were renaming controllerserver to just controller at the top lovel to hold server side controller stuff [04:49] i don't like the name controllerserver [04:50] nor do I [04:50] environmentserver made some sense, controllerserver does not [04:50] it was the best i could think of at the time :-/ [04:50] i don't really like environmentserver either [04:51] wallyworld: not suggesting we go back, but there was some connection to the two words before [04:51] axw: atm we have controller and controllerserver at the top level. we choose just one i think [04:51] a controller's a controller, adding server to the end doesn't change anything [04:52] wallyworld: I think go with "controller". the things that *were* in controller have been moved to jujuclient [04:52] oh, ok, we'll fix that [04:52] i like controller also [04:53] not sure if it should stay a top level package, but ok for now [05:00] wallyworld: reviewed [05:00] ty, noticed your config one, looking at that [05:00] wallyworld: I'll do azure, cloudsigma, and vsphere now [05:00] ok, i promise i won't :-) [05:01] :) [05:01] axw: with the manta url - that's going away as soon as storage is gone [05:01] wallyworld: yep, as per comment [05:02] fair enough, i'll leave in comment till then, i was hoping storage would be gone this week or next [05:43] wallyworld: replied to comment about Apply [05:43] ok [05:45] axw: fair point, i think, seems like the tests need updating which i'll look at. are you happy with the modified todo for the private key stuff? [05:45] wallyworld: didn't read yet, one sec [05:47] wallyworld: yep. you probably wouldn't want to enter it interactively, but we can read the file during interactive entry of the filename, and add the value [05:47] wallyworld: then your credentials.yaml is protected from changes on disk. [05:47] maybe not obvious though [05:47] not sure, we can leave it for now [05:47] covered the 99% case I think [05:47] i think the key on disk will be pretty static [05:47] yeah, all uses use file path afaik [05:57] axw: that should be good to go now [05:58] wallyworld: thanks, shipit [05:58] tyvm [05:59] wallyworld: just testing azure, should be ready to propose the rest very shortly [05:59] awesome [05:59] wallyworld: there's still an issue with azure, another case where we need to be able to specify multiple endpoints [06:00] wallyworld: in azure there's separate endpoints for storage and everything else, and they're not necessarily derivable [06:00] damn [06:00] wallyworld: pretty sure we're going to have to extend the clouds.yaml format [06:00] seems so [06:01] do we *need* azure storage long term? [06:01] wallyworld: yes. for volume support, and also some more basic operations like specifying where the VM image should live [06:02] storage-endpoint then i guess [06:02] axw: are the storage endpoints well know like the auth ones? [06:02] can we add them to publoc cloud.yaml [06:03] wallyworld: yes, for azure public cloud. for azure stack you'd specify your own [06:03] axw: well seems like we should just update our public cloud yaml and cloud metadata struct them [06:04] wallyworld: you mean with a new storage-endpoint field? [06:04] yeah [06:05] if it's not derivable [06:05] wallyworld: I'm on the fence as to whether it should be specific to storage, rather than having a flexible map of :. storage-endpoint is probably fine though [06:06] given it's optional, it keeps the default yaml nice and simple [06:06] for other clouds that don't need it [06:06] we can always get feedback and tweak [06:06] wallyworld: ok, sounds fine [06:07] can someone help me figure out why go test isn't actually testing anything in a particular directory? http://paste.ubuntu.com/15000293/ [06:07] wallyworld: added a card to Next [06:07] no suite registered? [06:08] the test ran in the merge bot, and failed, obviously [06:08] but not when I do it locally [06:08] sigh, hate that [06:08] cherylj: have you changed anything? changed a test file from package to package_test perhaps? [06:08] axw: no, I didn't change any test files [06:09] wallyworld: axw: updated move \o/ PTAL? [06:09] looks like even on master I see the same issue [06:09] I change a test to fail, and it happily thinks there's nothing to test [06:11] anastasiamac_: one sec [06:11] * anastasiamac_ waiting :D [06:12] ugh, the suite_test.go was for peergrouper_test, and nothing else was [06:12] cherylj: m blind but where is package_test.go? [06:12] wonder how long it's been like that [06:12] in worker/peergrouper... [06:12] anastasiamac_: guess they're using suite_test.go, rather than package_test [06:13] could be the problem?.. [06:13] anastasiamac_: LGTM, thanks [06:13] axw: \o/ [06:13] wow - 2 in one day!! [06:15] wallyworld: http://reviews.vapour.ws/r/3790/ -- here's the rest [06:15] ta, looking [06:27] axw: reviewed, you may want to rebase first as my branch is almost landed and it will probably conflict in fallback public clouds yaml [06:27] wallyworld: thanks, yep, will do [06:27] bbiab, school pickup [06:44] wallyworld, axw can one of you review the test changes I had to make? http://reviews.vapour.ws/r/3782/ [06:45] I'm going to add in the unit tests for the change and am tracking that work via bug 1543408 [06:45] Bug #1543408: WatchControllerStatusChanges needs unit tests [06:45] cherylj: just the last diff? [06:45] or all of it? [06:45] axw: yes, the last diff [06:45] I copied the mock watcher that was there and modified it to be a strings watcher [06:46] had to pull suite_test.go into the peergrouper package. I'll see about making the tests all external when I write the tests for the functional change. [06:46] it wasn't a no-op this time, so I skipped it [06:49] cherylj: ignoring the error from ControllerInfo in state watcher seems a bit alarming. why not just error out there? [06:49] (sorry, couldn't help myself and looked at the rest) [06:50] axw: if we're getting an error there, we'll get that error elsewhere and things will get restarted [06:50] that was my thinking anyway [06:50] cherylj: so it's not harmful to error out there as well right? [06:50] there's not really a way to return an error there, from what I saw. [06:50] oh. /me looks again [06:50] cherylj: have you moved to Australia, or do you just hate sleep? :-) [06:50] I miss sleep [06:51] we used to be buddies [06:51] now he's all emaciated because I don't feed him [06:51] That's no way to treat your buddies. :-) [06:51] cherylj: right, there's not, sorry. [06:51] cherylj: LGTM [06:52] thanks, axw! [06:53] and now, SLEEEEEP [06:53] be nice to your buddies! [06:54] cherylj: BTW, thanks for some of those recent bug fixes. [06:55] cherylj: i have a strong suspcion that bootstrap is not making it to waitForInitalisatin [06:55] I think it's bombing out _WAY_ earlier [06:56] davechen1y: shh... cherylj is sleeping \o/ [07:15] the new bootstrap syntax has already infected my brain. switching back and forth between 1.25 and 2.0 is going to be fun [07:19] anastasiamac_: looks like the controller.yaml file isn't quite right - it looks like it is storing model information as well as controllers [07:19] it also doesn't have the local. prefix [07:19] for the controller name [07:27]