[00:00] and a degree of latency can be used to smooth out spikes in state change [00:00] based on knowledge of the model interactions [00:01] the issue is unless its a work event, a state change event can be stale with network partitions or transient disconnects, if things end up pushiing forward an invalid/old state into the application tier.. ie the current fetch is probably still needed for the app / charm model [00:02] on design choices and scalability i actually had to do some work recently to have juju support provisioning many machines with a single provider api call.. [00:02] you mean to add a bulk call to allow more than one machine to be provisioned with a single api call, not one per machine? [00:03] yes [00:03] i've been pushing for bulk api calls since the api was first mooted [00:03] wallyworld, i'm not entirely clear things have changed with the move to core.. take ec2 for example.. standard cloud best practice would be multi-zone .. not something we can actually do with core... or take our instance type hardcodes which are old (we promote more expensive less powerful instances than what's current best practice). [00:03] and overlapping, and we can't specify instance-type [00:04] wallyworld, for the bulk provisioning i ended up, seeding cloud-init data with something that dialed back home, and got the actual machine specific provisioning script. [00:04] agreed. we could and should change all that. i know people want to [00:04] ah that would work [00:05] yeah.. works well, this work was all manual provider based though not in core per se. but the pattern works nicely [00:05] the whole hard coded insance type thing - that's just for ec2 because of what the lib provided. with openstack, we don't have that problem [00:06] we do need to fix ec2 but have had too many other things taking a higher priority [00:06] wallyworld, we should ideally host that on cloud-images, not internal to the src. [00:06] oh yes, you think? [00:06] :-) [00:06] :-) [00:06] so there's a lot of us that want to get this stuff sorted out badly [00:07] i'm still hopeful we can do it [01:15] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ rm -rf ~/.juju [01:15] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju init [01:15] A boilerplate environment configuration file has been written to /home/ubuntu/.juju/environments.yaml. [01:15] Edit the file to configure your juju environment and run bootstrap. [01:15] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju status [01:15] ERROR Unable to connect to environment "". [01:15] Please check your credentials or use 'juju bootstrap' to create a new environment. [01:15] Error details: [01:15] control-bucket: expected string, got nothing [01:15] this sounds wrong [01:15] juju init defaults to amazon [01:15] why does it say the environment is "" [01:25] nfi [01:25] but definitely a bug [01:29] this is a fresh install [01:29] i [01:29] i'll poke around some more [01:29] i smell JUJU_HOME in there somewhere [01:29] * davecheney throws a chair [01:30] no, JUJU_EV [01:30] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ export JUJU_ENV="amazon" [01:30] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju status [01:30] ERROR Unable to connect to environment "amazon". [01:30] bingo [01:31] thumper: ~/.juju/current-environemnt [01:31] where did that come from [01:31] that is being consulted always [01:31] switch [01:31] but I never used switch [01:32] then it probably shouldn't be there [01:32] but if it's not there [01:32] the env comes up as "" [01:32] that's a bug [01:32] it whould fall back to the default environment [01:33] * thumper thinks [01:33] ah... [01:33] yes [01:33] what happens [01:33] is that if the string is empty [01:33] which you would have got if you didn't specify -e or have an env var [01:33] it is treated "specially" [01:33] hmm, [01:33] we should remove the special [01:33] and have a sane fallback [01:33] always been like that [01:33] the correct behavior is to fall back to the order in the yaml file [01:33] now we are just outputting it [01:33] no actually [01:34] fall back to the default if specified [01:34] if no default specified [01:34] there is a default [01:34] then use the one if only one specified [01:34] this is from juju init [01:34] otherwise error [01:34] all this bullshit is too complicated [01:34] but it is using the default specified [01:34] juju switch was a bad idea [01:34] but it is getting it by the special case of "" [01:34] no [01:34] this isn't about switch [01:34] this is how it worked before [01:34] I just added an extra thing in [01:34] we weren't outputting it before [01:34] ok [01:35] whoever added the outputting should have changed the behaviour [01:35] but didn't [01:35] lemmie see if I can put this in an issue [01:35] having "" special cased is dumb [01:36] ok [01:36] thakns [01:36] i'll log a bug after lunhc [02:45] thumper, I have a branch that I can release as 1.17.1. It is the last r2248 + mgz's openstack and mgo fixes [02:45] thumper, It doesn't pass on canonistack though. [02:46] I cannot get stable juju to work on canonistack today, so maybe I should release this mashup as 1.17.1 anyway [02:46] sinzui: what's the issue on canonistack? [02:46] that has been flakey on occasion [02:48] After a successful bootstrap, the client can never talk the the env. That is true for 1.16.5 and 1.17.1 from inside canonistack, outside with public ips and outside with sshuttle vpns [02:48] hmmm [02:49] wallyworld, The cloud health tests show that canonistack has been barely usable for several days [02:49] sinzui: did the sshuttle connect? [02:49] yes it did [02:49] if everything else works, probably good to release then. i assume hp cloud works [02:49] but status and deploy always timeout [02:50] Hp is indeed very health with my branch [02:51] rick_h_: sinzui: sounds like a firewall issue, they might be blocking everything but port 22 even for the public IPs or something [02:51] or, public IPs aren't routable, but sshuttle is using the chinstrap bounce to connect [02:51] jam, from inside canonistack? [02:51] sinzui: so when you have sshuttle connected, do you *also* have a public IP assigned? [02:52] sinzui: 2248 is an older revision from last week? [02:52] because we won't end up using shuttle if we think we have an address outside the 10.* space [02:52] I'm just theorizing, though. [02:52] jam the health check is an hourly deployment using stable on each cloud. canonistack is ill http://162.213.35.54:8080/job/test-cloud-canonistack/ ... [02:53] latest health check works, quick release :-) [02:54] ...but since we are not seeing resources deleted properly from trunk tests, we know that some of the failures can be cause by the cruft left behind [02:54] sinzui: i see 2248 was before local provider sudo changes [02:55] I have been manually deleting instances, security groups, and networks all day to give CI a chance to pass [02:55] it would be good if 2249 could be the rev we use [02:55] It is [02:55] unless that is broken [02:55] wallyworld, FUCK NO [02:55] 2249 eliminates polling for status [02:55] wallyworld, 2249+ does not pass [02:55] oh [02:55] * wallyworld is sad [02:56] i'll have to take a look then [02:57] i tested on ec2 [02:57] wallyworld, I know. I really wanted to realeas tip. local is not healthy in trunk since CI was tainted by the destroy-environment problems, I ask other people to test. Local doesn't just work, for anyone who has ever used local before [02:58] my change in 2249 was related to juju status only [02:58] i'll check that it is ok though [02:58] wallyworld, I have been testing for 3 days. People want a release... and the want they favourite features in it [02:59] oh, sorry, didn't intend to push for 2249 to be included. i was just concenred i may have broken somethong [02:59] I am burned out trying to make a release when CI gave an answer last week [03:00] wallyworld, I want to release every week, which will help get the good work out into the wild [03:00] yeah. but we need to stop breaking stuff [03:02] wallyworld, Juju trunk was very good last week. On the last day a lot of branches landed just broken tests every where [03:02] i think they were mainly the local provider changes [03:03] if i understand correctly, it seems like the newer code doesn't like a previous dirty disk [03:03] wallyworld, Yeah, but since it cannot destroy itself (possibly a bug in the very version I propose releasing) the disk will always be dirty [03:04] so the clean up this week will need to be able to deal with that [03:04] i think the issue may be understood, so if 2248 is released and the final polish applied this week to trunk, 1.17.2 at the end of the week hopefully :-) [03:04] or next week [03:06] wallyworld, this bug causes subsequent runs of tests to fail. https://bugs.launchpad.net/juju-core/+bug/1272558 [03:06] <_mup_> Bug #1272558: destroy-environment shutdown machines instead [03:06] * wallyworld looks [03:07] We keep hitting resource limits, or instance already exists failures from machines that were shutdown instead of destroyed. [03:07] oh, and it didn't help that our trust amis expired. [03:09] sinzui: that is an interesting bug. i had a quick look at the Openstack provider, and StopInstances does seem to call "delete server". so more investigation required [03:09] ah, juju-test-cloud-canonistack-machine-0 got shutdown instead of destroyed. I expect the next health check to fail [03:10] wallyworld, the bug also affects aws and azure though. [03:10] very odd [03:10] yeah, i just thought i'd see what openstack did out of interest [03:10] Azure can take hours to delete a network when I do it from the console too [03:10] azure seems to take a looooong time to do *anything* [03:11] sinzui: so i assume that for example you could do a "nova list" and the old machines would be shown still and have a status "shutdown" [03:12] The durations shown here are consistent with previous weeks: http://162.213.35.54:8080/ [03:13] we do run azure tests in parallel to keep all CI test to about 30 minutes, but that also puts us at risk to exceeding our 20 cpu limit [03:13] wallyworld, exactly what I see [03:13] wallyworld, Hp is the only cloud/substrate not affected. [03:14] hmmm. i'm not intimately familiar with destroy-env code. i wonder if maybe something changed recently [03:14] we have some investigating to do [03:16] sinzui: i'll make sure the issue is known at the next core standup and we'll ensure someone is assigned to fix as a matter of priority [03:17] wallyworld, That is appreciated. [03:17] we really need to get some more closed loop feedback from CI -> devs [03:18] cause i reckon not may devs even know the address of the CI dashboard [03:18] and/or pay attention to the status [03:19] so you poor folks cop stuff we break without timely action to fix it [03:19] and then have to push shit uphill to get a release out [03:20] i'll offer my opinion and hopefully it will be shared and we can implement some workflow to improve the situation [03:21] sinzui: i'll make sure you get feedback to let you know the outcome of the above [03:22] wallyworld, Jenkins has a bad UI. We are creating a report site that explains what was tested. http://162.213.35.69:6543/ [03:22] wallyworld, You do need to log in. to see the report of the revision I create [03:22] d [03:22] so i do [03:23] WTF is going on! [03:23] damnit [03:23] this was working before [03:23] sinzui: E403 after logging in [03:23] The overall PASS status is there because I manually tested local, then hacked the PASS status. Canonistack should have damned the who rev though [03:24] wallyworld, did you check all the boxes? [03:24] thumper: do you have any knowledge of the destroy-env issue sinzui mentions in the scrollback? [03:24] sinzui: ah, no :-) [03:24] i thought they were for information [03:24] oh, arosales reported the same thing. control-reload to force the pages I think [03:24] wallyworld: where is the scrollback, it is long [03:25] thumper: https://bugs.launchpad.net/juju-core/+bug/1272558 [03:25] <_mup_> Bug #1272558: destroy-environment shutdown machines instead [03:26] sinzui: ah it works now [03:26] sinzui: try with --force [03:27] thumper: i'd need to read the code. i wonder why not having --force shuts down instances instead of deleting them [03:27] thumper, destroy-environment didn't return an error when it failed. It didn't tell us it needed to use use force [03:27] hmm... [03:27] this was axw's area, not sure [03:27] I think it had to do with moving to the api, but can't confirm [03:28] wallyworld: destroy-environment tries to be nice first [03:28] I think [03:28] thumper, I suspect the issue is older than last week, but something made it more visible. [03:28] sinzui: is it just the bootstrap machine left behind? [03:28] well, no, I have never seen a machine SHUTDOWN before last week [03:29] i was thinking perhaps that if the logic was moved behind the api, the code that does the destroy is running on the bootstrap machine itself and there may be an issue destroying it [03:29] wallyworld, most of the time, but we have see the service machines shutdown too. [03:30] and that the other nodes could still be destroyed [03:30] ok, was just a guess :-) [03:31] something seems all fucked up [03:32] local provider in trunk is giving weird arse lxc errors [03:32] machine-0: 2014-01-28 03:31:00 ERROR juju.container.lxc lxc.go:102 lxc container creation failed: error executing "lxc-create": + '[' amd64 == i686 ']'; + '[' amd64 '!=' i386 -a amd64 '!=' amd64 -a amd64 '!=' armhf -a amd64 '!=' armel ']'; + '[' amd64 '!=' i386 -a amd64 '!=' amd64 -a amd64 '!=' armhf -a amd64 '!=' armel ']'; + '[' amd64 = amd64 -a amd64 '!=' amd64 -a amd64 '!=' i386 ']'; + '[' amd64 = i386 -a amd64 '!=' i386 ']'; + '[' amd64 [03:32] = armhf -o amd64 = armel ']'; + '[' released '!=' daily -a released '!=' released ']'; + '[' -z /var/lib/lxc/tim-testlocal-machine-1 ']'; ++ id -u; + '[' 0 '!=' 0 ']'; + config=/var/lib/lxc/tim-testlocal-machine-1/config; + '[' -z /usr/lib/x86_64-linux-gnu/lxc ']'; + type ubuntu-cloudimg-query; ubuntu-cloudimg-query is /usr/bin/ubuntu-cloudimg-query; + type wget; wget is /usr/bin/wget; + cache=/var/cache/lxc/cloud-precise; + mkdir -p [03:32] /var/cache/lxc/cloud-precise; + '[' -n '' ']'; ++ ubuntu-cloudimg-query precise released amd64 --format '%{url}\n'; failed to get https://cloud-images.ubuntu.com/query/precise/server/released-dl.current.txt; + url1=; container creation template for tim-testlocal-machine-1 failed; Error creating container tim-testlocal-machine-1 [03:32] WTH [03:33] * thumper moves from current work back to trunk [03:33] still on precise? [03:36] saucy [03:36] wallyworld: having issues with trusty? [03:36] thumper: i tried local on trusy last thing friday and it didn't work [03:36] but didn't look into it [03:37] andrew and i had a quick look [03:37] nothing jumped out as being wrong but we didn't deep dive [03:37] hang on... [03:37] I moved back to trunk and now it is working [03:37] ... [03:37] trusty as host, precise containers [03:38] I didn't touch this area though [03:38] so pretty confused right now [03:38] isn't that always the way [03:44] ah fark [03:44] I think I know what it is [03:44] I have a fake https-proxy set [03:44] to "rubbish" [03:45] and I bet lxc is trying to download the latest server using the proxy [03:45] ha ha ha [03:45] that is kinda funny [03:45] in a terrible way [03:45] the proxy stuff works :-) [03:46] hehe, that's it [03:46] I guess the proxy works [03:47] * thumper proposes [03:47] actually [03:47] I may break this up as I broke something before [03:54] wallyworld: https://codereview.appspot.com/57590043/ simple fix for my fubar [03:54] * wallyworld looks [03:58] thumper: environs/config/config.go - are those new methods related to this mp? am i missing something? [03:58] wallyworld: ah, they may be used in the next [03:59] but I thought they were for that one [03:59] sorry [03:59] I'm just proposing the next [03:59] np, thought i was being dumb [04:05] wallyworld: and if you feel like it: Rietveld: https://codereview.appspot.com/57600043 [04:05] looking [04:13] thumper: so we could later on move existing clients to use the new common fasçard? [04:13] could [04:13] and façade [04:14] bah, can't spell [04:18] thumper: not sure if you agree, i find !a || !b easier to read than !(a && b), especially if the latter is split over two lines, where by !a i mean a != foo etc [04:34] wallyworld: I don't care that much [04:34] yeah, me either was just a thought [04:35] took me a couple of scans to grok it [04:35] cause of the line break and ( [04:36] * thumper nods [04:36] happy to change if you think it'll make a difference [04:45] thumper: i didn't lgtm because we're missing a test for jujud [04:45] * thumper sighs [04:45] sorry [04:45] and how do you suggest we test it? [04:45] yeah [04:46] there's some existing examples in machine_test [04:46] basically start a jujud and check for an expected result [04:46] similar code to the worker test itself [04:46] hmm... [04:46] ok [04:46] but a cut down version [04:46] just to test the wiring up of it all [04:47] not a blovker, but it would be good not to have the "first" bool required [04:47] or maybe it is essential here, not sure. but other workers don't seem to need it, but i could be mis remembering [04:48] it just complicates things [04:48] oh balls, gotta run - Belinda's car door is stuck and i have to drive to help her out [04:48] i'll check back a bit later [06:12] oh bollocks [06:12] ... obtained []charm.CharmRevision = []charm.CharmRevision{charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}} [06:12] ... expected []charm.CharmRevision = []charm.CharmRevision{charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}} [06:13] #gccgo [08:05] fwereade, hey, i've updated https://codereview.appspot.com/53210044/, can you take a look whether it's good to land? [08:06] dimitern, sure, thanks [08:19] dimitern, reviewed,bbs === _mup__ is now known as _mup_ [08:22] fwereade, ta [09:38] lucky(~) % juju destroy-environment ap-southeast-2 -y [09:38] ERROR state/api: websocket.Dial wss://ec2-54-206-142-42.ap-southeast-2.compute.amazonaws.com:17070/: dial tcp 54.206.142.42:17070: connection refused [09:38] but it did destroy the environment [10:07] davecheney: I believe it tries to contact the environment to check if there are any manually registered machines before nuking it all from the client side, but even if it fails, it still nukes from client side [10:09] but why did it fail ? [10:09] was the bootstrap machine already nuked > [10:46] dimitern: standup ? [10:58] fwereade, updated https://codereview.appspot.com/53210044/ [11:26] rogpeppe: want to talk now? [11:26] natefinch: good plan, yes [11:26] natefinch: i just went back into the hangout [11:26] rogpeppe: cool brt [11:26] rogpeppe, btw, re SOAP, I remember enjoying http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple/ [11:28] fwereade: https://codereview.appspot.com/56020043 you had asked for a deprecation warning (when supplying -e to destroy-environment), care to check the spelling and see if you like how I worded it? [11:30] dimitern, re {placeholder:false}, might a {$ne {placeholder:true}} work as a general replacement? [11:30] jam, ack [11:31] jam, LGTM [11:41] fwereade, and pendingupload as well [11:45] fwereade, i'll try it on my local mongo and if it works I could change it [11:46] fwereade: thanks [12:00] dimitern, reviewed [12:03] fwereade, tyvm [12:03] oh *fuck* these jujud tests [12:04] and also that provisioner one [12:06] * fwereade will sort out the provisioner one but needs a volunteer for the jujud one [12:06] fwereade: delete them. Tests are for inferior programmers anyway. [12:06] natefinch, well volunteered! [12:06] fwereade: which jujud one? [12:07] natefinch, there's a class of machine agent test failure [12:07] natefinch, happens in a few different ways now I think [12:07] natefinch, we try to test that a machine agent works by testing the side-effects of particular jobs [12:07] natefinch, but the MA isn't set up quite right, and so the api job barfs, and kills all the others [12:08] fwereade: sounds like fun [12:08] natefinch, and it's a matter of luck whether they managed to express their side effect in time [12:08] natefinch, btw, you don't actually have to volunteer, HA is at least as important [12:08] natefinch: http://play.golang.org/p/WwGvP5RUbM [12:09] natefinch: it could probably go in its own package somewhere [12:09] natefinch: perhaps in utils [12:10] fwereade: if I was less behind in the HA stuff I'd be happy to volunteer.... but if someone else can do it, that would probably be better for the schedule [12:11] natefinch: does that make sense as a primitive? [12:12] * rogpeppe needs to lunch [12:13] rogpeppe: looks good though I'm not sure I'm entirely happy about genericizing it with the interface{} ... [12:21] TheMue: morning, I wanted to check on the status of the debug-log work. I know rogpeppe brought up some ideas for potentially more effecient ways to handle communication. [12:21] TheMue: anything I can/should note on our tracking card for this during out standup? [12:23] rick_h_: yep, roger and william agreed on this new approach and I'm now finishing the outline (there are some todos left) [12:23] TheMue: cool, thanks for the update. [12:23] rick_h_: most looks good so far because this new approach avoids much of the problems of the old one [12:24] TheMue: always a good thing, glad to hear it [12:30] fwereade: did you get a chance to talk with Tim about planning for capetown? [12:49] jam, he got a message from me about it, but only while I was briefly midnightly awake and he was at the gym [12:49] fwereade: so its all sorted out, then :) [12:51] natefinch: i know what you mean, but it's general enough that it might be useful for other things. it's a bit more mechanism than i'd like to see in cmd/jujud directly, and as a separate package i wouldn't really want it to depend on the agent package [12:52] jam, mu -- I have pushed mramm to figure out what else we may need, but at least thumper is aware of where he needs to add things that he thinks of [12:52] natefinch: you could always write a thin type-safe wrapper on top if the interface{} thing gets you down [12:53] rogpeppe, can you remember what emitter of NotProvisionedError justifies returning true from isFatal in the machine agent? [12:54] rogpeppe, if *that particular* MA is not provisioned, sure, that's a problem [12:54] * rogpeppe looks to see what emits NotProvisionedError [12:54] rogpeppe, and the fact that other workers are emitting it is I think *also* a problem [12:55] rogpeppe, but they should surely be restricting their problems to themselves [12:57] fwereade: yeah, that may well be problematic. I think we should probably use a more specific error, probably defined in cmd/jujud, and transform from NotProvisionerError into that at the appropriate place only [12:57] fwereade: and take NotProvisionedError out of the list of fatal errors [12:58] fwereade: i *think* that the specific case that we were thinking about there is the one returned by apiRootForEntity [13:00] rogpeppe, excellent, that matches my rough analysis [13:00] rogpeppe, thanks [13:00] fwereade: cool [13:20] would appreciate a quick look at https://codereview.appspot.com/57740043 -- it's the flaky provisioner tests [13:21] the jujud ones demand a bit more obsessive/paranoid care [13:30] * fwereade succumbs to rage against the machine agent, goes for walk [13:35] the battle of los agents [13:52] anyone here know about lxc? [13:52] we're seeing this error from lxc-start on a machine: [13:52] 2014-01-28 13:35:27 ERROR juju.provisioner provisioner_task.go:399 cannot start instance for machine "3/lxc/5": error executing "lxc-start": command get_init_pid failed to receive response [14:07] natefinch, btw, a thought: it's *vital* that we start (non-bootstrap) state DBs only in response to info from the API -- it's only the API conn that does the nonce check and is therefore safe from edge case failures in the provisioner accidentally starting two instances for one machine [14:07] natefinch, I know that's what we're doing anyway, but it's another reason not to mess around with cloudinit [14:07] fwereade: ahh, yeah, good point [14:08] rogpeppe2: you may be right about the interface there, but I still am hesitant to add a bunch of code and some complexity for a feature we don't even support right now. It's like, either do this whole watching thing, or just call a function inline. [14:10] fwereade, jam, I created a branch from the last rev that CI blessed then merged mgz openstack/mgo fixes. [14:10] natefinch: i'm not sure what you mean by "just call a function inline" there [14:11] fwereade, jam. since I created a new branch, I created series 1.18 and moved 1.17.1 to that series. https://launchpad.net/juju-core/1.18 [14:11] rogpeppe2: given what fwereade said above... are we not just going to be calling the API to see if we should be a state server when MachineAgent.Run is called [14:12] fwereade, jam, I am not happy with this situation. If either of you are not happy, we should talk about our options for a 1.17.1 release [14:13] natefinch: we can't do that if there's only one state server machine agent [14:13] natefinch: that's kind of the whole point of the design i've suggested [14:14] natefinch: i don't think the SharedValue stuff is unreasonable complexity (it's been well tested in the past, under another guise) [14:15] natefinch: i'd be happy to commit it with tests if you don't fancy doing that [14:15] rogpeppe2: that's fine, it just seems like we're doing all that instead of writing func SetUpStateServer() [14:16] natefinch: we're moving towards a design that enables future stuff, rather than cluttering the existing design with more stuff. [14:17] rogpeppe2: it doesn't seem like clutter when it's code we'll need either way. The future stuff we may never get to. [14:18] rogpeppe2: can you explain this a bit more: we can't do that if there's only one state server machine agent [14:18] natefinch: perhaps you could sketch some pseudocode for what you think SetUpStateServer should do? [14:18] rogpeppe2: that's probably what is tripping me up [14:18] natefinch: if there's only one state server machine agent, there's no API to connect to [14:19] rogpeppe2: if there's no API to connect to, mongo can't sync its data from anywhere either [14:19] rogpeppe2: I don't think this code applies to the bootstrap node [14:19] natefinch: there is no such thing as the "bootstrap node" in an HA environment [14:19] natefinch: well, that's not entirely true [14:19] sinzui: thanks for that [14:20] natefinch: but the bootstrap node is only important at bootstrap time [14:20] I don't think I have any cunning solutions either unfortunately [14:21] rogpeppe2: other than the bootstrap node, there must already be a state server in existence when new state servers come up [14:21] natefinch: not necessarily [14:22] natefinch: i want to allow for the possibility of going from 3 servers to 1. [14:22] rogpeppe2: then we're boned anyway, because they can't sync the mongo data [14:22] natefinch: not necessarily [14:23] rogpeppe2: btw HA, where does the all-machines.log reside then? [14:23] natefinch: the peer group logic i've written should allow it [14:23] TheMue: that's a good question and one we haven't resolved yet [14:24] rogpeppe2: hehe, ok [14:24] TheMue: we perhaps sync it to all state server nodes, but we need to think about it [14:25] rogpeppe2: afk a sec, sorry, brb [14:26] rogpeppe2: yes, may be the right solution. also we still have no logrotate, don't we? [14:32] rogpeppe2: going from 3 to 1 and putting manage environ on existing machines are not things we need to deliver right now, and I don't think there's any throw away code we'd need to write to deliver what is required without adding this code. I guess if you want to commit the shared valuestuff with tests, that's fine with me... I just don't really want to take on *any* additional code burden right now. [14:33] TheMue: yes, there is no logrotate [14:34] natefinch: perhaps you could paste some pseudocode with your idea for your suggested solution [14:35] rogpeppe2, TheMue: unless there's a serious problem with the approach I think we should just fix the rsyslog conf to push to all state servers [14:35] TheMue, yes, there is no logrotate and we really need it [14:35] fwereade: +1000 [14:35] TheMue, don't suppose you're currently bored [14:35] ;p [14:36] fwereade: what happens when a new state server comes up? [14:36] fwereade: (do we lose all the previous log on that state server?) [14:37] rogpeppe2, I think it's ok that new state servers won't have logs from before they existed, if that's what you mean? [14:37] fwereade: it is [14:37] fwereade: i'm not sure that's really acceptable, tbh [14:37] fwereade: but if you think it is, i'll go with it [14:37] rogpeppe2: in machineagent.Run: open API, check this machine's jobs. If manageEnviron, install & run mongo upstart script. (if we can't assume mongo is installed, throw in an apt-get install in there). [14:38] rogpeppe2: I know it's naive, but I'm not understanding under what circumstances it'll fail in the stuff we need to sup[port for 14.04 [14:42] rogpeppe2, https://codereview.appspot.com/54950046 -- I don't think this is necessarily the *best* solution, but it involves no API changes and AFAICT it resolves the jujud flakiness -- opinions? [14:42] * fwereade needs to eat something, would be most grateful to return and see a review of https://codereview.appspot.com/57740043 as well [14:44] rogpeppe2: sorry if I'm asking the same thing over and over. There's obviously something I keep misunderstanding. [14:44] natefinch: one mo, i'm just doing a sketch, so you can see how my suggestion actually simplifies the existing code [14:45] rogpeppe2: that's fine, and thank you for helping me understand. [14:53] fwereade: I've got the slight feeling the the topic debug logging will accompany me for some time. ;) [15:24] mgz, is the bot wedged? https://code.launchpad.net/~waigani/juju-core/remove-local-shared-storage/+merge/202789 was approved yesterday -- but I'm sure I saw it doing something earlier today [15:26] natefinch: here's the kind of thing i'm thinking of: http://paste.ubuntu.com/6832569/ [15:26] fwereade: I'll have a look [15:26] natefinch: (try a diff against cmd/jujud/machine.go) [15:29] rogpeppe2: looking [15:30] rogpeppe2, mgz, natefinch Do either of you have time to review my branch to inc juju to 1.17.2? https://codereview.appspot.com/57750043 [15:32] sinzui, LGTM [15:32] thank you fwereade [15:37] fwereade: the bot is cycling through and looking for proposals fine [15:37] fwereade: waigani just didn't set a commit message [15:37] we can do that and it will land [15:53] fwereade: i'd be interested in what you think about my suggestion to nate above, machine agent changes (http://paste.ubuntu.com/6832569/) [15:54] mgz, doh, sorry [15:57] natefinch: does it make some sort of sense? it's somewhat more code, but i think it separates concerns better, and there are no special-case hacks [15:59] rogpeppe2, I think that looks nice [15:59] rogpeppe2: I still don't understand the *why*. When we first start up in MachineAgent.Run, we can call the API right then and determine if we need to run mongo, and do so right then. I'm not sure what we get by making a continuous watcher thingy, since we don't currently support changing a machine from a non-state server to a state server. [16:00] natefinch: we can't open the API if we're supposed to be running the API [16:00] natefinch: unless you add special case hacks for machine 9 [16:00] rogpeppe2, surely we *can*, though [16:00] machine 0 even [16:00] fwereade: well, we *can*, and that's what my suggestion does [16:00] rogpeppe2: yes, one special case hack for THE special case in the system [16:00] natefinch, every special case I see for machine 0 makes me sad [16:00] fwereade: I agree [16:01] natefinch, this reduces that special case to setting up the agent conf so that machine 0 alone already knows it's meant to run the state worker [16:01] natefinch, everything else gets it via the api [16:01] natefinch, (ultimately via the api) [16:01] natefinch: in your case you have to have a completely separate path for opening the API in MachineAgent.Run, and then you'll start the APIWorker which then needs to open it again [16:01] fwereade: maybe I'm missing that part because of all the magic watcheryness. [16:03] natefinch, rogpeppe2: yeah, my opinions are predicated on the watching all being sane, and the agent conf all being properly goroutine safe, etc [16:03] fwereade: of course [16:03] natefinch, rogpeppe2: but I think it's a suitable channel for this sort of information [16:04] rogpeppe2: it seems like you're arguing about the contents of needsStateWorker, which doesn't seem to be in the code you're talking about [16:05] natefinch: needsStateWorker is just a "does machine jobs contain JobStateWorker?" [16:05] natefinch: (or however that info is stored in the config) [16:06] rogpeppe2: right, fine. Ok. Why do we need 150 lines of magic watcherness rather than just checking the jobs in machineAgent.run? [16:06] natefinch: because we need to watch that stuff *anyway* [16:07] rogpeppe2: aren't we already doing that? [16:07] natefinch: because we need to save the addresses [16:07] natefinch: i don't think so [16:07] fwereade: aren't the addresses already in the config? [16:07] fwereade: ahh, I guess if the config changes [16:08] fwereade: who changes the config and how? [16:08] natefinch, :179 [16:09] natefinch, I think it depends on infrastructure that isn't written yet ( rogpeppe2 ?) but the shape of it looks sane to me [16:10] fwereade: the infrastructure that's not written yet is outlined in newConfigWatcher [16:10] fwereade, natefinch: it's pretty bog standard stuff - just watch that stuff and change the config appropriately [16:11] rogpeppe2, indeed, I was just checking there wasn't something I'd completely missed :) [16:11] natefinch: so, the config changes because that watcher changes it, because something that it's watching that needs to go into the config has changed [16:12] natefinch: FWIW i've been wanting to move towards this kind of structure in the machine agent for ages [16:12] natefinch: and i'd much prefer to do it now rather than twist the structure more [16:14] rogpeppe2, fwereade: I think what was confusing me was that the worker functions look like they're just called once, but they're runners/workers so they keep getting called over and over. I'm not entirely sure why I couldn't put ensureMongoServer inside StateWorker or something [16:14] natefinch: the other problem with "just" connecting to the API in MachineAgent.Run is that you have to be careful to allow the agent to be stopped, and all that logic rapidly becomes quite complex (and duplicates logic that's already there elsewhere) [16:14] natefinch: you definitely could do that [16:14] natefinch: but not in the current code [16:17] natefinch: well, actually, it would probably work [16:18] natefinch: but you'd still need the config watching stuff [16:18] natefinch: and tbh i don't really like the current twistiness with ensureStateWorker [16:19] would someone please review https://codereview.appspot.com/57740043 so I can deflake the bot a bit more? [16:20] rogpeppe2: I'm definitely not a fan of passing around an anonymous function that passes through another anonymous function [16:21] rogpeppe2: I believe that you and William know what the heck you're talking about, and ignore whatever last 2% I'm missing. [16:22] rogpeppe2: er rather I'm going to have to ignore what I'm missing. [16:22] rogpeppe2: my sleep's been pretty terrible the last few days which isn't helping anything [16:22] natefinch: np at all === rogpeppe2 is now known as rogpeppe [16:24] fwereade: LGTM [16:31] rogpeppe2, cheers [16:43] fwereade: according to the Go oracle, there are only three places that call agent.Config.Write - BootstrapCommand.Run, MachineAgent.APIWorker and UnitAgent.APIWorkers [16:43] fwereade: this corresponds to my intuition [16:44] fwereade: and means that the correct place to put the config writing code *is* in the APIWorker [17:15] fwereade, final look at https://codereview.appspot.com/53210044/ ? updated as suggested [17:15] bbiab [19:12] thumper: got that errgo thing all written yet? :) [19:13] natefinch: no, but will do a little today :) [19:13] fwereade: around? === _mup__ is now known as _mup_ [19:38] natefinch: what is the simplest way to insert a value at the start of a slice? [19:39] thumper: s = append([]type{ val }, s...) [19:39] hmm... [19:40] I was hoping there was a nicer way, but that is what I'll do [19:41] thumper: yeah, it's not great, but there's no real magic you can do with it [19:41] * thumper nods [20:11] thumper: It occurs to me, if prepending is something you're doing a lot of, it's probably better to just append, and treat the back as the front, if you know what I mean [20:11] natefinch: I do, but most of the rest of the operations are starting from the most recent [20:12] I want the equivalent of push_front [20:12] yeah, well, push_front is always ugly, so, there you go :) [20:13] thumper: unless you use a linked list.... and you should never use a linked list :) [20:13] :) [21:09] do we have any size recommendations on state-server size ? [21:10] ie 4 core / 16gb for 500 nodes env? [21:11] * hazmat rereads notes from jam scale test in nov [21:14] thumper: 2 things. 1. you got the loggo user name? 2. i hate all the code churn just to relocate a friggin project 3. don't forget to update the bot 4. i can't count [21:14] wallyworld: yes I got the loggo name on github, agree on the churn [21:15] wallyworld: hai by the way [21:15] hi :-) [21:15] i about to take lachie to his first day of high school, will be bbiab [21:15] i'll look at your worker code review once you add the test :-) [21:15] there's a couple of tools out there for package name rewriting [21:16] hazmat: sure, but it's sad it actually *changes the code* [21:16] all that code churn [21:16] sucks [21:16] as opposed to just updating a depedencies file [21:16] like in python [21:28] natefinch, Did you say I should remove the mongodb upstart script from /etc/init ? [21:30] sinzui: yeah... I don't think it actually breaks anything, but I think we removed it from the servers we deploy, IIRC. [21:35] sinzui: frees up some disk space and memory etc. [21:36] thumper, heyhey [21:36] fwereade: hey, how are you doing? [21:36] thumper, not bad [21:36] thumper, landed some actual code today, would you believe? [21:36] since we are seeing the port taken in CI. I like the thought that there is no reason for it to be up if there is no test running [21:36] fwereade: wow === gary_poster is now known as gary_poster|away [23:02] sinzui: ping [23:03] hi davecheney [23:03] sinzui: are we doing a hangout now ? [23:03] oh hey