[00:00] and a degree of latency can be used to smooth out spikes in state change [00:00] based on knowledge of the model interactions [00:01] the issue is unless its a work event, a state change event can be stale with network partitions or transient disconnects, if things end up pushiing forward an invalid/old state into the application tier.. ie the current fetch is probably still needed for the app / charm model [00:02] on design choices and scalability i actually had to do some work recently to have juju support provisioning many machines with a single provider api call.. [00:02] you mean to add a bulk call to allow more than one machine to be provisioned with a single api call, not one per machine? [00:03] yes [00:03] i've been pushing for bulk api calls since the api was first mooted [00:03] wallyworld, i'm not entirely clear things have changed with the move to core.. take ec2 for example.. standard cloud best practice would be multi-zone .. not something we can actually do with core... or take our instance type hardcodes which are old (we promote more expensive less powerful instances than what's current best practice). [00:03] and overlapping, and we can't specify instance-type [00:04] wallyworld, for the bulk provisioning i ended up, seeding cloud-init data with something that dialed back home, and got the actual machine specific provisioning script. [00:04] agreed. we could and should change all that. i know people want to [00:04] ah that would work [00:05] yeah.. works well, this work was all manual provider based though not in core per se. but the pattern works nicely [00:05] the whole hard coded insance type thing - that's just for ec2 because of what the lib provided. with openstack, we don't have that problem [00:06] we do need to fix ec2 but have had too many other things taking a higher priority [00:06] wallyworld, we should ideally host that on cloud-images, not internal to the src. [00:06] oh yes, you think? [00:06] :-) [00:06] :-) [00:06] so there's a lot of us that want to get this stuff sorted out badly [00:07] i'm still hopeful we can do it [01:15] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ rm -rf ~/.juju [01:15] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju init [01:15] A boilerplate environment configuration file has been written to /home/ubuntu/.juju/environments.yaml. [01:15] Edit the file to configure your juju environment and run bootstrap. [01:15] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju status [01:15] ERROR Unable to connect to environment "". [01:15] Please check your credentials or use 'juju bootstrap' to create a new environment. [01:15] Error details: [01:15] control-bucket: expected string, got nothing [01:15] this sounds wrong [01:15] juju init defaults to amazon [01:15] why does it say the environment is "" [01:25] nfi [01:25] but definitely a bug [01:29] this is a fresh install [01:29] i [01:29] i'll poke around some more [01:29] i smell JUJU_HOME in there somewhere [01:29] * davecheney throws a chair [01:30] no, JUJU_EV [01:30] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ export JUJU_ENV="amazon" [01:30] ubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju status [01:30] ERROR Unable to connect to environment "amazon". [01:30] bingo [01:31] thumper: ~/.juju/current-environemnt [01:31] where did that come from [01:31] that is being consulted always [01:31] switch [01:31] but I never used switch [01:32] then it probably shouldn't be there [01:32] but if it's not there [01:32] the env comes up as "" [01:32] that's a bug [01:32] it whould fall back to the default environment [01:33] * thumper thinks [01:33] ah... [01:33] yes [01:33] what happens [01:33] is that if the string is empty [01:33] which you would have got if you didn't specify -e or have an env var [01:33] it is treated "specially" [01:33] hmm, [01:33] we should remove the special [01:33] and have a sane fallback [01:33] always been like that [01:33] the correct behavior is to fall back to the order in the yaml file [01:33] now we are just outputting it [01:33] no actually [01:34] fall back to the default if specified [01:34] if no default specified [01:34] there is a default [01:34] then use the one if only one specified [01:34] this is from juju init [01:34] otherwise error [01:34] all this bullshit is too complicated [01:34] but it is using the default specified [01:34] juju switch was a bad idea [01:34] but it is getting it by the special case of "" [01:34] no [01:34] this isn't about switch [01:34] this is how it worked before [01:34] I just added an extra thing in [01:34] we weren't outputting it before [01:34] ok [01:35] whoever added the outputting should have changed the behaviour [01:35] but didn't [01:35] lemmie see if I can put this in an issue [01:35] having "" special cased is dumb [01:36] ok [01:36] thakns [01:36] i'll log a bug after lunhc [02:45] thumper, I have a branch that I can release as 1.17.1. It is the last r2248 + mgz's openstack and mgo fixes [02:45] thumper, It doesn't pass on canonistack though. [02:46] I cannot get stable juju to work on canonistack today, so maybe I should release this mashup as 1.17.1 anyway [02:46] sinzui: what's the issue on canonistack? [02:46] that has been flakey on occasion [02:48] After a successful bootstrap, the client can never talk the the env. That is true for 1.16.5 and 1.17.1 from inside canonistack, outside with public ips and outside with sshuttle vpns [02:48] hmmm [02:49] wallyworld, The cloud health tests show that canonistack has been barely usable for several days [02:49] sinzui: did the sshuttle connect? [02:49] yes it did [02:49] if everything else works, probably good to release then. i assume hp cloud works [02:49] but status and deploy always timeout [02:50] Hp is indeed very health with my branch [02:51] rick_h_: sinzui: sounds like a firewall issue, they might be blocking everything but port 22 even for the public IPs or something [02:51] or, public IPs aren't routable, but sshuttle is using the chinstrap bounce to connect [02:51] jam, from inside canonistack? [02:51] sinzui: so when you have sshuttle connected, do you *also* have a public IP assigned? [02:52] sinzui: 2248 is an older revision from last week? [02:52] because we won't end up using shuttle if we think we have an address outside the 10.* space [02:52] I'm just theorizing, though. [02:52] jam the health check is an hourly deployment using stable on each cloud. canonistack is ill http://162.213.35.54:8080/job/test-cloud-canonistack/ ... [02:53] latest health check works, quick release :-) [02:54] ...but since we are not seeing resources deleted properly from trunk tests, we know that some of the failures can be cause by the cruft left behind [02:54] sinzui: i see 2248 was before local provider sudo changes [02:55] I have been manually deleting instances, security groups, and networks all day to give CI a chance to pass [02:55] it would be good if 2249 could be the rev we use [02:55] It is [02:55] unless that is broken [02:55] wallyworld, FUCK NO [02:55] 2249 eliminates polling for status [02:55] wallyworld, 2249+ does not pass [02:55] oh [02:55] * wallyworld is sad [02:56] i'll have to take a look then [02:57] i tested on ec2 [02:57] wallyworld, I know. I really wanted to realeas tip. local is not healthy in trunk since CI was tainted by the destroy-environment problems, I ask other people to test. Local doesn't just work, for anyone who has ever used local before [02:58] my change in 2249 was related to juju status only [02:58] i'll check that it is ok though [02:58] wallyworld, I have been testing for 3 days. People want a release... and the want they favourite features in it [02:59] oh, sorry, didn't intend to push for 2249 to be included. i was just concenred i may have broken somethong [02:59] I am burned out trying to make a release when CI gave an answer last week [03:00] wallyworld, I want to release every week, which will help get the good work out into the wild [03:00] yeah. but we need to stop breaking stuff [03:02] wallyworld, Juju trunk was very good last week. On the last day a lot of branches landed just broken tests every where [03:02] i think they were mainly the local provider changes [03:03] if i understand correctly, it seems like the newer code doesn't like a previous dirty disk [03:03] wallyworld, Yeah, but since it cannot destroy itself (possibly a bug in the very version I propose releasing) the disk will always be dirty [03:04] so the clean up this week will need to be able to deal with that [03:04] i think the issue may be understood, so if 2248 is released and the final polish applied this week to trunk, 1.17.2 at the end of the week hopefully :-) [03:04] or next week [03:06] wallyworld, this bug causes subsequent runs of tests to fail. https://bugs.launchpad.net/juju-core/+bug/1272558 [03:06] <_mup_> Bug #1272558: destroy-environment shutdown machines instead

[03:06] * wallyworld looks [03:07] We keep hitting resource limits, or instance already exists failures from machines that were shutdown instead of destroyed. [03:07] oh, and it didn't help that our trust amis expired. [03:09] sinzui: that is an interesting bug. i had a quick look at the Openstack provider, and StopInstances does seem to call "delete server". so more investigation required [03:09] ah, juju-test-cloud-canonistack-machine-0 got shutdown instead of destroyed. I expect the next health check to fail [03:10] wallyworld, the bug also affects aws and azure though. [03:10] very odd [03:10] yeah, i just thought i'd see what openstack did out of interest [03:10] Azure can take hours to delete a network when I do it from the console too [03:10] azure seems to take a looooong time to do *anything* [03:11] sinzui: so i assume that for example you could do a "nova list" and the old machines would be shown still and have a status "shutdown" [03:12] The durations shown here are consistent with previous weeks: http://162.213.35.54:8080/ [03:13] we do run azure tests in parallel to keep all CI test to about 30 minutes, but that also puts us at risk to exceeding our 20 cpu limit [03:13] wallyworld, exactly what I see [03:13] wallyworld, Hp is the only cloud/substrate not affected. [03:14] hmmm. i'm not intimately familiar with destroy-env code. i wonder if maybe something changed recently [03:14] we have some investigating to do [03:16] sinzui: i'll make sure the issue is known at the next core standup and we'll ensure someone is assigned to fix as a matter of priority [03:17] wallyworld, That is appreciated. [03:17] we really need to get some more closed loop feedback from CI -> devs [03:18] cause i reckon not may devs even know the address of the CI dashboard [03:18] and/or pay attention to the status [03:19] so you poor folks cop stuff we break without timely action to fix it [03:19] and then have to push shit uphill to get a release out [03:20] i'll offer my opinion and hopefully it will be shared and we can implement some workflow to improve the situation [03:21] sinzui: i'll make sure you get feedback to let you know the outcome of the above [03:22] wallyworld, Jenkins has a bad UI. We are creating a report site that explains what was tested. http://162.213.35.69:6543/ [03:22] wallyworld, You do need to log in. to see the report of the revision I create [03:22] d [03:22] so i do [03:23] WTF is going on! [03:23] damnit [03:23] this was working before [03:23] sinzui: E403 after logging in [03:23] The overall PASS status is there because I manually tested local, then hacked the PASS status. Canonistack should have damned the who rev though [03:24] wallyworld, did you check all the boxes? [03:24] thumper: do you have any knowledge of the destroy-env issue sinzui mentions in the scrollback? [03:24] sinzui: ah, no :-) [03:24] i thought they were for information [03:24] oh, arosales reported the same thing. control-reload to force the pages I think [03:24] wallyworld: where is the scrollback, it is long [03:25] thumper: https://bugs.launchpad.net/juju-core/+bug/1272558 [03:25] <_mup_> Bug #1272558: destroy-environment shutdown machines instead

[03:26] sinzui: ah it works now [03:26] sinzui: try with --force [03:27] thumper: i'd need to read the code. i wonder why not having --force shuts down instances instead of deleting them [03:27] thumper, destroy-environment didn't return an error when it failed. It didn't tell us it needed to use use force [03:27] hmm... [03:27] this was axw's area, not sure [03:27] I think it had to do with moving to the api, but can't confirm [03:28] wallyworld: destroy-environment tries to be nice first [03:28] I think [03:28] thumper, I suspect the issue is older than last week, but something made it more visible. [03:28] sinzui: is it just the bootstrap machine left behind? [03:28] well, no, I have never seen a machine SHUTDOWN before last week [03:29] i was thinking perhaps that if the logic was moved behind the api, the code that does the destroy is running on the bootstrap machine itself and there may be an issue destroying it [03:29] wallyworld, most of the time, but we have see the service machines shutdown too. [03:30] and that the other nodes could still be destroyed [03:30] ok, was just a guess :-) [03:31] something seems all fucked up [03:32] local provider in trunk is giving weird arse lxc errors [03:32] machine-0: 2014-01-28 03:31:00 ERROR juju.container.lxc lxc.go:102 lxc container creation failed: error executing "lxc-create": + '[' amd64 == i686 ']'; + '[' amd64 '!=' i386 -a amd64 '!=' amd64 -a amd64 '!=' armhf -a amd64 '!=' armel ']'; + '[' amd64 '!=' i386 -a amd64 '!=' amd64 -a amd64 '!=' armhf -a amd64 '!=' armel ']'; + '[' amd64 = amd64 -a amd64 '!=' amd64 -a amd64 '!=' i386 ']'; + '[' amd64 = i386 -a amd64 '!=' i386 ']'; + '[' amd64 [03:32] = armhf -o amd64 = armel ']'; + '[' released '!=' daily -a released '!=' released ']'; + '[' -z /var/lib/lxc/tim-testlocal-machine-1 ']'; ++ id -u; + '[' 0 '!=' 0 ']'; + config=/var/lib/lxc/tim-testlocal-machine-1/config; + '[' -z /usr/lib/x86_64-linux-gnu/lxc ']'; + type ubuntu-cloudimg-query; ubuntu-cloudimg-query is /usr/bin/ubuntu-cloudimg-query; + type wget; wget is /usr/bin/wget; + cache=/var/cache/lxc/cloud-precise; + mkdir -p [03:32] /var/cache/lxc/cloud-precise; + '[' -n '' ']'; ++ ubuntu-cloudimg-query precise released amd64 --format '%{url}\n'; failed to get https://cloud-images.ubuntu.com/query/precise/server/released-dl.current.txt; + url1=; container creation template for tim-testlocal-machine-1 failed; Error creating container tim-testlocal-machine-1 [03:32] WTH [03:33] * thumper moves from current work back to trunk [03:33] still on precise? [03:36] saucy [03:36] wallyworld: having issues with trusty? [03:36] thumper: i tried local on trusy last thing friday and it didn't work [03:36] but didn't look into it [03:37] andrew and i had a quick look [03:37] nothing jumped out as being wrong but we didn't deep dive [03:37] hang on... [03:37] I moved back to trunk and now it is working [03:37] ... [03:37] trusty as host, precise containers [03:38] I didn't touch this area though [03:38] so pretty confused right now [03:38] isn't that always the way [03:44] ah fark [03:44] I think I know what it is [03:44] I have a fake https-proxy set [03:44] to "rubbish" [03:45] and I bet lxc is trying to download the latest server using the proxy [03:45] ha ha ha [03:45] that is kinda funny [03:45] in a terrible way [03:45] the proxy stuff works :-) [03:46] hehe, that's it [03:46] I guess the proxy works [03:47] * thumper proposes [03:47] actually [03:47] I may break this up as I broke something before [03:54] wallyworld: https://codereview.appspot.com/57590043/ simple fix for my fubar [03:54] * wallyworld looks [03:58] thumper: environs/config/config.go - are those new methods related to this mp? am i missing something? [03:58] wallyworld: ah, they may be used in the next [03:59] but I thought they were for that one [03:59] sorry [03:59] I'm just proposing the next [03:59] np, thought i was being dumb [04:05] wallyworld: and if you feel like it: Rietveld: https://codereview.appspot.com/57600043 [04:05] looking [04:13] thumper: so we could later on move existing clients to use the new common fasçard? [04:13] could [04:13] and façade [04:14] bah, can't spell [04:18] thumper: not sure if you agree, i find !a || !b easier to read than !(a && b), especially if the latter is split over two lines, where by !a i mean a != foo etc [04:34] wallyworld: I don't care that much [04:34] yeah, me either was just a thought [04:35] took me a couple of scans to grok it [04:35] cause of the line break and ( [04:36] * thumper nods [04:36] happy to change if you think it'll make a difference [04:45] thumper: i didn't lgtm because we're missing a test for jujud [04:45] * thumper sighs [04:45] sorry [04:45] and how do you suggest we test it? [04:45] yeah [04:46] there's some existing examples in machine_test [04:46] basically start a jujud and check for an expected result [04:46] similar code to the worker test itself [04:46] hmm... [04:46] ok [04:46] but a cut down version [04:46] just to test the wiring up of it all [04:47] not a blovker, but it would be good not to have the "first" bool required [04:47] or maybe it is essential here, not sure. but other workers don't seem to need it, but i could be mis remembering [04:48] it just complicates things [04:48] oh balls, gotta run - Belinda's car door is stuck and i have to drive to help her out [04:48] i'll check back a bit later [06:12] oh bollocks [06:12] ... obtained []charm.CharmRevision = []charm.CharmRevision{charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}} [06:12] ... expected []charm.CharmRevision = []charm.CharmRevision{charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}} [06:13] #gccgo [08:05] fwereade, hey, i've updated https://codereview.appspot.com/53210044/, can you take a look whether it's good to land? [08:06] dimitern, sure, thanks [08:19] dimitern, reviewed,bbs === _mup__ is now known as _mup_ [08:22] fwereade, ta [09:38] lucky(~) % juju destroy-environment ap-southeast-2 -y [09:38] ERROR state/api: websocket.Dial wss://ec2-54-206-142-42.ap-southeast-2.compute.amazonaws.com:17070/: dial tcp 54.206.142.42:17070: connection refused [09:38] but it did destroy the environment [10:07] davecheney: I believe it tries to contact the environment to check if there are any manually registered machines before nuking it all from the client side, but even if it fails, it still nukes from client side [10:09] but why did it fail ? [10:09] was the bootstrap machine already nuked > [10:46] dimitern: standup ? [10:58] fwereade, updated https://codereview.appspot.com/53210044/ [11:26] rogpeppe: want to talk now? [11:26] natefinch: good plan, yes [11:26] natefinch: i just went back into the hangout [11:26] rogpeppe: cool brt [11:26] rogpeppe, btw, re SOAP, I remember enjoying http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple/ [11:28] fwereade: https://codereview.appspot.com/56020043 you had asked for a deprecation warning (when supplying -e to destroy-environment), care to check the spelling and see if you like how I worded it? [11:30] dimitern, re {placeholder:false}, might a {$ne {placeholder:true}} work as a general replacement? [11:30] jam, ack [11:31] jam, LGTM [11:41] fwereade, and pendingupload as well [11:45] fwereade, i'll try it on my local mongo and if it works I could change it [11:46] fwereade: thanks [12:00] dimitern, reviewed [12:03] fwereade, tyvm [12:03] oh *fuck* these jujud tests [12:04] and also that provisioner one [12:06] * fwereade will sort out the provisioner one but needs a volunteer for the jujud one [12:06] fwereade: delete them. Tests are for inferior programmers anyway. [12:06] natefinch, well volunteered! [12:06] fwereade: which jujud one? [12:07] natefinch, there's a class of machine agent test failure [12:07] natefinch, happens in a few different ways now I think [12:07] natefinch, we try to test that a machine agent works by testing the side-effects of particular jobs [12:07] natefinch, but the MA isn't set up quite right, and so the api job barfs, and kills all the others [12:08] fwereade: sounds like fun [12:08] natefinch, and it's a matter of luck whether they managed to express their side effect in time [12:08] natefinch, btw, you don't actually have to volunteer, HA is at least as important [12:08] natefinch: http://play.golang.org/p/WwGvP5RUbM [12:09] natefinch: it could probably go in its own package somewhere [12:09] natefinch: perhaps in utils [12:10] fwereade: if I was less behind in the HA stuff I'd be happy to volunteer.... but if someone else can do it, that would probably be better for the schedule [12:11] natefinch: does that make sense as a primitive? [12:12] * rogpeppe needs to lunch [12:13] rogpeppe: looks good though I'm not sure I'm entirely happy about genericizing it with the interface{} ... [12:21] TheMue: morning, I wanted to check on the status of the debug-log work. I know rogpeppe brought up some ideas for potentially more effecient ways to handle communication. [12:21] TheMue: anything I can/should note on our tracking card for this during out standup? [12:23] rick_h_: yep, roger and william agreed on this new approach and I'm now finishing the outline (there are some todos left) [12:23] TheMue: cool, thanks for the update. [12:23] rick_h_: most looks good so far because this new approach avoids much of the problems of the old one [12:24] TheMue: always a good thing, glad to hear it [12:30] fwereade: did you get a chance to talk with Tim about planning for capetown? [12:49] jam, he got a message from me about it, but only while I was briefly midnightly awake and he was at the gym [12:49] fwereade: so its all sorted out, then :) [12:51] natefinch: i know what you mean, but it's general enough that it might be useful for other things. it's a bit more mechanism than i'd like to see in cmd/jujud directly, and as a separate package i wouldn't really want it to depend on the agent package [12:52] jam, mu -- I have pushed mramm to figure out what else we may need, but at least thumper is aware of where he needs to add things that he thinks of [12:52] natefinch: you could always write a thin type-safe wrapper on top if the interface{} thing gets you down [12:53] rogpeppe, can you remember what emitter of NotProvisionedError justifies returning true from isFatal in the machine agent? [12:54] rogpeppe, if *that particular* MA is not provisioned, sure, that's a problem [12:54] * rogpeppe looks to see what emits NotProvisionedError [12:54] rogpeppe, and the fact that other workers are emitting it is I think *also* a problem [12:55] rogpeppe, but they should surely be restricting their problems to themselves [12:57] fwereade: yeah, that may well be problematic. I think we should probably use a more specific error, probably defined in cmd/jujud, and transform from NotProvisionerError into that at the appropriate place only [12:57] fwereade: and take NotProvisionedError out of the list of fatal errors [12:58] fwereade: i *think* that the specific case that we were thinking about there is the one returned by apiRootForEntity [13:00] rogpeppe, excellent, that matches my rough analysis [13:00] rogpeppe, thanks [13:00] fwereade: cool [13:20] would appreciate a quick look at https://codereview.appspot.com/57740043 -- it's the flaky provisioner tests [13:21] the jujud ones demand a bit more obsessive/paranoid care [13:30] * fwereade succumbs to rage against the machine agent, goes for walk [13:35] the battle of los agents [13:52] anyone here know about lxc? [13:52] we're seeing this error from lxc-start on a machine: [13:52] 2014-01-28 13:35:27 ERROR juju.provisioner provisioner_task.go:399 cannot start instance for machine "3/lxc/5": error executing "lxc-start": command get_init_pid failed to receive response [14:07] natefinch, btw, a thought: it's *vital* that we start (non-bootstrap) state DBs only in response to info from the API -- it's only the API conn that does the nonce check and is therefore safe from edge case failures in the provisioner accidentally starting two instances for one machine [14:07] natefinch, I know that's what we're doing anyway, but it's another reason not to mess around with cloudinit [14:07] fwereade: ahh, yeah, good point [14:08] rogpeppe2: you may be right about the interface there, but I still am hesitant to add a bunch of code and some complexity for a feature we don't even support right now. It's like, either do this whole watching thing, or just call a function inline. [14:10] fwereade, jam, I created a branch from the last rev that CI blessed then merged mgz openstack/mgo fixes. [14:10] natefinch: i'm not sure what you mean by "just call a function inline" there [14:11] fwereade, jam. since I created a new branch, I created series 1.18 and moved 1.17.1 to that series. https://launchpad.net/juju-core/1.18 [14:11] rogpeppe2: given what fwereade said above... are we not just going to be calling the API to see if we should be a state server when MachineAgent.Run is called [14:12] fwereade, jam, I am not happy with this situation. If either of you are not happy, we should talk about our options for a 1.17.1 release [14:13] natefinch: we can't do that if there's only one state server machine agent [14:13] natefinch: that's kind of the whole point of the design i've suggested [14:14] natefinch: i don't think the SharedValue stuff is unreasonable complexity (it's been well tested in the past, under another guise) [14:15] natefinch: i'd be happy to commit it with tests if you don't fancy doing that [14:15] rogpeppe2: that's fine, it just seems like we're doing all that instead of writing func SetUpStateServer() [14:16] natefinch: we're moving towards a design that enables future stuff, rather than cluttering the existing design with more stuff. [14:17] rogpeppe2: it doesn't seem like clutter when it's code we'll need either way. The future stuff we may never get to. [14:18] rogpeppe2: can you explain this a bit more: we can't do that if there's only one state server machine agent [14:18] natefinch: perhaps you could sketch some pseudocode for what you think SetUpStateServer should do? [14:18] rogpeppe2: that's probably what is tripping me up [14:18] natefinch: if there's only one state server machine agent, there's no API to connect to [14:19] rogpeppe2: if there's no API to connect to, mongo can't sync its data from anywhere either [14:19] rogpeppe2: I don't think this code applies to the bootstrap node [14:19] natefinch: there is no such thing as the "bootstrap node" in an HA environment [14:19] natefinch: well, that's not entirely true [14:19] sinzui: thanks for that [14:20] natefinch: but the bootstrap node is only important at bootstrap time [14:20] I don't think I have any cunning solutions either unfortunately [14:21] rogpeppe2: other than the bootstrap node, there must already be a state server in existence when new state servers come up [14:21] natefinch: not necessarily [14:22] natefinch: i want to allow for the possibility of going from 3 servers to 1. [14:22] rogpeppe2: then we're boned anyway, because they can't sync the mongo data [14:22] natefinch: not necessarily [14:23] rogpeppe2: btw HA, where does the all-machines.log reside then? [14:23] natefinch: the peer group logic i've written should allow it [14:23] TheMue: that's a good question and one we haven't resolved yet [14:24] rogpeppe2: hehe, ok [14:24] TheMue: we perhaps sync it to all state server nodes, but we need to think about it [14:25] rogpeppe2: afk a sec, sorry, brb [14:26] rogpeppe2: yes, may be the right solution. also we still have no logrotate, don't we? [14:32] rogpeppe2: going from 3 to 1 and putting manage environ on existing machines are not things we need to deliver right now, and I don't think there's any throw away code we'd need to write to deliver what is required without adding this code. I guess if you want to commit the shared valuestuff with tests, that's fine with me... I just don't really want to take on *any* additional code burden right now. [14:33] TheMue: yes, there is no logrotate [14:34] natefinch: perhaps you could paste some pseudocode with your idea for your suggested solution [14:35] rogpeppe2, TheMue: unless there's a serious problem with the approach I think we should just fix the rsyslog conf to push to all state servers [14:35] TheMue, yes, there is no logrotate and we really need it [14:35] fwereade: +1000 [14:35] TheMue, don't suppose you're currently bored [14:35] ;p [14:36] fwereade: what happens when a new state server comes up? [14:36] fwereade: (do we lose all the previous log on that state server?) [14:37]