/srv/irclogs.ubuntu.com/2014/01/28/#juju-dev.txt

wallyworldand a degree of latency can be used to smooth out spikes in state change00:00
wallyworldbased on knowledge of the model interactions00:00
hazmatthe issue is unless its a work event, a state change event can be stale with network partitions or transient disconnects, if things end up pushiing forward an invalid/old state into the application tier.. ie the current fetch is probably still needed for the app / charm model00:01
hazmaton design choices and scalability i actually had to do some work recently to have juju support provisioning  many machines with a single provider api call..00:02
wallyworldyou mean to add a bulk call to allow more than one machine to be provisioned with a single api call, not one per machine?00:02
hazmatyes00:03
wallyworldi've been pushing for bulk api calls since the api was first mooted00:03
hazmatwallyworld, i'm not entirely clear things have changed with the move to core.. take ec2 for example..  standard cloud best practice would be multi-zone .. not something we can actually do with core... or take our instance type hardcodes which are old (we promote more expensive less powerful instances than what's current best practice).00:03
hazmatand overlapping, and we can't specify instance-type00:03
hazmatwallyworld, for the bulk provisioning i ended up, seeding cloud-init data with something that dialed back home, and got the actual machine specific provisioning script.00:04
wallyworldagreed. we could and should change all that. i know people want to00:04
wallyworldah that would work00:04
hazmatyeah.. works well, this work was all manual provider based though not in core per se. but the pattern works nicely00:05
wallyworldthe whole hard coded insance type thing - that's just for ec2 because of what the lib provided. with openstack, we don't have that problem00:05
wallyworldwe do need to fix ec2 but have had too many other things taking a higher priority00:06
hazmatwallyworld, we should ideally host that on cloud-images, not internal to the src.00:06
wallyworldoh yes, you think?00:06
wallyworld:-)00:06
hazmat:-)00:06
wallyworldso there's a lot of us that want to get this stuff sorted out badly00:06
wallyworldi'm still hopeful we can do it00:07
davecheneyubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ rm -rf ~/.juju01:15
davecheneyubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju init01:15
davecheneyA boilerplate environment configuration file has been written to /home/ubuntu/.juju/environments.yaml.01:15
davecheneyEdit the file to configure your juju environment and run bootstrap.01:15
davecheneyubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju status01:15
davecheneyERROR Unable to connect to environment "".01:15
davecheneyPlease check your credentials or use 'juju bootstrap' to create a new environment.01:15
davecheneyError details:01:15
davecheneycontrol-bucket: expected string, got nothing01:15
davecheneythis sounds wrong01:15
davecheneyjuju init defaults to amazon01:15
davecheneywhy does it say the environment is ""01:15
thumpernfi01:25
thumperbut definitely a bug01:25
davecheneythis is a fresh install01:29
davecheneyi01:29
davecheneyi'll poke around some more01:29
davecheneyi smell JUJU_HOME in there somewhere01:29
* davecheney throws a chair01:29
davecheneyno, JUJU_EV01:30
davecheneyubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ export JUJU_ENV="amazon"01:30
davecheneyubuntu@ip-10-248-60-212:~/src/launchpad.net/juju-core$ ~/bin/juju status01:30
davecheneyERROR Unable to connect to environment "amazon".01:30
davecheneybingo01:30
davecheneythumper: ~/.juju/current-environemnt01:31
davecheneywhere did that come from01:31
davecheneythat is being consulted always01:31
thumperswitch01:31
davecheneybut I never used switch01:31
thumperthen it probably shouldn't be there01:32
davecheneybut if it's not there01:32
davecheneythe env comes up as ""01:32
thumperthat's a bug01:32
thumperit whould fall back to the default environment01:32
* thumper thinks01:33
thumperah...01:33
thumperyes01:33
thumperwhat happens01:33
thumperis that if the string is empty01:33
thumperwhich you would have got if you didn't specify -e or have an env var01:33
thumperit is treated "specially"01:33
davecheneyhmm,01:33
thumperwe should remove the special01:33
thumperand have a sane fallback01:33
thumperalways been like that01:33
davecheneythe correct behavior is to fall back to the order in the yaml file01:33
thumpernow we are just outputting it01:33
thumperno actually01:33
thumperfall back to the default if specified01:34
thumperif no default specified01:34
davecheneythere is a default01:34
thumperthen use the one if only one specified01:34
davecheneythis is from juju init01:34
thumperotherwise error01:34
davecheneyall this bullshit is too complicated01:34
thumperbut it is using the default specified01:34
davecheneyjuju switch was a bad idea01:34
thumperbut it is getting it by the special case of ""01:34
thumperno01:34
thumperthis isn't about switch01:34
thumperthis is how it worked before01:34
thumperI just added an extra thing in01:34
thumperwe weren't outputting it before01:34
davecheneyok01:34
thumperwhoever added the outputting should have changed the behaviour01:35
thumperbut didn't01:35
davecheneylemmie see if I can put this in an issue01:35
thumperhaving "" special cased is dumb01:35
davecheneyok01:36
davecheneythakns01:36
davecheneyi'll log a bug after lunhc01:36
sinzuithumper, I have a branch that I can release as 1.17.1. It is the last r2248 + mgz's openstack and mgo fixes02:45
sinzuithumper, It doesn't pass on canonistack though.02:45
sinzuiI cannot get stable juju to work on canonistack today, so maybe I should release this mashup as 1.17.1 anyway02:46
wallyworldsinzui: what's the issue on canonistack?02:46
wallyworldthat has been flakey on occasion02:46
sinzuiAfter a successful bootstrap, the client can never talk the the env. That is true for 1.16.5 and 1.17.1 from inside canonistack, outside with public ips and outside with sshuttle vpns02:48
wallyworldhmmm02:48
sinzuiwallyworld, The cloud health tests show that canonistack has been barely usable for several days02:49
rick_h_sinzui: did the sshuttle connect?02:49
sinzuiyes it did02:49
wallyworldif everything else works, probably good to release then. i assume hp cloud works02:49
sinzuibut status and deploy always timeout02:49
sinzuiHp is indeed very health with my branch02:50
jamrick_h_: sinzui: sounds like a firewall issue, they might be blocking everything but port 22 even for the public IPs or something02:51
jamor, public IPs aren't routable, but sshuttle is using the chinstrap bounce to connect02:51
sinzuijam, from inside canonistack?02:51
jamsinzui: so when you have sshuttle connected, do you *also* have a public IP assigned?02:51
wallyworldsinzui: 2248 is an older revision from last week?02:52
jambecause we won't end up using shuttle if we think we have an address outside the 10.* space02:52
jamI'm just theorizing, though.02:52
sinzuijam the health check is an hourly deployment using stable on each cloud. canonistack is ill http://162.213.35.54:8080/job/test-cloud-canonistack/ ...02:52
wallyworldlatest health check works, quick release :-)02:53
sinzui...but since we are not seeing resources deleted properly from trunk tests, we know that some of the failures can be cause by the cruft left behind02:54
wallyworldsinzui: i see 2248 was before local provider sudo changes02:54
sinzuiI have been manually deleting instances, security groups, and networks all day to give CI a chance to pass02:55
wallyworldit would be good if 2249 could be the rev we use02:55
sinzuiIt is02:55
wallyworldunless that is broken02:55
sinzuiwallyworld, FUCK NO02:55
wallyworld2249 eliminates polling for status02:55
sinzuiwallyworld, 2249+ does not pass02:55
wallyworldoh02:55
* wallyworld is sad02:55
wallyworldi'll have to take a look then02:56
wallyworldi tested on ec202:57
sinzuiwallyworld, I know. I really wanted to realeas tip. local is not healthy in trunk since CI was tainted by the destroy-environment problems, I ask other people to test. Local doesn't just work, for anyone who has ever used local before02:57
wallyworldmy change in 2249 was related to juju status only02:58
wallyworldi'll check that it is ok though02:58
sinzuiwallyworld, I have been testing for 3 days. People want a release... and the want they favourite features in it02:58
wallyworldoh, sorry, didn't intend to push for 2249 to be included. i was just concenred i may have broken somethong02:59
sinzuiI am burned out trying to make a release when CI gave an answer last week02:59
sinzuiwallyworld, I want to release every week, which will help get the good work out into the wild03:00
wallyworldyeah. but we need to stop breaking stuff03:00
sinzuiwallyworld, Juju trunk was very good last week. On the last day a lot of branches landed just broken tests every where03:02
wallyworldi think they were mainly the local provider changes03:02
wallyworldif i understand correctly, it seems like the newer code doesn't like a previous dirty disk03:03
sinzuiwallyworld, Yeah, but since it cannot destroy itself (possibly a bug in the very version I propose releasing) the disk will always be dirty03:03
wallyworldso the clean up this week will need to be able to deal with that03:04
wallyworldi think the issue may be understood, so if 2248 is released and the final polish applied this week to trunk, 1.17.2 at the end of the week hopefully :-)03:04
wallyworldor next week03:04
sinzuiwallyworld, this bug causes subsequent runs of tests to fail. https://bugs.launchpad.net/juju-core/+bug/127255803:06
_mup_Bug #1272558: destroy-environment shutdown machines instead <ci> <destroy-machine> <intermittent-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1272558>03:06
* wallyworld looks03:06
sinzuiWe keep hitting resource limits, or instance already exists failures from machines that were shutdown instead of destroyed.03:07
sinzuioh, and it didn't help that our trust amis expired.03:07
wallyworldsinzui: that is an interesting bug. i had a quick look at the Openstack provider, and StopInstances does seem to call "delete server". so more investigation required03:09
sinzuiah, juju-test-cloud-canonistack-machine-0 got shutdown instead of destroyed. I expect the next health check to fail03:09
sinzuiwallyworld, the bug also affects aws and azure though.03:10
sinzuivery odd03:10
wallyworldyeah, i just thought i'd see what openstack did out of interest03:10
sinzuiAzure can take hours to delete a network when I do it from the console too03:10
wallyworldazure seems to take a looooong time to do *anything*03:10
wallyworldsinzui: so i assume that for example you could do a "nova list" and the old machines would be shown still and have a status "shutdown"03:11
sinzuiThe durations shown here are consistent with previous weeks: http://162.213.35.54:8080/03:12
sinzuiwe do run azure tests in parallel to keep all CI test to about 30 minutes, but that also puts us at risk to exceeding our 20 cpu limit03:13
sinzuiwallyworld, exactly what I see03:13
sinzuiwallyworld, Hp is the only cloud/substrate not affected.03:13
wallyworldhmmm. i'm not intimately familiar with destroy-env code. i wonder if maybe something changed recently03:14
wallyworldwe have some investigating to do03:14
wallyworldsinzui: i'll make sure the issue is known at the next core standup and we'll ensure someone is assigned to fix as a matter of priority03:16
sinzuiwallyworld, That is appreciated.03:17
wallyworldwe really need to get some more closed loop feedback from CI -> devs03:17
wallyworldcause i reckon not may devs even know the address of the CI dashboard03:18
wallyworldand/or pay attention to the status03:18
wallyworldso you poor folks cop stuff we break without timely action to fix it03:19
wallyworldand then have to push shit uphill to get a release out03:19
wallyworldi'll offer my opinion and hopefully it will be shared and we can implement some workflow to improve the situation03:20
wallyworldsinzui: i'll make sure you get feedback to let you know the outcome of the above03:21
sinzuiwallyworld, Jenkins has a bad UI. We are creating a report site that explains what was tested. http://162.213.35.69:6543/03:22
sinzuiwallyworld, You do need to log in. to see the report of the revision I create03:22
sinzuid03:22
wallyworldso i do03:22
thumperWTF is going on!03:23
thumperdamnit03:23
thumperthis was working before03:23
wallyworldsinzui: E403 after logging in03:23
sinzuiThe overall PASS status is there because I manually tested local, then hacked the PASS status. Canonistack should have damned the who rev though03:23
sinzuiwallyworld, did you check all the boxes?03:24
wallyworldthumper: do you have any knowledge of the destroy-env issue sinzui  mentions in the scrollback?03:24
wallyworldsinzui: ah, no :-)03:24
wallyworldi thought they were for information03:24
sinzuioh, arosales reported the same thing. control-reload to force the pages I think03:24
thumperwallyworld: where is the scrollback, it is long03:24
wallyworldthumper: https://bugs.launchpad.net/juju-core/+bug/127255803:25
_mup_Bug #1272558: destroy-environment shutdown machines instead <ci> <destroy-machine> <intermittent-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1272558>03:25
wallyworldsinzui: ah it works now03:26
thumpersinzui: try with --force03:26
wallyworldthumper: i'd need to read the code. i wonder why not having --force shuts down instances instead of deleting them03:27
sinzuithumper, destroy-environment didn't return an error when it failed. It didn't tell us it needed to use use force03:27
thumperhmm...03:27
thumperthis was axw's area, not sure03:27
thumperI think it had to do with moving to the api, but can't confirm03:27
thumperwallyworld: destroy-environment tries to be nice first03:28
thumperI think03:28
sinzuithumper, I suspect the issue is older than last week, but something made it more visible.03:28
wallyworldsinzui: is it just the bootstrap machine left behind?03:28
sinzuiwell, no, I have never seen a machine SHUTDOWN before last week03:28
wallyworldi was thinking perhaps that if the logic was moved behind the api, the code that does the destroy is running on the bootstrap machine itself and there may be an issue destroying it03:29
sinzuiwallyworld, most of the time, but we have see the service machines shutdown too.03:29
wallyworldand that the other nodes could still be destroyed03:30
wallyworldok, was just a guess :-)03:30
thumpersomething seems all fucked up03:31
thumperlocal provider in trunk is giving weird arse lxc errors03:32
thumpermachine-0: 2014-01-28 03:31:00 ERROR juju.container.lxc lxc.go:102 lxc container creation failed: error executing "lxc-create": + '[' amd64 == i686 ']'; + '[' amd64 '!=' i386 -a amd64 '!=' amd64 -a amd64 '!=' armhf -a amd64 '!=' armel ']'; + '[' amd64 '!=' i386 -a amd64 '!=' amd64 -a amd64 '!=' armhf -a amd64 '!=' armel ']'; + '[' amd64 = amd64 -a amd64 '!=' amd64 -a amd64 '!=' i386 ']'; + '[' amd64 = i386 -a amd64 '!=' i386 ']'; + '[' amd6403:32
thumper = armhf -o amd64 = armel ']'; + '[' released '!=' daily -a released '!=' released ']'; + '[' -z /var/lib/lxc/tim-testlocal-machine-1 ']'; ++ id -u; + '[' 0 '!=' 0 ']'; + config=/var/lib/lxc/tim-testlocal-machine-1/config; + '[' -z /usr/lib/x86_64-linux-gnu/lxc ']'; + type ubuntu-cloudimg-query; ubuntu-cloudimg-query is /usr/bin/ubuntu-cloudimg-query; + type wget; wget is /usr/bin/wget; + cache=/var/cache/lxc/cloud-precise; + mkdir -p03:32
thumper/var/cache/lxc/cloud-precise; + '[' -n '' ']'; ++ ubuntu-cloudimg-query precise released amd64 --format '%{url}\n'; failed to get https://cloud-images.ubuntu.com/query/precise/server/released-dl.current.txt; + url1=; container creation template for tim-testlocal-machine-1 failed; Error creating container tim-testlocal-machine-103:32
thumperWTH03:32
* thumper moves from current work back to trunk03:33
wallyworldstill on precise?03:33
thumpersaucy03:36
thumperwallyworld: having issues with trusty?03:36
wallyworldthumper: i tried local on trusy last thing friday and it didn't work03:36
wallyworldbut didn't look into it03:36
wallyworldandrew and i had a quick look03:37
wallyworldnothing jumped out as being wrong but we didn't deep dive03:37
thumperhang on...03:37
thumperI moved back to trunk and now it is working03:37
thumper...03:37
wallyworldtrusty as host, precise containers03:37
thumperI didn't touch this area though03:38
thumperso pretty confused right now03:38
wallyworldisn't that always the way03:38
thumperah fark03:44
thumperI think I know what it is03:44
thumperI have a fake https-proxy set03:44
thumperto "rubbish"03:44
thumperand I bet lxc is trying to download the latest server using the proxy03:45
wallyworldha ha ha03:45
thumperthat is kinda funny03:45
thumperin a terrible way03:45
wallyworldthe proxy stuff works :-)03:45
thumperhehe, that's it03:46
thumperI guess the proxy works03:46
* thumper proposes03:47
thumperactually03:47
thumperI may break this up as I broke something before03:47
thumperwallyworld: https://codereview.appspot.com/57590043/  simple fix for my fubar03:54
* wallyworld looks03:54
wallyworldthumper: environs/config/config.go - are those new methods related to this mp? am i missing something?03:58
thumperwallyworld: ah, they may be used in the next03:58
thumperbut I thought they were for that one03:59
thumpersorry03:59
thumperI'm just proposing the next03:59
wallyworldnp, thought i was being dumb03:59
thumperwallyworld: and if you feel like it: Rietveld: https://codereview.appspot.com/5760004304:05
wallyworldlooking04:05
wallyworldthumper: so we could later on move existing clients to use the new common fasçard?04:13
thumpercould04:13
thumperand façade04:13
wallyworldbah, can't spell04:14
wallyworldthumper: not sure if you agree, i find !a || !b easier to read than !(a && b), especially if the latter is split over two lines, where by !a i mean a != foo etc04:18
thumperwallyworld: I don't care that much04:34
wallyworldyeah, me either was just a thought04:34
wallyworldtook me a couple of scans to grok it04:35
wallyworldcause of the line break and (04:35
* thumper nods04:36
thumperhappy to change if you think it'll make a difference04:36
wallyworldthumper: i didn't lgtm because we're missing a test for jujud04:45
* thumper sighs04:45
wallyworldsorry04:45
thumperand how do you suggest we test it?04:45
wallyworldyeah04:45
wallyworldthere's some existing examples in machine_test04:46
wallyworldbasically start a jujud and check for an expected result04:46
wallyworldsimilar code to the worker test itself04:46
thumperhmm...04:46
thumperok04:46
wallyworldbut a cut down version04:46
wallyworldjust to test the wiring up of it all04:46
wallyworldnot a blovker, but it would be good not to have the "first" bool required04:47
wallyworldor maybe it is essential here, not sure. but other workers don't seem to need it, but i could be mis remembering04:47
wallyworldit just complicates things04:48
wallyworldoh balls, gotta run - Belinda's car door is stuck and i have to drive to help her out04:48
wallyworldi'll check back a bit later04:48
davecheneyoh bollocks06:12
davecheney... obtained []charm.CharmRevision = []charm.CharmRevision{charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"6645c56965290fc0097ea9962a926e04b8c5b1483f2871dce9e33e9613e36dbd", Err:error(nil)}}06:12
davecheney... expected []charm.CharmRevision = []charm.CharmRevision{charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}, charm.CharmRevision{Revision:23, Sha256:"2c9f01a53a73c221d5360207e7bb2f887ff83c32b04e58aca76c4d99fd071ec7", Err:error(nil)}}06:12
davecheney#gccgo06:13
dimiternfwereade, hey, i've updated https://codereview.appspot.com/53210044/, can you take a look whether it's good to land?08:05
fwereadedimitern, sure, thanks08:06
fwereadedimitern, reviewed,bbs08:19
=== _mup__ is now known as _mup_
dimiternfwereade, ta08:22
davecheneylucky(~) % juju destroy-environment ap-southeast-2 -y09:38
davecheneyERROR state/api: websocket.Dial wss://ec2-54-206-142-42.ap-southeast-2.compute.amazonaws.com:17070/: dial tcp 54.206.142.42:17070: connection refused09:38
davecheneybut it did destroy the environment09:38
jamdavecheney: I believe it tries to contact the environment to check if there are any manually registered machines before nuking it all from the client side, but even if it fails, it still nukes from client side10:07
davecheneybut why did it fail ?10:09
davecheneywas the bootstrap machine already nuked >10:09
jamdimitern: standup ?10:46
dimiternfwereade, updated https://codereview.appspot.com/53210044/10:58
natefinchrogpeppe: want to talk now?11:26
rogpeppenatefinch: good plan, yes11:26
rogpeppenatefinch: i just went back into the hangout11:26
natefinchrogpeppe: cool brt11:26
fwereaderogpeppe, btw, re SOAP, I remember enjoying http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple/11:26
jamfwereade: https://codereview.appspot.com/56020043 you had asked for a deprecation warning (when supplying -e to destroy-environment), care to check the spelling and see if you like how I worded it?11:28
fwereadedimitern, re {placeholder:false}, might a {$ne {placeholder:true}} work as a general replacement?11:30
fwereadejam, ack11:30
fwereadejam, LGTM11:31
dimiternfwereade, and pendingupload as well11:41
dimiternfwereade, i'll try it on my local mongo and if it works I could change it11:45
jamfwereade: thanks11:46
fwereadedimitern, reviewed12:00
dimiternfwereade, tyvm12:03
fwereadeoh *fuck* these jujud tests12:03
fwereadeand also that provisioner one12:04
* fwereade will sort out the provisioner one but needs a volunteer for the jujud one12:06
natefinchfwereade: delete them.  Tests are for inferior programmers anyway.12:06
fwereadenatefinch, well volunteered!12:06
natefinchfwereade: which jujud one?12:06
fwereadenatefinch, there's a class of machine agent test failure12:07
fwereadenatefinch, happens in a few different ways now I think12:07
fwereadenatefinch, we try to test that a machine agent works by testing the side-effects of particular jobs12:07
fwereadenatefinch, but the MA isn't set up quite right, and so the api job barfs, and kills all the others12:07
natefinchfwereade: sounds like fun12:08
fwereadenatefinch, and it's a matter of luck whether they managed to express their side effect in time12:08
fwereadenatefinch, btw, you don't actually have to volunteer, HA is at least as important12:08
rogpeppenatefinch: http://play.golang.org/p/WwGvP5RUbM12:08
rogpeppenatefinch: it could probably go in its own package somewhere12:09
rogpeppenatefinch: perhaps in utils12:09
natefinchfwereade: if I was less behind in the HA stuff I'd be happy to volunteer.... but if someone else can do it, that would probably be better  for the schedule12:10
rogpeppenatefinch: does that make sense as a primitive?12:11
* rogpeppe needs to lunch12:12
natefinchrogpeppe: looks good though I'm not sure I'm entirely happy about genericizing it with the interface{} ...12:13
rick_h_TheMue: morning, I wanted to check on the status of the debug-log work. I know rogpeppe brought up some ideas for potentially more effecient ways to handle communication.12:21
rick_h_TheMue: anything I can/should note on our tracking card for this during out standup?12:21
TheMuerick_h_: yep, roger and william agreed on this new approach and I'm now finishing the outline (there are some todos left)12:23
rick_h_TheMue: cool, thanks for the update.12:23
TheMuerick_h_: most looks good so far because this new approach avoids much of the problems of the old one12:23
rick_h_TheMue: always a good thing, glad to hear it12:24
jamfwereade: did you get a chance to talk with Tim about planning for capetown?12:30
fwereadejam, he got a message from me about it, but only while I was briefly midnightly awake and he was at the gym12:49
jamfwereade: so its all sorted out, then :)12:49
rogpeppenatefinch: i know what you mean, but it's general enough that it might be useful for other things. it's a bit more mechanism than i'd like to see in cmd/jujud directly, and as a separate package i wouldn't really want it to depend on the agent package12:51
fwereadejam, mu -- I have pushed mramm to figure out what else we may need, but at least thumper is aware of where he needs to add things that he thinks of12:52
rogpeppenatefinch: you could always write a thin type-safe wrapper on top if the interface{} thing gets you down12:52
fwereaderogpeppe, can you remember what emitter of NotProvisionedError justifies returning true from isFatal in the machine agent?12:53
fwereaderogpeppe, if *that particular* MA is not provisioned, sure, that's a problem12:54
* rogpeppe looks to see what emits NotProvisionedError12:54
fwereaderogpeppe, and the fact that other workers are emitting it is I think *also* a problem12:54
fwereaderogpeppe, but they should surely be restricting their problems to themselves12:55
rogpeppe fwereade: yeah, that may well be problematic. I think we should probably use a more specific error, probably defined in cmd/jujud, and transform from NotProvisionerError into that at the appropriate place only12:57
rogpeppefwereade: and take NotProvisionedError out of the list of fatal errors12:57
rogpeppefwereade: i *think* that the specific case that we were thinking about there is the one returned by apiRootForEntity12:58
fwereaderogpeppe, excellent, that matches my rough analysis13:00
fwereaderogpeppe, thanks13:00
rogpeppefwereade: cool13:00
fwereadewould appreciate a quick look at https://codereview.appspot.com/57740043 -- it's the flaky provisioner tests13:20
fwereadethe jujud ones demand a bit more obsessive/paranoid care13:21
* fwereade succumbs to rage against the machine agent, goes for walk13:30
mgzthe battle of los agents13:35
rogpeppe2anyone here know about lxc?13:52
rogpeppe2we're seeing this error from lxc-start on a machine:13:52
rogpeppe22014-01-28 13:35:27 ERROR juju.provisioner provisioner_task.go:399 cannot start instance for machine "3/lxc/5": error executing "lxc-start": command get_init_pid failed to receive response13:52
fwereadenatefinch, btw, a thought: it's *vital* that we start (non-bootstrap) state DBs only in response to info from the API -- it's only the API conn that does the nonce check and is therefore safe from edge case failures in the provisioner accidentally starting two instances for one machine14:07
fwereadenatefinch, I know that's what we're doing anyway, but it's another reason not to mess around with cloudinit14:07
natefinchfwereade: ahh, yeah, good point14:07
natefinchrogpeppe2: you may be right about the interface there, but I still am hesitant to add a bunch of code and some complexity for a feature we don't even support right now.  It's like, either do this whole watching thing, or just call a function inline.14:08
sinzuifwereade, jam, I created a branch from the last rev that CI blessed then merged mgz openstack/mgo fixes.14:10
rogpeppe2natefinch: i'm not sure what you mean by "just call a function inline" there14:10
sinzuifwereade, jam. since I created a new branch, I created series 1.18 and moved 1.17.1 to that series. https://launchpad.net/juju-core/1.1814:11
natefinchrogpeppe2: given what fwereade said above... are we not just going to be calling the API to see if we should be a state server when MachineAgent.Run is called14:11
sinzuifwereade, jam, I am not happy with this situation. If either of you are not happy, we should talk about our options for a 1.17.1 release14:12
rogpeppe2natefinch: we can't do that if there's only one state server machine agent14:13
rogpeppe2natefinch: that's kind of the whole point of the design i've suggested14:13
rogpeppe2natefinch: i don't think the SharedValue stuff is unreasonable complexity (it's been well tested in the past, under another guise)14:14
rogpeppe2natefinch: i'd be happy to commit it with tests if you don't fancy doing that14:15
natefinchrogpeppe2: that's fine, it just seems like we're doing all that instead of writing func SetUpStateServer()14:15
rogpeppe2natefinch: we're moving towards a design that enables future stuff, rather than cluttering the existing design with more stuff.14:16
natefinchrogpeppe2: it doesn't seem like clutter when it's code we'll need either way. The future stuff we may never get to.14:17
natefinchrogpeppe2: can you explain this a bit more: we can't do that if there's only one state server machine agent14:18
rogpeppe2natefinch: perhaps you could sketch some pseudocode for what you think SetUpStateServer should do?14:18
natefinchrogpeppe2: that's probably what is tripping me up14:18
rogpeppe2natefinch: if there's only one state server machine agent, there's no API to connect to14:18
natefinchrogpeppe2: if there's no API to connect to, mongo can't sync its data from anywhere either14:19
natefinchrogpeppe2: I don't think this code applies to the bootstrap node14:19
rogpeppe2natefinch: there is no such thing as the "bootstrap node" in an HA environment14:19
rogpeppe2natefinch: well, that's not entirely true14:19
mgzsinzui: thanks for that14:19
rogpeppe2natefinch: but the bootstrap node is only important at bootstrap time14:20
mgzI don't think I have any cunning solutions either unfortunately14:20
natefinchrogpeppe2: other than the bootstrap node, there must already be a state server in existence when new state servers come up14:21
rogpeppe2natefinch: not necessarily14:21
rogpeppe2natefinch: i want to allow for the possibility of going from 3 servers to 1.14:22
natefinchrogpeppe2: then we're boned anyway, because they can't sync the mongo data14:22
rogpeppe2natefinch: not necessarily14:22
TheMuerogpeppe2: btw HA, where does the all-machines.log reside then?14:23
rogpeppe2natefinch: the peer group logic i've written should allow it14:23
rogpeppe2TheMue: that's a good question and one we haven't resolved yet14:23
TheMuerogpeppe2: hehe, ok14:24
rogpeppe2TheMue: we perhaps sync it to all state server nodes, but we need to think about it14:24
natefinchrogpeppe2: afk a sec, sorry, brb14:25
TheMuerogpeppe2: yes, may be the right solution. also we still have no logrotate, don't we?14:26
natefinchrogpeppe2: going from 3 to 1 and putting manage environ on existing machines are not things we need to deliver right now, and I don't think there's any throw away code we'd need to write to deliver what is required without adding this code.  I guess if you want to commit the shared valuestuff with tests, that's fine with me... I just don't really want to take on *any* additional code burden right now.14:32
natefinchTheMue: yes, there is no logrotate14:33
rogpeppe2natefinch: perhaps you could paste some pseudocode with your idea for your suggested solution14:34
fwereaderogpeppe2, TheMue: unless there's a serious problem with the approach I think we should just fix the rsyslog conf to push to all state servers14:35
fwereadeTheMue, yes, there is no logrotate and we really need it14:35
rogpeppe2fwereade: +100014:35
fwereadeTheMue, don't suppose you're currently bored14:35
fwereade;p14:35
rogpeppe2fwereade: what happens when a new state server comes up?14:36
rogpeppe2fwereade: (do we lose all the previous log on that state server?)14:36
fwereaderogpeppe2, I think it's ok that new state servers won't have logs from before they existed, if that's what you mean?14:37
rogpeppe2fwereade: it is14:37
rogpeppe2fwereade: i'm not sure that's really acceptable, tbh14:37
rogpeppe2fwereade: but if you think it is, i'll go with it14:37
natefinchrogpeppe2: in machineagent.Run:  open API, check this machine's jobs.  If manageEnviron, install & run mongo upstart script.  (if we can't assume mongo is installed, throw in an apt-get install in there).14:37
natefinchrogpeppe2: I know it's naive, but I'm not understanding under what circumstances it'll fail in the stuff we need to sup[port for 14.0414:38
fwereaderogpeppe2, https://codereview.appspot.com/54950046 -- I don't think this is necessarily the *best* solution, but it involves no API changes and AFAICT it resolves the jujud flakiness -- opinions?14:42
* fwereade needs to eat something, would be most grateful to return and see a review of https://codereview.appspot.com/57740043 as well14:42
natefinchrogpeppe2: sorry if I'm asking the same thing over and over.  There's obviously something I keep misunderstanding.14:44
rogpeppe2natefinch: one mo, i'm just doing a sketch, so you can see how my suggestion actually simplifies the existing code14:44
natefinchrogpeppe2: that's fine, and thank you for helping me understand.14:45
TheMuefwereade: I've got the slight feeling the the topic debug logging will accompany me for some time. ;)14:53
fwereademgz, is the bot wedged? https://code.launchpad.net/~waigani/juju-core/remove-local-shared-storage/+merge/202789 was approved yesterday -- but I'm sure I saw it doing something earlier today15:24
rogpeppe2natefinch: here's the kind of thing i'm thinking of: http://paste.ubuntu.com/6832569/15:26
mgzfwereade: I'll have a look15:26
rogpeppe2natefinch: (try a diff against cmd/jujud/machine.go)15:26
natefinchrogpeppe2: looking15:29
sinzuirogpeppe2, mgz, natefinch  Do either of you have time to review my branch to inc juju to 1.17.2?  https://codereview.appspot.com/5775004315:30
fwereadesinzui, LGTM15:32
sinzuithank you fwereade15:32
mgzfwereade: the bot is cycling through and looking for proposals fine15:37
mgzfwereade: waigani just didn't set a commit message15:37
mgzwe can do that and it will land15:37
rogpeppe2fwereade: i'd be interested in what you think about my suggestion to nate above, machine agent changes (http://paste.ubuntu.com/6832569/)15:53
fwereademgz, doh, sorry15:54
rogpeppe2natefinch: does it make some sort of sense? it's somewhat more code, but i think it separates concerns better, and there are no special-case hacks15:57
fwereaderogpeppe2, I think that looks nice15:59
natefinchrogpeppe2: I still don't understand the *why*.  When we first start up in MachineAgent.Run, we can call the API right then and determine if we need to run mongo, and do so right then.  I'm not sure what we get by making a continuous watcher thingy, since we don't currently support changing a machine from a non-state server to a state server.15:59
rogpeppe2natefinch: we can't open the API if we're supposed to be running the API16:00
rogpeppe2natefinch: unless you add special case hacks for machine 916:00
fwereaderogpeppe2, surely we *can*, though16:00
rogpeppe2machine 0 even16:00
rogpeppe2fwereade: well, we *can*, and that's what my suggestion does16:00
natefinchrogpeppe2: yes, one special case hack for THE special case in the system16:00
fwereadenatefinch, every special case I see for machine 0 makes me sad16:00
natefinchfwereade: I agree16:00
fwereadenatefinch, this reduces that special case to setting up the agent conf so that machine 0 alone already knows it's meant to run the state worker16:01
fwereadenatefinch, everything else gets it via the api16:01
fwereadenatefinch, (ultimately via the api)16:01
rogpeppe2natefinch: in your case you have to have a completely separate path for opening the API in MachineAgent.Run, and then you'll start the APIWorker which then needs to open it again16:01
natefinchfwereade: maybe I'm missing that part because of all the magic watcheryness.16:01
fwereadenatefinch, rogpeppe2: yeah, my opinions are predicated on the watching all being sane, and the agent conf all being properly goroutine safe, etc16:03
rogpeppe2fwereade: of course16:03
fwereadenatefinch, rogpeppe2: but I think it's a suitable channel for this sort of information16:03
natefinchrogpeppe2: it seems like you're arguing about the contents of needsStateWorker, which doesn't seem to be in the code you're talking about16:04
rogpeppe2natefinch: needsStateWorker is just a "does machine jobs contain JobStateWorker?"16:05
rogpeppe2natefinch: (or however that info is stored in the config)16:05
natefinchrogpeppe2: right, fine.  Ok.  Why do we need 150 lines of magic watcherness rather than just checking the jobs in machineAgent.run?16:06
rogpeppe2natefinch: because we need to watch that stuff *anyway*16:06
natefinchrogpeppe2: aren't we already doing that?16:07
rogpeppe2natefinch: because we need to save the addresses16:07
rogpeppe2natefinch: i don't think so16:07
natefinchfwereade: aren't the addresses already in the config?16:07
natefinchfwereade: ahh, I guess if the config changes16:07
natefinchfwereade: who changes the config and how?16:08
fwereadenatefinch, :17916:08
fwereadenatefinch, I think it depends on infrastructure that isn't written yet ( rogpeppe2 ?) but the shape of it looks sane to me16:09
rogpeppe2fwereade: the infrastructure that's not written yet is outlined in newConfigWatcher16:10
rogpeppe2fwereade, natefinch: it's pretty bog standard stuff - just watch that stuff and change the config appropriately16:10
fwereaderogpeppe2, indeed, I was just checking there wasn't something I'd completely missed :)16:11
rogpeppe2natefinch: so, the config changes because that watcher changes it, because something that it's watching that needs to go into the config has changed16:11
rogpeppe2natefinch: FWIW i've been wanting to move towards this kind of structure in the machine agent for ages16:12
rogpeppe2natefinch: and i'd much prefer to do it now rather than twist the structure more16:12
natefinchrogpeppe2, fwereade: I think what was confusing me was that the worker functions look like they're just called once, but they're runners/workers so they keep getting called over and over.  I'm not entirely sure why I couldn't put ensureMongoServer inside StateWorker or something16:14
rogpeppe2natefinch: the other problem with "just" connecting to the API in MachineAgent.Run is that you have to be careful to allow the agent to be stopped, and all that logic rapidly becomes quite complex (and duplicates logic that's already there elsewhere)16:14
rogpeppe2natefinch: you definitely could do that16:14
rogpeppe2natefinch: but not in the current code16:14
rogpeppe2natefinch: well, actually, it would probably work16:17
rogpeppe2natefinch: but you'd still need the config watching stuff16:18
rogpeppe2natefinch: and tbh i don't really like the current twistiness with ensureStateWorker16:18
fwereadewould someone please review https://codereview.appspot.com/57740043 so I can deflake the bot a bit more?16:19
natefinchrogpeppe2: I'm definitely not a fan of passing around an anonymous function that passes through another anonymous function16:20
natefinchrogpeppe2: I believe that you and William know what the heck you're talking about, and ignore whatever last 2% I'm missing.16:21
natefinchrogpeppe2: er rather I'm going to have to ignore what I'm missing.16:22
natefinchrogpeppe2: my sleep's been pretty terrible the last few days which isn't helping anything16:22
rogpeppe2natefinch: np at all16:22
=== rogpeppe2 is now known as rogpeppe
rogpeppe2fwereade: LGTM16:24
fwereaderogpeppe2, cheers16:31
rogpeppe2fwereade: according to the Go oracle, there are only three places that call agent.Config.Write - BootstrapCommand.Run, MachineAgent.APIWorker and UnitAgent.APIWorkers16:43
rogpeppe2fwereade: this corresponds to my intuition16:43
rogpeppe2fwereade: and means that the correct place to put the config writing code *is* in the APIWorker16:44
dimiternfwereade, final look at https://codereview.appspot.com/53210044/ ? updated as suggested17:15
dimiternbbiab17:15
natefinchthumper: got that errgo thing all written yet? :)19:12
thumpernatefinch: no, but will do a little today :)19:13
thumperfwereade: around?19:13
=== _mup__ is now known as _mup_
thumpernatefinch: what is the simplest way to insert a value at the start of a slice?19:38
natefinchthumper: s = append([]type{ val }, s...)19:39
thumperhmm...19:39
thumperI was hoping there was a nicer way, but that is what I'll do19:40
natefinchthumper: yeah, it's not great, but there's no real magic you can do with it19:41
* thumper nods19:41
natefinchthumper: It occurs to me, if prepending is something you're doing a lot of, it's probably better to just append, and treat the back as the front, if you know what I mean20:11
thumpernatefinch: I do, but most of the rest of the operations are starting from the most recent20:11
thumperI want the equivalent of push_front20:12
natefinchyeah, well, push_front is always ugly, so, there you go :)20:12
natefinchthumper: unless you use a linked list.... and you should never use a linked list :)20:13
thumper:)20:13
hazmatdo we have any size  recommendations on state-server size ?21:09
hazmatie 4 core / 16gb for 500 nodes env?21:10
* hazmat rereads notes from jam scale test in nov21:11
wallyworldthumper: 2 things. 1. you got the loggo user name? 2. i hate all the code churn just to relocate a friggin project 3. don't forget to update the bot 4. i can't count21:14
thumperwallyworld: yes I got the loggo name on github, agree on the churn21:14
thumperwallyworld: hai by the way21:15
wallyworldhi :-)21:15
wallyworldi about to take lachie to his first day of high school, will be bbiab21:15
wallyworldi'll look at your worker code review once you add the test :-)21:15
hazmatthere's a couple of tools out there for package name rewriting21:15
wallyworldhazmat: sure, but it's sad it actually *changes the code*21:16
wallyworldall that code churn21:16
wallyworldsucks21:16
wallyworldas opposed to just updating a depedencies file21:16
wallyworldlike in python21:16
sinzuinatefinch, Did you say I should remove the mongodb upstart script from /etc/init ?21:28
natefinchsinzui: yeah... I don't think it actually breaks anything, but I think we removed it from the servers we deploy, IIRC.21:30
natefinchsinzui: frees up some disk space and memory etc.21:35
fwereadethumper, heyhey21:36
thumperfwereade: hey, how are you doing?21:36
fwereadethumper, not bad21:36
fwereadethumper, landed some actual code today, would you believe?21:36
sinzuisince we are seeing the port taken in CI. I like the thought that there is no reason for it to be up if there is no test running21:36
thumperfwereade: wow21:36
=== gary_poster is now known as gary_poster|away
davecheneysinzui: ping23:02
sinzuihi davecheney23:03
davecheneysinzui: are we doing a hangout now ?23:03
davecheneyoh hey23:03

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!