[00:34] <jcw4> wallyworld_: looking at the error messages I was wondering if it was an issue with stale .a files on the test machine... is that possible?
[00:34] <wallyworld_> jcw4: maybe, but i can reproduce locally using compiler=gccgo
[00:34] <wallyworld_> i get lots of segfaults as well
[00:34] <jcw4> wallyworld_: ah.. I misunderstood... I thought you couldn't repro locally.
[00:34] <wallyworld_> but i can get the test failure
[00:35] <jcw4> wallyworld_: is there any way to debug if you don't have a ppc machine?
[00:35] <wallyworld_> i can't reproduce unless i just run one test at a time
[00:35] <wallyworld_> you have to know how gccgo works i think
[00:35] <wallyworld_> i have no idea :-(
[00:35] <jcw4> I see
[00:35] <jcw4> :)
[00:35] <wallyworld_> davecheney: you around?
[00:37] <davecheney> wallyworld_: ack
[00:38] <wallyworld_> davecheney: bug 1365480 is blocking ci. it appears to be a gccgo issue because it fails to run the hooks used to mock out method calls
[00:38] <mup> Bug #1365480: ppc64el unit tests fail in many ways <ci> <ppc64el> <regression> <juju-core:Triaged by wallyworld> <https://launchpad.net/bugs/1365480>
[00:38] <wallyworld_> i have no idea how to fix
[00:39] <wallyworld_> this is failing to work
[00:39] <wallyworld_> 	cleanup := s.srv.Service.Nova.RegisterControlPoint(
[00:39] <wallyworld_> 		"addFloatingIP",
[00:39] <wallyworld_> 		func(sc hook.ServiceControl, args ...interface{}) error {
[00:39] <wallyworld_> 			return fmt.Errorf("failed on purpose")
[00:39] <wallyworld_> 		},
[00:39] <wallyworld_> 	)
[00:39] <wallyworld_> the register func uses the stackframe to figure out what to do
[00:40] <wallyworld_> i guess it's broken again - i think it was broken before at some point?
[00:40] <wallyworld_> there's several tests affected
[00:42] <davecheney> yeah, it breaks a bit
[00:43] <davecheney> has the version of gccgo on the builder machine changed ?
[00:43] <wallyworld_> nfi
[00:43] <wallyworld_> i was running gccgo (Ubuntu 4.9.1-10ubuntu2) 4.9.1
[00:43] <wallyworld_> the build machine had gccgo (Ubuntu 4.9.1-12ubuntu2) 4.9.1
[00:44] <wallyworld_> i updated to -12 locally
[00:44] <davecheney> ok, and it only repro's on ppc ?
[00:44] <wallyworld_> i can repo locally using -compiler=gccgo
[00:44] <wallyworld_> but i get LOTS of segfaults
[00:44] <wallyworld_> i have to specify each test one at a time
[00:44] <wallyworld_> and yes, ci fails when running of ppc
[00:55] <wallyworld_> davecheney: am i able to ask you to look into this a bit? i have no idea where to start with regard to gccgo
[00:59] <davecheney> wallyworld_: can I fix it on monday ?
[01:00] <wallyworld_> davecheney: it's blocking landings sadly, unless we can get the regressionm tag removed
[01:01] <davecheney> i recommend removing the regression tag
[01:01] <davecheney> if this is a compiler fix
[01:01] <davecheney> we can't do that at critical level
[01:01] <wallyworld_> sinzui: bug 1365480 looks like a gccgo issue, is there anyway we can remove the regression tag?
[01:01] <mup> Bug #1365480: ppc64el unit tests fail in many ways <ci> <ppc64el> <regression> <juju-core:Triaged by wallyworld> <https://launchpad.net/bugs/1365480>
[01:01] <wallyworld_> davecheney can do a compiler fix but not till mondau
[01:01] <sinzui> wallyworld_, your made
[01:02] <davecheney> i can look at it on monday
[01:02] <sinzui> mad
[01:02] <davecheney> i can't promise a fix
[01:02] <sinzui> wallyworld_, the old version of juju works, and now it doesn't.
[01:03] <wallyworld_> sinzui: i can prove that code which has not been touched for ages fails because gccgo does not register the monkey patch being applied
[01:03] <sinzui> wallyworld_, we can retest an older revision, maybe the one that passed. If the test fails like the new revision then we know something other than juju changes
[01:04] <wallyworld_> gccgo is can be fragile when it comes to looking at the call stack
[01:04] <wallyworld_> which is how the monkey patching stuff works
[01:04] <wallyworld_> fragile = different to golanggo
[01:05] <sinzui> I will retest the last passing revision, if it fails the same way then you are vindicated
[01:05] <wallyworld_> sinzui: was gccgo updated recently?
[01:05] <wallyworld_> on the test vm?
[01:05] <sinzui> wallyworld_, We would see that in the first test that failed
[01:06] <sinzui> wallyworld_, you loose, http://juju-ci.vapour.ws:8080/job/run-unit-tests-trusty-ppc64el/1213/console clearly states that gcc was already the latest version and that no packages were installed for the test
[01:08] <wallyworld_> and yet the tests that are failing have not changed and the failure is clearly due to gccgo not executing monkey patched code that the tests rely on to pass
[01:08] <wallyworld_> i put a panic in the code and it did not trigger
[01:08] <sinzui> wallyworld_, there is a difference, but it is not see in installs...
[01:09] <sinzui> The passing one has
[01:09] <sinzui> go version xgcc (Ubuntu 4.9.1-10ubuntu3) 4.9.1 linux/ppc64
[01:09] <sinzui> The failing one has
[01:09] <sinzui> go version xgcc (Ubuntu 4.9.1-12ubuntu2) 4.9.1 linux/ppc64
[01:09] <wallyworld_> yes, that's what i used to have here till i upgrade
[01:09] <wallyworld_> i'm on utopic now and it doesn't give me an option to go back to -10
[01:10] <wallyworld_> wait
[01:10] <wallyworld_> yes it does
[01:10] <sinzui> wallyworld_, I can look into this after I avert the disaster that really cannot be averted
[01:10] <wallyworld_> ok
[01:10] <wallyworld_> i'll try testing with -10
[01:11] <davecheney> ok, this is not good
[01:11] <davecheney> -12 must be the new version in proposed which fixes a different bug
[01:12] <sinzui> FU&CKI
[01:13] <sinzui> wallyworld_, even after using s3cmd to sync the tools that are on aws, I still get different filesizes from streams.canonical.com
[01:13] <wallyworld_> i can't seem to get apt to allow me to downgrade to -10 to test
[01:13] <wallyworld_> wot :-(
[01:14] <wallyworld_> sinzui: that is not good :-(
[01:14] <davecheney> wallyworld_: juju bootstrap && juju deploy cs:ubuntu
[01:15] <sinzui> wallyworld_, Am I experiencing this because i finally reported the versioning issue as a bug https://bugs.launchpad.net/juju-core/+bug/1365633
[01:15] <mup> Bug #1365633: cannot rebuild replacement tools for streams <ci> <juju-core:Triaged> <https://launchpad.net/bugs/1365633>
[01:15] <wallyworld_> looking
[01:15] <sinzui> wallyworld_, We have lived with this since Fabruary, I report the bug and now I need the fix
[01:16] <sinzui> wallyworld_, tools that should be identical are not, I cannot given then extra version information to differentiate their origin to avoid confusion or outright malign intent
[01:17] <wallyworld_> sinzui: simplestreams supports versioning using dates
[01:17] <sinzui> wallyworld_, that is not helping the users
[01:17] <wallyworld_> new tools tarballs with different names could be uploaded
[01:18] <wallyworld_> and new metadata with a newer date added
[01:18] <sinzui> wallyworld_, I am going to remake this data, and now I can expect users to complain that tools of the same name dont match
[01:18] <wallyworld_> the tarball name used to matter before simplestreams but it doesn't now
[01:18] <sinzui> wallyworld_, but these tools from two different machines that should be the same have different sums
[01:19] <wallyworld_> the tools tarball could be called juju-1.20.6-release1-precise-amd64 and juju-1.20.6-release2-precise-amd64.
[01:20] <wallyworld_> which one to use comes from the simplestreams metadata
[01:21] <wallyworld_> the latter one would be in the metadata with a later date, so that would be be picked up if juju asks for which tools to use for series/arch/release
[01:21] <wallyworld_> maybe i'm missing something
[01:21] <sinzui> wallyworld_, That would help. when I tested alternate names for tools, the metadata command ignored them :(
[01:22] <wallyworld_> that may be a limitation of that command :-(
[01:22] <wallyworld_> which needs to be fixed
[01:22] <sinzui> I have done evil things to preserve the greater good
[01:23] <wallyworld_> that command i think from memory does use the filename to suck stuff in
[01:23] <wallyworld_> it could be made smarter
[01:23] <sinzui> yeah, the convention is convenient for many people copying tools.
[01:24] <wallyworld_> or made so it can be called from a script, passing in the required tarball and params
[01:24] <sinzui> wallyworld_, maybe...
[01:24] <wallyworld_> sinzui: we are moving to a shared tarball across series
[01:24] <wallyworld_> ie one tarball only for precise/trusty/utopic
[01:25] <wallyworld_> since they are the same
[01:25] <sinzui> I have just reconciled the diffs from what was last in the CPCs and my own machine to make a json that describes what what there and what I am now uploading.
[01:25] <wallyworld_> so the filename will become less relvant
[01:26] <sinzui> wallyworld_, I would like to do that. The number of tools we make and publish do take a lot of time
[01:26] <wallyworld_> yes indeed :-(
[01:26] <wallyworld_> it's sorta happening now as part of moving tools into mongo storage
[01:27] <wallyworld_> and removing the need for cloud storage
[01:27] <sinzui> wallyworld_, I think if this command worked for azure, we might have prevented my misadventure
[01:27] <sinzui> juju metadata validate-tools --juju-version 1.20.7
[01:28] <wallyworld_> oh, azure doesn't currently support custom metadata
[01:29] <wallyworld_> i because there's no central storage we can use from memory
[01:29] <sinzui> wallyworld_, but joyent does. I don't understand? isn't the command getting the json and answering the version question?
[01:29] <wallyworld_> like we have for aws and hp cloud
[01:30] <wallyworld_> it's been ages since i looed at that stuff - from memory it's because there's no support for a custome search path on azure, i can't recallwhy
[01:30] <wallyworld_> i'll have to go digging in the code
[01:30] <wallyworld_> and even if no custom tools location is supported, i would think the metadata command should still work
[01:31] <wallyworld_> don't know why it doesn't :-(
[01:31] <sinzui> wallyworld_, oh yes, now I understand. I faced some of that using their python adk
[01:31] <sinzui> sdk
[01:32] <sinzui> wallyworld_, We add md5 and shasum metadata to each tool we upload to azure and manta because we wrote our own rsync tools to do what real storage systems do
[01:32] <wallyworld_> ok
[01:33] <sinzui> manta still sucks though. there is a 5 minute period where we make 1000+ calls to look up the sums because it doesn't support bulk queries
[01:33] <sinzui> well swift doesn't either, but the web/xml interface does
[01:34] <wallyworld_> 1000+ !!
[01:34] <wallyworld_> sinzui: we will soon not need cloud storage for juju
[01:35] <sinzui> wallyworld_, indeed...part of the tools problem is that each machine is downloading tools from one or more sources and that allows for mismatches
[01:35] <wallyworld_> yeah, so soon all machines will get tools from the state server
[01:35] <wallyworld_> the tools are loaded into the state server on bootstrap
[01:39] <sinzui> wallyworld_, I am 1. starting a rebuild of the last good master rev. I am 2, looking for the old packages to revert one of the machines to
[01:40] <wallyworld_> ty
[01:40] <wallyworld_> i've updated the bug with my thoughts
[01:42] <sinzui> wallyworld_, I might be able to go back to what was in place on Aug 31 http://ports.ubuntu.com/pool/universe/g/gcc-4.9/
[01:43] <wallyworld_> sinzui: that would be great. you may also find that gcc-base and other packages need downgrading also
[01:44] <sinzui> wallyworld_, yeah, that is what makes this hard
[01:44] <wallyworld_> indeed :-(
[02:08] <thumper> wallyworld_: I'm going to see if I can fix this bug: https://bugs.launchpad.net/juju-core/+bug/1348477
[02:08] <mup> Bug #1348477: userAuthenticatorSuite.TearDown failure <ci> <intermittent-failure> <regression> <test-failure> <juju-core:Triaged by cmars> <https://launchpad.net/bugs/1348477>
[02:08] <thumper> wallyworld_: I have a plan
[02:09] <wallyworld_> thumper: awesome, can we catch up in a sec, i'm otp withj axw
[02:27] <sinzui> wallyworld_, you are vindicated by the replay of the passing tarball
[02:28] <sinzui> wallyworld_, I am too tired to install the old packages. Maybe I shouldn't because I am not awake enough to know that this is stupid
[02:28] <wallyworld_> sinzui: \o/ does that mean we can remove the regression tag and unblock landings?
[02:28] <wallyworld_> sinzui: we do need to fix the compiler still
[02:28] <wallyworld_> dave can look at that on monday
[02:28] <sinzui> wallyworld_, I am going to take the tests voting rights away. if it starts passing, then we can assume the code or the compiler are in agreement and reatore the vore
[02:28] <sinzui> vote
[02:29] <wallyworld_> great,sounds good,
[02:29] <sinzui> I can do this now, and then add the real source for the bug
[02:29] <wallyworld_> sinzui: is there an eta then on landings being unblocked?
[02:30] <sinzui> wallyworld_, I will lower the priority of of the bug because obviously we cannot do anything now that it is out of our power...let me fix the vote first
[02:30] <wallyworld_> ok, thank you :-)
[02:31] <sinzui> oh, actually. I cannot go to sleep until this test completes
[02:31] <wallyworld_> :-(
[02:32] <sinzui> wallyworld_, on the other hand the apiserver.metrics might actually have problems. but without a safe compiler, we wont know
[02:32] <wallyworld_> thumper: did you want to talk about your plan?
[02:32] <thumper> wallyworld_: yeah, cause it isn't working
[02:33] <wallyworld_> ok, see you in onyx standup hangout?
[02:33] <thumper> ok
[03:24] <thumper> https://github.com/juju/juju/pull/683 anyone? refactoring work still from this week's mega branch being broken up
[03:24] <thumper> bug fix coming for auth failed
[03:26] <thumper> wallyworld_: https://github.com/juju/juju/pull/685
[03:26] <wallyworld_> looking
[03:35] <katco> wallyworld_: hey thanks for landing all my branches :) take away the fun part why don't ya!
[03:43] <katco> and now i'm off to bed. night all.
[03:51] <thumper> axw: could I get you to cast your eyes over https://github.com/juju/juju/pull/642 again?
[03:51] <axw> looking
[03:51] <thumper> axw: I've updated it based on recent changes and your suggestions
[03:54] <axw> thumper: line 20 can be dropped I think
[03:54] <thumper> sure
[03:54] <thumper> will do
[03:55] <thumper> and pushed
[03:56] <axw> thumper: reviewed, thank you
[03:56] <thumper> nm
[05:43] <axw> wallyworld_: https://github.com/axw/juju/compare/state-tools-take2 if you're interested in seeing the core changes
[05:43] <axw> fixing tests again now
[05:43] <wallyworld_> sure, looking
[05:44] <axw> wallyworld_: apiserver/common/tools.go and apiserver/tools.go are probably of most interest
[05:44] <axw> wallyworld_: also cmd/jujud/bootstrap.go
[05:44] <wallyworld_> kk, just got a phone call, will look soon
[06:14] <wallyworld_> axw: ToolsStorager NOOOOOOOOOOO
[06:15] <axw> heh
[06:15] <wallyworld_> not funny :)
[06:15] <axw> ToolsStorageProvider? it's really a very minor thing, I don't really care
[06:16] <axw> Getter is just as horrible to me
[06:16] <wallyworld_> is there already a "ToolStorage", can't recall
[06:16] <axw> yes, but this is a thing that has a ToolsStorage method
[06:17] <wallyworld_> otherwise ToolsStorageProvider
[06:17] <axw> ok
[06:17] <wallyworld_> sorry, i HATE that particular Go idiom
[07:08] <wallyworld_> fwereade: do you have a moment?
[07:23] <mattyw> morning all
[07:35] <TheMue> morning
[07:36] <axw> wallyworld_: did you find anything obviously wrong, apart from that name?
[07:36] <axw> morning TheMue
[07:36] <wallyworld_> axw: no, looked ok. i got distracted a bit by a bug report, let me just give it one more look
[07:37] <axw> no worries
[07:37] <axw> wallyworld_: doesn't need to be too deep, just wanted a glance over before I get too stuck into fixing tests
[07:37] <axw> which reminds me, tests
[07:38] <wallyworld_> axw: nothing jumped out, but i didn't go over the find in storage logic too closely
[07:39] <axw> ok
[07:39] <axw> thanks
[08:18] <wallyworld_> dimitern: hi there
[08:20] <dimitern> wallyworld_, hey
[08:20] <wallyworld_> dimitern: i backported your fix for allowing maas to disable network config to 1.20. the 1.20 branch is a little different to trunk. could you please review my back port? and type $$merge$$ if you are happy as i have to head to soccer https://github.com/juju/juju/pull/687
[08:20] <dimitern> wallyworld_, sure, looking
[08:21] <wallyworld_> thank you
[08:21]  * wallyworld_ heads out to soccer
[09:59] <TheMue> dimitern: heya, mind another look at https://github.com/juju/juju/pull/626 ?
[10:00] <TheMue> dimitern: it now also covers the simulation and testing of a V0 machiner API.
[10:00] <dimitern> TheMue, cheers, will have a look
[10:01] <TheMue> dimitern: great, thanks
[10:45] <TheMue> dimitern, voidspace: hangout?
[10:46] <voidspace> TheMue: omw
[11:29] <voidspace> dimitern: after changing TIME_WAIT I haven't seen the tests fail...
[11:30] <voidspace> dimitern: not conclusive, but they were failing regularly before
[11:30] <voidspace> dimitern: I'll go to 2MB rate limit (used to fail every time) and see if they now pass
[11:37] <dimitern> voidspace, good news then :)
[11:39] <voidspace> dimitern: ah no, fail :-/
[11:40] <dimitern> voidspace, too bad.. but hey, it's some progress at least
[11:55] <TheMue> so, back from lunch
[11:56] <TheMue> dimitern: thanks for review
[11:56] <TheMue> dimitern: only regarding the test for the providers I don't like to change
[11:57] <TheMue> dimitern: simply so that all providers, also future ones, always follow the same approach
[11:57] <dimitern> TheMue, well, I really don't like passing an opaque array of booleans
[11:57] <TheMue> dimitern: I recognized it as advantage in the moment I added the testing for the V0
[11:58] <TheMue> dimitern: and I don't like to do everything the same way but only ...
[11:58] <TheMue> dimitern: these exceptions always make it more difficult for later maintainers
[11:59] <TheMue> dimitern: but I could change it that I define the standard behavior as a const (ok, it's a var), so the tests read better
[11:59] <dimitern> TheMue, it will be difficult for anyone to see what [16]bool{true,true,true,false,false,...} actually means
[11:59] <dimitern> TheMue, that sounds better, yes
[11:59] <TheMue> var ExpectedStandardBehavior = [16]bool { ... }
[12:00] <dimitern> TheMue, btw why [16]bool and not []bool ?
[12:00] <TheMue> dimitern: OK, that's a compromise for me
[12:00] <TheMue> dimitern: hey, we all love Go for its type safety. so why open a door to pass to few or much values?
[12:01] <TheMue> dimitern: only to safe to chars?
[12:04] <dimitern> TheMue, ok, as long the [16]bool is hidden behind a var, I'm fine for the time being
[12:05] <gsamfira> hello folks. If anyone has some time, can I get a review on: https://github.com/juju/utils/pull/27/ ?
[12:07] <TheMue> dimitern: will hide it
[13:30] <perrito666> natefinch: fetching aurics, brt
[14:04] <perrito666> ericsnow: wwitzel3 do we?
[14:32] <voidspace> so I can confirm that CurrentStatus will report members in PrimaryState/SecondaryState even when primary renegotiation is happening and the replica set is unstable
[14:44] <voidspace> although it looks like it sets Uptime to 0 when that happens
[14:47] <voidspace> who wrote the replicaset code?
[14:47] <voidspace> It's part of juju not mgo
[15:10] <natefinch> voidspace: I wrote the replicaset code.
[15:11] <voidspace> natefinch: ok
[15:11] <voidspace> natefinch: I've butchered the applyRelSetConfig code
[15:11] <voidspace> natefinch: I don't think the loop inside that does quite what it looks like it does
[15:11] <voidspace> natefinch: however I've got rid of it anyway, so my question is now moot
[15:11] <natefinch> voidspace: heh ok
[15:12] <voidspace> natefinch: I have a new WaitForMajorityHealthy function which we can use to tell when the replica set is stable
[15:12] <voidspace> natefinch: so far it's mostly working - except for the times when it doesn't...
[15:12] <sinzui> alexisb, I am going to delay 1.21-alpha1 until Monday. There are too many changes to write up as release notes in a single day. I honestly don't know what features are in this release and how to explain to users who to use them
[15:12] <natefinch> voidspace: That was definitely not the finest code in the world.  I wish there were better ways to do pretty much everything in that code... mostly around querying mongo for "WTF are you doing right now?"
[15:12] <voidspace> natefinch: it's the fact that you change cmd to "Ping"
[15:12] <alexisb> sinzui, understood, no one is pinning for it today
[15:13] <voidspace> natefinch: which is only useful if you re-enter the block "if err == io.EOF"
[15:13] <alexisb> sinzui, you and I and Ian need to sync on release roadmap for 1.21 though
[15:13] <voidspace> natefinch: which almost certainly isn't what Ping returns
[15:13] <voidspace> natefinch: and even if Ping is successful we retry the loop instead of breaking
[15:13] <voidspace> natefinch: as there's no check for err == nil
[15:14] <voidspace> natefinch: if my function is reliable, it will look like this instead
[15:14] <voidspace> natefinch: http://pastebin.ubuntu.com/8260487/
[15:15] <natefinch> voidspace: hmm yeah that's not good. That  code has been tweaked by a lot of people who were trying to make it more reliable... it's quite possible there were some screw ups along the way.  A lot of it was trial and error trying to figure out what mongo will do at any particular time.
[15:15] <sinzui> alexisb, agreed
[15:16] <natefinch> voidspace: can you show me waitformajorityhealthy?  That's the key part that I had difficulty writing myself.
[15:17] <natefinch> voidspace: also, when does session.Run return EOF?  We should comment why that's an ok error to get
[15:17] <voidspace> natefinch: http://pastebin.ubuntu.com/8260518/
[15:17] <voidspace> natefinch: I should add back a comment about that
[15:17] <voidspace> natefinch: it's when changing the config causes primary re-negotiation so existing connections are dropped
[15:18] <voidspace> natefinch: it's fine - we just need to refresh
[15:18] <voidspace> natefinch: which WaitFor... does
[15:18] <voidspace> natefinch: this is currently not stable - I'm sometimes seeing WaitFor... timeout, so I need to add some debugging
[15:18] <voidspace> this is what I'm doing now
[15:18] <voidspace> it *mostly* works
[15:21] <natefinch> voidspace: thanks for putting in time on this, it'll make our code a lot more robust, and hopefully fix a lot of mongo related errors in the tests
[15:21] <voidspace> maybe... :-/
[15:21] <voidspace> it's been dead end after dead end so far
[15:22] <voidspace> this looks really promising, but I'm still seeing timeouts
[15:35] <natefinch> sinzui: is amazon sick today?  one of my PR's failed in a weird way: http://juju-ci.vapour.ws:8080/job/github-merge-juju/546/console
[15:37] <sinzui> natefinch, that indeed looks like aws failed to provide an instance
[15:38] <sinzui> natefinch, I saw messages yesterday that clearly states there weren't any instances of the size requested for the AZ :(
[15:39] <natefinch> sinzui: I suppose AWS could just be busy
[15:55] <perrito666> mattyw: hey, are you around?
[16:00] <mattyw> perrito666, yep
[16:07] <hazmat> sinzui, that's a bug imo, juju should recover and try a different az
[16:08] <hazmat> although that's different then what natefinch build says
[16:09] <perrito666> mattyw: did you see axw's last pr?
[16:10] <mattyw> perrito666, removing the call to setadminmongopassword?
[16:10] <perrito666> yup, I applied that and ran with and without your patch
[16:11] <perrito666> that seems to at least fix half of the erorrs yet the error related to presence is still there
[16:11] <mattyw> perrito666, my patch?
[16:11] <perrito666> http://paste.ubuntu.com/8227111/
[16:11] <perrito666> "patch"
[16:12] <mattyw> perrito666, does axw branch make use of the change that thumper landed overnight?
[16:12] <perrito666> yes
[16:12] <perrito666> https://github.com/juju/juju/pull/688
[16:43] <natefinch> how the hell are you supposed to use juju run?  I can't for the life of me figure out how to get it to do anything but say "unrecognized args <stuff in the command to run>"
[16:43] <wesleymason> natefinch: juju run --service <servicename> 'comand here'
[16:43] <wesleymason> for example
[16:44] <natefinch> in quotes?
[16:44] <wesleymason> yeah, in single quotes so bash/zsh etc. doesn't interpolate first
[16:44] <wesleymason> recommended anyway
[16:45] <natefinch> ahh that was it.  I was trying with -- to keep it from parsing flags.... we really need better help on that command
[16:46] <natefinch> or like ONE example would be nice
[16:46] <wesleymason> +1
[16:47] <natefinch> I'll work on that.  bad help is a pet peeve of mine
[16:56] <voidspace> natefinch: do you know how to debug "no reachable servers" errors?
[16:57] <natefinch> voidspace: when initiating the replicaset?
[16:57] <voidspace> natefinch: no, after applying a config change or during a Dial
[16:57] <voidspace> natefinch: but in both cases I have a replicaset with several members
[16:59] <natefinch> voidspace: either they all still trying to come up, or the addresses are internal to the cloud, not public...
[16:59] <voidspace> natefinch: it's during tests, so not a cloud issue
[16:59] <voidspace> natefinch: and I'd like to know *how* to tell whether or not they're trying to come up
[16:59] <natefinch> voidspace: I wish I knew
[16:59] <voidspace> natefinch: as I've waited five minutes and CurrentStatus is failing
[16:59] <voidspace> because of the connection error
[16:59] <voidspace> hah, right
[16:59] <natefinch> niemeyer: ^^
[17:01] <natefinch> niemeyer: we're trying to make our code more robust with respect to Mongo, especially when initiating a replicaset and when bringing up instances of mongo during testing.  We get what appear to be random failures where sometimes they either never come up or take a really long time, or initiating takes a really long time.   Part of the problem is that we don't really now how to figure out what state mongo is in... all we
[17:01] <natefinch> can do is dial and see if it responds within a timeout.  Is there some better way we can do this?
[17:02] <voidspace> I'm seeing a lot of errors like:
[17:02] <voidspace> [LOG] 6:43.772 DEBUG juju.testing tls.Dial(127.0.0.1:35846) failed with dial tcp 127.0.0.1:35846: connection refused
[17:02] <voidspace> Even with session.Refresh() and waiting for (up to) five minutes
[17:03] <niemeyer> natefinch: Yes, you can always ask the server for its status
[17:03] <voidspace> niemeyer: how specifically?
[17:03] <voidspace> calling CurrentStatus(session) is failing with connection refused
[17:04] <niemeyer> voidspace: http://docs.mongodb.org/manual/reference/command/replSetGetStatus/
[17:04] <voidspace> niemeyer: that's precisely what CurrentStatus is doing
[17:04] <niemeyer> voidspace: If the connection is refused, you know the status :)
[17:04] <voidspace> niemeyer: any idea *why* sometimes our connections die like that and just don't come back
[17:04] <niemeyer> voidspace: Okay, that's not what Nate said above
[17:04] <niemeyer> voidspace: Hmm
[17:05] <niemeyer> voidspace: Die with connection refused?
[17:05] <voidspace> [LOG] 6:43.772 DEBUG juju.testing tls.Dial(127.0.0.1:35846) failed with dial tcp 127.0.0.1:35846: connection refused
[17:05] <natefinch> niemeyer: sorry... what I mean is - we tell it to initiate... and then can never get it to respond
[17:05] <niemeyer> voidspace: The TCP port is not open..
[17:06] <niemeyer> natefinch: Look at the logs
[17:06] <niemeyer> natefinch: I've never seen anything similar before
[17:06] <niemeyer> natefinch: the test suite of mgo routinely shoot servers down and bring them back up
[17:06] <natefinch> niemeyer: it's the single most common failure for our tests - mongo going away and never coming back
[17:07] <natefinch> niemeyer: it's quite likely we're just doing something wrong, we just don't know what that is.
[17:07] <niemeyer> natefinch: That makes no sense.. a connection refusal is a TCP port not open, which in general means MongoDB is not even running
[17:07] <niemeyer> natefinch: I'd look at the logs to see why
[17:07] <natefinch> niemeyer: it's not always connection refusal... that's the problem this time, often times the dial will just time out eventually
[17:08] <niemeyer> natefinch: Heh..
[17:08] <voidspace> that particular failure was during a call to instance.MustDialDirect() - *after* waiting for CurrentStatus to report all members up
[17:08] <niemeyer> natefinch: First thing to do is make up your mind about what the symptom is :)
[17:08] <voidspace> well, I just did another test run and got the same symptom
[17:08] <voidspace> [LOG] 6:43.764 DEBUG juju.testing tls.Dial(127.0.0.1:37222) failed with dial tcp 127.0.0.1:37222: connection refused
[17:09] <niemeyer> Yeah, that's a server down.. the logs will say why
[17:10] <natefinch> voidspace: I think you'll need to hack the code a little to prevent gocheck from cleaning up the mongo directory, so you can look at the logs
[17:10] <voidspace> niemeyer: do you know where the logs should be? I've got a horrible feeling we redirect mongo logging somewhere useless.
[17:10] <voidspace> natefinch: ah, right
[17:10] <voidspace> natefinch: when we start mongo don't we get it to log to standard out so we can parse the logs...
[17:10] <voidspace> natefinch: meaning we get no logs
[17:11] <voidspace> natefinch: or does it log to the directory as well?
[17:11] <niemeyer> voidspace, natefinch: -check.work will prevent it from being removed, and display it as well
[17:11] <voidspace> niemeyer: cool, thanks
[17:11] <natefinch> niemeyer: oh, awesome, thanks
[17:11] <niemeyer> voidspace: But I don't know where the logs are being sent to
[17:11] <natefinch> voidspace: I'm pretty sure mongo's logs are still written to disk, but I honestly don't remember
[17:15] <voidspace> we're still using the launchpad version of gocheck of course
[17:15] <voidspace> wasn't there a thread about that?
[17:16] <voidspace> yeah, looks like we're about to update
[17:18] <natefinch> niemeyer: is that check.work flag available on launchpad's gocheck?  I can't find docs on the flags it takes
[17:18] <niemeyer> natefinch: -gocheck.work, likely
[17:19] <niemeyer> natefinch: -help on the test binary, or just passing a wrong flag, will print the options
[17:19] <voidspace> I don't think it is available
[17:19] <voidspace> we're at the latest revision of launchpad
[17:22] <voidspace> natefinch: copying the gopkg.in one over the top of the launchpad one seems to work though
[17:22] <voidspace> :-p
[17:33] <natefinch> voidspace: heh, we're lucky we always rename the package, otherwise that wouldn't work
[17:35] <voidspace> right
[17:35] <wwitzel3> woo, I have passing tests!
[17:39] <natefinch> anyone know why I'd get "cannot open ports 80-80/tcp on machine 5 due to conflict" when I re-ran my install hook?  Shouldn't open-port be idempotent?
[17:44] <gsamfira> natefinch: there was a discussion on the mailing list about this a while back. Subject was "Port ranges - restricting opening and closing ranges". Not sure of the conclusion on that though
[17:45] <gsamfira> https://lists.ubuntu.com/archives/juju-dev/2014-August/003131.html
[21:57] <perrito666> anyone knows the difference between using net.Listen("tcp", "localhost:0") and net.Listen("tcp", ":0") ?