[00:03] <hazmat> smoser, interesting.. coreos guys rewrote cloudinit in go..
[00:04] <smoser> i hadnt' seen that.
[00:06] <hazmat> smoser, its very limited subset and assumes coreos /systemd https://github.com/coreos/coreos-cloudinit
[00:07] <hazmat> its a bit much for them to call it  cloudinit... its almost zero feature set overlap
[00:16] <perrito666> did anyone see fwereade after this am? (and when I say AM I mean GMT-3 AM)
[00:18] <davecheney> perrito666: its unusual to see him online at this time
[00:19] <perrito666> davecheney: I know, he just said that he was taking a plane and returning later and then I got disconnected
[00:26] <davecheney> perrito666: ok, you probably know more than i then
[00:27] <perrito666> heh tx davecheney
[00:38] <hazmat> hmm.. odd /bin/sh: 1: exec: /var/lib/juju/tools/unit-mysql-0/jujud: not found
[00:41] <sinzui> hazmat, looks like the last message in juju-ci-machine-0's log. Jujud just disappeared 2 weeks ago. Since that machine is the gateway into the ppc testing, we left it where it was
[00:42] <sinzui> thumper, I can hangout now
[00:42] <hazmat> sinzui, its odd its there.. the issue is deployer/simple.go
[00:43] <hazmat> it removes the symlink on failure, but afaics that method never failed, the last line is install the upstart job, and the job is present on disk.
[00:44] <thumper> sinzui: just munching
[00:44] <thumper> with you shortly
[00:44]  * sinzui watches ci
[00:44] <hazmat> sinzui, ie its resolvable with sudo ln -s /var/lib/juju/tools/1.18.1-precise-amd64/  /var/lib/juju/tools/unit-owncloud-0
[00:45] <hazmat> hmm.. its as though the removeOnErr was firing
[00:45] <hazmat> even on success
[00:47]  * sinzui nods
[00:49] <thumper> sinzui: https://plus.google.com/hangouts/_/76cpik697jvk5a93b3md4vcuc8?hl=en
[00:50] <sinzui> wallyworld, jam: looks like all the upgrade test are indeed fixed. I disabled the local-upgrade test for thumper. I will retest when I have the time or when the next rev lands
[00:50] <wallyworld> \o/
[00:50] <thumper> sinzui: do local upgrade and local deploy run on the same machine?
[00:50] <thumper> sinzui: can't hear you
[00:50] <wallyworld> sinzui: so if thumper actually pulls his finger out, we could release 1.19.0 real soon now?
[00:53] <hazmat> deployer worker is a bit strange .. does it use a tombstone to communicate back to the runner?
[00:56] <hazmat> thumper, when you have a moment i'd like to chat as well..
[00:56] <thumper> hazmat: ack
[00:59] <wallyworld> hazmat: the deployer worker is similar to most others, it is created by machine agent but wrapping it inside a worker.NewSimpleWorker
[01:00] <hazmat> wallyworld, ah. thanks
[01:01] <wallyworld> np. that worker stuff still confuses me each time i have to re-read the code
[01:04] <hazmat> the pattern is a bit different
[01:04] <hazmat> trying to figure out why i'd get 2014-04-15 00:00:42 INFO juju runner.go:262 worker: start "1-container-watcher"  .. when there are no containers.. basically my manual provider + lxc seems a bit busted with 1.18
[01:04] <hazmat> also trying to figure out if on a simpleworker erroring, if the runner will just ignore it and move on.
[01:04] <hazmat> with no log
[01:05] <hazmat> the nutshell being deploy workloads gets that jujud not found
[01:05]  * hazmat instruments
[01:08] <thumper> hazmat: whazzup?
[01:11] <hazmat> thumper, trying to debug 1.18 with lxc + manual
[01:11] <hazmat> thumper, mostly in the backlog
[01:11] <sinzui> Wow.
[01:13] <sinzui> abentley replace the mysql + wordpress charms with dummy charms that instrument and report what juju is up to. They have take 2-4minutes off of all the tests
[01:13] <sinzui> Azure deploy in under 20 minutes
[01:14] <sinzui> AWS is almost as fast as HP Cloud
[01:16] <davecheney> sinzui: \o/
[01:17] <waigani> wallyworld: should I patch envtools.BundleTools in a test suite e.g. coretesting? Or should I copy the mocked function to each package that is failing and patch there?
[01:18] <waigani> wallyworld: it's just there seem to be a lot of tests that are all effected/fixed by this patch
[01:18] <wallyworld> use s.PatchValue
[01:18] <waigani> wallyworld: yep I am
[01:18] <waigani> but should I do it in a more generic suite?
[01:18] <wallyworld> so if the failures are clustered in a particular suite, you can use that in SetUpTest
[01:19] <wallyworld> not sure it's worth doing a fixture for a one liner
[01:19] <waigani> wallyworld: that is what I'm doing now, but aready I've done that in about 4 packages, with more to go
[01:19] <waigani> wallyworld: oh okay, you mean just patch in each individual test?
[01:20] <wallyworld> possibly, depends on hwere the failures are
[01:20] <waigani> okay, I'll do it the verbose way and we can cleanup in review if needed
[01:20] <wallyworld> but if the failures are in a manageable nuber of suites, doing the patc in setuptest makes sense
[01:21] <waigani> okay
[01:28] <thumper> what the actual fuck!
[01:33] <sinzui> wallyworld, CI hates the unit-tests on precise. Have you seen these tests fail consistently in pains before? http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/617/console
[01:34] <sinzui> ^ The last three runs on different precise instances has tghe same failure
[01:34] <thumper> sinzui: I have some binaries copying to the machine
[01:35] <wallyworld> sinzui: i haven't seen those. and one of them, TestOpenStateWithoutAdmin, is the test added in the branch i landed for john to make upgrades work
[01:35] <sinzui> thank you thumper.
[01:35] <wallyworld> so it seems there's a mongo/precise issue
[01:36] <wallyworld> thumper: were you running some tests in a precise vm?
[01:37] <thumper> wallyworld: I have a real life precise machine
[01:37] <thumper> wallyworld: that it works fine
[01:37] <thumper> on
[01:37] <thumper> I've hooked up loggo to the mgo internals logging
[01:37] <thumper> so we can get internal mongo logging out of the bootstrap command
[01:37] <thumper> uploading some binaries now
[01:37] <wallyworld> hmm. so what's different on jenkins then to cause the tests to fail
[01:37] <thumper> not sure
[01:38] <thumper> same version of mongo
[01:38] <thumper> my desktop is i386
[01:38] <thumper> ci is amd64
[01:38] <thumper> that is all I can come up with so far
[01:38] <wallyworld> if that is th cause then we're doomed
[01:38] <thumper> :-)
[01:38] <thumper> FSVO doomed
[01:38] <wallyworld> yeah :-)
[01:39] <thumper> the error is that something inside mgo is explicitly closing the socket
[01:39] <thumper> when we ask to set up the replica set
[01:39] <wallyworld> thumper: so, one thing it could be - HA added an admin db
[01:39] <thumper> hence the desire for mor logging
[01:39] <thumper> wallyworld: my binaries work locally
[01:39] <thumper> and copying up
[01:39] <thumper> if that is the case
[01:39] <thumper> and my binaries work
[01:39] <wallyworld> and the recently added test which i reference above tests that we can ignore unuath access to that db
[01:39] <thumper> it could be that
[01:40]  * thumper ndos
[01:40] <wallyworld> and that test fails
[01:40] <thumper> still copying that file
[01:40]  * thumper waits...
[01:40]  * wallyworld waits too....
[01:40] <thumper> and here I was wanting to sleep
[01:40] <thumper> not feeling too flash
[01:40] <wallyworld> :-(
[01:41] <hazmat> thumper, sinzui fwiw.  my issue was user error around series. i have trusty containers but had registered them as precise, machine agent deployed fine, unit agents didn't like it though. unsupported usage mode.
[01:41] <thumper> haha
[01:41] <hazmat> thumper, concievably the same happens when you dist-upgrade a machine
[01:41] <sinzui> thumper, wallyworld: the machines the run the unit tests are amd64 m1.larges for precise and trusty. We 95% of users deploy top amd64
[01:41] <thumper> hmm...
[01:42] <thumper> sinzui: right...
[01:42] <sinzui> we saw numbers that showed a very small number were 1386, we assume those are clients, not services
[01:42]  * thumper nods
[01:42] <thumper> wallyworld: can I get you to try the aws reproduction?
[01:42] <thumper> wallyworld: are you busy with anything else?
[01:43] <wallyworld> i am but i can
[01:43] <wallyworld> what's up with aws?
[01:43] <thumper> just trying to replicate the issues that we are seeing on CI with the local provider not bootstrapping
[01:44] <thumper> it works on trusty for me
[01:44] <thumper> and precise/i386
[01:44] <thumper> but we should check real precise amd64
[01:44] <wallyworld> ok, so you want to spin up an aws precise amd64 and try there
[01:45] <thumper> right
[01:45] <wallyworld> okey dokey
[01:45] <thumper> install juju / juju-local
[01:45] <wallyworld> yarp
[01:45] <thumper> probably need to copy local 1.19 binaries
[01:45] <thumper> to avoid building on aws
[01:45] <wallyworld> right
[01:51] <thumper> ugh...
[01:51] <thumper> man I'm confused
[01:57] <thumper> wallyworld: sinzui: using my extra logging http://paste.ubuntu.com/7253010/
[01:57] <thumper> so not a recent fix issue
[01:58] <wallyworld> thumper: we should just disable the replica set stuff
[01:58] <wallyworld> it has broken so much
[01:58] <thumper> perhaps worth doing for the local provider at least
[01:58] <thumper> we are never going to want HA on local
[01:58] <thumper> it makes no sense
[01:59] <sinzui> closed explicitly? That's like the computer says no
[01:59] <thumper> sinzui: ack
[02:01]  * thumper has a call now
[02:06] <sinzui> axw, Is there any more I should say about azure availability sets? https://docs.google.com/a/canonical.com/document/d/1BXYrLC78H3H9Cv4e_4XMcZ3mAkTcp6nx4v1wdN650jw/edit
[02:12] <axw> sinzui: otp
[02:18] <wallyworld> thumper: sinzui: i'm going to test this patch to disable the mongo replicaset setup for local provider https://pastebin.canonical.com/108522/
[02:19] <wallyworld> this should revert local bootstrap to be closer to how it was prior to HA stuff being added
[02:19] <wallyworld> and hence it should remove the error in thumper's log above hopefully
[02:20] <axw> sinzui: can I have permissions to add comments?
[02:21] <thumper> sinzui: this line is a bit suspect 2014-04-15 02:20:44 DEBUG mgo server.go:297 Ping for 127.0.0.1:37019 is 15000 ms
[02:21] <thumper> sinzui: locally I have 0ms
[02:22] <sinzui> sorry axw I gave all canonical write access as I intended
[02:22] <axw> sinzui: ta
[02:22]  * sinzui looks in /etc
[02:23] <axw> sinzui: availability-sets-enabled=true by default; I'll update the notes
[02:27] <thumper> wallyworld: that patch is wrong
[02:27] <wallyworld> i know
[02:27] <wallyworld> found that out
[02:27] <wallyworld> doing it differently
[02:28] <thumper> wallyworld: jujud/bootstrap.go line 165, return there if local
[02:28] <wallyworld> yep
[02:37] <axw> sinzui: I updated the azure section, would you mind reading over it to see if it makes sense to you?
[02:43] <sinzui> Thank you axw. Looks great
[02:44] <thumper> sinzui: wallyworld, axw: bootstrap failure with debug mgo logs: http://paste.ubuntu.com/7253155/
[02:44] <thumper> sinzui: I don't know enough to be able to interpret the errors
[02:44] <thumper> sinzui: perhaps we need gustavo for it
[02:44] <sinzui> thanks for playing thumper
[02:45] <wallyworld> sinzui: can you re-enable local provider tests in CI? i will do a branch to try and fix it and then when landed CI can tell us if it works
[02:45] <thumper> sinzui: I'm done with the machine now
[02:45] <sinzui> I will re-enable the tests
[02:46] <wallyworld> thanks
[02:46] <wallyworld> let's see if the next branch i land works
[02:47] <sinzui> thumper, wallyworld . I think you had decided to disable HA on local...and how would I do HA with local...Does that other machine get proper access to my local machine that probably has died with me at the keyboard
[02:47] <thumper> sinzui: you wouldn't do HA with the local provider
[02:47] <thumper> :)
[02:48] <wallyworld> sinzui: we are trying to set up replicaset and other stuff which is just failing with local and for 1.19 t least, i can't see why we would want that
[02:48] <sinzui> :)
[02:48] <wallyworld> so to get 1.19 out, we can disable and think about it later
[02:49] <sinzui> wallyworld, really, I don't think we ever need to offer HA for local provider.
[02:49] <wallyworld> maybe for testing
[02:49] <wallyworld> but i agree with you
[02:49] <wallyworld> i was being cautious in case others were attached to the idea
[03:45] <wallyworld> axw: this should make local provider happy again on trunkhttps://codereview.appspot.com/87830044
[03:57] <axw> wallyworld: was afk, looking now
[04:01] <wallyworld> ta
[04:05] <axw> wallyworld: reviewed
[04:05] <wallyworld> ta
[04:05] <wallyworld> axw: everyone hates that we use lcal provider checks in jujud
[04:06] <wallyworld> been a todo for a while to fix
[04:06] <axw> yeah, I kind of wish we didn't have to disable replicasets at all though
[04:06] <axw> I know they're not needed, but if they just worked it would be nice to not have a separate code path
[04:07] <wallyworld> axw: yeah. we could for 1.19.1, but we need 1.19 out the door and HA still isn't quite ready anuway
[04:08] <wallyworld> it is indeed a bandaid. nate added another last week also
[04:09] <axw> wallyworld: yep, understood
[04:09] <wallyworld> makes me sad too though
[05:12] <sinzui> wallyworld, Your hack solved local. The last probable issue is the broken unit tests for precise. I reported bug 1307836
[05:12] <_mup_> Bug #1307836: Ci unititests fail on precise <ci> <precise> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1307836>
[05:13] <wallyworld> sinzui: yeah, i just saw that but didn't think you'd be awake
[05:13] <sinzui> I don't want to be awake
[05:14] <wallyworld> i didn't realise we still had the precise issue :-(
[05:14] <wallyworld> i'll look at the logs
[05:14] <wallyworld> hopefully we'll have some good news when you wake up
[05:22] <sinzui> wallyworld, azure-upgrade hasn't passed yet. It may not because azure is unwell this hour. We don't need to worry about a failure for azure. I can ask for a retest when the cloud is better
[05:23] <wallyworld> righto
[05:23]  * sinzui finds pillow
[05:23] <wallyworld> good night
[05:32] <davecheney>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
[05:32] <davecheney>  7718 ubuntu    20   0 2513408 1.564g  25152 S  45.2 19.6   2:41.51 juju.test
[05:32] <davecheney> memory usage for Go tests is out of control
[05:47] <wallyworld> jam1: you online?
[05:47] <jam1> morning wallyworld
[05:47] <jam1> I am
[05:47] <wallyworld> g'day
[05:47] <wallyworld> jam1: so with you branch, and one i did, CI is happy for upgrades
[05:47] <wallyworld> but
[05:47] <wallyworld> a couple of tests fail under precise
[05:48] <wallyworld> there's the one you added for your branch, plus TestInitializeStateFailsSecondTime
[05:48] <jam1> wallyworld: links to failing tests ?
[05:48] <wallyworld> the error says that a connection to mongo is unauth
[05:48] <wallyworld> http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/621/consoleFull
[05:48] <jam1> wallyworld: and are you able to see the local provider fail with replica set stuff, because neither Tim or I could reproduce it.
[05:49] <wallyworld> yeah, i saw it
[05:49] <wallyworld> and fixed
[05:49] <wallyworld> i had to disable HA for local provider
[05:49] <jam1> and while we don't have to have replica set local, I'd prefer consistency and the ability to test out HA locally if we could
[05:49] <wallyworld> sure
[05:49] <wallyworld> but to get 1.19 out the door i went for a quick fix
[05:49] <wallyworld> which we can revisit in 1.19.1
[05:50] <wallyworld> curtis was ok with that
[05:50] <jam1> wallyworld: so I certainly had a WTF about why I was able to create a machine in "admin" but not able to delete it without logging in as the admin I just created.
[05:50] <jam1> wallyworld: so it seems like some versions of Mongo don't have that security hole
[05:50] <jam1> but I can't figure out how to log in as an actual admin, but I can try digging into the TestInitialize stuff a bit more for my test.
[05:51] <wallyworld> so we are using a different mongo on precise vs trusty?
[05:51] <jam1> wallyworld: 2.4.6 vs 2.4.9
[05:51] <wallyworld> ok, i didn't realise that
[05:51] <jam1> Trusty is the one that lets you do WTF stuff.
[05:51] <wallyworld> :-(
[05:51] <wallyworld> there are 2 failing tests
[05:51] <wallyworld> maybe more, i seem to recall previous logs showing more
[05:52] <wallyworld> but the latest run had 2 failures only
[05:52] <wallyworld> the other one was TestInitializeStateFailsSecondTime
[05:53] <wallyworld> jam1: i gotta run to an appointment soon, but will check back when i return. if we can this this sroted, we can at least release 1.19.0 asap and deal with the workarounds for 1.19.1
[05:54] <jam1> wallyworld: is your code landed?
[05:54] <wallyworld> yep
[05:54] <jam1> k
[05:54] <wallyworld> happy to revert it if we can find a fix
[05:54] <jam1> I'll pick it up
[05:55] <wallyworld> thanks, i can look also but ony found out about precise tests just before and sadly i gotta duck out
[05:56] <jam1> hmm... LP failing to load for me right now
[05:56] <davecheney> wallyworld: Ci is running an anchient version of mongo
[05:56] <davecheney> that won't help
[06:02] <jam1> davecheney: sinzui: I would think we should run mongo 2.4.6 which is the one you get from the cloud-archive:tools
[06:04] <davecheney> jam1: agreed
[06:07] <jam1> davecheney: are they running 2.2.4 from the PPA?
[06:09] <davecheney> jam1: good point, 2.0 was all that shipped in precise
[06:09] <jam1> I'm just trying to find a way to reproduce, and I thought there was a 2.4.0 out there for a while, but I can't find it
[06:10] <jam1> and it isn't clear *what* version they are runnig.
[06:10] <davecheney> jam1: Get:40 http://ppa.launchpad.net/juju/stable/ubuntu/ precise/main mongodb-clients amd64 1:2.2.4-0ubuntu1~ubuntu12.04.1~juju1 [20.1 MB]
[06:10] <davecheney> Get:41 http://ppa.launchpad.net/juju/stable/ubuntu/ precise/main mongodb-server amd64 1:2.2.4-0ubuntu1~ubuntu12.04.1~juju1 [5,135 kB]
[06:10] <davecheney> this is our fault
[06:10] <davecheney> remember that old ppa
[06:10] <jam1> yep, thanks for pointing me to it
[06:10] <jam1> well, I can at least test with it.
[06:10] <davecheney> so, that isn't he cloud archive
[06:10] <davecheney> :emoji concerned face
[06:10] <jam1> At one point we probably wanted to maintain compat with 2.2.4, but I'm not *as* concerned with it anymore.
[06:12] <davecheney> 2.2.4 never shipped in any main archive
[06:12] <davecheney> i don't think we have a duty of compatability
[06:12] <davecheney> https://bugs.launchpad.net/juju-core/+bug/1307289/comments/1
[06:12] <davecheney> if anyone cares
[06:12] <davecheney> btw, go test ./cmd/juju{,d}
[06:12] <davecheney> takes an age because the test setup is constantly recompiling the tools
[06:27] <davecheney> why are the cmd/juju tests calling out to bzr ?
[06:33] <davecheney>  FAIL: publish_test.go:75: PublishSuite.SetUpTest
[06:33] <davecheney> publish_test.go:86:
[06:33] <davecheney>     c.Assert(err, gc.IsNil)
[06:33] <davecheney> ... value *errors.errorString = &errors.errorString{s:"error running \"bzr init\": exec: \"bzr\": executable file not found in $PATH"} ("error running \"bzr init\": exec:
[06:33] <davecheney>  \"bzr\": executable file not found in $PATH")
[06:33] <davecheney> what is this shit ?
[06:44] <rogpeppe> mornin' all
[06:47] <davecheney> https://bugs.launchpad.net/juju-core/+bug/1307865
[06:47] <davecheney> this seems like an obvious failure
[06:47] <davecheney> why does it only happen sporadically ?
[06:47] <rogpeppe> davecheney: that's been the case for over a year (tests running bzr)
[06:48] <davecheney> rogpeppe: fair enough
[06:48] <rogpeppe> davecheney: i agree, that does seem odd
[06:50] <jam1> rogpeppe: do we have thoughts on how we would have a Provider work that didn't have storage? I know we don't particularly prefer the HTTP Storage stuff that we have.
[06:51] <rogpeppe> jam1: we'd need to provide something to the provider that enabled it to fetch tools from the mongo-based storage
[06:51] <jam1> rogpeppe: so we'd have to do away with "provider-state" file as well, right?
[06:51] <rogpeppe> jam1: other than that, i don't think providers rely much on storage, do they?
[06:51] <jam1> rogpeppe: we use it for charms
[06:52] <rogpeppe> jam1: so... provider-state is *supposed* to be an implementation detail of a given provider
[06:52] <jam1> sure
[06:52] <jam1> it is in the "common code" path, but you wouldn't have to use it/could make that part optional
[06:53] <rogpeppe> jam1: we don't really rely on it much these days
[06:53] <jam1> rogpeppe: we'd want bootstrap to cache the API creds and then we rely on it very little
[06:53] <jam1> you'd lose the fallback path
[06:54] <rogpeppe> jam1: yeah, and we don't want to lose that entirely
[06:54] <rogpeppe> jam1: for a provider-state replacement, i'd like to see the fallback path factored out of the providers entirely
[06:54] <jam1> well, it only works because there is a "known location" we can look in that is reasonably reliable. If a cloud doesn't provide its own storage, then any other location is just guesswork
[06:55] <jam1> anyway, switching machines now
[06:55] <rogpeppe> jam1: ok
[06:56] <rogpeppe> axw: looking at http://paste.ubuntu.com/7252280/, in the first status machines 3 and 4 are up AFAICS.
[06:57] <rogpeppe> axw: and that's the status that i am presuming that ensure-availability was acting on
[06:57] <axw> rogpeppe: in the first one, yes, but how do you know when they went down?
[06:58] <axw> rogpeppe: my point was it could have changed since you did "juju status"
[06:58] <rogpeppe> axw: there was a very short time between the first status and calling ensure-availability. i don't see any particular reason for it to have gone down in that time period, although of course i can't be absolutely sure
[06:58] <axw> right, that's why I asked about the log. I'm really only guessing
[06:59] <rogpeppe> axw: luckily i still have all the machines up, so i can check the log
[06:59] <axw> rogpeppe: I see no reason why the agent would have gone down after calling ensure-availability either
[06:59] <axw> cool
[07:00] <rogpeppe> axw: it would necessarily go down after calling ensure-availability, because mongo reconfigures itself and agents get thrown out
[07:01] <axw> rogpeppe: for *all* machines? not just the shunned ones?
[07:01] <rogpeppe> axw: yeah
[07:01] <rogpeppe> axw: we could really do with some logging in ensure-availability to give us some insight into why it's making the decisions it is
[07:02] <axw> yeah, fair enough
[07:16] <rogpeppe> axw: here's the relevant log: http://paste.ubuntu.com/7252375/
[07:17] <rogpeppe> axw: the relevant EnsureAvailability call is the second one, i think
[07:17] <rogpeppe> axw: it's surprising that the connection goes down so quickly after that call
[07:17] <axw> rogpeppe: wrong pastebin?
[07:18] <rogpeppe> axw: ha, yes: http://paste.ubuntu.com/7253848/
[07:38] <axw> rogpeppe: machine-3's API workers have dialled to machine-0's API server ...
[07:38] <axw> rogpeppe: not saying that's the cause, but it's strange I think
[07:39] <rogpeppe> axw: that's not ideal, but it's understandable
[07:39] <rogpeppe> axw: one change i want to make is to make every environ manager machine dial the API server only on its own machine
[07:39] <axw> yep
[07:42] <jam> axw: rogpeppe: right, we originally only wrote "localhost" into the agent.conf. I think the bug is that the connection caching logic is overwriting that ?
[07:42] <rogpeppe> jam: yeah - each agent watches the api addresses and caches them
[07:43] <jam> rogpeppe: I thought when we spec'd the work we were going to explicitly skip overwritting when the agents were "localhost"
[07:43] <rogpeppe> jam: but also, the first API address received by a new agent is not going to be localhost
[07:43] <jam> rogpeppe: well, the thing that monitors it could just do if self.IsMaster() => localhost
[07:44] <rogpeppe> jam: i don't remember that explicitly
[07:44] <jam> or not run the address poller if IsMaster
[07:44] <jam> sorry
[07:44] <jam> IsManager
[07:44] <jam> not Master
[07:44] <rogpeppe> jam: i don't think it's IsMaster - i think it's is-environ-manager
[07:44] <rogpeppe> jam: right
[07:45] <rogpeppe> jam: i've been thinking about whether to run the address poller if we're an environ manager
[07:45] <rogpeppe> s/poller/watcher/
[07:45] <rogpeppe> jam: my general feeling is that it is probably worth it anyway
[07:46] <rogpeppe> jam: because machines can lose their environment manager status
[07:46] <rogpeppe> jam: even though we don't fully support that yet
[07:47] <jam> rogpeppe: won't they get bounced under that circumstance?
[07:47] <jam> anyway, we can either simplify it by what we write in agent.conf, or we could detect that we are IsManager and if so force localhost at api.Open time.
[07:48] <rogpeppe> jam: they'll get bounced, but if they do we want them to know where the other API hosts are
[07:48] <rogpeppe> jam: i was thinking of going for your latter option above
[07:51] <axw> rogpeppe: I can't really see much from the logs, I'm afraid. there is one interesting thing: "dialled mongo successfully" just after FullStatus and before EnsureAvailability
[07:51] <rogpeppe> axw: i couldn't glean much from them either
[07:51] <rogpeppe> axw: i'm just doing a branch that adds some logging to EnsureAvailability
[07:52] <rogpeppe> axw: then i'll try the live tests again to see if i can see what's going on
[07:55] <axw> rogpeppe: any idea why agent-state shows up as "down" just after I bootstrap? should FullStatus be forcing a resynchronisation of state?
[07:55] <rogpeppe> axw: i think it's because the presence data hasn't caught up
[07:55] <axw> rogpeppe: oh. I wonder if that's it? FullStatus may be reporting wrong agent state in your test too
[07:55] <rogpeppe> axw: we should definitely look into that
[07:56] <rogpeppe> axw: i think that FullStatus probably sees the same agent state that the ensure availability function is seeing
[07:56] <axw> rogpeppe: yeah, true
[08:09] <axw> rogpeppe: https://codereview.appspot.com/88030043
[08:09] <rogpeppe> axw: nice one! looking.
[08:10] <axw> jam: I've reverted your change from last night that eats admin login errors; this CL adds machine-0 to the admin db if it isn't there already
[08:11] <jam> axw: any chance that we could get the port from mongo rather than passing it in?
[08:11] <axw> rogpeppe: this is just the bare minimum, will follow up with maybeInitiateMongoServer, etc.
[08:11] <axw> jam: can do, but it requires parsing and I thought it may as well get passed in since it's already known to the caller
[08:12] <jam> axw: well we can have mongo start on port "0" and dynamically allocate, rather than our current ugly hack of allocating a port, and then closing it and hoping we don't race.
[08:12] <axw> jam: I assume you are referring to the EnsureAdminUserParams.Port field
[08:12] <jam> axw: if it is clumsy to parse, then we can pass it in.
[08:12] <axw> oh I see what you mean
[08:13] <axw> umm. dunno. I will take a look
[08:13] <jam> we *can* just start on port 37017, but that means other goroutines will also think that mongo is up, and for noauth stuff, we really want as little as possible to connect to it.
[08:14] <jam> axw: I always get thrown off by "upstart.NewService" because *no we don't want to create a new upstart service*
[08:14] <jam> but that is just "create a new memory representation of an upstart service"
[08:15] <axw> jam: heh yeah, it is a misleading name
[08:15] <jam> axw: I'm not sure why upstart specifically throws me off.
[08:15] <jam> as I certainly know the pattern.
[08:16] <jam> axw: can "defer cmd.Process.Kill()" do bad things if the process has already died ?
[08:16] <jam> axw: is it possible to do EnsureAdminUser as an upgrade step rather than doing it on all boots?
[08:17] <axw> jam: if the pid got reused very quickly, yes I think so
[08:17] <jam> axw: I'm not particularly worried about PID reuse that fast
[08:17] <axw> jam: not really feasible as an upgrade step, as they require an API connection
[08:17] <jam> I'm more wondering about a panic because the PID didn't exist
[08:17] <axw> then there's all sorts of horrible interactions with workers dying and restarting all the others, etc.
[08:19] <axw> jam: I'm pretty certain it's safe, but I'll double check
[08:19] <wallyworld> jam: hi, any update on the precise tests failures?
[08:22] <axw> jam: late Kill does not cause a panic
[08:23] <jam> wallyworld: they pass with mongo 2.4.6 from cloud-archive:tools, they fail with 2.2.4 from ppa:juju/stable
[08:23] <jam> on all machines that matter we use cloud-archive:tools
[08:23] <jam> wallyworld: so CI should be using that one
[08:23] <wallyworld> great, so we can look to release 1.19
[08:23] <jam> wallyworld: and axw has a patch that replaces my change anyway.
[08:23] <jam> wallyworld: the replicaset failure isn't one that I could reproduce...
[08:24] <jam> since it is flaky
[08:24] <wallyworld> hmmm. i hate those
[08:24] <wallyworld> CI could reproduce it
[08:24] <jam> wallyworld: it is *possible* we just need to wait longer, but I hate those as well :)
[08:24] <axw> jam: this is what happens if you try to use "--port 0" in mongod: http://paste.ubuntu.com/7254007/
[08:25] <jam> axw: bleh.... ok
[08:25] <jam> I don't think we want to use the "default mongo port of 27017" so we might as well use our own since we know we just stopped the machine
[08:25] <jam> stopped the service
[08:33] <rogpeppe> axw: reviewed
[08:34] <axw> thanks
[08:37] <rogpeppe> jam: using info.StatePort seems right to me (at least in production).
[08:38] <jam> rogpeppe: for "bring this up in single user mode so we can poke at secrets and then restart it" I'd prefer it was more hidden than that, but I can live with StatePort being good-enough.
[08:39] <rogpeppe> jam: if there's someone sitting on localhost waiting for the fraction of a second during which we add the admin user, i think the user is probably not going to be happy anyway
[08:40] <rogpeppe> jam: note that the vulnerability is *only* to processes running on the local machine
[08:40] <rogpeppe> jam: and if there are untrusted processes running on the bootstrap machine, they're in trouble anyway
[08:41] <jam> rogpeppe: I'm actually more worried about the other goroutines in the existing process waking up, connecting, thinking to do work, and then getting shut down again.
[08:42] <jam> rogpeppe: more from a cleanliness than a "omg we broke security" perspective
[08:42] <rogpeppe> jam: what goroutines would those be?
[08:43] <jam> rogpeppe: so this is more about "lets not force ourselves to think critically about everything we are doing and be extra careful that we never run something we thought we weren't". Vs "just don't expose something we don't want exposed so we can trust nothing can be connected to it."
[08:44] <rogpeppe> jam: AFAIK there are only two goroutines that connect to the state - the StateWorker (which we're in, and which hasn't started anything yet) and the upgrader (which requires an API connection, which we can't have yet because the StateWorker hasn't come up yet.
[08:45] <rogpeppe> jam: even if we *are* allowed to connect to the mongo, i don't think we can do anything nasty accidentally
[08:46] <rogpeppe> jam: well, i suppose we could if were malicious
[08:47] <axw> rogpeppe: I tested by upgrading from 1.18.1. that's good enough right?
[08:47] <rogpeppe> axw: i think so, yeah
[09:26] <waigani> wa
[09:26] <waigani> wallyworld: branch is up: https://codereview.appspot.com/87130045 :)
[09:27] <waigani> lbox didn't update the description on codereview, but did on lp??
[09:27] <waigani> anyway, bedtime for me.
[09:27] <waigani> night all
[09:44] <natefinch> morning all
[09:47] <jam> morning natefinch
[09:48] <rogpeppe> axw: https://codereview.appspot.com/88080043
[09:48] <rogpeppe> axw: a bit of a refactoring of EnsureAvailability - hope you approve
[09:48] <wallyworld> wallyworld: ok
[09:49] <axw> rogpeppe: cooking dinner, will take a look a bit later
[09:52] <rogpeppe> jam, natefinch, mgz: review of above would be appreciated
[09:54] <natefinch> rogpeppe: sure
[10:00] <rogpeppe> natefinch: have you pushed your latest revision of 041-moremongo ?
[10:01] <rogpeppe> natefinch: (i want to merge it with trunk, but i don't want us to duplicate that work, as wallyworld's recent changes produce fairly nasty conflicts)
[10:01] <wallyworld> rogpeppe: if you can fix local provider, feel free to revert my work
[10:01] <wallyworld> i only landed it to get 1.19 out the door
[10:02] <wallyworld> and local provider + HA (mongo replicaets) = fail :-(
[10:02] <rogpeppe> wallyworld: it seemed to work ok for me actually
[10:02] <wallyworld> not for me or CI sadly
[10:03] <natefinch> rogpeppe: it's pushed now
[10:03] <rogpeppe> wallyworld: how did it fail?
[10:03] <wallyworld> CI has been broken for days
[10:03] <wallyworld> mongo didn't start
[10:03] <rogpeppe> pwd
[10:03] <wallyworld> hence machine agent didn't come up
[10:03] <rogpeppe> wallyworld: what was the error from mongo?
[10:04] <jamespage> sinzui, I think I just got an ack to use 1.16.6 via SRU to support the MRE for juju-core
[10:04] <wallyworld> um, can't recall exactly, it will be in the CI logs
[10:04] <jamespage> sinzui, I'll push forwards getting it into proposed this week
[10:04] <wallyworld> my local dir is now blown away
[10:04] <rogpeppe> wallyworld: np, just interested
[10:05] <wallyworld> sorry, i should have taken beter notes
[10:05] <wallyworld> rogpeppe: i think that there wasn't much in the mongo logs from memory, tim had to enable extra logging
[10:05] <wallyworld> he was debugging why stuff failes on precise
[10:06] <wallyworld> but we know now that's due to 2.26 vs 2.49
[10:11] <natefinch> rogpeppe: are there tests for that EnsureAvailability code?
[10:11] <rogpeppe> natefinch: yes
[10:11] <natefinch> rogpeppe:  cool
[10:11] <rogpeppe> natefinch: the semantics are unaffected, so the tests remain the same
[10:12] <natefinch> rogpeppe:  awesome, that's what I figured.
[10:19] <axw> rogpeppe: reviewed. thanks, it's a little clearer now
[10:22] <rogpeppe> axw: thanks a lot
[10:29] <jam> wallyworld: rogpeppe: The error I saw in CI was when Initiate went to do a replicaSet operation, it would get a Explicitly Closed message.
[10:29] <jam> Note, though, that CI has been testing with mongo 2.2.4 for quite some time.
[10:30] <jam> (and still is today, AFAIK, though I'm trying to push to get them to upgrade)
[10:30] <rogpeppe> jam: interestin
[10:30] <rogpeppe> g
[10:30] <jam> rogpeppe: https://bugs.launchpad.net/juju-core/+bug/1306212
[10:30] <_mup_> Bug #1306212: juju bootstrap fails with local provider <bootstrap> <ci> <local-provider> <regression> <juju-core:In Progress by jameinel> <https://launchpad.net/bugs/1306212>
[10:30] <wallyworld> yes, i do recall that was one of the errors
[10:30] <jam> 2014-04-10 04:57:43 INFO juju.replicaset replicaset.go:36 Initiating replicaset with config replicaset.Config{Name:"juju", Version:1, Members:[]replicaset.Member{replicaset.Member{Id:1, Address:"10.0.3.1:37019", Arbiter:(*bool)(nil), BuildIndexes:(*bool)(nil), Hidden:(*bool)(nil), Priority:(*float64)(nil), Tags:map[string]string(nil), SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)}}} 2014-04-10 04:58:18 ERROR juju.cmd supercommand.go:299 cannot initiat
[10:31] <jam> rogpeppe: natefinch: I wrote this patch https://code.launchpad.net/~jameinel/juju-core/log-mongo-version/+merge/215656 to help us debug that sort of thing if anyone wants to review it
[10:31] <wallyworld> although i'm running 2.4.9 locally and still has issues
[10:31] <wallyworld> had
[10:31] <jam> wallyworld: interesting, as neither myself nor tim were able to reproduce it
[10:31] <jam> and I tried 2.4.9 on Trusty and 2.4.6 on Precise
[10:32] <jam> local bootstrap always just worked
[10:32] <rogpeppe> natefinch: i've merged trunk now - you can pull from lp:~rogpeppe/juju-core/natefinch-041-moremongo
[10:32] <wallyworld> all i know is that it didn't work, and then i disableded --replSet from the upstart script and it worked
[10:32] <jam> though... hmmm. I did run into godeps issues once, so it is possible juju bootstrap wasn't actually the trunk I thought it was.
[10:32] <wallyworld> and that also then fixed CI
[10:33] <natefinch> jam: I think I've seen the explicitly closed bug once or twice.
[10:33] <jam> natefinch: CI has apparently been seeing it reliably for 4+ days
[10:34] <jam> wallyworld: CI passed local-deploy in r 2628 http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/local-deploy/
[10:34] <jam> and now even with axw's 2629 patch
[10:35] <natefinch> jam: google brings up a 2012 convo with gustavo about it where the culprit seemed to be load on mongo, but not definitively.  We should mention it to him
[10:35] <jam> natefinch: given this is during bootstrap, there should be 0 load on mongo
[10:35] <wallyworld> jam: 2628 was my atch to make it work
[10:36] <jam> wallyworld: certainly, just mentioning CI saw it and was happy again
[10:36] <wallyworld> don't worry, i was watching it :-)
[10:36] <jam> natefinch: so I just checked the previous 6 attempts, and all of them failed with replica set: Closed explicitly.
[10:37] <natefinch> rogpeppe: thanks
[10:37] <jam> natefinch: note that 2.2.4 failed other tests that using TestInitiate
[10:37] <natefinch> jam: the important part was: we should talk to Gustavo
[10:37] <natefinch> (where we probably means me :)
[10:37] <jam> with being unable to handle admin logins.
[10:37] <natefinch> jam: interesting
[10:38] <natefinch> jam: any chance we can abandon 2.2.4?
[10:38]  * natefinch loves dropping support for things
[10:38] <jam> natefinch: hopefully. It shouldn't be used in the field. It is only in our ppa, which means only Quantal gets it.
[10:38] <jam> http://docs.mongodb.org/v2.4/reference/method/db.addUser/
[10:38] <jam> says it was changed in 2.4
[10:39] <jam> and is superseded in 2.6 by createUser
[10:39] <jam> natefinch: and the mgo docs say we should be using UpsertUser: http://godoc.org/labix.org/v2/mgo#Database.UpsertUser
[10:40] <mgz> we drop quantal support... tomorrow
[10:40] <jam> natefinch: seems like mongo's security model is instable over 2.2/2.4/2.6 which doesn't bode very well for us managing compatibility
[10:40] <mgz> no, end of the week
[10:40] <jam> mgz: well it wouldn't be hard to just put 2.4.6 into ppa:juju/stable
[10:40] <jam> regardless of Q
[10:41] <natefinch> jam: that seems wise
[10:41] <jam> mgz: that would also "fix" CI, because they seem to install it from the PPA as well
[10:41] <mgz> well, we'd have upgrade questions like that
[10:41] <mgz> but yeah
[10:41] <jam> jamespage: ^^ is it possibel to get 2.4.6 into ppa:juju/stable ?
[10:43] <natefinch> rogpeppe: if you want to work on that moremongo branch, I can try to get that localhost stateinfo branch in a testable state.
[10:43] <rogpeppe> natefinch: ok
[10:43] <rogpeppe> natefinch: what more needs to be done in the moremongo branch?
[10:43] <jamespage> jam: context?
[10:44] <jam> jamespage: CI and Quantal users will install MongoDB from ppa:juju/stable, but it is currently 2.2.4 which is "really old" now.
[10:44] <jam> So if we could just grab the one in cloud-archive:tools (2.4.6) it would make our lives more consistent.
[10:44] <jam> I believe that is the version in Saucy, and Trusty has 2.4.9
[10:54] <jamespage> jam: I've pushed a no-change backport of 2.4.6 for 12.04 and12.10 into https://launchpad.net/~james-page/+archive/juju-stable-testing
[10:54] <jamespage> just to see if it works
[10:54] <jamespage> I have a suspicion that its not a no-change backport
[10:54] <jam> jamespage: we only really need it for P for the CI guys
[10:55] <jam> since Q is going EOL
[11:06] <jam> jamespage: we can potentially just point them at cloud-archive:tools if it is a problem
[11:28] <jamespage> jam: that might be better
[11:28] <jamespage> jam: that way they will get the best mongodb that has been released with ubuntu
[11:30] <jam> jamespage: well, we need them to be testing against the version that we'll be installing more than "just the best", but given that we install from their ourselves it seems to fit.
[11:30] <jam> There is a question about U
[11:30] <jam> given that it won't be in cloud-tools
[11:30] <jam> so we may have to do another PPA trick
[11:32] <jam> natefinch: the recent failure of rogpeppe's branch in TestAddRemoveSet is interesting. It seems to be spinning on: attempting Set got error: replSetReconfig command must be sent to the current replica set primary.
[11:33] <jam> context: https://code.launchpad.net/~rogpeppe/juju-core/548-destroy-environment-fix/+merge/215697
[11:34] <jamespage> jam: what about U?
[11:35] <jam> jamespage: in the version after Trusty, how do we install the "best for U". For Q we had to use the ppa:juju/stable because P was the only thing in cloud-archive:tools
[11:35] <jam> which then got out of date
[11:35] <jam> We didn't have to for S because the "best" was the same thing as in cloud-archive:tools
[11:35] <jamespage> jam: the best for U will be in U
[11:36] <jam> jamespage: well, it wasn't in Q
[11:36] <jam> and when V comes out, it may no longer be the best for U, right?
[11:36] <jamespage> jam: that's probably because 18 months ago this was all foobar
[11:36] <jamespage> go juju did not exist in any meaningful way
[11:37] <jam> jamespage: sure. I can just see that 2.6 is released upstream, and we may encounter another "when do we get 2.6 in Ubuntu" where the threshold is at an incovenient point
[11:37] <jamespage> jam: you must maintain 2.4 compat as that's whats in 14.04
[11:42] <rogpeppe> jam, natefinch, mgz: how about this? http://paste.ubuntu.com/7254781/
[11:43] <mgz> rogpeppe: seems reasonable
[11:43] <mgz> I prefer interpreted values to raw dumps of fields in status
[11:44] <mgz> as it's the funny mix between for-machines markup and textual output for the user
[11:44] <natefinch> rogpeppe: when does a machine get into n, n?
[11:45] <rogpeppe> natefinch: when it's deactivated by the peergrouper worker
[11:45] <natefinch> but why would that happen?
[11:45] <rogpeppe> natefinch: ok, here's an example:
[11:46] <rogpeppe> natefinch: we have an active server (wantvote, hasvote)
[11:46] <rogpeppe> natefinch: it dies
[11:46] <rogpeppe> natefinch: we run ensure-availability
[11:46] <rogpeppe> natefinch: which sees that the machine is inactive, and marks it as !wantsvote
[11:47] <rogpeppe> natefinch: the peergrouper worker sees that the machine no longer wants the vote, and removes its vote
[11:47] <rogpeppe> natefinch: and sets its hasvote status to n
[11:47] <rogpeppe> natefinch: so our machine now has status (!wantsvote, !hasvote)
[11:48] <rogpeppe> natefinch: if we then run ensureavailability again, that machine is now a candidate for having its environ-manager status removed
[11:48] <rogpeppe> natefinch: alternatively, the machine might come back up again
[11:48] <natefinch> I see, so hasvote is actual replicaset status, and wants vote is what we want the replicaset status to be
[11:48] <rogpeppe> natefinch: yes
[11:49] <natefinch> sorry gotta run, forgot it's tuesday
[11:49] <jam> mgz: so the branch up for review (which is approved) actually has the errors as a prereq
[11:50] <mgz> jam: yeah, I was sure there was something like that
[11:54] <rogpeppe> natefinch: i've dealt with a bunch more conflicts merging trunk and pushed the result: ~rogpeppe/juju-core/natefinch-041-moremongo
[12:20] <rogpeppe> natefinch: ping
[12:32]  * rogpeppe goes for lunch
[12:42] <natefinch> rogpeppe: sorry, just got back
[12:54] <axw> rogpeppe natefinch: I'll continue looking at HA upgrade - upstart rewriting and MaybeInitiateMongoServer in the machine agent. Let me know if there's anything else I should look at
[12:55] <natefinch> axw: that seems like a good thing to do for now.  the rewriting should work as-is, once we remove the line that bypasses it
[12:56] <axw> natefinch: it doesn't quite, because the replset needs to be initiated too
[12:56] <axw> natefinch: and that's slightly complicated because that requires the internal addresses from the environment
[12:58] <natefinch> axw: you should be able to get the addresses off the instance and pass it into SelectPeerAddress, and get the right one.  That's what jujud/bootstrap.go does.  Should work in the agent, too, I'd think
[12:59] <axw> natefinch: yep, the only problem is getting the Environ. the bootstrap agent gets a complete environ config handed to it; the machine agent needs to go to state
[13:00] <axw> natefinch: anyway, I will continue on with that. if you think of something else I can look at next, maybe just send me an email
[13:00] <natefinch> axw: will do, and thanks
[13:00] <axw> nps
[13:29] <rogpeppe> natefinch: that's ok
[13:30] <rogpeppe> natefinch: how's localstateinfo coming along?
[13:35] <rogpeppe> mgz, jam, natefinch: trivial (two line) code review anyone? fixes a sporadic test failure. https://codereview.appspot.com/88130044
[13:35] <natefinch> rogpeppe: haven't gotten far this morning.  My wife should be back any minute to take the baby off my hands, which will makes things go faster
[13:35] <rogpeppe> natefinch: k
[13:36] <jam> rogpeppe: shouldn't there be an associated test sort of change ?
[13:36] <natefinch> rogpeppe: how does that change fix the test failure?
[13:37] <rogpeppe> jam: the reason for the fix is a test failure
[13:37] <rogpeppe> jam: i can add another test, i guess
[13:37] <natefinch> ideally a test that otherwise always fails :)
[13:37] <jam> rogpeppe: so this is that sometimes during teardown we would hit this and then not restart because it was the wrong type ?
[13:38] <rogpeppe> jam: the test failure was this: http://paste.ubuntu.com/7255340/
[13:39] <rogpeppe> jam: i'm actually not quite sure why it is sporadic
[13:40] <natefinch> I see, we always expect it to be errterminateagent, but we were masking that along with other failures
[13:40] <rogpeppe> natefinch: yes
[13:41] <natefinch> rogpeppe: how does the defer interact with locally scoped err variables inside if statements etc?
[13:41] <natefinch> maybe that's the problem?  It's modifying the outside err, but we're returning a different one
[13:41] <rogpeppe> natefinch: the return value is assigned to before returning
[13:41] <natefinch> ahgh right
[13:42] <rogpeppe> natefinch: from http://golang.org/ref/spec#Return_statements: "A "return" statement that specifies results sets the result parameters before any deferred functions are executed."
[13:44] <jam> rogpeppe: so it looks like you only run into it if you get ErrTerminate before init actually finishes
[13:46] <rogpeppe> jam: i'm not sure why the unit isn't always dead for this test on entry to Uniter.init
[13:46] <rogpeppe> jam: tbh i don't want to take up the rest of my afternoon grokking the uniter tests - i'll leave this alone until i have some time.
[13:47] <rogpeppe> jam: (i agree that it indicates a lacking test in this area)
[13:52] <jam> rogpeppe: so LGTM for the change, though it does raise the question that if we wrapped errors without dropping context it might have worked as  well :)
[13:52] <rogpeppe> jam: yeah, i know
[13:53] <rogpeppe> jam: but i'd much prefer it if we have a wrap function that explicitly declares the errors that can pass back
[13:53] <rogpeppe> jam: then we can actually see what classifiable errors the function might be returning
[13:56] <rogpeppe> jam: there are 9 possible returned error branches in that function - it's much easier to modify the function if you know which of those might be relied on for specific errors
[13:56] <natefinch> rogpeppe: that would be pretty useful in a defer statement, since it would then be right next to the function definition, as well.
[13:56] <rogpeppe> natefinch: perhaps
[13:56] <rogpeppe> natefinch: tbh i'm not keen on ErrorContextf in general
[13:57] <rogpeppe> natefinch: it just adds context that the caller already knows
[13:57] <natefinch> rogpeppe: yes, I wouldn't have it change the message, just filter the types.  I don't want to have to troll through the code in a function to figure out what errors it can return
[13:58] <rogpeppe> natefinch: the doc comment should state what errors it can return
[13:59] <rogpeppe> natefinch: and i'd put a return errgo.Wrap(err) on each error return
[14:00] <rogpeppe> natefinch: (errgo.Wrap(err, errgo.Is(worker.ErrTerminateAgent) for paths where we care about the specific error)
[14:01] <rogpeppe> natefinch: i know what you mean about having the filter near the top of the function though
[14:17] <natefinch> rogpeppe: btw for want/hasvote, what about : non-member, pending-removal, pending-add, member?  I feel like inactive and active sound too ephemeral, like it could change at any minute, when in fact, it's likely to be a very stable state.   But maybe I'm over thinking it.
[14:18] <jam> natefinch: fwiw I like your terms better
[14:18] <rogpeppe> natefinch: those terms aren't actually right, unfortunately.
[14:19] <rogpeppe> natefinch: there's no way of currently telling if a machine's mongo is a member of the replica set
[14:20] <rogpeppe> natefinch: even if a machine has WantVote=false, HasVote=false, it may still be a member
[14:21] <rogpeppe> natefinch: basically, every state server machine will be a member unless it's down
[14:22] <rogpeppe> natefinch: how about "activated" and "deactivated" instead of "active" and "inactive" ?
[14:24] <natefinch> rogpeppe: isn't the intended purpose that those with y/y that they're in the replicaset?  I guess if it doesn't reflect the replicaset, what does it reflect?
[14:24] <rogpeppe> natefinch: the intended purpose of those with y/y is that they are *voting* members of the replica set
[14:25] <natefinch> I see
[14:25] <rogpeppe> natefinch: we can have any number of non-voting members
[14:25] <rogpeppe> natefinch: (and that's important)
[14:29] <natefinch> member-status: non-voting, pending-unvote, pending-vote, voting?    I know unvote is not a word, but pending-non-voting is too long and confusing.
[14:31] <sinzui> jamespage, I sent a reply about 1.16.4.
[14:32] <jamespage> sinzui, so the backup/restore bits are not actually in the 1.16 branch?
[14:33] <sinzui> jamespage, no backup
[14:33] <jamespage> hmm
[14:33] <natefinch> rogpeppe: check my last msg
[14:33] <sinzui> restore aka update-bootstrap worked for customers who had the bash script
[14:35] <sinzui> jamespage, by not installing juju-update-bootstrap, I think we can show that no new code was introduced to the system
[14:36] <rogpeppe> natefinch: "not voting", "adding vote", "removing vote", "voting" ?
[14:36] <perrito666> jamespage: sinzui I assigned myself https://bugs.launchpad.net/juju-core/+bug/1305780?comments=all just fyi
[14:36] <_mup_> Bug #1305780: juju-backup command fails against trusty bootstrap node <backup-restore> <juju-core:Triaged by hduran-8> <https://launchpad.net/bugs/1305780>
[14:36] <natefinch> rogpeppe: sure, that's good
[14:37] <rogpeppe> natefinch: although i'm not entirely happy with the vote/voting difference
[14:37] <rogpeppe> natefinch: how about: "no vote", "adding vote", "removing vote", "has vote" ?
[14:37] <sinzui> jam, wallyworld: the precise unit tests are now running with mongo from ctools. I have a set of failures. They are different from before. CI is automatically retesting. I am hopeful
[14:38] <natefinch> rogpeppe: "voting, pending removal" "not voting, pending add"?  That makes it a little more clear that even though the machine is not going to have the vote in a little bit, it actually still does right now
[14:38] <natefinch> (and vice versa)
[14:39] <rogpeppe> natefinch: i think that it's reasonable to assume that if something says "removing x" that x is currently there to be removed
[14:39] <rogpeppe> natefinch: likewise for adding
[14:39] <natefinch> rogpeppe: fair enough
[14:43] <rogpeppe> ha, i've just discovered that if you call any of the Patch functions in SetUpSuite, the teardown functions never get called.
[14:44] <mgz> rogpeppe: heh, yeah, another reason teardown is generally dangerous
[14:44] <hazmat> jam, thanks for the scale testing reports
[14:44] <rogpeppe> mgz: i think we should change CleanUpSuite so it just works if you do a suite-level patch
[14:44] <natefinch> rogpeppe: whoa, really?  I assymed they'd do the right thing
[14:44] <rogpeppe> natefinch: uh uh
[14:45] <natefinch> rogpeppe, mgz: yes, definitely.   Totally unintuitive otherwise
[14:45] <rogpeppe> i won't do it right now, but i'll raise a bug
[14:46] <rogpeppe> where are we supposed to raise bugs for github.com/juju/testing ?
[14:46] <rogpeppe> on the github page, or on juju-core?
[14:49] <natefinch> last I heard we were keeping bugs on launchpad
[14:51] <natefinch> (not my idea)
[14:52] <rogpeppe> natefinch: done. https://bugs.launchpad.net/juju-core/+bug/1308101
[14:52] <_mup_> Bug #1308101: juju/testing: suite-level Patch never gets restored <juju-core:New> <https://launchpad.net/bugs/1308101>
[14:53] <natefinch> rogpeppe: I very well may have made that mistake myself recently.
[14:53] <rogpeppe> natefinch: that was what caused me to investigate
[14:53] <rogpeppe> natefinch: i knew it was an error, but i thought it would get torn down at the end of the first test
[14:54] <rogpeppe> natefinch: i wondered how it was working at all
[14:54] <natefinch> rogpeppe: yeah, of the two likely behaviors, never getting torn down is definitely the worse of the two
[14:54] <natefinch> rogpeppe: but also the one least likely to be obvious
[14:55] <rogpeppe> natefinch: yup
[14:55] <alexisb> jam, sinzui any news on the bootstrap issue? https://bugs.launchpad.net/juju-core/+bug/1306212
[14:55] <_mup_> Bug #1306212: juju bootstrap fails with local provider <bootstrap> <ci> <local-provider> <regression> <juju-core:In Progress by jameinel> <https://launchpad.net/bugs/1306212>
[14:56] <sinzui> alexisb, thumper and wallyworld landed a hack to remove HA from local to make tests pass...
[14:56] <sinzui> alexisb, I think devs hope to fix the real bug...
[14:56] <natefinch> alexisb, sinzui:  jam looked into it some, and it may be due to an old version of mongo (2.2.x) that we don't really need to support anyway... cloud archive has 2.4.6 I believe, which may solve the problem
[14:56] <sinzui> natefinch, already update the test
[14:57] <alexisb> sinzui, natefinch: can we add those updates to the bug?
[14:57] <sinzui> natefinch, This is the current run with mongo from ctools: http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/633/console
[14:58] <alexisb> sinzui, any other critical bugs blocking the 19.0 release?
[14:58] <natefinch> sinzui: that looks good
[14:59] <sinzui> alexisb, I don't think the of HA removal of local as a hack. It is insane to attempt HA for a local env. I make close the bug instead of deferring it to the next release
[14:59] <natefinch> sinzui: but that's with wallyworld's hack, right?  We'll need to remove that hack at some point, and it would be good not to use an old version of mongo anyway.
[14:59] <sinzui> natefinch, there is a previous run with failure, but different failures thatn before. CI choose to retest assuming the usual intermittent failure
[15:00] <alexisb> sinzui, understood
[15:00] <natefinch> sinzui: the devs discussed it this morning.  Having HA on local may not actually give "HA", but it can be useful to show how juju works with HA, like you can kill a container and watch juju recover, re-ensure and watch a new one spin up, etc
[15:00] <sinzui> natefinch, wallyworld's hack was about not try to do HA setup for local.
[15:01] <natefinch> sinzui: it's basically just like the rest of local.... it's not actually *useful* for much other than demos and getting your feet wet.... but it's really useful for that.
[15:01] <sinzui> natefinch, alexisb unit tests are all pass
[15:01] <sinzui> \0/
[15:02] <sinzui> natefinch, alexisb azure is ill today and the azure tests failed. I am asking for a retest. the current revision will probably be blessed for release today
[15:02] <alexisb> sinzui, awesome!
[15:02] <natefinch> sinzui: what version of mongo is that running?
[15:03]  * natefinch doesn't know what ctools means
[15:04]  * sinzui reads the test log
[15:05] <natefinch> sinzui: ahh, I see, I missed it somehow, looks like 2.4.6
[15:05] <sinzui> natefinch,  1:2.4.6-0ubuntu5~ctools1
[15:05] <natefinch> sinzui: happy
[15:06] <jamespage> sinzui, I'm not adverse to introducing a new feature - the plugins are well isolated but afaict its not complete in the codebase
[15:06] <jamespage> sinzui, if that is the case then I agree not shipping the update-bootstrap plugin does make sense
[15:06] <jamespage> otherwise afaict I have no real way of providing backup/restore to 1.16 users right?
[15:07] <sinzui> jamespage, They aren't complete since we know that a script is needed to get tar and mongodump to do the right thing
[15:08] <jamespage> sinzui, OK - I'll drop it then
[15:11] <natefinch> sinzui: is that the version of mongo we were running before wallyworld's hack?  I'd like to know if the version of mongo is the deciding factor
[15:11] <sinzui> natefinch, for the unittests. that version of mongo is the fact
[15:11] <sinzui> natefinch, for the local deploy, wallyworld's hack was the fix
[15:12] <sinzui> and jam's fix for upgrades fixed all upgrdes
[15:12] <natefinch> ok
[15:13] <natefinch> oh right, it was the version of mongo for upgrades that was changing how we add/remove users.
[15:18] <sinzui> natefinch, CI has mongo from ctools though. all I did for test for the test harness was ensure that precise hosts add the same PPA as CI itself
[15:21] <natefinch> sinzui: ok, I thought someone had mentioned this morning that CI adds the juju/stable PPA for mongo, but I may have misunderstood or they may have been wrong
[15:22] <sinzui> natefinch, If added the juju stable ppa, I ensure the ctools archive is added and then manually install it before we rung make install-dependencies
[15:23] <natefinch> sinzui: I believe you know what your tools are running better than some random juju dev :)
[15:24] <natefinch> (and my memory thereof)
[15:26] <sinzui> Azure is very ill.
[15:26] <sinzui> The best I can do is manually retest hoping that I catch azure whe it is better
[15:28] <natefinch> poor azure
[15:31] <rogpeppe> natefinch: ping
[15:32] <natefinch> rogpeppe: yo
[15:32] <rogpeppe> natefinch: i've just pushed lp:~rogpeppe/juju-core/natefinch-041-moremongo/
[15:32] <rogpeppe> natefinch: all tests pass
[15:32] <rogpeppe> natefinch: could you pull it and re-propose -wip, please?
[15:32] <natefinch> rogpeppe: sure
[15:33] <rogpeppe> natefinch: i guess i could make a new CL, but it seems nicer to use the current one
[15:33] <natefinch> rogpeppe: yeah
[15:34] <natefinch> rogpeppe: wiping
[15:34] <natefinch> rogpeppe: that should be wip-ing
[15:35] <natefinch> rogpeppe: done
[15:35] <rogpeppe> natefinch: ta
[15:52] <rogpeppe> natefinch: i've pushed again. could you pull and then do a proper propose, please?
[15:53] <rogpeppe> natefinch: and then i think we can land it
[15:53] <natefinch> rogpeppe: sure
[15:53] <natefinch> rogpeppe: one sec, running tests on the other branch
[15:56] <jam1> natefinch: sinzui: so I don't know if changing mongo would have made CI happy without wallyworld's hack. wallyworld was the only one who has reproduced the replicaset Closed failure, and he did so on trusty running 2.4.9, so it seems like it *could* still be necessary.
[15:56] <jam1> natefinch: the concern is that the code is actually not different in Local, so if it is failing there it *should* be failing elsewhere
[15:56] <jam1> and maybe we just aren't seeing it yet
[15:56] <natefinch> jam1: yep, also a good reason not to have local be different
[15:57] <jam1> natefinch: I believe my stance on local HA, it doesn't provide actual HA, it is good for demos, I like common codebase, but I'm willing to slip it if we have to.
[15:57] <sinzui> jam1 I think you are forgetting the unit test failed days before upgrade failed and days before local deploy failed.
[15:57] <natefinch> jam1: yeah.  perfect is the enemy of good
[15:57] <mgz> so, I fixed the test suite hang... but still don't understand why it actually did that.
[15:58] <sinzui> I went to sleep with unit tests and azure failing (the latter is azure, not code)
[15:58] <rogpeppe> i cannot get this darn branch to pass tests in the bot: https://code.launchpad.net/~rogpeppe/juju-core/548-destroy-environment-fix/+merge/215697
[15:58] <rogpeppe> it's failed on replicaset.TestAddRemoveSet three times in a row now, and the changes it makes cannot have anything to do with that
[15:58] <rogpeppe> let's try just one more time
[15:59] <sinzui> jam, I changed the db used in http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/run-unit-tests-amd64-precise/ #632 I got a pass in #633. The errors were different between dbs
[15:59] <jam1> sinzui: so TestInitialize failing was the mongo version problem. The fact that adding --replicaSet to the mongo startup line caused local to always fail with Closed is surprising, but might be a race/load issue that local triggers that others don't.
[15:59] <jam1> natefinch: a thought, we know it takes a little while for replicaset to recover
[15:59] <jam1> IIRC, mongo defaults to having a 15s timeout
[16:00] <jam1> natefinch: is it possible that Load pushes local over the 15s timeout?
[16:01] <rogpeppe> natefinch: could you push your 043-localstateinfo branch please, so I can try to get the final branch ready for proposal?
[16:01] <sinzui> jam1 I changed the db for unit tests because it didn't match the db used by local/ci the one from ctool is the only one will trust now for precise
[16:02] <jam1> sinzui: right, the only place we actually use juju:ppa/stable is for Quantal and Raring
[16:02] <natefinch> jam: yes, possible. Mongo can be sporadically really slow
[16:02] <jam1> I forgot about R
[16:02] <natefinch> rogpeppe: reproposed moremongo
[16:02] <jam1> but I don't think 2.4 landed until Saucy
[16:02] <rogpeppe> natefinch: thanks
[16:02] <rogpeppe> natefinch: i'll approve it
[16:03] <sinzui> jam1, the makefile disagrees with your statement
[16:03] <mgz> R is no longer supported
[16:03] <sinzui> jam1 the make files doesn't know about ctools
[16:03] <jam1> sinzui: so what I mean is, when you go "juju bootstrap" and Q and R is the target, we add the juju ppa
[16:03] <jam1> sinzui: ctools doesn't have Q and R builds
[16:03] <jam1> only P
[16:03] <natefinch> rogpeppe:  pushed
[16:03] <rogpeppe> natefinch: ta
[16:04] <jam1> sinzui: but as mgz points out, Q is almost dead, and R is dead, so we can punt
[16:04] <rogpeppe> natefinch: how's it going, BTW?
[16:04] <sinzui> jam1, good, because I don't test with r (obsolete) and q (obsoelete in 3 days)
[16:04] <jam1> sinzui: otherwise the better fix is to get 2.4.6 into the ppa
[16:04] <sinzui> jam1, +1
[16:04] <sinzui> jam1, I was not aware the versions were different until this morning
[16:05] <jam1> sinzui: I wasn't that aware either
[16:05] <jam1> sinzui: I just saw the failing and davecheney pointed out CI was using an "old" version, which I tracked down
[16:06] <sinzui> I saw it too, but I haven't gotten enough sleep to see how the old version was selected
[16:06] <jam1> sinzui: it is nice to see so much blue on http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/
[16:07] <sinzui> jam1. I see CI has started next revision will I was trying to get lucky with azure.
[16:07] <natefinch> rogpeppe: I think most tests pass on 043-localstateinfo now.  saw some failures from state, but they looked sporadic, haven't checked them out yet
[16:08] <rogpeppe> natefinch: cool. "most" ?
[16:08] <mgz> rogpeppe: can I request a re-look at https://codereview.appspot.com/87540043
[16:08] <rogpeppe> mgz: looking
[16:09] <mgz> rogpeppe: I don't like that that fixed the hang... pretty sure it means the test is relying on actually dialing 0.1.2.3 and that failing
[16:09] <natefinch> rogpeppe: I didn't let the worker tests finish because I was impatient and they were taking forever, so possible there are failures there too
[16:09] <natefinch> rogpeppe:  state just passed for me
[16:09] <rogpeppe> natefinch: cool
[16:10] <sinzui> jam1, everyone. This is a first,  lp:juju-core r2630 passed 7 minutes after CI cursed the rev because I was forcing the retest of azure.
[16:11] <rogpeppe> mgz: the code in AddStateServerMachine is kinda kooky
[16:11] <sinzui> jam1. I will start preparation for the release while the current rev is tested. I will use the new rev if it gets a natural blessing
[16:11] <mgz> AddStateServerMachine should probably just be removed
[16:11] <rogpeppe> mgz: probably.
[16:11] <mgz> it's not a very useful or used helper
[16:12] <mgz> and its doc comment is wonky
[16:12] <mgz> I probably shouldn't have touched it, as stuff passes without poking there
[16:12] <mgz> but it came up in my grep
[16:13] <natefinch> rogpeppe:  worker tests pass except for a peergrouper test - workerJujuConnSuite.TestStartStop got cannot get replica set status: cannot get replica set status: not running with --replSet
[16:13] <rogpeppe> it should probably be changed to SetStateServerAddresses
[16:13] <rogpeppe> mgz: as that's the reason it was originally put there
[16:13] <mgz> right, something like that
[16:14] <rogpeppe> mgz: but let's just land it and think about that stuff later
[16:14] <mgz> I'll try it on the bot
[16:15] <rogpeppe> mgz: bot is chewing on it...
[16:19] <natefinch> rogpeppe: peergrouper failure was sporadic too, somehow.  All tests pass on that branch.
[16:25] <sinzui> alexisb, I will release in a few hours. CI is testing a new rev that I expect to pass without intervention. I can release the previous rev which pass with extra retests
[16:25] <alexisb> sinzui, awesome, thank you very much!
[16:26] <mattyw> folks, has anyone seen this error before when trying to deploy a local charm? juju resumer.go:68 worker/resumer: cannot resume transactions: not okForStorage
[16:26] <rogpeppe> natefinch: any chance we might get it landed today?
[16:27] <rogpeppe> natefinch: i'm needing to stop earlier today
[16:27] <alexisb> go rogpeppe go! :)
[16:27] <rogpeppe> alexisb: :-)
[16:28]  * rogpeppe has spent at least 50% of today dealing with merge conflicts
[16:31] <natefinch> rogpeppe: yes
[16:33] <rogpeppe> natefinch: cool
[16:38] <rogpeppe> natefinch: BTW the bot is running 041-moremongo tests right now... fingers crossed
[16:45] <rogpeppe> natefinch: i've just realised that it would be quite a bit nicer to have a func stateInfoFromServingInfo(info params.StateServingInfo) *state.Info, and just delete agent.Config.StateInfo
[16:45] <natefinch> rogpeppe: yeah
[16:46] <rogpeppe> natefinch: n'er mind, we'll plough on, i think.
[16:46] <natefinch> rogpeppe: that was my thinking :)
[16:47] <natefinch> awww, I think my old company finally cancelled my MSDN subscription
[16:56] <rogpeppe> natefinch: i'm needing to stop in 20 mins or so. any chance of that branch being proposed before then?
[16:56] <natefinch> rogpeppe: https://codereview.appspot.com/88200043
[16:56] <rogpeppe> natefinch: marvellous :-)
[16:57] <natefinch> :)
[17:05] <rogpeppe> natefinch: reviewed
[17:06] <natefinch> rogpeppe: btw, I had to rename params to attrParams because params is a package that needed to get used in the same function
[17:06] <rogpeppe> natefinch: i know
[17:07] <natefinch> rogpeppe: oh, I misunderstood the comment, ok
[17:07] <rogpeppe> natefinch: i just suggested standard capitalisation
[17:07] <natefinch> rogpeppe: yep, cool
[17:09] <natefinch> rogpeppe: why is test set password not correct anymore?  It still does that, I think?
[17:09] <rogpeppe> natefinch: oh, i probably missed it
[17:11] <rogpeppe> natefinch: you're right, i did
[17:11] <natefinch> rogpeppe: cool
[17:26] <rogpeppe> natefinch: 41-moremongo is merged...
[17:27] <natefinch> rogpeppe:awesome
[17:28] <natefinch> 43-localstateinfo should be being merged now
[17:28] <natefinch> rogpeppe: what's left?
[17:28] <rogpeppe> natefinch: it's actually retrying mgz's apiaddresses_use_hostport
[17:29] <rogpeppe> natefinch: i'm trying to get tests passing on my final integration branch
[17:29] <natefinch> rogpeppe: nice
[17:29] <rogpeppe> natefinch: currently failing because of the StateInfo changes (somehow we have a StateServingInfo with a 0 StatePort)
[17:30] <rogpeppe> natefinch: would you be able to take it over from me for the rest of the day
[17:30] <rogpeppe> ?
[17:30] <natefinch> rogpeppe: yeah definitely
[17:30] <rogpeppe> natefinch: it needs a test that peergrouper is called (i'm already mocking out peergrouper.New)
[17:31] <natefinch> rogpeppe: what's the branch name?
[17:32] <rogpeppe> natefinch: i haven't pushed it yet, one mo
[17:33] <rogpeppe> natefinch: bzr push --remember lp:~rogpeppe/juju-core/540-enable-HA
[17:34] <rogpeppe> natefinch: there are some debugging relics in there that need to be removed too
[17:34] <rogpeppe> natefinch: in particular, revno 2355 (cmd/jujud: print voting and jobs status of machines) needs to be reverted and proposed separately as discussed in the standup
[17:35] <natefinch> rogpeppe: ok
[17:39] <rogpeppe> natefinch: i'd prioritise the other branches though
[17:39] <rogpeppe> natefinch: i have to go now
[17:39] <rogpeppe> g'night all
[17:39] <natefinch> rogpeppe: g'night
[18:05] <sinzui> natefinch, Do you have a moment to review https://codereview.appspot.com/88170045
[18:06] <natefinch> sinzui: done
[18:06] <sinzui> thank you natefinch
[18:34] <BradCrittenden> sinzui: would you have a moment for a google hangout?
[18:35] <sinzui> bac: yes
[18:35] <bac> sinzui: cool.  let me set one up and invite you in a couple of minutes
[18:39] <bac> sinzui: https://plus.google.com/hangouts/_/canonical.com/daily-standup
[23:09] <davecheney> good morning worker ants
[23:16] <perrito666> davecheney: my window says otherwise
[23:21] <davecheney> perrito666: one of us is wrong
[23:21] <davecheney> i'll roshambo you for it
[23:21]  * perrito666 turns a very strong light on outside and says good morning to davecheney 
[23:22] <davecheney> perrito666: it helps with the jet lagg
[23:22] <perrito666> davecheney: I traveled under 20km today, I dont have that much jetlag :p