[00:50] <davechen1y> mwhudson: yes scond is the putative condition bits for arm instructions
[00:50] <davechen1y> the s is silent
[00:50] <mwhudson> heh
[00:51] <mwhudson> 5g is ... something
[00:51] <mwhudson> oops, not that any more
[01:41] <menn0> thumper: how are you getting on with that bug?
[01:41] <menn0> davechen1y: we should talk about the certupdater before my EOD
[01:58] <davechen1y> menn0: how about now ?
[01:58] <davechen1y> thumper: https://docs.google.com/document/d/1UPYcjMHV2YEHNAVjMWQAh2Htj-pPl2Bt0_POBGLQURE/edit
[01:58] <davechen1y> so, this sprint we had a dedicaed bug squad
[01:59] <davechen1y> and the number of bugs fixed went down ...
[01:59] <davechen1y> i'm not sure that is the result that was intended
[02:01] <mwhudson> urgh, the 32 bit arm abi is less nice than the 64 bit one
[02:02] <davechen1y> when you say "less nice"
[02:02] <davechen1y> to you mean lemon juice is "less nice" than wine
[02:02] <davechen1y> menn0: want to talk in the standup hangout ?
[02:03] <mwhudson> davechen1y: well, i wouldn't say the 64 bit one is comparable to wine
[02:03] <menn0> davechen1y: sounds good
[02:04] <davechen1y> why is tim still in the standup
[02:04] <davechen1y> thumper: why are you still in this hangout ?
[02:13] <thumper> davechen1y: the bugs fixed went down over the second iteration, which wasn't this one, but the last one
[02:13] <thumper> menn0: just back
[02:14] <thumper> menn0: also thinking, that despite our best efforts and analysys, we may still have some weird conditions where workers are not being shut down
[02:14] <thumper> menn0: this would explain a number of cases...
[02:14] <thumper> menn0: although, if the pinger is getting stuck, that would explain one or two
[02:16] <davechen1y> menn0: // Accept waits for and returns the next connection to the listener.
[02:16] <davechen1y> func (cl *changeCertListener) Accept() (c net.Conn, err error) { cl.m.Lock() defer cl.m.Unlock()
[02:18] <menn0> thumper: a worker getting stuck would explain a lot of things
[02:19] <thumper> menn0: we are seeing it on the ppc64 tests still, and some of the unit agents didn't restart...
[02:19] <thumper> I'm looking at one that didn't now
[02:20] <menn0> thumper: ok
[02:21] <menn0> davechen1y: looking at the code for tls.NewListener
[02:21] <menn0> davechen1y: i now remember what my plan was
[02:21] <thumper> definitely a bug in the unit agent
[02:22] <thumper> at 11:00:23 most workers die for shutdown
[02:22] <thumper> these two lines are next to each other
[02:22] <thumper> 2015-07-02 11:00:23 DEBUG juju.worker runner.go:242 killing "rsyslog"
[02:22] <thumper> 2015-07-02 11:00:25 INFO juju.worker runner.go:261 start "rsyslog"
[02:22] <thumper> 2
[02:23] <thumper> then we can see that jujuc commands are being fired
[02:23] <menn0> davechen1y: i was thinking that instead of wrapping a tls.Listener we should copy the tiny amount of code from it and integrate it into certChangeListener
[02:23] <thumper> config-changed hook being fired
[02:23] <menn0> certChangeListener could have an extra method to swap out the config
[02:23] <menn0> and the config would need to be protected by a lock
[02:24] <menn0> thumper: ok so something ain't dying
[02:24] <thumper> bingo
[02:24] <menn0> thumper: the unit agent has few workers so it shouldn't be hard to figure out which
[02:24] <thumper> the uniter is running the config-changed hook
[02:25] <menn0> thumper: but this doesn't necessarily explain the upgrade issue we were looking at
[02:25] <thumper> no...
[02:25] <thumper> that is most likely something else too
[02:25] <menn0> thumper: although they could be caused by a related change
[02:25]  * thumper nods
[02:26] <thumper> oh fuck...
[02:27]  * menn0 waits for good/bad news
[02:31] <thumper> well...
[02:31] <thumper> this is one problem
[02:31] <thumper> when the apiservers come back on line
[02:31] <thumper> the unit agent connects
[02:31] <thumper> one of the first thing that fires is config-changed
[02:31] <thumper> even though it probably didn't
[02:31] <thumper> now while the config-changed hook is being fired
[02:31] <thumper> we get told to upgrade
[02:31] <thumper> everything dies
[02:32] <thumper> except the uniter
[02:32] <thumper> which carries on for a while
[02:32] <thumper> then doesn't stop
[02:32] <thumper> the uniter is told to die...
[02:32] <thumper> can't tell if it actually does
[02:33] <thumper> certainly doesn't finish
[02:33] <thumper> the agent doesn't restart
[02:33] <menn0> right
[02:33] <thumper> seems to happen on a subset of units
[02:33] <thumper> certainly most restart
[02:34] <thumper> this screams race somewhere
[02:34] <menn0> thumper: this is probably what caused all those config-changed hook errors.
[02:34] <thumper> and may well be the cause of these other problems
[02:34] <thumper> I have to dig in a little...
[02:34] <menn0> b/c the unit agent doesn't shut down the API connection so the apiserver doesn't stop
[02:34] <thumper> to work that one out
[02:35]  * menn0 is going to check if this could be the cause of the other bug he's looking at
[02:35] <thumper> yeah... all seems closely related
[02:35] <menn0> i might make the uniter intentional hang to simulate and see what happens when an upgrade is requested
[03:00] <davechen1y> ,menn0 so
[03:00] <davechen1y> the approach I discussed on the call has one drawback
[03:00] <davechen1y> using that lock where I said
[03:01] <davechen1y> means the cert update worker will _block_ until one connetion is received by the apiserver
[03:01] <davechen1y> i'm not sure how much of a problem this will be in practice
[03:02] <menn0> davechen1y: it is a problem b/c it will prevent the agent from shutting down if there's been no connections received
[03:13] <davechen1y> nope, that will be fine
[03:13] <davechen1y> because we run the http.Server in a goroutine
[03:13] <davechen1y> and signal to it by closing hte underlyning tcp socket
[03:14] <menn0> davechen1y: ok, as long as you're sure
[03:15] <davechen1y> i'm not sure of anything anymore
[03:15] <davechen1y> that's why i run the race detector
[03:15] <davechen1y> i'll see how the patch shapes up
[03:15] <davechen1y> we can debate it on reviewboard
[03:15] <menn0> thumper: so a stuck uniter doesn't prevent the the state server from upgrading, but it does prevent the unit agent from upgrading (expected)
[03:16] <menn0> davechen1y: sounds good
[03:22] <thumper> menn0: yes...
[03:22]  * thumper goes to make some notes on the bug
[03:45] <davechen1y> thumper: http://paste.ubuntu.com/11813756/
[03:45] <davechen1y> new race
[03:45] <davechen1y> :(
[03:46] <thumper> :(
[03:46] <thumper> wat?
[03:46] <thumper> really new?
[03:46] <mwhudson> fire up your engines!
[03:47] <davechen1y> i think so
[03:48] <thumper> previous write by: sync.(*Mutex).Lock()
[03:48] <thumper> fucking what?
[03:49] <thumper> davechen1y: if a pointer is used as a bool check
[03:49] <thumper> nil doesn't count as true does it?
[03:49] <thumper> pointer / interface
[03:50] <thumper> oh...
[03:50] <thumper> damn it
[03:50] <thumper> it is a switch
[03:51] <thumper> meh
[03:52] <davechen1y> github.com/juju/juju/environs/configstore.(*memInfo).clone()
[03:52] <davechen1y> in code review
[03:52] <davechen1y> _anything_ with a clone method gets a hairy eyeball
[03:53] <davechen1y> it's a smell
[04:05] <thumper> menn0: mysql-hacluster/0 did not upgrade because the config-changed hook did not complete
[04:06] <menn0> thumper: ok so that's one problem
[04:08] <menn0> thumper: do you think it's a uniter bug or a problem with the hook it's running?
[04:08] <thumper> that one is the underlying hook AFAICT
[04:08] <menn0> thumper: or perhaps the hook is running a tool which calls back to juju?
[04:09] <thumper> but all that code is handled by the uniter itself
[04:09] <thumper> so we'd see it
[04:14] <menn0> thumper: so I figured out why we saw those api blocked / "upgrade in progress" log messages
[04:14] <menn0> thumper: it's due to some new stuff that ian added
[04:15] <menn0> thumper: it'll do that until the upgrader has done its first check for a new version
[04:15] <menn0> thumper: so that we don't allow API requests when the agent is about to restart
[04:15] <menn0> thumper: the message is a little confusing though
[04:16] <thumper> ah...
[04:16] <thumper> is that how we manage to have the apiserver not too busy that second time up?
[04:16] <thumper> so it can restart?
[04:16] <menn0> could be
[04:16] <thumper> that would actually explain it
[04:17] <menn0> have you tried to verify that the apiserver actually can get stuck
[04:17] <menn0> ?
[04:17] <thumper> just bootstrapping an environment
[04:17] <menn0> cool
[04:17] <thumper> I have figured that none of the code around this has changed
[04:17] <thumper> so I'm using 1.24.2 for better logging
[04:17] <thumper> and just using the juju upgrade-juju --upload-tools so it increments the buildnumber
[04:18] <thumper> ERROR while stopping machine agent: exec ["stop" "--system" "juju-agent-tim-testlocal"]: exit status 1 (stop: Method "Get" with signature "ss" on interface "org.freedesktop.DBus.Properties" doesn't exist)
[04:18] <thumper> boo hiss
[04:21] <menn0> yeah that happens all the time in my trusty VM
[04:25] <davechen1y> thumper: menn0 https://github.com/juju/juju/pull/2716
[04:25] <davechen1y> for debate
[04:25]  * menn0 looks
[04:25] <davechen1y> this is the smallest possible change i can make
[04:28] <menn0> davechen1y: I don't think this is enough
[04:29] <menn0> davechen1y: if you look at crypto/tls you can see that Conn ends up with the the Config
[04:29] <menn0> davechen1y: and it uses it beyond Accept
[04:32] <menn0> davechen1y: so if anything changes the config after it could get a surprise
[04:41] <menn0> davechen1y: oh I see... it's not so bad because Handshake is also blocked by the lock
[04:42] <menn0> davechen1y: but that's kind of a hack
[04:45] <menn0> davechen1y: do you mind if I put up a counter-PR?
[04:45] <menn0> i'm thinking we don't use tls.Listener at all
[04:55] <davechen1y> menn0: sure
[04:55] <menn0> davechen1y: almost done :)
[04:56] <menn0> davechen1y: do you happen to know a test that failed the race detector
[05:02] <davechen1y> go test -race ./cmd/jujud/agent
[05:02] <davechen1y> litmus test
[05:07] <davechen1y> thumper: what's the story with getting a voting race test ?
[05:12] <davechen1y> thumper:  menn0 have to run out to the bank to get greenbacks for next week
[05:13] <menn0> davechen1y: i'm just pushing this PR now
[05:14] <thumper> davechen1y: flick of a switch when we are at zero
[05:15] <davechen1y> thumper: so there is a non voting test at the moment ?
[05:15] <davechen1y> how can I see the results that it sees today ?
[05:15] <thumper> davechen1y: correct
[05:15]  * thumper looks
[05:15] <menn0> davechen1y: https://github.com/juju/juju/pull/2717
[05:16] <menn0> davechen1y: with this approach each connect gets it own tls.Config
[05:16] <thumper> davechen1y: http://reports.vapour.ws/releases/2847/job/run-unit-tests-race/attempt/150
[05:17] <menn0> davechen1y: so there's no way the cert updates can affect existing connections
[05:17] <menn0> davechen1y: also: there's no need for the lock to be held the whole time
[05:17] <menn0> davechen1y: (until the first connection)
[05:19] <menn0> davechen1y: do you understand why processCertChanges wanted to always hold the lock?
[05:19] <menn0> davechen1y: that seems like a serious bug
[05:22] <thumper> menn0: looks fine to me...
[05:23] <thumper> menn0: I'm assuming it works
[05:23] <menn0> thumper: tests pass .... haven't tried to use it
[05:24] <menn0> thumper, davechen1y: so ales made added a nasty bug a few days ago
[05:24] <thumper> oh?
[05:25] <menn0> thumper, davechen1y: look at rev c147767de940cd07db3330ac3a19f9f3547d4be1
[05:25] <thumper> menn0: c'mon, post a link to the revision at least
[05:26] <menn0> thumper, davechen1y: https://github.com/juju/juju/commit/c147767de940cd07db3330ac3a19f9f3547d4be1
[05:26] <menn0> with that in place i'm surprise the apiserver even works in master
[05:27] <thumper> because the lock is already held?
[05:27] <menn0> yep
[05:27] <menn0> just checking now
[05:30] <menn0> thumper: well I have no idea how that works
[05:30] <menn0> thumper: but everything seems ok
[05:31] <thumper> what type of lock is it?
[05:31] <menn0> sync.Mutex
[05:38] <thumper> perhaps because process change is never being caleld
[05:40] <menn0> thumper: i've put log messages in and I can see that it is
[05:40] <thumper> ?!?
[05:41] <thumper> i thought that would have deadlocked
[05:41] <menn0> same!
[05:43] <menn0> thumper: it does deadlock
[05:43] <thumper> umm...
[05:43] <thumper> you said that you see the lgos
[05:44] <thumper> logs
[05:44] <menn0> thumper: I just added more logs
[05:44] <thumper> you put the logs before the lock didn't you?
[05:44] <menn0> thumper: the cert update goes to happen and then it gets stuck b/c it can't get the lock again
[05:45] <menn0> thumper: the api server keeps working b/c Handshake isn't even called on changeCertConn (which is the other place the lock is grabbed)
[05:45] <menn0> thumper: so there is no need for changeCertConn to even exist
[05:46] <menn0> thumper: my PR gets rid of changeCertConn anyway
[05:48] <thumper> menn0: care to explain to me how your change works to fix things
[05:48] <thumper> ?
[05:49] <menn0> thumper: sure
[05:50] <menn0> thumper: so it means that each connection ends up with a private copy of the tls.Config
[05:50] <menn0> thumper: so that when the certificate is updated, it doesn't affect existing connections
[05:50] <menn0> thumper: and new connections see the new config with the updated cert
[05:50] <thumper> well that makes sense
[05:51] <menn0> thumper: there's no race because the tls.Config's aren't being shared
[05:51] <thumper> shipit
[05:51] <menn0> thumper: ok cool
[05:51] <thumper> I'm not getting to a reproduction step here.
[05:51] <menn0> thumper: just updating the commit message
[05:51] <thumper> and my brain is fuzzed
[05:51] <thumper> and it is friday
[05:52] <menn0> thumper: i've had very little success with these bugs myself
[05:52] <menn0> thumper: they are so hard to repro
[05:52]  * thumper nods
[05:52] <thumper> have a good weekend folks
[05:52] <menn0> thumper: debug level logs with the extra logging you added would help a lot
[05:52] <thumper> davechen1y: safe travels
[05:52] <thumper> menn0: yeah...
[05:53] <thumper> there are several extra places we should add logging ...
[05:59] <menn0> davechen1y: thumper is happy but I'd appreciate your feedback on this: http://reviews.vapour.ws/r/2095/
[06:00] <menn0> davechen1y: i'm EOD now but will leave IRC up and check back later
[06:49] <davechen1y> menn0: kk
[06:57] <davechen1y> menn0: ship it, that's excellent
[08:35] <mup> Bug #1471138 opened: destroy-environment should be able to destroy all sub-environments <juju-core:New> <https://launchpad.net/bugs/1471138>
[09:29] <menn0> davechen1y: sweet thanks
[10:17]  * fwereade bbiab
[10:47] <rogpeppe1> ha, has anyone realised that we don't allow single-character user names
[10:47] <rogpeppe1> ?
[10:47] <rogpeppe1> i wonder if that was deliberate
[12:50] <perrito666> TheMue: hb
[13:07] <dimitern> perrito666, he's off today
[13:07] <perrito666> well he'll read that when he returns hopefuly
[13:08] <dimitern> :)
[13:19] <fwereade> perrito666, do you need unblocking on http://reviews.vapour.ws/r/1851/ ?
[13:20] <perrito666> fwereade: mm, why is that still open? let me check
[13:31] <dimitern> 2015-07-03 13:19:53 ERROR juju.cmd supercommand.go:430 tools upload failed: 400 ({"Tools":null,"DisableSSLHostnameVerification":false,"Error":{"Message":"cannot get environment config: invalid series \"wily\"","Code":""}})
[13:32] <dimitern> are going to have to deal with such issues on *every* ubuntu release?
[13:32] <perrito666> dimitern: yes
[13:32] <perrito666> at least based in historical data
[13:33] <dimitern> :(
[13:33] <dimitern> tgif at least
[13:34] <perrito666> fwereade: I was under the impression that it had been merged, perhas it was merged to previous versions, Ill address that next week (most, if not all, your comments make sense)
[13:44] <mup> Bug #1471231 opened: debugLogDBIntSuite teardown fails <ci> <unit-tests> <juju-core:Incomplete> <juju-core db-log:Triaged> <https://launchpad.net/bugs/1471231>
[13:56] <mup> Bug #1471237 opened: Mongo causes juju to fail: bad record MAC <mongodb> <reliability> <juju-core:Triaged> <mongodb (Ubuntu):Confirmed> <https://launchpad.net/bugs/1471237>
[13:56] <mup> Bug #1471241 opened: RelationSuite teardown fails <ci> <intermittent-failure> <unit-tests> <juju-core:Triaged> <https://launchpad.net/bugs/1471241>
[14:00] <fwereade> perrito666, and also, can I help at all with http://reviews.vapour.ws/r/1979/ ?
[14:01] <perrito666> fwereade: not really, I have to go all over that again and submit another patch but I really dont want to context switch now it has been a terrible week for me in terms of the task at hand
[14:02] <fwereade> perrito666, no worries at all, just checking in
[14:03] <fwereade> perrito666, when you get back to it just let me know if anything needs clarification
[14:26] <mup> Bug #1471241 changed: RelationSuite teardown fails <ci> <intermittent-failure> <unit-tests> <juju-core:Triaged> <https://launchpad.net/bugs/1471241>
[14:26] <mup> Bug #1471242 opened: juju upgrade-juju to 1.24.2 fails with "invalid series: willy" <juju-core:New> <https://launchpad.net/bugs/1471242>
[15:15] <voidspace> dimitern: http://reviews.vapour.ws/r/2097/
[15:30] <dimitern> voidspace, awesome! LGTM
[15:31] <voidspace> dimitern: wow, that was quick
[15:31] <voidspace> dimitern: thanks
[15:38] <dimitern> voidspace, well, it's also friday :)
[16:14] <rogpeppe1> wwitzel3: ping
[17:12] <mup> Bug #1471308 opened: TestOpenStateWorksForJobManageEnviron fails intermittently on windows <ci> <intermittent-failure> <test-failure> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1471308>
[17:57]  * fwereade is finally happy enough with http://reviews.vapour.ws/r/2078/ and would very much appreciate reviews
[20:30] <mup> Bug #1471332 opened: Upgrade fails on windows machines <ci> <upgrade-juju> <windows> <juju-core:Triaged> <juju-core 1.24:Triaged> <https://launchpad.net/bugs/1471332>
[20:57] <fwereade> gaah
[20:57] <fwereade> I'm inside state
[20:57] <perrito666> fwereade: how did you get inside it?
[20:57]  * perrito666 takes a crowbar and goes to the rescue
[20:58] <fwereade> what's the quickest cleanest way to either find out what machine I'm running on, or otherwise come up with another stable id that effectively maps to that
[20:58] <fwereade> ?
[20:58] <mgz> fwereade: I'm like 20 mins in to reading your giant branch
[20:58] <fwereade> mgz, sorry it's so big but there was a lot to untangle
[20:58] <fwereade> mgz, I'm pretty sure it's there now though
[20:58] <mgz> yeaj, it's all pretty sane so far at least
[20:58] <fwereade> mgz, cool
[20:59] <fwereade> mgz, I pushed a much better description recently fwiw
[20:59] <fwereade> mgz, not sure if 20 mins ago or not
[20:59] <fwereade> mgz, probably more actually
[20:59] <perrito666> fwereade: where are you exactly?
[20:59] <perrito666> fwereade: to answer to your question
[21:00] <fwereade> perrito666, I'm in a state method which I want to set up the presence watcher and lease/leadership stuff
[21:00] <fwereade> perrito666, my lease client needs a unique id for the machine it's running on
[21:00] <perrito666> I see
[21:00] <fwereade> perrito666, deriving it from current state server id would be great
[21:01] <fwereade> perrito666, but, heh, how do we know which we are?
[21:01] <fwereade> perrito666, if I weren't inside state I could pass it in via agent config
[21:01] <fwereade> perrito666, but that really has no place in state
[21:04] <perrito666> fwereade: State{}.serverTag I think
[21:27] <fwereade> perrito666, I'm pretty sure that's the environ the state server's running in
[21:27] <fwereade> perrito666, but not unique to a particular machine
[21:28] <perrito666> look at IsStateServer