[00:50] mwhudson: yes scond is the putative condition bits for arm instructions [00:50] the s is silent [00:50] heh [00:51] 5g is ... something [00:51] oops, not that any more === kadams54 is now known as kadams54-away [01:41] thumper: how are you getting on with that bug? [01:41] davechen1y: we should talk about the certupdater before my EOD [01:58] menn0: how about now ? [01:58] thumper: https://docs.google.com/document/d/1UPYcjMHV2YEHNAVjMWQAh2Htj-pPl2Bt0_POBGLQURE/edit [01:58] so, this sprint we had a dedicaed bug squad [01:59] and the number of bugs fixed went down ... [01:59] i'm not sure that is the result that was intended [02:01] urgh, the 32 bit arm abi is less nice than the 64 bit one [02:02] when you say "less nice" [02:02] to you mean lemon juice is "less nice" than wine [02:02] menn0: want to talk in the standup hangout ? [02:03] davechen1y: well, i wouldn't say the 64 bit one is comparable to wine [02:03] davechen1y: sounds good [02:04] why is tim still in the standup [02:04] thumper: why are you still in this hangout ? [02:13] davechen1y: the bugs fixed went down over the second iteration, which wasn't this one, but the last one [02:13] menn0: just back [02:14] menn0: also thinking, that despite our best efforts and analysys, we may still have some weird conditions where workers are not being shut down [02:14] menn0: this would explain a number of cases... [02:14] menn0: although, if the pinger is getting stuck, that would explain one or two [02:16] menn0: // Accept waits for and returns the next connection to the listener. [02:16] func (cl *changeCertListener) Accept() (c net.Conn, err error) { cl.m.Lock() defer cl.m.Unlock() [02:18] thumper: a worker getting stuck would explain a lot of things [02:19] menn0: we are seeing it on the ppc64 tests still, and some of the unit agents didn't restart... [02:19] I'm looking at one that didn't now [02:20] thumper: ok [02:21] davechen1y: looking at the code for tls.NewListener [02:21] davechen1y: i now remember what my plan was [02:21] definitely a bug in the unit agent [02:22] at 11:00:23 most workers die for shutdown [02:22] these two lines are next to each other [02:22] 2015-07-02 11:00:23 DEBUG juju.worker runner.go:242 killing "rsyslog" [02:22] 2015-07-02 11:00:25 INFO juju.worker runner.go:261 start "rsyslog" [02:22] 2 [02:23] then we can see that jujuc commands are being fired [02:23] davechen1y: i was thinking that instead of wrapping a tls.Listener we should copy the tiny amount of code from it and integrate it into certChangeListener [02:23] config-changed hook being fired [02:23] certChangeListener could have an extra method to swap out the config [02:23] and the config would need to be protected by a lock [02:24] thumper: ok so something ain't dying [02:24] bingo [02:24] thumper: the unit agent has few workers so it shouldn't be hard to figure out which [02:24] the uniter is running the config-changed hook [02:25] thumper: but this doesn't necessarily explain the upgrade issue we were looking at [02:25] no... [02:25] that is most likely something else too [02:25] thumper: although they could be caused by a related change [02:25] * thumper nods [02:26] oh fuck... [02:27] * menn0 waits for good/bad news [02:31] well... [02:31] this is one problem [02:31] when the apiservers come back on line [02:31] the unit agent connects [02:31] one of the first thing that fires is config-changed [02:31] even though it probably didn't [02:31] now while the config-changed hook is being fired [02:31] we get told to upgrade [02:31] everything dies [02:32] except the uniter [02:32] which carries on for a while [02:32] then doesn't stop [02:32] the uniter is told to die... [02:32] can't tell if it actually does [02:33] certainly doesn't finish [02:33] the agent doesn't restart [02:33] right [02:33] seems to happen on a subset of units [02:33] certainly most restart [02:34] this screams race somewhere [02:34] thumper: this is probably what caused all those config-changed hook errors. [02:34] and may well be the cause of these other problems [02:34] I have to dig in a little... [02:34] b/c the unit agent doesn't shut down the API connection so the apiserver doesn't stop [02:34] to work that one out [02:35] * menn0 is going to check if this could be the cause of the other bug he's looking at [02:35] yeah... all seems closely related [02:35] i might make the uniter intentional hang to simulate and see what happens when an upgrade is requested [03:00] ,menn0 so [03:00] the approach I discussed on the call has one drawback [03:00] using that lock where I said [03:01] means the cert update worker will _block_ until one connetion is received by the apiserver [03:01] i'm not sure how much of a problem this will be in practice [03:02] davechen1y: it is a problem b/c it will prevent the agent from shutting down if there's been no connections received [03:13] nope, that will be fine [03:13] because we run the http.Server in a goroutine [03:13] and signal to it by closing hte underlyning tcp socket [03:14] davechen1y: ok, as long as you're sure [03:15] i'm not sure of anything anymore [03:15] that's why i run the race detector [03:15] i'll see how the patch shapes up [03:15] we can debate it on reviewboard [03:15] thumper: so a stuck uniter doesn't prevent the the state server from upgrading, but it does prevent the unit agent from upgrading (expected) [03:16] davechen1y: sounds good [03:22] menn0: yes... [03:22] * thumper goes to make some notes on the bug [03:45] thumper: http://paste.ubuntu.com/11813756/ [03:45] new race [03:45] :( [03:46] :( [03:46] wat? [03:46] really new? [03:46] fire up your engines! [03:47] i think so [03:48] previous write by: sync.(*Mutex).Lock() [03:48] fucking what? [03:49] davechen1y: if a pointer is used as a bool check [03:49] nil doesn't count as true does it? [03:49] pointer / interface [03:50] oh... [03:50] damn it [03:50] it is a switch [03:51] meh [03:52] github.com/juju/juju/environs/configstore.(*memInfo).clone() [03:52] in code review [03:52] _anything_ with a clone method gets a hairy eyeball [03:53] it's a smell [04:05] menn0: mysql-hacluster/0 did not upgrade because the config-changed hook did not complete [04:06] thumper: ok so that's one problem === kadams54 is now known as kadams54-away [04:08] thumper: do you think it's a uniter bug or a problem with the hook it's running? [04:08] that one is the underlying hook AFAICT [04:08] thumper: or perhaps the hook is running a tool which calls back to juju? [04:09] but all that code is handled by the uniter itself [04:09] so we'd see it [04:14] thumper: so I figured out why we saw those api blocked / "upgrade in progress" log messages [04:14] thumper: it's due to some new stuff that ian added [04:15] thumper: it'll do that until the upgrader has done its first check for a new version [04:15] thumper: so that we don't allow API requests when the agent is about to restart [04:15] thumper: the message is a little confusing though [04:16] ah... [04:16] is that how we manage to have the apiserver not too busy that second time up? [04:16] so it can restart? [04:16] could be [04:16] that would actually explain it [04:17] have you tried to verify that the apiserver actually can get stuck [04:17] ? [04:17] just bootstrapping an environment [04:17] cool [04:17] I have figured that none of the code around this has changed [04:17] so I'm using 1.24.2 for better logging [04:17] and just using the juju upgrade-juju --upload-tools so it increments the buildnumber [04:18] ERROR while stopping machine agent: exec ["stop" "--system" "juju-agent-tim-testlocal"]: exit status 1 (stop: Method "Get" with signature "ss" on interface "org.freedesktop.DBus.Properties" doesn't exist) [04:18] boo hiss [04:21] yeah that happens all the time in my trusty VM [04:25] thumper: menn0 https://github.com/juju/juju/pull/2716 [04:25] for debate [04:25] * menn0 looks [04:25] this is the smallest possible change i can make [04:28] davechen1y: I don't think this is enough [04:29] davechen1y: if you look at crypto/tls you can see that Conn ends up with the the Config [04:29] davechen1y: and it uses it beyond Accept [04:32] davechen1y: so if anything changes the config after it could get a surprise [04:41] davechen1y: oh I see... it's not so bad because Handshake is also blocked by the lock [04:42] davechen1y: but that's kind of a hack [04:45] davechen1y: do you mind if I put up a counter-PR? [04:45] i'm thinking we don't use tls.Listener at all [04:55] menn0: sure [04:55] davechen1y: almost done :) [04:56] davechen1y: do you happen to know a test that failed the race detector [05:02] go test -race ./cmd/jujud/agent [05:02] litmus test [05:07] thumper: what's the story with getting a voting race test ? [05:12] thumper: menn0 have to run out to the bank to get greenbacks for next week [05:13] davechen1y: i'm just pushing this PR now [05:14] davechen1y: flick of a switch when we are at zero [05:15] thumper: so there is a non voting test at the moment ? [05:15] how can I see the results that it sees today ? [05:15] davechen1y: correct [05:15] * thumper looks [05:15] davechen1y: https://github.com/juju/juju/pull/2717 [05:16] davechen1y: with this approach each connect gets it own tls.Config [05:16] davechen1y: http://reports.vapour.ws/releases/2847/job/run-unit-tests-race/attempt/150 [05:17] davechen1y: so there's no way the cert updates can affect existing connections [05:17] davechen1y: also: there's no need for the lock to be held the whole time [05:17] davechen1y: (until the first connection) [05:19] davechen1y: do you understand why processCertChanges wanted to always hold the lock? [05:19] davechen1y: that seems like a serious bug [05:22] menn0: looks fine to me... [05:23] menn0: I'm assuming it works [05:23] thumper: tests pass .... haven't tried to use it [05:24] thumper, davechen1y: so ales made added a nasty bug a few days ago [05:24] oh? [05:25] thumper, davechen1y: look at rev c147767de940cd07db3330ac3a19f9f3547d4be1 [05:25] menn0: c'mon, post a link to the revision at least [05:26] thumper, davechen1y: https://github.com/juju/juju/commit/c147767de940cd07db3330ac3a19f9f3547d4be1 [05:26] with that in place i'm surprise the apiserver even works in master [05:27] because the lock is already held? [05:27] yep [05:27] just checking now [05:30] thumper: well I have no idea how that works [05:30] thumper: but everything seems ok [05:31] what type of lock is it? [05:31] sync.Mutex [05:38] perhaps because process change is never being caleld [05:40] thumper: i've put log messages in and I can see that it is [05:40] ?!? [05:41] i thought that would have deadlocked [05:41] same! [05:43] thumper: it does deadlock [05:43] umm... [05:43] you said that you see the lgos [05:44] logs [05:44] thumper: I just added more logs [05:44] you put the logs before the lock didn't you? [05:44] thumper: the cert update goes to happen and then it gets stuck b/c it can't get the lock again [05:45] thumper: the api server keeps working b/c Handshake isn't even called on changeCertConn (which is the other place the lock is grabbed) [05:45] thumper: so there is no need for changeCertConn to even exist [05:46] thumper: my PR gets rid of changeCertConn anyway [05:48] menn0: care to explain to me how your change works to fix things [05:48] ? [05:49] thumper: sure [05:50] thumper: so it means that each connection ends up with a private copy of the tls.Config [05:50] thumper: so that when the certificate is updated, it doesn't affect existing connections [05:50] thumper: and new connections see the new config with the updated cert [05:50] well that makes sense [05:51] thumper: there's no race because the tls.Config's aren't being shared [05:51] shipit [05:51] thumper: ok cool [05:51] I'm not getting to a reproduction step here. [05:51] thumper: just updating the commit message [05:51] and my brain is fuzzed [05:51] and it is friday [05:52] thumper: i've had very little success with these bugs myself [05:52] thumper: they are so hard to repro [05:52] * thumper nods [05:52] have a good weekend folks [05:52] thumper: debug level logs with the extra logging you added would help a lot [05:52] davechen1y: safe travels [05:52] menn0: yeah... [05:53] there are several extra places we should add logging ... [05:59] davechen1y: thumper is happy but I'd appreciate your feedback on this: http://reviews.vapour.ws/r/2095/ [06:00] davechen1y: i'm EOD now but will leave IRC up and check back later [06:49] menn0: kk [06:57] menn0: ship it, that's excellent === meetingology` is now known as meetingology === wesleyma` is now known as wesleymason [08:35] Bug #1471138 opened: destroy-environment should be able to destroy all sub-environments [09:29] davechen1y: sweet thanks [10:17] * fwereade bbiab [10:47] ha, has anyone realised that we don't allow single-character user names [10:47] ? [10:47] i wonder if that was deliberate === kadams54 is now known as kadams54-away === liam_ is now known as Guest30118 [12:50] TheMue: hb [13:07] perrito666, he's off today [13:07] well he'll read that when he returns hopefuly [13:08] :) [13:19] perrito666, do you need unblocking on http://reviews.vapour.ws/r/1851/ ? [13:20] fwereade: mm, why is that still open? let me check [13:31] 2015-07-03 13:19:53 ERROR juju.cmd supercommand.go:430 tools upload failed: 400 ({"Tools":null,"DisableSSLHostnameVerification":false,"Error":{"Message":"cannot get environment config: invalid series \"wily\"","Code":""}}) [13:32] are going to have to deal with such issues on *every* ubuntu release? [13:32] dimitern: yes [13:32] at least based in historical data [13:33] :( [13:33] tgif at least [13:34] fwereade: I was under the impression that it had been merged, perhas it was merged to previous versions, Ill address that next week (most, if not all, your comments make sense) [13:44] Bug #1471231 opened: debugLogDBIntSuite teardown fails [13:56] Bug #1471237 opened: Mongo causes juju to fail: bad record MAC [13:56] Bug #1471241 opened: RelationSuite teardown fails [14:00] perrito666, and also, can I help at all with http://reviews.vapour.ws/r/1979/ ? [14:01] fwereade: not really, I have to go all over that again and submit another patch but I really dont want to context switch now it has been a terrible week for me in terms of the task at hand [14:02] perrito666, no worries at all, just checking in [14:03] perrito666, when you get back to it just let me know if anything needs clarification [14:26] Bug #1471241 changed: RelationSuite teardown fails [14:26] Bug #1471242 opened: juju upgrade-juju to 1.24.2 fails with "invalid series: willy" [15:15] dimitern: http://reviews.vapour.ws/r/2097/ [15:30] voidspace, awesome! LGTM [15:31] dimitern: wow, that was quick [15:31] dimitern: thanks [15:38] voidspace, well, it's also friday :) [16:14] wwitzel3: ping [17:12] Bug #1471308 opened: TestOpenStateWorksForJobManageEnviron fails intermittently on windows [17:57] * fwereade is finally happy enough with http://reviews.vapour.ws/r/2078/ and would very much appreciate reviews === kadams54 is now known as kadams54-away === kadams54-away is now known as kadams54 [20:30] Bug #1471332 opened: Upgrade fails on windows machines === kadams54 is now known as kadams54-away [20:57] gaah [20:57] I'm inside state [20:57] fwereade: how did you get inside it? [20:57] * perrito666 takes a crowbar and goes to the rescue [20:58] what's the quickest cleanest way to either find out what machine I'm running on, or otherwise come up with another stable id that effectively maps to that [20:58] ? [20:58] fwereade: I'm like 20 mins in to reading your giant branch [20:58] mgz, sorry it's so big but there was a lot to untangle [20:58] mgz, I'm pretty sure it's there now though [20:58] yeaj, it's all pretty sane so far at least [20:58] mgz, cool [20:59] mgz, I pushed a much better description recently fwiw [20:59] mgz, not sure if 20 mins ago or not [20:59] mgz, probably more actually [20:59] fwereade: where are you exactly? [20:59] fwereade: to answer to your question [21:00] perrito666, I'm in a state method which I want to set up the presence watcher and lease/leadership stuff [21:00] perrito666, my lease client needs a unique id for the machine it's running on [21:00] I see [21:00] perrito666, deriving it from current state server id would be great [21:01] perrito666, but, heh, how do we know which we are? [21:01] perrito666, if I weren't inside state I could pass it in via agent config [21:01] perrito666, but that really has no place in state [21:04] fwereade: State{}.serverTag I think === kadams54 is now known as kadams54-away [21:27] perrito666, I'm pretty sure that's the environ the state server's running in [21:27] perrito666, but not unique to a particular machine [21:28] look at IsStateServer