davechen1y | mwhudson: yes scond is the putative condition bits for arm instructions | 00:50 |
---|---|---|
davechen1y | the s is silent | 00:50 |
mwhudson | heh | 00:50 |
mwhudson | 5g is ... something | 00:51 |
mwhudson | oops, not that any more | 00:51 |
=== kadams54 is now known as kadams54-away | ||
menn0 | thumper: how are you getting on with that bug? | 01:41 |
menn0 | davechen1y: we should talk about the certupdater before my EOD | 01:41 |
davechen1y | menn0: how about now ? | 01:58 |
davechen1y | thumper: https://docs.google.com/document/d/1UPYcjMHV2YEHNAVjMWQAh2Htj-pPl2Bt0_POBGLQURE/edit | 01:58 |
davechen1y | so, this sprint we had a dedicaed bug squad | 01:58 |
davechen1y | and the number of bugs fixed went down ... | 01:59 |
davechen1y | i'm not sure that is the result that was intended | 01:59 |
mwhudson | urgh, the 32 bit arm abi is less nice than the 64 bit one | 02:01 |
davechen1y | when you say "less nice" | 02:02 |
davechen1y | to you mean lemon juice is "less nice" than wine | 02:02 |
davechen1y | menn0: want to talk in the standup hangout ? | 02:02 |
mwhudson | davechen1y: well, i wouldn't say the 64 bit one is comparable to wine | 02:03 |
menn0 | davechen1y: sounds good | 02:03 |
davechen1y | why is tim still in the standup | 02:04 |
davechen1y | thumper: why are you still in this hangout ? | 02:04 |
thumper | davechen1y: the bugs fixed went down over the second iteration, which wasn't this one, but the last one | 02:13 |
thumper | menn0: just back | 02:13 |
thumper | menn0: also thinking, that despite our best efforts and analysys, we may still have some weird conditions where workers are not being shut down | 02:14 |
thumper | menn0: this would explain a number of cases... | 02:14 |
thumper | menn0: although, if the pinger is getting stuck, that would explain one or two | 02:14 |
davechen1y | menn0: // Accept waits for and returns the next connection to the listener. | 02:16 |
davechen1y | func (cl *changeCertListener) Accept() (c net.Conn, err error) { cl.m.Lock() defer cl.m.Unlock() | 02:16 |
menn0 | thumper: a worker getting stuck would explain a lot of things | 02:18 |
thumper | menn0: we are seeing it on the ppc64 tests still, and some of the unit agents didn't restart... | 02:19 |
thumper | I'm looking at one that didn't now | 02:19 |
menn0 | thumper: ok | 02:20 |
menn0 | davechen1y: looking at the code for tls.NewListener | 02:21 |
menn0 | davechen1y: i now remember what my plan was | 02:21 |
thumper | definitely a bug in the unit agent | 02:21 |
thumper | at 11:00:23 most workers die for shutdown | 02:22 |
thumper | these two lines are next to each other | 02:22 |
thumper | 2015-07-02 11:00:23 DEBUG juju.worker runner.go:242 killing "rsyslog" | 02:22 |
thumper | 2015-07-02 11:00:25 INFO juju.worker runner.go:261 start "rsyslog" | 02:22 |
thumper | 2 | 02:22 |
thumper | then we can see that jujuc commands are being fired | 02:23 |
menn0 | davechen1y: i was thinking that instead of wrapping a tls.Listener we should copy the tiny amount of code from it and integrate it into certChangeListener | 02:23 |
thumper | config-changed hook being fired | 02:23 |
menn0 | certChangeListener could have an extra method to swap out the config | 02:23 |
menn0 | and the config would need to be protected by a lock | 02:23 |
menn0 | thumper: ok so something ain't dying | 02:24 |
thumper | bingo | 02:24 |
menn0 | thumper: the unit agent has few workers so it shouldn't be hard to figure out which | 02:24 |
thumper | the uniter is running the config-changed hook | 02:24 |
menn0 | thumper: but this doesn't necessarily explain the upgrade issue we were looking at | 02:25 |
thumper | no... | 02:25 |
thumper | that is most likely something else too | 02:25 |
menn0 | thumper: although they could be caused by a related change | 02:25 |
* thumper nods | 02:25 | |
thumper | oh fuck... | 02:26 |
* menn0 waits for good/bad news | 02:27 | |
thumper | well... | 02:31 |
thumper | this is one problem | 02:31 |
thumper | when the apiservers come back on line | 02:31 |
thumper | the unit agent connects | 02:31 |
thumper | one of the first thing that fires is config-changed | 02:31 |
thumper | even though it probably didn't | 02:31 |
thumper | now while the config-changed hook is being fired | 02:31 |
thumper | we get told to upgrade | 02:31 |
thumper | everything dies | 02:31 |
thumper | except the uniter | 02:32 |
thumper | which carries on for a while | 02:32 |
thumper | then doesn't stop | 02:32 |
thumper | the uniter is told to die... | 02:32 |
thumper | can't tell if it actually does | 02:32 |
thumper | certainly doesn't finish | 02:33 |
thumper | the agent doesn't restart | 02:33 |
menn0 | right | 02:33 |
thumper | seems to happen on a subset of units | 02:33 |
thumper | certainly most restart | 02:33 |
thumper | this screams race somewhere | 02:34 |
menn0 | thumper: this is probably what caused all those config-changed hook errors. | 02:34 |
thumper | and may well be the cause of these other problems | 02:34 |
thumper | I have to dig in a little... | 02:34 |
menn0 | b/c the unit agent doesn't shut down the API connection so the apiserver doesn't stop | 02:34 |
thumper | to work that one out | 02:34 |
* menn0 is going to check if this could be the cause of the other bug he's looking at | 02:35 | |
thumper | yeah... all seems closely related | 02:35 |
menn0 | i might make the uniter intentional hang to simulate and see what happens when an upgrade is requested | 02:35 |
davechen1y | ,menn0 so | 03:00 |
davechen1y | the approach I discussed on the call has one drawback | 03:00 |
davechen1y | using that lock where I said | 03:00 |
davechen1y | means the cert update worker will _block_ until one connetion is received by the apiserver | 03:01 |
davechen1y | i'm not sure how much of a problem this will be in practice | 03:01 |
menn0 | davechen1y: it is a problem b/c it will prevent the agent from shutting down if there's been no connections received | 03:02 |
davechen1y | nope, that will be fine | 03:13 |
davechen1y | because we run the http.Server in a goroutine | 03:13 |
davechen1y | and signal to it by closing hte underlyning tcp socket | 03:13 |
menn0 | davechen1y: ok, as long as you're sure | 03:14 |
davechen1y | i'm not sure of anything anymore | 03:15 |
davechen1y | that's why i run the race detector | 03:15 |
davechen1y | i'll see how the patch shapes up | 03:15 |
davechen1y | we can debate it on reviewboard | 03:15 |
menn0 | thumper: so a stuck uniter doesn't prevent the the state server from upgrading, but it does prevent the unit agent from upgrading (expected) | 03:15 |
menn0 | davechen1y: sounds good | 03:16 |
thumper | menn0: yes... | 03:22 |
* thumper goes to make some notes on the bug | 03:22 | |
davechen1y | thumper: http://paste.ubuntu.com/11813756/ | 03:45 |
davechen1y | new race | 03:45 |
davechen1y | :( | 03:45 |
thumper | :( | 03:46 |
thumper | wat? | 03:46 |
thumper | really new? | 03:46 |
mwhudson | fire up your engines! | 03:46 |
davechen1y | i think so | 03:47 |
thumper | previous write by: sync.(*Mutex).Lock() | 03:48 |
thumper | fucking what? | 03:48 |
thumper | davechen1y: if a pointer is used as a bool check | 03:49 |
thumper | nil doesn't count as true does it? | 03:49 |
thumper | pointer / interface | 03:49 |
thumper | oh... | 03:50 |
thumper | damn it | 03:50 |
thumper | it is a switch | 03:50 |
thumper | meh | 03:51 |
davechen1y | github.com/juju/juju/environs/configstore.(*memInfo).clone() | 03:52 |
davechen1y | in code review | 03:52 |
davechen1y | _anything_ with a clone method gets a hairy eyeball | 03:52 |
davechen1y | it's a smell | 03:53 |
thumper | menn0: mysql-hacluster/0 did not upgrade because the config-changed hook did not complete | 04:05 |
menn0 | thumper: ok so that's one problem | 04:06 |
=== kadams54 is now known as kadams54-away | ||
menn0 | thumper: do you think it's a uniter bug or a problem with the hook it's running? | 04:08 |
thumper | that one is the underlying hook AFAICT | 04:08 |
menn0 | thumper: or perhaps the hook is running a tool which calls back to juju? | 04:08 |
thumper | but all that code is handled by the uniter itself | 04:09 |
thumper | so we'd see it | 04:09 |
menn0 | thumper: so I figured out why we saw those api blocked / "upgrade in progress" log messages | 04:14 |
menn0 | thumper: it's due to some new stuff that ian added | 04:14 |
menn0 | thumper: it'll do that until the upgrader has done its first check for a new version | 04:15 |
menn0 | thumper: so that we don't allow API requests when the agent is about to restart | 04:15 |
menn0 | thumper: the message is a little confusing though | 04:15 |
thumper | ah... | 04:16 |
thumper | is that how we manage to have the apiserver not too busy that second time up? | 04:16 |
thumper | so it can restart? | 04:16 |
menn0 | could be | 04:16 |
thumper | that would actually explain it | 04:16 |
menn0 | have you tried to verify that the apiserver actually can get stuck | 04:17 |
menn0 | ? | 04:17 |
thumper | just bootstrapping an environment | 04:17 |
menn0 | cool | 04:17 |
thumper | I have figured that none of the code around this has changed | 04:17 |
thumper | so I'm using 1.24.2 for better logging | 04:17 |
thumper | and just using the juju upgrade-juju --upload-tools so it increments the buildnumber | 04:17 |
thumper | ERROR while stopping machine agent: exec ["stop" "--system" "juju-agent-tim-testlocal"]: exit status 1 (stop: Method "Get" with signature "ss" on interface "org.freedesktop.DBus.Properties" doesn't exist) | 04:18 |
thumper | boo hiss | 04:18 |
menn0 | yeah that happens all the time in my trusty VM | 04:21 |
davechen1y | thumper: menn0 https://github.com/juju/juju/pull/2716 | 04:25 |
davechen1y | for debate | 04:25 |
* menn0 looks | 04:25 | |
davechen1y | this is the smallest possible change i can make | 04:25 |
menn0 | davechen1y: I don't think this is enough | 04:28 |
menn0 | davechen1y: if you look at crypto/tls you can see that Conn ends up with the the Config | 04:29 |
menn0 | davechen1y: and it uses it beyond Accept | 04:29 |
menn0 | davechen1y: so if anything changes the config after it could get a surprise | 04:32 |
menn0 | davechen1y: oh I see... it's not so bad because Handshake is also blocked by the lock | 04:41 |
menn0 | davechen1y: but that's kind of a hack | 04:42 |
menn0 | davechen1y: do you mind if I put up a counter-PR? | 04:45 |
menn0 | i'm thinking we don't use tls.Listener at all | 04:45 |
davechen1y | menn0: sure | 04:55 |
menn0 | davechen1y: almost done :) | 04:55 |
menn0 | davechen1y: do you happen to know a test that failed the race detector | 04:56 |
davechen1y | go test -race ./cmd/jujud/agent | 05:02 |
davechen1y | litmus test | 05:02 |
davechen1y | thumper: what's the story with getting a voting race test ? | 05:07 |
davechen1y | thumper: menn0 have to run out to the bank to get greenbacks for next week | 05:12 |
menn0 | davechen1y: i'm just pushing this PR now | 05:13 |
thumper | davechen1y: flick of a switch when we are at zero | 05:14 |
davechen1y | thumper: so there is a non voting test at the moment ? | 05:15 |
davechen1y | how can I see the results that it sees today ? | 05:15 |
thumper | davechen1y: correct | 05:15 |
* thumper looks | 05:15 | |
menn0 | davechen1y: https://github.com/juju/juju/pull/2717 | 05:15 |
menn0 | davechen1y: with this approach each connect gets it own tls.Config | 05:16 |
thumper | davechen1y: http://reports.vapour.ws/releases/2847/job/run-unit-tests-race/attempt/150 | 05:16 |
menn0 | davechen1y: so there's no way the cert updates can affect existing connections | 05:17 |
menn0 | davechen1y: also: there's no need for the lock to be held the whole time | 05:17 |
menn0 | davechen1y: (until the first connection) | 05:17 |
menn0 | davechen1y: do you understand why processCertChanges wanted to always hold the lock? | 05:19 |
menn0 | davechen1y: that seems like a serious bug | 05:19 |
thumper | menn0: looks fine to me... | 05:22 |
thumper | menn0: I'm assuming it works | 05:23 |
menn0 | thumper: tests pass .... haven't tried to use it | 05:23 |
menn0 | thumper, davechen1y: so ales made added a nasty bug a few days ago | 05:24 |
thumper | oh? | 05:24 |
menn0 | thumper, davechen1y: look at rev c147767de940cd07db3330ac3a19f9f3547d4be1 | 05:25 |
thumper | menn0: c'mon, post a link to the revision at least | 05:25 |
menn0 | thumper, davechen1y: https://github.com/juju/juju/commit/c147767de940cd07db3330ac3a19f9f3547d4be1 | 05:26 |
menn0 | with that in place i'm surprise the apiserver even works in master | 05:26 |
thumper | because the lock is already held? | 05:27 |
menn0 | yep | 05:27 |
menn0 | just checking now | 05:27 |
menn0 | thumper: well I have no idea how that works | 05:30 |
menn0 | thumper: but everything seems ok | 05:30 |
thumper | what type of lock is it? | 05:31 |
menn0 | sync.Mutex | 05:31 |
thumper | perhaps because process change is never being caleld | 05:38 |
menn0 | thumper: i've put log messages in and I can see that it is | 05:40 |
thumper | ?!? | 05:40 |
thumper | i thought that would have deadlocked | 05:41 |
menn0 | same! | 05:41 |
menn0 | thumper: it does deadlock | 05:43 |
thumper | umm... | 05:43 |
thumper | you said that you see the lgos | 05:43 |
thumper | logs | 05:44 |
menn0 | thumper: I just added more logs | 05:44 |
thumper | you put the logs before the lock didn't you? | 05:44 |
menn0 | thumper: the cert update goes to happen and then it gets stuck b/c it can't get the lock again | 05:44 |
menn0 | thumper: the api server keeps working b/c Handshake isn't even called on changeCertConn (which is the other place the lock is grabbed) | 05:45 |
menn0 | thumper: so there is no need for changeCertConn to even exist | 05:45 |
menn0 | thumper: my PR gets rid of changeCertConn anyway | 05:46 |
thumper | menn0: care to explain to me how your change works to fix things | 05:48 |
thumper | ? | 05:48 |
menn0 | thumper: sure | 05:49 |
menn0 | thumper: so it means that each connection ends up with a private copy of the tls.Config | 05:50 |
menn0 | thumper: so that when the certificate is updated, it doesn't affect existing connections | 05:50 |
menn0 | thumper: and new connections see the new config with the updated cert | 05:50 |
thumper | well that makes sense | 05:50 |
menn0 | thumper: there's no race because the tls.Config's aren't being shared | 05:51 |
thumper | shipit | 05:51 |
menn0 | thumper: ok cool | 05:51 |
thumper | I'm not getting to a reproduction step here. | 05:51 |
menn0 | thumper: just updating the commit message | 05:51 |
thumper | and my brain is fuzzed | 05:51 |
thumper | and it is friday | 05:51 |
menn0 | thumper: i've had very little success with these bugs myself | 05:52 |
menn0 | thumper: they are so hard to repro | 05:52 |
* thumper nods | 05:52 | |
thumper | have a good weekend folks | 05:52 |
menn0 | thumper: debug level logs with the extra logging you added would help a lot | 05:52 |
thumper | davechen1y: safe travels | 05:52 |
thumper | menn0: yeah... | 05:52 |
thumper | there are several extra places we should add logging ... | 05:53 |
menn0 | davechen1y: thumper is happy but I'd appreciate your feedback on this: http://reviews.vapour.ws/r/2095/ | 05:59 |
menn0 | davechen1y: i'm EOD now but will leave IRC up and check back later | 06:00 |
davechen1y | menn0: kk | 06:49 |
davechen1y | menn0: ship it, that's excellent | 06:57 |
=== meetingology` is now known as meetingology | ||
=== wesleyma` is now known as wesleymason | ||
mup | Bug #1471138 opened: destroy-environment should be able to destroy all sub-environments <juju-core:New> <https://launchpad.net/bugs/1471138> | 08:35 |
menn0 | davechen1y: sweet thanks | 09:29 |
* fwereade bbiab | 10:17 | |
rogpeppe1 | ha, has anyone realised that we don't allow single-character user names | 10:47 |
rogpeppe1 | ? | 10:47 |
rogpeppe1 | i wonder if that was deliberate | 10:47 |
=== kadams54 is now known as kadams54-away | ||
=== liam_ is now known as Guest30118 | ||
perrito666 | TheMue: hb | 12:50 |
dimitern | perrito666, he's off today | 13:07 |
perrito666 | well he'll read that when he returns hopefuly | 13:07 |
dimitern | :) | 13:08 |
fwereade | perrito666, do you need unblocking on http://reviews.vapour.ws/r/1851/ ? | 13:19 |
perrito666 | fwereade: mm, why is that still open? let me check | 13:20 |
dimitern | 2015-07-03 13:19:53 ERROR juju.cmd supercommand.go:430 tools upload failed: 400 ({"Tools":null,"DisableSSLHostnameVerification":false,"Error":{"Message":"cannot get environment config: invalid series \"wily\"","Code":""}}) | 13:31 |
dimitern | are going to have to deal with such issues on *every* ubuntu release? | 13:32 |
perrito666 | dimitern: yes | 13:32 |
perrito666 | at least based in historical data | 13:32 |
dimitern | :( | 13:33 |
dimitern | tgif at least | 13:33 |
perrito666 | fwereade: I was under the impression that it had been merged, perhas it was merged to previous versions, Ill address that next week (most, if not all, your comments make sense) | 13:34 |
mup | Bug #1471231 opened: debugLogDBIntSuite teardown fails <ci> <unit-tests> <juju-core:Incomplete> <juju-core db-log:Triaged> <https://launchpad.net/bugs/1471231> | 13:44 |
mup | Bug #1471237 opened: Mongo causes juju to fail: bad record MAC <mongodb> <reliability> <juju-core:Triaged> <mongodb (Ubuntu):Confirmed> <https://launchpad.net/bugs/1471237> | 13:56 |
mup | Bug #1471241 opened: RelationSuite teardown fails <ci> <intermittent-failure> <unit-tests> <juju-core:Triaged> <https://launchpad.net/bugs/1471241> | 13:56 |
fwereade | perrito666, and also, can I help at all with http://reviews.vapour.ws/r/1979/ ? | 14:00 |
perrito666 | fwereade: not really, I have to go all over that again and submit another patch but I really dont want to context switch now it has been a terrible week for me in terms of the task at hand | 14:01 |
fwereade | perrito666, no worries at all, just checking in | 14:02 |
fwereade | perrito666, when you get back to it just let me know if anything needs clarification | 14:03 |
mup | Bug #1471241 changed: RelationSuite teardown fails <ci> <intermittent-failure> <unit-tests> <juju-core:Triaged> <https://launchpad.net/bugs/1471241> | 14:26 |
mup | Bug #1471242 opened: juju upgrade-juju to 1.24.2 fails with "invalid series: willy" <juju-core:New> <https://launchpad.net/bugs/1471242> | 14:26 |
voidspace | dimitern: http://reviews.vapour.ws/r/2097/ | 15:15 |
dimitern | voidspace, awesome! LGTM | 15:30 |
voidspace | dimitern: wow, that was quick | 15:31 |
voidspace | dimitern: thanks | 15:31 |
dimitern | voidspace, well, it's also friday :) | 15:38 |
rogpeppe1 | wwitzel3: ping | 16:14 |
mup | Bug #1471308 opened: TestOpenStateWorksForJobManageEnviron fails intermittently on windows <ci> <intermittent-failure> <test-failure> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1471308> | 17:12 |
* fwereade is finally happy enough with http://reviews.vapour.ws/r/2078/ and would very much appreciate reviews | 17:57 | |
=== kadams54 is now known as kadams54-away | ||
=== kadams54-away is now known as kadams54 | ||
mup | Bug #1471332 opened: Upgrade fails on windows machines <ci> <upgrade-juju> <windows> <juju-core:Triaged> <juju-core 1.24:Triaged> <https://launchpad.net/bugs/1471332> | 20:30 |
=== kadams54 is now known as kadams54-away | ||
fwereade | gaah | 20:57 |
fwereade | I'm inside state | 20:57 |
perrito666 | fwereade: how did you get inside it? | 20:57 |
* perrito666 takes a crowbar and goes to the rescue | 20:57 | |
fwereade | what's the quickest cleanest way to either find out what machine I'm running on, or otherwise come up with another stable id that effectively maps to that | 20:58 |
fwereade | ? | 20:58 |
mgz | fwereade: I'm like 20 mins in to reading your giant branch | 20:58 |
fwereade | mgz, sorry it's so big but there was a lot to untangle | 20:58 |
fwereade | mgz, I'm pretty sure it's there now though | 20:58 |
mgz | yeaj, it's all pretty sane so far at least | 20:58 |
fwereade | mgz, cool | 20:58 |
fwereade | mgz, I pushed a much better description recently fwiw | 20:59 |
fwereade | mgz, not sure if 20 mins ago or not | 20:59 |
fwereade | mgz, probably more actually | 20:59 |
perrito666 | fwereade: where are you exactly? | 20:59 |
perrito666 | fwereade: to answer to your question | 20:59 |
fwereade | perrito666, I'm in a state method which I want to set up the presence watcher and lease/leadership stuff | 21:00 |
fwereade | perrito666, my lease client needs a unique id for the machine it's running on | 21:00 |
perrito666 | I see | 21:00 |
fwereade | perrito666, deriving it from current state server id would be great | 21:00 |
fwereade | perrito666, but, heh, how do we know which we are? | 21:01 |
fwereade | perrito666, if I weren't inside state I could pass it in via agent config | 21:01 |
fwereade | perrito666, but that really has no place in state | 21:01 |
perrito666 | fwereade: State{}.serverTag I think | 21:04 |
=== kadams54 is now known as kadams54-away | ||
fwereade | perrito666, I'm pretty sure that's the environ the state server's running in | 21:27 |
fwereade | perrito666, but not unique to a particular machine | 21:27 |
perrito666 | look at IsStateServer | 21:28 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!