/srv/irclogs.ubuntu.com/2015/07/03/#juju-dev.txt

davechen1ymwhudson: yes scond is the putative condition bits for arm instructions00:50
davechen1ythe s is silent00:50
mwhudsonheh00:50
mwhudson5g is ... something00:51
mwhudsonoops, not that any more00:51
=== kadams54 is now known as kadams54-away
menn0thumper: how are you getting on with that bug?01:41
menn0davechen1y: we should talk about the certupdater before my EOD01:41
davechen1ymenn0: how about now ?01:58
davechen1ythumper: https://docs.google.com/document/d/1UPYcjMHV2YEHNAVjMWQAh2Htj-pPl2Bt0_POBGLQURE/edit01:58
davechen1yso, this sprint we had a dedicaed bug squad01:58
davechen1yand the number of bugs fixed went down ...01:59
davechen1yi'm not sure that is the result that was intended01:59
mwhudsonurgh, the 32 bit arm abi is less nice than the 64 bit one02:01
davechen1ywhen you say "less nice"02:02
davechen1yto you mean lemon juice is "less nice" than wine02:02
davechen1ymenn0: want to talk in the standup hangout ?02:02
mwhudsondavechen1y: well, i wouldn't say the 64 bit one is comparable to wine02:03
menn0davechen1y: sounds good02:03
davechen1ywhy is tim still in the standup02:04
davechen1ythumper: why are you still in this hangout ?02:04
thumperdavechen1y: the bugs fixed went down over the second iteration, which wasn't this one, but the last one02:13
thumpermenn0: just back02:13
thumpermenn0: also thinking, that despite our best efforts and analysys, we may still have some weird conditions where workers are not being shut down02:14
thumpermenn0: this would explain a number of cases...02:14
thumpermenn0: although, if the pinger is getting stuck, that would explain one or two02:14
davechen1ymenn0: // Accept waits for and returns the next connection to the listener.02:16
davechen1yfunc (cl *changeCertListener) Accept() (c net.Conn, err error) { cl.m.Lock() defer cl.m.Unlock()02:16
menn0thumper: a worker getting stuck would explain a lot of things02:18
thumpermenn0: we are seeing it on the ppc64 tests still, and some of the unit agents didn't restart...02:19
thumperI'm looking at one that didn't now02:19
menn0thumper: ok02:20
menn0davechen1y: looking at the code for tls.NewListener02:21
menn0davechen1y: i now remember what my plan was02:21
thumperdefinitely a bug in the unit agent02:21
thumperat 11:00:23 most workers die for shutdown02:22
thumperthese two lines are next to each other02:22
thumper2015-07-02 11:00:23 DEBUG juju.worker runner.go:242 killing "rsyslog"02:22
thumper2015-07-02 11:00:25 INFO juju.worker runner.go:261 start "rsyslog"02:22
thumper202:22
thumperthen we can see that jujuc commands are being fired02:23
menn0davechen1y: i was thinking that instead of wrapping a tls.Listener we should copy the tiny amount of code from it and integrate it into certChangeListener02:23
thumperconfig-changed hook being fired02:23
menn0certChangeListener could have an extra method to swap out the config02:23
menn0and the config would need to be protected by a lock02:23
menn0thumper: ok so something ain't dying02:24
thumperbingo02:24
menn0thumper: the unit agent has few workers so it shouldn't be hard to figure out which02:24
thumperthe uniter is running the config-changed hook02:24
menn0thumper: but this doesn't necessarily explain the upgrade issue we were looking at02:25
thumperno...02:25
thumperthat is most likely something else too02:25
menn0thumper: although they could be caused by a related change02:25
* thumper nods02:25
thumperoh fuck...02:26
* menn0 waits for good/bad news02:27
thumperwell...02:31
thumperthis is one problem02:31
thumperwhen the apiservers come back on line02:31
thumperthe unit agent connects02:31
thumperone of the first thing that fires is config-changed02:31
thumpereven though it probably didn't02:31
thumpernow while the config-changed hook is being fired02:31
thumperwe get told to upgrade02:31
thumpereverything dies02:31
thumperexcept the uniter02:32
thumperwhich carries on for a while02:32
thumperthen doesn't stop02:32
thumperthe uniter is told to die...02:32
thumpercan't tell if it actually does02:32
thumpercertainly doesn't finish02:33
thumperthe agent doesn't restart02:33
menn0right02:33
thumperseems to happen on a subset of units02:33
thumpercertainly most restart02:33
thumperthis screams race somewhere02:34
menn0thumper: this is probably what caused all those config-changed hook errors.02:34
thumperand may well be the cause of these other problems02:34
thumperI have to dig in a little...02:34
menn0b/c the unit agent doesn't shut down the API connection so the apiserver doesn't stop02:34
thumperto work that one out02:34
* menn0 is going to check if this could be the cause of the other bug he's looking at02:35
thumperyeah... all seems closely related02:35
menn0i might make the uniter intentional hang to simulate and see what happens when an upgrade is requested02:35
davechen1y,menn0 so03:00
davechen1ythe approach I discussed on the call has one drawback03:00
davechen1yusing that lock where I said03:00
davechen1ymeans the cert update worker will _block_ until one connetion is received by the apiserver03:01
davechen1yi'm not sure how much of a problem this will be in practice03:01
menn0davechen1y: it is a problem b/c it will prevent the agent from shutting down if there's been no connections received03:02
davechen1ynope, that will be fine03:13
davechen1ybecause we run the http.Server in a goroutine03:13
davechen1yand signal to it by closing hte underlyning tcp socket03:13
menn0davechen1y: ok, as long as you're sure03:14
davechen1yi'm not sure of anything anymore03:15
davechen1ythat's why i run the race detector03:15
davechen1yi'll see how the patch shapes up03:15
davechen1ywe can debate it on reviewboard03:15
menn0thumper: so a stuck uniter doesn't prevent the the state server from upgrading, but it does prevent the unit agent from upgrading (expected)03:15
menn0davechen1y: sounds good03:16
thumpermenn0: yes...03:22
* thumper goes to make some notes on the bug03:22
davechen1ythumper: http://paste.ubuntu.com/11813756/03:45
davechen1ynew race03:45
davechen1y:(03:45
thumper:(03:46
thumperwat?03:46
thumperreally new?03:46
mwhudsonfire up your engines!03:46
davechen1yi think so03:47
thumperprevious write by: sync.(*Mutex).Lock()03:48
thumperfucking what?03:48
thumperdavechen1y: if a pointer is used as a bool check03:49
thumpernil doesn't count as true does it?03:49
thumperpointer / interface03:49
thumperoh...03:50
thumperdamn it03:50
thumperit is a switch03:50
thumpermeh03:51
davechen1ygithub.com/juju/juju/environs/configstore.(*memInfo).clone()03:52
davechen1yin code review03:52
davechen1y_anything_ with a clone method gets a hairy eyeball03:52
davechen1yit's a smell03:53
thumpermenn0: mysql-hacluster/0 did not upgrade because the config-changed hook did not complete04:05
menn0thumper: ok so that's one problem04:06
=== kadams54 is now known as kadams54-away
menn0thumper: do you think it's a uniter bug or a problem with the hook it's running?04:08
thumperthat one is the underlying hook AFAICT04:08
menn0thumper: or perhaps the hook is running a tool which calls back to juju?04:08
thumperbut all that code is handled by the uniter itself04:09
thumperso we'd see it04:09
menn0thumper: so I figured out why we saw those api blocked / "upgrade in progress" log messages04:14
menn0thumper: it's due to some new stuff that ian added04:14
menn0thumper: it'll do that until the upgrader has done its first check for a new version04:15
menn0thumper: so that we don't allow API requests when the agent is about to restart04:15
menn0thumper: the message is a little confusing though04:15
thumperah...04:16
thumperis that how we manage to have the apiserver not too busy that second time up?04:16
thumperso it can restart?04:16
menn0could be04:16
thumperthat would actually explain it04:16
menn0have you tried to verify that the apiserver actually can get stuck04:17
menn0?04:17
thumperjust bootstrapping an environment04:17
menn0cool04:17
thumperI have figured that none of the code around this has changed04:17
thumperso I'm using 1.24.2 for better logging04:17
thumperand just using the juju upgrade-juju --upload-tools so it increments the buildnumber04:17
thumperERROR while stopping machine agent: exec ["stop" "--system" "juju-agent-tim-testlocal"]: exit status 1 (stop: Method "Get" with signature "ss" on interface "org.freedesktop.DBus.Properties" doesn't exist)04:18
thumperboo hiss04:18
menn0yeah that happens all the time in my trusty VM04:21
davechen1ythumper: menn0 https://github.com/juju/juju/pull/271604:25
davechen1yfor debate04:25
* menn0 looks04:25
davechen1ythis is the smallest possible change i can make04:25
menn0davechen1y: I don't think this is enough04:28
menn0davechen1y: if you look at crypto/tls you can see that Conn ends up with the the Config04:29
menn0davechen1y: and it uses it beyond Accept04:29
menn0davechen1y: so if anything changes the config after it could get a surprise04:32
menn0davechen1y: oh I see... it's not so bad because Handshake is also blocked by the lock04:41
menn0davechen1y: but that's kind of a hack04:42
menn0davechen1y: do you mind if I put up a counter-PR?04:45
menn0i'm thinking we don't use tls.Listener at all04:45
davechen1ymenn0: sure04:55
menn0davechen1y: almost done :)04:55
menn0davechen1y: do you happen to know a test that failed the race detector04:56
davechen1ygo test -race ./cmd/jujud/agent05:02
davechen1ylitmus test05:02
davechen1ythumper: what's the story with getting a voting race test ?05:07
davechen1ythumper:  menn0 have to run out to the bank to get greenbacks for next week05:12
menn0davechen1y: i'm just pushing this PR now05:13
thumperdavechen1y: flick of a switch when we are at zero05:14
davechen1ythumper: so there is a non voting test at the moment ?05:15
davechen1yhow can I see the results that it sees today ?05:15
thumperdavechen1y: correct05:15
* thumper looks05:15
menn0davechen1y: https://github.com/juju/juju/pull/271705:15
menn0davechen1y: with this approach each connect gets it own tls.Config05:16
thumperdavechen1y: http://reports.vapour.ws/releases/2847/job/run-unit-tests-race/attempt/15005:16
menn0davechen1y: so there's no way the cert updates can affect existing connections05:17
menn0davechen1y: also: there's no need for the lock to be held the whole time05:17
menn0davechen1y: (until the first connection)05:17
menn0davechen1y: do you understand why processCertChanges wanted to always hold the lock?05:19
menn0davechen1y: that seems like a serious bug05:19
thumpermenn0: looks fine to me...05:22
thumpermenn0: I'm assuming it works05:23
menn0thumper: tests pass .... haven't tried to use it05:23
menn0thumper, davechen1y: so ales made added a nasty bug a few days ago05:24
thumperoh?05:24
menn0thumper, davechen1y: look at rev c147767de940cd07db3330ac3a19f9f3547d4be105:25
thumpermenn0: c'mon, post a link to the revision at least05:25
menn0thumper, davechen1y: https://github.com/juju/juju/commit/c147767de940cd07db3330ac3a19f9f3547d4be105:26
menn0with that in place i'm surprise the apiserver even works in master05:26
thumperbecause the lock is already held?05:27
menn0yep05:27
menn0just checking now05:27
menn0thumper: well I have no idea how that works05:30
menn0thumper: but everything seems ok05:30
thumperwhat type of lock is it?05:31
menn0sync.Mutex05:31
thumperperhaps because process change is never being caleld05:38
menn0thumper: i've put log messages in and I can see that it is05:40
thumper?!?05:40
thumperi thought that would have deadlocked05:41
menn0same!05:41
menn0thumper: it does deadlock05:43
thumperumm...05:43
thumperyou said that you see the lgos05:43
thumperlogs05:44
menn0thumper: I just added more logs05:44
thumperyou put the logs before the lock didn't you?05:44
menn0thumper: the cert update goes to happen and then it gets stuck b/c it can't get the lock again05:44
menn0thumper: the api server keeps working b/c Handshake isn't even called on changeCertConn (which is the other place the lock is grabbed)05:45
menn0thumper: so there is no need for changeCertConn to even exist05:45
menn0thumper: my PR gets rid of changeCertConn anyway05:46
thumpermenn0: care to explain to me how your change works to fix things05:48
thumper?05:48
menn0thumper: sure05:49
menn0thumper: so it means that each connection ends up with a private copy of the tls.Config05:50
menn0thumper: so that when the certificate is updated, it doesn't affect existing connections05:50
menn0thumper: and new connections see the new config with the updated cert05:50
thumperwell that makes sense05:50
menn0thumper: there's no race because the tls.Config's aren't being shared05:51
thumpershipit05:51
menn0thumper: ok cool05:51
thumperI'm not getting to a reproduction step here.05:51
menn0thumper: just updating the commit message05:51
thumperand my brain is fuzzed05:51
thumperand it is friday05:51
menn0thumper: i've had very little success with these bugs myself05:52
menn0thumper: they are so hard to repro05:52
* thumper nods05:52
thumperhave a good weekend folks05:52
menn0thumper: debug level logs with the extra logging you added would help a lot05:52
thumperdavechen1y: safe travels05:52
thumpermenn0: yeah...05:52
thumperthere are several extra places we should add logging ...05:53
menn0davechen1y: thumper is happy but I'd appreciate your feedback on this: http://reviews.vapour.ws/r/2095/05:59
menn0davechen1y: i'm EOD now but will leave IRC up and check back later06:00
davechen1ymenn0: kk06:49
davechen1ymenn0: ship it, that's excellent06:57
=== meetingology` is now known as meetingology
=== wesleyma` is now known as wesleymason
mupBug #1471138 opened: destroy-environment should be able to destroy all sub-environments <juju-core:New> <https://launchpad.net/bugs/1471138>08:35
menn0davechen1y: sweet thanks09:29
* fwereade bbiab10:17
rogpeppe1ha, has anyone realised that we don't allow single-character user names10:47
rogpeppe1?10:47
rogpeppe1i wonder if that was deliberate10:47
=== kadams54 is now known as kadams54-away
=== liam_ is now known as Guest30118
perrito666TheMue: hb12:50
dimiternperrito666, he's off today13:07
perrito666well he'll read that when he returns hopefuly13:07
dimitern:)13:08
fwereadeperrito666, do you need unblocking on http://reviews.vapour.ws/r/1851/ ?13:19
perrito666fwereade: mm, why is that still open? let me check13:20
dimitern2015-07-03 13:19:53 ERROR juju.cmd supercommand.go:430 tools upload failed: 400 ({"Tools":null,"DisableSSLHostnameVerification":false,"Error":{"Message":"cannot get environment config: invalid series \"wily\"","Code":""}})13:31
dimiternare going to have to deal with such issues on *every* ubuntu release?13:32
perrito666dimitern: yes13:32
perrito666at least based in historical data13:32
dimitern:(13:33
dimiterntgif at least13:33
perrito666fwereade: I was under the impression that it had been merged, perhas it was merged to previous versions, Ill address that next week (most, if not all, your comments make sense)13:34
mupBug #1471231 opened: debugLogDBIntSuite teardown fails <ci> <unit-tests> <juju-core:Incomplete> <juju-core db-log:Triaged> <https://launchpad.net/bugs/1471231>13:44
mupBug #1471237 opened: Mongo causes juju to fail: bad record MAC <mongodb> <reliability> <juju-core:Triaged> <mongodb (Ubuntu):Confirmed> <https://launchpad.net/bugs/1471237>13:56
mupBug #1471241 opened: RelationSuite teardown fails <ci> <intermittent-failure> <unit-tests> <juju-core:Triaged> <https://launchpad.net/bugs/1471241>13:56
fwereadeperrito666, and also, can I help at all with http://reviews.vapour.ws/r/1979/ ?14:00
perrito666fwereade: not really, I have to go all over that again and submit another patch but I really dont want to context switch now it has been a terrible week for me in terms of the task at hand14:01
fwereadeperrito666, no worries at all, just checking in14:02
fwereadeperrito666, when you get back to it just let me know if anything needs clarification14:03
mupBug #1471241 changed: RelationSuite teardown fails <ci> <intermittent-failure> <unit-tests> <juju-core:Triaged> <https://launchpad.net/bugs/1471241>14:26
mupBug #1471242 opened: juju upgrade-juju to 1.24.2 fails with "invalid series: willy" <juju-core:New> <https://launchpad.net/bugs/1471242>14:26
voidspacedimitern: http://reviews.vapour.ws/r/2097/15:15
dimiternvoidspace, awesome! LGTM15:30
voidspacedimitern: wow, that was quick15:31
voidspacedimitern: thanks15:31
dimiternvoidspace, well, it's also friday :)15:38
rogpeppe1wwitzel3: ping16:14
mupBug #1471308 opened: TestOpenStateWorksForJobManageEnviron fails intermittently on windows <ci> <intermittent-failure> <test-failure> <windows> <juju-core:Triaged> <https://launchpad.net/bugs/1471308>17:12
* fwereade is finally happy enough with http://reviews.vapour.ws/r/2078/ and would very much appreciate reviews17:57
=== kadams54 is now known as kadams54-away
=== kadams54-away is now known as kadams54
mupBug #1471332 opened: Upgrade fails on windows machines <ci> <upgrade-juju> <windows> <juju-core:Triaged> <juju-core 1.24:Triaged> <https://launchpad.net/bugs/1471332>20:30
=== kadams54 is now known as kadams54-away
fwereadegaah20:57
fwereadeI'm inside state20:57
perrito666fwereade: how did you get inside it?20:57
* perrito666 takes a crowbar and goes to the rescue20:57
fwereadewhat's the quickest cleanest way to either find out what machine I'm running on, or otherwise come up with another stable id that effectively maps to that20:58
fwereade?20:58
mgzfwereade: I'm like 20 mins in to reading your giant branch20:58
fwereademgz, sorry it's so big but there was a lot to untangle20:58
fwereademgz, I'm pretty sure it's there now though20:58
mgzyeaj, it's all pretty sane so far at least20:58
fwereademgz, cool20:58
fwereademgz, I pushed a much better description recently fwiw20:59
fwereademgz, not sure if 20 mins ago or not20:59
fwereademgz, probably more actually20:59
perrito666fwereade: where are you exactly?20:59
perrito666fwereade: to answer to your question20:59
fwereadeperrito666, I'm in a state method which I want to set up the presence watcher and lease/leadership stuff21:00
fwereadeperrito666, my lease client needs a unique id for the machine it's running on21:00
perrito666I see21:00
fwereadeperrito666, deriving it from current state server id would be great21:00
fwereadeperrito666, but, heh, how do we know which we are?21:01
fwereadeperrito666, if I weren't inside state I could pass it in via agent config21:01
fwereadeperrito666, but that really has no place in state21:01
perrito666fwereade: State{}.serverTag I think21:04
=== kadams54 is now known as kadams54-away
fwereadeperrito666, I'm pretty sure that's the environ the state server's running in21:27
fwereadeperrito666, but not unique to a particular machine21:27
perrito666look at IsStateServer21:28

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!