/srv/irclogs.ubuntu.com/2013/06/16/#juju-dev.txt

=== liam_ is now known as Guest63322
jamfwereade: isn't it your weekend ? :)11:31
jambut really, great job on triaging11:31
wallyworld_jam: looks like the landing bot doesn't like my branch. all tests pass locally12:06
wallyworld_have you seen the failures before with the bot?12:06
jamwallyworld_: see earlier bugs, it seems there is a thread that stays alive "sometimes" ,which on the bot is about 3 in 5 test runs12:06
wallyworld_ok, i'll try re-approving12:07
wallyworld_3rd time lucky12:07
jamwallyworld_: yeah, I'm trying to debug it now, inserting logging in mgo, or something.12:09
wallyworld_good luck, doesn't sound straightforward12:09
jamwallyworld_: well it looks like what is actually happening is an RPC failure12:10
jamwhich causes the suite to fail to tear down correctly12:10
jam[LOG] 16.33282 DEBUG juju rpc/jsoncodec: <- error: read tcp 127.0.0.1:52276: use of closed network connection (closing true) [LOG] 16.33383 DEBUG juju rpc/jsoncodec: <- error: EOF (closing false) ... Panic: Fixture has panicked (see related PANIC)12:10
wallyworld_would be nice to make the teardown more robust at least12:11
jamwallyworld_: well it looks like a test should be failing given the "use of closed network connection" part12:11
jamthough that failure *might* be: http://code.google.com/p/go/issues/detail?id=470412:11
* wallyworld_ looks12:12
jamwhich is only relevant to the patched version of go used in the packaged go binaries12:12
jambut why would *that* be nondeterministic?12:12
jam(for danilo's patch, it was exactly 3 submissions to succeed)12:12
jamand naturally running it directly.... doesn't fail12:13
wallyworld_too bad we aren't running with 1.0.3 packaged12:13
jamwallyworld_: I'm using 1.0.3 from the ppa, which are you meaning?12:14
jamthat is the one tarmac is using12:14
jamwhich may not have the patch, which is why I don't think it is issue 4704 from above12:14
wallyworld_i just assumed we were running < 1.0.3 since the bug report linked above seemed to say t war fixed in 1.0.312:14
jamwallyworld_: right, but supposedly there was a patch to 1.0.2 which might still have been applied to this code.12:15
jamAlternatively, we are always getting the connection closed, and it only sometimes causes a stale conneciton to mgo12:15
wallyworld_ok. i was just guessing without having all the facts :-)12:16
jamwallyworld_: I don't have many more facts than you. Another pair of eyes is appreciated.12:16
wallyworld_i'll see if anything jumps out12:17
jamwallyworld_: the code under test is doing stuff with timeouts also, which always sounds non-deterministic12:17
wallyworld_yeah, could just be the bot vm is slow or something12:18
jamwallyworld_: well it is 10 seconds of 1 socket still marked as alive, so it probably isn't tearing down yet, but maybe the connection is closing earlier because the bot is slow12:18
jamwallyworld_: yeah, post commit hook fired12:25
jamyay12:25
wallyworld_\o/12:25
jamwallyworld_: so it seems I just need to auto-requeue all requests 3 times. :)12:25
wallyworld_lol12:25
jamnote that when we get this failure, it trips 3 different tests.12:25
wallyworld_yes, i noticed that12:25
jamI wonder if some other test is leaving it unclean and the rpc stuff is bogus?12:25
wallyworld_maybe, would not be surprised12:26
jamwallyworld_: of course I *just* triggered it locally, but I'm pretty sure it will succeed next time.12:27
jambut it failed and ran only 1 test12:27
jamso it is at least loacl to the test12:27
wallyworld_yeah12:27
jamwallyworld_: so if I add a last line: c.Fatal("failing this test")12:28
jamthen it always triggers12:28
jamand I see a "rpc/jsoncodec EOF stuff12:28
jamthe one difference is "(closing false)" in my method vs "(closing true)"12:28
wallyworld_which test do you add that to?12:28
jamwallyworld_: so I *think* the TestManageEnviron12:29
jamin cmd/jujud12:29
jammachine_test.go12:29
jamwallyworld_: So looking at the test, I think m1.Watch is setting up a watcher which is then polling on the server side, and we kill it at some point (w.Stop() is deferred)12:29
jambut on the *server* side we just get a closed connection.12:29
wallyworld_a client watcher should be able to be killed and the server should notice an sonme point and just deal with it12:31
jamERROR juju agent died with no error12:31
jamwallyworld_: right, when I add the Fatalf it doesn't panic the fixture12:31
jamit doesn't leave a connection alive12:31
jamthe point was that we end up with a very similar log file12:31
wallyworld_jam: weird commit message when i pull trunk and do bzr log - claims the committer was mathew scott and not me12:39
wallyworld_for the branch that just merged12:40
jamwallyworld_: 'bzr log --long' it has both of you, but MScott happens to come first for some reason.12:40
wallyworld_lp also shows m scott12:41
wallyworld_why is he associated with my commit?12:41
jamwallyworld_: if you look at the log, you merged his changes, Tarmac just includes everyone12:41
jamrev 1276.1.8 has you merging his branch12:41
jamso your patch brings in his changes12:41
wallyworld_i merged "trunk" but it was off ~juju not go-bot12:41
wallyworld_i didn't explicitly merge his code12:42
jamwallyworld_: did someone 'land' code to ~juju before it moved to go-bot?12:42
jamI pulled and pushed right when I switched12:42
wallyworld_not that i can recall or am aware of12:42
jambut maybe something was landed there after I switched it12:42
wallyworld_maybe12:42
wallyworld_doesn't matter, was just curious12:43
jamif people were using 'lbox submit' it probably would still land to the old location, because it was the known location.12:43
jamso I guess12:43
wallyworld_makes sense12:43
jamwallyworld_: thanks for bringing in accidentally missed changes :)12:43
wallyworld_anytime :-)12:43
jamwallyworld_: I may have found it.... the connection that is being created is inside mgo itself. Which runs a background pinger against the server (to make sure the server is still alive)13:11
jamguess what the pinger delay is?13:11
wallyworld_500ms?13:11
jamwallyworld_: 10s13:11
wallyworld_ah, right13:11
jamwhich is *exactly* the amount of time we wait for all connections to clean up13:11
wallyworld_lol13:11
jamso if pinger starts at any time close to when we start tearing down13:11
jamwe wait 10s13:11
jambut it sleeps for 10.xxx seconds13:11
jamand lo and behold, that thread never goes away13:12
wallyworld_good catch13:12
jamwallyworld_: inserting traceback logging13:12
jamand then tracking through what address wasn't13:12
jamdying13:12
jamand then checking its traceback for why it was existing13:13
jamI was *very* fortunate to be able to trigger it locally13:13
wallyworld_yes indeed13:13
jamonce-in-a-while13:13
jamI'm guessing Tarmac is slow enough13:13
jamthat the 10s delay13:13
jamends up being 10+s more often13:13
wallyworld_yep13:13
jamwidening the window of failure13:13
wallyworld_10s seems like way too long to wait inside a test13:14
wallyworld_too bad we can't tell mgo not to ping13:14
jamwallyworld_: well mgo keeps a connection in a sleep loop13:14
wallyworld_since for a test we don't care13:14
jamwallyworld_: ah, I think there is a goroutine scheduling issue as well. As newServer has fired of a side thread telling it that it wants to ping13:17
jamand it has already run pinger(false) # do not loop13:17
wallyworld_fun13:18
jamright, so it has checked and probably knows that we want to shut down, but fires off the goroutine, which runs at some point in (near) future13:18
jamwhich is just after we started closing13:18
jambut the first thing it does is sleep for 10s13:18
jamI'm thinking about just inserting a 'check for closed' before we hit sleep13:19
wallyworld_that would be the normal way to do these types of things13:19
jamdoesn't seem to matter13:20
jamlooks like it is getting closed in the  middle of those 10s13:20
jamwallyworld_: 7+ years since I wrote a lot of C++ code, and I still type "for i = 0; i < 10; ++i" :)13:21
jam(go doesn't support ++var)13:21
wallyworld_yeah, Go frustrates me a lot like that13:22
jamso I tried changing the single sleep into a loop of sleeping, so that we can break out earlier. No luck yet.13:24
jamwallyworld_: at 0.6376 we get a 'closing server' call, at 0.6378 we see it kill the currently active socket, at 0.648 we see it *create* a new socket which then never dies... why didn't it think it was done...14:02
wallyworld_why would it create a new socket after killing the active one?14:03
jam(why if it just killed its socket and is trying to shut down, doesn't it think the server is closed and tries to connect again)14:03
jamwallyworld_: exactly14:03
wallyworld_is this in the mgo code?14:03
jamwallyworld_: that's what I'm looking at right now, yes.14:03
wallyworld_i guess we need to find a workaround until any upstream patch gets applied14:04
jamwallyworld_: well, we can probably just set the loop to 15s and be safe14:04
jambut I'd still like to understand this bit.14:04
jam(I can easily run patched mgo on Tarmac bot :)14:05
wallyworld_great14:05
jamnote the 15s is *I think* :)14:05
wallyworld_but who knows if this will bite us in prouction14:05
jamgiven it is creating a new connection14:05
jamI wonder if server.closed is getting unset somehow14:05
wallyworld_plausible14:05
jamlike, it has been deallocated and thus the default value is nil? (can't really be deallocated if it isn't unreferenced)14:06
jamso I could see that if we get the Close() call, but then call AcquireSocket immediately after14:06
jamwe *won't* call liveSocket.Close()14:07
jambecause we close each socket in the server.Close() call.14:07
jamso possibly this thread will never die14:07
jambecause it doesn't actually think the server is closed.14:07
wallyworld_seems like a stock standard coding error14:11
jamwallyworld_: current guess. server is not closed when we call Dial14:14
jamserver is closed by the time connect returns14:14
wallyworld_would explain what is seen i think14:15
jamwallyworld_: 85.92296 AcquireSocket, 85.93311 closing server, 85.93321 killing old connection 85.94373 newSocket14:18
jamso we ask for a new socket, while that is pending, we call Close14:19
jamClose doesn't see the connection we are creating right now14:19
wallyworld_cause it's not fully created yet14:20
jamyep14:20
wallyworld_may just use a mutex14:20
wallyworld_maybe14:20
jamthere are mutexes around14:20
jamAcquireSocket releases the mutex just before calling Connect14:21
jampresumably to not block waiting for connect14:21
jamand grabs it again before appending to live sockets14:21
jamprobably needs to check for 'self.closed' before appending the new socket14:21
wallyworld_or gate acquire and close with a mutex14:22
wallyworld_not sure without looking at the code14:22
jamwallyworld_: I think just after Connect when we re-acquire the lock to add it to our liveSockets, we need to check that we aren't actually in closing state.14:22
wallyworld_sounds ok14:23
jamwallyworld_: yeah, I just got "Connect returned after server was closed" but *didn't* get AcquireSocket called with closed server.14:27
wallyworld_good14:27
thumperwallyworld_: morning...23:37
wallyworld_hi23:37
thumperwallyworld_: can I get a +1 on https://codereview.appspot.com/10235047/23:40
thumperwallyworld_: although I should propose again23:40
thumperif you wait a few minutes, I'll do that23:40
wallyworld_agains the new trrunk you mean23:40
thumperno, to have the test I added in the review23:40
wallyworld_ok23:40
* thumper is busy submitting a latter pipe23:40
thumperwallyworld_: sync up chat sometime?23:41
wallyworld_sure, give me a little time to propose some code23:41
wallyworld_just finishing some stuff23:41
thumpersure, no urgency23:42
thumperI have heaps to get on with23:42
wallyworld_thumper: looks like you already have your 2nd +123:49
thumperwallyworld_: yeah, just saw that too23:49
thumperapproving and merging now23:49
thumperdfc hasn't come online yet though :)23:49
thumperhi mramm23:50
mrammhey hey23:50
thumpermramm: working on a sunday?23:52
mrammnot really23:52
mrammjust thought I'd pop in and check on IRC23:52
mrammcheck my e-mail23:52
mrammwhile cooking dinner23:52
thumperwell, IRC is probably where you left it :)23:53
thumperon problems23:53
thumperall ticking along23:53
thumperhorrible weather here, cold and wet23:54
thumpersupposed to get snow on thursday23:54
wallyworld_arosales: did you get the meeting invite i sent?23:56

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!