[11:31] <jam> fwereade: isn't it your weekend ? :)
[11:31] <jam> but really, great job on triaging
[12:06] <wallyworld_> jam: looks like the landing bot doesn't like my branch. all tests pass locally
[12:06] <wallyworld_> have you seen the failures before with the bot?
[12:06] <jam> wallyworld_: see earlier bugs, it seems there is a thread that stays alive "sometimes" ,which on the bot is about 3 in 5 test runs
[12:07] <wallyworld_> ok, i'll try re-approving
[12:07] <wallyworld_> 3rd time lucky
[12:09] <jam> wallyworld_: yeah, I'm trying to debug it now, inserting logging in mgo, or something.
[12:09] <wallyworld_> good luck, doesn't sound straightforward
[12:10] <jam> wallyworld_: well it looks like what is actually happening is an RPC failure
[12:10] <jam> which causes the suite to fail to tear down correctly
[12:10] <jam> [LOG] 16.33282 DEBUG juju rpc/jsoncodec: <- error: read tcp 127.0.0.1:52276: use of closed network connection (closing true) [LOG] 16.33383 DEBUG juju rpc/jsoncodec: <- error: EOF (closing false) ... Panic: Fixture has panicked (see related PANIC)
[12:11] <wallyworld_> would be nice to make the teardown more robust at least
[12:11] <jam> wallyworld_: well it looks like a test should be failing given the "use of closed network connection" part
[12:11] <jam> though that failure *might* be: http://code.google.com/p/go/issues/detail?id=4704
[12:12]  * wallyworld_ looks
[12:12] <jam> which is only relevant to the patched version of go used in the packaged go binaries
[12:12] <jam> but why would *that* be nondeterministic?
[12:12] <jam> (for danilo's patch, it was exactly 3 submissions to succeed)
[12:13] <jam> and naturally running it directly.... doesn't fail
[12:13] <wallyworld_> too bad we aren't running with 1.0.3 packaged
[12:14] <jam> wallyworld_: I'm using 1.0.3 from the ppa, which are you meaning?
[12:14] <jam> that is the one tarmac is using
[12:14] <jam> which may not have the patch, which is why I don't think it is issue 4704 from above
[12:14] <wallyworld_> i just assumed we were running < 1.0.3 since the bug report linked above seemed to say t war fixed in 1.0.3
[12:15] <jam> wallyworld_: right, but supposedly there was a patch to 1.0.2 which might still have been applied to this code.
[12:15] <jam> Alternatively, we are always getting the connection closed, and it only sometimes causes a stale conneciton to mgo
[12:16] <wallyworld_> ok. i was just guessing without having all the facts :-)
[12:16] <jam> wallyworld_: I don't have many more facts than you. Another pair of eyes is appreciated.
[12:17] <wallyworld_> i'll see if anything jumps out
[12:17] <jam> wallyworld_: the code under test is doing stuff with timeouts also, which always sounds non-deterministic
[12:18] <wallyworld_> yeah, could just be the bot vm is slow or something
[12:18] <jam> wallyworld_: well it is 10 seconds of 1 socket still marked as alive, so it probably isn't tearing down yet, but maybe the connection is closing earlier because the bot is slow
[12:25] <jam> wallyworld_: yeah, post commit hook fired
[12:25] <jam> yay
[12:25] <wallyworld_> \o/
[12:25] <jam> wallyworld_: so it seems I just need to auto-requeue all requests 3 times. :)
[12:25] <wallyworld_> lol
[12:25] <jam> note that when we get this failure, it trips 3 different tests.
[12:25] <wallyworld_> yes, i noticed that
[12:25] <jam> I wonder if some other test is leaving it unclean and the rpc stuff is bogus?
[12:26] <wallyworld_> maybe, would not be surprised
[12:27] <jam> wallyworld_: of course I *just* triggered it locally, but I'm pretty sure it will succeed next time.
[12:27] <jam> but it failed and ran only 1 test
[12:27] <jam> so it is at least loacl to the test
[12:27] <wallyworld_> yeah
[12:28] <jam> wallyworld_: so if I add a last line: c.Fatal("failing this test")
[12:28] <jam> then it always triggers
[12:28] <jam> and I see a "rpc/jsoncodec EOF stuff
[12:28] <jam> the one difference is "(closing false)" in my method vs "(closing true)"
[12:28] <wallyworld_> which test do you add that to?
[12:29] <jam> wallyworld_: so I *think* the TestManageEnviron
[12:29] <jam> in cmd/jujud
[12:29] <jam> machine_test.go
[12:29] <jam> wallyworld_: So looking at the test, I think m1.Watch is setting up a watcher which is then polling on the server side, and we kill it at some point (w.Stop() is deferred)
[12:29] <jam> but on the *server* side we just get a closed connection.
[12:31] <wallyworld_> a client watcher should be able to be killed and the server should notice an sonme point and just deal with it
[12:31] <jam> ERROR juju agent died with no error
[12:31] <jam> wallyworld_: right, when I add the Fatalf it doesn't panic the fixture
[12:31] <jam> it doesn't leave a connection alive
[12:31] <jam> the point was that we end up with a very similar log file
[12:39] <wallyworld_> jam: weird commit message when i pull trunk and do bzr log - claims the committer was mathew scott and not me
[12:40] <wallyworld_> for the branch that just merged
[12:40] <jam> wallyworld_: 'bzr log --long' it has both of you, but MScott happens to come first for some reason.
[12:41] <wallyworld_> lp also shows m scott
[12:41] <wallyworld_> why is he associated with my commit?
[12:41] <jam> wallyworld_: if you look at the log, you merged his changes, Tarmac just includes everyone
[12:41] <jam> rev 1276.1.8 has you merging his branch
[12:41] <jam> so your patch brings in his changes
[12:41] <wallyworld_> i merged "trunk" but it was off ~juju not go-bot
[12:42] <wallyworld_> i didn't explicitly merge his code
[12:42] <jam> wallyworld_: did someone 'land' code to ~juju before it moved to go-bot?
[12:42] <jam> I pulled and pushed right when I switched
[12:42] <wallyworld_> not that i can recall or am aware of
[12:42] <jam> but maybe something was landed there after I switched it
[12:42] <wallyworld_> maybe
[12:43] <wallyworld_> doesn't matter, was just curious
[12:43] <jam> if people were using 'lbox submit' it probably would still land to the old location, because it was the known location.
[12:43] <jam> so I guess
[12:43] <wallyworld_> makes sense
[12:43] <jam> wallyworld_: thanks for bringing in accidentally missed changes :)
[12:43] <wallyworld_> anytime :-)
[13:11] <jam> wallyworld_: I may have found it.... the connection that is being created is inside mgo itself. Which runs a background pinger against the server (to make sure the server is still alive)
[13:11] <jam> guess what the pinger delay is?
[13:11] <wallyworld_> 500ms?
[13:11] <jam> wallyworld_: 10s
[13:11] <wallyworld_> ah, right
[13:11] <jam> which is *exactly* the amount of time we wait for all connections to clean up
[13:11] <wallyworld_> lol
[13:11] <jam> so if pinger starts at any time close to when we start tearing down
[13:11] <jam> we wait 10s
[13:11] <jam> but it sleeps for 10.xxx seconds
[13:12] <jam> and lo and behold, that thread never goes away
[13:12] <wallyworld_> good catch
[13:12] <jam> wallyworld_: inserting traceback logging
[13:12] <jam> and then tracking through what address wasn't
[13:12] <jam> dying
[13:13] <jam> and then checking its traceback for why it was existing
[13:13] <jam> I was *very* fortunate to be able to trigger it locally
[13:13] <wallyworld_> yes indeed
[13:13] <jam> once-in-a-while
[13:13] <jam> I'm guessing Tarmac is slow enough
[13:13] <jam> that the 10s delay
[13:13] <jam> ends up being 10+s more often
[13:13] <wallyworld_> yep
[13:13] <jam> widening the window of failure
[13:14] <wallyworld_> 10s seems like way too long to wait inside a test
[13:14] <wallyworld_> too bad we can't tell mgo not to ping
[13:14] <jam> wallyworld_: well mgo keeps a connection in a sleep loop
[13:14] <wallyworld_> since for a test we don't care
[13:17] <jam> wallyworld_: ah, I think there is a goroutine scheduling issue as well. As newServer has fired of a side thread telling it that it wants to ping
[13:17] <jam> and it has already run pinger(false) # do not loop
[13:18] <wallyworld_> fun
[13:18] <jam> right, so it has checked and probably knows that we want to shut down, but fires off the goroutine, which runs at some point in (near) future
[13:18] <jam> which is just after we started closing
[13:18] <jam> but the first thing it does is sleep for 10s
[13:19] <jam> I'm thinking about just inserting a 'check for closed' before we hit sleep
[13:19] <wallyworld_> that would be the normal way to do these types of things
[13:20] <jam> doesn't seem to matter
[13:20] <jam> looks like it is getting closed in the  middle of those 10s
[13:21] <jam> wallyworld_: 7+ years since I wrote a lot of C++ code, and I still type "for i = 0; i < 10; ++i" :)
[13:21] <jam> (go doesn't support ++var)
[13:22] <wallyworld_> yeah, Go frustrates me a lot like that
[13:24] <jam> so I tried changing the single sleep into a loop of sleeping, so that we can break out earlier. No luck yet.
[14:02] <jam> wallyworld_: at 0.6376 we get a 'closing server' call, at 0.6378 we see it kill the currently active socket, at 0.648 we see it *create* a new socket which then never dies... why didn't it think it was done...
[14:03] <wallyworld_> why would it create a new socket after killing the active one?
[14:03] <jam> (why if it just killed its socket and is trying to shut down, doesn't it think the server is closed and tries to connect again)
[14:03] <jam> wallyworld_: exactly
[14:03] <wallyworld_> is this in the mgo code?
[14:03] <jam> wallyworld_: that's what I'm looking at right now, yes.
[14:04] <wallyworld_> i guess we need to find a workaround until any upstream patch gets applied
[14:04] <jam> wallyworld_: well, we can probably just set the loop to 15s and be safe
[14:04] <jam> but I'd still like to understand this bit.
[14:05] <jam> (I can easily run patched mgo on Tarmac bot :)
[14:05] <wallyworld_> great
[14:05] <jam> note the 15s is *I think* :)
[14:05] <wallyworld_> but who knows if this will bite us in prouction
[14:05] <jam> given it is creating a new connection
[14:05] <jam> I wonder if server.closed is getting unset somehow
[14:05] <wallyworld_> plausible
[14:06] <jam> like, it has been deallocated and thus the default value is nil? (can't really be deallocated if it isn't unreferenced)
[14:06] <jam> so I could see that if we get the Close() call, but then call AcquireSocket immediately after
[14:07] <jam> we *won't* call liveSocket.Close()
[14:07] <jam> because we close each socket in the server.Close() call.
[14:07] <jam> so possibly this thread will never die
[14:07] <jam> because it doesn't actually think the server is closed.
[14:11] <wallyworld_> seems like a stock standard coding error
[14:14] <jam> wallyworld_: current guess. server is not closed when we call Dial
[14:14] <jam> server is closed by the time connect returns
[14:15] <wallyworld_> would explain what is seen i think
[14:18] <jam> wallyworld_: 85.92296 AcquireSocket, 85.93311 closing server, 85.93321 killing old connection 85.94373 newSocket
[14:19] <jam> so we ask for a new socket, while that is pending, we call Close
[14:19] <jam> Close doesn't see the connection we are creating right now
[14:20] <wallyworld_> cause it's not fully created yet
[14:20] <jam> yep
[14:20] <wallyworld_> may just use a mutex
[14:20] <wallyworld_> maybe
[14:20] <jam> there are mutexes around
[14:21] <jam> AcquireSocket releases the mutex just before calling Connect
[14:21] <jam> presumably to not block waiting for connect
[14:21] <jam> and grabs it again before appending to live sockets
[14:21] <jam> probably needs to check for 'self.closed' before appending the new socket
[14:22] <wallyworld_> or gate acquire and close with a mutex
[14:22] <wallyworld_> not sure without looking at the code
[14:22] <jam> wallyworld_: I think just after Connect when we re-acquire the lock to add it to our liveSockets, we need to check that we aren't actually in closing state.
[14:23] <wallyworld_> sounds ok
[14:27] <jam> wallyworld_: yeah, I just got "Connect returned after server was closed" but *didn't* get AcquireSocket called with closed server.
[14:27] <wallyworld_> good
[23:37] <thumper> wallyworld_: morning...
[23:37] <wallyworld_> hi
[23:40] <thumper> wallyworld_: can I get a +1 on https://codereview.appspot.com/10235047/
[23:40] <thumper> wallyworld_: although I should propose again
[23:40] <thumper> if you wait a few minutes, I'll do that
[23:40] <wallyworld_> agains the new trrunk you mean
[23:40] <thumper> no, to have the test I added in the review
[23:40] <wallyworld_> ok
[23:40]  * thumper is busy submitting a latter pipe
[23:41] <thumper> wallyworld_: sync up chat sometime?
[23:41] <wallyworld_> sure, give me a little time to propose some code
[23:41] <wallyworld_> just finishing some stuff
[23:42] <thumper> sure, no urgency
[23:42] <thumper> I have heaps to get on with
[23:49] <wallyworld_> thumper: looks like you already have your 2nd +1
[23:49] <thumper> wallyworld_: yeah, just saw that too
[23:49] <thumper> approving and merging now
[23:49] <thumper> dfc hasn't come online yet though :)
[23:50] <thumper> hi mramm
[23:50] <mramm> hey hey
[23:52] <thumper> mramm: working on a sunday?
[23:52] <mramm> not really
[23:52] <mramm> just thought I'd pop in and check on IRC
[23:52] <mramm> check my e-mail
[23:52] <mramm> while cooking dinner
[23:53] <thumper> well, IRC is probably where you left it :)
[23:53] <thumper> on problems
[23:53] <thumper> all ticking along
[23:54] <thumper> horrible weather here, cold and wet
[23:54] <thumper> supposed to get snow on thursday
[23:56] <wallyworld_> arosales: did you get the meeting invite i sent?