=== liam_ is now known as Guest63322 [11:31] fwereade: isn't it your weekend ? :) [11:31] but really, great job on triaging [12:06] jam: looks like the landing bot doesn't like my branch. all tests pass locally [12:06] have you seen the failures before with the bot? [12:06] wallyworld_: see earlier bugs, it seems there is a thread that stays alive "sometimes" ,which on the bot is about 3 in 5 test runs [12:07] ok, i'll try re-approving [12:07] 3rd time lucky [12:09] wallyworld_: yeah, I'm trying to debug it now, inserting logging in mgo, or something. [12:09] good luck, doesn't sound straightforward [12:10] wallyworld_: well it looks like what is actually happening is an RPC failure [12:10] which causes the suite to fail to tear down correctly [12:10] [LOG] 16.33282 DEBUG juju rpc/jsoncodec: <- error: read tcp 127.0.0.1:52276: use of closed network connection (closing true) [LOG] 16.33383 DEBUG juju rpc/jsoncodec: <- error: EOF (closing false) ... Panic: Fixture has panicked (see related PANIC) [12:11] would be nice to make the teardown more robust at least [12:11] wallyworld_: well it looks like a test should be failing given the "use of closed network connection" part [12:11] though that failure *might* be: http://code.google.com/p/go/issues/detail?id=4704 [12:12] * wallyworld_ looks [12:12] which is only relevant to the patched version of go used in the packaged go binaries [12:12] but why would *that* be nondeterministic? [12:12] (for danilo's patch, it was exactly 3 submissions to succeed) [12:13] and naturally running it directly.... doesn't fail [12:13] too bad we aren't running with 1.0.3 packaged [12:14] wallyworld_: I'm using 1.0.3 from the ppa, which are you meaning? [12:14] that is the one tarmac is using [12:14] which may not have the patch, which is why I don't think it is issue 4704 from above [12:14] i just assumed we were running < 1.0.3 since the bug report linked above seemed to say t war fixed in 1.0.3 [12:15] wallyworld_: right, but supposedly there was a patch to 1.0.2 which might still have been applied to this code. [12:15] Alternatively, we are always getting the connection closed, and it only sometimes causes a stale conneciton to mgo [12:16] ok. i was just guessing without having all the facts :-) [12:16] wallyworld_: I don't have many more facts than you. Another pair of eyes is appreciated. [12:17] i'll see if anything jumps out [12:17] wallyworld_: the code under test is doing stuff with timeouts also, which always sounds non-deterministic [12:18] yeah, could just be the bot vm is slow or something [12:18] wallyworld_: well it is 10 seconds of 1 socket still marked as alive, so it probably isn't tearing down yet, but maybe the connection is closing earlier because the bot is slow [12:25] wallyworld_: yeah, post commit hook fired [12:25] yay [12:25] \o/ [12:25] wallyworld_: so it seems I just need to auto-requeue all requests 3 times. :) [12:25] lol [12:25] note that when we get this failure, it trips 3 different tests. [12:25] yes, i noticed that [12:25] I wonder if some other test is leaving it unclean and the rpc stuff is bogus? [12:26] maybe, would not be surprised [12:27] wallyworld_: of course I *just* triggered it locally, but I'm pretty sure it will succeed next time. [12:27] but it failed and ran only 1 test [12:27] so it is at least loacl to the test [12:27] yeah [12:28] wallyworld_: so if I add a last line: c.Fatal("failing this test") [12:28] then it always triggers [12:28] and I see a "rpc/jsoncodec EOF stuff [12:28] the one difference is "(closing false)" in my method vs "(closing true)" [12:28] which test do you add that to? [12:29] wallyworld_: so I *think* the TestManageEnviron [12:29] in cmd/jujud [12:29] machine_test.go [12:29] wallyworld_: So looking at the test, I think m1.Watch is setting up a watcher which is then polling on the server side, and we kill it at some point (w.Stop() is deferred) [12:29] but on the *server* side we just get a closed connection. [12:31] a client watcher should be able to be killed and the server should notice an sonme point and just deal with it [12:31] ERROR juju agent died with no error [12:31] wallyworld_: right, when I add the Fatalf it doesn't panic the fixture [12:31] it doesn't leave a connection alive [12:31] the point was that we end up with a very similar log file [12:39] jam: weird commit message when i pull trunk and do bzr log - claims the committer was mathew scott and not me [12:40] for the branch that just merged [12:40] wallyworld_: 'bzr log --long' it has both of you, but MScott happens to come first for some reason. [12:41] lp also shows m scott [12:41] why is he associated with my commit? [12:41] wallyworld_: if you look at the log, you merged his changes, Tarmac just includes everyone [12:41] rev 1276.1.8 has you merging his branch [12:41] so your patch brings in his changes [12:41] i merged "trunk" but it was off ~juju not go-bot [12:42] i didn't explicitly merge his code [12:42] wallyworld_: did someone 'land' code to ~juju before it moved to go-bot? [12:42] I pulled and pushed right when I switched [12:42] not that i can recall or am aware of [12:42] but maybe something was landed there after I switched it [12:42] maybe [12:43] doesn't matter, was just curious [12:43] if people were using 'lbox submit' it probably would still land to the old location, because it was the known location. [12:43] so I guess [12:43] makes sense [12:43] wallyworld_: thanks for bringing in accidentally missed changes :) [12:43] anytime :-) [13:11] wallyworld_: I may have found it.... the connection that is being created is inside mgo itself. Which runs a background pinger against the server (to make sure the server is still alive) [13:11] guess what the pinger delay is? [13:11] 500ms? [13:11] wallyworld_: 10s [13:11] ah, right [13:11] which is *exactly* the amount of time we wait for all connections to clean up [13:11] lol [13:11] so if pinger starts at any time close to when we start tearing down [13:11] we wait 10s [13:11] but it sleeps for 10.xxx seconds [13:12] and lo and behold, that thread never goes away [13:12] good catch [13:12] wallyworld_: inserting traceback logging [13:12] and then tracking through what address wasn't [13:12] dying [13:13] and then checking its traceback for why it was existing [13:13] I was *very* fortunate to be able to trigger it locally [13:13] yes indeed [13:13] once-in-a-while [13:13] I'm guessing Tarmac is slow enough [13:13] that the 10s delay [13:13] ends up being 10+s more often [13:13] yep [13:13] widening the window of failure [13:14] 10s seems like way too long to wait inside a test [13:14] too bad we can't tell mgo not to ping [13:14] wallyworld_: well mgo keeps a connection in a sleep loop [13:14] since for a test we don't care [13:17] wallyworld_: ah, I think there is a goroutine scheduling issue as well. As newServer has fired of a side thread telling it that it wants to ping [13:17] and it has already run pinger(false) # do not loop [13:18] fun [13:18] right, so it has checked and probably knows that we want to shut down, but fires off the goroutine, which runs at some point in (near) future [13:18] which is just after we started closing [13:18] but the first thing it does is sleep for 10s [13:19] I'm thinking about just inserting a 'check for closed' before we hit sleep [13:19] that would be the normal way to do these types of things [13:20] doesn't seem to matter [13:20] looks like it is getting closed in the middle of those 10s [13:21] wallyworld_: 7+ years since I wrote a lot of C++ code, and I still type "for i = 0; i < 10; ++i" :) [13:21] (go doesn't support ++var) [13:22] yeah, Go frustrates me a lot like that [13:24] so I tried changing the single sleep into a loop of sleeping, so that we can break out earlier. No luck yet. [14:02] wallyworld_: at 0.6376 we get a 'closing server' call, at 0.6378 we see it kill the currently active socket, at 0.648 we see it *create* a new socket which then never dies... why didn't it think it was done... [14:03] why would it create a new socket after killing the active one? [14:03] (why if it just killed its socket and is trying to shut down, doesn't it think the server is closed and tries to connect again) [14:03] wallyworld_: exactly [14:03] is this in the mgo code? [14:03] wallyworld_: that's what I'm looking at right now, yes. [14:04] i guess we need to find a workaround until any upstream patch gets applied [14:04] wallyworld_: well, we can probably just set the loop to 15s and be safe [14:04] but I'd still like to understand this bit. [14:05] (I can easily run patched mgo on Tarmac bot :) [14:05] great [14:05] note the 15s is *I think* :) [14:05] but who knows if this will bite us in prouction [14:05] given it is creating a new connection [14:05] I wonder if server.closed is getting unset somehow [14:05] plausible [14:06] like, it has been deallocated and thus the default value is nil? (can't really be deallocated if it isn't unreferenced) [14:06] so I could see that if we get the Close() call, but then call AcquireSocket immediately after [14:07] we *won't* call liveSocket.Close() [14:07] because we close each socket in the server.Close() call. [14:07] so possibly this thread will never die [14:07] because it doesn't actually think the server is closed. [14:11] seems like a stock standard coding error [14:14] wallyworld_: current guess. server is not closed when we call Dial [14:14] server is closed by the time connect returns [14:15] would explain what is seen i think [14:18] wallyworld_: 85.92296 AcquireSocket, 85.93311 closing server, 85.93321 killing old connection 85.94373 newSocket [14:19] so we ask for a new socket, while that is pending, we call Close [14:19] Close doesn't see the connection we are creating right now [14:20] cause it's not fully created yet [14:20] yep [14:20] may just use a mutex [14:20] maybe [14:20] there are mutexes around [14:21] AcquireSocket releases the mutex just before calling Connect [14:21] presumably to not block waiting for connect [14:21] and grabs it again before appending to live sockets [14:21] probably needs to check for 'self.closed' before appending the new socket [14:22] or gate acquire and close with a mutex [14:22] not sure without looking at the code [14:22] wallyworld_: I think just after Connect when we re-acquire the lock to add it to our liveSockets, we need to check that we aren't actually in closing state. [14:23] sounds ok [14:27] wallyworld_: yeah, I just got "Connect returned after server was closed" but *didn't* get AcquireSocket called with closed server. [14:27] good [23:37] wallyworld_: morning... [23:37] hi [23:40] wallyworld_: can I get a +1 on https://codereview.appspot.com/10235047/ [23:40] wallyworld_: although I should propose again [23:40] if you wait a few minutes, I'll do that [23:40] agains the new trrunk you mean [23:40] no, to have the test I added in the review [23:40] ok [23:40] * thumper is busy submitting a latter pipe [23:41] wallyworld_: sync up chat sometime? [23:41] sure, give me a little time to propose some code [23:41] just finishing some stuff [23:42] sure, no urgency [23:42] I have heaps to get on with [23:49] thumper: looks like you already have your 2nd +1 [23:49] wallyworld_: yeah, just saw that too [23:49] approving and merging now [23:49] dfc hasn't come online yet though :) [23:50] hi mramm [23:50] hey hey [23:52] mramm: working on a sunday? [23:52] not really [23:52] just thought I'd pop in and check on IRC [23:52] check my e-mail [23:52] while cooking dinner [23:53] well, IRC is probably where you left it :) [23:53] on problems [23:53] all ticking along [23:54] horrible weather here, cold and wet [23:54] supposed to get snow on thursday [23:56] arosales: did you get the meeting invite i sent?