=== liam_ is now known as Guest63322 | ||
jam | fwereade: isn't it your weekend ? :) | 11:31 |
---|---|---|
jam | but really, great job on triaging | 11:31 |
wallyworld_ | jam: looks like the landing bot doesn't like my branch. all tests pass locally | 12:06 |
wallyworld_ | have you seen the failures before with the bot? | 12:06 |
jam | wallyworld_: see earlier bugs, it seems there is a thread that stays alive "sometimes" ,which on the bot is about 3 in 5 test runs | 12:06 |
wallyworld_ | ok, i'll try re-approving | 12:07 |
wallyworld_ | 3rd time lucky | 12:07 |
jam | wallyworld_: yeah, I'm trying to debug it now, inserting logging in mgo, or something. | 12:09 |
wallyworld_ | good luck, doesn't sound straightforward | 12:09 |
jam | wallyworld_: well it looks like what is actually happening is an RPC failure | 12:10 |
jam | which causes the suite to fail to tear down correctly | 12:10 |
jam | [LOG] 16.33282 DEBUG juju rpc/jsoncodec: <- error: read tcp 127.0.0.1:52276: use of closed network connection (closing true) [LOG] 16.33383 DEBUG juju rpc/jsoncodec: <- error: EOF (closing false) ... Panic: Fixture has panicked (see related PANIC) | 12:10 |
wallyworld_ | would be nice to make the teardown more robust at least | 12:11 |
jam | wallyworld_: well it looks like a test should be failing given the "use of closed network connection" part | 12:11 |
jam | though that failure *might* be: http://code.google.com/p/go/issues/detail?id=4704 | 12:11 |
* wallyworld_ looks | 12:12 | |
jam | which is only relevant to the patched version of go used in the packaged go binaries | 12:12 |
jam | but why would *that* be nondeterministic? | 12:12 |
jam | (for danilo's patch, it was exactly 3 submissions to succeed) | 12:12 |
jam | and naturally running it directly.... doesn't fail | 12:13 |
wallyworld_ | too bad we aren't running with 1.0.3 packaged | 12:13 |
jam | wallyworld_: I'm using 1.0.3 from the ppa, which are you meaning? | 12:14 |
jam | that is the one tarmac is using | 12:14 |
jam | which may not have the patch, which is why I don't think it is issue 4704 from above | 12:14 |
wallyworld_ | i just assumed we were running < 1.0.3 since the bug report linked above seemed to say t war fixed in 1.0.3 | 12:14 |
jam | wallyworld_: right, but supposedly there was a patch to 1.0.2 which might still have been applied to this code. | 12:15 |
jam | Alternatively, we are always getting the connection closed, and it only sometimes causes a stale conneciton to mgo | 12:15 |
wallyworld_ | ok. i was just guessing without having all the facts :-) | 12:16 |
jam | wallyworld_: I don't have many more facts than you. Another pair of eyes is appreciated. | 12:16 |
wallyworld_ | i'll see if anything jumps out | 12:17 |
jam | wallyworld_: the code under test is doing stuff with timeouts also, which always sounds non-deterministic | 12:17 |
wallyworld_ | yeah, could just be the bot vm is slow or something | 12:18 |
jam | wallyworld_: well it is 10 seconds of 1 socket still marked as alive, so it probably isn't tearing down yet, but maybe the connection is closing earlier because the bot is slow | 12:18 |
jam | wallyworld_: yeah, post commit hook fired | 12:25 |
jam | yay | 12:25 |
wallyworld_ | \o/ | 12:25 |
jam | wallyworld_: so it seems I just need to auto-requeue all requests 3 times. :) | 12:25 |
wallyworld_ | lol | 12:25 |
jam | note that when we get this failure, it trips 3 different tests. | 12:25 |
wallyworld_ | yes, i noticed that | 12:25 |
jam | I wonder if some other test is leaving it unclean and the rpc stuff is bogus? | 12:25 |
wallyworld_ | maybe, would not be surprised | 12:26 |
jam | wallyworld_: of course I *just* triggered it locally, but I'm pretty sure it will succeed next time. | 12:27 |
jam | but it failed and ran only 1 test | 12:27 |
jam | so it is at least loacl to the test | 12:27 |
wallyworld_ | yeah | 12:27 |
jam | wallyworld_: so if I add a last line: c.Fatal("failing this test") | 12:28 |
jam | then it always triggers | 12:28 |
jam | and I see a "rpc/jsoncodec EOF stuff | 12:28 |
jam | the one difference is "(closing false)" in my method vs "(closing true)" | 12:28 |
wallyworld_ | which test do you add that to? | 12:28 |
jam | wallyworld_: so I *think* the TestManageEnviron | 12:29 |
jam | in cmd/jujud | 12:29 |
jam | machine_test.go | 12:29 |
jam | wallyworld_: So looking at the test, I think m1.Watch is setting up a watcher which is then polling on the server side, and we kill it at some point (w.Stop() is deferred) | 12:29 |
jam | but on the *server* side we just get a closed connection. | 12:29 |
wallyworld_ | a client watcher should be able to be killed and the server should notice an sonme point and just deal with it | 12:31 |
jam | ERROR juju agent died with no error | 12:31 |
jam | wallyworld_: right, when I add the Fatalf it doesn't panic the fixture | 12:31 |
jam | it doesn't leave a connection alive | 12:31 |
jam | the point was that we end up with a very similar log file | 12:31 |
wallyworld_ | jam: weird commit message when i pull trunk and do bzr log - claims the committer was mathew scott and not me | 12:39 |
wallyworld_ | for the branch that just merged | 12:40 |
jam | wallyworld_: 'bzr log --long' it has both of you, but MScott happens to come first for some reason. | 12:40 |
wallyworld_ | lp also shows m scott | 12:41 |
wallyworld_ | why is he associated with my commit? | 12:41 |
jam | wallyworld_: if you look at the log, you merged his changes, Tarmac just includes everyone | 12:41 |
jam | rev 1276.1.8 has you merging his branch | 12:41 |
jam | so your patch brings in his changes | 12:41 |
wallyworld_ | i merged "trunk" but it was off ~juju not go-bot | 12:41 |
wallyworld_ | i didn't explicitly merge his code | 12:42 |
jam | wallyworld_: did someone 'land' code to ~juju before it moved to go-bot? | 12:42 |
jam | I pulled and pushed right when I switched | 12:42 |
wallyworld_ | not that i can recall or am aware of | 12:42 |
jam | but maybe something was landed there after I switched it | 12:42 |
wallyworld_ | maybe | 12:42 |
wallyworld_ | doesn't matter, was just curious | 12:43 |
jam | if people were using 'lbox submit' it probably would still land to the old location, because it was the known location. | 12:43 |
jam | so I guess | 12:43 |
wallyworld_ | makes sense | 12:43 |
jam | wallyworld_: thanks for bringing in accidentally missed changes :) | 12:43 |
wallyworld_ | anytime :-) | 12:43 |
jam | wallyworld_: I may have found it.... the connection that is being created is inside mgo itself. Which runs a background pinger against the server (to make sure the server is still alive) | 13:11 |
jam | guess what the pinger delay is? | 13:11 |
wallyworld_ | 500ms? | 13:11 |
jam | wallyworld_: 10s | 13:11 |
wallyworld_ | ah, right | 13:11 |
jam | which is *exactly* the amount of time we wait for all connections to clean up | 13:11 |
wallyworld_ | lol | 13:11 |
jam | so if pinger starts at any time close to when we start tearing down | 13:11 |
jam | we wait 10s | 13:11 |
jam | but it sleeps for 10.xxx seconds | 13:11 |
jam | and lo and behold, that thread never goes away | 13:12 |
wallyworld_ | good catch | 13:12 |
jam | wallyworld_: inserting traceback logging | 13:12 |
jam | and then tracking through what address wasn't | 13:12 |
jam | dying | 13:12 |
jam | and then checking its traceback for why it was existing | 13:13 |
jam | I was *very* fortunate to be able to trigger it locally | 13:13 |
wallyworld_ | yes indeed | 13:13 |
jam | once-in-a-while | 13:13 |
jam | I'm guessing Tarmac is slow enough | 13:13 |
jam | that the 10s delay | 13:13 |
jam | ends up being 10+s more often | 13:13 |
wallyworld_ | yep | 13:13 |
jam | widening the window of failure | 13:13 |
wallyworld_ | 10s seems like way too long to wait inside a test | 13:14 |
wallyworld_ | too bad we can't tell mgo not to ping | 13:14 |
jam | wallyworld_: well mgo keeps a connection in a sleep loop | 13:14 |
wallyworld_ | since for a test we don't care | 13:14 |
jam | wallyworld_: ah, I think there is a goroutine scheduling issue as well. As newServer has fired of a side thread telling it that it wants to ping | 13:17 |
jam | and it has already run pinger(false) # do not loop | 13:17 |
wallyworld_ | fun | 13:18 |
jam | right, so it has checked and probably knows that we want to shut down, but fires off the goroutine, which runs at some point in (near) future | 13:18 |
jam | which is just after we started closing | 13:18 |
jam | but the first thing it does is sleep for 10s | 13:18 |
jam | I'm thinking about just inserting a 'check for closed' before we hit sleep | 13:19 |
wallyworld_ | that would be the normal way to do these types of things | 13:19 |
jam | doesn't seem to matter | 13:20 |
jam | looks like it is getting closed in the middle of those 10s | 13:20 |
jam | wallyworld_: 7+ years since I wrote a lot of C++ code, and I still type "for i = 0; i < 10; ++i" :) | 13:21 |
jam | (go doesn't support ++var) | 13:21 |
wallyworld_ | yeah, Go frustrates me a lot like that | 13:22 |
jam | so I tried changing the single sleep into a loop of sleeping, so that we can break out earlier. No luck yet. | 13:24 |
jam | wallyworld_: at 0.6376 we get a 'closing server' call, at 0.6378 we see it kill the currently active socket, at 0.648 we see it *create* a new socket which then never dies... why didn't it think it was done... | 14:02 |
wallyworld_ | why would it create a new socket after killing the active one? | 14:03 |
jam | (why if it just killed its socket and is trying to shut down, doesn't it think the server is closed and tries to connect again) | 14:03 |
jam | wallyworld_: exactly | 14:03 |
wallyworld_ | is this in the mgo code? | 14:03 |
jam | wallyworld_: that's what I'm looking at right now, yes. | 14:03 |
wallyworld_ | i guess we need to find a workaround until any upstream patch gets applied | 14:04 |
jam | wallyworld_: well, we can probably just set the loop to 15s and be safe | 14:04 |
jam | but I'd still like to understand this bit. | 14:04 |
jam | (I can easily run patched mgo on Tarmac bot :) | 14:05 |
wallyworld_ | great | 14:05 |
jam | note the 15s is *I think* :) | 14:05 |
wallyworld_ | but who knows if this will bite us in prouction | 14:05 |
jam | given it is creating a new connection | 14:05 |
jam | I wonder if server.closed is getting unset somehow | 14:05 |
wallyworld_ | plausible | 14:05 |
jam | like, it has been deallocated and thus the default value is nil? (can't really be deallocated if it isn't unreferenced) | 14:06 |
jam | so I could see that if we get the Close() call, but then call AcquireSocket immediately after | 14:06 |
jam | we *won't* call liveSocket.Close() | 14:07 |
jam | because we close each socket in the server.Close() call. | 14:07 |
jam | so possibly this thread will never die | 14:07 |
jam | because it doesn't actually think the server is closed. | 14:07 |
wallyworld_ | seems like a stock standard coding error | 14:11 |
jam | wallyworld_: current guess. server is not closed when we call Dial | 14:14 |
jam | server is closed by the time connect returns | 14:14 |
wallyworld_ | would explain what is seen i think | 14:15 |
jam | wallyworld_: 85.92296 AcquireSocket, 85.93311 closing server, 85.93321 killing old connection 85.94373 newSocket | 14:18 |
jam | so we ask for a new socket, while that is pending, we call Close | 14:19 |
jam | Close doesn't see the connection we are creating right now | 14:19 |
wallyworld_ | cause it's not fully created yet | 14:20 |
jam | yep | 14:20 |
wallyworld_ | may just use a mutex | 14:20 |
wallyworld_ | maybe | 14:20 |
jam | there are mutexes around | 14:20 |
jam | AcquireSocket releases the mutex just before calling Connect | 14:21 |
jam | presumably to not block waiting for connect | 14:21 |
jam | and grabs it again before appending to live sockets | 14:21 |
jam | probably needs to check for 'self.closed' before appending the new socket | 14:21 |
wallyworld_ | or gate acquire and close with a mutex | 14:22 |
wallyworld_ | not sure without looking at the code | 14:22 |
jam | wallyworld_: I think just after Connect when we re-acquire the lock to add it to our liveSockets, we need to check that we aren't actually in closing state. | 14:22 |
wallyworld_ | sounds ok | 14:23 |
jam | wallyworld_: yeah, I just got "Connect returned after server was closed" but *didn't* get AcquireSocket called with closed server. | 14:27 |
wallyworld_ | good | 14:27 |
thumper | wallyworld_: morning... | 23:37 |
wallyworld_ | hi | 23:37 |
thumper | wallyworld_: can I get a +1 on https://codereview.appspot.com/10235047/ | 23:40 |
thumper | wallyworld_: although I should propose again | 23:40 |
thumper | if you wait a few minutes, I'll do that | 23:40 |
wallyworld_ | agains the new trrunk you mean | 23:40 |
thumper | no, to have the test I added in the review | 23:40 |
wallyworld_ | ok | 23:40 |
* thumper is busy submitting a latter pipe | 23:40 | |
thumper | wallyworld_: sync up chat sometime? | 23:41 |
wallyworld_ | sure, give me a little time to propose some code | 23:41 |
wallyworld_ | just finishing some stuff | 23:41 |
thumper | sure, no urgency | 23:42 |
thumper | I have heaps to get on with | 23:42 |
wallyworld_ | thumper: looks like you already have your 2nd +1 | 23:49 |
thumper | wallyworld_: yeah, just saw that too | 23:49 |
thumper | approving and merging now | 23:49 |
thumper | dfc hasn't come online yet though :) | 23:49 |
thumper | hi mramm | 23:50 |
mramm | hey hey | 23:50 |
thumper | mramm: working on a sunday? | 23:52 |
mramm | not really | 23:52 |
mramm | just thought I'd pop in and check on IRC | 23:52 |
mramm | check my e-mail | 23:52 |
mramm | while cooking dinner | 23:52 |
thumper | well, IRC is probably where you left it :) | 23:53 |
thumper | on problems | 23:53 |
thumper | all ticking along | 23:53 |
thumper | horrible weather here, cold and wet | 23:54 |
thumper | supposed to get snow on thursday | 23:54 |
wallyworld_ | arosales: did you get the meeting invite i sent? | 23:56 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!