/srv/irclogs.ubuntu.com/2013/06/16/#juju-dev.txt

=== liam_ is now known as Guest63322
jam	fwereade: isn't it your weekend ? :)	11:31
jam	but really, great job on triaging	11:31
wallyworld_	jam: looks like the landing bot doesn't like my branch. all tests pass locally	12:06
wallyworld_	have you seen the failures before with the bot?	12:06
jam	wallyworld_: see earlier bugs, it seems there is a thread that stays alive "sometimes" ,which on the bot is about 3 in 5 test runs	12:06
wallyworld_	ok, i'll try re-approving	12:07
wallyworld_	3rd time lucky	12:07
jam	wallyworld_: yeah, I'm trying to debug it now, inserting logging in mgo, or something.	12:09
wallyworld_	good luck, doesn't sound straightforward	12:09
jam	wallyworld_: well it looks like what is actually happening is an RPC failure	12:10
jam	which causes the suite to fail to tear down correctly	12:10
jam	[LOG] 16.33282 DEBUG juju rpc/jsoncodec: <- error: read tcp 127.0.0.1:52276: use of closed network connection (closing true) [LOG] 16.33383 DEBUG juju rpc/jsoncodec: <- error: EOF (closing false) ... Panic: Fixture has panicked (see related PANIC)	12:10
wallyworld_	would be nice to make the teardown more robust at least	12:11
jam	wallyworld_: well it looks like a test should be failing given the "use of closed network connection" part	12:11
jam	though that failure might be: http://code.google.com/p/go/issues/detail?id=4704	12:11
* wallyworld_ looks		12:12
jam	which is only relevant to the patched version of go used in the packaged go binaries	12:12
jam	but why would that be nondeterministic?	12:12
jam	(for danilo's patch, it was exactly 3 submissions to succeed)	12:12
jam	and naturally running it directly.... doesn't fail	12:13
wallyworld_	too bad we aren't running with 1.0.3 packaged	12:13
jam	wallyworld_: I'm using 1.0.3 from the ppa, which are you meaning?	12:14
jam	that is the one tarmac is using	12:14
jam	which may not have the patch, which is why I don't think it is issue 4704 from above	12:14
wallyworld_	i just assumed we were running < 1.0.3 since the bug report linked above seemed to say t war fixed in 1.0.3	12:14
jam	wallyworld_: right, but supposedly there was a patch to 1.0.2 which might still have been applied to this code.	12:15
jam	Alternatively, we are always getting the connection closed, and it only sometimes causes a stale conneciton to mgo	12:15
wallyworld_	ok. i was just guessing without having all the facts :-)	12:16
jam	wallyworld_: I don't have many more facts than you. Another pair of eyes is appreciated.	12:16
wallyworld_	i'll see if anything jumps out	12:17
jam	wallyworld_: the code under test is doing stuff with timeouts also, which always sounds non-deterministic	12:17
wallyworld_	yeah, could just be the bot vm is slow or something	12:18
jam	wallyworld_: well it is 10 seconds of 1 socket still marked as alive, so it probably isn't tearing down yet, but maybe the connection is closing earlier because the bot is slow	12:18
jam	wallyworld_: yeah, post commit hook fired	12:25
jam	yay	12:25
wallyworld_	\o/	12:25
jam	wallyworld_: so it seems I just need to auto-requeue all requests 3 times. :)	12:25
wallyworld_	lol	12:25
jam	note that when we get this failure, it trips 3 different tests.	12:25
wallyworld_	yes, i noticed that	12:25
jam	I wonder if some other test is leaving it unclean and the rpc stuff is bogus?	12:25
wallyworld_	maybe, would not be surprised	12:26
jam	wallyworld_: of course I just triggered it locally, but I'm pretty sure it will succeed next time.	12:27
jam	but it failed and ran only 1 test	12:27
jam	so it is at least loacl to the test	12:27
wallyworld_	yeah	12:27
jam	wallyworld_: so if I add a last line: c.Fatal("failing this test")	12:28
jam	then it always triggers	12:28
jam	and I see a "rpc/jsoncodec EOF stuff	12:28
jam	the one difference is "(closing false)" in my method vs "(closing true)"	12:28
wallyworld_	which test do you add that to?	12:28
jam	wallyworld_: so I think the TestManageEnviron	12:29
jam	in cmd/jujud	12:29
jam	machine_test.go	12:29
jam	wallyworld_: So looking at the test, I think m1.Watch is setting up a watcher which is then polling on the server side, and we kill it at some point (w.Stop() is deferred)	12:29
jam	but on the server side we just get a closed connection.	12:29
wallyworld_	a client watcher should be able to be killed and the server should notice an sonme point and just deal with it	12:31
jam	ERROR juju agent died with no error	12:31
jam	wallyworld_: right, when I add the Fatalf it doesn't panic the fixture	12:31
jam	it doesn't leave a connection alive	12:31
jam	the point was that we end up with a very similar log file	12:31
wallyworld_	jam: weird commit message when i pull trunk and do bzr log - claims the committer was mathew scott and not me	12:39
wallyworld_	for the branch that just merged	12:40
jam	wallyworld_: 'bzr log --long' it has both of you, but MScott happens to come first for some reason.	12:40
wallyworld_	lp also shows m scott	12:41
wallyworld_	why is he associated with my commit?	12:41
jam	wallyworld_: if you look at the log, you merged his changes, Tarmac just includes everyone	12:41
jam	rev 1276.1.8 has you merging his branch	12:41
jam	so your patch brings in his changes	12:41
wallyworld_	i merged "trunk" but it was off ~juju not go-bot	12:41
wallyworld_	i didn't explicitly merge his code	12:42
jam	wallyworld_: did someone 'land' code to ~juju before it moved to go-bot?	12:42
jam	I pulled and pushed right when I switched	12:42
wallyworld_	not that i can recall or am aware of	12:42
jam	but maybe something was landed there after I switched it	12:42
wallyworld_	maybe	12:42
wallyworld_	doesn't matter, was just curious	12:43
jam	if people were using 'lbox submit' it probably would still land to the old location, because it was the known location.	12:43
jam	so I guess	12:43
wallyworld_	makes sense	12:43
jam	wallyworld_: thanks for bringing in accidentally missed changes :)	12:43
wallyworld_	anytime :-)	12:43
jam	wallyworld_: I may have found it.... the connection that is being created is inside mgo itself. Which runs a background pinger against the server (to make sure the server is still alive)	13:11
jam	guess what the pinger delay is?	13:11
wallyworld_	500ms?	13:11
jam	wallyworld_: 10s	13:11
wallyworld_	ah, right	13:11
jam	which is exactly the amount of time we wait for all connections to clean up	13:11
wallyworld_	lol	13:11
jam	so if pinger starts at any time close to when we start tearing down	13:11
jam	we wait 10s	13:11
jam	but it sleeps for 10.xxx seconds	13:11
jam	and lo and behold, that thread never goes away	13:12
wallyworld_	good catch	13:12
jam	wallyworld_: inserting traceback logging	13:12
jam	and then tracking through what address wasn't	13:12
jam	dying	13:12
jam	and then checking its traceback for why it was existing	13:13
jam	I was very fortunate to be able to trigger it locally	13:13
wallyworld_	yes indeed	13:13
jam	once-in-a-while	13:13
jam	I'm guessing Tarmac is slow enough	13:13
jam	that the 10s delay	13:13
jam	ends up being 10+s more often	13:13
wallyworld_	yep	13:13
jam	widening the window of failure	13:13
wallyworld_	10s seems like way too long to wait inside a test	13:14
wallyworld_	too bad we can't tell mgo not to ping	13:14
jam	wallyworld_: well mgo keeps a connection in a sleep loop	13:14
wallyworld_	since for a test we don't care	13:14
jam	wallyworld_: ah, I think there is a goroutine scheduling issue as well. As newServer has fired of a side thread telling it that it wants to ping	13:17
jam	and it has already run pinger(false) # do not loop	13:17
wallyworld_	fun	13:18
jam	right, so it has checked and probably knows that we want to shut down, but fires off the goroutine, which runs at some point in (near) future	13:18
jam	which is just after we started closing	13:18
jam	but the first thing it does is sleep for 10s	13:18
jam	I'm thinking about just inserting a 'check for closed' before we hit sleep	13:19
wallyworld_	that would be the normal way to do these types of things	13:19
jam	doesn't seem to matter	13:20
jam	looks like it is getting closed in the middle of those 10s	13:20
jam	wallyworld_: 7+ years since I wrote a lot of C++ code, and I still type "for i = 0; i < 10; ++i" :)	13:21
jam	(go doesn't support ++var)	13:21
wallyworld_	yeah, Go frustrates me a lot like that	13:22
jam	so I tried changing the single sleep into a loop of sleeping, so that we can break out earlier. No luck yet.	13:24
jam	wallyworld_: at 0.6376 we get a 'closing server' call, at 0.6378 we see it kill the currently active socket, at 0.648 we see it create a new socket which then never dies... why didn't it think it was done...	14:02
wallyworld_	why would it create a new socket after killing the active one?	14:03
jam	(why if it just killed its socket and is trying to shut down, doesn't it think the server is closed and tries to connect again)	14:03
jam	wallyworld_: exactly	14:03
wallyworld_	is this in the mgo code?	14:03
jam	wallyworld_: that's what I'm looking at right now, yes.	14:03
wallyworld_	i guess we need to find a workaround until any upstream patch gets applied	14:04
jam	wallyworld_: well, we can probably just set the loop to 15s and be safe	14:04
jam	but I'd still like to understand this bit.	14:04
jam	(I can easily run patched mgo on Tarmac bot :)	14:05
wallyworld_	great	14:05
jam	note the 15s is I think :)	14:05
wallyworld_	but who knows if this will bite us in prouction	14:05
jam	given it is creating a new connection	14:05
jam	I wonder if server.closed is getting unset somehow	14:05
wallyworld_	plausible	14:05
jam	like, it has been deallocated and thus the default value is nil? (can't really be deallocated if it isn't unreferenced)	14:06
jam	so I could see that if we get the Close() call, but then call AcquireSocket immediately after	14:06
jam	we won't call liveSocket.Close()	14:07
jam	because we close each socket in the server.Close() call.	14:07
jam	so possibly this thread will never die	14:07
jam	because it doesn't actually think the server is closed.	14:07
wallyworld_	seems like a stock standard coding error	14:11
jam	wallyworld_: current guess. server is not closed when we call Dial	14:14
jam	server is closed by the time connect returns	14:14
wallyworld_	would explain what is seen i think	14:15
jam	wallyworld_: 85.92296 AcquireSocket, 85.93311 closing server, 85.93321 killing old connection 85.94373 newSocket	14:18
jam	so we ask for a new socket, while that is pending, we call Close	14:19
jam	Close doesn't see the connection we are creating right now	14:19
wallyworld_	cause it's not fully created yet	14:20
jam	yep	14:20
wallyworld_	may just use a mutex	14:20
wallyworld_	maybe	14:20
jam	there are mutexes around	14:20
jam	AcquireSocket releases the mutex just before calling Connect	14:21
jam	presumably to not block waiting for connect	14:21
jam	and grabs it again before appending to live sockets	14:21
jam	probably needs to check for 'self.closed' before appending the new socket	14:21
wallyworld_	or gate acquire and close with a mutex	14:22
wallyworld_	not sure without looking at the code	14:22
jam	wallyworld_: I think just after Connect when we re-acquire the lock to add it to our liveSockets, we need to check that we aren't actually in closing state.	14:22
wallyworld_	sounds ok	14:23
jam	wallyworld_: yeah, I just got "Connect returned after server was closed" but didn't get AcquireSocket called with closed server.	14:27
wallyworld_	good	14:27
thumper	wallyworld_: morning...	23:37
wallyworld_	hi	23:37
thumper	wallyworld_: can I get a +1 on https://codereview.appspot.com/10235047/	23:40
thumper	wallyworld_: although I should propose again	23:40
thumper	if you wait a few minutes, I'll do that	23:40
wallyworld_	agains the new trrunk you mean	23:40
thumper	no, to have the test I added in the review	23:40
wallyworld_	ok	23:40
* thumper is busy submitting a latter pipe		23:40
thumper	wallyworld_: sync up chat sometime?	23:41
wallyworld_	sure, give me a little time to propose some code	23:41
wallyworld_	just finishing some stuff	23:41
thumper	sure, no urgency	23:42
thumper	I have heaps to get on with	23:42
wallyworld_	thumper: looks like you already have your 2nd +1	23:49
thumper	wallyworld_: yeah, just saw that too	23:49
thumper	approving and merging now	23:49
thumper	dfc hasn't come online yet though :)	23:49
thumper	hi mramm	23:50
mramm	hey hey	23:50
thumper	mramm: working on a sunday?	23:52
mramm	not really	23:52
mramm	just thought I'd pop in and check on IRC	23:52
mramm	check my e-mail	23:52
mramm	while cooking dinner	23:52
thumper	well, IRC is probably where you left it :)	23:53
thumper	on problems	23:53
thumper	all ticking along	23:53
thumper	horrible weather here, cold and wet	23:54
thumper	supposed to get snow on thursday	23:54
wallyworld_	arosales: did you get the meeting invite i sent?	23:56

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!