/srv/irclogs.ubuntu.com/2010/09/12/#launchpad-dev.txt

lifeless	wgrant: since you're here	04:40
lifeless	wgrant: I don't hold with a strict ops/code split for bugs	04:40
lifeless	wgrant: it doesn't make sense unless the eyeballs thinking about stuff are also partitioned.	04:40
lifeless	And they aren't.	04:40
lifeless	And shouldn't be.	04:40
lifeless	I may well in future start agitating for LP ops stuff to be on LP, but that wouldn't make sense unless LP is a /lot/ more robust than it is now.	04:41
lifeless	also, I wish 'str' was PEP8	04:42
wgrant	lifeless: I guess.	04:46
wgrant	What's un-PEP8 about it?	04:46
lifeless	startswith	04:48
lifeless	starts_with <- pep8	04:48
wgrant	Ah, rihgt.	04:48
wgrant	Also, why isn't it 'string'? :(	04:49
lifeless	characters were expensive in the bad ol days	04:49
=== MTeck is now known as MTecknology
lifeless	sigh	06:44
lifeless	File "/home/robertc/launchpad/lp-branches/working/lib/canonical/testing/layers.py", line 507, in setUp	06:44
lifeless	><	06:44
lifeless	time.sleep(0.1)	06:44
wgrant_	I would really like to know why that happens.	06:44
lifeless	why folk put time.sleep calls in layer setups ?	06:46
wgrant_	Well, and why bin/test is so hard to kill.	06:48
wgrant_	Often resulting in that traceback, then a hang.	06:48
lifeless	I suspect the zope test running reinvocation stuff is broken subtly	06:50
lifeless	grah - fugly - <script>LP.client.cache['context'] = ...	06:53
lifeless	ok, I hate chromium	06:53
lifeless	show source should /show the source/ not re-request	06:53
lifeless	I just hit the filebug issue myself	06:54
lifeless	on launchpad-foundations	06:54
beuno_	lifeless, firefox does that as well, AFAIK	06:54
lifeless	w/no apport glue	06:54
lifeless	beuno_: firefox used to DTRT	06:54
lifeless	hmm, something is really wrong, pages are on a go-slow	06:54
beuno_	well, it used to be fast as well :)	06:55
=== beuno_ is now known as beuno
lifeless	I wonder if we have a request backlog or something causing high perceived time	06:55
lifeless	effectively lowering the service time	06:55
wgrant_	lifeless: I'm still getting those truncated responses during most runs.	07:16
wgrant_	So something is up.	07:16
=== wgrant_ is now known as wgrant
wgrant	Not something I've seen before, though :/	07:16
lifeless	wgrant: grah	07:18
lifeless	hmm	07:18
lifeless	7am	07:18
lifeless	I'm seeing trouble too, but nothing that I can obviously identify	07:18
lifeless	oops counts are slightly high, but not freakishly so except for a spike ~ 24 hours ago	07:19
lifeless	wgrant: I got the connecting error on +filebug	07:19
lifeless	want to know the weird thing	07:19
lifeless	the bug got filed	07:19
wgrant	lifeless: That's a different issue, then. Sounds like yours was post-redirect?	07:20
lifeless	yes	07:21
lifeless	not that you could tell	07:21
lifeless	wgrant: I'm saying that I think its the same issue	07:21
wgrant	Hmm.	07:22
lifeless	that we're seeing two things : OOPSes w/stuff, and something networkish which is borking responses and causing 'count not connect' errors	07:22
wgrant	Do you know how long it took?	07:22
lifeless	a while	07:22
lifeless	here is an explanantion for a truncated page:	07:22
lifeless	an HTTP/1.0 page with no Content-Length had its network socket disconnect	07:22
lifeless	we have two datacentres	07:23
wgrant	httplib.IncompleteRead: IncompleteRead(15646 bytes read, 1272188 more expected)	07:23
wgrant	is what I'm getting, FWIW.	07:23
lifeless	the front ends are in one place.	07:23
lifeless	the appservers are in both.	07:23
wgrant	Wait, LP is split over both?	07:23
lifeless	the database is in the same one as the FE's, AIUI.	07:23
wgrant	I didn't realise it was in the second at all.	07:23
lifeless	wampee and the other are, AIUI	07:23
lifeless	anyhow	07:24
lifeless	my working theory is connectivity issues between the DC's	07:24
lifeless	this would:	07:24
lifeless	- increase SQL time (packet retransmits)	07:24
lifeless	- truncate pages partly transmitted w/out error (HTTP/1.0 We Loves You)	07:24
lifeless	- truncate pages with content-length with error, signalled by a socket shutdown only (again, 1.0 we loves you)	07:25
lifeless	- cause random timeouts if enough packets on the same tcp link drop	07:26
wgrant	Hmm.	07:26
lifeless	particularly if the 'failure to connect' error has a time < 2 * the retry interval for TCP	07:26
lifeless	which I don't remember offhand	07:26
lifeless	elmo: are thou perchance around?	10:31
wgrant	lifeless: Is it just me, or is Launchpad in general really really slow at the moment?	10:43
lifeless	yes	10:44
lifeless	I think its a real issue	10:44
wgrant	/people took 2.97s with 29 queries.	10:44
wgrant	But took like 20s to make it to the browser.	10:44
lifeless	yes	10:46
lifeless	Like I said, I think its cross DC fuckage	10:46
lifeless	will ring elmo soon	10:46
elmo	you'll need a little more evidence than that	10:46
lifeless	elmo: hi cool.	10:48
elmo	replication lag is > 300s, so we're on wildcherry, but that's recent and hardly "cross DC fuckage"	10:48
lifeless	elmo: uhm, ugh	10:48
elmo	we're also down to two edge servers because edge rollout has failed twice	10:48
lifeless	elmo: well, I was using an abbreviation for the discussion earlier.	10:48
lifeless	elmo: here are the symptoms I'm aware of:	10:48
lifeless	- apis and web requests are getting truncated responses	10:48
lifeless	- things feel slow	10:48
lifeless	- a fair number of requests get the 'could not connect to launchpad' error page (btw: what is the trigger for showing that)	10:49
elmo	lifeless: so nothing in nagios jumps out as being obviously wrong	10:51
elmo	if anything the system looks lightly loaded - despite being all on wildcherry, the load there is small - appservers are busy but not extraordinarily so	10:52
lifeless	elmo: I looked at the hourly oops rates for edge and nothing was obviously bong there	10:52
lifeless	elmo: does the load balancer show any sort of deep queuing going on perhaps? or apache?	10:52
lifeless	elmo: the	10:52
lifeless	'could not connect to launchpad' page seems like a particularly relevant clue	10:52
lifeless	elmo: what triggers it being shown ?	10:53
elmo	I don't know	10:53
elmo	there's no particular queues on haproxy	10:53
elmo	and apache is fine	10:53
lifeless	I think its shown when <the thing that shows it> gets no HTTP response header in 30 seconds from a server	10:54
wgrant	Except it also shows sometimes after almost exactly 10s.	10:54
lifeless	wgrant: odd	10:54
wgrant	(this was in the +filebug stuff yesterday)	10:54
elmo	both apache and haproxy have custom 503 pages	10:54
elmo	and squid talks to haproxy	10:54
wgrant	Is there a way to distinguish between the two?	10:54
lifeless	is squid in the loop for authenticated requests?	10:55
elmo	nope	10:55
elmo	wgrant: they point at different files at least	10:55
lifeless	the error I saw has no branding	10:55
lifeless	just black on white text	10:55
wgrant	offline-unplanned.html and offline-unplanned-haproxy.html?	10:56
elmo	wgrant: right	10:56
wgrant	The haproxy one has a comment.	10:56
wgrant	So it should be easy to tell which is which.	10:56
elmo	lifeless: like I say, AFAICS neither apache nor haproxy should do that for any of the main websites	10:58
elmo	what was the URL you had failing?	10:58
lifeless	https://bugs.edge.launchpad.net/launchpad-foundations/+filebug	10:58
elmo	I'm also not seeing many 5*'s in the apache log	10:59
wgrant	Er.	10:59
wgrant	mtr shows some latency	10:59
wgrant	Fluctuating.	10:59
wgrant	In the DC.	10:59
wgrant	50-200ms	10:59
elmo	wgrant: is fine from here (and I'm a couple of hundred miles away atm)	11:00
lifeless	elmo: what are the two edge servers we have left?	11:00
elmo	lifeless: potassium and palladium	11:00
wgrant	elmo: It's settled down now, but for a while there was 200ms between chenet and nutmeg.	11:00
wgrant	And the rest of the route was pretty stable.	11:00
elmo	wgrant: that's just chenet depriotizing pings	11:01
wgrant	Bah.	11:01
elmo	it's a busy firewall...	11:01
lifeless	its happier now than it was 3 hours ago	11:02
lifeless	elmo: to humour me, which two edge servers do we have still live? and which DC are they in?	11:02
elmo	11:00 < elmo> lifeless: potassium and palladium	11:03
lifeless	thanks, its late :)	11:03
elmo	lifeless: they're in Goswell Road	11:03
lifeless	is that the same one as apache / haproxy?	11:03
elmo	no, different one	11:04
lifeless	ok, I know I don't have enough data here, but my instincts are jumping on the inter dc link	11:04
lifeless	What can we do to rule it out	11:04
elmo	there's absolutely no evidence of a problem with the link	11:05
elmo	our London stuff is relatively well spread out, if there were problems, more than just launchpad would be showing up and it would be in nagios	11:05
elmo	the link is up, there's no problems with automated ping testing or manual testing, it's nowhere near capacity	11:05
lifeless	was there anything ~ 3 hours back ?	11:06
elmo	checking	11:07
elmo	nothing shows in router logs or bandwidth graphs	11:09
elmo	3-4 hours ago is the sunday apache logrotate	11:09
elmo	if you want to gut-blame something, I'd say that's a much better target	11:09
lifeless	interesting	11:10
lifeless	how long does that take?	11:10
lifeless	or does it have a tendency to go skewiff	11:11
elmo	the logrotate itself? not very long but it does an apache reload/restart which sometimes goes mental on busy webservers	11:11
lifeless	ah	11:13
lifeless	I got back a partial page on one of the bugs I just filed as test	11:15
lifeless	it cuts off right on the target table	11:15
lifeless	elmo: so I don't quite know where to go; there was a serious persistent problem, its a lot better now but still a good fraction fof requests go into lala land, don't oops, but take ages (like 15 seconds, but reported as At least 98 queries/external actions issued in 0.99 seconds	11:24
lifeless	)	11:24
elmo	i don't know where to go either, and I really need to run - unless we have some actionable next steps, I'd like to defer this	11:26
lifeless	I think we have to	11:27
lifeless	defer it	11:27
elmo	ok, sorry I haven't been of much help	11:27
elmo	I'm still available on my mobile if things get worse again	11:27
lifeless	thanks	11:29
lifeless	elmo: https://xmlrpc.edge.launchpad.net/bazaar/: Unable to handle http code 503: Service Unavailable	12:02
lifeless	elmo: I've seen this a couple of times, someone in #bzr is asking now.	12:02
lifeless	wgrant: grep -r 'except:' .	12:39
lifeless	wgrant: then weep.	12:42
lifeless	night all	12:45
wgrant	lib/lp/buildmaster/manager.py: except:	12:49
wgrant	Eep.	12:49
jelmer	'morning lifeless, wgrant	12:53
wgrant	(do we not have a zero-tolerance policy for this sort of thing?)	12:56
wgrant	Morning jelmer.	12:56
wgrant	Just got another 502 from edge.	13:04
wgrant	Definitely from Apache.	13:04
wgrant	Er.	13:20
wgrant	I do believe I just got a truncated copy of the WADL.	13:20
wgrant	Yes.	13:20
wgrant	That is a little odd, since that's served fairly dumbly.	13:20
StevenK	wgrant: Don't you have uni work to do on a Sunday night? :-P	13:24
wgrant	Yes :(	13:25
wgrant	But that doesn't stop my inbox from filling up with cron errors.	13:25
* jelmer waves to StevenK		13:34
StevenK	jelmer: O hai!	13:39
=== Ursinha is now known as Ursinha-afk
lifeless	morning	20:24
beuno	good morning lifeless	20:25
lifeless	hi beuno	20:50
=== Daviey_ is now known as Daviey
mwhudson	morning	22:25
lifeless	mwhudson: hu	22:26
lifeless	mwhudson: hi	22:26
lifeless	mwhudson: a) we have a small firefight going on, and b) I can has reviews? All ones you'll like, I swear. topic of launchpad-reviews	22:26
mwhudson	lifeless: boo, and uh ok	22:26
mwhudson	lifeless: what's the firefight? i saw identi.ca/launchpadstatus	22:27
lifeless	url is in the internal channel	22:28
mwhudson	k	22:29
wgrant	Morning lifeless.	22:43
wgrant	Morning mwhudson.	22:43
wgrant	Any news?	22:44
lifeless	wgrant: some	22:44
lifeless	elmo has been helping, but hes on a mf train.	22:44
lifeless	when he surfaces again, we'll continue	22:44
lifeless	wgrant: the only current theory I have is that a failed shutdown of the appservers leaves opstats working but everything else bork bork borked	22:46
wgrant	Data I have: Pretty consistently broken from 02:00Z yesterday to 08:00Z, possibly OK for 3-4 hours, broken from 13:00Z to 18:00Z, possibly OK for a couple of hours, then broken since.	22:47
lifeless	there is a bug about the failed shutdowns, but we'll need one about this failure mode of opstats	22:47
lifeless	we have 2 appservers which are out of rotation but the process is running	22:47
thumper	morning	22:51
lifeless	hi thumper	22:51
lifeless	wgrant: this theory doesn't quite explain the partial pages	22:55
lifeless	wgrant: but if the broken appservers sometimes work, but haproxy kills it all when it decides the servers bong, that would explain it.	22:56
wallyworld	morning	22:57
wgrant	lifeless: Possibly.	22:58
wgrant	lifeless: Also of note is the partial WADL that I got last night. So even really simple requests get truncated.	22:58
lifeless	wgrant: that comes from the appservers still	22:58
wgrant	It does.	22:59
wgrant	But it should be really fast.	22:59
lifeless	wgrant: and has been known to take minutes to generate.	22:59
wgrant	Oh hmmmmmmm.	22:59
lifeless	not everytime, of course. Just Some Times.	22:59
wgrant	It's almost always truncated after ~15700 bytes.	23:00
wgrant	It fluctuates, but that may just be header size differences.	23:00
wgrant	Regardless of the file, it's within a kilobyte of that number.	23:00
lifeless	elmo: ^	23:00
lifeless	thats interesting	23:00
lifeless	have you tried a loop on the wadl? if so what url are you using?	23:00
wgrant	I haven't.	23:01
wgrant	It needs some Accept header... let's see.	23:01
thumper	wallyworld: morning	23:18
wallyworld	thumper: looks like you had an interesting time on friday :-(	23:19
thumper	wallyworld: FSVO interesting	23:19
thumper	wallyworld: we should have a call	23:19
wallyworld	thumper: ok, anytime	23:19
wallyworld	thumper: https://code.edge.launchpad.net/~wallyworld/launchpad/link-bugs-in-merge-proposal/+merge/34826	23:36
thumper	wallyworld: http://yuiblog.com/assets/pdf/cheatsheets/css.pdf	23:43

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!