/srv/irclogs.ubuntu.com/2010/09/12/#launchpad-dev.txt

lifelesswgrant: since you're here04:40
lifelesswgrant: I don't hold with a strict ops/code split for bugs04:40
lifelesswgrant: it doesn't make sense unless the eyeballs thinking about stuff are also partitioned.04:40
lifelessAnd they aren't.04:40
lifelessAnd shouldn't be.04:40
lifelessI may well in future start agitating for LP ops stuff to be *on LP*, but that wouldn't make sense unless LP is a /lot/ more robust than it is now.04:41
lifelessalso, I wish 'str' was PEP804:42
wgrantlifeless: I guess.04:46
wgrantWhat's un-PEP8 about it?04:46
lifelessstartswith04:48
lifelessstarts_with <- pep804:48
wgrantAh, rihgt.04:48
wgrantAlso, why isn't it 'string'? :(04:49
lifelesscharacters were expensive in the bad ol days04:49
=== MTeck is now known as MTecknology
lifelesssigh06:44
lifeless  File "/home/robertc/launchpad/lp-branches/working/lib/canonical/testing/layers.py", line 507, in setUp06:44
lifeless><06:44
lifeless    time.sleep(0.1)06:44
wgrant_I would really like to know why that happens.06:44
lifelesswhy folk put time.sleep calls in layer setups ?06:46
wgrant_Well, and why bin/test is so hard to kill.06:48
wgrant_Often resulting in that traceback, then a hang.06:48
lifelessI suspect the zope test running reinvocation stuff is broken subtly06:50
lifelessgrah - fugly - <script>LP.client.cache['context'] = ...06:53
lifelessok, I hate chromium06:53
lifelessshow source should /show the source/ not re-request06:53
lifelessI just hit the filebug issue myself06:54
lifelesson launchpad-foundations06:54
beuno_lifeless, firefox does that as well, AFAIK06:54
lifelessw/no apport glue06:54
lifelessbeuno_: firefox used to DTRT06:54
lifelesshmm, something is really wrong, pages are on a go-slow06:54
beuno_well, it used to be fast as well  :)06:55
=== beuno_ is now known as beuno
lifelessI wonder if we have a request backlog or something causing high perceived time06:55
lifelesseffectively lowering the service time06:55
wgrant_lifeless: I'm still getting those truncated responses during most runs.07:16
wgrant_So something is up.07:16
=== wgrant_ is now known as wgrant
wgrantNot something I've seen before, though :/07:16
lifelesswgrant: grah07:18
lifelesshmm07:18
lifeless7am07:18
lifelessI'm seeing trouble too, but nothing that I can obviously identify07:18
lifelessoops counts are slightly high, but not freakishly so except for a spike ~ 24 hours ago07:19
lifelesswgrant: I got the connecting error on +filebug07:19
lifelesswant to know the weird thing07:19
lifelessthe bug got filed07:19
wgrantlifeless: That's a different issue, then. Sounds like yours was post-redirect?07:20
lifelessyes07:21
lifelessnot that you could tell07:21
lifelesswgrant: I'm saying that I think its the same issue07:21
wgrantHmm.07:22
lifelessthat we're seeing two things : OOPSes w/stuff, and something networkish which is borking responses and causing 'count not connect' errors07:22
wgrantDo you know how long it took?07:22
lifelessa while07:22
lifelesshere is an explanantion for a truncated page:07:22
lifelessan HTTP/1.0 page with no Content-Length had its network socket disconnect07:22
lifelesswe have two datacentres07:23
wgranthttplib.IncompleteRead: IncompleteRead(15646 bytes read, 1272188 more expected)07:23
wgrantis what I'm getting, FWIW.07:23
lifelessthe front ends are in one place.07:23
lifelessthe appservers are in both.07:23
wgrantWait, LP is split over both?07:23
lifelessthe database is in the same one as the FE's, AIUI.07:23
wgrantI didn't realise it was in the second at all.07:23
lifelesswampee and the other are, AIUI07:23
lifelessanyhow07:24
lifelessmy working theory is connectivity issues between the DC's07:24
lifelessthis would:07:24
lifeless - increase SQL time (packet retransmits)07:24
lifeless - truncate pages partly transmitted w/out error (HTTP/1.0 We Loves You)07:24
lifeless - truncate pages with content-length with error, signalled by a socket shutdown only (again, 1.0 we loves you)07:25
lifeless - cause random timeouts if enough packets on the same tcp link drop07:26
wgrantHmm.07:26
lifelessparticularly if the 'failure to connect' error has a time < 2 * the retry interval for TCP07:26
lifelesswhich I don't remember offhand07:26
lifelesselmo: are thou perchance around?10:31
wgrantlifeless: Is it just me, or is Launchpad in general *really really* slow at the moment?10:43
lifelessyes10:44
lifelessI think its a real issue10:44
wgrant /people took 2.97s with 29 queries.10:44
wgrantBut took like 20s to make it to the browser.10:44
lifelessyes10:46
lifelessLike I said, I think its cross DC fuckage10:46
lifelesswill ring elmo soon10:46
elmoyou'll need a little more evidence than that10:46
lifelesselmo: hi cool.10:48
elmoreplication lag is > 300s, so we're on wildcherry, but that's recent and hardly "cross DC fuckage"10:48
lifelesselmo: uhm, ugh10:48
elmowe're also down to two edge servers because edge rollout has failed twice10:48
lifelesselmo: well, I was using an abbreviation for the discussion earlier.10:48
lifelesselmo: here are the symptoms I'm aware of:10:48
lifeless - apis and web requests are getting truncated responses10:48
lifeless - things feel slow10:48
lifeless - a fair number of requests get the 'could not connect to launchpad' error page (btw: what is the trigger for showing that)10:49
elmolifeless: so nothing in nagios jumps out as being obviously wrong10:51
elmoif anything the system looks lightly loaded - despite being all on wildcherry, the load there is small - appservers are busy but not extraordinarily so10:52
lifelesselmo: I looked at the hourly oops rates for edge and nothing was obviously bong there10:52
lifelesselmo: does the load balancer show any sort of deep queuing going on perhaps? or apache?10:52
lifelesselmo: the10:52
lifeless'could not connect to launchpad' page seems like a particularly relevant clue10:52
lifelesselmo: what triggers it being shown ?10:53
elmoI don't know10:53
elmothere's no particular queues on haproxy10:53
elmoand apache is fine10:53
lifelessI *think* its shown when <the thing that shows it> gets no HTTP response header in 30 seconds from a server10:54
wgrantExcept it also shows sometimes after almost exactly 10s.10:54
lifelesswgrant: odd10:54
wgrant(this was in the +filebug stuff yesterday)10:54
elmoboth apache and haproxy have custom 503 pages10:54
elmoand squid talks to haproxy10:54
wgrantIs there a way to distinguish between the two?10:54
lifelessis squid in the loop for authenticated requests?10:55
elmonope10:55
elmowgrant: they point at different files at least10:55
lifelessthe error I saw has no branding10:55
lifelessjust black on white text10:55
wgrantoffline-unplanned.html and offline-unplanned-haproxy.html?10:56
elmowgrant: right10:56
wgrantThe haproxy one has a comment.10:56
wgrantSo it should be easy to tell which is which.10:56
elmolifeless: like I say, AFAICS neither apache nor haproxy should do that for any of the main websites10:58
elmowhat was the URL you had failing?10:58
lifelesshttps://bugs.edge.launchpad.net/launchpad-foundations/+filebug10:58
elmoI'm also not seeing many 5*'s in the apache log10:59
wgrantEr.10:59
wgrantmtr shows some latency10:59
wgrantFluctuating.10:59
wgrantIn the DC.10:59
wgrant50-200ms10:59
elmowgrant: is fine from here (and I'm a couple of hundred miles away atm)11:00
lifelesselmo: what are the two edge servers we have left?11:00
elmolifeless: potassium and palladium11:00
wgrantelmo: It's settled down now, but for a while there was 200ms between chenet and nutmeg.11:00
wgrantAnd the rest of the route was pretty stable.11:00
elmowgrant: that's just chenet depriotizing pings11:01
wgrantBah.11:01
elmoit's a busy firewall...11:01
lifelessits happier now than it was 3 hours ago11:02
lifelesselmo: to humour me, which two edge servers do we have still live? and which DC are they in?11:02
elmo11:00 < elmo> lifeless: potassium and palladium11:03
lifelessthanks, its late :)11:03
elmolifeless: they're in Goswell Road11:03
lifelessis that the same one as apache / haproxy?11:03
elmono, different one11:04
lifelessok, I *know* I don't have enough data here, but my instincts are jumping on the inter dc link11:04
lifelessWhat can we do to rule it out11:04
elmothere's absolutely no evidence of a problem with the link11:05
elmoour London stuff is relatively well spread out, if there were problems, more than just launchpad would be showing up and it would be in nagios11:05
elmothe link is up, there's no problems with automated ping testing or manual testing, it's nowhere near capacity11:05
lifelesswas there anything ~ 3 hours back ?11:06
elmochecking11:07
elmonothing shows in router logs or bandwidth graphs11:09
elmo3-4 hours ago is the sunday apache logrotate11:09
elmoif you want to gut-blame something, I'd say that's a much better target11:09
lifelessinteresting11:10
lifelesshow long does that take?11:10
lifelessor does it have a tendency to go skewiff11:11
elmothe logrotate itself?  not very long but it does an apache reload/restart which sometimes goes mental on busy webservers11:11
lifelessah11:13
lifelessI got back a partial page on one of the bugs I just filed as test11:15
lifelessit cuts off right on the target table11:15
lifelesselmo: so I don't quite know where to go; there *was* a serious persistent problem, its a lot better now but still a good fraction fof requests go into lala land, don't oops, but take ages (like 15 seconds, but reported as    At least 98 queries/external actions issued in 0.99 seconds11:24
lifeless)11:24
elmoi don't know where to go either, and I really need to run - unless we have some actionable next steps, I'd like to defer this11:26
lifelessI think we have to11:27
lifelessdefer it11:27
elmook, sorry I haven't been of much help11:27
elmoI'm still available on my mobile if things get worse again11:27
lifelessthanks11:29
lifelesselmo: https://xmlrpc.edge.launchpad.net/bazaar/: Unable to handle http code 503: Service Unavailable12:02
lifelesselmo: I've seen this a couple of times, someone in #bzr is asking now.12:02
lifelesswgrant: grep -r 'except:' .12:39
lifelesswgrant: then weep.12:42
lifelessnight all12:45
wgrantlib/lp/buildmaster/manager.py:        except:12:49
wgrantEep.12:49
jelmer'morning lifeless, wgrant12:53
wgrant(do we not have a zero-tolerance policy for this sort of thing?)12:56
wgrantMorning jelmer.12:56
wgrantJust got another 502 from edge.13:04
wgrantDefinitely from Apache.13:04
wgrantEr.13:20
wgrantI do believe I just got a truncated copy of the WADL.13:20
wgrantYes.13:20
wgrantThat is a little odd, since that's served fairly dumbly.13:20
StevenKwgrant: Don't you have uni work to do on a Sunday night? :-P13:24
wgrantYes :(13:25
wgrantBut that doesn't stop my inbox from filling up with cron errors.13:25
* jelmer waves to StevenK13:34
StevenKjelmer: O hai!13:39
=== Ursinha is now known as Ursinha-afk
lifelessmorning20:24
beunogood morning lifeless20:25
lifelesshi beuno20:50
=== Daviey_ is now known as Daviey
mwhudsonmorning22:25
lifelessmwhudson: hu22:26
lifelessmwhudson: hi22:26
lifelessmwhudson: a) we have a small firefight going on, and b) I can has reviews? All ones you'll like, I swear. topic of launchpad-reviews22:26
mwhudsonlifeless: boo, and uh ok22:26
mwhudsonlifeless: what's the firefight?  i saw identi.ca/launchpadstatus22:27
lifelessurl is in the internal channel22:28
mwhudsonk22:29
wgrantMorning lifeless.22:43
wgrantMorning mwhudson.22:43
wgrantAny news?22:44
lifelesswgrant: some22:44
lifelesselmo has been helping, but hes on a m*f* train.22:44
lifelesswhen he surfaces again, we'll continue22:44
lifelesswgrant: the only current theory I have is that a failed shutdown of the appservers leaves opstats working but everything else bork bork borked22:46
wgrantData I have: Pretty consistently broken from 02:00Z yesterday to 08:00Z, possibly OK for 3-4 hours, broken from 13:00Z to 18:00Z, possibly OK for a couple of hours, then broken since.22:47
lifelessthere is a bug about the failed shutdowns, but we'll need one about this failure mode of opstats22:47
lifelesswe have 2 appservers which are out of rotation but the process is running22:47
thumpermorning22:51
lifelesshi thumper22:51
lifelesswgrant: this theory doesn't quite explain the partial pages22:55
lifelesswgrant: but if the broken appservers sometimes work, but haproxy kills it all when it decides the servers bong, that would explain it.22:56
wallyworldmorning22:57
wgrantlifeless: Possibly.22:58
wgrantlifeless: Also of note is the partial WADL that I got last night. So even really simple requests get truncated.22:58
lifelesswgrant: that comes from the appservers still22:58
wgrantIt does.22:59
wgrantBut it should be really fast.22:59
lifelesswgrant: and has been known to take minutes to generate.22:59
wgrantOh hmmmmmmm.22:59
lifelessnot everytime, of course. Just Some Times.22:59
wgrantIt's almost always truncated after ~15700 bytes.23:00
wgrantIt fluctuates, but that may just be header size differences.23:00
wgrantRegardless of the file, it's within a kilobyte of that number.23:00
lifelesselmo: ^23:00
lifelessthats interesting23:00
lifelesshave you tried a loop on the wadl? if so what url are you using?23:00
wgrantI haven't.23:01
wgrantIt needs some Accept header... let's see.23:01
thumperwallyworld: morning23:18
wallyworldthumper: looks like you had an interesting time on friday :-(23:19
thumperwallyworld: FSVO interesting23:19
thumperwallyworld: we should have a call23:19
wallyworldthumper: ok, anytime23:19
wallyworldthumper: https://code.edge.launchpad.net/~wallyworld/launchpad/link-bugs-in-merge-proposal/+merge/3482623:36
thumperwallyworld: http://yuiblog.com/assets/pdf/cheatsheets/css.pdf23:43

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!