[04:40] <lifeless> wgrant: since you're here
[04:40] <lifeless> wgrant: I don't hold with a strict ops/code split for bugs
[04:40] <lifeless> wgrant: it doesn't make sense unless the eyeballs thinking about stuff are also partitioned.
[04:40] <lifeless> And they aren't.
[04:40] <lifeless> And shouldn't be.
[04:41] <lifeless> I may well in future start agitating for LP ops stuff to be *on LP*, but that wouldn't make sense unless LP is a /lot/ more robust than it is now.
[04:42] <lifeless> also, I wish 'str' was PEP8
[04:46] <wgrant> lifeless: I guess.
[04:46] <wgrant> What's un-PEP8 about it?
[04:48] <lifeless> startswith
[04:48] <lifeless> starts_with <- pep8
[04:48] <wgrant> Ah, rihgt.
[04:49] <wgrant> Also, why isn't it 'string'? :(
[04:49] <lifeless> characters were expensive in the bad ol days
[06:44] <lifeless> sigh
[06:44] <lifeless>   File "/home/robertc/launchpad/lp-branches/working/lib/canonical/testing/layers.py", line 507, in setUp
[06:44] <lifeless> ><
[06:44] <lifeless>     time.sleep(0.1)
[06:44] <wgrant_> I would really like to know why that happens.
[06:46] <lifeless> why folk put time.sleep calls in layer setups ?
[06:48] <wgrant_> Well, and why bin/test is so hard to kill.
[06:48] <wgrant_> Often resulting in that traceback, then a hang.
[06:50] <lifeless> I suspect the zope test running reinvocation stuff is broken subtly
[06:53] <lifeless> grah - fugly - <script>LP.client.cache['context'] = ...
[06:53] <lifeless> ok, I hate chromium
[06:53] <lifeless> show source should /show the source/ not re-request
[06:54] <lifeless> I just hit the filebug issue myself
[06:54] <lifeless> on launchpad-foundations
[06:54] <beuno_> lifeless, firefox does that as well, AFAIK
[06:54] <lifeless> w/no apport glue
[06:54] <lifeless> beuno_: firefox used to DTRT
[06:54] <lifeless> hmm, something is really wrong, pages are on a go-slow
[06:55] <beuno_> well, it used to be fast as well  :)
[06:55] <lifeless> I wonder if we have a request backlog or something causing high perceived time
[06:55] <lifeless> effectively lowering the service time
[07:16] <wgrant_> lifeless: I'm still getting those truncated responses during most runs.
[07:16] <wgrant_> So something is up.
[07:16] <wgrant> Not something I've seen before, though :/
[07:18] <lifeless> wgrant: grah
[07:18] <lifeless> hmm
[07:18] <lifeless> 7am
[07:18] <lifeless> I'm seeing trouble too, but nothing that I can obviously identify
[07:19] <lifeless> oops counts are slightly high, but not freakishly so except for a spike ~ 24 hours ago
[07:19] <lifeless> wgrant: I got the connecting error on +filebug
[07:19] <lifeless> want to know the weird thing
[07:19] <lifeless> the bug got filed
[07:20] <wgrant> lifeless: That's a different issue, then. Sounds like yours was post-redirect?
[07:21] <lifeless> yes
[07:21] <lifeless> not that you could tell
[07:21] <lifeless> wgrant: I'm saying that I think its the same issue
[07:22] <wgrant> Hmm.
[07:22] <lifeless> that we're seeing two things : OOPSes w/stuff, and something networkish which is borking responses and causing 'count not connect' errors
[07:22] <wgrant> Do you know how long it took?
[07:22] <lifeless> a while
[07:22] <lifeless> here is an explanantion for a truncated page:
[07:22] <lifeless> an HTTP/1.0 page with no Content-Length had its network socket disconnect
[07:23] <lifeless> we have two datacentres
[07:23] <wgrant> httplib.IncompleteRead: IncompleteRead(15646 bytes read, 1272188 more expected)
[07:23] <wgrant> is what I'm getting, FWIW.
[07:23] <lifeless> the front ends are in one place.
[07:23] <lifeless> the appservers are in both.
[07:23] <wgrant> Wait, LP is split over both?
[07:23] <lifeless> the database is in the same one as the FE's, AIUI.
[07:23] <wgrant> I didn't realise it was in the second at all.
[07:23] <lifeless> wampee and the other are, AIUI
[07:24] <lifeless> anyhow
[07:24] <lifeless> my working theory is connectivity issues between the DC's
[07:24] <lifeless> this would:
[07:24] <lifeless>  - increase SQL time (packet retransmits)
[07:24] <lifeless>  - truncate pages partly transmitted w/out error (HTTP/1.0 We Loves You)
[07:25] <lifeless>  - truncate pages with content-length with error, signalled by a socket shutdown only (again, 1.0 we loves you)
[07:26] <lifeless>  - cause random timeouts if enough packets on the same tcp link drop
[07:26] <wgrant> Hmm.
[07:26] <lifeless> particularly if the 'failure to connect' error has a time < 2 * the retry interval for TCP
[07:26] <lifeless> which I don't remember offhand
[10:31] <lifeless> elmo: are thou perchance around?
[10:43] <wgrant> lifeless: Is it just me, or is Launchpad in general *really really* slow at the moment?
[10:44] <lifeless> yes
[10:44] <lifeless> I think its a real issue
[10:44] <wgrant>  /people took 2.97s with 29 queries.
[10:44] <wgrant> But took like 20s to make it to the browser.
[10:46] <lifeless> yes
[10:46] <lifeless> Like I said, I think its cross DC fuckage
[10:46] <lifeless> will ring elmo soon
[10:46] <elmo> you'll need a little more evidence than that
[10:48] <lifeless> elmo: hi cool.
[10:48] <elmo> replication lag is > 300s, so we're on wildcherry, but that's recent and hardly "cross DC fuckage"
[10:48] <lifeless> elmo: uhm, ugh
[10:48] <elmo> we're also down to two edge servers because edge rollout has failed twice
[10:48] <lifeless> elmo: well, I was using an abbreviation for the discussion earlier.
[10:48] <lifeless> elmo: here are the symptoms I'm aware of:
[10:48] <lifeless>  - apis and web requests are getting truncated responses
[10:48] <lifeless>  - things feel slow
[10:49] <lifeless>  - a fair number of requests get the 'could not connect to launchpad' error page (btw: what is the trigger for showing that)
[10:51] <elmo> lifeless: so nothing in nagios jumps out as being obviously wrong
[10:52] <elmo> if anything the system looks lightly loaded - despite being all on wildcherry, the load there is small - appservers are busy but not extraordinarily so
[10:52] <lifeless> elmo: I looked at the hourly oops rates for edge and nothing was obviously bong there
[10:52] <lifeless> elmo: does the load balancer show any sort of deep queuing going on perhaps? or apache?
[10:52] <lifeless> elmo: the
[10:52] <lifeless> 'could not connect to launchpad' page seems like a particularly relevant clue
[10:53] <lifeless> elmo: what triggers it being shown ?
[10:53] <elmo> I don't know
[10:53] <elmo> there's no particular queues on haproxy
[10:53] <elmo> and apache is fine
[10:54] <lifeless> I *think* its shown when <the thing that shows it> gets no HTTP response header in 30 seconds from a server
[10:54] <wgrant> Except it also shows sometimes after almost exactly 10s.
[10:54] <lifeless> wgrant: odd
[10:54] <wgrant> (this was in the +filebug stuff yesterday)
[10:54] <elmo> both apache and haproxy have custom 503 pages
[10:54] <elmo> and squid talks to haproxy
[10:54] <wgrant> Is there a way to distinguish between the two?
[10:55] <lifeless> is squid in the loop for authenticated requests?
[10:55] <elmo> nope
[10:55] <elmo> wgrant: they point at different files at least
[10:55] <lifeless> the error I saw has no branding
[10:55] <lifeless> just black on white text
[10:56] <wgrant> offline-unplanned.html and offline-unplanned-haproxy.html?
[10:56] <elmo> wgrant: right
[10:56] <wgrant> The haproxy one has a comment.
[10:56] <wgrant> So it should be easy to tell which is which.
[10:58] <elmo> lifeless: like I say, AFAICS neither apache nor haproxy should do that for any of the main websites
[10:58] <elmo> what was the URL you had failing?
[10:58] <lifeless> https://bugs.edge.launchpad.net/launchpad-foundations/+filebug
[10:59] <elmo> I'm also not seeing many 5*'s in the apache log
[10:59] <wgrant> Er.
[10:59] <wgrant> mtr shows some latency
[10:59] <wgrant> Fluctuating.
[10:59] <wgrant> In the DC.
[10:59] <wgrant> 50-200ms
[11:00] <elmo> wgrant: is fine from here (and I'm a couple of hundred miles away atm)
[11:00] <lifeless> elmo: what are the two edge servers we have left?
[11:00] <elmo> lifeless: potassium and palladium
[11:00] <wgrant> elmo: It's settled down now, but for a while there was 200ms between chenet and nutmeg.
[11:00] <wgrant> And the rest of the route was pretty stable.
[11:01] <elmo> wgrant: that's just chenet depriotizing pings
[11:01] <wgrant> Bah.
[11:01] <elmo> it's a busy firewall...
[11:02] <lifeless> its happier now than it was 3 hours ago
[11:02] <lifeless> elmo: to humour me, which two edge servers do we have still live? and which DC are they in?
[11:03] <elmo> 11:00 < elmo> lifeless: potassium and palladium
[11:03] <lifeless> thanks, its late :)
[11:03] <elmo> lifeless: they're in Goswell Road
[11:03] <lifeless> is that the same one as apache / haproxy?
[11:04] <elmo> no, different one
[11:04] <lifeless> ok, I *know* I don't have enough data here, but my instincts are jumping on the inter dc link
[11:04] <lifeless> What can we do to rule it out
[11:05] <elmo> there's absolutely no evidence of a problem with the link
[11:05] <elmo> our London stuff is relatively well spread out, if there were problems, more than just launchpad would be showing up and it would be in nagios
[11:05] <elmo> the link is up, there's no problems with automated ping testing or manual testing, it's nowhere near capacity
[11:06] <lifeless> was there anything ~ 3 hours back ?
[11:07] <elmo> checking
[11:09] <elmo> nothing shows in router logs or bandwidth graphs
[11:09] <elmo> 3-4 hours ago is the sunday apache logrotate
[11:09] <elmo> if you want to gut-blame something, I'd say that's a much better target
[11:10] <lifeless> interesting
[11:10] <lifeless> how long does that take?
[11:11] <lifeless> or does it have a tendency to go skewiff
[11:11] <elmo> the logrotate itself?  not very long but it does an apache reload/restart which sometimes goes mental on busy webservers
[11:13] <lifeless> ah
[11:15] <lifeless> I got back a partial page on one of the bugs I just filed as test
[11:15] <lifeless> it cuts off right on the target table
[11:24] <lifeless> elmo: so I don't quite know where to go; there *was* a serious persistent problem, its a lot better now but still a good fraction fof requests go into lala land, don't oops, but take ages (like 15 seconds, but reported as    At least 98 queries/external actions issued in 0.99 seconds
[11:24] <lifeless> )
[11:26] <elmo> i don't know where to go either, and I really need to run - unless we have some actionable next steps, I'd like to defer this
[11:27] <lifeless> I think we have to
[11:27] <lifeless> defer it
[11:27] <elmo> ok, sorry I haven't been of much help
[11:27] <elmo> I'm still available on my mobile if things get worse again
[11:29] <lifeless> thanks
[12:02] <lifeless> elmo: https://xmlrpc.edge.launchpad.net/bazaar/: Unable to handle http code 503: Service Unavailable
[12:02] <lifeless> elmo: I've seen this a couple of times, someone in #bzr is asking now.
[12:39] <lifeless> wgrant: grep -r 'except:' .
[12:42] <lifeless> wgrant: then weep.
[12:45] <lifeless> night all
[12:49] <wgrant> lib/lp/buildmaster/manager.py:        except:
[12:49] <wgrant> Eep.
[12:53] <jelmer> 'morning lifeless, wgrant
[12:56] <wgrant> (do we not have a zero-tolerance policy for this sort of thing?)
[12:56] <wgrant> Morning jelmer.
[13:04] <wgrant> Just got another 502 from edge.
[13:04] <wgrant> Definitely from Apache.
[13:20] <wgrant> Er.
[13:20] <wgrant> I do believe I just got a truncated copy of the WADL.
[13:20] <wgrant> Yes.
[13:20] <wgrant> That is a little odd, since that's served fairly dumbly.
[13:24] <StevenK> wgrant: Don't you have uni work to do on a Sunday night? :-P
[13:25] <wgrant> Yes :(
[13:25] <wgrant> But that doesn't stop my inbox from filling up with cron errors.
[13:34]  * jelmer waves to StevenK
[13:39] <StevenK> jelmer: O hai!
[20:24] <lifeless> morning
[20:25] <beuno> good morning lifeless
[20:50] <lifeless> hi beuno
[22:25] <mwhudson> morning
[22:26] <lifeless> mwhudson: hu
[22:26] <lifeless> mwhudson: hi
[22:26] <lifeless> mwhudson: a) we have a small firefight going on, and b) I can has reviews? All ones you'll like, I swear. topic of launchpad-reviews
[22:26] <mwhudson> lifeless: boo, and uh ok
[22:27] <mwhudson> lifeless: what's the firefight?  i saw identi.ca/launchpadstatus
[22:28] <lifeless> url is in the internal channel
[22:29] <mwhudson> k
[22:43] <wgrant> Morning lifeless.
[22:43] <wgrant> Morning mwhudson.
[22:44] <wgrant> Any news?
[22:44] <lifeless> wgrant: some
[22:44] <lifeless> elmo has been helping, but hes on a m*f* train.
[22:44] <lifeless> when he surfaces again, we'll continue
[22:46] <lifeless> wgrant: the only current theory I have is that a failed shutdown of the appservers leaves opstats working but everything else bork bork borked
[22:47] <wgrant> Data I have: Pretty consistently broken from 02:00Z yesterday to 08:00Z, possibly OK for 3-4 hours, broken from 13:00Z to 18:00Z, possibly OK for a couple of hours, then broken since.
[22:47] <lifeless> there is a bug about the failed shutdowns, but we'll need one about this failure mode of opstats
[22:47] <lifeless> we have 2 appservers which are out of rotation but the process is running
[22:51] <thumper> morning
[22:51] <lifeless> hi thumper
[22:55] <lifeless> wgrant: this theory doesn't quite explain the partial pages
[22:56] <lifeless> wgrant: but if the broken appservers sometimes work, but haproxy kills it all when it decides the servers bong, that would explain it.
[22:57] <wallyworld> morning
[22:58] <wgrant> lifeless: Possibly.
[22:58] <wgrant> lifeless: Also of note is the partial WADL that I got last night. So even really simple requests get truncated.
[22:58] <lifeless> wgrant: that comes from the appservers still
[22:59] <wgrant> It does.
[22:59] <wgrant> But it should be really fast.
[22:59] <lifeless> wgrant: and has been known to take minutes to generate.
[22:59] <wgrant> Oh hmmmmmmm.
[22:59] <lifeless> not everytime, of course. Just Some Times.
[23:00] <wgrant> It's almost always truncated after ~15700 bytes.
[23:00] <wgrant> It fluctuates, but that may just be header size differences.
[23:00] <wgrant> Regardless of the file, it's within a kilobyte of that number.
[23:00] <lifeless> elmo: ^
[23:00] <lifeless> thats interesting
[23:00] <lifeless> have you tried a loop on the wadl? if so what url are you using?
[23:01] <wgrant> I haven't.
[23:01] <wgrant> It needs some Accept header... let's see.
[23:18] <thumper> wallyworld: morning
[23:19] <wallyworld> thumper: looks like you had an interesting time on friday :-(
[23:19] <thumper> wallyworld: FSVO interesting
[23:19] <thumper> wallyworld: we should have a call
[23:19] <wallyworld> thumper: ok, anytime
[23:36] <wallyworld> thumper: https://code.edge.launchpad.net/~wallyworld/launchpad/link-bugs-in-merge-proposal/+merge/34826
[23:43] <thumper> wallyworld: http://yuiblog.com/assets/pdf/cheatsheets/css.pdf