lifeless | wgrant: since you're here | 04:40 |
---|---|---|
lifeless | wgrant: I don't hold with a strict ops/code split for bugs | 04:40 |
lifeless | wgrant: it doesn't make sense unless the eyeballs thinking about stuff are also partitioned. | 04:40 |
lifeless | And they aren't. | 04:40 |
lifeless | And shouldn't be. | 04:40 |
lifeless | I may well in future start agitating for LP ops stuff to be *on LP*, but that wouldn't make sense unless LP is a /lot/ more robust than it is now. | 04:41 |
lifeless | also, I wish 'str' was PEP8 | 04:42 |
wgrant | lifeless: I guess. | 04:46 |
wgrant | What's un-PEP8 about it? | 04:46 |
lifeless | startswith | 04:48 |
lifeless | starts_with <- pep8 | 04:48 |
wgrant | Ah, rihgt. | 04:48 |
wgrant | Also, why isn't it 'string'? :( | 04:49 |
lifeless | characters were expensive in the bad ol days | 04:49 |
=== MTeck is now known as MTecknology | ||
lifeless | sigh | 06:44 |
lifeless | File "/home/robertc/launchpad/lp-branches/working/lib/canonical/testing/layers.py", line 507, in setUp | 06:44 |
lifeless | >< | 06:44 |
lifeless | time.sleep(0.1) | 06:44 |
wgrant_ | I would really like to know why that happens. | 06:44 |
lifeless | why folk put time.sleep calls in layer setups ? | 06:46 |
wgrant_ | Well, and why bin/test is so hard to kill. | 06:48 |
wgrant_ | Often resulting in that traceback, then a hang. | 06:48 |
lifeless | I suspect the zope test running reinvocation stuff is broken subtly | 06:50 |
lifeless | grah - fugly - <script>LP.client.cache['context'] = ... | 06:53 |
lifeless | ok, I hate chromium | 06:53 |
lifeless | show source should /show the source/ not re-request | 06:53 |
lifeless | I just hit the filebug issue myself | 06:54 |
lifeless | on launchpad-foundations | 06:54 |
beuno_ | lifeless, firefox does that as well, AFAIK | 06:54 |
lifeless | w/no apport glue | 06:54 |
lifeless | beuno_: firefox used to DTRT | 06:54 |
lifeless | hmm, something is really wrong, pages are on a go-slow | 06:54 |
beuno_ | well, it used to be fast as well :) | 06:55 |
=== beuno_ is now known as beuno | ||
lifeless | I wonder if we have a request backlog or something causing high perceived time | 06:55 |
lifeless | effectively lowering the service time | 06:55 |
wgrant_ | lifeless: I'm still getting those truncated responses during most runs. | 07:16 |
wgrant_ | So something is up. | 07:16 |
=== wgrant_ is now known as wgrant | ||
wgrant | Not something I've seen before, though :/ | 07:16 |
lifeless | wgrant: grah | 07:18 |
lifeless | hmm | 07:18 |
lifeless | 7am | 07:18 |
lifeless | I'm seeing trouble too, but nothing that I can obviously identify | 07:18 |
lifeless | oops counts are slightly high, but not freakishly so except for a spike ~ 24 hours ago | 07:19 |
lifeless | wgrant: I got the connecting error on +filebug | 07:19 |
lifeless | want to know the weird thing | 07:19 |
lifeless | the bug got filed | 07:19 |
wgrant | lifeless: That's a different issue, then. Sounds like yours was post-redirect? | 07:20 |
lifeless | yes | 07:21 |
lifeless | not that you could tell | 07:21 |
lifeless | wgrant: I'm saying that I think its the same issue | 07:21 |
wgrant | Hmm. | 07:22 |
lifeless | that we're seeing two things : OOPSes w/stuff, and something networkish which is borking responses and causing 'count not connect' errors | 07:22 |
wgrant | Do you know how long it took? | 07:22 |
lifeless | a while | 07:22 |
lifeless | here is an explanantion for a truncated page: | 07:22 |
lifeless | an HTTP/1.0 page with no Content-Length had its network socket disconnect | 07:22 |
lifeless | we have two datacentres | 07:23 |
wgrant | httplib.IncompleteRead: IncompleteRead(15646 bytes read, 1272188 more expected) | 07:23 |
wgrant | is what I'm getting, FWIW. | 07:23 |
lifeless | the front ends are in one place. | 07:23 |
lifeless | the appservers are in both. | 07:23 |
wgrant | Wait, LP is split over both? | 07:23 |
lifeless | the database is in the same one as the FE's, AIUI. | 07:23 |
wgrant | I didn't realise it was in the second at all. | 07:23 |
lifeless | wampee and the other are, AIUI | 07:23 |
lifeless | anyhow | 07:24 |
lifeless | my working theory is connectivity issues between the DC's | 07:24 |
lifeless | this would: | 07:24 |
lifeless | - increase SQL time (packet retransmits) | 07:24 |
lifeless | - truncate pages partly transmitted w/out error (HTTP/1.0 We Loves You) | 07:24 |
lifeless | - truncate pages with content-length with error, signalled by a socket shutdown only (again, 1.0 we loves you) | 07:25 |
lifeless | - cause random timeouts if enough packets on the same tcp link drop | 07:26 |
wgrant | Hmm. | 07:26 |
lifeless | particularly if the 'failure to connect' error has a time < 2 * the retry interval for TCP | 07:26 |
lifeless | which I don't remember offhand | 07:26 |
lifeless | elmo: are thou perchance around? | 10:31 |
wgrant | lifeless: Is it just me, or is Launchpad in general *really really* slow at the moment? | 10:43 |
lifeless | yes | 10:44 |
lifeless | I think its a real issue | 10:44 |
wgrant | /people took 2.97s with 29 queries. | 10:44 |
wgrant | But took like 20s to make it to the browser. | 10:44 |
lifeless | yes | 10:46 |
lifeless | Like I said, I think its cross DC fuckage | 10:46 |
lifeless | will ring elmo soon | 10:46 |
elmo | you'll need a little more evidence than that | 10:46 |
lifeless | elmo: hi cool. | 10:48 |
elmo | replication lag is > 300s, so we're on wildcherry, but that's recent and hardly "cross DC fuckage" | 10:48 |
lifeless | elmo: uhm, ugh | 10:48 |
elmo | we're also down to two edge servers because edge rollout has failed twice | 10:48 |
lifeless | elmo: well, I was using an abbreviation for the discussion earlier. | 10:48 |
lifeless | elmo: here are the symptoms I'm aware of: | 10:48 |
lifeless | - apis and web requests are getting truncated responses | 10:48 |
lifeless | - things feel slow | 10:48 |
lifeless | - a fair number of requests get the 'could not connect to launchpad' error page (btw: what is the trigger for showing that) | 10:49 |
elmo | lifeless: so nothing in nagios jumps out as being obviously wrong | 10:51 |
elmo | if anything the system looks lightly loaded - despite being all on wildcherry, the load there is small - appservers are busy but not extraordinarily so | 10:52 |
lifeless | elmo: I looked at the hourly oops rates for edge and nothing was obviously bong there | 10:52 |
lifeless | elmo: does the load balancer show any sort of deep queuing going on perhaps? or apache? | 10:52 |
lifeless | elmo: the | 10:52 |
lifeless | 'could not connect to launchpad' page seems like a particularly relevant clue | 10:52 |
lifeless | elmo: what triggers it being shown ? | 10:53 |
elmo | I don't know | 10:53 |
elmo | there's no particular queues on haproxy | 10:53 |
elmo | and apache is fine | 10:53 |
lifeless | I *think* its shown when <the thing that shows it> gets no HTTP response header in 30 seconds from a server | 10:54 |
wgrant | Except it also shows sometimes after almost exactly 10s. | 10:54 |
lifeless | wgrant: odd | 10:54 |
wgrant | (this was in the +filebug stuff yesterday) | 10:54 |
elmo | both apache and haproxy have custom 503 pages | 10:54 |
elmo | and squid talks to haproxy | 10:54 |
wgrant | Is there a way to distinguish between the two? | 10:54 |
lifeless | is squid in the loop for authenticated requests? | 10:55 |
elmo | nope | 10:55 |
elmo | wgrant: they point at different files at least | 10:55 |
lifeless | the error I saw has no branding | 10:55 |
lifeless | just black on white text | 10:55 |
wgrant | offline-unplanned.html and offline-unplanned-haproxy.html? | 10:56 |
elmo | wgrant: right | 10:56 |
wgrant | The haproxy one has a comment. | 10:56 |
wgrant | So it should be easy to tell which is which. | 10:56 |
elmo | lifeless: like I say, AFAICS neither apache nor haproxy should do that for any of the main websites | 10:58 |
elmo | what was the URL you had failing? | 10:58 |
lifeless | https://bugs.edge.launchpad.net/launchpad-foundations/+filebug | 10:58 |
elmo | I'm also not seeing many 5*'s in the apache log | 10:59 |
wgrant | Er. | 10:59 |
wgrant | mtr shows some latency | 10:59 |
wgrant | Fluctuating. | 10:59 |
wgrant | In the DC. | 10:59 |
wgrant | 50-200ms | 10:59 |
elmo | wgrant: is fine from here (and I'm a couple of hundred miles away atm) | 11:00 |
lifeless | elmo: what are the two edge servers we have left? | 11:00 |
elmo | lifeless: potassium and palladium | 11:00 |
wgrant | elmo: It's settled down now, but for a while there was 200ms between chenet and nutmeg. | 11:00 |
wgrant | And the rest of the route was pretty stable. | 11:00 |
elmo | wgrant: that's just chenet depriotizing pings | 11:01 |
wgrant | Bah. | 11:01 |
elmo | it's a busy firewall... | 11:01 |
lifeless | its happier now than it was 3 hours ago | 11:02 |
lifeless | elmo: to humour me, which two edge servers do we have still live? and which DC are they in? | 11:02 |
elmo | 11:00 < elmo> lifeless: potassium and palladium | 11:03 |
lifeless | thanks, its late :) | 11:03 |
elmo | lifeless: they're in Goswell Road | 11:03 |
lifeless | is that the same one as apache / haproxy? | 11:03 |
elmo | no, different one | 11:04 |
lifeless | ok, I *know* I don't have enough data here, but my instincts are jumping on the inter dc link | 11:04 |
lifeless | What can we do to rule it out | 11:04 |
elmo | there's absolutely no evidence of a problem with the link | 11:05 |
elmo | our London stuff is relatively well spread out, if there were problems, more than just launchpad would be showing up and it would be in nagios | 11:05 |
elmo | the link is up, there's no problems with automated ping testing or manual testing, it's nowhere near capacity | 11:05 |
lifeless | was there anything ~ 3 hours back ? | 11:06 |
elmo | checking | 11:07 |
elmo | nothing shows in router logs or bandwidth graphs | 11:09 |
elmo | 3-4 hours ago is the sunday apache logrotate | 11:09 |
elmo | if you want to gut-blame something, I'd say that's a much better target | 11:09 |
lifeless | interesting | 11:10 |
lifeless | how long does that take? | 11:10 |
lifeless | or does it have a tendency to go skewiff | 11:11 |
elmo | the logrotate itself? not very long but it does an apache reload/restart which sometimes goes mental on busy webservers | 11:11 |
lifeless | ah | 11:13 |
lifeless | I got back a partial page on one of the bugs I just filed as test | 11:15 |
lifeless | it cuts off right on the target table | 11:15 |
lifeless | elmo: so I don't quite know where to go; there *was* a serious persistent problem, its a lot better now but still a good fraction fof requests go into lala land, don't oops, but take ages (like 15 seconds, but reported as At least 98 queries/external actions issued in 0.99 seconds | 11:24 |
lifeless | ) | 11:24 |
elmo | i don't know where to go either, and I really need to run - unless we have some actionable next steps, I'd like to defer this | 11:26 |
lifeless | I think we have to | 11:27 |
lifeless | defer it | 11:27 |
elmo | ok, sorry I haven't been of much help | 11:27 |
elmo | I'm still available on my mobile if things get worse again | 11:27 |
lifeless | thanks | 11:29 |
lifeless | elmo: https://xmlrpc.edge.launchpad.net/bazaar/: Unable to handle http code 503: Service Unavailable | 12:02 |
lifeless | elmo: I've seen this a couple of times, someone in #bzr is asking now. | 12:02 |
lifeless | wgrant: grep -r 'except:' . | 12:39 |
lifeless | wgrant: then weep. | 12:42 |
lifeless | night all | 12:45 |
wgrant | lib/lp/buildmaster/manager.py: except: | 12:49 |
wgrant | Eep. | 12:49 |
jelmer | 'morning lifeless, wgrant | 12:53 |
wgrant | (do we not have a zero-tolerance policy for this sort of thing?) | 12:56 |
wgrant | Morning jelmer. | 12:56 |
wgrant | Just got another 502 from edge. | 13:04 |
wgrant | Definitely from Apache. | 13:04 |
wgrant | Er. | 13:20 |
wgrant | I do believe I just got a truncated copy of the WADL. | 13:20 |
wgrant | Yes. | 13:20 |
wgrant | That is a little odd, since that's served fairly dumbly. | 13:20 |
StevenK | wgrant: Don't you have uni work to do on a Sunday night? :-P | 13:24 |
wgrant | Yes :( | 13:25 |
wgrant | But that doesn't stop my inbox from filling up with cron errors. | 13:25 |
* jelmer waves to StevenK | 13:34 | |
StevenK | jelmer: O hai! | 13:39 |
=== Ursinha is now known as Ursinha-afk | ||
lifeless | morning | 20:24 |
beuno | good morning lifeless | 20:25 |
lifeless | hi beuno | 20:50 |
=== Daviey_ is now known as Daviey | ||
mwhudson | morning | 22:25 |
lifeless | mwhudson: hu | 22:26 |
lifeless | mwhudson: hi | 22:26 |
lifeless | mwhudson: a) we have a small firefight going on, and b) I can has reviews? All ones you'll like, I swear. topic of launchpad-reviews | 22:26 |
mwhudson | lifeless: boo, and uh ok | 22:26 |
mwhudson | lifeless: what's the firefight? i saw identi.ca/launchpadstatus | 22:27 |
lifeless | url is in the internal channel | 22:28 |
mwhudson | k | 22:29 |
wgrant | Morning lifeless. | 22:43 |
wgrant | Morning mwhudson. | 22:43 |
wgrant | Any news? | 22:44 |
lifeless | wgrant: some | 22:44 |
lifeless | elmo has been helping, but hes on a m*f* train. | 22:44 |
lifeless | when he surfaces again, we'll continue | 22:44 |
lifeless | wgrant: the only current theory I have is that a failed shutdown of the appservers leaves opstats working but everything else bork bork borked | 22:46 |
wgrant | Data I have: Pretty consistently broken from 02:00Z yesterday to 08:00Z, possibly OK for 3-4 hours, broken from 13:00Z to 18:00Z, possibly OK for a couple of hours, then broken since. | 22:47 |
lifeless | there is a bug about the failed shutdowns, but we'll need one about this failure mode of opstats | 22:47 |
lifeless | we have 2 appservers which are out of rotation but the process is running | 22:47 |
thumper | morning | 22:51 |
lifeless | hi thumper | 22:51 |
lifeless | wgrant: this theory doesn't quite explain the partial pages | 22:55 |
lifeless | wgrant: but if the broken appservers sometimes work, but haproxy kills it all when it decides the servers bong, that would explain it. | 22:56 |
wallyworld | morning | 22:57 |
wgrant | lifeless: Possibly. | 22:58 |
wgrant | lifeless: Also of note is the partial WADL that I got last night. So even really simple requests get truncated. | 22:58 |
lifeless | wgrant: that comes from the appservers still | 22:58 |
wgrant | It does. | 22:59 |
wgrant | But it should be really fast. | 22:59 |
lifeless | wgrant: and has been known to take minutes to generate. | 22:59 |
wgrant | Oh hmmmmmmm. | 22:59 |
lifeless | not everytime, of course. Just Some Times. | 22:59 |
wgrant | It's almost always truncated after ~15700 bytes. | 23:00 |
wgrant | It fluctuates, but that may just be header size differences. | 23:00 |
wgrant | Regardless of the file, it's within a kilobyte of that number. | 23:00 |
lifeless | elmo: ^ | 23:00 |
lifeless | thats interesting | 23:00 |
lifeless | have you tried a loop on the wadl? if so what url are you using? | 23:00 |
wgrant | I haven't. | 23:01 |
wgrant | It needs some Accept header... let's see. | 23:01 |
thumper | wallyworld: morning | 23:18 |
wallyworld | thumper: looks like you had an interesting time on friday :-( | 23:19 |
thumper | wallyworld: FSVO interesting | 23:19 |
thumper | wallyworld: we should have a call | 23:19 |
wallyworld | thumper: ok, anytime | 23:19 |
wallyworld | thumper: https://code.edge.launchpad.net/~wallyworld/launchpad/link-bugs-in-merge-proposal/+merge/34826 | 23:36 |
thumper | wallyworld: http://yuiblog.com/assets/pdf/cheatsheets/css.pdf | 23:43 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!