[00:27] <wgrant> lifeless: Are the appserver -> restricted librarian firewall rules completely sorted?
[00:27] <wgrant> We are having 502s which could be caused by them.
[02:03] <lifeless> wgrant: I don't know
[02:03] <lifeless> abel said he was still seeing a failure if he pushed past 5 concurrent uploads, so I assume that we haven't figured it all out.
[02:03] <lifeless> wgrant: gather oopes!
[02:04] <wgrant> lifeless: There are no OOPSes.
[02:04] <lifeless> https://edge.launchpad.net/sprints/uds-karmic/+temp-meeting-export <- why is this being hit :<
[02:04] <wgrant> They're proxy timeouts.
[02:04] <lifeless> restricted librarian isn't proxied
[02:04] <wgrant> Yay, c.l.security is finally being split.
[02:04] <wgrant> lifeless: Appserver connection timeouts, these are.
[02:05] <wgrant> "Sorry, we couldn't connect to the Launchpad server."
[02:05] <wgrant> On an action that would be accessing the restricted librarian.
[02:05] <wgrant> And it's intermittent.
[02:05] <lifeless> AIUI that error, that can't be related.
[02:05] <lifeless> however, I may not understand the error
[02:05] <lifeless> What server group ?
[02:05] <wgrant> Hm.
[02:05] <lifeless> edge/lpnet ?
[02:06] <lifeless> file a bug, lets gather data.
[02:06] <lifeless> it may well be related, but no assumptions
[02:06] <lifeless> wtf
[02:06] <lifeless> BugTask LEFT JOIN Bug
[02:06] <lifeless> makes no sense
[02:06] <wgrant> Looks like prod.
[02:07] <wgrant> lifeless: If there are no timeouts on librarian connections, and the connections are being dropped instead of rejected, why couldn't it be related?
[02:08] <lifeless> well
[02:08] <lifeless> what does the error actually mean?
[02:08] <lifeless> does it mean 'got no SYN-ACK
[02:08] <lifeless> or does it mean 'got no HTTP response in X time' ?
[02:09] <wgrant> I understand that it means the proxy didn't get a response from the appserver in a timely manner.
[02:09] <wgrant> Which probably means the appserver was waiting for something.
[02:09] <wgrant> Which, given last week's happenings, and the fact that other stuff times out, is quite possibly the librarian.
[02:09] <lifeless> if it means no HTTP response in X time, then yes, it can be related.
[02:09] <lifeless> but it also means we should be seeing OOPSes
[02:10] <lifeless> what pageids ?
[02:10] <wgrant> Even if there was no SQL executed afterwards?
[02:10] <wgrant> Um, it was on bug submission.
[02:10] <wgrant> So possibly BugTarget:+filebug-guided or something like that.
[02:10] <lifeless> wgrant: yes, soft oops are generated if the request is > $time
[02:11] <wgrant> lifeless: Ah, I didn't know if that also depended on SQL statements.
[02:11] <lifeless> so
[02:11] <lifeless> there's lazr.restful.utils.timeout or whatever it is
[02:11] <lifeless> which does a thread based timeout enforcer
[02:11] <lifeless> and there is the check in the storm tracer
[02:11] <lifeless> I plan to move all these checks to requesttimeline.
[02:12] <lifeless> or possibly something separate but connected.
[02:15] <lifeless> gandwana
[02:17] <wgrant> It's having lots of +filebug timeouts ?
[02:17] <lifeless> first one is sql
[02:17] <lifeless> death-by-a-thousand-LFA lookups
[02:18] <lifeless> potassium looks similar
[02:19] <lifeless> its awful o'clock to be calling the escalation phone just now
[02:19] <wgrant> What needs escalating?
[02:19] <lifeless> this issue
[02:19] <lifeless> if its not fixed
[02:20] <lifeless> 771 queries for +filebug
[02:20] <lifeless> with apport data
[02:21] <wgrant> Just tried some other restricted download stuff.
[02:21] <wgrant> Got a failure from one prod appserver -- not sure which.
[02:21] <lifeless> download or upload
[02:21] <wgrant> Download.
[02:21] <lifeless> we only had upload enabled on the firewall
[02:21] <lifeless> this might explain it
[02:21] <lifeless> well
[02:21] <lifeless> maybe not
[02:22] <wgrant> Download has been used for ages, though.
[02:22] <lifeless> we only *corrected a missing rule* for upload
[02:22] <wgrant> Ah.
[02:26] <wgrant> So, since StreamOrRedirectLibraryFileAlias failed at least once, the firewall is probably the problem.
[02:27] <lifeless> have you seen that ?
[02:28] <lifeless> was there an oops?
[02:28] <wgrant> No OOPS. Just a plaintext "There was a problem fetching the contents of this file. Please try again in a few minutes."
[02:28] <lifeless> oh, feng shui ?
[02:28] <wgrant> No.
[02:29] <wgrant> This is displayed by the appserver proxy view.
[02:29] <wgrant> When LibrarianServerError is raised by getFileContents.
[02:30] <lifeless> I have to go
[02:30] <lifeless> please - file a bug
[02:30] <lifeless> lets get all the data we can
[02:30] <wgrant> OK.
[02:30] <wgrant> Thanks.
[02:30] <lifeless> also it sounds like LibrarianServerError should be filing OOPSes
[02:30] <lifeless> if you wanted to fix that we could CP it to get more data.
[02:31] <wgrant> It sounds like it might be better to just not catch it at all.
[02:35] <lifeless> it should generate oops, if the best way to do that is to not catch it - fine.
[02:36]  * lifeless is gone, back in a few hours.
[03:22] <wgrant> sinzui: Is OOPS-1714K1846 another of the openid_identity_url LocationErrors?
[03:22]  * sinzui looks
[03:23] <wgrant> The user has OpenID issues.
[03:23] <wgrant> But it may be unrelated.
[03:23] <sinzui> Yes it is
[03:23] <wgrant> It works fine on edge, oddly.
[03:23] <wgrant> And I don't see what's changed on edge.
[03:23] <sinzui> I see two views definitely provide the attr
[03:24] <wgrant> (in this case, post-rollout the SSO account mapped to the wrong account)
[03:24] <wgrant> s/wrong account/wrong person/
[03:24] <sinzui> wgrant, that me be the case
[03:24] <sinzui> wgrant, this is the TB: http://pastebin.ubuntu.com/491936/
[03:25] <wgrant> Huh.
[03:26] <sinzui> ah we hit the XRDS code
[03:26] <wgrant> Oh, right.
[03:26] <wgrant> That's why it's only on prod.
[03:26] <wgrant> Of course.
[03:26] <sinzui> This is something that the foundations team may need to explain
[03:27] <wgrant> Now, there were some changes relating to OpenID on account merges last cycle.
[03:27] <wgrant> And the diff is huge, so I didn't even skim it. /me reads.
[03:28] <wgrant> Grrrrar.
[03:28] <wgrant> Branch is private.
[03:28]  * wgrant diffs manually.
[05:02] <lifeless> back
[05:02] <lifeless> wgrant: how goes it, any more data?
[05:20] <wgrant> lifeless: Nothing.
[05:20] <wgrant> And I didn't file a bug, since if all goes well that view will disappear soon.
[05:21] <wgrant> (once your stuff is active)
[05:21] <wgrant> Or do you want a bug about the probably-not-bug +filebug issue?
[05:40] <lifeless> the upload and download ports to the appserver need to be open regardless
[05:41] <lifeless> because; in-appserver stuff uses the restricted librarian to get at content sometimes
[05:41] <wgrant> They do, yes.
[05:41] <wgrant> But it's not a bug.
[05:41] <wgrant> It's an operational issue.
[05:41] <lifeless> and uploads of all sorts are proxied via the appserver
[05:41] <lifeless> wgrant: 'meh'
[08:02] <wgrant> OOPS-1715S302
[08:05] <wgrant> lifeless: You're not still around?
[08:06] <lifeless> sigh, context manager fail
[08:06] <lifeless> yes
[08:06] <wgrant> What's the OOPS?
[08:07] <wgrant> I got that the first couple of times before the "Please try again" started appearing on staging.
[08:10] <lifeless> LaunchpadTimeoutError: Statement: 'SELECT DISTINCT SourcePackagePublishingHistory.archive, SourcePackagePublishingHistory.component, SourcePackagePublishingHistory.datecreated,
[08:10] <lifeless> QueryCanceledError('canceling statement due to statement timeout\\n',)
[08:10] <lifeless> SQL time: 10494 ms
[08:10] <lifeless> Non-sql time: 175 ms
[08:10] <lifeless> Total time: 10669 ms
[08:10] <lifeless> Statement Count: 43
[08:10] <wgrant> Hm, so probably unrelated.
[08:10] <lifeless> its on staging
[08:11] <lifeless> different librarian
[08:11] <wgrant> It is.
[08:11] <wgrant> But I still got the same error later.
[08:11] <wgrant> So it's not prod-specific.
[08:11] <wgrant> Is the staging librarian also on asuka, or not?
[08:11] <lifeless> I think so
[08:11] <wgrant> Urgh.
[08:11] <lifeless> let me check
[08:11] <wgrant> So... not firewall, in that case.
[08:11] <wgrant> I could try dogfood, which I know is the one machine.
[08:12] <lifeless> yes, asuka
[08:12] <wgrant> If the failed request caused an OOPS, it should have been just after OOPS-1715S304.
[08:12] <wgrant> Is it obvious?
[08:13] <lifeless>  LaunchpadTimeoutError: Statement: 'SELECT BinaryPackagePublishingHistory.archive, BinaryPackagePublishingHistory.binarypackagerelease, BinaryPackagePublishingHistory.component,
[08:13] <lifeless> thats 5
[08:13] <wgrant> I didn't think I caused a third, but maybe I did.
[08:13] <lifeless>   LaunchpadTimeoutError: Statement: '(SELECT "_259ce".name, Person.displayname, EmailAddress.email FROM Person JOIN Account ON Account.id = Person.account JOIN EmailAddress ON EmailAddress.person = Person.id JOIN TeamParticipation ON
[08:13] <lifeless> thats 6
[08:13] <lifeless> anon
[08:14] <wgrant> Probably not, then (but that looks like an auth query... how would that be timing out so early?)
[09:18] <wgrant> lifeless: The proxy timeouts go away if I remove most of the attachments from the uploaded blob, or if I file it against a project with only a couple of subscribers.
[09:19] <lifeless> heh
[09:19] <wgrant> Next test: Specifying a biggish team as the initial assignee, to emulate the lots of subscribers that Ubuntu has.
[09:19] <lifeless> thought so
[09:20] <wgrant> But that should still be an SQL timeout :/
[09:20] <lifeless> and they all have been that I've seen, so far.
[09:20] <wgrant> Oh look.
[09:21] <wgrant> Setting assignee=ubuntumembers when filing the bug also makes it die like that.
[09:21] <wgrant> But that should still be an SQL timeout. So why does it not appear as one...
[09:22]  * wgrant creates a few hundred people locally.
[09:30] <wgrant> Uh.
[09:30] <wgrant> Would you like some queries?
[09:30] <wgrant> That request has plenty.
[09:38] <lifeless> heh
[10:16] <lifeless> james_w: https://edge.launchpad.net/python-fixtures/trunk/0.2https://edge.launchpad.net/python-fixtures/trunk/0.2
[14:56] <james_w> thanks lifeless
[20:19] <lifeless> james_w: please let me know how you like/dislike it.
[20:20] <james_w> I'll give it a go now
[20:20] <james_w> I assume testresources will become a layer on top of fixtures now?
[20:21] <lifeless> yeah
[20:21] <lifeless> going to look at jmls remaining testrepository patches
[20:21] <lifeless> then package up fixtures
[20:22] <lifeless> then start working back along the stack, harmonising things
[20:22] <james_w> excellent
[20:22] <lifeless> I was surprised, 0.1 had 49 downloads.
[20:23]  * jelmer cheers on lifeless
[20:24] <james_w> the existence of fixtures fixture and testfixtures is unfortunate
[20:24] <lifeless> yes
[20:24] <lifeless> I thought had before wedging in there
[20:25] <lifeless> I also looked at their designs
[20:26] <lifeless> probably want to subsume fixture functionality wise in a couple of releases
[20:28] <lifeless> and testfixtures, ah yes
[20:28] <lifeless> sugar but not AFAICT fundamentally solving it
[20:31] <lifeless> actually, revisiting, testfixtures is pretty neat
[20:31] <lifeless> but the API for compare isn't quite disconnected enough for little ol me