[00:41] I wonder if my with patch will be past buildbot in time for the deploy [00:41] * lifeless has a cunning plan to add as many things as possible [00:41] lifeless: Will we have an mthaddon in time? [00:42] also a good question [00:42] I was assuming not, otherwise I would have been landing stuff more aggressively today. [00:42] staging takes 20 minutes [00:42] if he shows up at ~8 as normal [00:42] then yes [00:42] otherwise no [00:43] Hmm, the docs say the tree should be prepped 2.5 hours before :) [00:43] yes [00:43] and we have such a tree [00:43] But I guess you could try to get another pushed out, and just use the old one if it's not there in time. [00:43] as long as we don't bork it we're fine [00:43] the 2.5 hours - if you are reading the losa docs - is time for 2 staging attempts + discussion [00:43] Oh. [00:44] staging as in staging, not staging, right. I was wondering why you were talking about staging, and how you'd managed to get it to update in 20 minutes. [00:46] * wgrant lunches. [00:46] But it isn't even midday yet? [00:47] Another day, another firefox update. [01:07] StevenK: it is past midday [01:07] well past :) [01:08] thumper: Timezone fail :-) [01:08] StevenK: Shh. [01:09] wgrant: Do you have dinner at 4pm, too? [01:09] StevenK: No :( [01:11] thumper: Is the EnumChoiceWidget suitable for the bugtask table too? [01:12] wgrant: it should be, perhaps with a little tweaking [01:12] wgrant: in the same way the InlineEditPickerWidget should be for selecting a person [01:12] thumper: Why does one have InlineEdit and the other not? [01:13] wgrant: because I never got around to renaming it, and that is what it was originally called [01:13] Ah, good. Was hoping I wasn't missing some difference. [01:16] nah... [01:16] I'm just fixing a few tests on my blueprint-magic branch [01:16] which widgetizes the blueprint page as a proof of concept [01:16] or I should say [01:16] Excellent. [01:16] another proof of use [01:16] Awww. Here I was hoping it removed blueprints. [01:17] I like blueprints [01:17] Blueprints are annoying [01:17] personally I think merging blueprints and bugs is wrong [01:44] * thumper runs blueprint-magic through ec2 [02:46] lifeless: what benefit does colocation provide us? [02:55] wgrant: we need to support the protocol: more metadata, streaming fetch of N branches at onces, multiple heads etc [02:55] wgrant: + [possibly] get rid of stacking and massively simplify things [02:55] I guess. [02:56] Get rid of stacking? [02:57] wgrant: think about ideal loom behaviour [02:57] wgrant: pushing 200 vim patches == pain; pushing 1 collection of branches == nice [02:58] StevenK: stacking is the source of many bugs and slowdowns in bzr [03:00] I like it for LP development [03:03] StevenK: you like the performance [03:04] StevenK: if it was faster than it is now, would you really whine? [03:05] lifeless: TBH, with stacking I don't mind bzr push performance, and I'm happy about the disk space win for crowberry. If it was faster without losing the win for crowberry, that would be awesome. [03:08] it has the potential to be smaller [03:08] we don't delete branches [03:09] and the minimum size for a stacked branch is the size of one inventory - which doesn't compress well [03:09] if all those branches were combined, the incremental overhead per branch could be a lot lower [03:09] the question is whether the baseline overhead would be more or less [03:09] But we can't do that today, right? [03:10] no [03:10] its a nontrivial discussion [03:10] and we have other fish to fry [03:10] Right. [03:11] lifeless: Why does it need a full inventory? Because it starts a new compression group? [03:12] (my knowledge of 2a and above is sorely lacking) [03:12] And above? [03:12] There's another format after 2a? [03:13] development-subtree, for one. But it's not exactly very different. [03:19] wgrant: it has to be able to generate a delta [03:19] wgrant: for anything in it [03:19] Sigh. Minutes after I say I'm happy with push performance, I'm stuck waiting for it. [03:20] lifeless: And it can't use just a delta on top of the stacked-on CHK tree? [03:20] I should probably read how CHK actually works :) [03:20] wgrant: fetch operations are single repo always [03:20] wgrant: consider: client A, servers B and C with a firewall between B and C [03:21] wgrant: if sftp to B and C worked but bzr+ssh didn't it would be unpleasant [03:21] wgrant: so the way we did it is to say that a repository must: [03:21] - for any rev R it has: [03:22] - be able to return the content of the texts in R [in a repo specific format - e.g. fulltext, delta against some ancestor, whatever] [03:22] - be able to describe the content of R as a delta against the immediate ancestors of R on all sides [03:23] - on pushes the server says "I am missing the parents of revisions X,Y,Z" [03:24] Oh, right. [03:25] we could, in theory, have a partial CHK tree for a given rev [03:25] so far we haven't implemented that [03:25] hmm, new timeout [03:25] SourcePackage:+index [03:26] What's the bad query? [03:26] dunno yet [03:27] Ah, there. [03:27] wgrant: https://code.launchpad.net/~stevenk/launchpad/derive-common-ancestor/+merge/52796 -- given it's our work, I'm not asking for a reviewer, but look it over? [03:28] lifeless: Ouch, 9s in 3 repeated queries. [03:28] Rather one triplicated query. [03:29] It seems to be exactly the same query. [03:29] (looking at https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1894B1679, queries 20, 40 and 41) [03:30] and there is AND SourcePackagePublishingHistory.status IN (2) [03:30] on q 19 [03:30] So there is. [03:30] so maybe 4 calls [03:31] I would normally say that two of them are probably TAL guarding the display of the third... but these are full queries. [03:31] Might try to get tracebacks from DF. [03:31] Or go TAL-diving, but that is more tedious. [03:32] I'm doing some houseworky stuff atm [03:32] but if you wanted to identify the call sites, that would be awesome [03:32] Sure, just trying not to collide with you. [03:33] We can't get full tracebacks from qas? [03:33] StevenK: I have a patch to give OOPSes a traceback for every query,. [03:33] But it makes them unparseable, so it's not really usable on qas. [03:34] So I pretend that Julian doesn't exist and use it on mawson. [03:34] Haha [03:34] * StevenK ponders ringing Subaru [03:34] Oh? [03:35] They've had my car for 5 and a half hours now. Surely they're done servicing it. [03:36] :( [03:36] wgrant: Can haz opinion on MP? Or is it on your list? [03:36] StevenK: Looking. [03:37] mawson will take about forever to update. [03:37] Just one or two eons [03:37] + 'derived': dervied_changelog, [03:37] typo [03:37] Sigh [03:38] All instances fixed, thanks. [03:38] You want to factory out the madness into something like get_ancestry(SPR) [03:39] Otherwise that looks good. [03:39] I do? I figured _updateBaseVersion() was self-contained enough. [03:40] You're duplicating the set(Changelog(spr.changelog.read()).versions) [03:41] Also, what does debian.changelog do if the changelog is unparsable? [03:42] Returns an empty list [03:42] Which is fine by me [03:43] It's not going to raise exceptions in any case? [03:43] wgrant: I'm happy to write a test for that case. [03:43] That would be handy. [03:44] wgrant: get_ancestry in DSD or SPR? [03:44] StevenK: SPR is big enough already. [03:44] Keep it in DSD until we need it elsewhere, I think. [03:54] wgrant: http://pastebin.ubuntu.com/578173/ [03:55] StevenK: Looks reasonable. [03:58] No manual entry for subunit-stats [03:59] Subunit has to be the most poorly documented set of scripts ever [04:04] StevenK: I give you perl [04:04] StevenK: seriously, subunit-stats --help. [04:04] help2man, kthxbye [04:04] patches appreciated kthxdeal [04:05] lifeless: So, I've found the sources of the queries. [04:06] lifeless: They are very fast on DF when caches are hot. [04:06] Well, not very fast, but <200ms. [04:06] But very fast on DF is still 2 seconds. [04:07] lifeless: SourcePackage.summary and SourcePackage.published_by_pocket. A couple of calls to each. [04:07] Can you get a plan from a staging? [04:07] sure [04:09] lifeless: Thanks for the review. [04:09] I hope next week that the OOPS counts will be low enough that I can sensibly go through and tear out the old OOPS reporting stuff. [04:09] And then clean up the exception handling. [04:10] I'm using query 11 from https://lp-oops.canonical.com/oops.py/?oopsid=1894B1679#statementlog [04:10] Now that we know (hopefully) everywhere it's needed. [04:10] note that its not a hold/cold issue because its consistnetly slow in the oops [04:11] Indeed, I noticed that. [04:11] sadly, tis fast on qas [04:11] Hmmmmmmm. [04:11] want me to check staging ? [04:11] Worth a try, I guess :/ [04:12] oh but [04:12] there is also the deserialiation overhead [04:12] same results on staging [04:12] :( [04:13] no, not that [04:13] 72 rows [04:14] tagged it dba [04:14] Thanks. [04:14] we need to start capturing db hostnames [04:15] still, we should fix [04:15] no need to do 3 lookps === MTecknology is now known as MTeck-InPain [05:02] Huzzah, I have my car back [05:23] stub: hi [05:23] yo [05:23] lifeless: https://dev.launchpad.net/Database/LivePatching [05:23] stub: I saw - looking good [05:23] stub: I am drafting a 'ReliabileDBDeployments' LEP too [05:24] stub: which will frame whatever work we need to invest in this [05:24] Ok. Do you want to incorporate what I put together? [05:24] I was very pleased to see LivePatching this morning. [05:25] stub: I think the are complementary - the L-P page is how and implementation strategies [05:25] stub: the LEP will be what, goals, constraints, requirements [05:25] s/the/they/ [05:25] Ok. I'll update that wiki document if I think of anything new or get feedback then. [05:26] excellent [05:26] stub: we have some queries running slow on prod slaves, but fast on qastaging/staging [05:27] stub: two so far that I know of : the duplicate bug detection FTI queries (the ones you said we can't do realtime) and the one in https://bugs.launchpad.net/launchpad/+bug/732398 [05:27] <_mup_> Bug #732398: SourcePackage:+index timeout < https://launchpad.net/bugs/732398 > [05:27] wgrant: So I'm worried the extra overhead (sometimes needing 3x as many db patches, extra code to support 'old' and 'new' schemas) could deter devs. You disagree and think you would make use of the process? [05:27] stub: We can push things out more quickly and without hideous amounts of downtime. [05:27] Sounds like we need to pull some RAM out of prod.... [05:27] ;) [05:28] Why wouldn't people make use of it, even if it slightly more cumbersome?: [05:28] (I note that the page doesn't define what a light patch is, though. [05:28] stub: if you could get an explain analyze on all three db's for the query in https://bugs.launchpad.net/launchpad/+bug/732398/comments/1 - that would be awesome [05:28] <_mup_> Bug #732398: SourcePackage:+index timeout < https://launchpad.net/bugs/732398 > [05:29] wgrant: I'm just a born devil's advocate. There is more overhead in this process, and I'm interested in if the extra overhead will overcome the desire to get stuff rolled out 'now' rather than 'next cycle' [05:29] bah, brb [05:30] am I back? [05:30] lifeless: your back [05:30] cool [05:30] You're not gone. [05:30] so - can has analyze ? [05:31] stub: and then, I'd like to talk fti briefly, possibly voice, possibly here [05:31] Interestingly enough, as soon as I said "you're not gone", freenode lagged for 30s. [05:31] stub: (short story, I want to know what I'm missing on the query - it's plan and behaviour on staging seem totally fine) [05:32] wgrant: had to bounce wifi to stop openid trashing all my open tabs when I restarted chromium [05:32] Hah. [05:32] and I had to restart chromium becuase it had forgotten about a popup window which was permanently stuck inthe foreground [05:33] Software sucks :( [05:33] (not a browser window, right mouse context window) [05:34] Cold, that query ran in 500ms or less on all prod servers [05:34] stub: argh [05:34] stub: so, we have a diagnostic challenge [05:34] Because it took 3s hot. [05:34] lifeless: Somehow get more information about locks [05:34] wgrant: lets be precise [05:35] wgrant: our timelime which records query serialisation, queuing, deserialistion and upcasting to objects and any time given to another worker thread, showed 3 seconds. [05:35] True. [05:36] Interestingly, the fastest one (launchpad_prod_1) had a slightly different plan [05:36] Sorry - launchpad_prod_2 [05:36] perhaps we should capture the plan for any query over 1 second [05:36] into the timeline [05:38] there is already a per thread timeline bug [05:38] bug 243554 [05:38] <_mup_> Bug #243554: oops report should record information about the running environment

< https://launchpad.net/bugs/243554 > [05:39] * lifeless retitles [05:40] stub: so the same query is run three times in that page [05:40] In this case, I think the plan is a red herring and just an artifact of different statistics. The costs of the different parts of the plans are close enough to identical. [05:40] stub: https://launchpad.net/ubuntu/lucid/+source/chromium-browser/+index - and its timing out now [05:41] OOPS-1895M373 [05:41] trigging an lpnet sync [05:42] Just reran the previous query - slowest was 318ms [05:42] Is that with or without status IN (2)? [05:43] stub: would lock contention explain the same query being slow 3 times in a row ? [05:43] no [05:44] Well... maybe. [05:44] https://lp-oops.canonical.com/oops.py/?oopsid=1894B1679#repeatedstatements [05:44] 4th row [05:44] 3 calls to it, average time 2911ms [05:44] If it is a slow process like the publisher it could be locking rows in the same set returned by the slow query in different transactions [05:45] ok, so lets see the times for these oopses [05:45] Oh... 3 queries in one transaction, no - not lock contention [05:45] they are spread over the day [05:47] locks shouldn't be blocking selects anyway. [05:48] time of day for the oopses: 1937 1420 1937 0846 1201 1335 1258 1056 [05:48] wgrant: what time range does the publisher run in ? [05:48] lifeless: Primary archive? Normally 03-40 [05:49] ok, not that then [05:49] But it should release locks a good 10-15 minutes before it finishes. [05:49] High replication load possibly - look for corresponding lag spikes [05:49] all the oopses have exactly the same pattern [05:49] stub: where is the replication lag graph ? [05:49] * stub is looking for it [05:49] stub: and do we have one right now ? [05:50] https://launchpad.net/ubuntu/lucid/+source/chromium-browser/+index is the page I hit to generate an oops [05:50] Not lagged atm. [05:51] stub: then thats likely not it, cause its still timing out on prod :> [05:51] Graph is here anyway: https://lpstats.canonical.com/graphs/ProductionDBReplicationLag/ [05:51] Seems to time out on the master too. [05:51] OOPS-1895ED372 [05:52] So the same query with different parameters from the bug is still not having any problems. [05:52] Anyone have the actual query currently timing out handy yet? [05:53] stub: the one I linked is the one timing out [05:53] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1895M373 [05:54] http://pastebin.com/f8Vdi5p0 [05:54] 3.7 seconds in prod [05:55] When I run it, slowest 536ms. [05:55] * stub checks the pastebin matches [05:56] yup [05:56] ok thats strange [05:56] stub: try this [05:56] stub: run it, *not explain*, and press 'end' [05:56] make sure you have \timing on [05:56] The query is returning 5k rows and building 10k objects - would this be appserver time hidden from our metrics? [05:57] hidden from an explain analyze anyhow [05:57] That will give crap results atm.... I'll open some local psql shells. [05:57] erm... remote shells [05:58] Ooh... perhaps there are some silly large text fields in there? [05:58] SPR is pretty fat. [05:58] Because of SPR.copyright. [05:59] There shouldn't be that many rows, though :/ [05:59] I'd expect a few dozen at most. [05:59] count(*) says 72 rows for me [05:59] doing select count(*) from (SE...) as _tmp; [06:00] stub: how are you measuring the 5K ? [06:00] Sorry - I was looking at the estimate, not the actual returned count. [06:00] Ahh [06:00] I was scared that our kernel developers were even more insane than I thought. [06:01] wgrant: can you put the functions into the bug ? [06:02] lifeless: SUre. [06:03] so [06:03] theory is the analyse is not showing us the cost of a file processing of the rows in some fashion [06:04] It certainly is taking a lot longer to get the results to the client than it is to get the query plan. [06:05] yeah [06:05] one mystery solved [06:05] another item for the db performance tips page [06:06] Just seeing bug queries causing large temporary files in the logs [06:06] Oh hah. [06:06] I guess linux might have an enormous copyright file. [06:07] As well as a few uploads. [06:09] We should really do away with that column. [06:09] stub: Can you see how big a column is across an entire table? [06:09] I'm about to do that. [06:10] max(length(foo)) stuff [06:10] It could well be a few times the size of the rest of the table. [06:10] As a bonus, the data is not used by anything yet. [06:11] I selected into a temp table [06:11] \dt+ foo2 [06:11] List of relations [06:11] Schema | Name | Type | Owner | Size | Description [06:12] ------------+------+-------+-------+-------+------------- [06:12] pg_temp_30 | foo2 | table | ro | 96 kB | [06:12] its only 96kB apparently [06:12] Did you select all the SPR changelogs into a temp table? [06:12] Or is that the result of the problematic select? [06:13] thats the result of the problematic select [06:13] I find that difficult to believe. [06:13] But I guess it's possible. [06:13] with id as id2, component as c2 etc etc [06:13] launchpad_qastaging=> select count(*) from foo2; [06:13] count [06:13] ------- [06:13] 72 [06:14] 1.1 mb is the largest text field in the current slow query [06:14] (the one from the last oops) [06:14] * stub checks the entire table. [06:14] stub: whats the sum of that field ? [06:15] DF is being a bit slow at summing all the lengths. [06:15] 81MB... [06:15] Ouch. Which SPR is that? [06:16] stub: whats the length() function return ? [06:16] That is the sum of copywrite... so multiple [06:16] I got noddy small values [06:16] Hah. [06:16] 2.5GB of linux copyright files on DF. [06:16] but the text is big [06:16] sum of the length? 84466050 [06:16] no [06:16] 81MB [06:16] I mean hte function [06:16] select max(length(changelog_entry)) from foo2; [06:16] max [06:16] ------ [06:16] 3418 [06:17] 1.1MB [06:17] what does the 3418 mean ? [06:17] is that pages? sectors? [06:17] Its bytes [06:17] changelog_entry? [06:17] Sorry - characters [06:17] You mean changelog? [06:17] wgrant: changelog is an int [06:17] changelog_entry is different. [06:17] Er. [06:17] copyright, not changelog [06:17] So UTF-8 might theoretically be 4x as many bytes? [06:18] changelog_entry is always going to be small; it's only the latest. [06:18] I've checked all text fields on the table. The problem is copyright as you suspected. [06:18] select max(length(copyright)) from foo2; [06:18] max [06:18] --------- [06:18] 1126214 [06:18] (1 row) [06:18] ok, confusion sorted [06:18] http://paste.ubuntu.com/578202/ === Ursinha is now known as Ursinha-afk [06:19] right [06:19] dropping copyright fixes it [06:19] what do we have this in there for ? [06:19] Nothing at all. [06:19] It's populated, but not used. [06:19] drop the column definition from launchpad ? [06:19] We could possibly drop it from the class immediately, and just set it on upload, and then migrate it out of the DB later. [06:19] that will stop storm querying it [06:19] Right. [06:20] So yeah, that column needs to be split into a separate table. Code only fix might be to remove that field from the main Storm class, and have a separate Storm class with the extra column and use that only where necessary. [06:20] stub: s/table/librarian/, I think. [06:20] query can show all the rows in 300ms with that field removed. [06:20] wgrant: If there is no need for it to be in the DB, sure. [06:21] wgrant: want to do this one? [06:21] wgrant: If it is being shown inline on the page, I guess that is a performance call (pull it from the librarian and render it will be slower than from the db... unless it is ajax, and then search engines won't see it) [06:21] lifeless: I'll drop the column from the Storm definition now. [06:21] wgrant: cool [06:21] stub: its not used at all [06:21] stub: premature optimisation years ago [06:21] stub: It may plausibly be displayed in the page. [06:21] k [06:21] But it isn't yet. [06:21] And if someone wants to, they can damn well pull it from the librarian. [06:22] iframe embedding ftw [06:22] And if someone wants to display multiple on a single page, they are probably wrong. [06:22] So we don't have to consider that case. [06:23] So asuming ascii text, and Python decoding it into Unicode, that is 324MB of RAM needed before it even gets past psycopg2. This sort of thing will certainly be driving our memory footprint. [06:23] or they can just hand out librarian urls [06:23] I wonder if the uploader can update SPR. I guess I'm about to find out. [06:23] Given Python won't clean up, and that is per thread. [06:24] stub: ok, next one up is fti [06:24] I've been concerned about SPR.copyright for well over a year now, but never had the data to confirm that it was a problem. [06:24] wgrant: data is wonderful isn't it [06:24]