[00:00] mwhudson: unfortunately they don't get propagated [00:00] jelmer: oh really? [00:00] how completely useless [00:01] mwhudson: I discussed them with some git developers at LCA, and we came to the conclusion that git notes wouldn't be useful for this sort of thing. [00:01] oh ok, i got the idea somehow that they were a new thing [00:01] They are relatively new, only about a year I think. [00:02] I ended up implementing bzr-git roundtripping using extra metadata in commit messages (revision properties, revision id) and a file-id table in the tree. [00:02] ah ok === almaisan-away is now known as al-maisan === al-maisan is now known as almaisan-away [00:12] jelmer: so [00:12] /home/robertc/launchpad/lp-branches/working/lib/lp/scripts/utilities/importfascist.py:187: DeprecationWarning: please use 'debian' instead of 'debian_bundle' [00:12] module = original_import(name, globals, locals, fromlist, level) [00:12] jelmer: when is that getting stabbed ? [00:12] jelmer: i see the kdebase import failed :/ [00:12] lifeless: yes that is nice to separate the policy bits of testresources from the mechanism [00:13] lifeless: I thought I already head [00:13] *had [00:14] jelmer: I [00:14] 'm on latest devel [00:14] latest sourcedeps, upgraded my packages [00:14] hmm, I'll hit apt harder there were some hold- backs [00:15] lifeless: it doesn't tell you where that DeprecationWarning is coming from? [00:15] mwhudson: yeah :-( [00:15] jelmer: no, its a little opaque [00:16] jelmer: any idea on this one? [00:16] lifeless: Have you run update-sourcecode recently? [00:16] jelmer: never ? [00:16] seems that it's some kind of remote server problem [00:16] mwhudson: yeah, it's the same one as the len(tview) != len_tview one [00:17] jelmer: oh ok [00:17] poolie: small request for you [00:18] poolie: I'd like to put a rule in the feature flags ruleset on production [00:18] ok [00:18] which will say 'beta users team membership means is_edge should evaluate true' [00:18] by all means [00:19] so keeping the same behaviour, but changing the guts to use the flags mechanism rather than being hardcoded? [00:19] yeah [00:19] yes i was thinking of doing that soon too [00:19] for now is_edge (and is_lpnet) need to union the two things [00:19] as a transition, I think. [00:19] e.g. [00:20] if the appserver is configured as edge [00:20] then is_edge is true and is_lpnet is false [00:20] if the appserver is configured as lpnet [00:20] so last thursday [00:20] then the flags rule can override that [00:20] which seems like a long time ago [00:20] i got a readonly web view of them [00:20] to say 'actually, this is edge, nyarh' [00:20] and i was trying to work out a nice way to make them editable [00:21] poolie: I quite like what sinzui has organised for jabber ids with bac [00:21] which is, you can add or delete, but not edit in place; it makes it kindof simple [00:22] I dunno - we can have something pretty crude - even a big textfield you parse and diff would do :). I don't know this part of the LP machinery as yet. [00:26] yeah, something like that [00:27] mwhudson: getting that particular bug fixed is high on my todo list, but it's non-trivial [00:27] it wasn't a blocker, that was just what i got up to before i stopped [00:27] jelmer: it's fixable on the bzr-svn end? [00:27] mwhudson: yeah, it's a bzr-svn bug [00:28] oh ok [00:28] mwhudson: the problem is that because of odd operations in the repo bzr-svn ends up with an invalid base text to apply the delta it receives from the server against [00:28] jelmer: is it like the hash reconstruction fun you had with bzr-git? [00:28] mwhudson: It's just for the file fulltext, doesn't depend on the particular serialization that bzr-svn uses [00:29] oh ok [00:29] sounds like fun :) [00:29] s/bzr-svn/svn/ [01:54] jelmer: still up ? [01:54] or StevenK ? [01:54] I have this happening : [01:55] robertc 6840 0.0 0.2 26224 4276 pts/1 T 10:52 0:00 \_ /usr/bin/perl -w /usr/bin/debuild --no-conf -S -k0x5D147547 [01:55] robertc 6868 0.0 0.0 5572 772 pts/1 T 10:52 0:00 \_ tee ../biscuit_1.0-4_source.build [01:55] robertc 6929 0.0 0.0 19404 1920 pts/1 T 10:52 0:00 \_ /bin/bash /usr/bin/debsign -k0x5D147547 biscuit_1.0-4_source.changes [01:55] robertc 6968 0.0 0.0 5596 772 pts/1 T 10:52 0:00 \_ stty 400:1:bf:a20:3:1c:7f:15:4:0:1:0:11:13:1a:0:12:f:17:16:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 [01:55] ^ hung test [02:21] jelmer: the stty is looping: [02:21] ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted) [02:21] --- SIGTTOU (Stopped (tty output)) @ 0 (0) --- [02:21] -forever- [02:21] --- SIGTTOU (Stopped (tty output)) @ 0 (0) --- [02:35] man, my vim in this vm is _borked_ [02:35] its the cause, whatever is going on. [03:17] mwhudson: did you have any thoughts on https://bugs.edge.launchpad.net/launchpad-foundations/+bug/618019 ? [03:17] <_mup_> Bug #618019: OOPS may be underrepresenting storm/sql time [03:19] lifeless: not particularly [03:19] lifeless: i know that object construction time can be significant [03:19] calling it 'sql time' doesn't seem quite fair though [03:20] well, I'm saying we don't know [03:20] I'd like to be able to say ORM time and sql time [03:20] but I have a sinking feeling tha time spent getting stuff out of the sql socket is being accrued as nonsql time [03:35] mwhudson: ould factory.makePerson be reusing Person db id's ? [03:35] lifeless: no [03:35] I have a -weird- interaction then [03:35] there might be db-non marked dirty issues [03:35] lifeless: details++ pls ;) [03:35] yes [03:35] typing [03:35] lp/registry/browser/tests/coc-views.txt [03:35] line 37 [03:36] makes a new person [03:36] grabs a view [03:36] and checks that the instructions are show [03:36] n [03:36] I'm seeing [03:36] - 1. Register an OpenPGP key. [03:36] - 2. Download the current Code of Conduct. [03:36] - 3. Sign it! [03:36] which means the view things the principle has signed the coc [03:36] I think [03:38] lifeless: Wouldn't that suggest the opposite? [03:38] wgrant: the - means the line is missing [03:39] Oh. [03:39] Not a bullet. [03:39] I see. [03:39] now, for hilarity [03:39] I can't find those instructions in the source [03:39] What are the changes in your branch? [03:39] wgrant: its the registry branch [03:40] The one where you're caching that field? [03:40] yes [03:40] lifeless: the instructions are in codeofconduct-list.pt [03:40] nothing obvious in a preview merge [03:40] oh, the threw my gre out [03:41] mwhudson: thanks [03:41] it also seems to think that they have registered an openpgpg key [03:42] which is -really weird- as I didn't cache that [03:42] ah no, its all in the not: is_ubuntu_coc_signer clause [03:42] so the symptoms are 'a new person from factory.makePerson has their is_ubuntu_coc_signer set true [03:47] yes, I added a print of the is_ubuntu_coc_signer right after makePerson [03:47] -> True [03:48] lifeless: Is it caching (eg. does it work in make harness), or is your query broken? [03:48] my query ? [03:48] wgrant: this isn't going through _all_members [03:48] wgrant: thats a -very- explicit code path; [03:48] lifeless: But you refactored the property itself. [03:48] yes [03:49] wgrant: oh, right, I see what you're asking. Thanks. [03:49] I haven't seen this AND thing before. [03:49] thats easy to tes [03:49] Ah. [03:49] There's the problem. [03:49] You're not constraining the Person... [03:50] You're just saying Person.id. [03:50] yeah, its the calculation itself [03:50] You mean self.id. [03:50] Ah, except that it's static. [03:51] yeah, the refactoring is borked [03:51] very good catch [03:53] in both cases in fact. [03:53] not enough attention to detail; fixing. [03:53] Hah. [03:53] it needs to LeftJoin on the _all_members case [03:53] to be fair though, it was the second attribute, I was still figuring things out ;) [03:54] Hm, is the _all_members case really broken? [03:54] yes [03:54] I'm not sure how Storm will SQLify that. [03:54] OK. [03:54] people that haven't signed a coc at all will be excluded [03:54] because their columns would be NULL [03:54] But it's in an Exists... [03:54] oh [03:54] I think I need more caffeine. This isn't me :) [03:55] right. square one. [03:55] Heh. [03:55] the _all_members case does look right. [03:55] I think it's OK, yes. [03:55] the property is naffed [03:55] we can't use Person.id there [03:55] so [03:55] You can. [03:55] You just have to constrain it in the property itself. [03:56] we're not querying on the Person table [03:56] and querying both tables would be nuts [03:56] True. [03:56] brb. [03:58] mwhudson: wgrant: thank you for your help. [04:04] taking a break; -> new house stuff and talking to bigjools late tonight [04:04] if you need me, SMS/ring the aussie mobile [07:07] getting there with this branch [07:11] The cache-the-world one? [07:15] cache person [07:15] got some wtf failures in soyuz tests [07:16] also in my cachedproperty branch which I'm using as a prereq now [07:37] wgrant: for instance [07:37] File "lib/lp/soyuz/browser/tests/archive-views.txt", line 1279, in archive-views.txt [07:37] Failed example: [07:37] view = create_initialized_view( [07:37] ubuntu_team.archive, name="+copy-packages", [07:37] form={ [07:37] 'field.destination_archive': '', [07:37] 'field.destination_series': '', [07:37] }) [07:37] ... [07:37] raise ComponentLookupError(objects, interface, name) [07:37] ComponentLookupError: ((None, ), , '+copy-packages') [07:38] has me scratching [08:07] lifeless: I'm also seeing a hanging test, FWIW [08:07] StevenK: in soyuz? same symptoms? how are you running it ? [08:07] lifeless: I threw a branch at ec2 land [08:07] ec2test@domU-12-31-39-0E-60-31:~$ tail -n 1 /var/www/current_test.log && date [08:07] time: 2010-08-16 05:14:30.759898Z [08:08] Mon Aug 16 07:07:44 UTC 2010 [08:08] lifeless: strace is being as unhelpful as I feared [08:08] \o/ [08:09] so [08:09] what does ps tell you ? [08:09] e.g. ps fux [08:09] All that shows is the shell, the ps process and the librarian [08:10] hmm, what user are you :) [08:10] oh, no terminal [08:10] ec2test [08:10] ps faux ? [08:10] the stty process is what was hurting me [08:10] (and its a known 'feature' of stty [08:10] No stty process [08:11] what do you have? [08:12] lifeless: http://pastebin.ubuntu.com/478711/ [08:12] I'm suspecting the test runner has fallen over [08:12] well [08:12] did you see the patch from maris [08:13] shutdown stomps on shutdown [08:13] so falling over is possible [08:13] finishing and not shutting down is also possible [08:13] I seriously doubt the test suite finished [08:14] subunit-ls < test.log [08:14] sorry [08:14] Minusing the last test ran versus what time the instance started is 80 minutes. [08:14] We're not that quick :-) [08:15] subunit-stats < test.log [08:15] true [08:15] It only ran 1744 tests [08:16] One did fail, though, which is odd [08:18] may be interrupted [08:18] if the test kills the runner thats how it shows up [08:18] error: lp.soyuz.tests.test_buildpackagejob.TestBuildPackageJob.test_providesInterfaces [ [08:18] _StringException: lost connection during test 'lp.soyuz.tests.test_buildpackagejob.TestBuildPackageJob.test_providesInterfaces' [08:18] lifeless: Like so, from subunit-filter ? [08:19] yes [08:19] that tells you what nuked things [08:19] assuming no buffering [08:19] Indeed [08:20] [which is false - I *know* we have buggering in place in the test supervisor] [08:20] * StevenK smirks [08:20] s/gg/ff/ [08:21] lifeless: So, what do you suggest? Kill the test runner and run make check by hand on the instance? [08:22] well [08:22] is it repeatable? [08:23] This branch has done this twice, but I don't know which test killed it the first time since I was sleeping [08:24] running the tests while watching isn't a silly idea [08:25] you could try running just that test [08:28] Running just that test locally doesn't fail [08:28] But, like you say, I think buffering is screwing us [08:29] you can run it and the next 20 [08:29] grab any previous ec2 result [08:29] and use subunit-ls to get a list of the tests [08:29] pick that one and 20 or 30 after [08:29] put them in a file and use bin/test --load-list filename [08:29] The ordering doesn't change? [08:30] its tolerably stable [08:31] btw [08:31] 1500 line text files are not 'tests' [08:31] just saying [08:32] Wha? [08:32] * StevenK is missing contextg [08:32] s/g$// [08:33] tests/archive-views.txt [08:33] its blowing up, spectacularly, for me [08:34] File "/home/robertc/launchpad/lp-sourcedeps/eggs/zope.component-3.9.3-py2.6.egg/zope/component/_api.py", line 111, in getMultiAdapter [08:34] raise ComponentLookupError(objects, interface, name) [08:34] ComponentLookupError: ((None, ), , '+copy-packages') [08:34] line 1279 [08:34] lifeless: That test predates me by a while [08:35] sure [08:35] not blaming you :) [08:35] has anyone seen one of these sorts of failures before [08:35] and can suggest how to debug the thing ? [08:35] He can! [08:35] Um, I have, but I can't remember what I did [08:36] lifeless: normally you see that kind of error when ZCA hasn't been initialised properly... I'm assuming you've not modified the test so you haven't changed the layer it runs with? [08:36] * StevenK grumbles at the letter he just got from Medibank Private [08:36] noodles775: no, I haven't changed anything in soyuz [08:37] noodles775: this is my registry branch, which adds some caching to Person (and only Person) [08:37] hullo [08:37] hey bigjools [08:37] lifeless: ah, so it's only failing in your branch? I'll take a look at the MP. [08:37] hey lifeless, epic fail at getting up early I'm afraid [08:38] bigjools: thats ok [08:38] bigjools: Still flu-ey? [08:38] bigjools: Read as, "Did the weekend help?" [08:38] not quite full strength but better, thanks [08:38] noodles775: I'm pushing the latest now, but the shape is unchanged [08:39] noodles775: https://code.edge.launchpad.net/~lifeless/launchpad/registry/+merge/32067 [08:40] Ta. [08:40] its pushing now (bit slow because Lynne is eating all the wifi slots with a machine migration) [08:40] no worries... I'm just running the doc firstbefore merging anyway. [08:41] *sigh* and need to run make schema first. [08:41] the cachedproperty changes are from a different branch [08:41] which is approved and ec2ing itself [08:42] * StevenK stares at bin/test --load-list on his instance === almaisan-away is now known as al-maisan === henninge_ is now known as henninge [08:51] lifeless: Your suggestion was to run bin/test --load-list . That has set up the layers and then done nothing else [08:52] you probably want -vv there too :) [08:53] It just spat out 'Killed', so I'm now *very* curious what is going on [08:55] lifeless: Probably :-) [08:56] lifeless: the cache you've added on IPerson.archive is changing the test: http://pastebin.ubuntu.com/478724/ [08:56] (sorry, doc :) L. [08:56] noodles775: thanks [08:57] np. [08:57] (waiting for pakcets to look at the pastebin) [08:57] ec2test@domU-12-31-39-0E-60-31:~$ dmesg | grep -c 'oom-killer' [08:57] 6 [08:57] lifeless: ^ [08:57] -epic- packet loss on local wifi :( [08:57] The plot thins :-( [08:57] hah [08:58] 8693 ec2test 20 0 3575m 3.2g 11m R 100 45.1 4:35.38 /usr/bin/python [08:58] Fuuuuuuuuuun [08:58] thats less that optimal [08:58] bin/test is blaming buildd-slavescanner.txt [08:58] noodles775: so what does it /mean/ ? 'no archive found when one should be found' ? [08:59] lifeless: I'm guessing it means that ubuntu_team.archive was accessed before the doc added an archive for them, so the value of None is cached... [09:00] The line of the error you see is where it tries to adapt ubuntu_team.archive (which is None) and fails. [09:00] As the paste shows, the archive is found if you kill the cache (and the initialized view returned as expected) [09:01] So, convert it to a unit test and the problem goes away :) [09:07] noodles775: yeah, I need to find where though - so that I can fix the code to invalidate automatically (there are many more failures, this is just the first bit of fallout that was completly bizarre) [09:09] bigjools: so [09:09] Right, Python got to 6.7g resident before the oom-killer stepped in [09:10] bigjools: I don't konw what bits needed expansion in the incident report [09:10] Hello all [09:10] bigjools: so I suggest you ask me stuff ;) [09:10] lifeless: so an explanation of how you reached your conclusion would be a good start [09:10] noodles775: ArchiveSet.new() caches the value [09:10] bigjools: which conclusion ? [09:10] lifeless: aha. [09:11] noodles775: yes, aha indeed. [09:11] lifeless: "Soyuz went into a busy loop on the 'bohrium' builder" [09:11] noodles775: I suspect its this : if purpose == ArchivePurpose.PPA and owner.archive is not None: [09:13] lifeless: yep. Isn't this why we don't normally cache model attributes? (sorry, I've probably not caught up on some email discussion saying why what you're doing is ok). [09:13] bigjools: wgrant and cody talked about that. [09:14] bigjools: cody looked at the log on devpad and saw multi-times-per-second logged events saying that bohrium was being disabled [09:14] bigjools: I thought I linked the example log entry in the report [09:14] lifeless: that's just one snippet, it doesn't show repeated attempts at anything [09:15] noodles775: see several threads of mine about this on the list :) short story: caching is *a* way to get 'things that look cheap are cheap' more widespread, and bigger picture solutions require r&d [09:15] bigjools: it was spitting that out a lot, or so I was told [09:16] ok [09:16] bigjools: other evidence about a busy loop on bohrium is that attempts to update it from both psql and the lp webapp and airlock all failed [09:16] * wgrant was just working on what Cody said was hoppening in the logs. [09:16] lifeless: in the traceback it's calling "requestAbort" [09:16] specificaly just update builder set thing=False where name=bohrium got stuck waiting for a lock [09:17] that was a sign that *something* was busy updating that row....and updating the row.... and updating the row [09:17] we *don't know* if it was one long transaction, or many short ones. [09:17] bigjools: That codepath then immediately sets builderok=false, then commits. With no other options. [09:18] noodles775: pm [09:18] wgrant: nope, it calls slave.abort() [09:18] bigjools: True. Which fails, then invokes the exception handler which prints the log message. [09:18] which is why we get an XMLRPC fault [09:18] Which sets builderok=false, which commits. [09:18] Missed the exception handler bit, sorry. [09:18] ah ok [09:18] But we know that the handler was called, since the log entry is there. [09:19] noodles775: does that pasted thing look reasonable to you ? [09:19] it makes that entire doctest pass [09:19] In fact, I think that's just about all we *know* happened. [09:20] we also knew that other builds were not happening [09:20] True. [09:20] (or were being updated/processed so slowly that it was equivalent to not happening) [09:20] Now, I haven't seen the logs, but from what I heard there was nothing about the commit failing. [09:21] So possible something kept setting builderok=true again, or the Twisted evil is swallowing exceptions. [09:21] lifeless: so what that means is that we never got back into the reactor, so there was a loop somewhere [09:21] lifeless: was the log spewing stuff or stuck on that line? [09:22] lifeless: hrm... it looks like, yes, it would get the doctest passing, but it's adding a dependence elsewhere on the knowledge that a property is cached... would something like @cachedproperty(unless_equal=None) be silly? Hrm [09:22] bigjools: I didn't look : when I got here it had been plausibly analysed > 1 hour before, without escalation: so I discussed enough to determine that it needed (IMO) escalation, and handed off to elmo [09:23] noodles775: yes, because it would mean that /participants of a team with many people without PPAs would trigger lots of tiny queries [09:23] lifeless: Right. [09:23] lifeless: So I can't think of any other solution immediately than what you've pasted. [09:24] archive.txt does this [09:24] * StevenK tries to figure out why a doctest is causing python to eat 6.8g of RAM [09:24] mm, does something similar [09:24] lifeless: ok so I am looking at the log now, it's repeating that log section over and over [09:26] lifeless: I have a suspiscion that it happened when the Enablement guys yanked it from the pool [09:26] assuming they did so on a Saturday [09:26] actually, Friday night [09:26] you say potato, I say pohtahto [09:27] b-m was stuck in this loop from 22:17 to 08:47 [09:27] no you don't :) [09:27] I mean tz issues :) [09:28] that's UTC [09:28] yeah [09:28] so 10am sat [09:28] therefore it was Friday night everywhere ;) [09:28] Oho. @write_transaction retries.... [09:28] forever? [09:28] Only three times, apparently. [09:28] was gonna say ... [09:28] But it's still utterly wrong to use it. [09:29] ? [09:29] What with the whole talking to the builder thing. [09:29] what do you guys think of the idea of making the buildd manager an API client [09:29] you need to expand on that [09:29] lifeless: CRACK [09:29] and removing its DB usage. [09:29] total crack [09:29] why ? [09:29] bigjools: Retrying transactions that have external effects seems... unwise. [09:29] ... [09:29] Although it's probably OK enough here. [09:30] lifeless: the API is S L O W [09:30] bigjools: not in any way that matters for the buildd manager [09:30] dude [09:30] no [09:31] it does matter [09:31] of coure performance matters [09:31] you're just making things twice as hard for yourself and putting load on appservers for no good reason [09:31] but nothing the buildd manager *does* has any reason to be slow with the API as it stands today. [09:31] no, I'm running an idea up the flagpole [09:31] if its a bad idea, fine. [09:31] the b-m is *very* busy issuing queries [09:31] but I want to understand *why* [09:32] first of all, you need to justify why you think it would be better with the API [09:32] sure [09:33] you've got a highly concurrent task [09:33] which twisted is great at [09:33] but you're doing transactions, which will - unless you are _very_ careful - last the length of a conceptual task like 'check on builder Y' [09:34] the longer transactions are, the more contention you put on busy rows in the DB, like the builder table. [09:34] secondly [09:34] first one easily fixed by moving transaction boundaries [09:35] (it already does partial commits) [09:35] our *entire* stack above the storm layer is built on the model of global, magical, transaction objects which lookup information in thread locals [09:35] [it may be easy to fix, but it needs to be done and maintained with care: using the API you'd have that for free, so its less effort *in that regard*] [09:35] back to the second angle [09:35] twisted runs everything in the reactor thread [09:36] so you have to play silly buggers with our stack to move transaction objects out of context and back in, and also, because the stores and transactions are tied to model objets, possibly do that to the model objects too [09:37] if you don't, you run a high risk of bugs where you do unrelated things in a single transaction [09:38] note that moving transaction boundaries won't work well if you want to use our regular model objects for flow control or anything like that. [09:40] lifeless: I don't see why things are moving around as you suggest [09:40] I think that those two things together make the job of writing the buildd manager in twistd harder than if it was an API client. Of course, I'm not the one writing it: I'm expressing an opinion and asking what you think. [09:40] the b-m would bring the appservers to their knees [09:41] why? [09:41] I'm not trying to be silly or difficult - I really don't see why it would. [09:41] because it issues shitloads of queries [09:41] ask stub about the load it generates [09:41] what is it doing with these queries ? [09:42] seeing if there's a new build on each builder, checking status of each builder etc etc [09:42] if the webservice were a thin, performant agile layer I would consider agreeing with you, but it's not [09:43] even if it were you'd still be putting tremendous load on the appservers [09:43] and I think we should use those to service external requests, not DC ones [09:43] so, in terms of its performance; I've been looking and its bad in a couple of specific ways [09:44] it batches stuff we shouldn't batch : but clients have control over that, so we can easily avoid it. [09:44] and it potato programs as you traverse objects (the exact same terrible pattern that makes some pages (and some API calls) extremely slow) [09:45] the former issue would be avoidable; the latter issue is nearly entirely irrelevant in the DC (particularly for a long lived process like the buildd manager) [09:45] in terms of appserver load [09:46] completely separately from this discussion [09:46] I want to separate out APIs and WebUI appservers [09:47] I don't think its good or appropriate to mix up human-facing work with machine-driven work: makes servicing humans well in times of overload harder. [09:47] this idea is raw: I haven't run it up the flagpole and had it examined it; its an open-concept [09:47] ok [09:47] anyhow, if that were done, the load on the appservers might be huge, but it wouldn't affect browser using users. [09:49] anyway I am trying to see reasons why the b-m would be stuck in a slave.abort() loop [09:50] it's not exiting from the Deferred at all, otherwise we'd see other builds getting dispatched in the log [09:50] Anyhow, this is a diversion: I asked your opinion, which is that its a bad idea because it would be harder to make it perform well, and even if done to perform well it has a high risk of adversely affecting the API appservers. [09:50] I may come back to this concept in the future, but I have what I wanted for now:) [09:51] noodles775: actually, hah, it didn't fix it - I've got some more digging to do. [09:54] what happens if you removeSecurityProxy on a non proxies object ? [09:54] ah, pass though. nice [09:58] right, scatter removeSecurityProxy into the caching layer, and its good. [10:01] bigjools: is there anything else I can tell you about saturday to help [10:01] danilos: hi [10:01] lifeless, hi [10:02] is there anything I can do you help with your page performance analysis ? your plea for help was rather heartfelt :) [10:03] lifeless, heh, thanks for the offer :) not sure right now, but I just wanted to indicate a few issues that seem as if they are staying under your radar :) [10:03] lifeless, it's basically: "here's a few things we are having problems with, I know different people are working on it, but I am sure you are the best person to keep that all in mind" [10:04] danilos: I've filed a bug - thats related - https://bugs.edge.launchpad.net/launchpad-foundations/+bug/618019 [10:04] <_mup_> Bug #618019: OOPS may be underrepresenting storm/sql time [10:04] \o/ I think my registry branch will pass now [10:04] danilos: I think this is related because if we're undercounting DB time, it would certainly explain some things ;) [10:05] lifeless: Doesn't mean it won't break anything :/ [10:05] lifeless, right, but this particular case doesn't seem to be that [10:05] danilos: do you have a profile for it ? [10:05] lifeless, for instance, we get long rendering time on local instances with that many objects where DB time is very stable (i.e. we just add 2000 rows to the DB) [10:05] lifeless, no, sorry, it's the next step we've got to take [10:06] no worries [10:06] losa ping [10:06] :) [10:06] lifeless, (I've got KCacheGrind email tagged as "important" :) [10:06] :( [10:06] bah [10:06] :) is what I meant [10:06] heh [10:07] danilos: if it is python time, there are a few things we can do to fix it, but which one will make sense will need the data for where the time is going [10:07] lifeless, also, our first suspicion was storm "objectification", but that turned out to be pretty quick [10:07] lifeless, of course, we haven't done anything other than get a gut feel about the situation [10:07] (we could do a C extension, we can cut some fat out, we can rearrange the layers) [10:07] wgrant: of course it will break something. [10:09] lifeless, gary was mentioning switching to chameleon (as a faster TAL renderer), looking into fmt:url and why it takes so long (it was around 2ms per call for us), etc. anyway, I don't want to make another guess without profiling first :) [10:10] \o/ data [10:10] bigjools: ok, I'm going to go do family time and stuff [10:11] bigjools: if there are other things I can answer to help, or if you want me to start looking at the code as another of eyeballs, just shout. [10:11] lifeless: sorry was on a call [10:12] lifeless: if you could eyeball the code that would be great. It's getting stuck inside a Deferred somehow [10:12] I don't know why it would be repeatedly calling slave.abort [10:12] well, it's calling builder.rescueIfLost() [10:12] The log says it's also repeatedly calling Builder.updateStatus... [10:12] And there's only one place that's called :/ [10:13] wgrant: I only see it calling rescueIfLost [10:13] bigjools: The traceback sucks. But grep for the error text. [10:13] wgrant: updateStatus is not in the traceback [10:14] http://pastebin.ubuntu.com/477761/ [10:14] bigjools: updateStatus delegates to updateBuilderStatus. [10:14] Which is right at the top. [10:14] d'oh [10:14] NFI why the traceback stops there, though. [10:14] right [10:15] I still feel it's something to do with Enablement yanking the builder [10:15] Very probably. [10:15] But even DB contention doesn't explain this entirely. [10:15] It explains three calls. [10:15] Not thousands. [10:15] Unless they're both mutually timing each other out. [10:16] lifeless: you said that Builder:+edit is slow, I can't find an oops on bohrium in the oops reports [10:17] it was on sat [10:18] only three requests, and I think they were all elmo [10:18] lifeless: yeah I looked in the reports and I can only find one Unauthorized response [10:18] Can someone run the buildd-slavescanner.txt test on devel and see if python starts taking up gobs and gobs of RAM? [10:18] bigjools: how did you look - grep or something else ? [10:18] since I'm a buildd admin I'll play later [10:19] lifeless: I used my email client to search my OOPS reports emails [10:19] bigjools: they only report the top 25 [10:19] by volume [10:20] lifeless: ok then I leave it up to you to put the OOPS on that bug :) [10:20] hmm, elmo linked it [10:20] sec [10:20] ah cool [10:21] bigjools: OOPS-1687L750 I think [10:21] got it, ta [10:46] bigjools: Do you have a few minutes to talk ddebs? [10:46] wgrant: not now, sorry, I'm hellish busy [10:46] Heh, OK. [11:14] hello [11:15] hello [11:16] hello [11:19] lifeless, regarding the deprecation warning you were seeing earlier [11:19] lifeless, I spent some time trying to fix the opacity of the message. [11:20] lifeless, and then gave up. import handlers and warn-on-import warnings are tricky beasts. [11:23] Is the legendary lazr.importguardian any better in this respect? [11:24] wgrant, I'd be surprised if it was. [11:25] were. [11:25] :( [11:26] as far as I can tell, to do it right you'd have to monkeypatch warning.warn and a couple of other methods and change the stacklevel argument [11:26] OR [11:26] just do stack introspection in the warninghandler [11:27] Yay. [11:27] but I'm increasingly unsure what our warning handler actually wins us. [11:39] lifeless, So, are feature flags available to use now? I appear to have missed the news on this one. [11:41] gmb: yes [caveats may apply] [11:42] lifeless, !caveats ;) [11:42] gmb: the caveats are that there is no UI for admining them yet, and little polish - not many scope types etc [11:42] gmb: but that doesn't matter much [11:42] lifeless, Righto. Thanks. Good to know. [11:42] gmb: as a consumer of flags, you just write your code guarded by a flag [11:42] the admin stuff that is missing will affect QA and production only [11:42] QA to see how your patch looks on/off [11:43] Okay. [11:44] production to configure the rules you request (which you can do anytime: because it defaults off you can land your patch, tweak some more, land that, and then ask losas to turn on the feature flag) [11:48] Right === lifeless changed the topic of #launchpad-dev to: Launchpad Development Channel | week 1 of 10.09 | PQM is OPEN | firefighting: - | https://dev.launchpad.net/ | Get the code: https://dev.launchpad.net/Getting | On-call review in irc://irc.freenode.net/#launchpad-reviews [11:49] anyone familiar with ARM know how many architecture variations there are, roughly? [11:49] do you mean specific SoC's ? [11:49] that sort of thing yeah - they each need a separate distroarchseries [11:50] lots, I reckon. [11:50] and I heard that our current set of supported architectures will balloon [11:50] #linaro might be a better place to ask. [11:50] jml: yes, that's my approx guess too :) [11:50] * bigjools was not aware of that channel, aha [11:51] yes, 'lots' is definitely the answer [11:51] 1 or more per silicon vendor AIUI [11:55] the reason I ask is because people want arch checkboxes on the "register a distroseries" page but that's probably not possible [11:56] there is a big effort going on [11:56] to make the kernel and libc (IIUC) the only things that vary, so that it would be more pocket-like, or something. [11:56] and by vary I mean 'loadable modules' [11:58] jml: I haven't landed your patches yet. [11:58] jml: however I did release a new project :P [11:58] lifeless, I had noticed both things. [11:59] jml: and your patches are now very high up the pile [11:59] lifeless, grats on the new project. I'm looking forward to giving it a try. [11:59] I appreciate your patience and regret the time its taken to get to them [12:00] lifeless, np :) [12:00] Morning, all [12:00] hi deryck [12:08] deryck, good morning === al-maisan is now known as almaisan-away === jelmer_ is now known as jelmer [12:56] StevenK: you should email the list about this memory problem [12:56] StevenK: I'm seeing the same death on my ec2 land calls [12:56] so we're essentially in stop-the-line mode [12:57] or someone should [12:57] ec2 instances dying at 1744 tests run in the log, memory epxlosion and OOM killing [12:57] however, its mightnight, so me, I'm-a-sleeping [12:57] jml: tag, you're it. ^ :) [12:59] hmm ok. === mrevell is now known as mrevell-lunch [13:16] jml: I have tracked it down, and I can share notes, if you wish [13:17] StevenK, that'd be great, thanks. [13:17] hi [13:18] jml: The troublesome test is lib/lp/soyuz/doc/buildd-slavescanner.txt, the troublesome line in the test is 51, and I have a traceback: http://paste.ubuntu.com/478766/ [13:18] jml: Evidently, I did kill the test horribly to get that traceback, so it might not help much [13:19] StevenK, what happens when you run it locally? [13:19] poolie, hello [13:19] (just saying hi) [13:20] jml: It consumes gobs of memory before I get sufficently nervous and kill it [13:20] You know, that's interesting. It's almost the same place the hang happened over the weekend. [13:20] well, that's a good sign. [13:21] (Imagine how much worse this would be if it behaved nicely locally) [13:23] my first hunch is that it's a bug in z.testing.testrunner. [13:28] jml, + for testr failing --list [13:29] jml, any progress on the suite OOM error? Why do you suspect zc.testing.testrunner? [13:29] StevenK, I don't see the error when running on stable. [13:29] mars, because of the traceback. [13:30] I think you hacked on that code for subunit. At least it should be familiar territory [13:31] not the traceback formatter. [13:31] but familiar enough. [13:31] oddly enough, this interrupts me during some fairly deep surgery on ec2test/remote.py [13:32] jml: I've posted to -dev about it, so you can follow up when you wish [13:32] jml, StevenK, can you reproduce this on a local copy of devel? [13:32] mars: I can [13:32] mars, I'm trying that right now. [13:33] mars, but I always try stable first. [13:33] I get a lot more nervous on my desktop than a random EC2 instance [13:33] hehe [13:33] oh, awakening the OOM serial killer bit [13:33] StevenK, computers are for burning! [13:33] browsers and editors beware [13:34] For instance, I let Python consume 6.8GiB of RAM before the oom-killer stepped in, but on my desktop, I killed it myself after 2.1GiB [13:34] I *can* reproduce this on devel, it seems. [13:34] jml: Says you :-P === matsubara-afk is now known as matsubara [13:35] although please forgive me while my desktop grinds :) [13:36] this is good news! [13:36] argh - zope testrunner explicitly forbids running under PDB. PDB would catch the original exception before formatException() got to it. [13:36] jml ? [13:36] well, it means the problem is in http://paste.ubuntu.com/478825/ [13:37] and if I were a betting man, I'd say in r11345 [13:37] I have to agree [13:37] * jml verifies [13:38] ......... [13:38] Aha. [13:38] fwiw, this is the stack trace I got when I interrupted. [13:38] http://paste.ubuntu.com/478827/ [13:38] That would explain *everything*. [13:39] And why I couldn't work out what was happening over the weekend. [13:39] Taking that CP into account, the problem is obvious... [13:39] bigjools: ^^ [13:39] StevenK: you know about ulimit? [13:40] ... and science verifies it is indeed that revision. [13:41] wgrant, it's not obvious to me (other than "signals are hard") [13:41] heh [13:41] jml: Note that one of the cases results in an infinite loop. [13:41] It might not be the problem that's causing the excessive memory usage. But it is what caused the crisis over the weekend. [13:41] * jml looks atthe diff === mrevell-lunch is now known as mrevell [13:42] If you stick a try/except in a 'while True' loop, you want to ensure that all the paths have a way to escape... [13:42] I seem to remember recommending not to use an infinite loop. [13:43] poolie: I do [13:43] * bigjools did not use an infinite loop [13:43] ahh. [13:43] bigjools: Why is there an infinite loop, then? [13:43] fuck nose [13:43] Heh. [13:44] bigjools, I'm looking at the code. It's clear that the intent is not to be an infinite loop [13:44] has this revision passed buildbot yet? [13:44] but you know what they say about good intentions. [13:44] bigjools, no. [13:44] ooookay then [13:44] that explains the buildbot failures [13:44] so, the memory thing is because builder.failBuilder(str(reason)) is appending stuff to a list? [13:44] sigh [13:45] so, where's the loop coming from [13:45] I don't think it appends stuff to a list. [13:45] bigjools: The codepath that was problematic over the weekend. [13:46] After calling failBuilder... it logs, then returns to the top of the loop. [13:46] wgrant: that is now obvious [13:46] bigjools, ah, this is why two buildbot slaves fell over from Friday onwards, eh? [13:46] but how is it returning to the top of the loop [13:46] bigjools: How not? [13:46] bigjools, it just continues after the except [13:46] there's a "return" [13:46] bigjools: Only the second except and the else have a return. [13:46] bigjools, not in the failBuilder clause. [13:47] ooooooooooooooo ffffffffffffffuuuuuuuuuuuuuuuuuuuuuck [13:47] Haha. [13:47] fwiw, the loop could be written as 'for i in range(MAX_EINTR_RETRIES)' [13:47] jml: please feel free to do that. I tried and the code became less readable. [13:47] bigjools, I'll have a stab. [13:47] cheers [13:47] jml: in the meantime I'll land a fix [13:48] This explains why a look at db-devel yielded no reasonable explanation for the weekend's behaviour. [13:48] yes :/ [13:48] god DAMN [13:49] jml: It's possible that the logging goes to a StringIO... that's probably likely, in fact. [13:49] bigjools, so you will have to land a roll-back revision before we can restart the builders then [13:50] wgrant, that would make sense. [13:50] mars: eh? [13:51] bigjools, I assume the tests in the revision caused the OOM errors? And the OOM killed our CI build farm. So the revision needs to be backed out. [13:51] bigjools, http://paste.ubuntu.com/478837/ [13:51] mars: why can't I just submit a fix? [13:52] bigjools, or that, if you think it is fast to do [13:52] I do! [13:52] jml: thanks [13:52] bigjools, I can land that now, if you'd like. [13:52] (it also has the return fix) [13:52] jml: I'll do it [13:53] I need to CP this branch as well [13:53] ok [13:53] easier if I do the whole thing [13:53] no arguments from me :) [13:53] Is there a good reason not to just have a single return at the end? [13:54] wgrant, I reckon that would be better, yes. [13:54] Slightly less explicit. But also slightly less likely to destroy everything. [13:54] depends how close you want to keep the returns to the code that depends on them [13:54] for readability [13:55] mars, the readability issue here is that the control flow is jumpy [13:55] yep [13:56] so I can see exactly what happened on Saturday now [13:56] we got a xmlrpclib.Fault when an Enablement machine was pulled [13:56] and the loop ensued :/ [13:57] bigjools, fwiw, the test passes locally with the fix. [13:57] It's somewhat more complicated than that, I think. Since it tried to abort. [13:57] yeah [13:57] But that's the main bit of it. [13:57] yup [13:58] Now I don't have to go insane wondering what I missed :) [13:58] perhaps ec2 test should run against stable by default, rather than devel. [13:58] jml: http://pastebin.ubuntu.com/478840/ [13:58] salgado, don't worry about deactivating the Lucid EC2 AMI. The OOM test failures were unrelated to the Lucid upgrade. [13:58] jml: however with that change, test_updateBuilderStatus_catches_repeated_EINTR fails [13:59] mars, yeah, just saw the backlog. :) [13:59] meh. I shouldn't have been so quick to delete my branch with that change :) [14:02] jml: I see why. This is why I didn't use range() :) [14:03] bigjools, why? [14:03] jml: it's not running the code at the bottom of the exception any more [14:03] except in the case of reason[0] != errno.EINTR: [14:03] bigjools, the bottom of which exception? [14:03] except socket.error [14:04] oh, you mean in the case where MAX_EINTR_RETRIES is actually reached [14:04] yes [14:05] * jml tries a fix [14:05] http://pastebin.ubuntu.com/478842/ works [14:08] jml: do you see my point about it being less readable now? [14:08] but maybe http://pastebin.ubuntu.com/478843/ really is the patch with the best cleanness / robustness trade-off. [14:09] jml: that has the same problem [14:10] bigjools, there are so many problems, which one do you mean? [14:10] jml: the one where it doesn't run handleTimeout() when we hit MAX_EINTR_RETRIES [14:10] bigjools, it does. run the tests :) [14:10] jml: oh sorry I can't read [14:14] jml: ok I'll land that with your blessing? [14:14] bigjools, please. [14:14] and we need to cowboy cesium until the CP goes in [14:19] last iteration: http://pastebin.ubuntu.com/478848/ [14:27] jml: land that afterwards if you wouldn't mind, I'm already mid-process for the other change [14:33] ok buildd-manager was restarted with the patch [14:34] bigjools, sure thing. [15:12] EdwinGrubbs: btw, whatever happened with that ml oops? === almaisan-away is now known as al-maisan [15:37] barry: well, it was just spam, so I am now looking at adding the full text of the email to the oops, so that we don't have to track down a losa to see it. [15:38] EdwinGrubbs: +1. i also suggest that we track down that rt and try to work with IS to improve the incoming mta (exim i believe) anit-spam defenses. really, stuff like this should rarely if ever actually hit lp [15:39] s/anit/anti/ [15:54] barry: I tried searching rt for requests concerning spam or filtering for mailman, but no luck. Do you have any other information before I create a new request? [15:55] EdwinGrubbs: unfortunately no. i'm sure i no longer have a link to the rt. probably can't hurt to just open a new issue [15:55] ok [15:59] bigjools, jml, I see your change landed on devel - so are we good to restart the buildbot farm? [15:59] mars: yep, thanks [15:59] * bigjools dons brown paper bag [16:00] losa ping, could we please restart the buildmaster and the downed slaves? [16:01] losas, I have no idea what state the lucid buildslaves are in - they probably need a process tree cleanup :/ [16:01] mars, fwiw, I've split TestOnMergeRunner into three different classes [16:01] mars: erm, you mean lpbuildbot (buildmaster is the PPA/archive build master) [16:01] jml, wow, 3? [16:01] mthaddon, yep! Thanks [16:01] jml, mumble? [16:02] mars, yeah. one to handle running generic stuff in a daemon; one as an object that represents the merge request; one as the thing that knows how to run tests and gather results [16:02] that middle object simplified a lot of stuff, I think. [16:02] sinzui, sure. [16:02] mthaddon, the lpbuildbot farm has a buildmaster as well - wonderful confusion :) [16:08] mars: I often get confused by this nomenclature [16:08] I clicked on a graph earlier on and it took me 10 seconds to realised why the thing didn't make sense [16:08] lol [16:09] we should just call it the bbmaster or something [16:09] bbmaster, bbslaves [16:16] jelmer: ping [16:18] EdwinGrubbs, pong [16:21] jelmer: I was wondering if you had a chance to use the feature on edge that lets you create project from a source package. I was starting to QA it but I see there are a lot of meta packages and packages where ubuntu is the upstream, so they don't need a project. [16:23] jelmer: I'm also curious, what is the biggest benefit you get after you link a project to a package? [16:23] EdwinGrubbs: Because of the way the sorting in the needs-packaging list works all of the packages that can be linked at the beginning of the list already have been, that's why there are so many native packages there. [16:24] If you skip a few pages in the list there should be some more packages that don't have an upstream project registered. [16:24] * jelmer looks for one [16:24] e.g. https://edge.launchpad.net/ubuntu/maverick/+source/soprano [16:25] sinzui, http://www.jjg.net/elements/pdf/elements.pdf [16:25] jelmer: what's the first thing you do with a project after it has been linked to take advantage of it? [16:26] EdwinGrubbs: usually I register the upstream branch, and sometimes I add the homepage. [16:26] jelmer: and do you use the upstream branch to automate some of the package building process? [16:28] EdwinGrubbs: not necessarily, usually it's just so I can get a local copy of the upstream source code or perhaps work on a patch against upstream [16:30] jelmer: so, it just makes it easier to get the project in bzr, instead of switching to git or svn to work on a patch? Sorry if this is a lot of questions. I'm trying to improve my understanding of the process. [16:31] EdwinGrubbs: Yep, exactly. It gives me an easy way to track all of my in-progress patches (since they're all nicely listed on my branches page) independent of what project it's for. [16:32] ahhh [16:32] EdwinGrubbs: No problem; this is just how I use it though, I imagine it's quite different for other people. [16:32] jelmer: who else would be good to talk to? [16:34] EdwinGrubbs: Specifically in relation to linking upstream projects and Ubuntu source packages? [16:35] jelmer: yes [16:35] EdwinGrubbs: Ubuntu developers that use bug watches (as registering the upstream bug tracker is required to do bug watching IIUC) and forward bugs from ubuntu to upstream. [16:35] ok [16:36] EdwinGrubbs: The early adopters of daily build packages might also be good candidates, as they need both the upstream source branches and ubuntu packaging branches to build their packages. [16:37] I'm not sure if the actual link between the upstream and the ubuntu source package is of benefit to them though, just that they use both the upstream project and the packaging in ubuntu. [16:37] sinzui, hi, is the partial fix for bug 607879 really testable, or should we mark it as qa-untestable for now? [16:38] <_mup_> Bug #607879: ~person/+participation timeouts [16:38] Ursinha, it only reduced the query count [16:39] sinzui, so not really testable [16:40] Yes, We can only see query counts go down. [16:42] jelmer: thanks for the info [16:43] sinzui, I'll change the bug tag then. Thanks [16:43] thanks === salgado is now known as salgado-lunch === Ursinha is now known as Ursinha-brb [17:18] rockstar: does sourceforge let us mirror their branches? I remember this being a problem a long time ago, but I can't remember if it was resolved. [17:18] EdwinGrubbs, I believe so. === matsubara is now known as matsubara-lunch === Ursinha-brb is now known as Ursinha === salgado-lunch is now known as salgado [18:17] * jml hates writing tests after writing the code === matsubara-lunch is now known as matsubara [18:30] EdwinGrubbs: I think in the case of irda-utils you probably wanted to import /trunk rather than the root of the repository. [19:00] benji, ping [19:02] benji (or gary_poster), jtv has been working on a fake librarian for testing. He registers it using zope.component.provideUtility, but (simply by unregistering the utility it provides) fails to restore the original ILibraryFileAliasSet utility [19:03] I got stuck on this a few months ago and change my implementation since I could not solve the deregistration issue [19:03] jtv was just trying to talk to me about this when his connection died [19:06] sinzui, talking to jtv about it in private message [19:06] sinzui: thanks… sorry for dropping out there; pidgin died [19:08] jelmer: oops [19:12] moin === EdwinGrubbs is now known as Edwin-lunch [19:17] is the issue StevenK reported with ec2 runs OOMing fixed ? [19:19] lifeless, yes [19:20] has it landed in devel - if I just ec2 land, will my stuff go through (assuming my bits are good) [19:21] yes, mine just went thrrough [19:21] \o/ [19:21] time to fire up the engines boys! [19:23] hello what [19:23] lifeless, the fix has yet to reach stable, according to lpbuildbot. [19:24] I have just added a new option: "ec2 test --dont" [19:24] jml: long as its in devel [19:25] jml, 'ec2 test --dont' ? [19:25] mars, yeah. it sets up the instance ready to run the test suite, and then it doesn't. [19:25] jml, heh, I've been instructing people to run 'ec2 demo' [19:26] mars, yeah, but that does more stuff. [19:26] or "ec2 test -o '-t somejunk'" [19:26] "ec2 test --postmortem -o '-t somejunk'" [19:26] that leaves the instance running [19:27] mars, also, it doesn't conveniently tell you the command that ec2 would run if it were going to run the tests. [19:27] true [19:27] jml, ec2 test --not-really :) [19:27] yeah. I might rename it. [19:30] --setup-only works too [19:30] jml, in your list mail about the OOM problem, what do you mean by "Test suite failures often really do mean production failures" ? [19:31] no one has mentioned a cowboy or production issues in the thread [19:31] mars, oh. well, there was both :) [19:32] the sequence went something like: [19:32] * critical production issue A [19:32] * cowboy fix for issue A [19:33] * land fix for issue A [19:33] * critical production issue B caused by fix for issue A; critical test suite failure caused by fix for issue A [19:34] and then you can pick up the rest from there. [19:34] nice [19:36] impressive - that one block of code knocked out production and our development pipeline [19:36] I don't think that has happened before [19:38] rockstar, hi, is https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1686EA3556 really supposed to be an oops? [19:40] Ursinha, yes. If we're not catching it, then there's a bug there. [19:40] rockstar, I mean, is that supposed to fail as an oops or just be caught to display a nice error message to the user? [19:40] Ursinha, the latter. The bug is that we're not catching it. [19:41] rockstar, right. mind if I file a bug for that? [19:45] mars, well, by far the most common pattern is to run tests before applying a change to production. [19:45] jml, heh, I was just writing a reply asking about that. So why did we not do so this time? [19:46] jml, and could we have just run '-m lp.soyuz' and felt confident enough with that? [19:46] mars, I don't know exactly. If I had to guess I'd say that it would have taken too long. [19:47] mars, I'm not 100% sure that would have caught the failure. [19:47] I am just wondering if there is a natural suite partition there [19:48] mars, do you mean as code currently stands now, or as it ought to be? [19:48] as it ought to be [19:48] mars, probably. [19:49] there is no way to run 'soyuz-not-app-server-code', or 'soyuz-just-the-app-server', but we can run 'soyuz' [19:49] which should be much faster that 'soyuz+bugs+code+..." [19:49] I'll try timing it, see what happens [19:50] mars, I'm not sure if you know, but lots of the code that runs that particular production system lives in paths that do not have 'soyuz' in them at all. [19:50] jml, I did not know that [19:51] lp.buildmaster being the one that I'm most certain about. [19:52] yes, was just looking at that === al-maisan is now known as almaisan-away [19:54] dammit. testing this refactoring shows that I am basically incapable of writing Python without unit tests. [19:57] jml: :) [19:57] I just fixed my fourth simple name error. [19:57] jml, pyflakes? [19:58] mars, doesn't help with instance variables. [20:02] jml: interestingly I suggested a range() approach too, back when reviewing the EINTR change [20:04] lifeless, well, a little paranoia is healthy, and all that [20:04] lifeless, I notice that Twisted & bzrlib both have until_no_eintr-style helpers and that both use infinite loops [20:04] yeah [20:04] I haven't read your mail yet [20:05] I grant that the likelihood of a call generating eintr forever is pretty small [20:05] in a call with the isd guys about logging & stuff [20:05] jml: lamont argues that forever is the right answer [20:05] but something in me bristles all the same [20:07] lifeless, for the moment, I disagree. Everything needs a circuit breaker. [20:07] also, it somehow became 8pm. [20:08] one more test run... [20:11] jml: I suggested that perhaps the buildd manager become an API client [20:11] jml: bigjools felt that this was a terrible idea [20:12] lifeless, I don't think it's a terrible idea, but I'm not sure how it would help. [20:12] lifeless, or rather, the first thing that needs to happen to the buildd manager is to clean up a bunch of needlessly complex code. [20:13] jml: it would have meant that there wasn't a deadlock on the builder row in the DB, so the disabling would have worked; the b-m might still have broken, but less cascade would have happened [20:13] I worry about the thread-locals model of zope with the context model of twisted [20:13] lifeless, oh huh. [20:13] lifeless, we managed to get it quite stable with the old authserver [20:13] lifeless, but it was a trial. [20:14] lifeless, switching to internal xmlrpc helped a lot. [20:14] perhaps its only a worry [20:14] but I would be happier with the scheduler being just a thing that talks to webservices and local processes [20:16] two thoughts [20:16] 1. there are no good twisted clients for the API [20:16] didn't jamesw make one? [20:16] lifeless, it's a prototype [20:17] aren't they all? [20:17] 2. having some kind of deferred-returning calls for the db stuff makes integration points _really_ obvious, which is nice. [20:17] one of the problems is this [20:17] model objects know they are in a transaction [20:18] but we don't want long lived transactions [20:18] anyhow [20:18] * jml is off. [20:18] g'night. [20:18] gnight === lifeless changed the topic of #launchpad-dev to: Launchpad Development Channel | Performance tuesday! | Week 1 of 10.09 | PQM is OPEN | firefighting: - | https://dev.launchpad.net/ | Get the code: https://dev.launchpad.net/Getting | On-call review in irc://irc.freenode.net/#launchpad-reviews [20:39] sinzui, I see that bug 612408 is marked as Fix Released but I still see occurrences of that OOPS on lpnet [20:43] It is not fix released... [20:44] damn it, the branch was marked fix committed when we knew we needed several branches to fix this. [20:44] Ursinha, it is in progress. We are waiting for jcsackett's branch to land [20:44] I updated the status and milestone [20:45] sinzui, right. So, if you want to avoid QA bot to change this bug, just add the [incr] tag to your commit msg, or if using ec2 land, there's the --incremental option [20:45] sinzui, so it won't close your bug until you land a fix without the incremental tag [20:46] Ursinha: i think that's info more directed at me, and my apologies if i screwed up QA process. [20:46] deryck: sorry about the wrong project, the back-link seemed to leave project-wide bug filing state all confused :) [20:46] Ursinha: thanks for mentioned the incr tag. [20:46] jcsackett, ah, no, my fault I haven't announced that properly, I'm afraid [20:46] lifeless, dude, no worries. Just doing CHR duties today. [20:46] Ursinha: either way, thanks for the info. :-) [20:46] jcsackett, my pleasure :) === matsubara is now known as matsubara-afk [20:48] deryck: have we had any feedback on +filebug ? [20:48] anyone come and loved us or hated us ? [20:49] lifeless, not that I know of. No feedback at all. Neither for or against. [20:49] no news is good news, I hope. [20:49] well thats something [20:49] certainly no noose is good nes [20:50] (gotta love mel brooks) [20:50] sinzui, did you update the bug? I can't see it here... [20:51] ah, just showed up [20:51] sinzui, jcsackett, thanks [20:51] Ursinha, https://bugs.edge.launchpad.net/launchpad-registry/+bug/612408 [20:51] sinzui, thanks, it just updated here. don't know why it took so long, though [20:52] I see In progress, 10.09, oops qa-bad [20:53] sinzui, now I see the same too [20:54] Ursinha: farming the oops reports ? [20:54] lifeless, yes sir [20:54] Ursinha: how are we placed for the merge workflow ? [20:55] Ursinha: anything I can help with ? [20:55] which reminds me, time to check the rt ticket status [20:55] lifeless, sure, you can take the rollback part if you want. I had no time to tackle that last week, am working on writing more tests to it [20:56] I won't today, because its performance day, but lets talk about that ~ this time tomorrow, and I'll likely make the same offer again :- but actually do it :) [20:56] lifeless, thanks [20:57] we're only waiting on the tagger and rt 40482:live-schema staging environment [20:57] AFAIK [21:03] flacoste: hi [21:04] hi lifeless [21:04] we're on, I think ? [21:04] lifeless: we are! [21:07] Later on, all. === almaisan-away is now known as al-maisan === Edwin-lunch is now known as EdwinGrubbs [21:41] sinzui: ping [21:41] sinzui: I need a slow milestones url page or bug # please :) [21:42] flacoste says landscape/+milestones ? [21:44] lifeless, yes, that is often a good one *if* you can see all their private bugs [21:44] project groups are a little different that projects and distros. [21:44] flacoste: https://devpad.canonical.com/~stub/ppr/lpnet/latest-daily-timeout-candidates.html [21:45] sinzui: I will arrange to be able to :) [21:45] jkakar: ping [21:50] morning [22:05] lifeless: https://wiki.canonical.com/Launchpad/Sprints/BugJamDecember2010 === Ursinha is now known as Ursinha-afk === Ursinha-afk is now known as Ursinha [22:11] sinzui: do you have an oops report for the landscape one ? [22:11] sinzui: or a bug number ? [22:11] lifeless: Hiya. [22:11] sinzui: performance day, flacoste thought that looking at this pain point might make you happy :) [22:11] jkakar: hiya [22:12] jkakar: going to be doing some performance work on milestone views, and apparently the landscape private bugs are a contributing factor in its pain-level [22:12] flacoste: 2010-12-13 to 2010-12-24 is a bit more than a week [22:12] jkakar: but I'd need to be able to see them to see the issues [22:12] jkakar: so, I was wondering, if I'm not already, if I could be in the relevant group for a week or two [22:13] lifeless: I'm happy to add you to the landscape team, but unfortunately you'll get a lot of mail. [22:14] jkakar: thats ok, I'll treat it the same way I treat launchpad bugs :) [22:14] lifeless: Added. [22:15] thanks [22:15] lifeless: And by "same way I treat launchpad bugs" that means we should expect merge proposals from you, right? ;b [22:15] perhaps... [22:16] Hehe === salgado is now known as salgado-afk [22:24] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1689EB3060 is a sample from https://edge.launchpad.net/landscape-project/+milestones [22:24] 305 16ms statments [22:25] 190.0ms for the longest statement [22:25] I think I can do something for this [22:31] hello, happy performance day [22:33] thank you [22:35] lifeless, there is no current bug number for the Milestone timeout. It reappeared last week after the cache was removed. [22:38] matsubara-afk: ping [22:39] lifeless, this is the bug that was tracking milestones with lots of private bugs: https://bugs.edge.launchpad.net/launchpad-registry/+bug/447418 [22:40] ^ it was closed after the oopses disappeared. I had changed the security on the bugs to lp.View after verify them once. [22:43] sinzui: thanks [22:44] lifeless, this is also the bug where I discovered the number of assigned users is also a factor. We format links to them. Ubuntu has a lot of assignees :( [22:45] sinzui: I will look, journal what I find, and see if I can help [22:45] thanks [22:45] would you like me to edit that bug [22:45] or make a new one ? [22:46] Lets reopen it since it has made a few occurrences this week [22:46] ok, I'll do so [22:48] lifeless, was pondering creating a memcached rule for person/pillar link formatters. 250 assignees is a lot of icon lookups [22:49] damn, a failur ein cacheproperty branch >< ah well, ec2 knows best :) [22:49] sinzui: What does an icon lookup entail ? [22:49] librarian calls [22:49] wha? [22:49] we render them on the appserver? [22:50] we look for an icon, then insert a link to the librarian icon for the link [22:50] thats just a query then [22:50] yes [22:50] any reason we can't prepopulate it ? [22:51] Since we link to users on every page, should all assignees, bug/branch/question commenters be prepopulated? [22:53] I suspect so [22:53] only teams have icons, only teams can be private (not rendered). Teams do not change their icons very often. [22:53] sure [22:53] at the lowest level though, a lookup is a lookup, and postgresql is as good as memcache at doing those very quickly [22:53] Since we would need a separate memchache mechanism, we could build once with a knowable key that we can invalidate on change [22:54] but unlike memcache we can do them all at once, whereas with memcache we have a serialised interface (due to our appserver structure) [22:54] so it will be faster to do this in postgresql [22:55] (I think) [22:55] memcached is really slow when making lots of requests. [22:55] oh, that reminds me. Last time I looked at the bugtask badge decoration code, we were looking for a mentoring icon. We should stop that [22:55] yeah this page is definitely death by sql [22:56] It is faster than lots of pg requests I am told [22:56] SQL time: 7505 ms [22:56] Non-sql time: 7849 ms [22:56] Total time: 15354 ms [22:56] Statement Count: 390 [22:56] sinzui: Yes, but so is a snail. [22:56] sinzui: single-pg-request < many memcache requests < many pgsql requests [22:56] I have not yet set my email outlining the death by many menus I am see hiding in our oopese [22:56] ooh, interesting [22:57] so wgrant, going to join in performance tuesday today ? [22:58] If beat-my-project-team-into-doing-work doesn't take up too much time. [22:58] wgrant: \o/ [22:58] wgrant: I can fedex a bat from a local friend, if needed ? [22:58] Did the buildd-manager stuff get sorted out last night? [22:58] Heh. [22:58] yes [22:59] lifeless, I need to start a dinner. I promised my team I would send an email outlining to cost of using the existing menu implementation with a suggest of how to address it. [22:59] sinzui: start your dinner then :) [22:59] I'm a little concerned that a rev seems to have been CPd without making it through the test suite. But I guess it was urgent enough that it may have been cowboyed. [22:59] wgrant: it was [23:00] Ah, good. [23:04] so, https://edge.launchpad.net/launchpad-foundations/+milestone/10.08 seems to be a regular milestone page with all the features present === al-maisan is now known as almaisan-away [23:09] does create_initialized_view process the template etc? [23:11] lifeless: what do you mean? [23:12] lifeless: it creates the view and initializes it [23:12] Can someone check buildd-manager logs? It's been stalled for a few minutes. [23:12] I don't know what 'initializes it' entails [23:12] I want to make sure that I capture all the SQL done by the view & template rendering [23:12] whats the most tasteful way to do that. [23:12] lifeless: it calls the initialize method [23:12] lifeless: which sets up any fields and widgets [23:13] lifeless: it does not render the view [23:13] ok [23:13] so what should I do accomplish my goal [23:13] lifeless: that is either __call__ or render [23:13] not sure which is the best way [23:13] probably to use a browser test case [23:13] and get the test browser to load the page [23:14] ok [23:15] So, we now have two or three untested buildd-manager CPs, and it appears to have fallen over. [23:15] losa ping [23:15] Well, someone could check the logs. [23:16] yes [23:16] someone experienced, who knows that bit, and who can act to fix it, would be best, no ? [23:16] True, but LOSAs are unlikely... [23:16] Oh, I guess some might be back. [23:17] I know some are back :) [23:35] Well, apparently not. [23:37] wgrant: I have one in a private channel - sorry. But they are feeling flat-out - multiple issues at once etc [23:38] Ah. [23:41] wgrant: trying to figure out what is going on with the builders [23:41] wgrant: i was led to believe you might have a working theory? [23:42] mbarnett: No idea. What are the logs saying? [23:44] wgrant: still poking around. last i saw was a "no route to host error" [23:44] trying to get more info now [23:44] i think the buildd-master may be well and truly hung [23:45] it is logging nothing, and the running process doesn't seems to be actually doing a single thing. [23:45] Gonna give it another minute and see if it wakes up. If not, it will be time to hit it with things. [23:46] lifeless: What's the best course of action to debug that? strace, then SIGINT and hope we get a traceback? [23:46] strace [23:46] well [23:46] ps first [23:46] see what state the process is in [23:46] then strace, which may 'fix' it [23:47] strace gives NOTHING [23:47] if that doesn't, attach gdb and get a backtrace [23:47] lp_buildd@cesium:/srv/launchpad.net/production-logs$ strace -p 10680 [23:47] Process 10680 attached - interrupt to quit [23:47] select(39, [9 38], [], [], NULL [23:47] that probably means its dropped the ball on a deferred [23:47] there was an error in the logs when the process hung [23:47] :( All the CPs are tested. [23:47] "no route to host" [23:47] mbarnett: What was the full line? [23:48] wgrant: the full line from the strace? [23:48] mbarnett: From the 'no route to host' line. [23:49] http://pastebin.ubuntu.com/479111/ [23:49] wgrant: ^ [23:49] Well., [23:49] This is the same codepath that died last time. [23:49] There's nothing after that log entry? [23:49] awesomesauce [23:49] nope [23:49] that's it [23:49] WTF. [23:50] Does that still have the two cowboys, or is it properly CPd now? [23:50] CP'd [23:50] actually, let me verify [23:50] So, we have replaced an infinite loop with a hang. Awesome. [23:51] nope, it is cowboy'd [23:51] with http://pastebin.ubuntu.com/478843/ [23:51] That's the only cowboy? [23:52] there is another that is not listed on the production status page [23:52] Should be to the same piece of code. [23:52] * wgrant tries to reproduce locally. [23:53] http://pastebin.ubuntu.com/479112/ [23:53] that is also on there [23:53] * mbarnett goes to update the production status page [23:53] Oh, wait, it isn't quite the same codepath as last time. But it's right next to it. [23:54] \o/ [23:54] Hm, there's only a test diff? [23:54] it isn't the code tests breaking the translation tests [23:54] Nothing in lib/lp/buildmaster/model/builder.py itself? [23:55] * thumper confirms by running the other half [23:56] i completely don't see how a failure there should hang the buildd-manager [23:57] it might stop that builder from working forever more [23:57] wgrant: 478843 is a cowboy to builder.py [23:57] 478843? [23:57] wgrant: http://pastebin.ubuntu.com/478843/ [23:57] mwhudson: Of course it shouldn't. [23:57] Oh. [23:58] But this is buildd-manager. [23:58] Well, I can't see what's going on. May be good to gdb out a backtrace. [23:59] (and this time I think I'm actually taking all the cowboys into account...)