[00:09] thumper: should i ask sk to review those 2 daily recipe build list page branches? [01:04] bugclient now fails with [01:04] AttributeError: 'Entry' object has no attribute 'markAsDuplicate' [01:04] did this recently change in the api? [01:05] what api version are you using? [01:06] ?11958 [01:10] lifeless, api/1.0> why not? [01:11] james_w: ? [01:11] bug 11958 [01:11] <_mup_> Bug #11958: Unable to show hidden files [01:11] oh, that was a typo [01:11] pasted to the wrong window [01:12] lifeless, salgado-afk: please don't put the blueprints in api/1.0 [01:12] james_w: right. Because 1.0 is supported for many more years. [01:12] but jml has the issue tracker thing in train for the next year. [01:13] why not stop the default being to add features to old versions then? [01:13] there's a bug open on that [01:13] IIRC [01:13] ok [01:13] if there isn't, there should be - this has been discussed. [01:14] we shouldn't 'support' things we haven't finished. [01:14] we should do best effort etc [01:14] but we need to allow room for mistakes. [01:15] hello james_w [01:21] hi poolie [01:26] poolie: hi, so what api version are you using? [01:26] i don't know [01:27] poolie: I ask, because its possible some things that were accidentally exposed have been shuffled, but we're meant to be very conservative about removing stuff.... trying to estimate whether to dig deeper or not. [01:27] i seem to be just taking the default? [01:28] [01:28] <_mup_> Bug #415936: Merge into new branch produces strange log [01:28] so 1.0, i guess [01:29] hmm, you're using edge too :) [01:29] could you switch to LPNET_SERVICE_ROOT :) [01:30] heh, it's a hangover from that once being the only place to get it [01:30] much better to switch things on/off on the server [01:30] sure [01:30] we can't sadly [01:30] lp API clients start with a POST [01:30] i mean in future [01:30] and blow up if thats redirected or otherwise handled by non-edge. [01:31] sure [01:31] uhm, so 1.0 [01:31] lets see [01:31] we could change lplib to make EDGE_SERVICE_ROOT == LPNET_SERVICE_ROOT [01:31] but that might just cause confusion [01:31] we're going to do something like that [01:31] perhaps with a deprecation warning [01:31] hm that name doesn't exist [01:32] ok, i have it [01:33] ah cool [01:33] what was it ? [01:34] ok, so markAsDuplicate is in beta [01:34] not in 1.0 or devel [01:35] I can check the changelog but my guess is that you've moved from a launchpadlib that used beta by default to one that uses 1.0 by default [01:35] the current idiom is to use duplicate_of = xxx [01:35] hm [01:35] something like that [01:35] you can pass beta as the api version you want [01:35] though, this is on my maverick machine, running the packaged lplib [01:35] or update your code [01:36] i doubt that changed in the last few weeks? [01:36] that would be unusual. [01:36] could you file a bug? Probably a mistake then. [01:36] Yippie, build fixed! [01:36] Project devel build (238): FIXED in 3 hr 45 min: https://hudson.wedontsleep.org/job/devel/238/ [01:36] * Launchpad Patch Queue Manager: [r=abentley][ui=none][no-qa] New "images" command for bin/ec2 to [01:36] display all current test images. [01:36] * Launchpad Patch Queue Manager: [r=flacoste][ui=none][bug=677305] Downgrade bzr to 2.2.0 [01:38] hm, i see i was repeating myself and defaulting to edge in two places [01:38] you might find https://dev.launchpad.net/PolicyAndProcess/OptionalReviews interesting [01:39] i've followed some of the mail about it [01:40] ok, so even on 1.0 non-edge, it still fails [01:40] markAsDuplicate isn't in 1.0 [01:40] "stable interfaces in python are hard" [01:40] perhaps it should be ;) - thus a bug is needed. [01:40] i'm pretty sure this was working a week or two ago [01:41] maybe a bit more than this, but within the last couple of months [01:41] i will file [01:44] the shorter Optional Reviews report: "data, bitches!" [01:44] i think that's great [01:44] to be devil's advocate [01:45] i think the previous experience is not so much showing everything needs review [01:45] but rather that people will use these to route around a broken review process [01:45] cf john's mail [01:45] but perhaps things have now changed [01:45] s//probably [01:50] well [01:51] I think that route around is better than stockpiling [01:52] wallyworld_: yes, fling them StevenK's way [01:52] wallyworld_: I'm just about to head and collect the girls from school [01:52] thumper: ok [01:52] wallyworld_: we should have a chat after that [01:52] thumper: i'll be here [01:54] lifeless: i think in the first version of this code i did assign to 'duplicate_of', but.. [01:54] that didn't work [01:54] i can't remember if the change was not saved, or if it gave an error [02:01] wgrant: yo [02:01] poolie: you need to call obj.lp_save() [02:02] i know that, and i think it wasn't enough [02:02] imbw [02:02] anyhow, bug 680339 [02:02] <_mup_> Bug #680339: 'Entry' object has no attribute 'markAsDuplicate' [02:03] thank you [02:16] thumper: back in 15 mins - have to drop the kid to his McJob [02:17] wallyworld_: ok [02:20] thumper, wallyworld_, we're going to do a bzr 2.2.2 release at the end of this week [02:20] that should address the problems you hit in 2.2.1 [02:21] poolie: ok [02:21] feedback welcome, as always [02:22] lifeless: Hi. [02:23] What's broken? [02:24] wgrant: cesium - removing cowboys [02:25] wondering if you know the rev that landed the fix [02:25] wgrant: how was your exam? [02:25] lifeless: Exam was rather better than expected. [02:25] cesium... what was the cowboy? The builder disabling thing? [02:25] wgrant: all done? wooo! [02:26] Indeed. [02:30] lifeless: Unless there was more than just logging and not disabling builders, devel r11938 looks to be the one. [02:35] thumper: call now? [02:37] wgrant: there was [02:37] disabling the failure checking [02:38] http://pastebin.com/Kr44BkbD [02:39] lifeless: 11938 should render that pointless. [02:39] Though I'd, er, check with someone else. [02:41] wgrant: it looks complete to me [02:43] the only gap is that first hunk [02:44] Yeah, that's what I meant. [02:44] I think. [03:57] jamesh: do you remember how to tell zope to use a different thread count? [03:57] lifeless: there is an option in the .conf file [03:57] I think it might even be called thread_count [03:59] this is the ZConfig .conf file [03:59] is that 'launchpad.conf' ? [04:05] yes [04:10] lifeless: looking at one of the ancient launchpad trees on my system, one of the launchpad.conf files under configs/ has "threads 16" at the top level [04:10] thanks! [04:10] also, terrible idea :) [04:11] I remembered that we had bumped the count up for the demo.launchpad.net instance [04:11] so checked that config [04:12] that instance wasn't under heavy load, but it was easy for one user to block everyone else with the default thread count [04:18] yeah, [04:18] we're in the process of (with measurement) dropping down to 1 thread per appserver [04:32] lifeless: interesting! i don't recall you posting about it [04:32] (not that you have to, i suppose) [04:32] in various perf tuesday mails [04:32] on the grounds that, if they're not wasting time, you wouldn't get more than one python thread to run anyhow? [04:32] we're seeing things that thread starvation is the best explanation for [04:32] and yes [04:32] that too [04:33] ok, i do see a mention in passing [04:35] jamesh: yes, thats it [04:35] jamesh: thanks for finding it [04:36] poolie: we could reasonable expect total time/database time threads to be a reasonable figure [04:36] but its simpler to let the OS manage things at that point [04:37] + if we do decide to debug something only one user will be impacted [04:37] poolie: I did some more analysis in a ubuntu-one thread [04:39] agree about letting the OS do it [04:40] istm the main drawback is there are things that python knows are shareable in memory, that the OS doesn't [04:40] like modules [04:40] and for lp they tend to be large [04:40] but maybe this is not a sufficiently important factor [04:42] its a couple hundred MB [04:42] 1.6GB on a fully loaded machine. [04:42] it could be better [04:43] but we should get a win regardless, or so says the theory. [04:46] yay [04:46] https://launchpad.net/~lifeless/+commentedbugs working [04:47] timed out for me.. [04:47] hahrugle [04:48] in 'select count from bugtask....' [04:48] but it's a lovely sounding url :) [04:48] what revno [04:48] 11952 [04:48] damn [04:48] At least 42 queries/external actions issued in 2.16 seconds [04:48] r11952 [04:50] poolie: its working consistently for me [04:50] poolie: whats the OOPS id ? [04:52] https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1788K350 [04:52] works for me now [04:53] is this a new feature, or just a newly-faster feature? [04:53] worked this time [04:53] top OOPS since the monthly rollout [04:53] ah i see [04:53] maybe it was 'want an affecting bugs list' i was thinking of [05:59] wgrant: btw. I'm bemused and entertained at your dedication. Finish final exam, hop on IRC. ISTR that post my final exam at Uni, a group of us went to the Brekky Creek Hotel in Brisvegas for a Steak lunch and afternoon of entertaining the regulars in the beer garden with (badly) Monty Python songs sung. [06:01] spm: Well, a few of us finished early, went to pub, consumed beer, returned to the exam venue to see everyone else, then went back to my office. [06:01] And then IRC, yes :P [06:02] bwhahaha [06:16] spm: hey [06:16] sorry, ECHAN [06:42] spm, ah, the brekky creek [06:42] is it a known problem that all api calls against staging are giving 500 errors? [06:43] and indeed the web ui too [06:44] Hayfever or cold :-( Suspect today is another sick day. [06:44] yeah, staging is borked atm. been fixing the borked prod rollout first tho [06:45] heh, that's probably a good choice [06:45] :) thanks spm === almaisan-away is now known as al-maisan [08:41] good morning [08:41] morning [08:42] how was the exam wgrant? [08:54] is there someone around that can qa https://bugs.launchpad.net/rosetta/+bug/669831 ? [08:54] <_mup_> Bug #669831: obsolete translations exported to the branch [09:09] Morning [09:12] lifeless: I've landed all code to cover the cowboys on cesium, please roll it out (or I can) [09:12] bigjools: its done [09:12] lifeless: ah great, you checked or Picarded it? [09:12] spm deployed [09:12] and noone has screamed yet [09:13] I figured I'd chat with you briefly :) [09:13] heh [09:13] * bigjools looks at log [09:14] lifeless: I thought it was in the nodowntime set? [09:15] bigjools: it was taken out when it blew up [09:15] when things get cowboyed we remove them from nodowntime. [09:16] so it's back in? I thought I said I was going to approve that first ... [09:16] we may be crossing wires [09:16] back at uds it wasn't in, and we discussed what it would take to be in [09:16] that you were cc'd on a discussion and gave your blessing. [09:17] then the big branch blew builders away [09:17] so it was cowboy-fixed and removed [to stop deploys uncowboying it] [09:17] we cross checked that you'd landed fixes [and they were to be deployed] today, and so uncowboyed it (which implies putting it back in nodowntime] [09:18] bigjools: if you had wanted a further cross-check with you, I'm really sorry - I didn't realise. [09:19] lifeless: np, I talked with Tom not you [09:19] as it happens it's fine to roll, but that's because I had landed everything [09:19] right, we looked first ;) [09:19] in future, if you want a cross-check, could you note that on LPS against the cowboy, or the DeploymentException ? [09:20] yes but one of the cowboys is not landed, but deliberately. How did you reconcile that? :) [09:20] bigjools: knowledge. [09:20] :) [09:20] I had discussed the fault with you. [09:20] so I knew you put it in while you diagnosed the root cause. [09:20] we need to figure out wtf the builders are taking >30 seconds to accept connections [09:21] I also included that cowboy as a ready to go patch in a mail to losas [09:21] so if there is a submarine present there [09:21] its at-hand to recowboy [09:22] one of the side effects of that cowboy is that jobs will "stick" on genuinely failed builders [09:22] I have to keep checking the log [09:30] lifeless: so, did we talk about figuring out the massive slave connection delays? [09:31] not as such [09:36] hey jtv! ;) [09:36] hi henninge! [09:36] jtv: let's be chatty ;) [09:37] henninge: I don't think this connection will support voice chat. [09:38] jtv: oic [09:38] you are also not on #translations [09:38] just a mo' [09:40] jtv: hi [09:40] hi [09:40] https://devpad.canonical.com/~lpqateam/qa_reports/deployment-stable.html [09:41] uh-oh [09:41] are you able to qa 669831 on [qa]staging ? [09:41] henninge: hang on, lifeless wants something :) [09:41] no panic [09:41] lifeless: very slow connection, so will be a bit slow to respond [09:41] but 11960 would remove another cowboy in the datacentre [09:42] jtv: like I say [09:42] no panic [09:42] jtv, lifeless: I can do that [09:42] awesome [09:42] ok, gnight all [09:42] g'night [09:42] lifeless: good night [09:42] * jtv runs around a bit and screams for a while, just because he was told not to panic [09:42] it's the principle of the thing [09:53] have you considered working on Soyuz? [10:06] bigjools: Better than expected. [10:07] wgrant: excellent, but shouldn't you be out celebrating now [10:07] Did that earlier :P [10:07] But yes, probably. [10:08] you party animal [10:08] did you get a chance to look at the expiration query? [10:08] Heh. [10:08] Will look now. [10:11] bigjools: Where's the latest version of that query? [10:11] the one I pasted :) [10:11] k [10:12] http://pastebin.ubuntu.com/535167/ [10:12] Yup. [10:21] bigjools: Are the LFC, DS and DAS joins in the EXCEPT completely useless, or am I stupid and blind? [10:22] one sec [10:24] bigjools: http://pastebin.ubuntu.com/535491/ [10:25] Should be equivalent, except with the retention condition fixed. [10:25] (now it will exclude if the file is unremoved or Obsolete, rather than Published or Obsolete) [10:25] Which should make p-d-r less sad. [10:27] Hmm. [10:29] bigjools: Did the domination thingy help dogfood at all? I guess it pales in comparison with the file list generation :/ [10:29] wgrant: yes, file lists take ~4 hours [10:29] for just maverick release pocket :/ [10:29] something has regressed a lot [10:29] it used to take ~30 mins [10:29] I will poke it in the eye in a few weeks. [10:30] DF is slow, but ... [10:30] Heh. [10:43] wgrant: sql looks good, I am trying it on DF [10:47] bigjools: So, about those people who feel like they need to use obsolete PPAs... [10:47] OEM [10:48] How did I guess :( [10:49] now, I am struggling to understand wtf builder sometimes take in excess of 180 seconds to accept a connection from the manager [10:50] * bigjools -> caffeine [10:51] bigjools: Load graphs from non-virt builders pls. [10:51] It might at least give us some idea of if that's the issue. === matsubara-afk is now known as matsubara [11:14] jml: remember my dirty reactor failure in my b-m tests? [11:16] I added debug output on the Deferreds and it's caused by distribution_mirrorprober. There's a test isolation error... :/ [11:17] The reactor is shared‽ [11:17] in tests [11:17] My interrobang stands. [11:17] * jml briefly steps in from the sick room [11:18] man flu? [11:18] wgrant: yeah, there's exactly one reactor. [11:18] wgrant: it's arguably the biggest flaw in Twisted [11:18] jml: This sounds... like Zope. [11:19] it would be kinda hard to have more than one [11:19] bigjools: I know about those tests. my testtools-experiment branch fixes those delayed calls. [11:19] bigjools: but it's blocked on landing by some weird-ass 500 from the librarian [11:19] jml: yay. So, what can I do about it in my branch? [11:20] bigjools: those tests need to be completely rewritten... let me find you my workaround [11:20] and why are they leaking to my test? [11:20] bigjools: because the distributionmirror_prober tests aren't using trial [11:20] ah ... [11:20] bigjools: so the calls are going on to the reactor, and when your tests are cleaned up ... bang [11:21] that's kinda bad from a test isolation PoV :/ [11:21] wgrant: [11:21] total_files | space_saved [11:21] -------------+-------------- [11:21] 184899 | 307432516434 [11:22] bigjools: yes. the problem is the mutable global state that is the reactor [11:22] indeed [11:22] bigjools: got any suggestions on how to make it better? [11:22] bigjools: Mm, not too implausible. [11:22] jml: I'd make Trial clean the reactor when starting a test? [11:22] bigjools: it can't do that. there might be in-process Twisted-using fixtures [11:23] bigjools: http://pastebin.ubuntu.com/535507/ <- should work around the problem [11:23] jml: at the very start of the test there should be no fixtures yet though? [11:23] bigjools: not if they are shared between tests [11:23] aieeee [11:24] jml: I'll poke your workaround in, thanks [11:24] wgrant: incidentally, it doesn't sound like Zope to me. [11:24] jml: Opaque global state. [11:24] The Zope Way. [11:24] jml: if it's complaining about them when the test ends, how can a fixture's Deferreds never get in the way then? [11:25] if it's shared between tests that is [11:25] bigjools: good point. in Trial, there's some historical crap back from when we thought setUpClass would be a good idea [11:26] (it's a terrible idea, we got rid of it, Guido added it back to Python again, and so the circle of crap continues) [11:26] :/ [11:26] so if someone uses setUpClass with a Deferred, your tests will always fail [11:26] bigjools: in testtools, I guess we could clean the reactor before tests [11:27] bigjools: no, Trial postpones checking of those things until tearDownClass runs. [11:27] bigjools: *that's* the historical crap. [11:27] oy [11:28] as I said, clearing it out in testtools would help [11:29] yeah [11:29] although it wouldn't help that much. [11:29] "some test before this one was bonkers" [11:29] it's still an improvement over the current situation [11:30] I won't ask why the mirrorprober tests are not using Trial then.... :) [11:30] I have NFI [11:30] if I have my way, they'll be using testtools before the year is out. [11:30] \o/ [11:31] hmm I need to book some holiday [11:31] anyway, all this excitement is threatening my delicate constitution [11:31] jml: Berocca [11:31] bigjools: it's a cold, not a hangover. :P [11:31] jml: :) it still works [11:31] the fizzy stuff [11:32] get better soon anyway, go get some rest [11:32] * jml watches The Wire instead. [11:32] a friend of mine swears his PS3 is medicinal [11:32] heh === al-maisan is now known as almaisan-away [12:04] Morning, all. [12:23] wgrant: so, I figured out the problem with the log parser [12:25] bigjools: !! [12:25] What is it? [12:26] wgrant: it reads in gzip files in their entirety :/ [12:26] see lp/services/apachelogparser/base.py [12:26] get_fd_and_file_size() [12:26] le heavy sigh [12:27] bigjools: Hahaha. [12:28] I thought gzip stored the uncompressed size in the header... [12:28] Yes, it's in the footer. [12:29] Not sure how we can access that from Python, though... [12:29] but does any of the python module read that [12:33] It's limited to 4 bytes, so it probably doesn't. [12:33] But let's see. [12:33] http://stackoverflow.com/questions/1704458/get-uncompressed-size-of-a-gz-file-in-python [12:34] It looks like we probably have no choice but to read in chunks. [12:36] did you notice the last answer on that page :) [12:36] Suggesting len(fd.read())? [12:37] Oh. [12:37] Hahahah. [12:37] I rarely look at the author... [12:38] I am tempted to grab the last 4 bytes [12:39] But 2**32... [12:39] I guess we can make it explode if it ever tries to write a bytes_read greater than that. [12:39] Since something is pretty broken if we have a 4GiB log file, I guess. [12:40] they are in the region of 1.2G uncompressed [12:41] Oh. [12:41] That is inconvenient... [12:41] Also, that's huge. [12:41] it's fine [12:42] It's far too close for my liking, but OK. [12:43] we can put a limit on it [12:44] Yes, but a limit within a couple of orders of magnitude of the current value seems like a really bad idea. [12:47] * bigjools tries dirty hack on DF [13:04] score [13:04] It works? [13:06] yes [13:07] So I didn't break it after all :D [13:07] seems so :) [13:07] 2010-11-23 13:03:04 INFO Parsed 5000000 lines resulting in 16085 download stats. [13:08] Hmm. [13:08] I lied, one file is 2.9G [13:09] Ow. [13:10] bigjools: Could you sum all the BPRDC.count? [13:10] Just to see that it actually handles most of the lines. [13:11] Although I guess lots of those lines will be Packages/Sources, so it might not be too similar. [13:12] 6880 [13:12] it's still going though [13:12] it will take a while [13:12] and it's hammering DF [13:13] good time for lunch, see you later [13:13] * wgrant sleeps. [13:13] Thanks for fixing that. [13:15] my pleasure [14:00] 'morning benji, abentley [14:00] jelmer: morning. === almaisan-away is now known as al-maisan [14:05] morning jelmer, or afternoon as it were ;) === matsubara is now known as matsubara-lunch [14:10] how to add https://launchpad.net/~falk-t-j/+archive/lucid/+build/2018840 in repository? [14:10] anybody to help? [14:12] jitu: follow the instructions here https://launchpad.net/~falk-t-j/+archive/lucid [14:12] where it says "Adding this PPA to your system" [14:14] checking... [14:15] bigjools, thnx [14:19] mars, hi. You around? [14:24] Or maybe gary_poster could help me. gary_poster, ping? [14:24] hey deryck. what's up [14:24] Hi gary_poster. Does this revno look like the right way that I would disable windmill tests to you? http://bazaar.launchpad.net/~deryck/launchpad/rockstar-js-refresh/revision/11726 [14:25] * gary_poster is skeptical he will know, but is looking [14:25] deryck: yeah, that looks like a very reasonable approach to me, especially if you have evidence it works ;-) . [14:26] gary_poster, heh. that's the problem, I don't. ;) Started an ec2 test run last night that disappeared. I assume something hung and the test was killed.... [14:26] :-P [14:27] so I started another run, but was looking for some confirmation that the patch looked right. :-) [14:27] deryck: I can assert that I believe that should have worked [14:27] gary_poster, good enough. Thanks! [14:27] :-) np [14:27] :-) [14:27] FWIW, my current test run seems to be going well. The tests started up much faster than earlier attempts I had. [14:28] deryck, you may need '!(MailmanLayer|WindmillLayer)' instead, but I don't know for sure. [14:29] empirical evidence needed [14:29] mars, gary_poster, also, I opened Bug #680497 about missing test coverage. If this didn't belong on Foundations, sorry. I wasn't sure. [14:29] <_mup_> Bug #680497: jstests for LP JavaScript client are not running automatically [14:29] mars, ok, I'll watch the run closely and see. Thanks! [14:30] if I see a windmill test go, I'll kill it ASAP and try again :-) [14:30] deryck: Foundations: yeah, close enough :-) . I might add the web group. [14:30] ok, cool [14:31] deryck, mars: mars makes a good point. I'm not sure if the layer thing is an and or an or. ./bin/test --help should say. looking [14:31] heh "is an and or an or" is hard to parse on two cups of coffee only [14:32] the hung run suggests my patch might not have worked. [14:32] deryck, :-) mars is right. "--layer" is a logical or. that is, it will run include all layers that are not Windmill ORed with all layers that are not Mailman, resulting in all layers. [14:33] s/run include/include [14:33] ah [14:33] gary_poster, so I need the form mars suggested then, right? [14:33] yes [14:33] ok, cool. an ec2 run killin' I shall go.... [14:33] :-) k [14:33] thanks mars and gary_poster! [14:33] np [14:34] * deryck dreads seeing the ec2 bill this month [14:49] jml: are you too ill to help a bit with the buildd-manager timeouts? My fix hasn't worked. [14:51] bigjools: Has lamont installed the new buildd (rev 74)? [14:52] abentley: you should ask him, not me ;) [14:53] bigjools: because if he has, that's a bad sign, but if he hasn't, it could help with the timeouts. === salgado is now known as salgado-lunch === matsubara-lunch is now known as matsubara [15:14] bigjools: I can try. Wassup? [15:14] jml: that timeout stuff I added has had no effect :( [15:15] here's an example of the sequence: [15:15] bigjools: ok. so maybe we misdiagnosed the problem? [15:15] 2010-11-23 14:46:27+0000 [QueryProtocol,client] Resuming hassium (http://hassium.. [15:15] 2010-11-23 14:46:32+0000 [-] Asking builder on http://hassium.ppa:8221/filecache to ensure it has file chroot-ubuntu-lucid-i386.tar.bz2 [15:15] 2010-11-23 14:46:54+0000 [Uninitialized] Scanning hassium failed with: TCP connection timed out: 110: Connection timed out [15:15] the timeout is 22 seconds later [15:15] not even the default 30 [15:16] so I reckon misdiagnosis is quite probable [15:16] bigjools: is your timeout stuff actually on the relevant machines? [15:18] jml: yes, I had Tom do a paranoia grep [15:18] I'm at a bit of a loss here [15:19] poking at the code now [15:19] we probably need some more debugging logging [15:19] yes. and the ability to switch it on and off in runtime [15:20] bigjools: do you have a traceback with that error? [15:21] jml: no [15:21] bigjools: isn't that unexpected? [15:21] it's one of the "known" errors so it doesn't run the traceback. [15:21] BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch, [15:21] CannotResumeHost, BuildDaemonError, CannotFetchFile): [15:21] which one? [15:23] it'll be getting re-raised somewhere [15:24] bigjools: we should also remove some of the unnecessary layers of stack toot-sweet. We have to trawl through this code frequently enough that it's a noticeable cost. [15:24] wtf is a toot-sweet? [15:25] sorry, an expression from childhood. "very quickly" [15:25] "tout de suite" [15:25] :) [15:26] but yeah, agree [15:26] bigjools: I learnt it from folk with Middlesex accents. [15:26] anyhow [15:27] hah [15:27] bigjools: my reading of the code shows that it's failing in startBuild (builder.py) after the call to resumeSlaveHost succeeds [15:28] because resumeSlaveHost prepends crap to the error message, and we don't see that in the logs [15:28] likewise, the error can't be coming from any of the eb_foo in startBuild, because they also mutate the error message [15:28] correct [15:28] ergo, it comes from resume_done [15:28] or something after startBuild [15:29] it will be the first call that tries to dispatch the chroot [15:29] bearing in mind they are >600M I wonder if that is poking a subtle bug [15:30] there doesn't seem to be anything after startBuild that does anything particularly interesting [15:30] bigjools: do we know what type of job this is? [15:30] binarypackage [15:32] hmm [15:34] something is catching that error and re-raising it [15:34] that's, well, odd. [15:34] as a known exception - otherwise we'd see a traceback [15:34] actually - I wonder if it contains any trace info [15:35] because, afaict, it's being raised by the second line in dispatchBuildToSlave in binarypackagebuildbehavior [15:35] d = self._builder.slave.cacheFile(logger, chroot) [15:35] which has absolutely no errback added to it until _scanFailed in manager.py [15:36] (again, my reading only, not actually tested) [15:36] actually, I'm going to say some stuff with triple-x markers so the conversation can be more readily grepped for actions [15:37] XXX: change _scanFailed to have a different prefix to the log message for unexpected errors [15:38] XXX: collapse some of the unnecessary indirection in builder.py (e.g. updateStatus, updateBuilderStatus; updateBuild; _dispatchBuildCandidate) [15:38] bigjools: the other big question is why 22s [15:39] bigjools: what are the values for timeouts in production? [15:39] 180s [15:39] I simply cannot fathom where 22 comes from [15:39] perhaps we're using an unexpected config setting? [15:39] perhaps it's 20 + noise? [15:39] the delay on the failures in the log is not consistent either [15:39] bigjools: what's the smallest? [15:39] that's the smallest I've seen, but it's hard to grep [15:40] since we've got blocking code in prod, it's to be expected that there'll be variation [15:40] yep [15:41] jml: those eb_ functions are not doing much any more [15:42] bigjools: I can believe it [15:42] hmm. [15:43] they are used when doing the resume op :) [15:43] bigjools: what's in the logs roughly 3 minutes before that error message? [15:44] not a lot [15:44] bigjools: do we get the same three lines every time the error happens? [15:44] sorta [15:44] they're obviously spread around the file [15:44] but it's always the chroot dispatch as far as I've seen [15:45] hmm. [15:45] grepping twisted code reveals it's definitely a TCPTimedOutError [15:45] I wonder if the slave is disconnecting before it's replied? [15:46] * jml looks up what ETIMEDOUT means [15:49] jml: heh, you know what, the timeout on the slave will still be 30 seconds, right? [15:49] bigjools: why so? [15:49] it's also twisted [15:49] hmm. [15:50] I don't think listening works in the same way [15:50] it depends on what stupidity it has [15:50] how is the buildd launched in production? [15:51] it's part of init.d [15:53] Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server. [16:00] hah [16:00] is that a victorious hah? [16:01] maybe. [16:01] show me the beans [16:01] so, if I understand correctly, the timeout passed to connectTCP does not in fact control the UNIX-level socket timeout [16:01] it's a callLater to cancel the Deferred I think [16:01] exactly [16:02] but the TimeoutError that generates is different to TCPTimedOutError [16:02] yep [16:02] which is a mapping for ETIMEDOUT [16:02] interesting [16:03] IIUC, the call to socket.connect_ex() in t/internet/tcp.py is timing out === beuno is now known as beuno-lunch [16:04] unfortunately, I don't know whether the timeout is being set client-side or server-side, somewhere in Twisted, somewhere in Python or somewhere deeper [16:05] maybe the stack trace will tell us when I get that cowboy in [16:05] but I suspect not :/ [16:05] bigjools: well, the timeout is being set in a different call [16:05] * jml flicks through APUE [16:12] nope === deryck is now known as deryck[lunch] === al-maisan is now known as almaisan-away [16:30] * bigjools otp for a bit === almaisan-away is now known as al-maisan === rockstar` is now known as rockstar [16:58] morning [16:58] bigjools: how did cesium go ? [16:59] henninge: hi; how did you go on 669831? [17:03] bigjools: ah, found your mail. [17:07] lifeless: yeah, badly [17:08] I've suggested a feature flag to you, to let us get rid of the cowboy but keep the code path easily off/on as needed. [17:08] (in mail) [17:11] lifeless: +1 to the flag. But only if I don't work out the problem this week. [17:12] sure [17:12] maybe I'll do it along with an attempted fix actually [17:12] lifeless: the thing that we discovered earlier is that the timeouts I changed are not firing, something else is. [17:13] we see a TCPTimedOutError [17:13] if it were the timeout value firing it would be a Deferred cancelled error [17:13] the question is, where the heck is generating that [17:14] do you generate an OOPS ? [17:15] if so, its backtrace might help [17:15] lifeless: there's no trace on the exception :( [17:15] it should get logged, but there's nada [17:16] grah [17:16] well - there IS a traceback, it's just one line [17:16] is it an OOPS ? [17:16] or something else [17:16] no, I don't generate an oops [17:16] whats logging the error for you ? [17:16] because they are routine failures [17:16] my code! [17:17] see _scanFailed() in lib/lp/buildmaster/manager.py [17:17] so there's a more sophisticated error checker you can create [17:17] have you seen Release It! - the book ? [17:17] it uses failure.getTraceback() [17:17] I have not [17:17] ok, and failure.getTraceback() is neutered for some reason ? [17:18] 2010-11-23 16:39:28+0000 [Uninitialized] Traceback (most recent call last): [17:18] 2010-11-23 16:39:28+0000 [Uninitialized] Failure: twisted.internet.error.TCPTime [17:18] dOutError: TCP connection timed out: 110: Connection timed out. [17:18] that's it [17:18] how frustrating [17:19] somewhat [17:20] it's thrown when we get a errno.ETIMEDOUT [17:20] righto, the pattern is Circuit Breaker [17:20] its kindof the ultimate variant of what you've implemented [17:21] its not relevant right now [17:21] but I'm going to explain why I was thinking of it === salgado-lunch is now known as salgado [17:21] it has the idea of the thing it measures being ok, being in trouble, and being dead. [17:22] *if* we got the full traceback and were only logging one line [17:22] I was going to suggest logging the full line on the transition from in-trouble->dead [17:22] s/full line/full thing/ [17:22] however, thats not the issue we're facing, so it was just a short side discussion. [17:22] that's what it's trying to do, effectively [17:24] yah === beuno-lunch is now known as beuno [17:24] so [17:24] we're getting a short tb [17:25] IIRC Failure does that when the error is thrown locally and the frame above is the reactor [17:25] there's only 2 places the Twisted code itself can throw that error [17:25] so could we be dealing with non twisted code doing a regular socket call and throwing, in a callback from some twisted code. [17:25] twisted/internet/tcp.py === deryck[lunch] is now known as deryck [17:26] doConnect() calls self.failIfNotConnected(error.getConnectError .... [17:26] what I don't get is that none of that seems to be asynchronous === benji is now known as benji-lunch [17:27] lifeless: bear in mind that this code works absolutely flawlessly on dogfood [17:27] even with more builders added [17:29] I think- I'd need to check some lower level code - but I think that that is: [17:29] 'read the last error from the socket' [17:30] failIfNotConnected looks like a plausible issue [17:32] so getConnectError generates a stackless exception object [17:36] and twisted.python.failure when given a regular Exception with no frame data bails [17:36] elif not isinstance(self.value, Failure): [17:36] # we don't do frame introspection since it's expensive, [17:36] # and if we were passed a plain exception with no [17:36] # traceback, it's not useful anyway [17:36] f = stackOffset = None [17:36] thats why you're not getting anything useful. [17:37] *I think* [17:37] jml: ^ plausible? [17:38] the failure.Failure construction call is passing in a simple Exception object [17:39] that has no traceback and thus doesn't get one [17:39] if we passed *no* exception in, sys.exc_inf would have been called, which gets a tb object. [17:39] the err is an errno [17:40] right [17:40] I'm surprised that at the claim that getting a stack from scratch is more expensive than was sys.exc_info does (when the exception is triggered). But perhaps it is. [17:40] it might be worth a quick cowboy [17:40] (I mean, I know it has overhead and I avoid doing it casually myself) [17:41] but this function does one and not the other which is pretty surprising to me [17:42] what I need to work out is why we get ETIMEDOUT so quickly [17:42] TCP sockets take minutes to time out (hours?) by default [17:43] depends on the state [17:43] during connection its 30 seconds (IIRC) [17:43] I've never heard of that [17:43] actually, hmmn [17:44] 30 seconds of retry for a packet that won't send, but much more for just an open socket without traffic. [17:44] also 30 seconds or so for a SYN that isn't acked [17:44] right [17:44] but blackholes [17:45] bigjools: http://www.faqs.org/docs/iptables/tcpconnections.html [17:45] Table 4-2. Internal states [17:45] bah [17:46] thats the firewall side [17:46] SYN_SENT 2 minutes [17:46] which has timeouts set *above* what tcp needs [17:47] anyhow, default syn timeout is 30 seconds [17:48] lifeless: yeah, the stack analysis is correct. see #twisted for some follow-up discussion. [17:48] * jml gone again [17:49] jml: I see discussion on the socket, not on the poor stack info [17:49] jml: thanks though [17:52] lifeless, did yo usee the comment " it's very easy to run out of listen queue in a python server with many short-lived connections" === al-maisan is now known as almaisan-away [17:54] bigjools: do you make multiple outstanding SYN attempts to a asingle build slave [17:54] s/you/we/ [17:56] lifeless: the code has no knowledge of SYN attempts, it's using xmlrpc.Proxy [17:56] ok [17:56] but it will try and connect every5 seconds [17:56] does it make multiple outstanding xmlrpc calls ? [17:56] actually, 15, I changed it [17:56] no [17:56] then that comment is irrelevant [17:57] if we made multiple requests at once [17:57] then we could exceed the default listen size (8) [17:59] there must be some other difference between dogfood and production [17:59] I have not seen this problem *a single time* on DF [17:59] and now I have to go to dinner [18:00] I'll catch up with you again later [18:34] flacoste, mumble === maxb_ is now known as maxb === benji-lunch is now known as benji [19:21] flacoste, https://bugs.launchpad.net/launchpad-registry/+bug/621778 [19:21] <_mup_> Bug #621778: Register project from source package should include homepage URL [19:36] thumper: how was pyconz [19:45] finally [19:45] we're getting back into shape [19:45] Time Out Counts by Page ID [19:45] Hard Soft Page ID [19:45] 77 3783 Archive:+index [19:45] 54 162 BugTask:+index [19:45] 26 298 Distribution:+bugs [19:45] 25 99 ProjectGroupSet:CollectionResource:#project_groups [19:45] 17 46 DistroSeries:+queue [19:45] 15 10 Person:+commentedbugs [19:45] 12 256 POFile:+translate [19:45] 9 2 Person:+bugs [19:45] 6 4 DistributionSourcePackage:+changelog [19:45] 6 1 Person:+related-software [20:00] -> hospital for allergy vaccination, back in ~3 hours. [20:00] on mobile if needed === matsubara is now known as matsubara-afk === salgado is now known as salgado-afk [21:45] thumper: field.setTaggedValue('has_structured_doc', True) [21:46] widget.context.queryTaggedValue('has_structured_docstring') [21:48] field = Attribute('The link') [21:48] field.setTaggedValue('has_structured_doc', True) [21:49] field = has_structured_doc(Attribute(...) [21:49] field = exported(has_structured_doc(Attribute(...))) [23:08] bacj [23:18] hi lifeless, flacoste [23:18] hiya [23:20] flacoste: quesetion for you [23:20] flacoste: if you're still around [23:20] lifeless: will be gone after i hit send on that email [23:20] but shoot [23:20] hi poolie [23:21] lifeless: i think jam would like someone (maybe not you) to state that creating bin/bzr is/isn't the most tasteful way to do it [23:21] poolie: i say it is [23:21] jam: ^^ [23:21] poolie, jam: try talking to gary tomorrow for help [23:21] flacoste: rt 41361 - I mailed you [23:21] lifeless: yes, i saw that [23:21] that's what i said too :) [23:22] jam: creating bin/py is the most tastesful way to do it. [23:22] jam: I just don't know the machinery in that stack yet. [23:22] flacoste: did I make sense? I didn't see a reply. [23:22] lifeless: i understand it, i'll have to see how it fits resource wise [23:23] since it needs some amount of coordination on our side [23:23] flacoste: very small :) - I've done the heavy lifting. [23:23] well [23:23] flacoste: anyhow, shoo. [23:23] it requires doing measurements [23:23] and saying +1 / -1 [23:23] that's not very small in my book< [23:23] flacoste: that doesn't need coordination [23:24] flacoste: that can be done weeks or months later, if needed. [23:24] it needs somebody on the lp side to work with the losa [23:24] hmm, ok [23:24] but [23:24] actually [23:24] is it that important to do just one [23:24] if we don'T assess and then deploy it across the board? [23:25] its important to get some headroom [23:25] we can't do this on all the servers [yet] - not enough ram [23:25] flacoste: why google docs??? [23:25] my normal gmail login can't see it [23:25] if this works we can probably get headroom without more hardware. [23:25] thumper: i'm sending a normal email [23:25] * thumper throws hands up [23:25] flacoste: thanks [23:26] thumper: a logout/login can help. [23:26] thumper: also turn on MultiSession [23:26] I don't know what multi session is [23:26] thumper: there's a link in my facebook feed :) [23:27] thumper: when I whinged about logins obliterating each other on google apps/gmail [23:27] lifeless: but my gut feeling is that completing RFWATD takes precedence [23:27] flacoste: thats crucial as well. [23:27] :) [23:27] they're all critical! :> [23:28] flacoste: seriously, gnight.