[02:28] * thumper is back [06:25] * thumper EODs [08:32] good morning [09:02] Good morning [10:28] jelmer: hey [10:28] thumper: Hi! [10:28] jelmer_: what's the status of bzr-git and dulwich? [10:29] jelmer_: I'd like to rollout updates [10:33] thumper: I've fixed the issue with http last week and we now import HEAD branches again, we should be able to rollout current dulwich and bzr-git's roundtrip branch. [10:33] so tips of both? [10:33] thumper: (did I mention 'bzr serve --git' works in a native bzr branch now?) [10:33] jelmer_: I think I saw your 'dent about it [10:34] jelmer_: how efficient is it? [10:34] thumper: It's very quick on small (small as in inventory) branches, very slow on large inventory branches [10:34] jelmer_: do you know why it is slow? [10:35] jelmer_: fixable? [10:35] thumper: Creating the Tree objects from inventories on the fly is very slow and O(size-of-tree) [10:35] * thumper nods [10:35] thumper: We need to cache those Tree objects, should be able to do so once jam's work on packs lands. [10:36] thumper: I'm merging the roundtrip branch into trunk atm with roundtripping itself disabled for the moment. [10:37] (it's got a lot of other improvements but I don't want to be tied to the current syntax for roundtripped data) [10:37] ok [10:37] jelmer_: just send me an email when it's ready with revnos for lp:dulwich and lp:bzr-git [10:37] jelmer_: and I'll get them updated [10:41] thumper: will do [10:48] jelmer_: thanks [11:06] Morning, all. [11:07] jelmer_: bzr serve --git crashes for me :( [11:07] AttributeError: 'BzrBackend' object has no attribute 'object_store' [11:07] bzr-git/dulwich 0.5.0. [11:16] morning deryck [11:21] wgrant: you need a newer dulwich, bzr-git from lp:~jelmer/bzr-git/roundtrip === barry` is now known as barry_ === mrevell is now known as mrevell-lunch === jelmer_ is now known as Guest6486 === mrevell-lunch is now known as mrevell === Ursinha_ is now known as Ursinha [14:15] mrevell: it looks like RT 39801 has been done, meaning that our help and dev wikis are now editable only by members of ~launchpad-doc (for spam protection). I'm planning to a) announce this in the appropriate places, b) document it likewise, and c) approve the currently pending team members at https://edge.launchpad.net/~launchpad-doc/+members#active. Thoughts? [14:17] kfogel, Thanks for handling that. Your plan looks ideal to me. I have two questions: do you think it's worthwhile contacting each existing member of the team directly to explain the new meaning of their membership? Also, do you think there's now a role for the team mailing list? I think "Yes" to the first and I'm not sure about the second. [14:19] mrevell: I agree. "Yes" to the first, but for the second, let's just direct people at #launchpad-dev and the other usual places if they have questions. [14:20] mrevell: oh, looks like I'll need to be a team administrator. Really, anyone in ~launchpad could be an admin. Can we do that? [14:22] kfogel, ~launchpad is now an admin of that team [14:23] mrevell: fast action, sir. [14:24] kfogel, :) As for contacting members directly, to tell them of the team's altered role, I don't think everyone in the team is a member of the ML [14:24] kfogel, but the membership is fairly small so we can easily contact them directly [14:25] mrevell: I was just going to mail them individually. [14:25] cool [14:42] jml or allenap, ping, would either of you be available to help debug a possible hang in some Unix IPC code using the subprocess module? I have a few questions about a 40 LoC block I'm studying. [14:43] mars: I'll have a look. [14:43] thanks allenap, I'm pasting it now [14:44] allenap, http://pastebin.ubuntu.com/434980/ [14:46] line 13, comment is lying [14:47] that's where you end up if the TIMEOUT runs out [14:47] maxb, ok, that's why I asked other people :) [14:48] maxb, so STDOUT could still be open, but the select() has timed out? [14:48] yes [14:48] ok [14:48] allenap, ^ [14:48] * mars changes the comment [14:49] * maxb wonders why the code uses os.read(blah.fileno()) [14:49] mars: Yes, I noticed that. What were your questions? [14:49] mars: Is it working? [14:50] maxb, allenap, fwiw, I'm trying to figure out why the test suite is hanging with the ec2 testrunner.  This code should kill the entire test suite. But it might not be. [14:50] allenap, I'm wondering if something in this timeout code is buggy in such a way that the test suite could hang, but this code doesn't catch and kill it. [14:51] allenap, my XXX comments are where I would start. They ask basic Unix programming questions that I have, that I can not answer. [14:52] Without reading "Advanced Unix Programming". I need the fix a bit faster than that. [14:53] hmm [14:54] maxb, if what you say is true... During a hang, the test suite has stopped printing output. That means the select() is timing out, right? [14:54] yes [14:55] so that narrows the bug down to that branch of the conditional. [14:56] so the hang is in this code path, or this code is running correctly, and the fault happens after this entire script has exited (test_on_merge.py exits correctly, but fails to send mail or something). [14:57] maxb, allenap, is the XXX on line 17 a valid concern? [14:57] maxb: subprocess.Popen._communicate() does the fileno() thing. mars: Might be worth looking there and copy that select() loop as closely as possible. [14:58] allenap, yeah, I was wondering why they didn't use .communicate(). [14:58] mars: If proc.poll() is not None then the process has definitely terminated. [14:59] mars: .communicate() only returns the result at the end. Don't want to wait that long while running the test suite :) [14:59] Line 17 could be hit if the test process exited but a subprocess spawned by the test process still retained the open stdout file descriptor [15:00] allenap, ah, "is not None" means it's absolutely dead. Ok. [15:00] I just re-read the docs, you are correct. [15:00] maxb, oh, that is something [15:00] it *does* do that [15:01] the process tree on a hung server has a few defunct processes [15:01] well, maybe === Guest6486 is now known as jelmer_____ [15:03] mars: Yeah, line 28 will kill the child, but if its children are still alive and holding a reference to proc.stdout then line 31 will hang. [15:03] here, more info [15:04] maxb, allenap, process tree from a hung server. Everything is still alive! http://pastebin.ubuntu.com/434986/ [15:04] the code never got to the killem() line :( [15:05] mars: The firefox has not been collected by the parent :-/ [15:05] kill_hung_process_with_a_series_of_brutish_instruments() was never called. It will eventually SIGKILL the process in question. [15:06] allenap, yes, on my local system I had a sort of the same thing. I had to send SIGHUP to Python itself in order for it to get collected. [15:06] kill -1 python-parent-process-id [15:06] mrevell, which project did you have the milestone problem with? [15:08] sinzui, launchpad, malone, rosetta, launchpad-registry and launchpad-foundations, so far [15:08] always 10.10? [15:08] ie 10.11 is always fine [15:08] allenap, firefox was started by the windmill testrunner. I don't see it in the tree. That means that it died or was killed. [15:09] sinzui, Yeah. Always 10.10. Everything else from 10.06 to 10.12 have been fine. [15:10] mrevell, I see them, how did you create 10.10 milestones? [15:10] allenap, maxb, that would leave the zombie processes. So would the process tree I posted somehow lead the timeout code to hang? [15:10] sinzui, I refreshed the page so that the error message was replaced by a link to a new 10.1 milestone (which I hadn't created but appeared instead of the 10.10 milestone), then went in and changed the name of the 10.1 milestone to 10.10 [15:11] okay thanks. [15:12] What is the process actually being Popen-ed here? [15:13] maxb, line 1 of http://pastebin.ubuntu.com/434986/. [15:15] Did the [[[print ("\nA test appears to be hung. There has been no output for"]]] actually occur? [15:15] allenap, maxb, here is the current code, uncommented http://pastebin.ubuntu.com/434992/. This makes no sense: line 19 and 23 must have run. The processes are still alive! [15:15] And just what is kill_hung_process_with_a_series_of_brutish_instruments? [15:16] maxb, that function is a rewrite I did of lines 19 through 23 of http://pastebin.ubuntu.com/434992/ [15:17] maxb, just for clarity. The code I just pasted is what is actually run by the server, but it was too dense for me to understand. So I rewrote and commented it before posting. [15:17] What is killem? [15:17] hmm [15:17] It would be interesting to add some logging to see what PIDs it's *actually* sending signals to [15:18] if we had console output for the log to write to :/ [15:18] or python standard logging installed and running in this script... [15:19] maxb, I can add logging if that would help. [15:19] Well, I'm a bit baffled, so I'm clutching on to the fact that if the process is still running, a SIGKILL can't have really happened. [15:20] maxb, here is the killem() function: http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/annotate/head:/test_on_merge.py#L189 [15:20] maxb, allenap, killem() runs os.killpg(), not os.kill(). Does that matter? [15:20] (I suspect it might?) [15:22] hmmm [15:22] column 3 of http://pastebin.ubuntu.com/434986/ is the process group ID [15:22] oh [15:22] line 6 [15:24] firefox is part of a different process group. But that doesn't make sense. Could the windmill testrunner have been the target of the process group kill? [15:26] maxb, I think you are right. The best next step is probably to add some logging to the code to see what it is killing, and why. [15:26] Clearly we have an issue if the kill code is expecting the entire tree to be just one process group, but it isn't [15:27] mars: Where does the select() loop run? Is it in the test runner? If so, then it's running in pid 15177 and 20962. [15:28] mars: Ah, it's in test_on_merge.py [15:28] allenap, sorry, I don't understand? the select() loop is in test_on_merge.py, which... hey, if test_on_merge.py is still running, shouldn't it be in the process tree as well? [15:29] allenap, the first column of http://pastebin.ubuntu.com/434986/ is the PPID. Notice that it is '1' for a few of those? [15:29] hahahahahaha [15:29] mars: Yes, it should! [15:29] By killing the process group in this way, the supervisor script is killing itself :-) [15:29] blah [15:30] maxb: Is it? [15:30] I tried typing out a few key lines in an interactive python [15:31] and that's what it seems to indicate [15:31] proc = Popen('sleep 3600', stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True) [15:31] os.killpg(os.getpgid(proc.pid), 9) [15:33] maxb: But isn't it doing os.killpg(proc.pid)? [15:33] maxb: Scratch that. [15:33] Doh. [15:34] maxb: There's a comment in killem saying "Note that bin/test sets its process to a process group leader". [15:34] allenap, os.killpg(os.getpgid(pid), signal) [15:34] allenap: oh, ok [15:34] hrm [15:34] maxb: The process tree bears that out I think. [15:38] mars, maxb: The Popen call has shell=True; killem() is killing the shell. [15:39] * allenap has to restart router === deryck is now known as deryck[lunch] [15:52] allenap, back? [15:52] nope, still away === barry_ is now known as barry [15:59] mars: Hi. If you said anything between my last message and "allenap, back?" then I missed it. [15:59] :) [15:59] nope! [15:59] allenap, thought of something [15:59] possible cause then [15:59] so normally, killing the shell would take everything with it [16:00] but I have observed that Python can get deadlocked(?) on the windmill process. Python won't respond to a SIGTERM even. You have to SIGHUP it. [16:01] So, if we have the windmill/Python deadlock, the suite hangs. test_on_merge says "Oops, better kill it!", and kill the shell. But Python ignores the shell's SIGTERM, and keeps running, along witheverything in the tree under it. [16:02] allenap, maxb ^ does that make sense? [16:03] mars: That makes a lot of sense. So, perhaps send HUP or QUIT before KILL. [16:04] mars: Alternatively, get the Popen call to work without shell=True [16:06] allenap, should HUP walk the process tree maybe? This code assume that killing the shell is enough. Obvious that is not a thorough approach. [16:06] mars: Or make cmdline = "exec " + cmdline [16:06] mars: Maybe it should, but see if this works first. [16:07] exec would terminate the test_on_merge code path, wouldn't it? [16:09] mars: No, it would mean that the test process replaces the shell used to invoke it, so that killpg kills the test process group. [16:09] oh! [16:09] yes [16:09] allenap, awesome, thank you [16:09] mars: But, I'm not sure the shell=True bit is necessary anyway. [16:10] allenap, well, you were wondering if the entire select() loop was needed instead of just using .communicate(), so I can't say if any of this code should stay. [16:11] allenap, they may have used their own select() loop to save .communicate() from buffering too much output. [16:11] there is a warning in the module docs about that [16:13] mars: No, you definitely shouldn't use communicate(); but Popen._communicate() does have an implementation of a similar select() loop that I thought was worth studying. For example, it has an exception handler around select() to catch EINTR. Actually, other than that it's very similar. [16:13] mars: In any case, it doesn't look like the problem lies in that direction anyway. [16:13] right [16:14] well, this is one huge step closer to getting things working again [16:19] maxb, allenap, thank you for all the help! [16:19] mars: You're welcome. I hope it works now :) === matsubara is now known as matsubara-lunch === deryck[lunch] is now known as deryck === gary_poster is now known as gary-lunch === beuno is now known as beuno-lunch [17:12] mrevell: I think I'd like to actually deactivate the launchpad-doc@ mailing list entirely. Its original purpose is now obsolete. Any objections? [17:13] leonardr, something you may find interesting: http://factoryjoe.com/blog/2010/05/16/combing-openid-and-oauth-with-openid-connect/ [17:13] kfogel, None at all [17:13] mrevell: thanks. Also, should I update https://help.launchpad.net/DocTeam to say it's obsolete? [17:13] kfogel, Please, if you don't mind. [17:13] mrevell: no problem. === abentley is now known as abentley-lunch === matsubara-lunch is now known as matsubara [18:04] Night all === gary-lunch is now known as gary_poster === beuno-lunch is now known as beuno === EdwinGrubbs_ is now known as foo1000 === abentley-lunch is now known as abentley === matsubara is now known as matsubara-afk [21:30] OMG. I made staging much faster that edge. [21:38] how? [21:50] sinzui: how?? [21:57] thumper, I used memcached tales directives on milestone and portlets. I may have fixed another issue doing this, https://staging.launchpad.net/libpng/main/+index loads faster on staging then edge for me [21:58] sinzui: please write it up for the list - I'm not sure how to use the memcached tales directives [21:59] sinzui: oh you did already [21:59] thumper, I will if my branch lands. I am sure engineers will love all the broken browser tests that caching created [21:59] for me [21:59] gary_poster, ^ [22:00] sinzui, yay! :-) [22:00] sinzui, that was for memcached [22:00] thumper, I have not. written up what I did, yet, I just submitted the review. I want to cache some of our tales formatters if it proves to be faster than db lookups [22:00] gary_poster :) [22:02] sinzui, I wonder how many of those requests are for breadcrumbs... [22:02] sinzui, probably not as big a payoff. What you found is huge. [22:02] mars, indeed that too crossed my mind. the header of pages can be cached. project/person displayname changes are rare. [23:17] gary_poster: leonardr: ping - mod_compress [23:17] lifeless: pong, hi [23:17] I'm a little surprised that you didn't just fix apacge [23:17] blah, apache [23:17] I wanted to check my facts at the source :) [23:19] lifeless: nothing we would consider to be 'fixing apache' is acceptable to apache upstream [23:19] lifeless: :-) I'd be +1 on fixing it in apache, but from what I understood of Roy Fielding's response to the bug, the proper fix is a new filter. [23:19] leonardr: oh! thats surprising [23:19] see https://issues.apache.org/bugzilla/show_bug.cgi?id=39727#c31 for example [23:20] leonardr: couldn't we have made a new filter? Or did I misunderstand Roy Fielding's suggestion? [23:20] roy specifically says [23:20] If mod_deflate modifies [23:20] ETag on the way out, then its corresponding later requests must [23:20] be reverse-modified (etags and request content) on the way back. [23:21] which is completely consistent with my view, and the source of the issue [that mod_compress or whatever we're using *is violating* that MUST] [23:21] Ah, I was misrembering just a bit. This was the line I was trying to remember: [23:21] The best solution is to implement transfer-encoding as an [23:21] http protocol filter module. [23:22] well, thats TE [23:22] yeah [23:22] which is different [23:22] yeah [23:22] that's what we tried to do, but the intermediaries stripped it [23:22] that was the misremembering part [23:22] roy also rejects the solution i thought would work: [23:22] Preprocessing all incoming conditional headers to remove [23:22] a -gzip suffix before the request is processed won't work. [23:22] In a chain of Apache servers, we won't know which server [23:22] set the suffix and how many caches have stored the modified [23:22] ETag versus the unmodified ETag. [23:25] so [23:25] a latter comment addreses that, though you have to be prescient to parse it [23:26] (have each server uniquely add its suffix, and have the sysadmins be responsible for ensuring a matching back-path) [23:26] I think that Apache would accept a patch which strips the -gzip, when an option is set. [23:27] there are lots of special-case vs general case situations in surrogates vs http as a whole [23:27] there is already a patch that strips the -gzip always [23:27] ok [23:27] I suggest we: get that applied to our apaches [23:27] say in the review for that patch that we need it, and discuss whats required to get it in mainline [23:28] ok, what's the process for patching our apache? [23:29] so, lifeless, I have to go half an hour ago, but is this the position: [23:29] also, in #squiddev I'm asking hno what he thinks the situation is [23:29] - the gzip suffix could eventually be customized [23:29] [henrik nordstrom from thast bug report] [23:29] (per server) [23:30] - at that point the underlying concerns could be addressed, because that specific server's suffix could be targeted [23:31] leonardr: in a chroot/vm of hardy which is what we have deployed, do an apt-get source apache2, apply the patch using whatever patch system its building with, and make sure it builds, then file an RT ticket, and include the debdiff [23:31] - meanwhile we have a hard-coded suffix. We could have a flag to remove the suffix, whatever it is; at this point it is particularly easy, because it is hardcoded [23:31] gary_poster: precisely. making it customised might be a way to get apache upstream to let useful code into their code base [23:32] lifeless: ok, thank you for the clarification. I'm happy with that if the LOSAs are happy with that (as I expect they would be). [23:32] thank you [23:32] need to run [23:32] U1 has patched their apache [23:32] with a different patch, but similar sort of situation [though theirs was a simple backport] [23:33] gary_poster: ciao [23:33] implementing this is outside my area of competence, but if you and the losas are happy with changing apache to handle this problem, that's good news for me [23:33] losa: ^ cross-check please [23:39] thumper: I'm curious (as a test framework writer) what prompted lp:~thumper/launchpad/fix-factory-ids-in-tests [23:55] lifeless: a test failed in launchpad because it depended on exact values returned by the factory and miscellaneous refactoring changed that [23:55] so i changed the factory to return different values and ran the entire test suite and filed a bug with the failure [23:55] s [23:56] ah [23:56] I guess I meant [23:57] 'why change from using the unique stuff' [23:57] not 'why did some tests fail' [23:59] mwhudson: ^ [23:59] i haven't looked at the branch itself [23:59] leonardr/lifeless: I really don't want to carry a patch to apache forever; I accepted the U1 patch because it's an upstream patch