* thumper is back | 02:28 | |
* thumper EODs | 06:25 | |
adeuring | good morning | 08:32 |
---|---|---|
mrevell | Good morning | 09:02 |
thumper | jelmer: hey | 10:28 |
jelmer_ | thumper: Hi! | 10:28 |
thumper | jelmer_: what's the status of bzr-git and dulwich? | 10:28 |
thumper | jelmer_: I'd like to rollout updates | 10:29 |
jelmer_ | thumper: I've fixed the issue with http last week and we now import HEAD branches again, we should be able to rollout current dulwich and bzr-git's roundtrip branch. | 10:33 |
thumper | so tips of both? | 10:33 |
jelmer_ | thumper: (did I mention 'bzr serve --git' works in a native bzr branch now?) | 10:33 |
thumper | jelmer_: I think I saw your 'dent about it | 10:33 |
thumper | jelmer_: how efficient is it? | 10:34 |
jelmer_ | thumper: It's very quick on small (small as in inventory) branches, very slow on large inventory branches | 10:34 |
thumper | jelmer_: do you know why it is slow? | 10:34 |
thumper | jelmer_: fixable? | 10:35 |
jelmer_ | thumper: Creating the Tree objects from inventories on the fly is very slow and O(size-of-tree) | 10:35 |
* thumper nods | 10:35 | |
jelmer_ | thumper: We need to cache those Tree objects, should be able to do so once jam's work on packs lands. | 10:35 |
jelmer_ | thumper: I'm merging the roundtrip branch into trunk atm with roundtripping itself disabled for the moment. | 10:36 |
jelmer_ | (it's got a lot of other improvements but I don't want to be tied to the current syntax for roundtripped data) | 10:37 |
thumper | ok | 10:37 |
thumper | jelmer_: just send me an email when it's ready with revnos for lp:dulwich and lp:bzr-git | 10:37 |
thumper | jelmer_: and I'll get them updated | 10:37 |
jelmer_ | thumper: will do | 10:41 |
thumper | jelmer_: thanks | 10:48 |
deryck | Morning, all. | 11:06 |
wgrant | jelmer_: bzr serve --git crashes for me :( | 11:07 |
wgrant | AttributeError: 'BzrBackend' object has no attribute 'object_store' | 11:07 |
wgrant | bzr-git/dulwich 0.5.0. | 11:07 |
bigjools | morning deryck | 11:16 |
jelmer_ | wgrant: you need a newer dulwich, bzr-git from lp:~jelmer/bzr-git/roundtrip | 11:21 |
=== barry` is now known as barry_ | ||
=== mrevell is now known as mrevell-lunch | ||
=== jelmer_ is now known as Guest6486 | ||
=== mrevell-lunch is now known as mrevell | ||
=== Ursinha_ is now known as Ursinha | ||
kfogel | mrevell: it looks like RT 39801 has been done, meaning that our help and dev wikis are now editable only by members of ~launchpad-doc (for spam protection). I'm planning to a) announce this in the appropriate places, b) document it likewise, and c) approve the currently pending team members at https://edge.launchpad.net/~launchpad-doc/+members#active. Thoughts? | 14:15 |
mrevell | kfogel, Thanks for handling that. Your plan looks ideal to me. I have two questions: do you think it's worthwhile contacting each existing member of the team directly to explain the new meaning of their membership? Also, do you think there's now a role for the team mailing list? I think "Yes" to the first and I'm not sure about the second. | 14:17 |
kfogel | mrevell: I agree. "Yes" to the first, but for the second, let's just direct people at #launchpad-dev and the other usual places if they have questions. | 14:19 |
kfogel | mrevell: oh, looks like I'll need to be a team administrator. Really, anyone in ~launchpad could be an admin. Can we do that? | 14:20 |
mrevell | kfogel, ~launchpad is now an admin of that team | 14:22 |
kfogel | mrevell: fast action, sir. | 14:23 |
mrevell | kfogel, :) As for contacting members directly, to tell them of the team's altered role, I don't think everyone in the team is a member of the ML | 14:24 |
mrevell | kfogel, but the membership is fairly small so we can easily contact them directly | 14:24 |
kfogel | mrevell: I was just going to mail them individually. | 14:25 |
mrevell | cool | 14:25 |
mars | jml or allenap, ping, would either of you be available to help debug a possible hang in some Unix IPC code using the subprocess module? I have a few questions about a 40 LoC block I'm studying. | 14:42 |
allenap | mars: I'll have a look. | 14:43 |
mars | thanks allenap, I'm pasting it now | 14:43 |
mars | allenap, http://pastebin.ubuntu.com/434980/ | 14:44 |
maxb | line 13, comment is lying | 14:46 |
maxb | that's where you end up if the TIMEOUT runs out | 14:47 |
mars | maxb, ok, that's why I asked other people :) | 14:47 |
mars | maxb, so STDOUT could still be open, but the select() has timed out? | 14:48 |
maxb | yes | 14:48 |
mars | ok | 14:48 |
mars | allenap, ^ | 14:48 |
* mars changes the comment | 14:48 | |
* maxb wonders why the code uses os.read(blah.fileno()) | 14:49 | |
allenap | mars: Yes, I noticed that. What were your questions? | 14:49 |
allenap | mars: Is it working? | 14:49 |
mars | maxb, allenap, fwiw, I'm trying to figure out why the test suite is hanging with the ec2 testrunner. This code should kill the entire test suite. But it might not be. | 14:50 |
mars | allenap, I'm wondering if something in this timeout code is buggy in such a way that the test suite could hang, but this code doesn't catch and kill it. | 14:50 |
mars | allenap, my XXX comments are where I would start. They ask basic Unix programming questions that I have, that I can not answer. | 14:51 |
mars | Without reading "Advanced Unix Programming". I need the fix a bit faster than that. | 14:52 |
mars | hmm | 14:53 |
mars | maxb, if what you say is true... During a hang, the test suite has stopped printing output. That means the select() is timing out, right? | 14:54 |
maxb | yes | 14:54 |
mars | so that narrows the bug down to that branch of the conditional. | 14:55 |
mars | so the hang is in this code path, or this code is running correctly, and the fault happens after this entire script has exited (test_on_merge.py exits correctly, but fails to send mail or something). | 14:56 |
mars | maxb, allenap, is the XXX on line 17 a valid concern? | 14:57 |
allenap | maxb: subprocess.Popen._communicate() does the fileno() thing. mars: Might be worth looking there and copy that select() loop as closely as possible. | 14:57 |
mars | allenap, yeah, I was wondering why they didn't use .communicate(). | 14:58 |
allenap | mars: If proc.poll() is not None then the process has definitely terminated. | 14:58 |
allenap | mars: .communicate() only returns the result at the end. Don't want to wait that long while running the test suite :) | 14:59 |
maxb | Line 17 could be hit if the test process exited but a subprocess spawned by the test process still retained the open stdout file descriptor | 14:59 |
mars | allenap, ah, "is not None" means it's absolutely dead. Ok. | 15:00 |
mars | I just re-read the docs, you are correct. | 15:00 |
mars | maxb, oh, that is something | 15:00 |
mars | it *does* do that | 15:00 |
mars | the process tree on a hung server has a few defunct processes | 15:01 |
mars | well, maybe | 15:01 |
=== Guest6486 is now known as jelmer_____ | ||
allenap | mars: Yeah, line 28 will kill the child, but if its children are still alive and holding a reference to proc.stdout then line 31 will hang. | 15:03 |
mars | here, more info | 15:03 |
mars | maxb, allenap, process tree from a hung server. Everything is still alive! http://pastebin.ubuntu.com/434986/ | 15:04 |
mars | the code never got to the killem() line :( | 15:04 |
allenap | mars: The firefox has not been collected by the parent :-/ | 15:05 |
mars | kill_hung_process_with_a_series_of_brutish_instruments() was never called. It will eventually SIGKILL the process in question. | 15:05 |
mars | allenap, yes, on my local system I had a sort of the same thing. I had to send SIGHUP to Python itself in order for it to get collected. | 15:06 |
mars | kill -1 python-parent-process-id | 15:06 |
sinzui | mrevell, which project did you have the milestone problem with? | 15:06 |
mrevell | sinzui, launchpad, malone, rosetta, launchpad-registry and launchpad-foundations, so far | 15:08 |
sinzui | always 10.10? | 15:08 |
sinzui | ie 10.11 is always fine | 15:08 |
mars | allenap, firefox was started by the windmill testrunner. I don't see it in the tree. That means that it died or was killed. | 15:08 |
mrevell | sinzui, Yeah. Always 10.10. Everything else from 10.06 to 10.12 have been fine. | 15:09 |
sinzui | mrevell, I see them, how did you create 10.10 milestones? | 15:10 |
mars | allenap, maxb, that would leave the zombie processes. So would the process tree I posted somehow lead the timeout code to hang? | 15:10 |
mrevell | sinzui, I refreshed the page so that the error message was replaced by a link to a new 10.1 milestone (which I hadn't created but appeared instead of the 10.10 milestone), then went in and changed the name of the 10.1 milestone to 10.10 | 15:10 |
sinzui | okay thanks. | 15:11 |
maxb | What is the process actually being Popen-ed here? | 15:12 |
mars | maxb, line 1 of http://pastebin.ubuntu.com/434986/. | 15:13 |
maxb | Did the [[[print ("\nA test appears to be hung. There has been no output for"]]] actually occur? | 15:15 |
mars | allenap, maxb, here is the current code, uncommented http://pastebin.ubuntu.com/434992/. This makes no sense: line 19 and 23 must have run. The processes are still alive! | 15:15 |
maxb | And just what is kill_hung_process_with_a_series_of_brutish_instruments? | 15:15 |
mars | maxb, that function is a rewrite I did of lines 19 through 23 of http://pastebin.ubuntu.com/434992/ | 15:16 |
mars | maxb, just for clarity. The code I just pasted is what is actually run by the server, but it was too dense for me to understand. So I rewrote and commented it before posting. | 15:17 |
maxb | What is killem? | 15:17 |
mars | hmm | 15:17 |
maxb | It would be interesting to add some logging to see what PIDs it's *actually* sending signals to | 15:17 |
mars | if we had console output for the log to write to :/ | 15:18 |
mars | or python standard logging installed and running in this script... | 15:18 |
mars | maxb, I can add logging if that would help. | 15:19 |
maxb | Well, I'm a bit baffled, so I'm clutching on to the fact that if the process is still running, a SIGKILL can't have really happened. | 15:19 |
mars | maxb, here is the killem() function: http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/annotate/head:/test_on_merge.py#L189 | 15:20 |
mars | maxb, allenap, killem() runs os.killpg(), not os.kill(). Does that matter? | 15:20 |
mars | (I suspect it might?) | 15:20 |
mars | hmmm | 15:22 |
mars | column 3 of http://pastebin.ubuntu.com/434986/ is the process group ID | 15:22 |
mars | oh | 15:22 |
mars | line 6 | 15:22 |
mars | firefox <defunct> is part of a different process group. But that doesn't make sense. Could the windmill testrunner have been the target of the process group kill? | 15:24 |
mars | maxb, I think you are right. The best next step is probably to add some logging to the code to see what it is killing, and why. | 15:26 |
maxb | Clearly we have an issue if the kill code is expecting the entire tree to be just one process group, but it isn't | 15:26 |
allenap | mars: Where does the select() loop run? Is it in the test runner? If so, then it's running in pid 15177 and 20962. | 15:27 |
allenap | mars: Ah, it's in test_on_merge.py | 15:28 |
mars | allenap, sorry, I don't understand? the select() loop is in test_on_merge.py, which... hey, if test_on_merge.py is still running, shouldn't it be in the process tree as well? | 15:28 |
mars | allenap, the first column of http://pastebin.ubuntu.com/434986/ is the PPID. Notice that it is '1' for a few of those? | 15:29 |
maxb | hahahahahaha | 15:29 |
allenap | mars: Yes, it should! | 15:29 |
maxb | By killing the process group in this way, the supervisor script is killing itself :-) | 15:29 |
mars | blah | 15:29 |
allenap | maxb: Is it? | 15:30 |
maxb | I tried typing out a few key lines in an interactive python | 15:30 |
maxb | and that's what it seems to indicate | 15:31 |
maxb | proc = Popen('sleep 3600', stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True) | 15:31 |
maxb | os.killpg(os.getpgid(proc.pid), 9) | 15:31 |
allenap | maxb: But isn't it doing os.killpg(proc.pid)? | 15:33 |
allenap | maxb: Scratch that. | 15:33 |
allenap | Doh. | 15:33 |
allenap | maxb: There's a comment in killem saying "Note that bin/test sets its process to a process group leader". | 15:34 |
mars | allenap, os.killpg(os.getpgid(pid), signal) | 15:34 |
maxb | allenap: oh, ok | 15:34 |
maxb | hrm | 15:34 |
allenap | maxb: The process tree bears that out I think. | 15:34 |
allenap | mars, maxb: The Popen call has shell=True; killem() is killing the shell. | 15:38 |
* allenap has to restart router | 15:39 | |
=== deryck is now known as deryck[lunch] | ||
mars | allenap, back? | 15:52 |
mars | nope, still away | 15:52 |
=== barry_ is now known as barry | ||
allenap | mars: Hi. If you said anything between my last message and "allenap, back?" then I missed it. | 15:59 |
mars | :) | 15:59 |
mars | nope! | 15:59 |
mars | allenap, thought of something | 15:59 |
mars | possible cause then | 15:59 |
mars | so normally, killing the shell would take everything with it | 15:59 |
mars | but I have observed that Python can get deadlocked(?) on the windmill process. Python won't respond to a SIGTERM even. You have to SIGHUP it. | 16:00 |
mars | So, if we have the windmill/Python deadlock, the suite hangs. test_on_merge says "Oops, better kill it!", and kill the shell. But Python ignores the shell's SIGTERM, and keeps running, along witheverything in the tree under it. | 16:01 |
mars | allenap, maxb ^ does that make sense? | 16:02 |
allenap | mars: That makes a lot of sense. So, perhaps send HUP or QUIT before KILL. | 16:03 |
allenap | mars: Alternatively, get the Popen call to work without shell=True | 16:04 |
mars | allenap, should HUP walk the process tree maybe? This code assume that killing the shell is enough. Obvious that is not a thorough approach. | 16:06 |
allenap | mars: Or make cmdline = "exec " + cmdline | 16:06 |
allenap | mars: Maybe it should, but see if this works first. | 16:06 |
mars | exec would terminate the test_on_merge code path, wouldn't it? | 16:07 |
allenap | mars: No, it would mean that the test process replaces the shell used to invoke it, so that killpg kills the test process group. | 16:09 |
mars | oh! | 16:09 |
mars | yes | 16:09 |
mars | allenap, awesome, thank you | 16:09 |
allenap | mars: But, I'm not sure the shell=True bit is necessary anyway. | 16:09 |
mars | allenap, well, you were wondering if the entire select() loop was needed instead of just using .communicate(), so I can't say if any of this code should stay. | 16:10 |
mars | allenap, they may have used their own select() loop to save .communicate() from buffering too much output. | 16:11 |
mars | there is a warning in the module docs about that | 16:11 |
allenap | mars: No, you definitely shouldn't use communicate(); but Popen._communicate() does have an implementation of a similar select() loop that I thought was worth studying. For example, it has an exception handler around select() to catch EINTR. Actually, other than that it's very similar. | 16:13 |
allenap | mars: In any case, it doesn't look like the problem lies in that direction anyway. | 16:13 |
mars | right | 16:13 |
mars | well, this is one huge step closer to getting things working again | 16:14 |
mars | maxb, allenap, thank you for all the help! | 16:19 |
allenap | mars: You're welcome. I hope it works now :) | 16:19 |
=== matsubara is now known as matsubara-lunch | ||
=== deryck[lunch] is now known as deryck | ||
=== gary_poster is now known as gary-lunch | ||
=== beuno is now known as beuno-lunch | ||
kfogel | mrevell: I think I'd like to actually deactivate the launchpad-doc@ mailing list entirely. Its original purpose is now obsolete. Any objections? | 17:12 |
mars | leonardr, something you may find interesting: http://factoryjoe.com/blog/2010/05/16/combing-openid-and-oauth-with-openid-connect/ | 17:13 |
mrevell | kfogel, None at all | 17:13 |
kfogel | mrevell: thanks. Also, should I update https://help.launchpad.net/DocTeam to say it's obsolete? | 17:13 |
mrevell | kfogel, Please, if you don't mind. | 17:13 |
kfogel | mrevell: no problem. | 17:13 |
=== abentley is now known as abentley-lunch | ||
=== matsubara-lunch is now known as matsubara | ||
mrevell | Night all | 18:04 |
=== gary-lunch is now known as gary_poster | ||
=== beuno-lunch is now known as beuno | ||
=== EdwinGrubbs_ is now known as foo1000 | ||
=== abentley-lunch is now known as abentley | ||
=== matsubara is now known as matsubara-afk | ||
sinzui | OMG. I made staging much faster that edge. | 21:30 |
cody-somerville | how? | 21:38 |
thumper | sinzui: how?? | 21:50 |
sinzui | thumper, I used memcached tales directives on milestone and portlets. I may have fixed another issue doing this, https://staging.launchpad.net/libpng/main/+index loads faster on staging then edge for me | 21:57 |
thumper | sinzui: please write it up for the list - I'm not sure how to use the memcached tales directives | 21:58 |
thumper | sinzui: oh you did already | 21:59 |
sinzui | thumper, I will if my branch lands. I am sure engineers will love all the broken browser tests that caching created | 21:59 |
sinzui | for me | 21:59 |
mars | gary_poster, ^ | 21:59 |
gary_poster | sinzui, yay! :-) | 22:00 |
gary_poster | sinzui, that was for memcached | 22:00 |
sinzui | thumper, I have not. written up what I did, yet, I just submitted the review. I want to cache some of our tales formatters if it proves to be faster than db lookups | 22:00 |
sinzui | gary_poster :) | 22:00 |
mars | sinzui, I wonder how many of those requests are for breadcrumbs... | 22:02 |
mars | sinzui, probably not as big a payoff. What you found is huge. | 22:02 |
sinzui | mars, indeed that too crossed my mind. the header of pages can be cached. project/person displayname changes are rare. | 22:02 |
lifeless | gary_poster: leonardr: ping - mod_compress | 23:17 |
gary_poster | lifeless: pong, hi | 23:17 |
lifeless | I'm a little surprised that you didn't just fix apacge | 23:17 |
lifeless | blah, apache | 23:17 |
lifeless | I wanted to check my facts at the source :) | 23:17 |
leonardr | lifeless: nothing we would consider to be 'fixing apache' is acceptable to apache upstream | 23:19 |
gary_poster | lifeless: :-) I'd be +1 on fixing it in apache, but from what I understood of Roy Fielding's response to the bug, the proper fix is a new filter. | 23:19 |
lifeless | leonardr: oh! thats surprising | 23:19 |
leonardr | see https://issues.apache.org/bugzilla/show_bug.cgi?id=39727#c31 for example | 23:19 |
gary_poster | leonardr: couldn't we have made a new filter? Or did I misunderstand Roy Fielding's suggestion? | 23:20 |
lifeless | roy specifically says | 23:20 |
lifeless | If mod_deflate modifies | 23:20 |
lifeless | ETag on the way out, then its corresponding later requests must | 23:20 |
lifeless | be reverse-modified (etags and request content) on the way back. | 23:20 |
lifeless | which is completely consistent with my view, and the source of the issue [that mod_compress or whatever we're using *is violating* that MUST] | 23:21 |
gary_poster | Ah, I was misrembering just a bit. This was the line I was trying to remember: | 23:21 |
gary_poster | The best solution is to implement transfer-encoding as an | 23:21 |
gary_poster | http protocol filter module. | 23:21 |
lifeless | well, thats TE | 23:22 |
leonardr | yeah | 23:22 |
lifeless | which is different | 23:22 |
gary_poster | yeah | 23:22 |
leonardr | that's what we tried to do, but the intermediaries stripped it | 23:22 |
gary_poster | that was the misremembering part | 23:22 |
leonardr | roy also rejects the solution i thought would work: | 23:22 |
leonardr | Preprocessing all incoming conditional headers to remove | 23:22 |
leonardr | a -gzip suffix before the request is processed won't work. | 23:22 |
leonardr | In a chain of Apache servers, we won't know which server | 23:22 |
leonardr | set the suffix and how many caches have stored the modified | 23:22 |
leonardr | ETag versus the unmodified ETag. | 23:22 |
lifeless | so | 23:25 |
lifeless | a latter comment addreses that, though you have to be prescient to parse it | 23:25 |
lifeless | (have each server uniquely add its suffix, and have the sysadmins be responsible for ensuring a matching back-path) | 23:26 |
lifeless | I think that Apache would accept a patch which strips the -gzip, when an option is set. | 23:26 |
lifeless | there are lots of special-case vs general case situations in surrogates vs http as a whole | 23:27 |
leonardr | there is already a patch that strips the -gzip always | 23:27 |
lifeless | ok | 23:27 |
lifeless | I suggest we: get that applied to our apaches | 23:27 |
lifeless | say in the review for that patch that we need it, and discuss whats required to get it in mainline | 23:27 |
leonardr | ok, what's the process for patching our apache? | 23:28 |
gary_poster | so, lifeless, I have to go half an hour ago, but is this the position: | 23:29 |
lifeless | also, in #squiddev I'm asking hno what he thinks the situation is | 23:29 |
gary_poster | - the gzip suffix could eventually be customized | 23:29 |
lifeless | [henrik nordstrom from thast bug report] | 23:29 |
gary_poster | (per server) | 23:29 |
gary_poster | - at that point the underlying concerns could be addressed, because that specific server's suffix could be targeted | 23:30 |
lifeless | leonardr: in a chroot/vm of hardy which is what we have deployed, do an apt-get source apache2, apply the patch using whatever patch system its building with, and make sure it builds, then file an RT ticket, and include the debdiff | 23:31 |
gary_poster | - meanwhile we have a hard-coded suffix. We could have a flag to remove the suffix, whatever it is; at this point it is particularly easy, because it is hardcoded | 23:31 |
lifeless | gary_poster: precisely. making it customised might be a way to get apache upstream to let useful code into their code base | 23:31 |
gary_poster | lifeless: ok, thank you for the clarification. I'm happy with that if the LOSAs are happy with that (as I expect they would be). | 23:32 |
gary_poster | thank you | 23:32 |
gary_poster | need to run | 23:32 |
lifeless | U1 has patched their apache | 23:32 |
lifeless | with a different patch, but similar sort of situation [though theirs was a simple backport] | 23:32 |
lifeless | gary_poster: ciao | 23:33 |
leonardr | implementing this is outside my area of competence, but if you and the losas are happy with changing apache to handle this problem, that's good news for me | 23:33 |
lifeless | losa: ^ cross-check please | 23:33 |
lifeless | thumper: I'm curious (as a test framework writer) what prompted lp:~thumper/launchpad/fix-factory-ids-in-tests | 23:39 |
mwhudson | lifeless: a test failed in launchpad because it depended on exact values returned by the factory and miscellaneous refactoring changed that | 23:55 |
mwhudson | so i changed the factory to return different values and ran the entire test suite and filed a bug with the failure | 23:55 |
mwhudson | s | 23:55 |
lifeless | ah | 23:56 |
lifeless | I guess I meant | 23:56 |
lifeless | 'why change from using the unique stuff' | 23:57 |
lifeless | not 'why did some tests fail' | 23:57 |
lifeless | mwhudson: ^ | 23:59 |
mwhudson | i haven't looked at the branch itself | 23:59 |
elmo | leonardr/lifeless: I really don't want to carry a patch to apache forever; I accepted the U1 patch because it's an upstream patch | 23:59 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!