/srv/irclogs.ubuntu.com/2010/05/17/#launchpad-dev.txt

* thumper is back02:28
* thumper EODs06:25
adeuringgood morning08:32
mrevellGood morning09:02
thumperjelmer: hey10:28
jelmer_thumper: Hi!10:28
thumperjelmer_: what's the status of bzr-git and dulwich?10:28
thumperjelmer_: I'd like to rollout updates10:29
jelmer_thumper: I've fixed the issue with http last week and we now import HEAD branches again, we should be able to rollout current dulwich and bzr-git's roundtrip branch.10:33
thumperso tips of both?10:33
jelmer_thumper: (did I mention 'bzr serve --git' works in a native bzr branch now?)10:33
thumperjelmer_: I think I saw your 'dent about it10:33
thumperjelmer_: how efficient is it?10:34
jelmer_thumper: It's very quick on small (small as in inventory) branches, very slow on large inventory branches10:34
thumperjelmer_: do you know why it is slow?10:34
thumperjelmer_: fixable?10:35
jelmer_thumper: Creating the Tree objects from inventories on the fly is very slow and O(size-of-tree)10:35
* thumper nods10:35
jelmer_thumper: We need to cache those Tree objects, should be able to do so once jam's work on packs lands.10:35
jelmer_thumper: I'm merging the roundtrip branch into trunk atm with roundtripping itself disabled for the moment.10:36
jelmer_(it's got a lot of other improvements but I don't want to be tied to the current syntax for roundtripped data)10:37
thumperok10:37
thumperjelmer_: just send me an email when it's ready with revnos for lp:dulwich and lp:bzr-git10:37
thumperjelmer_: and I'll get them updated10:37
jelmer_thumper: will do10:41
thumperjelmer_: thanks10:48
deryckMorning, all.11:06
wgrantjelmer_: bzr serve --git crashes for me :(11:07
wgrantAttributeError: 'BzrBackend' object has no attribute 'object_store'11:07
wgrantbzr-git/dulwich 0.5.0.11:07
bigjoolsmorning deryck11:16
jelmer_wgrant: you need a newer dulwich, bzr-git from lp:~jelmer/bzr-git/roundtrip11:21
=== barry` is now known as barry_
=== mrevell is now known as mrevell-lunch
=== jelmer_ is now known as Guest6486
=== mrevell-lunch is now known as mrevell
=== Ursinha_ is now known as Ursinha
kfogelmrevell: it looks like RT 39801 has been done, meaning that our help and dev wikis are now editable only by members of ~launchpad-doc (for spam protection).  I'm planning to a) announce this in the appropriate places, b) document it likewise, and c) approve the currently pending team members at https://edge.launchpad.net/~launchpad-doc/+members#active.  Thoughts?14:15
mrevellkfogel, Thanks for handling that. Your plan looks ideal to me. I have two questions: do you think it's worthwhile contacting each existing member of the team directly to explain the new meaning of their membership? Also, do you think there's now a role for the team mailing list? I think "Yes" to the first and I'm not sure about the second.14:17
kfogelmrevell: I agree.  "Yes" to the first, but for the second, let's just direct people at #launchpad-dev and the other usual places if they have questions.14:19
kfogelmrevell: oh, looks like I'll need to be a team administrator.  Really, anyone in ~launchpad could be an admin.  Can we do that?14:20
mrevellkfogel, ~launchpad is now an admin of that team14:22
kfogelmrevell: fast action, sir.14:23
mrevellkfogel, :) As for contacting members directly, to tell them of the team's altered role, I don't think everyone in the team is a member of the ML14:24
mrevellkfogel, but the membership is fairly small so we can easily contact them directly14:24
kfogelmrevell: I was just going to mail them individually.14:25
mrevellcool14:25
marsjml or allenap, ping, would either of you be available to help debug a possible hang in some Unix IPC code using the subprocess module?  I have a few questions about a 40 LoC block I'm studying.14:42
allenapmars: I'll have a look.14:43
marsthanks allenap, I'm pasting it now14:43
marsallenap, http://pastebin.ubuntu.com/434980/14:44
maxbline 13, comment is lying14:46
maxbthat's where you end up if the TIMEOUT runs out14:47
marsmaxb, ok, that's why I asked other people :)14:47
marsmaxb, so STDOUT could still be open, but the select() has timed out?14:48
maxbyes14:48
marsok14:48
marsallenap, ^14:48
* mars changes the comment14:48
* maxb wonders why the code uses os.read(blah.fileno())14:49
allenapmars: Yes, I noticed that. What were your questions?14:49
allenapmars: Is it working?14:49
marsmaxb, allenap, fwiw, I'm trying to figure out why the test suite is hanging with the ec2 testrunner.  This code should kill the entire test suite.  But it might not be.14:50
marsallenap, I'm wondering if something in this timeout code is buggy in such a way that the test suite could hang, but this code doesn't catch and kill it.14:50
marsallenap, my XXX comments are where I would start.  They ask basic Unix programming questions that I have, that I can not answer.14:51
marsWithout reading "Advanced Unix Programming".  I need the fix a bit faster than that.14:52
marshmm14:53
marsmaxb, if what you say is true...   During a hang, the test suite has stopped printing output.  That means the select() is timing out, right?14:54
maxbyes14:54
marsso that narrows the bug down to that branch of the conditional.14:55
marsso the hang is in this code path, or this code is running correctly, and the fault happens after this entire script has exited (test_on_merge.py exits correctly, but fails to send mail or something).14:56
marsmaxb, allenap, is the XXX on line 17 a valid concern?14:57
allenapmaxb: subprocess.Popen._communicate() does the fileno() thing. mars: Might be worth looking there and copy that select() loop as closely as possible.14:57
marsallenap, yeah, I was wondering why they didn't use .communicate().14:58
allenapmars: If proc.poll() is not None then the process has definitely terminated.14:58
allenapmars: .communicate() only returns the result at the end. Don't want to wait that long while running the test suite :)14:59
maxbLine 17 could be hit if the test process exited but a subprocess spawned by the test process still retained the open stdout file descriptor14:59
marsallenap, ah, "is not None" means it's absolutely dead.  Ok.15:00
marsI just re-read the docs, you are correct.15:00
marsmaxb, oh, that is something15:00
marsit *does* do that15:00
marsthe process tree on a hung server has a few defunct processes15:01
marswell, maybe15:01
=== Guest6486 is now known as jelmer_____
allenapmars: Yeah, line 28 will kill the child, but if its children are still alive and holding a reference to proc.stdout then line 31 will hang.15:03
marshere, more info15:03
marsmaxb, allenap, process tree from a hung server.  Everything is still alive!  http://pastebin.ubuntu.com/434986/15:04
marsthe code never got to the killem() line :(15:04
allenapmars: The firefox has not been collected by the parent :-/15:05
marskill_hung_process_with_a_series_of_brutish_instruments() was never called.  It will eventually SIGKILL the process in question.15:05
marsallenap, yes, on my local system I had a sort of the same thing.  I had to send SIGHUP to Python itself in order for it to get collected.15:06
marskill -1 python-parent-process-id15:06
sinzuimrevell, which project did you have the milestone problem with?15:06
mrevellsinzui, launchpad, malone, rosetta, launchpad-registry and launchpad-foundations, so far15:08
sinzuialways 10.10?15:08
sinzuiie 10.11 is always fine15:08
marsallenap, firefox was started by the windmill testrunner.  I don't see it in the tree.  That means that it died or was killed.15:08
mrevellsinzui, Yeah. Always 10.10. Everything else from 10.06 to 10.12 have been fine.15:09
sinzuimrevell, I see them, how did you create 10.10 milestones?15:10
marsallenap, maxb, that would leave the zombie processes.  So would the process tree I posted somehow lead the timeout code to hang?15:10
mrevellsinzui, I refreshed the page so that the error message was replaced by a link to a new 10.1 milestone (which I hadn't created but appeared instead of the 10.10 milestone), then went in and changed the name of the 10.1 milestone to 10.1015:10
sinzuiokay thanks.15:11
maxbWhat is the process actually being Popen-ed here?15:12
marsmaxb, line 1 of http://pastebin.ubuntu.com/434986/.15:13
maxbDid the [[[print ("\nA test appears to be hung. There has been no output for"]]] actually occur?15:15
marsallenap, maxb, here is the current code, uncommented http://pastebin.ubuntu.com/434992/.  This makes no sense:  line 19 and 23 must have run.  The processes are still alive!15:15
maxbAnd just what is kill_hung_process_with_a_series_of_brutish_instruments?15:15
marsmaxb, that function is a rewrite I did of lines 19 through 23 of http://pastebin.ubuntu.com/434992/15:16
marsmaxb, just for clarity.  The code I just pasted is what is actually run by the server, but it was too dense for me to understand.  So I rewrote and commented it before posting.15:17
maxbWhat is killem?15:17
marshmm15:17
maxbIt would be interesting to add some logging to see what PIDs it's *actually* sending signals to15:17
marsif we had console output for the log to write to :/15:18
marsor python standard logging installed and running in this script...15:18
marsmaxb, I can add logging if that would help.15:19
maxbWell, I'm a bit baffled, so I'm clutching on to the fact that if the process is still running, a SIGKILL can't have really happened.15:19
marsmaxb, here is the killem() function: http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/annotate/head:/test_on_merge.py#L18915:20
marsmaxb, allenap, killem() runs os.killpg(), not os.kill().  Does that matter?15:20
mars(I suspect it might?)15:20
marshmmm15:22
marscolumn 3 of http://pastebin.ubuntu.com/434986/ is the process group ID15:22
marsoh15:22
marsline 615:22
marsfirefox <defunct> is part of a different process group.  But that doesn't make sense.  Could the windmill testrunner have been the target of the process group kill?15:24
marsmaxb, I think you are right.  The best next step is probably to add some logging to the code to see what it is killing, and why.15:26
maxbClearly we have an issue if the kill code is expecting the entire tree to be just one process group, but it isn't15:26
allenapmars: Where does the select() loop run? Is it in the test runner? If so, then it's running in pid 15177 and 20962.15:27
allenapmars: Ah, it's in test_on_merge.py15:28
marsallenap, sorry, I don't understand?  the select() loop is in test_on_merge.py, which... hey, if test_on_merge.py is still running, shouldn't it be in the process tree as well?15:28
marsallenap, the first column of http://pastebin.ubuntu.com/434986/ is the PPID.  Notice that it is '1' for a few of those?15:29
maxbhahahahahaha15:29
allenapmars: Yes, it should!15:29
maxbBy killing the process group in this way, the supervisor script is killing itself :-)15:29
marsblah15:29
allenapmaxb: Is it?15:30
maxbI tried typing out a few key lines in an interactive python15:30
maxband that's what it seems to indicate15:31
maxbproc = Popen('sleep 3600', stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True)15:31
maxbos.killpg(os.getpgid(proc.pid), 9)15:31
allenapmaxb: But isn't it doing os.killpg(proc.pid)?15:33
allenapmaxb: Scratch that.15:33
allenapDoh.15:33
allenapmaxb: There's a comment in killem saying "Note that bin/test sets its process to a process group leader".15:34
marsallenap, os.killpg(os.getpgid(pid), signal)15:34
maxballenap: oh, ok15:34
maxbhrm15:34
allenapmaxb: The process tree bears that out I think.15:34
allenapmars, maxb: The Popen call has shell=True; killem() is killing the shell.15:38
* allenap has to restart router15:39
=== deryck is now known as deryck[lunch]
marsallenap, back?15:52
marsnope, still away15:52
=== barry_ is now known as barry
allenapmars: Hi. If you said anything between my last message and "allenap, back?" then I missed it.15:59
mars:)15:59
marsnope!15:59
marsallenap, thought of something15:59
marspossible cause then15:59
marsso normally, killing the shell would take everything with it15:59
marsbut I have observed that Python can get deadlocked(?) on the windmill process.  Python won't respond to a SIGTERM even.  You have to SIGHUP it.16:00
marsSo, if we have the windmill/Python deadlock, the suite hangs.  test_on_merge says "Oops, better kill it!", and kill the shell.  But Python ignores the shell's SIGTERM, and keeps running, along witheverything in the tree under it.16:01
marsallenap, maxb ^ does that make sense?16:02
allenapmars: That makes a lot of sense. So, perhaps send HUP or QUIT before KILL.16:03
allenapmars: Alternatively, get the Popen call to work without shell=True16:04
marsallenap, should HUP walk the process tree maybe?  This code assume that killing the shell is enough.  Obvious that is not a thorough approach.16:06
allenapmars: Or make cmdline = "exec " + cmdline16:06
allenapmars: Maybe it should, but see if this works first.16:06
marsexec would terminate the test_on_merge code path, wouldn't it?16:07
allenapmars: No, it would mean that the test process replaces the shell used to invoke it, so that killpg kills the test process group.16:09
marsoh!16:09
marsyes16:09
marsallenap, awesome, thank you16:09
allenapmars: But, I'm not sure the shell=True bit is necessary anyway.16:09
marsallenap, well, you were wondering if the entire select() loop was needed instead of just using .communicate(), so I can't say if any of this code should stay.16:10
marsallenap, they may have used their own select() loop to save .communicate() from buffering too much output.16:11
marsthere is a warning in the module docs about that16:11
allenapmars: No, you definitely shouldn't use communicate(); but Popen._communicate() does have an implementation of a similar select() loop that I thought was worth studying. For example, it has an exception handler around select() to catch EINTR. Actually, other than that it's very similar.16:13
allenapmars: In any case, it doesn't look like the problem lies in that direction anyway.16:13
marsright16:13
marswell, this is one huge step closer to getting things working again16:14
marsmaxb, allenap, thank you for all the help!16:19
allenapmars: You're welcome. I hope it works now :)16:19
=== matsubara is now known as matsubara-lunch
=== deryck[lunch] is now known as deryck
=== gary_poster is now known as gary-lunch
=== beuno is now known as beuno-lunch
kfogelmrevell: I think I'd like to actually deactivate the launchpad-doc@ mailing list entirely.  Its original purpose is now obsolete.  Any objections?17:12
marsleonardr, something you may find interesting: http://factoryjoe.com/blog/2010/05/16/combing-openid-and-oauth-with-openid-connect/17:13
mrevellkfogel, None at all17:13
kfogelmrevell: thanks.  Also, should I update https://help.launchpad.net/DocTeam to say it's obsolete?17:13
mrevellkfogel, Please, if you don't mind.17:13
kfogelmrevell: no problem.17:13
=== abentley is now known as abentley-lunch
=== matsubara-lunch is now known as matsubara
mrevellNight all18:04
=== gary-lunch is now known as gary_poster
=== beuno-lunch is now known as beuno
=== EdwinGrubbs_ is now known as foo1000
=== abentley-lunch is now known as abentley
=== matsubara is now known as matsubara-afk
sinzuiOMG. I made staging much faster that edge.21:30
cody-somervillehow?21:38
thumpersinzui: how??21:50
sinzuithumper, I used memcached tales directives on milestone and portlets. I may have fixed another issue doing this, https://staging.launchpad.net/libpng/main/+index loads faster on staging then edge for me21:57
thumpersinzui: please write it up for the list - I'm not sure how to use the memcached tales directives21:58
thumpersinzui: oh you did already21:59
sinzuithumper, I will if my branch lands. I am sure engineers will love all the broken browser tests that caching created21:59
sinzuifor me21:59
marsgary_poster, ^21:59
gary_postersinzui, yay! :-)22:00
gary_postersinzui, that was for memcached22:00
sinzuithumper, I have not. written up what I did, yet, I just submitted the review. I want to cache some of our tales formatters if it proves to be faster than db lookups22:00
sinzuigary_poster :)22:00
marssinzui, I wonder how many of those requests are for breadcrumbs...22:02
marssinzui, probably not as big a payoff.  What you found is huge.22:02
sinzuimars, indeed that too crossed my mind. the header of pages can be cached. project/person displayname changes are rare.22:02
lifelessgary_poster: leonardr: ping - mod_compress23:17
gary_posterlifeless: pong, hi23:17
lifelessI'm a little surprised that you didn't just fix apacge23:17
lifelessblah, apache23:17
lifelessI wanted to check my facts at the source :)23:17
leonardrlifeless: nothing we would consider to be 'fixing apache' is acceptable to apache upstream23:19
gary_posterlifeless: :-) I'd be +1 on fixing it in apache, but from what I understood of Roy Fielding's response to the bug, the proper fix is a new filter.23:19
lifelessleonardr: oh! thats surprising23:19
leonardrsee https://issues.apache.org/bugzilla/show_bug.cgi?id=39727#c31 for example23:19
gary_posterleonardr: couldn't we have made a new filter?  Or did I misunderstand Roy Fielding's suggestion?23:20
lifelessroy specifically says23:20
lifelessIf mod_deflate modifies23:20
lifelessETag on the way out, then its corresponding later requests must23:20
lifelessbe reverse-modified (etags and request content) on the way back.23:20
lifelesswhich is completely consistent with my view, and the source of the issue [that mod_compress or whatever we're using *is violating* that MUST]23:21
gary_posterAh, I was misrembering just a bit.  This was the line I was trying to remember:23:21
gary_posterThe best solution is to implement transfer-encoding as an23:21
gary_posterhttp protocol filter module.23:21
lifelesswell, thats TE23:22
leonardryeah23:22
lifelesswhich is different23:22
gary_posteryeah23:22
leonardrthat's what we tried to do, but the intermediaries stripped it23:22
gary_posterthat was the misremembering part23:22
leonardrroy also rejects the solution i thought would work:23:22
leonardrPreprocessing all incoming conditional headers to remove23:22
leonardra -gzip suffix before the request is processed won't work.23:22
leonardrIn a chain of Apache servers, we won't know which server23:22
leonardrset the suffix and how many caches have stored the modified23:22
leonardrETag versus the unmodified ETag.23:22
lifelessso23:25
lifelessa latter comment addreses that, though you have to be prescient to parse it23:25
lifeless(have each server uniquely add its suffix, and have the sysadmins be responsible for ensuring a matching back-path)23:26
lifelessI think that Apache would accept a patch which strips the -gzip, when an option is set.23:26
lifelessthere are lots of special-case vs general case situations in surrogates vs http as a whole23:27
leonardrthere is already a patch that strips the -gzip always23:27
lifelessok23:27
lifelessI suggest we: get that applied to our apaches23:27
lifeless say in the review for that patch that we need it, and discuss whats required to get it in mainline23:27
leonardrok, what's the process for patching our apache?23:28
gary_posterso, lifeless, I have to go half an hour ago, but is this the position:23:29
lifelessalso, in #squiddev I'm asking hno what he thinks the situation is23:29
gary_poster- the gzip suffix could eventually be customized23:29
lifeless[henrik nordstrom from thast bug report]23:29
gary_poster(per server)23:29
gary_poster- at that point the underlying concerns could be addressed, because that specific server's suffix could be targeted23:30
lifelessleonardr: in a chroot/vm of hardy which is what we have deployed, do an apt-get source apache2, apply the patch using whatever patch system its building with, and make sure it builds, then file an RT ticket, and include the debdiff23:31
gary_poster- meanwhile we have a hard-coded suffix.  We could have a flag to remove the suffix, whatever it is; at this point it is particularly easy, because it is hardcoded23:31
lifelessgary_poster: precisely. making it customised might be a way to get apache upstream to let useful code into their code base23:31
gary_posterlifeless: ok, thank you for the clarification.  I'm happy with that if the LOSAs are happy with that (as I expect they would be).23:32
gary_posterthank you23:32
gary_posterneed to run23:32
lifelessU1 has patched their apache23:32
lifelesswith a different patch, but similar sort of situation [though theirs was a simple backport]23:32
lifelessgary_poster: ciao23:33
leonardrimplementing this is outside my area of competence, but if you and the losas are happy with changing apache to handle this problem, that's good news for me23:33
lifelesslosa: ^ cross-check please23:33
lifelessthumper: I'm curious (as a test framework writer) what prompted lp:~thumper/launchpad/fix-factory-ids-in-tests23:39
mwhudsonlifeless: a test failed in launchpad because it depended on exact values returned by the factory and miscellaneous refactoring changed that23:55
mwhudsonso i changed the factory to return different values and ran the entire test suite and filed a bug with the failure23:55
mwhudsons23:55
lifelessah23:56
lifelessI guess I meant23:56
lifeless'why change from using the unique stuff'23:57
lifelessnot 'why did some tests fail'23:57
lifelessmwhudson: ^23:59
mwhudsoni haven't looked at the branch itself23:59
elmoleonardr/lifeless: I really don't want to carry a patch to apache forever; I accepted the U1 patch because it's an upstream patch23:59

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!