[02:28]  * thumper is back
[06:25]  * thumper EODs
[08:32] <adeuring> good morning
[09:02] <mrevell> Good morning
[10:28] <thumper> jelmer: hey
[10:28] <jelmer_> thumper: Hi!
[10:28] <thumper> jelmer_: what's the status of bzr-git and dulwich?
[10:29] <thumper> jelmer_: I'd like to rollout updates
[10:33] <jelmer_> thumper: I've fixed the issue with http last week and we now import HEAD branches again, we should be able to rollout current dulwich and bzr-git's roundtrip branch.
[10:33] <thumper> so tips of both?
[10:33] <jelmer_> thumper: (did I mention 'bzr serve --git' works in a native bzr branch now?)
[10:33] <thumper> jelmer_: I think I saw your 'dent about it
[10:34] <thumper> jelmer_: how efficient is it?
[10:34] <jelmer_> thumper: It's very quick on small (small as in inventory) branches, very slow on large inventory branches
[10:34] <thumper> jelmer_: do you know why it is slow?
[10:35] <thumper> jelmer_: fixable?
[10:35] <jelmer_> thumper: Creating the Tree objects from inventories on the fly is very slow and O(size-of-tree)
[10:35]  * thumper nods
[10:35] <jelmer_> thumper: We need to cache those Tree objects, should be able to do so once jam's work on packs lands.
[10:36] <jelmer_> thumper: I'm merging the roundtrip branch into trunk atm with roundtripping itself disabled for the moment.
[10:37] <jelmer_> (it's got a lot of other improvements but I don't want to be tied to the current syntax for roundtripped data)
[10:37] <thumper> ok
[10:37] <thumper> jelmer_: just send me an email when it's ready with revnos for lp:dulwich and lp:bzr-git
[10:37] <thumper> jelmer_: and I'll get them updated
[10:41] <jelmer_> thumper: will do
[10:48] <thumper> jelmer_: thanks
[11:06] <deryck> Morning, all.
[11:07] <wgrant> jelmer_: bzr serve --git crashes for me :(
[11:07] <wgrant> AttributeError: 'BzrBackend' object has no attribute 'object_store'
[11:07] <wgrant> bzr-git/dulwich 0.5.0.
[11:16] <bigjools> morning deryck
[11:21] <jelmer_> wgrant: you need a newer dulwich, bzr-git from lp:~jelmer/bzr-git/roundtrip
[14:15] <kfogel> mrevell: it looks like RT 39801 has been done, meaning that our help and dev wikis are now editable only by members of ~launchpad-doc (for spam protection).  I'm planning to a) announce this in the appropriate places, b) document it likewise, and c) approve the currently pending team members at https://edge.launchpad.net/~launchpad-doc/+members#active.  Thoughts?
[14:17] <mrevell> kfogel, Thanks for handling that. Your plan looks ideal to me. I have two questions: do you think it's worthwhile contacting each existing member of the team directly to explain the new meaning of their membership? Also, do you think there's now a role for the team mailing list? I think "Yes" to the first and I'm not sure about the second.
[14:19] <kfogel> mrevell: I agree.  "Yes" to the first, but for the second, let's just direct people at #launchpad-dev and the other usual places if they have questions.
[14:20] <kfogel> mrevell: oh, looks like I'll need to be a team administrator.  Really, anyone in ~launchpad could be an admin.  Can we do that?
[14:22] <mrevell> kfogel, ~launchpad is now an admin of that team
[14:23] <kfogel> mrevell: fast action, sir.
[14:24] <mrevell> kfogel, :) As for contacting members directly, to tell them of the team's altered role, I don't think everyone in the team is a member of the ML
[14:24] <mrevell> kfogel, but the membership is fairly small so we can easily contact them directly
[14:25] <kfogel> mrevell: I was just going to mail them individually.
[14:25] <mrevell> cool
[14:42] <mars> jml or allenap, ping, would either of you be available to help debug a possible hang in some Unix IPC code using the subprocess module?  I have a few questions about a 40 LoC block I'm studying.
[14:43] <allenap> mars: I'll have a look.
[14:43] <mars> thanks allenap, I'm pasting it now
[14:44] <mars> allenap, http://pastebin.ubuntu.com/434980/
[14:46] <maxb> line 13, comment is lying
[14:47] <maxb> that's where you end up if the TIMEOUT runs out
[14:47] <mars> maxb, ok, that's why I asked other people :)
[14:48] <mars> maxb, so STDOUT could still be open, but the select() has timed out?
[14:48] <maxb> yes
[14:48] <mars> ok
[14:48] <mars> allenap, ^
[14:48]  * mars changes the comment
[14:49]  * maxb wonders why the code uses os.read(blah.fileno())
[14:49] <allenap> mars: Yes, I noticed that. What were your questions?
[14:49] <allenap> mars: Is it working?
[14:50] <mars> maxb, allenap, fwiw, I'm trying to figure out why the test suite is hanging with the ec2 testrunner.  This code should kill the entire test suite.  But it might not be.
[14:50] <mars> allenap, I'm wondering if something in this timeout code is buggy in such a way that the test suite could hang, but this code doesn't catch and kill it.
[14:51] <mars> allenap, my XXX comments are where I would start.  They ask basic Unix programming questions that I have, that I can not answer.
[14:52] <mars> Without reading "Advanced Unix Programming".  I need the fix a bit faster than that.
[14:53] <mars> hmm
[14:54] <mars> maxb, if what you say is true...   During a hang, the test suite has stopped printing output.  That means the select() is timing out, right?
[14:54] <maxb> yes
[14:55] <mars> so that narrows the bug down to that branch of the conditional.
[14:56] <mars> so the hang is in this code path, or this code is running correctly, and the fault happens after this entire script has exited (test_on_merge.py exits correctly, but fails to send mail or something).
[14:57] <mars> maxb, allenap, is the XXX on line 17 a valid concern?
[14:57] <allenap> maxb: subprocess.Popen._communicate() does the fileno() thing. mars: Might be worth looking there and copy that select() loop as closely as possible.
[14:58] <mars> allenap, yeah, I was wondering why they didn't use .communicate().
[14:58] <allenap> mars: If proc.poll() is not None then the process has definitely terminated.
[14:59] <allenap> mars: .communicate() only returns the result at the end. Don't want to wait that long while running the test suite :)
[14:59] <maxb> Line 17 could be hit if the test process exited but a subprocess spawned by the test process still retained the open stdout file descriptor
[15:00] <mars> allenap, ah, "is not None" means it's absolutely dead.  Ok.
[15:00] <mars> I just re-read the docs, you are correct.
[15:00] <mars> maxb, oh, that is something
[15:00] <mars> it *does* do that
[15:01] <mars> the process tree on a hung server has a few defunct processes
[15:01] <mars> well, maybe
[15:03] <allenap> mars: Yeah, line 28 will kill the child, but if its children are still alive and holding a reference to proc.stdout then line 31 will hang.
[15:03] <mars> here, more info
[15:04] <mars> maxb, allenap, process tree from a hung server.  Everything is still alive!  http://pastebin.ubuntu.com/434986/
[15:04] <mars> the code never got to the killem() line :(
[15:05] <allenap> mars: The firefox has not been collected by the parent :-/
[15:05] <mars> kill_hung_process_with_a_series_of_brutish_instruments() was never called.  It will eventually SIGKILL the process in question.
[15:06] <mars> allenap, yes, on my local system I had a sort of the same thing.  I had to send SIGHUP to Python itself in order for it to get collected.
[15:06] <mars> kill -1 python-parent-process-id
[15:06] <sinzui> mrevell, which project did you have the milestone problem with?
[15:08] <mrevell> sinzui, launchpad, malone, rosetta, launchpad-registry and launchpad-foundations, so far
[15:08] <sinzui> always 10.10?
[15:08] <sinzui> ie 10.11 is always fine
[15:08] <mars> allenap, firefox was started by the windmill testrunner.  I don't see it in the tree.  That means that it died or was killed.
[15:09] <mrevell> sinzui, Yeah. Always 10.10. Everything else from 10.06 to 10.12 have been fine.
[15:10] <sinzui> mrevell, I see them, how did you create 10.10 milestones?
[15:10] <mars> allenap, maxb, that would leave the zombie processes.  So would the process tree I posted somehow lead the timeout code to hang?
[15:10] <mrevell> sinzui, I refreshed the page so that the error message was replaced by a link to a new 10.1 milestone (which I hadn't created but appeared instead of the 10.10 milestone), then went in and changed the name of the 10.1 milestone to 10.10
[15:11] <sinzui> okay thanks.
[15:12] <maxb> What is the process actually being Popen-ed here?
[15:13] <mars> maxb, line 1 of http://pastebin.ubuntu.com/434986/.
[15:15] <maxb> Did the [[[print ("\nA test appears to be hung. There has been no output for"]]] actually occur?
[15:15] <mars> allenap, maxb, here is the current code, uncommented http://pastebin.ubuntu.com/434992/.  This makes no sense:  line 19 and 23 must have run.  The processes are still alive!
[15:15] <maxb> And just what is kill_hung_process_with_a_series_of_brutish_instruments?
[15:16] <mars> maxb, that function is a rewrite I did of lines 19 through 23 of http://pastebin.ubuntu.com/434992/
[15:17] <mars> maxb, just for clarity.  The code I just pasted is what is actually run by the server, but it was too dense for me to understand.  So I rewrote and commented it before posting.
[15:17] <maxb> What is killem?
[15:17] <mars> hmm
[15:17] <maxb> It would be interesting to add some logging to see what PIDs it's *actually* sending signals to
[15:18] <mars> if we had console output for the log to write to :/
[15:18] <mars> or python standard logging installed and running in this script...
[15:19] <mars> maxb, I can add logging if that would help.
[15:19] <maxb> Well, I'm a bit baffled, so I'm clutching on to the fact that if the process is still running, a SIGKILL can't have really happened.
[15:20] <mars> maxb, here is the killem() function: http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/annotate/head:/test_on_merge.py#L189
[15:20] <mars> maxb, allenap, killem() runs os.killpg(), not os.kill().  Does that matter?
[15:20] <mars> (I suspect it might?)
[15:22] <mars> hmmm
[15:22] <mars> column 3 of http://pastebin.ubuntu.com/434986/ is the process group ID
[15:22] <mars> oh
[15:22] <mars> line 6
[15:24] <mars> firefox <defunct> is part of a different process group.  But that doesn't make sense.  Could the windmill testrunner have been the target of the process group kill?
[15:26] <mars> maxb, I think you are right.  The best next step is probably to add some logging to the code to see what it is killing, and why.
[15:26] <maxb> Clearly we have an issue if the kill code is expecting the entire tree to be just one process group, but it isn't
[15:27] <allenap> mars: Where does the select() loop run? Is it in the test runner? If so, then it's running in pid 15177 and 20962.
[15:28] <allenap> mars: Ah, it's in test_on_merge.py
[15:28] <mars> allenap, sorry, I don't understand?  the select() loop is in test_on_merge.py, which... hey, if test_on_merge.py is still running, shouldn't it be in the process tree as well?
[15:29] <mars> allenap, the first column of http://pastebin.ubuntu.com/434986/ is the PPID.  Notice that it is '1' for a few of those?
[15:29] <maxb> hahahahahaha
[15:29] <allenap> mars: Yes, it should!
[15:29] <maxb> By killing the process group in this way, the supervisor script is killing itself :-)
[15:29] <mars> blah
[15:30] <allenap> maxb: Is it?
[15:30] <maxb> I tried typing out a few key lines in an interactive python
[15:31] <maxb> and that's what it seems to indicate
[15:31] <maxb> proc = Popen('sleep 3600', stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True)
[15:31] <maxb> os.killpg(os.getpgid(proc.pid), 9)
[15:33] <allenap> maxb: But isn't it doing os.killpg(proc.pid)?
[15:33] <allenap> maxb: Scratch that.
[15:33] <allenap> Doh.
[15:34] <allenap> maxb: There's a comment in killem saying "Note that bin/test sets its process to a process group leader".
[15:34] <mars> allenap, os.killpg(os.getpgid(pid), signal)
[15:34] <maxb> allenap: oh, ok
[15:34] <maxb> hrm
[15:34] <allenap> maxb: The process tree bears that out I think.
[15:38] <allenap> mars, maxb: The Popen call has shell=True; killem() is killing the shell.
[15:39]  * allenap has to restart router
[15:52] <mars> allenap, back?
[15:52] <mars> nope, still away
[15:59] <allenap> mars: Hi. If you said anything between my last message and "allenap, back?" then I missed it.
[15:59] <mars> :)
[15:59] <mars> nope!
[15:59] <mars> allenap, thought of something
[15:59] <mars> possible cause then
[15:59] <mars> so normally, killing the shell would take everything with it
[16:00] <mars> but I have observed that Python can get deadlocked(?) on the windmill process.  Python won't respond to a SIGTERM even.  You have to SIGHUP it.
[16:01] <mars> So, if we have the windmill/Python deadlock, the suite hangs.  test_on_merge says "Oops, better kill it!", and kill the shell.  But Python ignores the shell's SIGTERM, and keeps running, along witheverything in the tree under it.
[16:02] <mars> allenap, maxb ^ does that make sense?
[16:03] <allenap> mars: That makes a lot of sense. So, perhaps send HUP or QUIT before KILL.
[16:04] <allenap> mars: Alternatively, get the Popen call to work without shell=True
[16:06] <mars> allenap, should HUP walk the process tree maybe?  This code assume that killing the shell is enough.  Obvious that is not a thorough approach.
[16:06] <allenap> mars: Or make cmdline = "exec " + cmdline
[16:06] <allenap> mars: Maybe it should, but see if this works first.
[16:07] <mars> exec would terminate the test_on_merge code path, wouldn't it?
[16:09] <allenap> mars: No, it would mean that the test process replaces the shell used to invoke it, so that killpg kills the test process group.
[16:09] <mars> oh!
[16:09] <mars> yes
[16:09] <mars> allenap, awesome, thank you
[16:09] <allenap> mars: But, I'm not sure the shell=True bit is necessary anyway.
[16:10] <mars> allenap, well, you were wondering if the entire select() loop was needed instead of just using .communicate(), so I can't say if any of this code should stay.
[16:11] <mars> allenap, they may have used their own select() loop to save .communicate() from buffering too much output.
[16:11] <mars> there is a warning in the module docs about that
[16:13] <allenap> mars: No, you definitely shouldn't use communicate(); but Popen._communicate() does have an implementation of a similar select() loop that I thought was worth studying. For example, it has an exception handler around select() to catch EINTR. Actually, other than that it's very similar.
[16:13] <allenap> mars: In any case, it doesn't look like the problem lies in that direction anyway.
[16:13] <mars> right
[16:14] <mars> well, this is one huge step closer to getting things working again
[16:19] <mars> maxb, allenap, thank you for all the help!
[16:19] <allenap> mars: You're welcome. I hope it works now :)
[17:12] <kfogel> mrevell: I think I'd like to actually deactivate the launchpad-doc@ mailing list entirely.  Its original purpose is now obsolete.  Any objections?
[17:13] <mars> leonardr, something you may find interesting: http://factoryjoe.com/blog/2010/05/16/combing-openid-and-oauth-with-openid-connect/
[17:13] <mrevell> kfogel, None at all
[17:13] <kfogel> mrevell: thanks.  Also, should I update https://help.launchpad.net/DocTeam to say it's obsolete?
[17:13] <mrevell> kfogel, Please, if you don't mind.
[17:13] <kfogel> mrevell: no problem.
[18:04] <mrevell> Night all
[21:30] <sinzui> OMG. I made staging much faster that edge.
[21:38] <cody-somerville> how?
[21:50] <thumper> sinzui: how??
[21:57] <sinzui> thumper, I used memcached tales directives on milestone and portlets. I may have fixed another issue doing this, https://staging.launchpad.net/libpng/main/+index loads faster on staging then edge for me
[21:58] <thumper> sinzui: please write it up for the list - I'm not sure how to use the memcached tales directives
[21:59] <thumper> sinzui: oh you did already
[21:59] <sinzui> thumper, I will if my branch lands. I am sure engineers will love all the broken browser tests that caching created
[21:59] <sinzui> for me
[21:59] <mars> gary_poster, ^
[22:00] <gary_poster> sinzui, yay! :-)
[22:00] <gary_poster> sinzui, that was for memcached
[22:00] <sinzui> thumper, I have not. written up what I did, yet, I just submitted the review. I want to cache some of our tales formatters if it proves to be faster than db lookups
[22:00] <sinzui> gary_poster :)
[22:02] <mars> sinzui, I wonder how many of those requests are for breadcrumbs...
[22:02] <mars> sinzui, probably not as big a payoff.  What you found is huge.
[22:02] <sinzui> mars, indeed that too crossed my mind. the header of pages can be cached. project/person displayname changes are rare.
[23:17] <lifeless> gary_poster: leonardr: ping - mod_compress
[23:17] <gary_poster> lifeless: pong, hi
[23:17] <lifeless> I'm a little surprised that you didn't just fix apacge
[23:17] <lifeless> blah, apache
[23:17] <lifeless> I wanted to check my facts at the source :)
[23:19] <leonardr> lifeless: nothing we would consider to be 'fixing apache' is acceptable to apache upstream
[23:19] <gary_poster> lifeless: :-) I'd be +1 on fixing it in apache, but from what I understood of Roy Fielding's response to the bug, the proper fix is a new filter.
[23:19] <lifeless> leonardr: oh! thats surprising
[23:19] <leonardr> see https://issues.apache.org/bugzilla/show_bug.cgi?id=39727#c31 for example
[23:20] <gary_poster> leonardr: couldn't we have made a new filter?  Or did I misunderstand Roy Fielding's suggestion?
[23:20] <lifeless> roy specifically says
[23:20] <lifeless> If mod_deflate modifies
[23:20] <lifeless> ETag on the way out, then its corresponding later requests must
[23:20] <lifeless> be reverse-modified (etags and request content) on the way back.
[23:21] <lifeless> which is completely consistent with my view, and the source of the issue [that mod_compress or whatever we're using *is violating* that MUST]
[23:21] <gary_poster> Ah, I was misrembering just a bit.  This was the line I was trying to remember:
[23:21] <gary_poster> The best solution is to implement transfer-encoding as an
[23:21] <gary_poster> http protocol filter module.
[23:22] <lifeless> well, thats TE
[23:22] <leonardr> yeah
[23:22] <lifeless> which is different
[23:22] <gary_poster> yeah
[23:22] <leonardr> that's what we tried to do, but the intermediaries stripped it
[23:22] <gary_poster> that was the misremembering part
[23:22] <leonardr> roy also rejects the solution i thought would work:
[23:22] <leonardr> Preprocessing all incoming conditional headers to remove
[23:22] <leonardr> a -gzip suffix before the request is processed won't work.
[23:22] <leonardr> In a chain of Apache servers, we won't know which server
[23:22] <leonardr> set the suffix and how many caches have stored the modified
[23:22] <leonardr> ETag versus the unmodified ETag.
[23:25] <lifeless> so
[23:25] <lifeless> a latter comment addreses that, though you have to be prescient to parse it
[23:26] <lifeless> (have each server uniquely add its suffix, and have the sysadmins be responsible for ensuring a matching back-path)
[23:26] <lifeless> I think that Apache would accept a patch which strips the -gzip, when an option is set.
[23:27] <lifeless> there are lots of special-case vs general case situations in surrogates vs http as a whole
[23:27] <leonardr> there is already a patch that strips the -gzip always
[23:27] <lifeless> ok
[23:27] <lifeless> I suggest we: get that applied to our apaches
[23:27] <lifeless>  say in the review for that patch that we need it, and discuss whats required to get it in mainline
[23:28] <leonardr> ok, what's the process for patching our apache?
[23:29] <gary_poster> so, lifeless, I have to go half an hour ago, but is this the position:
[23:29] <lifeless> also, in #squiddev I'm asking hno what he thinks the situation is
[23:29] <gary_poster> - the gzip suffix could eventually be customized
[23:29] <lifeless> [henrik nordstrom from thast bug report]
[23:29] <gary_poster> (per server)
[23:30] <gary_poster> - at that point the underlying concerns could be addressed, because that specific server's suffix could be targeted
[23:31] <lifeless> leonardr: in a chroot/vm of hardy which is what we have deployed, do an apt-get source apache2, apply the patch using whatever patch system its building with, and make sure it builds, then file an RT ticket, and include the debdiff
[23:31] <gary_poster> - meanwhile we have a hard-coded suffix.  We could have a flag to remove the suffix, whatever it is; at this point it is particularly easy, because it is hardcoded
[23:31] <lifeless> gary_poster: precisely. making it customised might be a way to get apache upstream to let useful code into their code base
[23:32] <gary_poster> lifeless: ok, thank you for the clarification.  I'm happy with that if the LOSAs are happy with that (as I expect they would be).
[23:32] <gary_poster> thank you
[23:32] <gary_poster> need to run
[23:32] <lifeless> U1 has patched their apache
[23:32] <lifeless> with a different patch, but similar sort of situation [though theirs was a simple backport]
[23:33] <lifeless> gary_poster: ciao
[23:33] <leonardr> implementing this is outside my area of competence, but if you and the losas are happy with changing apache to handle this problem, that's good news for me
[23:33] <lifeless> losa: ^ cross-check please
[23:39] <lifeless> thumper: I'm curious (as a test framework writer) what prompted lp:~thumper/launchpad/fix-factory-ids-in-tests
[23:55] <mwhudson> lifeless: a test failed in launchpad because it depended on exact values returned by the factory and miscellaneous refactoring changed that
[23:55] <mwhudson> so i changed the factory to return different values and ran the entire test suite and filed a bug with the failure
[23:55] <mwhudson> s
[23:56] <lifeless> ah
[23:56] <lifeless> I guess I meant
[23:57] <lifeless> 'why change from using the unique stuff'
[23:57] <lifeless> not 'why did some tests fail'
[23:59] <lifeless> mwhudson: ^
[23:59] <mwhudson> i haven't looked at the branch itself
[23:59] <elmo> leonardr/lifeless: I really don't want to carry a patch to apache forever; I accepted the U1 patch because it's an upstream patch