[06:55] <jam> mmrazik: feel free to ping me when you come back online
[06:56] <jam> I'm guessing paramiko + tarmac is actually trying to prompt for a password/etc which is why it is hanging. (Since you probably have credentials set for openssh that paramiko might not know about.)
[06:56] <jam> we can debug that if we want, but given it looks like the connection is being retried and is *still* failing, I think we need to dig deeper.
[06:57] <jam> So I'd like to debug it in a bit more hands-on way.
[07:10] <mmrazik> jam: ok. Let me try to re-create the setup. Will take me a few minutes.
[08:01] <mmrazik> jam: so this is where I'm right now: http://pastebin.ubuntu.com/1190380/
[08:01] <mmrazik> jam: but I'll be on a phone for a while now
[08:02] <mgz> morning!
[08:03] <vila> hi mgz
[08:07] <mgz> hey vila
[08:16] <jam> mmrazik: k, I'll be doing lunch and digging into some lp stuff for a bit, but I'll try to be responsive when you get back.
[08:16] <mmrazik> jam: is there something I should try now?
[08:16] <mmrazik> I'm a bit stuck TBH
[08:17] <jam> mmrazik: do you know what version of tarmac you are running (just to try to set things up similarly here)
[08:35] <jam> mmrazik: line numbers in your traceback don't quite match up to tarmac trunk, but you might be able to do something like: http://paste.ubuntu.com/1190405/
[08:36] <jam> ah, you might need to do both branches, 1 sec
[08:39] <jam> http://paste.ubuntu.com/1190407/
[08:39] <jam> mmrazik: ^^ should re-open both branches, creating new connections. at least as a stop-gap. I'd like to fix bzrlib, though, if you don't mind helping me investigate.
[08:43] <mmrazik> jam: I'm on it now
[08:43] <mmrazik> jam: regarding tarmac -- its unfortunately custom tarmac extension I didn't even write
[08:43] <mmrazik> let me check if it is somewhere on bzr
[08:43] <mmrazik> but the setup is fairly complex and requires jenkins
[08:44] <mmrazik> the exension is some jenkins pre-commit logic and that is also why it fails. It waits for the jenkins job to finish only then commits.
[08:44] <mmrazik> jam: I think the easiest way to reproduce will be to create some custom "sleep 420" pre-commit hook
[08:45] <jam> mmrazik: is this the same one that sidnei was looking at recently?
[08:45] <jam> (not sure which team you're on)
[08:45] <mmrazik> jam: I don't know but we are different teams (and this one was written by yet another team)
[08:45] <mmrazik> for me this tarmac stuff is almost end of life and I want to get rid of it
[08:46] <mmrazik> its just some legacy I had to maintain
[08:46] <jam> mmrazik: what are you switching to?
[08:46] <mmrazik> jam: more jenkins driven approach. where the logic is in jenkins.
[08:46] <mmrazik> it also scales better because jenkins can schedule build slaves
[08:46] <mmrazik> right now tarmac must be running on the same node where the jenkins job runs
[08:47] <mmrazik> anyway... going to patch tarmac with the patch you provided
[08:49] <mmrazik> patched/running
[08:50] <mmrazik> jam: I believe it is this one: https://code.launchpad.net/~didrocks/tarmac/tarmac-jenkins
[08:50] <mmrazik> but as I said there should be a simpler way how to reproduce
[08:57] <jam> mmrazik: seeing if I can reproduce it trivially.
[09:00] <mmrazik> jam: the tarmac patch you provided didn't help :-/
[09:01] <mmrazik> http://pastebin.ubuntu.com/1190430/
[09:02] <mmrazik> AFAICT it now fails in the  "source.bzr_branch = source.bzr_branch.bzrdir.open_branch()" which I just added
[09:08] <jam> mmrazi|otp: http://paste.ubuntu.com/1190441/
[09:08] <jam> is another patch you can try when you get back.
[09:09] <jam> mgz: poke
[09:32] <mmrazi|otp> jam: running it
[10:43] <mmrazik> jam: still no luck :-/ http://pastebin.ubuntu.com/1190559/
[10:43] <jam> mmrazik: the traceback shows it isn't the new code: source.bzr_branch = source.bzr_branch.bzrdir.open_branch()
[10:44] <mmrazik> jam... argh... sorry. I didn't apply it correctly
[10:44] <mmrazik> jam: yes. Just looking at it
[10:54] <mmrazik> jam: looks better now. There is still a stacktrace but I think its because the tarmac user is not allowed to push into the branch
[10:58] <mmrazik> I'm now trying with the real thing
[11:02] <mmrazik> jam: ack. it works with the tarmac patch.
[11:02] <jam> mmrazik: so that at least gets you up and running again.
[11:02] <jam> I'm trying to see if I can reproduce here. The 5-min wait to test is a bit annoying.
[11:02] <mmrazik> jam: yep. Many thanks for the help.
[11:02] <jam> I think I tried paramiko, and found it hangs at the point of reconnect.
[11:02] <jam> which might be what you saw.
[11:03] <mmrazik> let me know if you need some more help with this
[11:06] <jam> well, I should know in about 200 more seconds if it reproduces locally.
[11:11] <jam> mmrazik: :( it doesn't reproduce here, the retry works: http://paste.ubuntu.com/1190598/
[11:11] <jam> (that is seconds *10)
[11:11] <jam> at 5 min it gets the 'you're disconnected' from the server.
[11:12] <jam> at 35s, the client notices, and retries the connection.
[11:12] <jam> and successfully gets Branch.last_revision()
[11:13] <mgz> hm. I wonder what's different.
[11:13] <mmrazik> :-/
[11:38] <jam> mgz: well offhand I wouldn't expect EPIPE from a *socket* object, but the traceback clearly looks like it is failing while retrying, not failing in the initial request (and then failing to retry)
[11:46] <jam> mgz: hmm.. right now I'm running on Windows, which uses actual pipes, rather than socketpair. I wonder if that matters.
[11:46] <jam> mgz: can you run this on your machine: http://paste.ubuntu.com/1190654/
[11:50] <jam> and maybe you as well mmrazik ^^
[11:51] <mgz> jam: sure
[11:51] <jam> I can see that if I run BZR_SSH=paramiko, I don't see the stderr 'you have been disconnected' message.
[11:53] <mmrazik> jam: running
[11:53] <mmrazik> so far so good. just numbers
[11:53] <mmrazik> oh..
[11:53] <mmrazik> thats expected :)
[11:53] <jam> mmrazik: well, expected for 350s :)
[11:54] <jam> mgz, mmrazik: weird, when running with paramiko, we end up looping on a socket.sendall trying to send 119 bytes, and we just keep failing.
[11:54]  * mmrazik shour read the code before copy&pasting&running something
[11:54] <mmrazik> s/shour/should/
[11:54] <jam> it gives us a "sent 0 bytes" in response, but doesn't actually give an error.
[11:54] <jam> I think we should probably have a check for 'if bytes sent == 0: EOF"
[11:58] <mmrazik> jam: the code can reproduce the error
[11:58] <mmrazik> http://pastebin.ubuntu.com/1190681/
[11:58] <jam> mmrazik: so... progress of a sort.
[12:03] <jam> bug #1047309
[12:04] <jam> mgz, jelmer: can you think if sock.send() can legitimately say "I couldn't send any content right now" without raising EINTR?
[12:04] <jam> I realize it returns the number of bytes written, but if it can't write *any* bytes, should we treat that as EOF immediately or should we try a couple times.
[12:07] <jelmer> jam: couldn't there be a buffer that's full, or something like that?
[12:09] <jam> jelmer: man send says: http://paste.ubuntu.com/1190698/
[12:09] <jam> it will block until it can send what you asked
[12:09] <jam> unless you are in non-blocking mode
[12:09] <jam> but then send should fail with EWOULDBLOCK
[12:10] <jam> MSG_NOSIGNAL (since Linux 2.2)
[12:10] <jam>        Requests not to send SIGPIPE on errors on stream oriented sockets when the other end breaks the connec-
[12:10] <jam>        tion.  The EPIPE error is still returned.
[12:10] <jam> interesting.
[12:11] <jam> and we use blocking sockets (because when you set nonblocking it causes the smart server tests to fail)
[12:12] <jam> mmrazik|otp: ok, in this particular case, it looks like it is getting EPIPE during the first send, not during the retry, so I think our code just isn't handling EPIPE as a connection reset
[12:12] <jam> I'll try to dig some more.
[12:12] <jam> mgz: can you confirm that it fails for you?
[12:22] <mgz> onesec
[12:23] <mgz> okay, running remotely, will tell you when it returns
[12:28] <mgz> 30
[12:28] <mgz> Connection Timeout: disconnecting client after 300.0 seconds
[12:29] <mgz> and traceback at loop end.
[12:29] <mgz> same as mmrazik.
[12:36] <jam> mgz: k, I think I know the bug, and i'll put up a fix, can you run the fixed code in a sec.
[12:39] <jam> mgz, mmrazik: If you are comfortable running bzr from source: lp:///~jameinel/bzr/2.5-conn-reset-socket-pipe-1047325
[12:39] <jam> it doesn't have a test, but it should fix the problem
[12:39] <jam> (if it is that we aren't retrying at all.)
[12:39] <mgz> sure, I'll test that.
[12:39] <mgz> probably just want the builddeps on this..
[13:18] <jam> mgz: did you get a chance to test the branch?
[13:18] <jam> I also have: https://code.launchpad.net/~jameinel/bzr/2.5-unending-sendall-1047309/+merge/123268
[13:18] <jam> up for review.
[13:20] <mgz> jam:
[13:20] <mgz> Connection Timeout: disconnecting client after 300.0 seconds
[13:20] <mgz> 31
[13:20] <mgz> 32
[13:20] <mgz> 33
[13:20] <mgz> 34
[13:20] <mgz> ConnectionReset calling 'Branch.last_revision_info', retrying
[13:28] <jam> mgz: did it print the revision_id at the end?
[13:28] <mgz> so, will review other branch, and that fix looks good
[13:29] <mgz> jam: yup, the lack of traceback was the main thing :)
[13:30] <jam> mgz: I think the fix is good, I'd like a test for it, so if you have ideas, I'm listening.
[13:31] <jam> I might get to it over the weekend, and then we should do 2.5.2
[13:32] <mgz> I do wonder about if we've got the exception wrapping at the right level
[13:33] <mgz> there are some tests that try to check connection reset stuff, but are a little unreliable as terminiation a connection from one thread in a process to another thread is not actually the same as what really happens
[13:34] <mgz> the short answer is you replace the underlying call to raise an exception we've observed it raising and make sure it propogates wrapped up netly
[13:34] <mgz> *neatly
[13:34] <mgz> but a more real world test would be grand...
[13:34] <awilkins> Gah, why did I ever set things up with NTLM auth (answer : because most of my users are noobs and it's easier when it works...)
[13:35] <awilkins> In the position where I have a tree that SVN can check out fine (anonymously) but Bazaar can't branch it (fails the NTLM auth)
[13:38] <awilkins> Does Bazaar just use PyCurl if it's installed?
[13:38] <awilkins> Hmm, maybe not
[13:38] <mgedmin> every time I see "Aborting commit due to empty commit message." I feel that I ♥git
[13:38] <mgedmin> you're missing an opportunity here with that interactive roadblock
[13:46] <mgz> mgedmin: I'm not sure what you're referring to, but every time, and you've never got as far as just sending a patch?
[14:21] <awilkins> Every time I see a commit without a log message, I feel that I ☠☢☹ the annoying sod that committed it.
[15:09] <jml> you guys are going to force me to implement 'bzr branches --merged' aren't you?
[15:11] <mgz> are we?
[15:13] <fullermd> Look, I never _said_ I'd kill your puppy if you didn't...
[23:05] <mark06> is it possible to make bazaar recognize mac newlines?
[23:06] <mark06> it's considering the whole file changed when no newline conversion happened in fact