[06:00] wgrant: one problem (possibly not the only problem), is the the semantics of dirty headers are still weird (I could have implemented this better in bzrlib)... each group of dirty headers is only associated with the first patch, where really all of the following patches before a blankline should be associated. [06:03] so currently if the diff looks like [dirty, dirty, patch, patch, patch, dirty, ...etc] we get [{dirty: foo, patch: }, patch1, patch2] [06:04] where really we should probably have [{dirty: foo, patches: [, , etc]}] [06:06] bzrlib of course only sees dirty lines for the first patch it parses [06:07] I think I do have an interim fix though which should demangle the mail diffs. [06:12] blr: Why wouldn't it be [{dirty: foo, patch: patch0}, patch1, patch2, {dirty: bar, patch: patch3}]? [06:13] wgrant: if the first 3 patches are from the same file? [06:13] blr: Three patches can't be from the same file, but the hunks could be. [06:17] hmm, sorry you're correct, I think the diff_text in the test is actually in an impossible state. [06:18] wgrant: would you mind looking at test_codereviewcomment.py [06:18] presumably the 'baz' and 'foo' patch headers would always have a dirty header [06:22] and in practice would always be preceded with a newline [06:23] blr: In which branch? [06:23] I don't see a baz in test_codereviewcomment in devel. [06:28] wgrant: hmm second patch in the diff_text [06:32] wgrant: http://pastebin.ubuntu.com/11840023/ [06:33] wgrant: that first dirty header has always been in the sample data, but presumably it is incorrect? [06:33] blr: What's wrong with it? [06:34] ah they're supposed to represent directories [06:35] in which case it should be bar.py and baz and foo should have their own dirty headers? [06:37] I don't think we generate diffs with patches in succession like that without headers. [06:39] That diff looks believable to me. [06:39] A directory was added, which is only indicated by a === line. [06:40] In our parsing, that will appear as a dirty line on the patch that follows it. [06:40] Now, technically it has no relationship with that patch, but it doesn't particularly matter if we associate them. [06:42] wgrant: in what case would the second and third patches not have dirty headers (or a blankline)? [06:44] blr: I don't know if we'd generate them like that (though git might), but there's no reason we can't parse them AFAICS. [06:45] wgrant: sure, we are parsing them like that now, it just didn't seem to align with reality [06:45] blr: Parsing them like what? [06:45] I've added another patch preceeded by a blankline and a new dirty header in this branch [06:46] Ah, it was misinterpreting the blank line case? [06:46] wgrant: I mean, we can parse a diff in that state. [06:47] well, we didn't have test data with a blankline/new header prior to this, which accounts for some of the failings in the tests. [06:48] Right. [06:48] wgrant: just wanted to ascertain if there were circumstances where we might generate a diff like that. [06:48] Yep, I understand now. [06:48] git may, not sure. [06:48] We should test the various cases. [06:48] sorry my brain is a little fried after looking at this all day [06:48] No gap, blank gap, dirty gap, dirty and blank gap. [06:48] Oh I know the feeling, [06:48] I'm probably not making much sense :) [06:48] right [06:49] wgrant: anyhow, I'm parsing elmo's big diff and it seems to be fine now [06:50] blr: Great, small bzrlib fix? [06:51] no, I think bzrlib is actually fine.. the problem there was a malformed hunk header in my test which made me erroneously think the problem was in bzr [06:53] wgrant: our comment count includes the blanklines before dirty headers which bzrlib throws away [06:57] blr: But if bzrlib throws them away then how can we know if they were there? [06:57] (unless you do what Colin suggested, and keep track of which lines bzr emits yourself) [06:58] wgrant: we consistently render a blankline before a dirty header [06:58] well, all but the first [07:06] Perhaps this should all be refactored, but this fix does appear to work, so perhaps we should land it and revisit it. [07:06] blr: So you just assume there's always a blank line between patches? [07:07] wgrant: I've written two tests using elmo's diff, not sure I should commit them however, that logic is basically tested in other tests, but I did want to parse a real (large) diff [07:07] well, that does appear to be the case. [07:07] blr: Yeah, no point having such a huge diff (of proprietary code, no less) [10:28] blr: There's always a blank line between bzrlib-generated patches, but not between git-generated patches. [10:28] blr: So we can't assume there's a blank line there. [10:29] blr: And we probably shouldn't insert one into git patches either. [10:41] wgrant: Shall I take it from your review that you would like me to fix IPersonSet.getByEmail(None) to return None? [10:41] AttributeError: 'NoneType' object has no attribute 'lower' [10:57] cjwatson: Oh, hm, nevermind then. [10:58] OK. That was why I left the wrapper in place. Though I'll have another look since we seem to be unnecessarily running things through safe_fix_maintainer again when we could pass in an unadorned e-mail address instead. [16:22] unnecessarily running things through safe_fix_maintainer again> ignore that, looking at the wrong branch. So I'll just land this. [21:13] morning [21:16] cjwatson: then perhaps we should not count them when generating the comment dict [21:19] cjwatson: we also have no test using a git generated diff [21:36] so afaict, we either need to change the behaviour of the caller, or add some heuristics to detect a bzr (unified) vs git (coimbined) diff. [21:38] not certain if bzrlib.patches can parse a combined diff actually.. I suppose I should test that :) [22:24] blr: The comment dict is a map of physical diff lines to comments. [22:24] blr: We can't not count them. [22:24] The parser which is used to generate the email must be able to cope with the line being there or the line not being there. [22:30] wgrant: I think it will have to be special cased for git anyway. [22:32] blr: Why? [22:32] git diffs are not unified diffs? [22:32] They conform to every definition of that term that I know of. [22:32] How do they differ? [22:34] wgrant: hunk headers differ [22:35] blr: In what way? The only difference I know of is that the --- and +++ lines don't carry timestamps. [22:36] git combined diffs can have hunk headers comparing two or more files [22:36] I'm not sure bzrlib is going to parse those [22:36] I have yet to check though, it may happily. [22:37] have a look at http://git-scm.com/docs/git-diff under "combined diff format" [22:38] blr: Does the way we use pygit2 actually give those to us? [22:38] not certain [22:40] It would be pretty hostile for it to just give them to us without us asking. [22:41] wgrant: how would feel about some code then which checks for the presence of a blank line preceding a hunk header before passing the diff to bzrlib? [22:43] bzrlib parses git patches justfine. [22:43] tested that many times. [22:43] great, that simplifies things. [22:44] all we need to do is to be able to cope with arbitrary stuff in between the bits bzrlib gives us. [22:44] which is easy because bzrlib preserves the order [22:44] at worst it skips some lines [22:45] anyway, back a bit later, gaming [22:49] blr: If LP parses the patch before passing it to bzrlib then it might as well do the whole thing itself. [22:49] bzrlib needs to preserve the blank line(s), or we need to take Colin's approach. [22:54] yep, okay, I'll have a look. === mwhudson_ is now known as mwhudson === mwhudson is now known as Guest82160 [23:13] wgrant: "if line.startswith('=== ') or not line.strip():" should do it? (or clause is new) === Guest82160 is now known as mwhudson [23:21] blr: That won't work. [23:21] blr: It will, for example, consider any blank line within a hunk to be part of the next dirty head. [23:23] hmm, in the context of this iterator, not certain how I can determine if the blank is followed by a dirtyheader [23:25] also with git diffs I think there's a line starting with "diff" [23:25] cjwatson: and also "index"? [23:28] Yes. [23:28] There is an arbitrary set of junk between the end of one patch and the start of the next. [23:30] unfortunately there's nothing denoting the end of a patch, so presumably we will always have to regex the line [23:32] Is the junk just any line that doesn't start with '+', '-', '@' or ' '? [23:34] (or you could count the lines described in the hunk header. not sure which is messier) [23:35] junk is anything preceding '--- ' [23:37] begiunning is set when the first patch line is encountered [23:37] so this should work: if beginning and line.startswith(dirty_headers) or not line.strip() [23:38] where dirty_headers is a tuple of things we care about [23:38] or maybe it should just all be preserved [23:39] "or not line.strip()" will catch lines in the middle of a hunk. [23:39] not if beginning is False? [23:39] Junk can't just be anything preceding '--- ', because the entire previous hunk precedes it.