[05:49] <AfC> Can anyone comment on the state of bzr-git? I have _one_ last project that's not in Bazaar yet that is in Git. It'd be nice to somehow convert it into a bzr branch, but I had no luck with tailor (for months of trying)
[05:56] <lifeless> jelmer of bzr-svn fame is hacking on it now
[05:56] <lifeless> its not up to converting yet, only data inspection
[06:05] <AfC> lifeless: ok, thanks Robert
[06:06] <AfC> I'm going to go ahead and just do a big initial import for now. Maybe I can use --file-ids or whatever later to recover the history
[06:09] <Peng> Why not keep it in git for the moment?
[06:10] <AfC> Peng: you're asking that HERE, in #bzr?
[06:10] <AfC> {sigh}
[06:11] <AfC> Peng: but also because I can hardly remember how to use Git and am not really that interested in relearning.
[06:12] <AfC> s/could/could not/
[06:14] <Peng> Heh.
[06:14] <Peng> I have no experience with git. :P
[06:14] <Peng> Maybe I should be glad.
[06:15] <Peng> But if bzr-git is progressing quickly, I was just thinking that it shouldn't be too bad to use it for a little while, especially when otherwise you risk losing the history.
[06:15] <AfC> Peng: fair enough
[06:16] <AfC> Peng: nah, I need to get on with collaborating with someone. I'll wait until there is a way to graft the two branches together.
[06:16] <AfC> It's all cosmetic, of course
[06:16] <AfC> Just feelgood factor that you want to recover, mostly
[07:08] <keir> lifeless, ping
[08:05] <lifeless> keir: pong
[08:09] <keir> lifeless, hey
[08:09] <keir> lifeless, did you start on the 4k fanout?
[08:11] <lifeless> keir: putting the finishing touches on bisection
[08:11] <lifeless> its 19 roundtrips on a 200MB index
[08:11] <lifeless> to get down to a 4K size
[08:12] <keir> i was thinking about this
[08:12] <keir> for big indicies, why not pad them out to 4k blocks?
[08:12] <lifeless> a 4K prelude on the index will give about 16 times granularity, or log(16, 2) - 4 less round trips
[08:12] <keir> then we can have a fan out table which selects down to 4k nicely
[08:13] <lifeless> hmm, right now I just want to get enough legs on this toy format to survive while the real one comes together
[08:13] <keir> of course :)
[08:16] <keir> so in a 200mb index, that's ~2.5m keys, right?
[08:16] <keir> assuming something like 80 bytes per key/vaue/refs
[08:17] <lifeless> 800K keys
[08:17] <lifeless> (I gave you a 200M index to play with :))
[08:18] <lifeless> for the current toy format of course
[08:18] <keir> yes
[08:18] <keir> i am using the old 100mb 0.tix
[08:18] <lifeless> oh, was it only 100M. ll
[08:18] <lifeless> lol
[08:19] <lifeless> 200MB for it + the rev index and inv index
[08:19] <keir> wait, i found 115k keys in that one...
[08:19] <keir> i wonder if my parsing code is wrong
[08:19] <keir> most key/val/refs are ~90 bytes, so 115k keys makes sense
[08:20] <keir> wait
[08:20] <keir> i think i dropped a 0
[08:20] <lifeless> I think that index is a little unusual because it has converted data
[08:20] <lifeless> our native indices have longer keys
[08:21] <keir> are you always reading 4k at a time?
[08:21] <lifeless> o
[08:21] <lifeless> no
[08:22] <keir> less?
[08:22] <lifeless> minimum of get_recommended_page_size
[08:22] <lifeless> which transport supplies
[08:22] <keir> aah, i see
[08:22] <lifeless> a single readv may hit many locations, each of which is fanned out to that figure if its smaller
[08:22] <lifeless> so on http we'll read 64k minimum
[08:23] <lifeless> but something like ftp may well choose to read 200K or more, because of the insane effort needed to issue what amounts to a readv
[08:23] <lifeless> so its not truely pages in the toy index
[08:23] <lifeless> read read <->
[08:23] <lifeless> we read <->
[08:23] <keir> so really, a 4k fanout/preamble may be too small
[08:23] <lifeless> transport expands that
[08:24] <lifeless> we get back [......] 
[08:24] <lifeless> if the edges of that are not already parsed, we strip up to the first \n
[08:24] <lifeless> giving row\nrow\n....
[08:24] <lifeless> we parse those
[08:24] <lifeless> mark the range as parsed
[08:24] <lifeless> and the low and high key found in the range
[08:25] <lifeless> the bisection code to drive this is on the commits list
[08:25] <lifeless> bisect_multi_bytes(content_lookup, size, keys)
[08:25] <lifeless> content_lookup is a callable that takes a list of (location, ley) tuples
[08:26] <lifeless> and returns those tuples with an added status: one of (-1, +1, False, result)
[08:26] <lifeless> where -1 and +1 are 'lower than this location' and higher than..
[08:26] <lifeless> False is 'cannot be in this index'
[08:26] <lifeless> and result is 'return this to the caller'
[08:28] <keir> ok
[08:28] <lifeless> I'm putting the final bits of the content_lookup callable on GraphIndex at the moment
[08:28] <keir> i see that the hash based fan out is nice
[08:29] <keir> then merging with bzr.dev?
[08:29] <lifeless> then profile for regressions on local operations
[08:29] <lifeless> then profile for regressions on network operations
[08:29] <lifeless> then send in a [MERGE]  to the list for ddebate
[08:29] <lifeless> so this won't add any prelude
[08:29] <keir> ok
[08:30] <lifeless> adding a prelude will simply provide the first 4 left-right jumps within the index at the front, cheaply
[08:31] <keir> only first 4?
[08:31] <lifeless> 4 keys/K
[08:31] <lifeless> 4*4 == 16
[08:31] <lifeless> log(16,2) == 4
[08:32] <keir> index is 4000 single bytes or 2000 shorts?
[08:32] <lifeless> so this makes a 64K index achieve 2 round trips lookups
[08:32] <lifeless> this is the toy index
[08:32] <lifeless> strings
[08:32] <lifeless> current format, not sure where you are getting bytes/shorts concepts
[08:32] <lifeless> the prelude I'm thinking for this format, is trivial: a list of psuedo keys
[08:33] <lifeless> and the byte offset of the start of the first key that sorts after the pseudo key
[08:33] <keir> i was thinking the prelude would be the fan out.
[08:34] <lifeless> I think we're talking past each other to some degree
[08:34] <lifeless> the stuff I talked about with you the other day was for the format you're working on, with topological grouping
[08:34] <keir> it's late here, i probably shouldn't be bothering you!
[08:34] <lifeless> this index is linear sorted
[08:34] <keir> lifeless, yes, i realize that
[08:34] <lifeless> ok cool
[08:34] <keir> lifeless, the hash indexing works for linear ordering too
[08:35] <keir> which is neat
[08:35] <lifeless> so given keys AA AB AC BA BB BC
[08:35] <keir> using a hash index you can just glue it on any ordering
[08:35] <lifeless> well, you need a complete hash table
[08:35] <lifeless> I'm planning on using the sorted facility of this index to just improve on regular bisection- basically the same as the git fan-out prelude
[08:36] <lifeless> with the keys above, a prelude might look like
[08:36] <keir> AA <loc> BB <loc>
[08:36] <lifeless> '' 0, 'B' 30
[08:36] <keir> yes
[08:37] <keir> that's exactly how my other code works
[08:37] <lifeless> that is, if I'm looking for a key between '' and 'B', I know its between 0 and 30
[08:38] <lifeless> ok
[08:38] <lifeless> so, the reason I want to do this rather than a generalised full-index hash table
[08:38] <lifeless> is that this is very simple to code;
[08:38] <lifeless> take all the keys in order
[08:39] <lifeless> bisect through them and pull out key, location pairs
[08:39] <lifeless> until I've got 4K of data.
[08:39] <lifeless> stop.
[08:40] <lifeless> if I want to, shrink the prelude keys to the smallest unique string at the location I picked up, allowing it to be smaller
[08:40] <keir> yes, that's also how my code works. it does it recursively until the top level is 4k
[08:41] <lifeless> cool
[08:41] <keir> ok i lie slightly; it does it bottom up
[08:41] <keir> but same idea
[08:44] <lifeless> :)
[09:36] <keir> lifeless, i just did some excel work. i have the following proposal.
[09:36] <keir> store a 4k preamble which is a histogram of the number of keys in that bin
[09:36] <keir> where each bin is a 1 byte uchar
[09:36] <keir> then store the usual tag/offset jazz after
[09:37] <keir> this way it fits into 4k
[09:37] <keir> for roughly up to 800k keys
[09:38] <lifeless> what is a bin precisely
[09:38] <lifeless> is it positional
[09:38] <lifeless> a key prefix
[09:38] <lifeless> ?
[09:39] <keir> lop the first byte off of each hash
[09:39] <keir> sorry
[09:39] <keir> 3 bytes
[09:40] <keir> grr
[09:40] <keir> 12 bytes
[09:40] <keir> bites
[09:40] <keir> bits
[09:40] <keir> that gives you 1 position in the 4k table (each entry 1 byte)
[09:41] <keir> the 'table' is really a histogram of the number of keys which fell in that bin
[09:41] <keir> which you count by taking the first 12 bits of each hash, indexing into the table, and incrementing
[09:41] <lifeless> I don't understand; which hash? and what does this do for us?
[09:42] <keir> in two round trips most of the time you'll have the exact location
[09:42] <keir> the benefit is that by having the client do the cumulative sum to get the offset into the tag part of the hash table, we can store huge tables
[09:43] <lifeless> if I understand this correctly
[09:44] <lifeless> then I can paraphrase this as:
[09:46] <lifeless> 'store a table of hash:list of locations at the front of the file. To allow the table to become very big, store a summary of the table at the front of the table, size limited to 4K.
[09:46] <keir> yes
[09:47] <lifeless> the summary of the table lists the number of locations stored against every combination of X bits of hash, to allow direct access to the serialised sparse table once the 4K summary is read.
[09:48] <keir> yes
[09:48] <lifeless> it sounds like you are making good progress on figuring out how to have a very small 'find a key' logic, while still allowing arbitary sorts for data locality
[09:48] <keir> exactly
[09:49] <lifeless> I'm fairly uninterested in whacking this into the toy format as yet
[09:49] <lifeless> but very interested in it being in the real format
[09:49] <keir> ok
[09:49] <keir> so for the 800mb case
[09:49] <lifeless> 800K case? :)
[09:50] <keir> excel tells me it'll be 4k of preamble and ~13mb of tag:offset pairs
[09:50] <keir> 800k keys rather :)
[09:50] <keir> the nice thing about this is that i can whack it on top of the toy format no problem
[09:52] <keir> and then the average bin will be still something around 4k
[09:52] <keir> as in, you'll have to grab 4k of data
[09:52] <lifeless> hmmm
[09:52] <lifeless> tell you what, I'll finish this bisection approach off
[09:53] <keir> yes
[09:53] <keir> i'll keep working on excel
[09:53] <keir> and mail the list
[09:53] <lifeless> and think seriously about this hash approach
[09:53] <lifeless> what hash function were you thinking of using?
[09:53] <keir> sha, just because it's Proven
[09:53] <keir> but maybe it's too slow
[09:53] <lifeless> its very slow
[09:54] <lifeless> also its cryptographically secure
[09:54] <lifeless> which is unneeded here
[09:54] <lifeless> as we support collisions
[09:55] <keir> yes
[09:55] <keir> crc32!
[09:55] <keir> ftw!
[09:57] <lifeless> well
[09:57] <lifeless> that would work, and is bound in python
[09:58] <keir> we'd probably want more bits
[09:58] <lifeless> why
[09:58] <lifeless> 800K << 2.6M
[10:00] <lifeless> I wonder if hashlittle is available in the stdlib
[10:04] <lifeless> so
[10:04] <lifeless> wikipedias' list of hash functions
[10:05] <lifeless> suggests that adler32 is about the fastest thing out there
[10:05] <lifeless> crc32 ~= md5
[10:05] <lifeless> at 3.6* adler
[10:05] <lifeless> and sha is 6 * adler
[10:05] <lifeless> so I'd use adler up to say 500K keys
[10:06] <keir> neat
[10:06] <keir> is adler in python?
[10:07] <lifeless> yes
[10:07] <lifeless> in the zlib module
[10:07] <lifeless> and go md5 at 500K keys
[10:07] <lifeless> and stay at md5
[10:07] <lifeless> this is only data lookup, things are verified by sha as they are reconstructed, so we don't care about hostile modifications
[10:09] <keir> "Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits."
[10:09] <keir> this is bad for us
[10:09] <keir> probably crc32?
[10:09] <keir> i suppose we can just try
[10:09] <lifeless> test, test and more testing :)
[10:15] <keir> is there a function in your code which gives me the offset of a key given the key? i.e. during the building phase
[10:19] <lifeless> no
[10:19] <keir> ok
[10:19] <lifeless> because the location cannot be determined without knowing the number of key references before the key + the length of the keys before the key
[10:20] <lifeless> so we wait until finish() is invoked to calculate this
[10:20] <keir> of course.
[10:21] <keir> it seems entirely reasonable to add  the hash index as an index on the index... in a seperate file
[10:22] <keir> cool, now that i look back at git's index, this way is nicer
[10:22] <lifeless> if you update the 'size of offset needed' logic to understand that there will be a hash table there, it should fit in very nicely
[10:22] <keir> by storing the histogram rather than the cumulative sum, we can store fewer bits per hash table entry
[10:23] <keir> and nicely enough, given the number of keys, the hash table (including both 4k start and rest) is fixed size
[10:23] <lifeless> but i'm not against extra files if that really helps
[10:23] <lifeless> actually the hash table probably isn't fixed size - but it is bounded.
[10:23] <lifeless> collisions will share hash names
[10:24] <keir> hmm, this is true
[10:24] <keir> i had originaly envisioned sha hashes so i was thinking no collisions
[10:24] <lifeless> even sha can collide
[10:24] <lifeless> its true we don't know how to *make* it collide
[10:25] <lifeless> but its a fallacy to say it won't :)
[10:25] <keir> actually that ruins everything
[10:25] <lifeless> hmm?
[10:25] <lifeless> collisions don't ruin anything here
[10:26] <lifeless> the number of bytes in the hash table is:
[10:26] <keir> the whole trick with the histogram relies on your bins being exactly nentries*fixed size
[10:26] <lifeless> right
[10:26] <keir> ooh, i see
[10:26] <keir> you just duplicate the tag
[10:26] <lifeless> and you have that
[10:26] <lifeless> nah no need
[10:26] <keir> duh
[10:26] <lifeless> here:
[10:26] <lifeless> trivial format for the table:
[10:26] <lifeless> HASH LOCREF[ LOCREF...] 
[10:27] <keir> (well, the table format is pretty trivial!)
[10:27] <lifeless> oops
[10:27] <lifeless> HASH LOCREF[ LOCREF...] \n
[10:28] <keir> i am against delimiters...
[10:28] <keir> in this part of the table
[10:28] <keir> i'd go like this:
[10:28] <lifeless> well
[10:28] <lifeless> if you don't duplicate the hash
[10:28] <keir> TAG OFFSET TAG OFFSET TAG OFFSET
[10:28] <lifeless> then a collision moves the table up
[10:28] <keir> up?
[10:29] <lifeless> the idea of the table summary is to say 'if your hash starts with 010110' then sum up the counts for all hashes in the summary before that
[10:29] <lifeless> and you can tell where in the table to read from to read the data for all hases that start with 010110
[10:29] <keir> yes
[10:30] <lifeless> by up I mean 'earlier'
[10:30] <keir> what i'm suggesting, is to dupe the tag and increment the bin anyway
[10:30] <keir> then the reading end needs to know that there may be duped tags
[10:30] <lifeless> if its a fixed size, then take the sum of the bins, multiply by the size of a keyref + your hash size and you know where to read
[10:30] <keir> i don't see how to do the offset calc when there are delimiters
[10:31] <lifeless> if its not a fixed size
[10:31] <lifeless> then it will never be further in the file, but it may be earlier
[10:31] <lifeless> but
[10:31] <keir> i think delimiters are a bad idea in this context
[10:32] <keir> hopefully collisions will be rare enough that it's worth duping the content
[10:32] <keir> of course, we'll see in testing
[10:32] <lifeless> anyhow, you can calculate the upper bound of collisions in the reader
[10:32] <lifeless> 4 bins, with 10 keys each - at minimum there are 4 unique hashes, at most 40
[10:32] <lifeless> but its better than that
[10:32] <keir> ah yes
[10:33] <lifeless> if you record 'hash, refs' rather than 'refs' in the bin summary
[10:33] <lifeless> then you can handle any number of collisions and still predict exact location
[10:34] <keir> but then our summary will be either larger or less selective
[10:34] <lifeless> right
[10:35] <lifeless> so its a tradeoff between larger table when there are collisions and larger summary when there aren't
[10:44] <keir> hmm, the keys in 0.tix should be unique right?
[10:44] <lifeless> yes
[10:51] <keir> wow, lots and lots of crc32 collisions
[10:51] <keir> i'm getting 50% collisions
[10:51] <keir> i guess that's what happens when you load 800k items
[10:54] <lifeless> yah
[10:54] <lifeless> don't use crc32
[10:54] <lifeless> :)
[10:55] <lifeless> as its meant to be ~ the same as md5
[10:55] <lifeless> I'd check what distribution you're getting
[10:56] <keir> sorry
[10:56] <keir> bug
[10:56] <keir> in 717k i get 50 collisions
[10:56] <keir> for crc32
[10:56] <keir> 4000 with addler32
[10:57] <lifeless> what cpu do you have?
[10:57] <keir> this is just doing the dumbest thing throwing all the keys into a dict() keyed on hash
[10:57] <lifeless> 4K/717K is nice and low - 0.5%
[10:57] <keir> pokey old centrino
[10:58] <lifeless> ok
[10:58] <lifeless> print "0x%u" % hash('asdfg')
[10:58] <lifeless> 0x3709412585253919148
[10:58] <lifeless> I don't think we can use this hash
[10:58] <keir> it can hash all of them pretty fast even on this machine
[10:58] <lifeless> but I'm curious what result you get
[10:59] <keir> 0x-848537172
[10:59] <lifeless> hah! so much for unsigned
[10:59] <lifeless> anyhow, different number
[11:00] <keir> bin = int(sha.sha(key).hexdigest()[:4] , 16)
[11:00] <keir> i'm getting 65k collisions with that one
[11:01] <keir> that should be a 32 bit hash right?
[11:01] <lifeless> 4 bits per char
[11:01] <lifeless> 4 chars
[11:01] <keir> duh
[11:02] <keir> 54
[11:02] <lifeless> really, use md5
[11:02] <lifeless> seriously faster
[11:02] <keir> just checking collisions
[11:03] <lifeless> sure
[11:03] <keir> 61
[11:03] <keir> wow, so far the winner is the fastest and dumbest: crc32 ftw!
[11:03] <lifeless> :)
[11:03] <lifeless> have you tried adler ?
[11:03] <keir> 4000
[11:04] <lifeless> hmm, I remember one of adler/crc has endianness/word size issues
[11:04] <keir> by authors admision, is broken below 128 chars due to design
[11:04] <lifeless> we'll want to check
[11:04] <keir> it's adler
[11:04] <keir> it's unsuitable
[11:04] <lifeless> k
[11:05] <lifeless> brb
[11:12] <lifeless> back
[11:15] <james_w> lifeless: I was asking about disk safe filenames for the patches/threads/loom plugin. I guess I could use a random string and store the mapping.
[11:17] <keir> lifeless, so if we have 12 bits to pick inside the 4k preamble, we're left with an awkward 20 bit tag. i suggest we use a 44 bit hash; 12 bits for preamble, 32 for tag
[11:17] <keir> then our records are 8 bytes
[11:17] <keir> which are very cache performant (aligning to 4 bytes good!)
[11:18] <keir> 0 collisions on 717k
[11:18] <lifeless> james_w: a bit more context would be useful
[11:18] <keir> (8 bytes being tag:offset)
[11:20] <keir> i'm off to bed
[11:20] <lifeless> night!
[11:20] <keir> it's been a fun thought experiment
[11:20] <lifeless> we're making really good progress I think
[11:20] <keir> i'm pretty convinced this is the right approach
[11:21] <keir> super simple, compact
[11:21] <james_w> lifeless: I don't know if you saw on the list (extra revisions in sprout), but I am looking at doing something like the loom plugin. This means that I need to create new branch. Each 'thread' has a name, so I was going to name the hidden branches the same, so I wanted to know if there was a function to check if a string was safe for writing to disk.
[11:21] <james_w> you suggested that using user supplied input wasn't a good idea.
[11:21] <lifeless> I don't understand whats being written to disk
[11:21] <lifeless> or what you mean by hidden branch
[11:22] <james_w> for each thread I create a hidden branch, which is just a branch that is stored in .bzr to suggest to the user that they shouldn't modify it manually. This involves creating a new directory beneath .bzr that is a branch.
[11:23] <lifeless> I suggest using some mapping
[11:23] <james_w> each of these branches has a name, which I was going to use for the name of the directory, so that given a name I can find the branch.
[11:23] <lifeless> and thinking about renames
[11:23] <lifeless> are there relationships between these branches
[11:24] <lifeless> bbiab
[11:24] <james_w> yes, I am working on a single linked list assumption at the moment.
[11:24] <james_w> or a stack might be a better term.
[11:24] <james_w> If i remove that you can get git workflow, but it is more work.
[01:11] <Peng> lifeless (since keir /quit): You might be using the hash functions for different purposes, but what about Mercurial's hash functions? Either their old GNU diff-based one or their new lyhash (by Leonid Yuriev)? http://hg.intevation.org/mercurial/crew/rev/d0c48891dd4a?style=gitweb
[04:37] <schierbeck> hello lads
[09:03] <s|k> hi
[09:17] <keir> why are the new bzr packs still ~5x larger than git packs?
[09:18] <luks> bzr packs are really just knits written into bigger files
[09:18] <luks> but the structure of data in bzr and git is very different, anyway
[09:19] <keir> the bzr packs don't compress at all either...
[09:20] <keir> hmm
[09:20] <keir> git is snapshot based
[09:20] <keir> but internall just uses heuristics to store the diffs
[09:20] <luks> knits are gzipped
[09:20] <luks> gzipped text deltas
[09:20] <keir> that the complete linux kernel history is 100mb, is pretty nice!
[09:20] <keir> (for git)
[09:21] <keir> interesting, i'll have to look into this
[09:22] <keir> the tiny size for git repos is really nice for branching projects -- tiny download time!
[09:24] <keir> is there docs of what a knit is?
[09:24] <keir> i don't see it clearly explained in the developers/ dir in docs
[09:27] <keir> i see, so text deltas rather than a more efficient xdiff type binary encoding
[09:34] <fullermd> And occasional fulltext copies.  That adds up size on long histories.
[09:36] <keir> git does that too
[09:36] <keir> i still don't see how it's a 5x size difference though
[09:36] <fullermd> Not in the same sense, AIUI.
[09:37] <luks> annotations add a lot, too. or are they dropped from packs already?
[09:39] <keir> hmm. annotations are very important for some projects (gnome, etc)
[09:40] <fullermd> That doesn't mean storing them is necessary.
[09:40] <keir> well, the issue gnome had with svn was that annotate was very slow compared to cvs
[09:40] <keir> back in the day it was a blocker to svn conversion
[09:43] <fullermd> Well, annotations are also important to things like annotation-based merges.
[09:44] <fullermd> What both end up meaning is that annotations should be get-able "fast enough".  Which means either finding a blazing fast way to derive them on the fly, or caching them.
[09:44] <fullermd> With knits, they're cached at commit time, which has a lot of drawbacks.  Space usage, CPU and I/O time to add/retrieve them on all operations.
[09:45] <fullermd> Makes it very hard to use a non-line-based delta algorithm, since annotations are line-based (my brain rebels at reading a byte-wise 'annotate' output)
[09:45] <fullermd> Related, it means they're hard-coded based on the 'diff' algorithm used at commit-time.  That can suck.
[09:46] <fullermd> AIUI, the plan with packs is to add the capability to gen and store an external cache when things come needed, independent of the actual text store.
[09:47] <keir> sensible
[09:55] <keir> is there a fast mpdiff implementation already?
[09:55] <keir> anyone used bzr-git?
[09:55] <keir> i can't get it to work at all
[09:55] <keir> and there is no docs whatsoever
[09:55] <keir> bzr branch ../git ---> 'GitRepository' object has no attribute '_format'
[09:57] <fullermd> I thought I heard it wasn't up to branching.
[09:58] <keir> there is not a single example command showing how to use any of its functionality, so i'm guessing
[09:58] <keir> so bzr-git is, at the moment, entirely useless?
[09:58] <fullermd> I think lifeless said it's only "there" enough to look at history (presumably 'log' and the like)
[10:45] <lifeless> keir: bzr-git can show history but is not complete enough to pull data across
[10:45] <lifeless> luks: knits have annotations, we won't be changing the knits repo format, so that won't change; packs have knits though
[10:45] <lifeless> erm, packs have the deltas, and they are not annotated
[10:46] <luks> well, that's what I meant
[10:46] <luks> I just wasn't sure if packs still have annotations
[10:46] <lifeless> ok
[10:46] <lifeless> they don't
[10:47] <lifeless> packs are something like 10% smaller than bzr's knit based repositories
[10:47] <keir> lifeless, why are they so much larger than git repos?
[10:47] <jelmer> 'morning *
[10:48] <keir> because libxdiff (what git uses) is 5x better than line diff + gzip?
[10:48] <lifeless> keir: two reasons that I know of; one is that we know we can do better on size - john and aaron have done lots of experiments on this - but we haven't moved their research into a disk format [yet] .
[10:49] <lifeless> in fact thats a general enough statement that it covers the second reason too :)
[10:49] <lifeless> hi jelmer
[10:49] <keir> lifeless, ok :)
[10:49] <lifeless> Peng: thanks; I wasn't aware of lyhash
[10:49] <Peng> lifeless: :)
[10:50] <lifeless> keir: so I would expect the next iteration of packs to bring in some of their improvements which will increase performance more, decrease size etc,.
[10:51] <lifeless> keir: I'm curious though where you are getting the size comparison from?
[10:51] <lifeless> keir: is it a tree with the same history in both formats?
[10:51] <keir> lifeless, order of magnitude comparisons of git tree vs bzr tree
[10:51] <keir> same source size
[10:51] <keir> similar history size
[10:51] <lifeless> well
[10:51] <lifeless> similar can cover a multitude of sins
[10:51] <keir> true
[10:52] <keir> i was trying to convert the git repo into bzr packs
[10:52] <keir> to do a proper comparison
[10:52] <lifeless> yah, thats not ready yet
[10:52] <lifeless> oh, also, a thought - if you were doing du -sh .bzr
[10:52] <lifeless> there is the obsolete_packs directory
[10:53] <keir> it's empty
[10:53] <lifeless> ok
[10:53] <lifeless> :)
[10:53] <keir> all the size is in the single .pack
[10:53] <keir> 54mb
[10:53] <lifeless> how old was the project when it was in knit format ?
[10:53] <keir> compared to the git 15mb one
[10:53] <keir> this is packs.packs
[10:53] <keir> knit is similar size
[10:54] <keir> .bzr in packs is 61mb
[10:54] <lifeless> uhm, I wasn't clear
[10:54] <keir> in knits it's 71mb
[10:54] <lifeless> old knit generation code generated bigger knits
[10:54] <lifeless> for two fairly dumb in hindsight reasons
[10:54] <lifeless> the first is the python GZipFile uses zlib compression level 9 by default for some insane reason
[10:55] <keir> hmm
[10:55] <lifeless> this generates bigger output in a non trivial % of cases
[10:55] <lifeless> I saved a surprising amount of disk in the moz repo tests by converting that to -1, not to mention a bucketload of time
[10:56] <lifeless> the second reason is that we started with a full text every 50 commits no matter what
[10:56] <lifeless> (per file history)
[10:56] <keir> ah, i see
[10:56] <lifeless> and the inventory file - which is a) big and b) changes on every commit
[10:56] <lifeless> lets just say that it became a dominating factor in several repositories
[10:57] <lifeless> now we use a heuristic that says 'if the size of the deltas == the size of the file, write a new full text'
[10:57] <lifeless> oh, and theres another reason too
[10:57] <lifeless> we delta against history always
[10:57] <Peng> To get more efficient knits, what, re-clone?
[10:58] <lifeless> not against best size/closest match/introduction to repo
[10:58] <lifeless> which means if you imagine a Y graph
[10:58] <lifeless> if at the point just before the split you are due-for-a-full-text
[10:58] <lifeless> both arms will get a fulltext, one each.
[10:59] <lifeless> Peng: by diffing against history its trivial to be sure that you have all the deltas needed to reconstruct a text during branching
[10:59] <lifeless> because you can take all the inventories; grab the texts they want as either fultext-or-delta, and you're done
[11:00] <lifeless> if you have arbitrary deltas, you need to grab other data that you dont want in general, replace the local delta for the thing that referenced it with another, then you're done
[11:00] <lifeless> this is one reason pack text indices have two pointers
[11:00] <lifeless> one to another text for delta parent
[11:00] <keir> aah, a history-parent and delta parent
[11:01] <lifeless> one to a list of parents for file graph
[11:01] <lifeless> so that we can e.g. delta against last-added, but still tell what we need during fetch easily
[11:01] <bialix> luks: ping
[11:01] <luks> hi!
[11:01] <lifeless> if someone digs into the 'pack' command (not the autopack, the manual 'please make me fast daddy' command)
[11:02] <lifeless> then they can fix existing repositories to be considerably smaller, especially now that packs are not annotated.
[11:02] <lifeless> keir: did I just hear a penny dropping ? :)
[11:03] <lifeless> hi bialix
[11:03] <lifeless> anyway, I've some stuff to do here, hope this discussion helps - the 5x thing is definately a problem, but I'm quite sure you won't see it *that bad* if you start with packs and git and build up a given graph programatically
[11:04] <luks> hi again :)
[11:04] <bialix> hi, luks, lifeless
[11:05] <bialix> luks, I thinking about separate file for strings, but it seems to few strings at this moment
[11:05] <bialix> ^too few
[11:05] <luks> I'm fine with it as it is
[11:06] <luks> we can move them later if the list will start growing
[11:06] <bialix> ok for me
[11:06] <luks> what still worries me are error messages from bzrlib
[11:06] <luks> but I guess that's not easily fixable for now
[11:06] <bialix> yeah
[11:07] <bialix> duplicate them inside QBzr is not best idea
[11:07] <luks> yep
[11:07] <bialix> luks, I saw you change some plurals-relaed things in qdiff
[11:08] <luks> yes, I realized I need a plural in the title
[11:08] <luks> and I couldn't figure out how to do it with pygettext
[11:08] <luks> so I switched it to xgettext
[11:08] <bialix> it's may be something strange with my standalone QBzr, but it seems that this code is not working on my win32
[11:08] <luks> hmm
[11:09] <luks> what's the problem?
[11:09] <keir> lifeless, i was looking into this because i was thinking about the compressed keys issue -- if we compress keys into groups of 16, we get a 4x saving on space for storing the keys
[11:09] <bialix> I don't see this messages at all
[11:09] <keir> lifeless, but i wanted to check if the size of the keys was even on the cards compared to pack sizes
[11:10] <bialix> luks: may be it's something related to bug #148409
[11:10] <ubotu> Launchpad bug 148409 in qbzr "qdiff --inline: option has no effect?" [Undecided,New]  https://launchpad.net/bugs/148409
[11:12] <luks> bialix, I think I'm confused now
[11:12] <luks> you mean you don't see the title in qdiff at all?
[11:13] <bialix> luks: because my bad english, or because my bug report?
[11:13] <luks> neither, I just don't see how are these two related
[11:14] <luks> the "X files" title should appear only if you specify more than 2 files to diff on the command line
[11:14] <bialix> no, title in qdiff is present, I don't see "X files" message
[11:15] <bialix> mmm, and when I specify no files?
[11:15] <luks> then you get only "QBzr - Diff"
[11:15] <bialix> aha
[11:15] <luks> and if you open it from qlog then I believe you get "QBzr - Diff - <revision ID>"
[11:16] <luks> maybe this should be unified
[11:16] <luks> it made sense to me back then, but it looks confusing now that I think about it again :)
[11:18] <bialix> here screenshot http://bialix.com/qbzr/qdiff-2-files.png with 2 files in command line
[11:18] <luks> more than two
[11:19] <bialix> yep
[11:19] <bialix> my bad
[11:20] <bialix> but --inline does not work anyway
[11:22] <luks> I know, it used to work, but then I added this new diff viewer and didn't bother to implement it there
[11:22] <luks> but I'm about to rewrite it again, because the diff gets misaligned sometimes
[11:22] <bialix> well, ok, I just want to know it's not win32-specific
[11:22] <luks> no, it's not
[11:23] <bialix> btw, I hve some strange bug with diff on my ru.po
[11:23] <bialix> somewhat something breaks html rendering in diff and I saw raw html code
[11:26] <luks> can you mail me the patch that does it?
[11:26] <bialix> here is screenshot: http://bialix.com/qbzr/qdiff-html-bug.png
[11:26] <luks> ouch
[11:26] <bialix> it's from my i18n branch
[11:26] <bialix> it's already on laucnhpad server
[11:27] <bialix> bzr qdi -c136
[11:27] <luks> I see
[11:28] <luks> I used an ugly hack there, but it doesn't play well with utf-8
[11:33] <luks> hm, maybe not
[11:41] <bialix> luks: good night
[11:41] <luks> night