/srv/irclogs.ubuntu.com/2007/10/06/#bzr.txt

=== Alien_Freak [n=sfaci2@cs-user23.wireless.uic.edu] has joined #bzr
=== cprov is now known as cprov-out
=== ubotu [n=ubotu@ubuntu/bot/ubotu] has joined #bzr
=== AfC [n=andrew@ip67-91-236-171.z236-91-67.customer.algx.net] has joined #bzr
=== NfNit|oop [i=codyc@cpe-70-112-28-217.austin.res.rr.com] has joined #bzr
=== epn [n=whatever@c-69-252-219-78.hsd1.nm.comcast.net] has joined #bzr
=== keir_ [n=keir@206-248-131-150.dsl.teksavvy.com] has joined #bzr
=== BasicOSX [n=BasicOSX@216.243.156.81] has joined #bzr
=== BasicMac [n=BasicOSX@216.243.156.81] has joined #bzr
=== BasicMac is now known as BasicOSX
=== BasicMac [n=BasicOSX@warden.real-time.com] has joined #bzr
=== NamNguyen [n=namnt@cm38.delta196.maxonline.com.sg] has joined #bzr
=== thumper [n=tim@125-236-193-95.adsl.xtra.co.nz] has joined #bzr
=== bitmonk [n=justizin@adsl-75-55-127-69.dsl.sfldmi.sbcglobal.net] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has joined #bzr
=== BasicOSX [n=BasicOSX@warden.real-time.com] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
=== beuno_ [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
AfCCan anyone comment on the state of bzr-git? I have _one_ last project that's not in Bazaar yet that is in Git. It'd be nice to somehow convert it into a bzr branch, but I had no luck with tailor (for months of trying)05:49
lifelessjelmer of bzr-svn fame is hacking on it now05:56
lifelessits not up to converting yet, only data inspection05:56
AfClifeless: ok, thanks Robert06:05
AfCI'm going to go ahead and just do a big initial import for now. Maybe I can use --file-ids or whatever later to recover the history06:06
PengWhy not keep it in git for the moment?06:09
AfCPeng: you're asking that HERE, in #bzr?06:10
AfC{sigh}06:10
AfCPeng: but also because I can hardly remember how to use Git and am not really that interested in relearning.06:11
=== AfC just tried to figure out how to `revert` a file and could decide whether it was `reset` or `reset --hard` or what. Git is horrendous.
AfCs/could/could not/06:12
PengHeh.06:14
PengI have no experience with git. :P06:14
PengMaybe I should be glad.06:14
=== Peng wanders off.
PengBut if bzr-git is progressing quickly, I was just thinking that it shouldn't be too bad to use it for a little while, especially when otherwise you risk losing the history.06:15
=== Peng wanders off.
AfCPeng: fair enough06:15
AfCPeng: nah, I need to get on with collaborating with someone. I'll wait until there is a way to graft the two branches together.06:16
AfCIt's all cosmetic, of course06:16
AfCJust feelgood factor that you want to recover, mostly06:16
=== BasicOSX [n=BasicOSX@warden.real-time.com] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== keir [n=keir@bas15-toronto12-1168012681.dsl.bell.ca] has joined #bzr
keirlifeless, ping07:08
=== orospakr [n=orospakr@bas4-ottawa23-1088826449.dsl.bell.ca] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
lifelesskeir: pong08:05
=== g0ph3r [n=g0ph3r@p57A09E60.dip0.t-ipconnect.de] has joined #bzr
keirlifeless, hey08:09
keirlifeless, did you start on the 4k fanout?08:09
lifelesskeir: putting the finishing touches on bisection08:11
lifelessits 19 roundtrips on a 200MB index08:11
lifelessto get down to a 4K size08:11
keiri was thinking about this08:12
keirfor big indicies, why not pad them out to 4k blocks?08:12
lifelessa 4K prelude on the index will give about 16 times granularity, or log(16, 2) - 4 less round trips08:12
keirthen we can have a fan out table which selects down to 4k nicely08:12
lifelesshmm, right now I just want to get enough legs on this toy format to survive while the real one comes together08:13
keirof course :)08:13
keirso in a 200mb index, that's ~2.5m keys, right?08:16
keirassuming something like 80 bytes per key/vaue/refs08:16
lifeless800K keys08:17
lifeless(I gave you a 200M index to play with :))08:17
lifelessfor the current toy format of course08:18
keiryes08:18
keiri am using the old 100mb 0.tix08:18
lifelessoh, was it only 100M. ll08:18
lifelesslol08:18
lifeless200MB for it + the rev index and inv index08:19
keirwait, i found 115k keys in that one...08:19
keiri wonder if my parsing code is wrong08:19
keirmost key/val/refs are ~90 bytes, so 115k keys makes sense08:19
keirwait08:20
keiri think i dropped a 008:20
lifelessI think that index is a little unusual because it has converted data08:20
lifelessour native indices have longer keys08:20
keirare you always reading 4k at a time?08:21
lifelesso08:21
lifelessno08:21
keirless?08:22
lifelessminimum of get_recommended_page_size08:22
lifelesswhich transport supplies08:22
keiraah, i see08:22
lifelessa single readv may hit many locations, each of which is fanned out to that figure if its smaller08:22
lifelessso on http we'll read 64k minimum08:22
lifelessbut something like ftp may well choose to read 200K or more, because of the insane effort needed to issue what amounts to a readv08:23
lifelessso its not truely pages in the toy index08:23
lifelessread read <->08:23
lifelesswe read <->08:23
keirso really, a 4k fanout/preamble may be too small08:23
lifelesstransport expands that08:23
lifelesswe get back [......] 08:24
lifelessif the edges of that are not already parsed, we strip up to the first \n08:24
lifelessgiving row\nrow\n....08:24
lifelesswe parse those08:24
lifelessmark the range as parsed08:24
lifelessand the low and high key found in the range08:24
lifelessthe bisection code to drive this is on the commits list08:25
=== abentley [n=abentley@bas8-toronto63-1088754407.dsl.bell.ca] has joined #bzr
lifelessbisect_multi_bytes(content_lookup, size, keys)08:25
lifelesscontent_lookup is a callable that takes a list of (location, ley) tuples08:25
lifelessand returns those tuples with an added status: one of (-1, +1, False, result)08:26
lifelesswhere -1 and +1 are 'lower than this location' and higher than..08:26
lifelessFalse is 'cannot be in this index'08:26
lifelessand result is 'return this to the caller'08:26
keirok08:28
lifelessI'm putting the final bits of the content_lookup callable on GraphIndex at the moment08:28
keiri see that the hash based fan out is nice08:28
keirthen merging with bzr.dev?08:29
lifelessthen profile for regressions on local operations08:29
lifelessthen profile for regressions on network operations08:29
lifelessthen send in a [MERGE]  to the list for ddebate08:29
lifelessso this won't add any prelude08:29
keirok08:29
lifelessadding a prelude will simply provide the first 4 left-right jumps within the index at the front, cheaply08:30
keironly first 4?08:31
lifeless4 keys/K08:31
lifeless4*4 == 1608:31
lifelesslog(16,2) == 408:31
keirindex is 4000 single bytes or 2000 shorts?08:32
lifelessso this makes a 64K index achieve 2 round trips lookups08:32
lifelessthis is the toy index08:32
lifelessstrings08:32
lifelesscurrent format, not sure where you are getting bytes/shorts concepts08:32
=== BasicOSX [n=BasicOSX@warden.real-time.com] has joined #bzr
lifelessthe prelude I'm thinking for this format, is trivial: a list of psuedo keys08:32
lifelessand the byte offset of the start of the first key that sorts after the pseudo key08:33
keiri was thinking the prelude would be the fan out.08:33
lifelessI think we're talking past each other to some degree08:34
lifelessthe stuff I talked about with you the other day was for the format you're working on, with topological grouping08:34
keirit's late here, i probably shouldn't be bothering you!08:34
lifelessthis index is linear sorted08:34
keirlifeless, yes, i realize that08:34
lifelessok cool08:34
keirlifeless, the hash indexing works for linear ordering too08:34
keirwhich is neat08:35
lifelessso given keys AA AB AC BA BB BC08:35
keirusing a hash index you can just glue it on any ordering08:35
lifelesswell, you need a complete hash table08:35
lifelessI'm planning on using the sorted facility of this index to just improve on regular bisection- basically the same as the git fan-out prelude08:35
lifelesswith the keys above, a prelude might look like08:36
keirAA <loc> BB <loc>08:36
lifeless'' 0, 'B' 3008:36
keiryes08:36
keirthat's exactly how my other code works08:37
lifelessthat is, if I'm looking for a key between '' and 'B', I know its between 0 and 3008:37
lifelessok08:38
lifelessso, the reason I want to do this rather than a generalised full-index hash table08:38
lifelessis that this is very simple to code;08:38
lifelesstake all the keys in order08:38
lifelessbisect through them and pull out key, location pairs08:39
lifelessuntil I've got 4K of data.08:39
lifelessstop.08:39
lifelessif I want to, shrink the prelude keys to the smallest unique string at the location I picked up, allowing it to be smaller08:40
keiryes, that's also how my code works. it does it recursively until the top level is 4k08:40
lifelesscool08:41
keirok i lie slightly; it does it bottom up08:41
keirbut same idea08:41
lifeless:)08:44
=== BasicMac [n=BasicOSX@warden.real-time.com] has joined #bzr
=== jrydberg_ [n=Johan@c80-216-246-123.bredband.comhem.se] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== allenap [n=allenap@87-194-166-60.bethere.co.uk] has joined #bzr
=== fog [n=fog@debian/developer/fog] has joined #bzr
keirlifeless, i just did some excel work. i have the following proposal.09:36
keirstore a 4k preamble which is a histogram of the number of keys in that bin09:36
keirwhere each bin is a 1 byte uchar09:36
keirthen store the usual tag/offset jazz after09:36
keirthis way it fits into 4k09:37
keirfor roughly up to 800k keys09:37
lifelesswhat is a bin precisely09:38
lifelessis it positional09:38
lifelessa key prefix09:38
lifeless?09:38
keirlop the first byte off of each hash09:39
keirsorry09:39
keir3 bytes09:39
keirgrr09:40
keir12 bytes09:40
keirbites09:40
keirbits09:40
=== keir needs sleep
keirthat gives you 1 position in the 4k table (each entry 1 byte)09:40
keirthe 'table' is really a histogram of the number of keys which fell in that bin09:41
keirwhich you count by taking the first 12 bits of each hash, indexing into the table, and incrementing09:41
lifelessI don't understand; which hash? and what does this do for us?09:41
keirin two round trips most of the time you'll have the exact location09:42
keirthe benefit is that by having the client do the cumulative sum to get the offset into the tag part of the hash table, we can store huge tables09:42
lifelessif I understand this correctly09:43
lifelessthen I can paraphrase this as:09:44
lifeless'store a table of hash:list of locations at the front of the file. To allow the table to become very big, store a summary of the table at the front of the table, size limited to 4K.09:46
keiryes09:46
lifelessthe summary of the table lists the number of locations stored against every combination of X bits of hash, to allow direct access to the serialised sparse table once the 4K summary is read.09:47
keiryes09:48
lifelessit sounds like you are making good progress on figuring out how to have a very small 'find a key' logic, while still allowing arbitary sorts for data locality09:48
keirexactly09:48
lifelessI'm fairly uninterested in whacking this into the toy format as yet09:49
lifelessbut very interested in it being in the real format09:49
keirok09:49
keirso for the 800mb case09:49
lifeless800K case? :)09:49
keirexcel tells me it'll be 4k of preamble and ~13mb of tag:offset pairs09:50
keir800k keys rather :)09:50
keirthe nice thing about this is that i can whack it on top of the toy format no problem09:50
keirand then the average bin will be still something around 4k09:52
keiras in, you'll have to grab 4k of data09:52
lifelesshmmm09:52
lifelesstell you what, I'll finish this bisection approach off09:52
keiryes09:53
keiri'll keep working on excel09:53
keirand mail the list09:53
lifelessand think seriously about this hash approach09:53
lifelesswhat hash function were you thinking of using?09:53
keirsha, just because it's Proven09:53
keirbut maybe it's too slow09:53
lifelessits very slow09:53
lifelessalso its cryptographically secure09:54
lifelesswhich is unneeded here09:54
lifelessas we support collisions09:54
keiryes09:55
keircrc32!09:55
keirftw!09:55
lifelesswell09:57
=== mvo [n=egon@p54A64932.dip.t-dialin.net] has joined #bzr
lifelessthat would work, and is bound in python09:57
keirwe'd probably want more bits09:58
lifelesswhy09:58
lifeless800K << 2.6M09:58
lifelessI wonder if hashlittle is available in the stdlib10:00
lifelessso10:04
lifelesswikipedias' list of hash functions10:04
lifelesssuggests that adler32 is about the fastest thing out there10:05
lifelesscrc32 ~= md510:05
lifelessat 3.6* adler10:05
lifelessand sha is 6 * adler10:05
lifelessso I'd use adler up to say 500K keys10:05
keirneat10:06
keiris adler in python?10:06
lifelessyes10:07
lifelessin the zlib module10:07
lifelessand go md5 at 500K keys10:07
lifelessand stay at md510:07
lifelessthis is only data lookup, things are verified by sha as they are reconstructed, so we don't care about hostile modifications10:07
keir"Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits."10:09
keirthis is bad for us10:09
keirprobably crc32?10:09
keiri suppose we can just try10:09
lifelesstest, test and more testing :)10:09
keiris there a function in your code which gives me the offset of a key given the key? i.e. during the building phase10:15
lifelessno10:19
=== asak [n=alexis@201-1-2-93.dsl.telesp.net.br] has joined #bzr
keirok10:19
lifelessbecause the location cannot be determined without knowing the number of key references before the key + the length of the keys before the key10:19
lifelessso we wait until finish() is invoked to calculate this10:20
keirof course.10:20
keirit seems entirely reasonable to add  the hash index as an index on the index... in a seperate file10:21
keircool, now that i look back at git's index, this way is nicer10:22
lifelessif you update the 'size of offset needed' logic to understand that there will be a hash table there, it should fit in very nicely10:22
keirby storing the histogram rather than the cumulative sum, we can store fewer bits per hash table entry10:22
keirand nicely enough, given the number of keys, the hash table (including both 4k start and rest) is fixed size10:23
lifelessbut i'm not against extra files if that really helps10:23
lifelessactually the hash table probably isn't fixed size - but it is bounded.10:23
lifelesscollisions will share hash names10:23
keirhmm, this is true10:24
keiri had originaly envisioned sha hashes so i was thinking no collisions10:24
lifelesseven sha can collide10:24
lifelessits true we don't know how to *make* it collide10:24
lifelessbut its a fallacy to say it won't :)10:25
keiractually that ruins everything10:25
lifelesshmm?10:25
lifelesscollisions don't ruin anything here10:25
lifelessthe number of bytes in the hash table is:10:26
keirthe whole trick with the histogram relies on your bins being exactly nentries*fixed size10:26
lifelessright10:26
keirooh, i see10:26
keiryou just duplicate the tag10:26
lifelessand you have that10:26
lifelessnah no need10:26
keirduh10:26
lifelesshere:10:26
lifelesstrivial format for the table:10:26
lifelessHASH LOCREF[ LOCREF...] 10:26
keir(well, the table format is pretty trivial!)10:27
lifelessoops10:27
lifelessHASH LOCREF[ LOCREF...] \n10:27
keiri am against delimiters...10:28
keirin this part of the table10:28
keiri'd go like this:10:28
lifelesswell10:28
lifelessif you don't duplicate the hash10:28
keirTAG OFFSET TAG OFFSET TAG OFFSET10:28
lifelessthen a collision moves the table up10:28
keirup?10:28
=== luks [n=lukas@unaffiliated/luks] has joined #bzr
lifelessthe idea of the table summary is to say 'if your hash starts with 010110' then sum up the counts for all hashes in the summary before that10:29
lifelessand you can tell where in the table to read from to read the data for all hases that start with 01011010:29
keiryes10:29
lifelessby up I mean 'earlier'10:30
keirwhat i'm suggesting, is to dupe the tag and increment the bin anyway10:30
keirthen the reading end needs to know that there may be duped tags10:30
lifelessif its a fixed size, then take the sum of the bins, multiply by the size of a keyref + your hash size and you know where to read10:30
keiri don't see how to do the offset calc when there are delimiters10:30
lifelessif its not a fixed size10:31
lifelessthen it will never be further in the file, but it may be earlier10:31
lifelessbut10:31
keiri think delimiters are a bad idea in this context10:31
=== Zindar [n=erik@stockholm.ardendo.se] has joined #bzr
keirhopefully collisions will be rare enough that it's worth duping the content10:32
keirof course, we'll see in testing10:32
lifelessanyhow, you can calculate the upper bound of collisions in the reader10:32
lifeless4 bins, with 10 keys each - at minimum there are 4 unique hashes, at most 4010:32
lifelessbut its better than that10:32
keirah yes10:32
lifelessif you record 'hash, refs' rather than 'refs' in the bin summary10:33
lifelessthen you can handle any number of collisions and still predict exact location10:33
keirbut then our summary will be either larger or less selective10:34
lifelessright10:34
lifelessso its a tradeoff between larger table when there are collisions and larger summary when there aren't10:35
keirhmm, the keys in 0.tix should be unique right?10:44
lifelessyes10:44
keirwow, lots and lots of crc32 collisions10:51
keiri'm getting 50% collisions10:51
keiri guess that's what happens when you load 800k items10:51
lifelessyah10:54
lifelessdon't use crc3210:54
lifeless:)10:54
lifelessas its meant to be ~ the same as md510:55
lifelessI'd check what distribution you're getting10:55
keirsorry10:56
keirbug10:56
keirin 717k i get 50 collisions10:56
keirfor crc3210:56
keir4000 with addler3210:56
lifelesswhat cpu do you have?10:57
keirthis is just doing the dumbest thing throwing all the keys into a dict() keyed on hash10:57
lifeless4K/717K is nice and low - 0.5%10:57
keirpokey old centrino10:57
lifelessok10:58
lifelessprint "0x%u" % hash('asdfg')10:58
lifeless0x370941258525391914810:58
lifelessI don't think we can use this hash10:58
keirit can hash all of them pretty fast even on this machine10:58
lifelessbut I'm curious what result you get10:58
keir0x-84853717210:59
lifelesshah! so much for unsigned10:59
lifelessanyhow, different number10:59
keirbin = int(sha.sha(key).hexdigest()[:4] , 16)11:00
keiri'm getting 65k collisions with that one11:00
keirthat should be a 32 bit hash right?11:01
lifeless4 bits per char11:01
lifeless4 chars11:01
keirduh11:01
keir5411:02
lifelessreally, use md511:02
lifelessseriously faster11:02
keirjust checking collisions11:02
lifelesssure11:03
keir6111:03
keirwow, so far the winner is the fastest and dumbest: crc32 ftw!11:03
lifeless:)11:03
lifelesshave you tried adler ?11:03
keir400011:03
lifelesshmm, I remember one of adler/crc has endianness/word size issues11:04
keirby authors admision, is broken below 128 chars due to design11:04
lifelesswe'll want to check11:04
keirit's adler11:04
keirit's unsuitable11:04
lifelessk11:04
lifelessbrb11:05
lifelessback11:12
james_wlifeless: I was asking about disk safe filenames for the patches/threads/loom plugin. I guess I could use a random string and store the mapping.11:15
=== arjenAU [n=arjen@ppp215-29.static.internode.on.net] has joined #bzr
keirlifeless, so if we have 12 bits to pick inside the 4k preamble, we're left with an awkward 20 bit tag. i suggest we use a 44 bit hash; 12 bits for preamble, 32 for tag11:17
keirthen our records are 8 bytes11:17
keirwhich are very cache performant (aligning to 4 bytes good!)11:17
keir0 collisions on 717k11:18
lifelessjames_w: a bit more context would be useful11:18
keir(8 bytes being tag:offset)11:18
keiri'm off to bed11:20
lifelessnight!11:20
keirit's been a fun thought experiment11:20
lifelesswe're making really good progress I think11:20
keiri'm pretty convinced this is the right approach11:20
keirsuper simple, compact11:21
james_wlifeless: I don't know if you saw on the list (extra revisions in sprout), but I am looking at doing something like the loom plugin. This means that I need to create new branch. Each 'thread' has a name, so I was going to name the hidden branches the same, so I wanted to know if there was a function to check if a string was safe for writing to disk.11:21
james_wyou suggested that using user supplied input wasn't a good idea.11:21
lifelessI don't understand whats being written to disk11:21
lifelessor what you mean by hidden branch11:21
=== Zindar_ [n=erik@stockholm.ardendo.se] has joined #bzr
james_wfor each thread I create a hidden branch, which is just a branch that is stored in .bzr to suggest to the user that they shouldn't modify it manually. This involves creating a new directory beneath .bzr that is a branch.11:22
lifelessI suggest using some mapping11:23
james_weach of these branches has a name, which I was going to use for the name of the directory, so that given a name I can find the branch.11:23
lifelessand thinking about renames11:23
lifelessare there relationships between these branches11:23
lifelessbbiab11:24
james_wyes, I am working on a single linked list assumption at the moment.11:24
james_wor a stack might be a better term.11:24
james_wIf i remove that you can get git workflow, but it is more work.11:24
=== pbor [n=urk@host140-91-dynamic.11-79-r.retail.telecomitalia.it] has joined #bzr
=== i386 [n=james@203-158-59-54.dyn.iinet.net.au] has joined #bzr
=== mwhudson [n=mwh@62-31-157-102.cable.ubr01.azte.blueyonder.co.uk] has joined #bzr
=== i386 [n=james@ppp239-169.static.internode.on.net] has joined #bzr
Penglifeless (since keir /quit): You might be using the hash functions for different purposes, but what about Mercurial's hash functions? Either their old GNU diff-based one or their new lyhash (by Leonid Yuriev)? http://hg.intevation.org/mercurial/crew/rev/d0c48891dd4a?style=gitweb01:11
=== fleeb [n=chatzill@fleeb.dslbr.toad.net] has joined #bzr
=== AfC [n=andrew@ip67-91-236-171.z236-91-67.customer.algx.net] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has joined #bzr
=== Zindar_ [n=erik@stockholm.ardendo.se] has left #bzr []
=== pbor [n=urk@host140-91-dynamic.11-79-r.retail.telecomitalia.it] has joined #bzr
=== herzel104 [i=herzel@gateway/tor/x-494dd13d5dfbe90d] has joined #bzr
=== jeremyb_ [n=jeremy@unaffiliated/jeremyb] has joined #bzr
=== quatauta [n=quatauta@pD9E260C8.dip.t-dialin.net] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== seanhodges [n=sean@90.240.81.130] has joined #bzr
=== schierbeck [n=daniel@dasch.egmont-kol.dk] has joined #bzr
=== Demitar [n=demitar@c-212-031-182-147.cust.broadway.se] has joined #bzr
=== phanatic [n=phanatic@3e70d9be.adsl.enternet.hu] has joined #bzr
=== AfC [n=andrew@63.116.222.46] has joined #bzr
=== orutherfurd [n=orutherf@dsl092-164-022.wdc2.dsl.speakeasy.net] has joined #bzr
=== NamNguyen [n=NamNguye@cm38.delta196.maxonline.com.sg] has joined #bzr
=== schierbeck [n=daniel@dasch.egmont-kol.dk] has joined #bzr
schierbeckhello lads04:37
=== phanatic [n=phanatic@3e70d9be.adsl.enternet.hu] has joined #bzr
=== cfbolz [n=cfbolz@p54AB9E54.dip0.t-ipconnect.de] has joined #bzr
=== cprov [n=cprov@canonical/launchpad/cprov] has joined #bzr
=== fog [n=fog@debian/developer/fog] has left #bzr []
=== Gwaihir [n=Gwaihir@ubuntu/member/gwaihir] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has joined #bzr
=== keir [n=keir@bas15-toronto12-1168010516.dsl.bell.ca] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== luks [n=lukas@unaffiliated/luks] has joined #bzr
=== Vernius_ [n=tomger@p508AEB28.dip.t-dialin.net] has joined #bzr
=== Mez [n=Mez@ubuntu/member/mez] has joined #bzr
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has left #bzr []
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== s|k [n=bjorn@c-69-181-8-54.hsd1.ca.comcast.net] has joined #bzr
s|khi09:03
=== sevrin [n=sevrin@ns1.clipsalportal.com] has joined #bzr
keirwhy are the new bzr packs still ~5x larger than git packs?09:17
luksbzr packs are really just knits written into bigger files09:18
luksbut the structure of data in bzr and git is very different, anyway09:18
keirthe bzr packs don't compress at all either...09:19
keirhmm09:20
keirgit is snapshot based09:20
keirbut internall just uses heuristics to store the diffs09:20
luksknits are gzipped09:20
luksgzipped text deltas09:20
keirthat the complete linux kernel history is 100mb, is pretty nice!09:20
keir(for git)09:20
keirinteresting, i'll have to look into this09:21
keirthe tiny size for git repos is really nice for branching projects -- tiny download time!09:22
keiris there docs of what a knit is?09:24
keiri don't see it clearly explained in the developers/ dir in docs09:24
keiri see, so text deltas rather than a more efficient xdiff type binary encoding09:27
fullermdAnd occasional fulltext copies.  That adds up size on long histories.09:34
keirgit does that too09:36
keiri still don't see how it's a 5x size difference though09:36
fullermdNot in the same sense, AIUI.09:36
luksannotations add a lot, too. or are they dropped from packs already?09:37
keirhmm. annotations are very important for some projects (gnome, etc)09:39
fullermdThat doesn't mean storing them is necessary.09:40
keirwell, the issue gnome had with svn was that annotate was very slow compared to cvs09:40
keirback in the day it was a blocker to svn conversion09:40
=== mm_202 [n=mm_202@216.106.29.84] has joined #bzr
=== mm_202 [n=mm_202@216.106.29.84] has left #bzr ["Goodbye."]
fullermdWell, annotations are also important to things like annotation-based merges.09:43
fullermdWhat both end up meaning is that annotations should be get-able "fast enough".  Which means either finding a blazing fast way to derive them on the fly, or caching them.09:44
fullermdWith knits, they're cached at commit time, which has a lot of drawbacks.  Space usage, CPU and I/O time to add/retrieve them on all operations.09:44
fullermdMakes it very hard to use a non-line-based delta algorithm, since annotations are line-based (my brain rebels at reading a byte-wise 'annotate' output)09:45
fullermdRelated, it means they're hard-coded based on the 'diff' algorithm used at commit-time.  That can suck.09:45
fullermdAIUI, the plan with packs is to add the capability to gen and store an external cache when things come needed, independent of the actual text store.09:46
keirsensible09:47
keiris there a fast mpdiff implementation already?09:55
keiranyone used bzr-git?09:55
keiri can't get it to work at all09:55
keirand there is no docs whatsoever09:55
keirbzr branch ../git ---> 'GitRepository' object has no attribute '_format'09:55
=== dholmster [i=dholm@195.198.115.93] has joined #bzr
fullermdI thought I heard it wasn't up to branching.09:57
keirthere is not a single example command showing how to use any of its functionality, so i'm guessing09:58
keirso bzr-git is, at the moment, entirely useless?09:58
fullermdI think lifeless said it's only "there" enough to look at history (presumably 'log' and the like)09:58
=== hstuart [n=hstuart@0x503e9965.virnxx12.adsl-dhcp.tele.dk] has joined #bzr
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== cprov [n=cprov@canonical/launchpad/cprov] has joined #bzr
lifelesskeir: bzr-git can show history but is not complete enough to pull data across10:45
lifelessluks: knits have annotations, we won't be changing the knits repo format, so that won't change; packs have knits though10:45
lifelesserm, packs have the deltas, and they are not annotated10:45
lukswell, that's what I meant10:46
luksI just wasn't sure if packs still have annotations10:46
lifelessok10:46
lifelessthey don't10:46
lifelesspacks are something like 10% smaller than bzr's knit based repositories10:47
keirlifeless, why are they so much larger than git repos?10:47
jelmer'morning *10:47
keirbecause libxdiff (what git uses) is 5x better than line diff + gzip?10:48
lifelesskeir: two reasons that I know of; one is that we know we can do better on size - john and aaron have done lots of experiments on this - but we haven't moved their research into a disk format [yet] .10:48
lifelessin fact thats a general enough statement that it covers the second reason too :)10:49
lifelesshi jelmer10:49
keirlifeless, ok :)10:49
lifelessPeng: thanks; I wasn't aware of lyhash10:49
Penglifeless: :)10:49
lifelesskeir: so I would expect the next iteration of packs to bring in some of their improvements which will increase performance more, decrease size etc,.10:50
lifelesskeir: I'm curious though where you are getting the size comparison from?10:51
lifelesskeir: is it a tree with the same history in both formats?10:51
keirlifeless, order of magnitude comparisons of git tree vs bzr tree10:51
keirsame source size10:51
keirsimilar history size10:51
lifelesswell10:51
lifelesssimilar can cover a multitude of sins10:51
keirtrue10:51
keiri was trying to convert the git repo into bzr packs10:52
keirto do a proper comparison10:52
lifelessyah, thats not ready yet10:52
lifelessoh, also, a thought - if you were doing du -sh .bzr10:52
lifelessthere is the obsolete_packs directory10:52
keirit's empty10:53
lifelessok10:53
lifeless:)10:53
keirall the size is in the single .pack10:53
keir54mb10:53
lifelesshow old was the project when it was in knit format ?10:53
keircompared to the git 15mb one10:53
keirthis is packs.packs10:53
keirknit is similar size10:53
keir.bzr in packs is 61mb10:54
lifelessuhm, I wasn't clear10:54
keirin knits it's 71mb10:54
lifelessold knit generation code generated bigger knits10:54
lifelessfor two fairly dumb in hindsight reasons10:54
lifelessthe first is the python GZipFile uses zlib compression level 9 by default for some insane reason10:54
keirhmm10:55
lifelessthis generates bigger output in a non trivial % of cases10:55
lifelessI saved a surprising amount of disk in the moz repo tests by converting that to -1, not to mention a bucketload of time10:55
lifelessthe second reason is that we started with a full text every 50 commits no matter what10:56
lifeless(per file history)10:56
keirah, i see10:56
lifelessand the inventory file - which is a) big and b) changes on every commit10:56
lifelesslets just say that it became a dominating factor in several repositories10:56
lifelessnow we use a heuristic that says 'if the size of the deltas == the size of the file, write a new full text'10:57
lifelessoh, and theres another reason too10:57
lifelesswe delta against history always10:57
PengTo get more efficient knits, what, re-clone?10:57
lifelessnot against best size/closest match/introduction to repo10:58
lifelesswhich means if you imagine a Y graph10:58
lifelessif at the point just before the split you are due-for-a-full-text10:58
lifelessboth arms will get a fulltext, one each.10:58
lifelessPeng: by diffing against history its trivial to be sure that you have all the deltas needed to reconstruct a text during branching10:59
lifelessbecause you can take all the inventories; grab the texts they want as either fultext-or-delta, and you're done10:59
lifelessif you have arbitrary deltas, you need to grab other data that you dont want in general, replace the local delta for the thing that referenced it with another, then you're done11:00
lifelessthis is one reason pack text indices have two pointers11:00
lifelessone to another text for delta parent11:00
keiraah, a history-parent and delta parent11:00
lifelessone to a list of parents for file graph11:01
=== bialix [i=chatzill@77.109.22.134] has joined #bzr
lifelessso that we can e.g. delta against last-added, but still tell what we need during fetch easily11:01
bialixluks: ping11:01
lukshi!11:01
lifelessif someone digs into the 'pack' command (not the autopack, the manual 'please make me fast daddy' command)11:01
lifelessthen they can fix existing repositories to be considerably smaller, especially now that packs are not annotated.11:02
lifelesskeir: did I just hear a penny dropping ? :)11:02
lifelesshi bialix11:03
lifelessanyway, I've some stuff to do here, hope this discussion helps - the 5x thing is definately a problem, but I'm quite sure you won't see it *that bad* if you start with packs and git and build up a given graph programatically11:03
=== bialix [i=chatzill@77-109-18-147.dynamic.peoplenet.ua] has joined #bzr
lukshi again :)11:04
bialixhi, luks, lifeless11:04
bialixluks, I thinking about separate file for strings, but it seems to few strings at this moment11:05
bialix^too few11:05
luksI'm fine with it as it is11:05
lukswe can move them later if the list will start growing11:06
bialixok for me11:06
lukswhat still worries me are error messages from bzrlib11:06
luksbut I guess that's not easily fixable for now11:06
bialixyeah11:06
bialixduplicate them inside QBzr is not best idea11:07
luksyep11:07
bialixluks, I saw you change some plurals-relaed things in qdiff11:07
luksyes, I realized I need a plural in the title11:08
luksand I couldn't figure out how to do it with pygettext11:08
luksso I switched it to xgettext11:08
bialixit's may be something strange with my standalone QBzr, but it seems that this code is not working on my win3211:08
lukshmm11:08
lukswhat's the problem?11:09
keirlifeless, i was looking into this because i was thinking about the compressed keys issue -- if we compress keys into groups of 16, we get a 4x saving on space for storing the keys11:09
bialixI don't see this messages at all11:09
keirlifeless, but i wanted to check if the size of the keys was even on the cards compared to pack sizes11:09
bialixluks: may be it's something related to bug #14840911:10
ubotuLaunchpad bug 148409 in qbzr "qdiff --inline: option has no effect?" [Undecided,New]  https://launchpad.net/bugs/14840911:10
luksbialix, I think I'm confused now11:12
luksyou mean you don't see the title in qdiff at all?11:12
bialixluks: because my bad english, or because my bug report?11:13
luksneither, I just don't see how are these two related11:13
luksthe "X files" title should appear only if you specify more than 2 files to diff on the command line11:14
bialixno, title in qdiff is present, I don't see "X files" message11:14
bialixmmm, and when I specify no files?11:15
luksthen you get only "QBzr - Diff"11:15
bialixaha11:15
luksand if you open it from qlog then I believe you get "QBzr - Diff - <revision ID>"11:15
luksmaybe this should be unified11:16
luksit made sense to me back then, but it looks confusing now that I think about it again :)11:16
bialixhere screenshot http://bialix.com/qbzr/qdiff-2-files.png with 2 files in command line11:18
luksmore than two11:18
bialixyep11:19
bialixmy bad11:19
bialixbut --inline does not work anyway11:20
luksI know, it used to work, but then I added this new diff viewer and didn't bother to implement it there11:22
luksbut I'm about to rewrite it again, because the diff gets misaligned sometimes11:22
bialixwell, ok, I just want to know it's not win32-specific11:22
luksno, it's not11:22
bialixbtw, I hve some strange bug with diff on my ru.po11:23
bialixsomewhat something breaks html rendering in diff and I saw raw html code11:23
lukscan you mail me the patch that does it?11:26
bialixhere is screenshot: http://bialix.com/qbzr/qdiff-html-bug.png11:26
luksouch11:26
bialixit's from my i18n branch11:26
bialixit's already on laucnhpad server11:26
bialixbzr qdi -c13611:27
luksI see11:27
luksI used an ugly hack there, but it doesn't play well with utf-811:28
lukshm, maybe not11:33
bialixluks: good night11:41
luksnight11:41
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== Demitar [n=demitar@c-212-031-182-147.cust.broadway.se] has joined #bzr

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!