/srv/irclogs.ubuntu.com/2007/10/06/#bzr.txt

=== Alien_Freak [n=sfaci2@cs-user23.wireless.uic.edu] has joined #bzr
=== cprov is now known as cprov-out
=== ubotu [n=ubotu@ubuntu/bot/ubotu] has joined #bzr
=== AfC [n=andrew@ip67-91-236-171.z236-91-67.customer.algx.net] has joined #bzr
=== NfNit\|oop [i=codyc@cpe-70-112-28-217.austin.res.rr.com] has joined #bzr
=== epn [n=whatever@c-69-252-219-78.hsd1.nm.comcast.net] has joined #bzr
=== keir_ [n=keir@206-248-131-150.dsl.teksavvy.com] has joined #bzr
=== BasicOSX [n=BasicOSX@216.243.156.81] has joined #bzr
=== BasicMac [n=BasicOSX@216.243.156.81] has joined #bzr
=== BasicMac is now known as BasicOSX
=== BasicMac [n=BasicOSX@warden.real-time.com] has joined #bzr
=== NamNguyen [n=namnt@cm38.delta196.maxonline.com.sg] has joined #bzr
=== thumper [n=tim@125-236-193-95.adsl.xtra.co.nz] has joined #bzr
=== bitmonk [n=justizin@adsl-75-55-127-69.dsl.sfldmi.sbcglobal.net] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has joined #bzr
=== BasicOSX [n=BasicOSX@warden.real-time.com] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
=== beuno_ [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
AfC	Can anyone comment on the state of bzr-git? I have _one_ last project that's not in Bazaar yet that is in Git. It'd be nice to somehow convert it into a bzr branch, but I had no luck with tailor (for months of trying)	05:49
lifeless	jelmer of bzr-svn fame is hacking on it now	05:56
lifeless	its not up to converting yet, only data inspection	05:56
AfC	lifeless: ok, thanks Robert	06:05
AfC	I'm going to go ahead and just do a big initial import for now. Maybe I can use --file-ids or whatever later to recover the history	06:06
Peng	Why not keep it in git for the moment?	06:09
AfC	Peng: you're asking that HERE, in #bzr?	06:10
AfC	{sigh}	06:10
AfC	Peng: but also because I can hardly remember how to use Git and am not really that interested in relearning.	06:11
=== AfC just tried to figure out how to `revert` a file and could decide whether it was `reset` or `reset --hard` or what. Git is horrendous.
AfC	s/could/could not/	06:12
Peng	Heh.	06:14
Peng	I have no experience with git. :P	06:14
Peng	Maybe I should be glad.	06:14
=== Peng wanders off.
Peng	But if bzr-git is progressing quickly, I was just thinking that it shouldn't be too bad to use it for a little while, especially when otherwise you risk losing the history.	06:15
=== Peng wanders off.
AfC	Peng: fair enough	06:15
AfC	Peng: nah, I need to get on with collaborating with someone. I'll wait until there is a way to graft the two branches together.	06:16
AfC	It's all cosmetic, of course	06:16
AfC	Just feelgood factor that you want to recover, mostly	06:16
=== BasicOSX [n=BasicOSX@warden.real-time.com] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== keir [n=keir@bas15-toronto12-1168012681.dsl.bell.ca] has joined #bzr
keir	lifeless, ping	07:08
=== orospakr [n=orospakr@bas4-ottawa23-1088826449.dsl.bell.ca] has joined #bzr
=== beuno [n=beuno@44-111-231-201.fibertel.com.ar] has joined #bzr
lifeless	keir: pong	08:05
=== g0ph3r [n=g0ph3r@p57A09E60.dip0.t-ipconnect.de] has joined #bzr
keir	lifeless, hey	08:09
keir	lifeless, did you start on the 4k fanout?	08:09
lifeless	keir: putting the finishing touches on bisection	08:11
lifeless	its 19 roundtrips on a 200MB index	08:11
lifeless	to get down to a 4K size	08:11
keir	i was thinking about this	08:12
keir	for big indicies, why not pad them out to 4k blocks?	08:12
lifeless	a 4K prelude on the index will give about 16 times granularity, or log(16, 2) - 4 less round trips	08:12
keir	then we can have a fan out table which selects down to 4k nicely	08:12
lifeless	hmm, right now I just want to get enough legs on this toy format to survive while the real one comes together	08:13
keir	of course :)	08:13
keir	so in a 200mb index, that's ~2.5m keys, right?	08:16
keir	assuming something like 80 bytes per key/vaue/refs	08:16
lifeless	800K keys	08:17
lifeless	(I gave you a 200M index to play with :))	08:17
lifeless	for the current toy format of course	08:18
keir	yes	08:18
keir	i am using the old 100mb 0.tix	08:18
lifeless	oh, was it only 100M. ll	08:18
lifeless	lol	08:18
lifeless	200MB for it + the rev index and inv index	08:19
keir	wait, i found 115k keys in that one...	08:19
keir	i wonder if my parsing code is wrong	08:19
keir	most key/val/refs are ~90 bytes, so 115k keys makes sense	08:19
keir	wait	08:20
keir	i think i dropped a 0	08:20
lifeless	I think that index is a little unusual because it has converted data	08:20
lifeless	our native indices have longer keys	08:20
keir	are you always reading 4k at a time?	08:21
lifeless	o	08:21
lifeless	no	08:21
keir	less?	08:22
lifeless	minimum of get_recommended_page_size	08:22
lifeless	which transport supplies	08:22
keir	aah, i see	08:22
lifeless	a single readv may hit many locations, each of which is fanned out to that figure if its smaller	08:22
lifeless	so on http we'll read 64k minimum	08:22
lifeless	but something like ftp may well choose to read 200K or more, because of the insane effort needed to issue what amounts to a readv	08:23
lifeless	so its not truely pages in the toy index	08:23
lifeless	read read <->	08:23
lifeless	we read <->	08:23
keir	so really, a 4k fanout/preamble may be too small	08:23
lifeless	transport expands that	08:23
lifeless	we get back [......]	08:24
lifeless	if the edges of that are not already parsed, we strip up to the first \n	08:24
lifeless	giving row\nrow\n....	08:24
lifeless	we parse those	08:24
lifeless	mark the range as parsed	08:24
lifeless	and the low and high key found in the range	08:24
lifeless	the bisection code to drive this is on the commits list	08:25
=== abentley [n=abentley@bas8-toronto63-1088754407.dsl.bell.ca] has joined #bzr
lifeless	bisect_multi_bytes(content_lookup, size, keys)	08:25
lifeless	content_lookup is a callable that takes a list of (location, ley) tuples	08:25
lifeless	and returns those tuples with an added status: one of (-1, +1, False, result)	08:26
lifeless	where -1 and +1 are 'lower than this location' and higher than..	08:26
lifeless	False is 'cannot be in this index'	08:26
lifeless	and result is 'return this to the caller'	08:26
keir	ok	08:28
lifeless	I'm putting the final bits of the content_lookup callable on GraphIndex at the moment	08:28
keir	i see that the hash based fan out is nice	08:28
keir	then merging with bzr.dev?	08:29
lifeless	then profile for regressions on local operations	08:29
lifeless	then profile for regressions on network operations	08:29
lifeless	then send in a [MERGE] to the list for ddebate	08:29
lifeless	so this won't add any prelude	08:29
keir	ok	08:29
lifeless	adding a prelude will simply provide the first 4 left-right jumps within the index at the front, cheaply	08:30
keir	only first 4?	08:31
lifeless	4 keys/K	08:31
lifeless	4*4 == 16	08:31
lifeless	log(16,2) == 4	08:31
keir	index is 4000 single bytes or 2000 shorts?	08:32
lifeless	so this makes a 64K index achieve 2 round trips lookups	08:32
lifeless	this is the toy index	08:32
lifeless	strings	08:32
lifeless	current format, not sure where you are getting bytes/shorts concepts	08:32
=== BasicOSX [n=BasicOSX@warden.real-time.com] has joined #bzr
lifeless	the prelude I'm thinking for this format, is trivial: a list of psuedo keys	08:32
lifeless	and the byte offset of the start of the first key that sorts after the pseudo key	08:33
keir	i was thinking the prelude would be the fan out.	08:33
lifeless	I think we're talking past each other to some degree	08:34
lifeless	the stuff I talked about with you the other day was for the format you're working on, with topological grouping	08:34
keir	it's late here, i probably shouldn't be bothering you!	08:34
lifeless	this index is linear sorted	08:34
keir	lifeless, yes, i realize that	08:34
lifeless	ok cool	08:34
keir	lifeless, the hash indexing works for linear ordering too	08:34
keir	which is neat	08:35
lifeless	so given keys AA AB AC BA BB BC	08:35
keir	using a hash index you can just glue it on any ordering	08:35
lifeless	well, you need a complete hash table	08:35
lifeless	I'm planning on using the sorted facility of this index to just improve on regular bisection- basically the same as the git fan-out prelude	08:35
lifeless	with the keys above, a prelude might look like	08:36
keir	AA <loc> BB <loc>	08:36
lifeless	'' 0, 'B' 30	08:36
keir	yes	08:36
keir	that's exactly how my other code works	08:37
lifeless	that is, if I'm looking for a key between '' and 'B', I know its between 0 and 30	08:37
lifeless	ok	08:38
lifeless	so, the reason I want to do this rather than a generalised full-index hash table	08:38
lifeless	is that this is very simple to code;	08:38
lifeless	take all the keys in order	08:38
lifeless	bisect through them and pull out key, location pairs	08:39
lifeless	until I've got 4K of data.	08:39
lifeless	stop.	08:39
lifeless	if I want to, shrink the prelude keys to the smallest unique string at the location I picked up, allowing it to be smaller	08:40
keir	yes, that's also how my code works. it does it recursively until the top level is 4k	08:40
lifeless	cool	08:41
keir	ok i lie slightly; it does it bottom up	08:41
keir	but same idea	08:41
lifeless	:)	08:44
=== BasicMac [n=BasicOSX@warden.real-time.com] has joined #bzr
=== jrydberg_ [n=Johan@c80-216-246-123.bredband.comhem.se] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== allenap [n=allenap@87-194-166-60.bethere.co.uk] has joined #bzr
=== fog [n=fog@debian/developer/fog] has joined #bzr
keir	lifeless, i just did some excel work. i have the following proposal.	09:36
keir	store a 4k preamble which is a histogram of the number of keys in that bin	09:36
keir	where each bin is a 1 byte uchar	09:36
keir	then store the usual tag/offset jazz after	09:36
keir	this way it fits into 4k	09:37
keir	for roughly up to 800k keys	09:37
lifeless	what is a bin precisely	09:38
lifeless	is it positional	09:38
lifeless	a key prefix	09:38
lifeless	?	09:38
keir	lop the first byte off of each hash	09:39
keir	sorry	09:39
keir	3 bytes	09:39
keir	grr	09:40
keir	12 bytes	09:40
keir	bites	09:40
keir	bits	09:40
=== keir needs sleep
keir	that gives you 1 position in the 4k table (each entry 1 byte)	09:40
keir	the 'table' is really a histogram of the number of keys which fell in that bin	09:41
keir	which you count by taking the first 12 bits of each hash, indexing into the table, and incrementing	09:41
lifeless	I don't understand; which hash? and what does this do for us?	09:41
keir	in two round trips most of the time you'll have the exact location	09:42
keir	the benefit is that by having the client do the cumulative sum to get the offset into the tag part of the hash table, we can store huge tables	09:42
lifeless	if I understand this correctly	09:43
lifeless	then I can paraphrase this as:	09:44
lifeless	'store a table of hash:list of locations at the front of the file. To allow the table to become very big, store a summary of the table at the front of the table, size limited to 4K.	09:46
keir	yes	09:46
lifeless	the summary of the table lists the number of locations stored against every combination of X bits of hash, to allow direct access to the serialised sparse table once the 4K summary is read.	09:47
keir	yes	09:48
lifeless	it sounds like you are making good progress on figuring out how to have a very small 'find a key' logic, while still allowing arbitary sorts for data locality	09:48
keir	exactly	09:48
lifeless	I'm fairly uninterested in whacking this into the toy format as yet	09:49
lifeless	but very interested in it being in the real format	09:49
keir	ok	09:49
keir	so for the 800mb case	09:49
lifeless	800K case? :)	09:49
keir	excel tells me it'll be 4k of preamble and ~13mb of tag:offset pairs	09:50
keir	800k keys rather :)	09:50
keir	the nice thing about this is that i can whack it on top of the toy format no problem	09:50
keir	and then the average bin will be still something around 4k	09:52
keir	as in, you'll have to grab 4k of data	09:52
lifeless	hmmm	09:52
lifeless	tell you what, I'll finish this bisection approach off	09:52
keir	yes	09:53
keir	i'll keep working on excel	09:53
keir	and mail the list	09:53
lifeless	and think seriously about this hash approach	09:53
lifeless	what hash function were you thinking of using?	09:53
keir	sha, just because it's Proven	09:53
keir	but maybe it's too slow	09:53
lifeless	its very slow	09:53
lifeless	also its cryptographically secure	09:54
lifeless	which is unneeded here	09:54
lifeless	as we support collisions	09:54
keir	yes	09:55
keir	crc32!	09:55
keir	ftw!	09:55
lifeless	well	09:57
=== mvo [n=egon@p54A64932.dip.t-dialin.net] has joined #bzr
lifeless	that would work, and is bound in python	09:57
keir	we'd probably want more bits	09:58
lifeless	why	09:58
lifeless	800K << 2.6M	09:58
lifeless	I wonder if hashlittle is available in the stdlib	10:00
lifeless	so	10:04
lifeless	wikipedias' list of hash functions	10:04
lifeless	suggests that adler32 is about the fastest thing out there	10:05
lifeless	crc32 ~= md5	10:05
lifeless	at 3.6* adler	10:05
lifeless	and sha is 6 * adler	10:05
lifeless	so I'd use adler up to say 500K keys	10:05
keir	neat	10:06
keir	is adler in python?	10:06
lifeless	yes	10:07
lifeless	in the zlib module	10:07
lifeless	and go md5 at 500K keys	10:07
lifeless	and stay at md5	10:07
lifeless	this is only data lookup, things are verified by sha as they are reconstructed, so we don't care about hostile modifications	10:07
keir	"Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits."	10:09
keir	this is bad for us	10:09
keir	probably crc32?	10:09
keir	i suppose we can just try	10:09
lifeless	test, test and more testing :)	10:09
keir	is there a function in your code which gives me the offset of a key given the key? i.e. during the building phase	10:15
lifeless	no	10:19
=== asak [n=alexis@201-1-2-93.dsl.telesp.net.br] has joined #bzr
keir	ok	10:19
lifeless	because the location cannot be determined without knowing the number of key references before the key + the length of the keys before the key	10:19
lifeless	so we wait until finish() is invoked to calculate this	10:20
keir	of course.	10:20
keir	it seems entirely reasonable to add the hash index as an index on the index... in a seperate file	10:21
keir	cool, now that i look back at git's index, this way is nicer	10:22
lifeless	if you update the 'size of offset needed' logic to understand that there will be a hash table there, it should fit in very nicely	10:22
keir	by storing the histogram rather than the cumulative sum, we can store fewer bits per hash table entry	10:22
keir	and nicely enough, given the number of keys, the hash table (including both 4k start and rest) is fixed size	10:23
lifeless	but i'm not against extra files if that really helps	10:23
lifeless	actually the hash table probably isn't fixed size - but it is bounded.	10:23
lifeless	collisions will share hash names	10:23
keir	hmm, this is true	10:24
keir	i had originaly envisioned sha hashes so i was thinking no collisions	10:24
lifeless	even sha can collide	10:24
lifeless	its true we don't know how to make it collide	10:24
lifeless	but its a fallacy to say it won't :)	10:25
keir	actually that ruins everything	10:25
lifeless	hmm?	10:25
lifeless	collisions don't ruin anything here	10:25
lifeless	the number of bytes in the hash table is:	10:26
keir	the whole trick with the histogram relies on your bins being exactly nentries*fixed size	10:26
lifeless	right	10:26
keir	ooh, i see	10:26
keir	you just duplicate the tag	10:26
lifeless	and you have that	10:26
lifeless	nah no need	10:26
keir	duh	10:26
lifeless	here:	10:26
lifeless	trivial format for the table:	10:26
lifeless	HASH LOCREF[ LOCREF...]	10:26
keir	(well, the table format is pretty trivial!)	10:27
lifeless	oops	10:27
lifeless	HASH LOCREF[ LOCREF...] \n	10:27
keir	i am against delimiters...	10:28
keir	in this part of the table	10:28
keir	i'd go like this:	10:28
lifeless	well	10:28
lifeless	if you don't duplicate the hash	10:28
keir	TAG OFFSET TAG OFFSET TAG OFFSET	10:28
lifeless	then a collision moves the table up	10:28
keir	up?	10:28
=== luks [n=lukas@unaffiliated/luks] has joined #bzr
lifeless	the idea of the table summary is to say 'if your hash starts with 010110' then sum up the counts for all hashes in the summary before that	10:29
lifeless	and you can tell where in the table to read from to read the data for all hases that start with 010110	10:29
keir	yes	10:29
lifeless	by up I mean 'earlier'	10:30
keir	what i'm suggesting, is to dupe the tag and increment the bin anyway	10:30
keir	then the reading end needs to know that there may be duped tags	10:30
lifeless	if its a fixed size, then take the sum of the bins, multiply by the size of a keyref + your hash size and you know where to read	10:30
keir	i don't see how to do the offset calc when there are delimiters	10:30
lifeless	if its not a fixed size	10:31
lifeless	then it will never be further in the file, but it may be earlier	10:31
lifeless	but	10:31
keir	i think delimiters are a bad idea in this context	10:31
=== Zindar [n=erik@stockholm.ardendo.se] has joined #bzr
keir	hopefully collisions will be rare enough that it's worth duping the content	10:32
keir	of course, we'll see in testing	10:32
lifeless	anyhow, you can calculate the upper bound of collisions in the reader	10:32
lifeless	4 bins, with 10 keys each - at minimum there are 4 unique hashes, at most 40	10:32
lifeless	but its better than that	10:32
keir	ah yes	10:32
lifeless	if you record 'hash, refs' rather than 'refs' in the bin summary	10:33
lifeless	then you can handle any number of collisions and still predict exact location	10:33
keir	but then our summary will be either larger or less selective	10:34
lifeless	right	10:34
lifeless	so its a tradeoff between larger table when there are collisions and larger summary when there aren't	10:35
keir	hmm, the keys in 0.tix should be unique right?	10:44
lifeless	yes	10:44
keir	wow, lots and lots of crc32 collisions	10:51
keir	i'm getting 50% collisions	10:51
keir	i guess that's what happens when you load 800k items	10:51
lifeless	yah	10:54
lifeless	don't use crc32	10:54
lifeless	:)	10:54
lifeless	as its meant to be ~ the same as md5	10:55
lifeless	I'd check what distribution you're getting	10:55
keir	sorry	10:56
keir	bug	10:56
keir	in 717k i get 50 collisions	10:56
keir	for crc32	10:56
keir	4000 with addler32	10:56
lifeless	what cpu do you have?	10:57
keir	this is just doing the dumbest thing throwing all the keys into a dict() keyed on hash	10:57
lifeless	4K/717K is nice and low - 0.5%	10:57
keir	pokey old centrino	10:57
lifeless	ok	10:58
lifeless	print "0x%u" % hash('asdfg')	10:58
lifeless	0x3709412585253919148	10:58
lifeless	I don't think we can use this hash	10:58
keir	it can hash all of them pretty fast even on this machine	10:58
lifeless	but I'm curious what result you get	10:58
keir	0x-848537172	10:59
lifeless	hah! so much for unsigned	10:59
lifeless	anyhow, different number	10:59
keir	bin = int(sha.sha(key).hexdigest()[:4] , 16)	11:00
keir	i'm getting 65k collisions with that one	11:00
keir	that should be a 32 bit hash right?	11:01
lifeless	4 bits per char	11:01
lifeless	4 chars	11:01
keir	duh	11:01
keir	54	11:02
lifeless	really, use md5	11:02
lifeless	seriously faster	11:02
keir	just checking collisions	11:02
lifeless	sure	11:03
keir	61	11:03
keir	wow, so far the winner is the fastest and dumbest: crc32 ftw!	11:03
lifeless	:)	11:03
lifeless	have you tried adler ?	11:03
keir	4000	11:03
lifeless	hmm, I remember one of adler/crc has endianness/word size issues	11:04
keir	by authors admision, is broken below 128 chars due to design	11:04
lifeless	we'll want to check	11:04
keir	it's adler	11:04
keir	it's unsuitable	11:04
lifeless	k	11:04
lifeless	brb	11:05
lifeless	back	11:12
james_w	lifeless: I was asking about disk safe filenames for the patches/threads/loom plugin. I guess I could use a random string and store the mapping.	11:15
=== arjenAU [n=arjen@ppp215-29.static.internode.on.net] has joined #bzr
keir	lifeless, so if we have 12 bits to pick inside the 4k preamble, we're left with an awkward 20 bit tag. i suggest we use a 44 bit hash; 12 bits for preamble, 32 for tag	11:17
keir	then our records are 8 bytes	11:17
keir	which are very cache performant (aligning to 4 bytes good!)	11:17
keir	0 collisions on 717k	11:18
lifeless	james_w: a bit more context would be useful	11:18
keir	(8 bytes being tag:offset)	11:18
keir	i'm off to bed	11:20
lifeless	night!	11:20
keir	it's been a fun thought experiment	11:20
lifeless	we're making really good progress I think	11:20
keir	i'm pretty convinced this is the right approach	11:20
keir	super simple, compact	11:21
james_w	lifeless: I don't know if you saw on the list (extra revisions in sprout), but I am looking at doing something like the loom plugin. This means that I need to create new branch. Each 'thread' has a name, so I was going to name the hidden branches the same, so I wanted to know if there was a function to check if a string was safe for writing to disk.	11:21
james_w	you suggested that using user supplied input wasn't a good idea.	11:21
lifeless	I don't understand whats being written to disk	11:21
lifeless	or what you mean by hidden branch	11:21
=== Zindar_ [n=erik@stockholm.ardendo.se] has joined #bzr
james_w	for each thread I create a hidden branch, which is just a branch that is stored in .bzr to suggest to the user that they shouldn't modify it manually. This involves creating a new directory beneath .bzr that is a branch.	11:22
lifeless	I suggest using some mapping	11:23
james_w	each of these branches has a name, which I was going to use for the name of the directory, so that given a name I can find the branch.	11:23
lifeless	and thinking about renames	11:23
lifeless	are there relationships between these branches	11:23
lifeless	bbiab	11:24
james_w	yes, I am working on a single linked list assumption at the moment.	11:24
james_w	or a stack might be a better term.	11:24
james_w	If i remove that you can get git workflow, but it is more work.	11:24
=== pbor [n=urk@host140-91-dynamic.11-79-r.retail.telecomitalia.it] has joined #bzr
=== i386 [n=james@203-158-59-54.dyn.iinet.net.au] has joined #bzr
=== mwhudson [n=mwh@62-31-157-102.cable.ubr01.azte.blueyonder.co.uk] has joined #bzr
=== i386 [n=james@ppp239-169.static.internode.on.net] has joined #bzr
Peng	lifeless (since keir /quit): You might be using the hash functions for different purposes, but what about Mercurial's hash functions? Either their old GNU diff-based one or their new lyhash (by Leonid Yuriev)? http://hg.intevation.org/mercurial/crew/rev/d0c48891dd4a?style=gitweb	01:11
=== fleeb [n=chatzill@fleeb.dslbr.toad.net] has joined #bzr
=== AfC [n=andrew@ip67-91-236-171.z236-91-67.customer.algx.net] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has joined #bzr
=== Zindar_ [n=erik@stockholm.ardendo.se] has left #bzr []
=== pbor [n=urk@host140-91-dynamic.11-79-r.retail.telecomitalia.it] has joined #bzr
=== herzel104 [i=herzel@gateway/tor/x-494dd13d5dfbe90d] has joined #bzr
=== jeremyb_ [n=jeremy@unaffiliated/jeremyb] has joined #bzr
=== quatauta [n=quatauta@pD9E260C8.dip.t-dialin.net] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== seanhodges [n=sean@90.240.81.130] has joined #bzr
=== schierbeck [n=daniel@dasch.egmont-kol.dk] has joined #bzr
=== Demitar [n=demitar@c-212-031-182-147.cust.broadway.se] has joined #bzr
=== phanatic [n=phanatic@3e70d9be.adsl.enternet.hu] has joined #bzr
=== AfC [n=andrew@63.116.222.46] has joined #bzr
=== orutherfurd [n=orutherf@dsl092-164-022.wdc2.dsl.speakeasy.net] has joined #bzr
=== NamNguyen [n=NamNguye@cm38.delta196.maxonline.com.sg] has joined #bzr
=== schierbeck [n=daniel@dasch.egmont-kol.dk] has joined #bzr
schierbeck	hello lads	04:37
=== phanatic [n=phanatic@3e70d9be.adsl.enternet.hu] has joined #bzr
=== cfbolz [n=cfbolz@p54AB9E54.dip0.t-ipconnect.de] has joined #bzr
=== cprov [n=cprov@canonical/launchpad/cprov] has joined #bzr
=== fog [n=fog@debian/developer/fog] has left #bzr []
=== Gwaihir [n=Gwaihir@ubuntu/member/gwaihir] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has joined #bzr
=== keir [n=keir@bas15-toronto12-1168010516.dsl.bell.ca] has joined #bzr
=== bitmonk [n=justizin@adsl-76-212-13-68.dsl.pltn13.sbcglobal.net] has joined #bzr
=== luks [n=lukas@unaffiliated/luks] has joined #bzr
=== Vernius_ [n=tomger@p508AEB28.dip.t-dialin.net] has joined #bzr
=== Mez [n=Mez@ubuntu/member/mez] has joined #bzr
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== yminsky [n=yminsky@user-0cevcqv.cable.mindspring.com] has left #bzr []
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== s\|k [n=bjorn@c-69-181-8-54.hsd1.ca.comcast.net] has joined #bzr
s\|k	hi	09:03
=== sevrin [n=sevrin@ns1.clipsalportal.com] has joined #bzr
keir	why are the new bzr packs still ~5x larger than git packs?	09:17
luks	bzr packs are really just knits written into bigger files	09:18
luks	but the structure of data in bzr and git is very different, anyway	09:18
keir	the bzr packs don't compress at all either...	09:19
keir	hmm	09:20
keir	git is snapshot based	09:20
keir	but internall just uses heuristics to store the diffs	09:20
luks	knits are gzipped	09:20
luks	gzipped text deltas	09:20
keir	that the complete linux kernel history is 100mb, is pretty nice!	09:20
keir	(for git)	09:20
keir	interesting, i'll have to look into this	09:21
keir	the tiny size for git repos is really nice for branching projects -- tiny download time!	09:22
keir	is there docs of what a knit is?	09:24
keir	i don't see it clearly explained in the developers/ dir in docs	09:24
keir	i see, so text deltas rather than a more efficient xdiff type binary encoding	09:27
fullermd	And occasional fulltext copies. That adds up size on long histories.	09:34
keir	git does that too	09:36
keir	i still don't see how it's a 5x size difference though	09:36
fullermd	Not in the same sense, AIUI.	09:36
luks	annotations add a lot, too. or are they dropped from packs already?	09:37
keir	hmm. annotations are very important for some projects (gnome, etc)	09:39
fullermd	That doesn't mean storing them is necessary.	09:40
keir	well, the issue gnome had with svn was that annotate was very slow compared to cvs	09:40
keir	back in the day it was a blocker to svn conversion	09:40
=== mm_202 [n=mm_202@216.106.29.84] has joined #bzr
=== mm_202 [n=mm_202@216.106.29.84] has left #bzr ["Goodbye."]
fullermd	Well, annotations are also important to things like annotation-based merges.	09:43
fullermd	What both end up meaning is that annotations should be get-able "fast enough". Which means either finding a blazing fast way to derive them on the fly, or caching them.	09:44
fullermd	With knits, they're cached at commit time, which has a lot of drawbacks. Space usage, CPU and I/O time to add/retrieve them on all operations.	09:44
fullermd	Makes it very hard to use a non-line-based delta algorithm, since annotations are line-based (my brain rebels at reading a byte-wise 'annotate' output)	09:45
fullermd	Related, it means they're hard-coded based on the 'diff' algorithm used at commit-time. That can suck.	09:45
fullermd	AIUI, the plan with packs is to add the capability to gen and store an external cache when things come needed, independent of the actual text store.	09:46
keir	sensible	09:47
keir	is there a fast mpdiff implementation already?	09:55
keir	anyone used bzr-git?	09:55
keir	i can't get it to work at all	09:55
keir	and there is no docs whatsoever	09:55
keir	bzr branch ../git ---> 'GitRepository' object has no attribute '_format'	09:55
=== dholmster [i=dholm@195.198.115.93] has joined #bzr
fullermd	I thought I heard it wasn't up to branching.	09:57
keir	there is not a single example command showing how to use any of its functionality, so i'm guessing	09:58
keir	so bzr-git is, at the moment, entirely useless?	09:58
fullermd	I think lifeless said it's only "there" enough to look at history (presumably 'log' and the like)	09:58
=== hstuart [n=hstuart@0x503e9965.virnxx12.adsl-dhcp.tele.dk] has joined #bzr
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== cprov [n=cprov@canonical/launchpad/cprov] has joined #bzr
lifeless	keir: bzr-git can show history but is not complete enough to pull data across	10:45
lifeless	luks: knits have annotations, we won't be changing the knits repo format, so that won't change; packs have knits though	10:45
lifeless	erm, packs have the deltas, and they are not annotated	10:45
luks	well, that's what I meant	10:46
luks	I just wasn't sure if packs still have annotations	10:46
lifeless	ok	10:46
lifeless	they don't	10:46
lifeless	packs are something like 10% smaller than bzr's knit based repositories	10:47
keir	lifeless, why are they so much larger than git repos?	10:47
jelmer	'morning *	10:47
keir	because libxdiff (what git uses) is 5x better than line diff + gzip?	10:48
lifeless	keir: two reasons that I know of; one is that we know we can do better on size - john and aaron have done lots of experiments on this - but we haven't moved their research into a disk format [yet] .	10:48
lifeless	in fact thats a general enough statement that it covers the second reason too :)	10:49
lifeless	hi jelmer	10:49
keir	lifeless, ok :)	10:49
lifeless	Peng: thanks; I wasn't aware of lyhash	10:49
Peng	lifeless: :)	10:49
lifeless	keir: so I would expect the next iteration of packs to bring in some of their improvements which will increase performance more, decrease size etc,.	10:50
lifeless	keir: I'm curious though where you are getting the size comparison from?	10:51
lifeless	keir: is it a tree with the same history in both formats?	10:51
keir	lifeless, order of magnitude comparisons of git tree vs bzr tree	10:51
keir	same source size	10:51
keir	similar history size	10:51
lifeless	well	10:51
lifeless	similar can cover a multitude of sins	10:51
keir	true	10:51
keir	i was trying to convert the git repo into bzr packs	10:52
keir	to do a proper comparison	10:52
lifeless	yah, thats not ready yet	10:52
lifeless	oh, also, a thought - if you were doing du -sh .bzr	10:52
lifeless	there is the obsolete_packs directory	10:52
keir	it's empty	10:53
lifeless	ok	10:53
lifeless	:)	10:53
keir	all the size is in the single .pack	10:53
keir	54mb	10:53
lifeless	how old was the project when it was in knit format ?	10:53
keir	compared to the git 15mb one	10:53
keir	this is packs.packs	10:53
keir	knit is similar size	10:53
keir	.bzr in packs is 61mb	10:54
lifeless	uhm, I wasn't clear	10:54
keir	in knits it's 71mb	10:54
lifeless	old knit generation code generated bigger knits	10:54
lifeless	for two fairly dumb in hindsight reasons	10:54
lifeless	the first is the python GZipFile uses zlib compression level 9 by default for some insane reason	10:54
keir	hmm	10:55
lifeless	this generates bigger output in a non trivial % of cases	10:55
lifeless	I saved a surprising amount of disk in the moz repo tests by converting that to -1, not to mention a bucketload of time	10:55
lifeless	the second reason is that we started with a full text every 50 commits no matter what	10:56
lifeless	(per file history)	10:56
keir	ah, i see	10:56
lifeless	and the inventory file - which is a) big and b) changes on every commit	10:56
lifeless	lets just say that it became a dominating factor in several repositories	10:56
lifeless	now we use a heuristic that says 'if the size of the deltas == the size of the file, write a new full text'	10:57
lifeless	oh, and theres another reason too	10:57
lifeless	we delta against history always	10:57
Peng	To get more efficient knits, what, re-clone?	10:57
lifeless	not against best size/closest match/introduction to repo	10:58
lifeless	which means if you imagine a Y graph	10:58
lifeless	if at the point just before the split you are due-for-a-full-text	10:58
lifeless	both arms will get a fulltext, one each.	10:58
lifeless	Peng: by diffing against history its trivial to be sure that you have all the deltas needed to reconstruct a text during branching	10:59
lifeless	because you can take all the inventories; grab the texts they want as either fultext-or-delta, and you're done	10:59
lifeless	if you have arbitrary deltas, you need to grab other data that you dont want in general, replace the local delta for the thing that referenced it with another, then you're done	11:00
lifeless	this is one reason pack text indices have two pointers	11:00
lifeless	one to another text for delta parent	11:00
keir	aah, a history-parent and delta parent	11:00
lifeless	one to a list of parents for file graph	11:01
=== bialix [i=chatzill@77.109.22.134] has joined #bzr
lifeless	so that we can e.g. delta against last-added, but still tell what we need during fetch easily	11:01
bialix	luks: ping	11:01
luks	hi!	11:01
lifeless	if someone digs into the 'pack' command (not the autopack, the manual 'please make me fast daddy' command)	11:01
lifeless	then they can fix existing repositories to be considerably smaller, especially now that packs are not annotated.	11:02
lifeless	keir: did I just hear a penny dropping ? :)	11:02
lifeless	hi bialix	11:03
lifeless	anyway, I've some stuff to do here, hope this discussion helps - the 5x thing is definately a problem, but I'm quite sure you won't see it that bad if you start with packs and git and build up a given graph programatically	11:03
=== bialix [i=chatzill@77-109-18-147.dynamic.peoplenet.ua] has joined #bzr
luks	hi again :)	11:04
bialix	hi, luks, lifeless	11:04
bialix	luks, I thinking about separate file for strings, but it seems to few strings at this moment	11:05
bialix	^too few	11:05
luks	I'm fine with it as it is	11:05
luks	we can move them later if the list will start growing	11:06
bialix	ok for me	11:06
luks	what still worries me are error messages from bzrlib	11:06
luks	but I guess that's not easily fixable for now	11:06
bialix	yeah	11:06
bialix	duplicate them inside QBzr is not best idea	11:07
luks	yep	11:07
bialix	luks, I saw you change some plurals-relaed things in qdiff	11:07
luks	yes, I realized I need a plural in the title	11:08
luks	and I couldn't figure out how to do it with pygettext	11:08
luks	so I switched it to xgettext	11:08
bialix	it's may be something strange with my standalone QBzr, but it seems that this code is not working on my win32	11:08
luks	hmm	11:08
luks	what's the problem?	11:09
keir	lifeless, i was looking into this because i was thinking about the compressed keys issue -- if we compress keys into groups of 16, we get a 4x saving on space for storing the keys	11:09
bialix	I don't see this messages at all	11:09
keir	lifeless, but i wanted to check if the size of the keys was even on the cards compared to pack sizes	11:09
bialix	luks: may be it's something related to bug #148409	11:10
ubotu	Launchpad bug 148409 in qbzr "qdiff --inline: option has no effect?" [Undecided,New] https://launchpad.net/bugs/148409	11:10
luks	bialix, I think I'm confused now	11:12
luks	you mean you don't see the title in qdiff at all?	11:12
bialix	luks: because my bad english, or because my bug report?	11:13
luks	neither, I just don't see how are these two related	11:13
luks	the "X files" title should appear only if you specify more than 2 files to diff on the command line	11:14
bialix	no, title in qdiff is present, I don't see "X files" message	11:14
bialix	mmm, and when I specify no files?	11:15
luks	then you get only "QBzr - Diff"	11:15
bialix	aha	11:15
luks	and if you open it from qlog then I believe you get "QBzr - Diff - <revision ID>"	11:15
luks	maybe this should be unified	11:16
luks	it made sense to me back then, but it looks confusing now that I think about it again :)	11:16
bialix	here screenshot http://bialix.com/qbzr/qdiff-2-files.png with 2 files in command line	11:18
luks	more than two	11:18
bialix	yep	11:19
bialix	my bad	11:19
bialix	but --inline does not work anyway	11:20
luks	I know, it used to work, but then I added this new diff viewer and didn't bother to implement it there	11:22
luks	but I'm about to rewrite it again, because the diff gets misaligned sometimes	11:22
bialix	well, ok, I just want to know it's not win32-specific	11:22
luks	no, it's not	11:22
bialix	btw, I hve some strange bug with diff on my ru.po	11:23
bialix	somewhat something breaks html rendering in diff and I saw raw html code	11:23
luks	can you mail me the patch that does it?	11:26
bialix	here is screenshot: http://bialix.com/qbzr/qdiff-html-bug.png	11:26
luks	ouch	11:26
bialix	it's from my i18n branch	11:26
bialix	it's already on laucnhpad server	11:26
bialix	bzr qdi -c136	11:27
luks	I see	11:27
luks	I used an ugly hack there, but it doesn't play well with utf-8	11:28
luks	hm, maybe not	11:33
bialix	luks: good night	11:41
luks	night	11:41
=== zyga [n=zyga@ubuntu/member/zyga] has joined #bzr
=== Demitar [n=demitar@c-212-031-182-147.cust.broadway.se] has joined #bzr

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!