You looked at the murmurhash implementation that is in hbase Tim? It
has good characteristics -- faster than jenkins and 32bit or 64bit
product. See
http://sites.google.com/site/murmurhash/. Convertion to
java was done by Andrzej. Way cheaper than UUID'ing and much smaller.
St.Ack
On Sat, Mar 13, 2010 at 12:01 AM, Tim Robertson
<[EMAIL PROTECTED]> wrote:
> Along similar lines... (sorry for hijacking thread)
>
> I assume that this is even more applicable for key choice given the way keys
> participate in indexes? I have been using UUID, but it is way overkill for
> my needs. What are others using? Is there convenient way of doing (e.g.) 8
> characters strings?
>
>
>
>
> On Fri, Mar 12, 2010 at 9:15 PM, Kay Kay <[EMAIL PROTECTED]> wrote:
>
>> Some of our current experiences go along similiar lines , where we saw a
>> ~20-30% of ram savings by using abbreviations in the key space.
>>
>> But the biggest advantage came actually with defining the right schema and
>> column families, as per the query pattern of the jobs. We keep the column
>> families no more than 5 and have relatively *thin* columns , but revisit the
>> schema with more tables , if that gets stretched , as applicable of course.
>>
>>
>>
>>
>> On 3/12/10 12:02 PM, Lars Francke wrote:
>>
>>> Will I save a lot of space (especially if I have many small columns)?
>>>>
>>>>
>>> I don't have any hard numbers for you but I tested it and I remember
>>> that on a dataset of about 10-20 GB I could save about 200-500 MB
>>> (this was with compression enabled) just by not using descriptive
>>> sting qualifiers that weren't data by itself. A lot of small columns
>>> for me too (mostly counters). I use a simple mapping of short byte
>>> arrays to strings so that it is still very easy to use in the client.
>>>
>>> I asked that very same question a few months ago on IRC but I think
>>> nobody answered so I'd be interested in what others do as well.
>>>
>>> Cheers,
>>> Lars
>>>
>>>
>>
>>
>