Reasoning behind ptr_uint64 and hash_bytes

Daniel Näslund

#14883

April 9, 2018

What's the reasoning behind the selection of the (presumable) prime constant for ptr_uint64 and the usage of FNV for hashing bytes? Selecting hash functions looks like dark magic to me. My only exposure to hash functions has been the multiplicative function from K&R and Joshua Blochs general advice from Effective Java.

Store some constant nonzero value, say 17, in an int variable called result.
Compute an int hashcode c for each field f that defines equals:
If the field is a boolean, compute (f ? 1 : 0)
If the field is a byte, char, short, int, compute (int) f
If the field is a long, compute (int) (f ^ (f >>> 32))
If the field is a float, compute Float.floatToIntBits(f)
If the field is a double, compute Double.doubleToLongBits(f), then hash the resulting long as in above.
If the field is an object reference and this class's equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If the value of the field is null, return 0.
If the field is an array, treat it as if each element is a separate field. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
Combine the hashcode c into result as follows: result = 31 * result + c;

Any suggestions on resources for learning more about hashing?

Edited by Daniel Näslund on April 9, 2018, 4:13am Reason: Initial post

Daniel Näslund

#14885

April 9, 2018

I have found Peter Kawinskis hash function benchmark and Bob Jenkins list of resources for hash functions but none of them describes clearly to me how to select a hash function depending on the data to be hashed.

I skimmed Knuth Vol 3, chapter 6.4 about Hashing but that didn't really make things make clearer. He says that choosing a hash function is data dependent but he really doesn't say much on how to go about selecting a hash function.

Daniel Näslund

#14886

April 9, 2018

How is ptr_hash better than just right shifting N bits to get rid of the zeroes due to alignment?

Per Vognsen

#14887

April 9, 2018

I don't know of a single resource which is going to cover all the different aspects of this. Almost all the theoretical analysis of hash tables and other hashing-based data structures relies on universal hash functions, which are randomly keyed hash functions where you can establish some uniform distribution and independence properties. Our hash function is the multiply-shift hash function the Wikipedia page analyzes here: https://en.wikipedia.org/wiki/Uni...shing#Avoiding_modular_arithmetic

Like most people, we're not randomizing our key but setting it to a fixed value, so their probabilistic analysis does not work for our case, but it might give you an inkling of why multiply-shift is a common ingredient in hash functions.

I started to write something much longer and tutorial-like, but I can't really do it justice in a forum post, and I'm also not enough of an expert to speak authoritatively on the subject.

As for why we don't just shift off the lesser bits. First, keep in mind that hash_ptr can be used with any input. Even if the malloc only returns 16-byte aligned pointers, if you have an array of data structures that are say 128 bytes in length, then pointers to the array's entries are going to have the same lowest 7 bits. That's just a simple example. You don't want to make assumptions about your input values if you can avoid it. From the previous cases, you might suggest not to use any of the less significant bits, but only the most significant bits. But then you have a similar problem. If all the pointers are allocated out of a contiguous arena, almost all their MSBs are going to be identical, and if most of their LSBs are the same as well due to accidental alignment, then they might only differ by some relatively small range of middle bits, and your hash function doesn't know what those are in advance.

Edited by Per Vognsen on April 9, 2018, 5:40am

Daniel Näslund

#14899

April 9, 2018

Thank you for your answer. The link to the universal hashing wikipedia page was enlightening.

I digged myself a deep hole searching for hash code implementations. Here's what I found:

* Division based hash codes are simple to implement but requires the table size to be a prime number.
* Multiplication based hash codes don't require the table size to be a prime number.
* Certain factors are better than others when using multiplicative hashing.
* There's a bit of folklore surrounding hash functions. Dan Bernsteins djb2 has been popular for a long time but few people understand why the constants are selected as they are. The PNV hash code is another simple algorithm that is easy to implement but that few understand.
* Multiply-shift hash codes is a recent (1997) invention.
* In later years CityHash, MurmurHash and SpookyHash has been developed. They are complicated compared to the earlier hash codes.
* Dan Bernstein created SipHash which is supposed to be immune (or atleast more resistant) to DOS attacks against hash tables.

As for hashing pointers, the GNU libstdc++ library just casts the pointer to an int value. I wonder how they get around the bias introduced by alignment and memory allocated in arenas?

Per Vognsen

#14904

April 10, 2018

As for hashing pointers, the GNU libstdc++ library
just casts the pointer to an int value. I wonder how they
get around the bias introduced by alignment and memory
allocated in arenas?

My guess is they use prime sizes for their hash tables, not powers of two. I tried searching through their source code but the policy-based template design makes it hard to find out where anything is happening. I see "primes" mentioned in the default hashing policy, so I'm guessing that's it. The main advantage of prime-sized hash tables is that you're not restricted to 2x growth and mod p does a decent baseline job of scattering values. We exploit the power-of-two sizing to compute hash % cap more efficiently as hash & (cap - 1). A 32-bit DIV on Skylake has 26 cycles of latency and 64-bit DIV is listed as 35-88 cycles in Agner Fog's tables, which is significant if you're hitting L1, L2 or even L3 cache (L1 is 3 cycles, L2 is 14 cycles, L3 is 42 cycles).

https://github.com/gcc-mirror/gcc...v3/include/tr1/hashtable_policy.h

Edited by Per Vognsen on April 10, 2018, 2:07am