dannas
Daniel Näslund
10 posts

#14883
Reasoning behind ptr_uint64 and hash_bytes 11 months, 2 weeks ago Edited by Daniel Näslund on April 9, 2018, 4:13 a.m. Reason: Initial post
What's the reasoning behind the selection of the (presumable) prime constant for ptr_uint64 and the usage of FNV for hashing bytes? Selecting hash functions looks like dark magic to me. My only exposure to hash functions has been the multiplicative function from K&R and Joshua Blochs general advice from Effective Java.
Store some constant nonzero value, say 17, in an int variable called result. Any suggestions on resources for learning more about hashing? 
dannas
Daniel Näslund
10 posts

#14885
Reasoning behind ptr_uint64 and hash_bytes 11 months, 2 weeks ago
I have found Peter Kawinskis hash function benchmark and Bob Jenkins list of resources for hash functions but none of them describes clearly to me how to select a hash function depending on the data to be hashed.
I skimmed Knuth Vol 3, chapter 6.4 about Hashing but that didn't really make things make clearer. He says that choosing a hash function is data dependent but he really doesn't say much on how to go about selecting a hash function. 
dannas
Daniel Näslund
10 posts

#14886
Reasoning behind ptr_uint64 and hash_bytes 11 months, 2 weeks ago
How is ptr_hash better than just right shifting N bits to get rid of the zeroes due to alignment?

pervognsen
Per Vognsen
49 posts
/ 1 project

#14887
Reasoning behind ptr_uint64 and hash_bytes 11 months, 2 weeks ago Edited by Per Vognsen on April 9, 2018, 5:40 a.m.
I don't know of a single resource which is going to cover all the different aspects of this. Almost all the theoretical analysis of hash tables and other hashingbased data structures relies on universal hash functions, which are randomly keyed hash functions where you can establish some uniform distribution and independence properties. Our hash function is the multiplyshift hash function the Wikipedia page analyzes here: https://en.wikipedia.org/wiki/Uni...shing#Avoiding_modular_arithmetic
Like most people, we're not randomizing our key but setting it to a fixed value, so their probabilistic analysis does not work for our case, but it might give you an inkling of why multiplyshift is a common ingredient in hash functions. I started to write something much longer and tutoriallike, but I can't really do it justice in a forum post, and I'm also not enough of an expert to speak authoritatively on the subject. As for why we don't just shift off the lesser bits. First, keep in mind that hash_ptr can be used with any input. Even if the malloc only returns 16byte aligned pointers, if you have an array of data structures that are say 128 bytes in length, then pointers to the array's entries are going to have the same lowest 7 bits. That's just a simple example. You don't want to make assumptions about your input values if you can avoid it. From the previous cases, you might suggest not to use any of the less significant bits, but only the most significant bits. But then you have a similar problem. If all the pointers are allocated out of a contiguous arena, almost all their MSBs are going to be identical, and if most of their LSBs are the same as well due to accidental alignment, then they might only differ by some relatively small range of middle bits, and your hash function doesn't know what those are in advance. 
dannas
Daniel Näslund
10 posts

#14899
Reasoning behind ptr_uint64 and hash_bytes 11 months, 1 week ago
Thank you for your answer. The link to the universal hashing wikipedia page was enlightening.
I digged myself a deep hole searching for hash code implementations. Here's what I found: * Division based hash codes are simple to implement but requires the table size to be a prime number. * Multiplication based hash codes don't require the table size to be a prime number. * Certain factors are better than others when using multiplicative hashing. * There's a bit of folklore surrounding hash functions. Dan Bernsteins djb2 has been popular for a long time but few people understand why the constants are selected as they are. The PNV hash code is another simple algorithm that is easy to implement but that few understand. * Multiplyshift hash codes is a recent (1997) invention. * In later years CityHash, MurmurHash and SpookyHash has been developed. They are complicated compared to the earlier hash codes. * Dan Bernstein created SipHash which is supposed to be immune (or atleast more resistant) to DOS attacks against hash tables. As for hashing pointers, the GNU libstdc++ library just casts the pointer to an int value. I wonder how they get around the bias introduced by alignment and memory allocated in arenas? 
pervognsen
Per Vognsen
49 posts
/ 1 project

#14904
Reasoning behind ptr_uint64 and hash_bytes 11 months, 1 week ago Edited by Per Vognsen on April 10, 2018, 2:07 a.m.
My guess is they use prime sizes for their hash tables, not powers of two. I tried searching through their source code but the policybased template design makes it hard to find out where anything is happening. I see "primes" mentioned in the default hashing policy, so I'm guessing that's it. The main advantage of primesized hash tables is that you're not restricted to 2x growth and mod p does a decent baseline job of scattering values. We exploit the poweroftwo sizing to compute hash % cap more efficiently as hash & (cap  1). A 32bit DIV on Skylake has 26 cycles of latency and 64bit DIV is listed as 3588 cycles in Agner Fog's tables, which is significant if you're hitting L1, L2 or even L3 cache (L1 is 3 cycles, L2 is 14 cycles, L3 is 42 cycles). https://github.com/gccmirror/gcc...v3/include/tr1/hashtable_policy.h 