Timeline for Hash map without collision check
Current License: CC BY-SA 3.0
9 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Mar 7, 2018 at 11:31 | comment | added | Jules | @amon - in selecting hash functions and hash table size, it gives better collision resistance if the constant factors in the hash function and the size of the table are coprimes. An easy way of doing this is to choose a prime table size. Java chose the other way: it recommends the use of prime numbers as factors in the production of the hashes -- although this isn't enforced, and the default implementation of Object doesn't do this, most other types in the JDK with a defined hashCode method do. | |
| Mar 7, 2018 at 11:14 | comment | added | rwong | The problem that Berin Loritsch mentions is that, for user defined types, the hash functions that are implemented by a "typical / median" programmer are typically very bad. When this inadequacy is combined with a "nice" bucket count, it results in subtle performance bugs, sometimes causing unexpected failures. This is why it is better to pack all of the relevant fields from the user-defined type into an array, and use a well-written hash function on that array. Unfortunately this is widely perceived as too slow, and it might indeed be. | |
| Mar 5, 2018 at 19:37 | comment | added | amon | Btw the OpenJDK HashMap implementation enforces power of two table sizes, and Oracle documents the default capacity as the very non-prime 16 | |
| Mar 5, 2018 at 19:37 | comment | added | amon | @BerinLoritsch You're not wrong, but this doesn't hold universally true either. Java, Python, … are unique because every object is hashable, often with a suboptimal hash function. There, a prime number leads to better bucket distribution. But with an evenly distributed hash function over a 2^n sized hash space, a power-of-two sized hash table and dynamic resizing by a factor of 2 is strictly superior. This guarantees amortized O(1) inserts and avoids memory fragmentation. If the bucket count doesn't divide the hash space evenly, you'll get a bias towards the lower-numbered buckets. | |
| Mar 5, 2018 at 19:15 | comment | added | Berin Loritsch | NOTE: when the number of buckets is an even power of two, you end up using far fewer buckets than you planned. A lot of that is due to the default implementation of getting a hashcode for an object in many languages (which is the memory address for the object). That's why most hashmap implementations use a prime number for the number of buckets. Trust me, lesson learned a long time ago when I created a custom concurrent safe hasmap in Java. Peer review and actual usage reports showed me the error of my ways. | |
| Mar 5, 2018 at 17:47 | comment | added | amon | @Jules oh, you're right, I misread that the question seems to be about using the hash as the key, not as the hash table's usual hash function (which can then be simplified to the identity function). | |
| Mar 5, 2018 at 17:17 | comment | added | Jules | The point of doing something like this is to use the hash as a proxy for the actual key, so if the key is large and/or expensive to access (e.g. because it's stored on disk while your hash index is stored in RAM) you can compare the hashes as a proxy for comparing the keys themselves. In this circumstance, storing the full hash instead of storing the key itself in the table is a useful structure. It's rare, but reasons to do it do exist. | |
| Mar 4, 2018 at 23:53 | history | edited | amon | CC BY-SA 3.0 |
I know my powers of two, I swear!
|
| Mar 4, 2018 at 23:31 | history | answered | amon | CC BY-SA 3.0 |