Hashing algorithm for millions of variable length strings (URLs)

Question

The goal is to distribute approximately 100 million variable length strings, average length 100 characters, uniformly among 100 million buckets. Perfection not required, just no egregious clumping. The strings are URLs. They tend to begin much the same way and end much the same way (e.g. h t t p:// or w w w. and ".c o m", ".e d u", "a s p x" and so forth) and show their greatest variation in the latter half of the string, except for the final few characters.

What's a good algorithm that would accept the string and the number of slots as inputs, and return a number between 0 and SlotCount-1, and satisfy the uniform-distribution requirement?

Since you mention no other requirement (such as quick checks for existence etc.), why not just go in a cyclic fashion, putting the incoming string in the next slot and so on? — Paresh
– Paresh, Commented Mar 16, 2013 at 20:04
Sorry, I did not make a quick test for existence explicit. It is indeed a requirement. — mrblint
– mrblint, Commented Mar 16, 2013 at 22:59

vonbrand · Accepted Answer · 2013-03-16 23:31:06Z

2

Any string hashing algorithm designed for ASCII text should work fine. Take a look at the Wikipedia article, consider if the set could be manipulated by an adversary to create pile-ups (algoritmic vulnerabilities). If that isn't an issue, at strchr and partow several "classical" hash algorithms are given and compared. Get a subset of your data and use the parameters suggested there for your own comparison. And please tell us the results.

answered Mar 16, 2013 at 23:31

vonbrand

14.3k3 gold badges42 silver badges52 bronze badges

Add a comment |

Stack Exchange Network

Hashing algorithm for millions of variable length strings (URLs)

1 Answer 1

Hot Network Questions

Hashing algorithm for millions of variable length strings (URLs)

1 Answer 1

Related

Hot Network Questions