Creating a hashcode for use in a database (ie not using GetHashCode)

Question

I have recently been instructed in the ways of GetHashCode() and in particular "Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains" (From an Eric Lippert blog article).

Unfortuantely I have been using this in a database to try to speed up lookups (by inserting the result of GetHashCode rather than doing searches on text strings). I am now aware that this is a very bad thing to do.

So I'm left wondering what there is that I can do instead. Is there anything that given a string will be guaranteed to return a sensibly collision resistant integer that I can use for lookups?

I could write something myself but I was hoping that there would be something built in that I could use without having to go for stuff in the cryptographic libraries which feels a bit heavyweight.

Why do you want that in the first place? Let the database do what it's meant for. — SLaks
– SLaks, Commented Oct 31, 2011 at 13:57
@SLaks: It may be me having picked some old wives tales up about databases but I thought that if you were trying to look up a 50 character string that it would be slower than looking up an int based on that string. Thinking about it you are probably right that if I index the column that it will do what I want and then some. I am still interested in if there is an answer though since getting a checksum or similar I would have thought was a generaly useful thing to do. — Chris
– Chris, Commented Oct 31, 2011 at 14:12
@Chris, “getting a checksum or similar” is not a generally useful, it's useful for some specific situations. And in each situation you have different requirements for the checksum/hashcode, so you should probably use different algorithm. — svick
– svick, Commented Oct 31, 2011 at 21:49
It is not possible to get a "shorthand" for a long string by computing any sort of "code" (short of literally compressing it). Where would the missing data go? A 32-bit int, for example, will be able to distinguish 2-32 different strings. A 32 character string can come in at least 36^32 different varieties -- and that's for a short string! — Kirk Woll
– Kirk Woll, Commented Oct 31, 2011 at 21:59
@KirkWoll: Yup. I wasn't looking for compression, just a hashing algorithm but one that doesn't potentially vary by architecture, etc so I am expecting collisions. I think I was just mistakenly trying to implement a kind of hashbucket system myself which I have now had thoroughly pointed out is wrong. ;-) — Chris
– Chris, Commented Nov 1, 2011 at 9:14

Jim Mischel · Accepted Answer · 2011-10-31 22:20:57Z

I would encourage you to consider what the others have said: let the database do what it's good at. Creating a hash code in order to optimize lookups is an indication that the indexes on your table aren't what they should be.

That said, if you really need a hash code:

You don't say if you want a 32-bit or 64-bit hash code. This one will create a 64-bit hash code for a string. It's reasonably collision-resistant.

public static long ComputeHashCode(string url)
{
    const ulong p = 1099511628211;

    ulong hash = 14695981039346656037;

    for (int i = 0; i < url.Length; ++i)
    {
        hash = (hash ^ url[i]) * p;
    }

    // Wang64 bit mixer
    hash = (~hash) + (hash << 21);
    hash = hash ^ (hash >> 24);
    hash = (hash + (hash << 3)) + (hash << 8);
    hash = hash ^ (hash >> 14);
    hash = (hash + (hash << 2)) + (hash << 4);
    hash = hash ^ (hash >> 28);
    hash = hash + (hash << 31);

    if (hash == (ulong)UNKNOWN_RECORD_HASH)
    {
        ++hash;
    }
    return (long)hash;
}

Note that this is a hash code and the likelihood of a collision is pretty small if you have up to a few billion records. Rule of thumb: you have a 50% chance of collision when the number of items exceeds the square root of your hash code's range. This hash code has a range of 2^64, so if you have 2^32 items, your chance of a collision is about 50%.

See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=792 and http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table for more information.

What is the UNKNOWN_RECORD_HASH constant (I assume its a const) in this?
And I am marking this as the correct answer because you told me to stop being so silly and also answered my question. :)
UNKNOWN_RECORD_HASH is the value I use to indicate that a record doesn't have a hash code. I think it's equal to 0 in my system, but you can set it to any constant value. The check is there to prevent the method from ever generating the unknown record hash value.

svick · Accepted Answer · 2011-10-31 21:54:26Z

1

As SLaks pointed out in a comment, looking up data is what databases are good at.

If you need fast lookups, create an index on the column. At the very least, you won't have to deal with collisions anymore.

answered Oct 31, 2011 at 21:54

svick

246k54 gold badges405 silver badges535 bronze badges

1 Comment

Chris Over a year ago

I think you're right. Now I just need to go and fix all my horrendous code. ;-)

svick · Accepted Answer · 2011-11-01 11:56:00Z

1

Are you using a MSSQL Database? The T-SQL Checksum function does exactly that.

edited Nov 1, 2011 at 11:56

svick

246k54 gold badges405 silver badges535 bronze badges

answered Oct 31, 2011 at 13:59

Zeph

1,73815 silver badges30 bronze badges

Collectives™ on Stack Overflow

Creating a hashcode for use in a database (ie not using GetHashCode)

3 Answers 3

3 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Linked

Related