24

I have the following two strings:

var string1 = "MHH2016-05-20MASTECH HOLDINGS, INC. Financialshttp://finance.yahoo.com/q/is?s=mhhEDGAR Online FinancialsHeadlines";

var string2 = "CVEO2016-06-22Civeo upgraded by Scotia Howard Weilhttp://finance.yahoo.com/q/ud?s=CVEOBriefing.comHeadlines";

At first glance these two strings are different however their hashcode is the same using the GetHashCode method.

var hash = 0;
var total = 0;
foreach (var x in string1) //string2
{
    //hash = x * 7;
    hash = x.GetHashCode();
    Console.WriteLine("Char: " +  x + " hash: " + hash + " hashed: " + (int) x);
    total += hash;
}

Total ends up being 620438779 for both strings. Is there another method that will return a more unique hash code? I need the hashcode to be unique based on the characters in the string. Although both strings are different and the code works properly, these two strings so happen add up to being the same. How can I improve this code to make them more unique?

6
  • 6
    You do realize, don't you, that you can't guarantee a unique hash code for all possible strings? A hash code is 32 bits, meaning that there are 4 billion (and change) possible values. Each of your two strings is more than 120 characters long. The number of possible 120-character strings using the 96 printable ASCII characters is is much, much larger. Collisions are inevitable. There is no such thing as a unique hash code in the general case. Making the hash code larger will reduce the chance of collision, but will not eliminate it. Commented Jun 27, 2016 at 0:41
  • 2
    Your question implies that you're trying to use hash codes as unique identifiers. This is an incredibly bad idea, and doomed to fail. The answer by @AlexD explains why. Commented Jun 27, 2016 at 0:46
  • @JimMischel yes I am aware of this now but thanks Commented Jun 27, 2016 at 1:50
  • Possible duplicate of Generate unique number based on string input in Javascript Commented Jan 28, 2018 at 21:52
  • Old question, I know, see my 3 years earlier question and answer: stackoverflow.com/questions/15377161/… Commented Jan 28, 2018 at 21:53

3 Answers 3

43

string.GetHashCode is indeed inappropriate for real hashing:

Warning

A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:

  • Do not serialize hash code values or store them in databases.
  • Do not use the hash code as the key to retrieve an object from a keyed collection.
  • Do not use the hash code instead of a value returned by a cryptographic hashing function. For cryptographic hashes, use a class derived from the System.Security.Cryptography.HashAlgorithm or System.Security.Cryptography.KeyedHashAlgorithm class.
  • Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To test for equality, call the ReferenceEquals or Equals method.

and has high possibility of duplicates.

Consider HashAlgorithm.ComputeHash. The sample is slightly changed to use SHA256 instead of MD5, as @zaph suggested:

static string GetSha256Hash(SHA256 shaHash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = shaHash.ComputeHash(Encoding.UTF8.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data 
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}
Sign up to request clarification or add additional context in comments.

7 Comments

for the complete example see msdn.microsoft.com/en-us/library/…
@lexx9999 I think the link in the post points to the same algorithm already.
As I was reading it, it did not include GetMd5Hash/ VerifyMd5Hash
This would result in slow performance, security hashing is not mean to be used for avoiding collision.
@SirajMansour Cryptographic hashes are indeed designed for avoiding collisions. On my iPhone I can compute a SHA-256 hash on a 1MB file in 0.950 mSec, is that fast enough? BTW, SHA-256 is slightly faster than MD5 on my phone.
|
8
using System.Security.Cryptography;
string data="test";
byte[] hash;
using (MD5 md5 = MD5.Create())
{
    md5.Initialize();
    md5.ComputeHash(Encoding.UTF8.GetBytes(data));
    hash = md5.Hash;
}

hash is a 16 byte array, which in turn you could covert to some hex-string or base64 encoded string for storage.

EDIT:

What's the purpose of that hash code?

From hash(x) != hash(y) you can derive x!=y, but

from hash(x) == hash(y) you canNOT derive x==y in general!

12 Comments

This would result in slow performance, security hashing is not mean to be used for avoiding collision.
@somerandomdude, As with every hash function you would have to compare the original data, in case you want to be absolutely sure, You can try other hash-algorithms, but you must always expect collisions. That's what from hash(x) == hash(y) you canNOT derive x==y in general! means.
MD5 should not be used, it is insecure, use a SHA-2 hash such as SHA-256. Generally SHA-256 is minimally slower than MD5.
@zaph they're designed for security purposes. Although it would serve the goal in avoiding collision doesn't mean its the right tool.
I am not ranting against crypto hash functions or the fact that they're proven to work where they're supposed to. I am just saying that if the sole purpose is randomness, depending on the performance needs crypto hash-algos are not always the answer. Now you can argue back and be aggressive as you want, but try to accept other opinions. It doesn't hurt.
|
0

I also had a problem similar to yours which I have solved in the following way and with a test software I have tested up to 50 million texts, all of which were in English, and there was no problem in testing them, but surely, considering the statistical population of the input data, it is completely logical that it cannot be For all cases of a string with any string length, it will produce a unique long numeric value, but for strings with special and meaningful conditions, such as the names of tables and columns, this method can be used. If there is a better solution, please guide me

public class ContextIdGenerator 
{
    public long CreateTableId(string schemaName, string tableName) => CreateId($"{schemaName}.{tableName}", false, Encoding.ASCII);
    public long CreateTableColumnId(string schemaName, string tableName, string columnName) => CreateId($"{schemaName}.{tableName}.{columnName}", false, Encoding.ASCII);
    public long CreateId(string context, bool sensitive, Encoding encoding)
    {
        if (string.IsNullOrEmpty(context)) return 0;
        if (!sensitive) context = context.ToUpper();
        using var hasher = MD5.Create();
        var bytes = encoding.GetBytes(context.ToUpper());
        var hBytes = hasher.ComputeHash(bytes);

        return hBytes.Select((q, i) => Convert.ToInt64(q * Math.Pow(10, i + 1))).Sum();
    }
}

The test source is mentioned below (ASCII state)

private static Random random = new Random();
public static string RandomString(int length)
{
    const string chars = ".ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    return new string(Enumerable.Repeat(chars, length)
        .Select(s => s[random.Next(s.Length)]).ToArray());
}

private void TestButton_Click(object sender, EventArgs e)
{
    Enabled = false;
    Dictionary<long, string> db = new();
    var A = new ContextIdGenerator();
    var errorCounter = 0;
    var counter = 0;
    while (counter <= 50000000)
    {
        Application.DoEvents();
        var len = random.Next(20, 30);
        var text = RandomString(len);
        var keyId = A.CreateId(text, false, Encoding.ASCII);
        counter++;
        if (keyId == 0) continue;
        try
        {
            db.Add(keyId, text);
        }
        catch
        {
            if (db.Values.Contains(text, StringComparer.OrdinalIgnoreCase))
                continue;
            errorCounter++;
        }
    }
    Enabled = true;
    MessageBox.Show($"ErrorCount : {errorCounter}");
}

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.