2

In SQL server, we can easily generate a hash on an entire row using "checksum":

select 
checksum(*), *    
from mytable

Is there a way to do the same thing using the hashbytes function? Something like:

select 
hashbytes('md5', *), *    
from mytable

2 Answers 2

2

Not easily I'm afraid. And in most cases you shouldn't really want to. But if you must, generating a XML document for each row and hashing the XML document is probably the easiest way to go:

SELECT  HASHBYTES('MD5', X.value)
FROM    Data
CROSS APPLY (
    SELECT value = 
        CAST((SELECT Data.* FROM   (SELECT 1) AS X (y) FOR XML AUTO) AS VARCHAR(MAX))
) AS X

EDIT: I've did some testing, applying the CAST(... AS VARCHAR(MAX)) as I've done in the updated statement above improves performance about 15%. More importantly, it prevents truncation errors on wide tables. Concatenating the columns yourself is even after this fix still about 15% faster. Using CONCAT() (available from MSSQL 2012) function makes your live a little easier though as you can omit applying ISNULL and CAST functions to each column.

Sign up to request clarification or add additional context in comments.

1 Comment

Clever, and it has built-in protection against column content overlap as well. Whether it performs acceptably is another matter, though.
1

If writing out the columns is no objection, you can simply concatenate the columns and hash that:

SELECT HASHBYTES('md5', 
    ISNULL(CONVERT(VARBINARY(MAX), column1), 0x) +
    ISNULL(CONVERT(VARBINARY(MAX), column2), 0x) + 
    ...
)

From SQL Server 2012 onwards, we can achieve the same more compactly using CONCAT (thanks to @MWillemse for the tip):

SELECT HASHBYTES('md5', CONCAT(column1, column2, ...))

Be aware that this can cause collisions if your columns are "the same": hashing AA,BB,CC yields the same result as AAB,BBC,C or even AABBCC,NULL,NULL. If this is a concern, you'll have to design your own hashing scheme to minimize the odds of this (by including the column names in the hash, for example). Also, this is a binary checksum: hashing A,B,C is not the same as a,b,c. Again, if this is a concern you'll have to tweak the data before hashing it (UPPER).

Finally, MD5 is no longer considered cryptographically secure, so only use this for things like checksum indexes where you need less collision probability than with CHECKSUM. If you need a cryptographic signature of a row, this is too simplistic.

1 Comment

It might be worth adding a separator between the columns, e.g. '|~'. +1 for using CONCAT

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.