(NOT) Using CHECKSUM or BINARY_CHECKSUM for Changing Dimensions

Let me start out by saying that CHECKSUM or BINARY_CHECSUM should not be used to determine if a row has changed in a database.  One might take a BINARY_CHECKSUM of a row when it is inserted into a dimension or a fact table and then later come back using the business key and the BINARY_CHECKSUM to determine whether the row had changed.

Here is a quote from Wikipedia about what checksums are for:

A checksum is a form of redundancy check, a simple way to protect the integrity of data by detecting errors in data that are sent through space (telecommunications) or time (storage). It works by adding up the basic components of a message, typically the asserted bits, and storing the resulting value. Anyone can later perform the same operation on the data, compare the result to the authentic checksum, and (assuming that the sums match) conclude that the message was probably not corrupted. (http://en.wikipedia.org/wiki/Checksum)

There is a nuance to checking the same data to see if it has changed because of error during transmission or changed due to an error in storage mechanism as opposed to checking different copies of data to see if they are the same.

Let me show an example.

First we’ll make a table of all the uppercase letters of the alphabet:

create table #letters (letter char(1))

insert into #letters (letter) values (‘A’)
insert into #letters (letter) values (‘B’)
insert into #letters (letter) values (‘C’)
insert into #letters (letter) values (‘D’)
insert into #letters (letter) values (‘E’)
insert into #letters (letter) values (‘F’)
insert into #letters (letter) values (‘G’)
insert into #letters (letter) values (‘H’)
insert into #letters (letter) values (‘I’)
insert into #letters (letter) values (‘J’)
insert into #letters (letter) values (‘K’)
insert into #letters (letter) values (‘L’)
insert into #letters (letter) values (‘M’)
insert into #letters (letter) values (‘N’)
insert into #letters (letter) values (‘O’)
insert into #letters (letter) values (‘P’)
insert into #letters (letter) values (‘Q’)
insert into #letters (letter) values (‘R’)
insert into #letters (letter) values (‘S’)
insert into #letters (letter) values (‘T’)
insert into #letters (letter) values (‘U’)
insert into #letters (letter) values (‘V’)
insert into #letters (letter) values (‘W’)
insert into #letters (letter) values (‘X’)
insert into #letters (letter) values (‘Y’)
insert into #letters (letter) values (‘Z’)

Next, lets query the table for a three way cartesian join of the letters and a BINARY_CHECKSUM of the concatenation of those three letters.  I’ll sort by the BINARY_CHECKSUM.

select L1.Letter+L2.Letter+L3.Letter lettergroup, binary_checksum(L1.Letter+L2.Letter+L3.Letter) binchecksum
from #letters L1
inner join #letters L2
on 1=1
inner join #letters L3
on 1=1
order by 2

You’ll see here that there are many combinations of three letters that have the same BINARY_CHECKSUM.

You can even run this query for a little more information from our contrived but illustratory sample:

select binchecksum, count(*)
from
(
select L1.Letter+L2.Letter+L3.Letter lettergroup, binary_checksum(L1.Letter+L2.Letter+L3.Letter) binchecksum
from #letters L1
inner join #letters L2
on 1=1
inner join #letters L3
on 1=1
) ss
group by binchecksum
order by count(*) desc

This query shows us that there are a total of only 7,016 total distinct BINARY_CHECKSUM values across a total of 17,576 (26*3) rows.  That’s not even half!

By the way, this happened at a client site I was working at and we had to completely re-write our updating procedures to stop using BINARY_CHECKSUM to determine when a row needed to be updated in a datamart.  For me, the moral of the story was to NEVER use BINARY_CHECKSUM to determine if a row has changed and needed updating.  If you decide to use it, try it out first to see if you are going to encounter some problems.  I just have a hunch you will.

Advertisements

5 Responses to (NOT) Using CHECKSUM or BINARY_CHECKSUM for Changing Dimensions

  1. Tan Nguyen says:

    You offer a good argument, but what is the alternative?

  2. Mark Garner says:

    Tan,

    You’ll have to compare each column one by one.

    Just the nature of the beast.

    Mark

  3. Ken L. says:

    if you had a primary key (PK) or unique key (UK) for each row, checksum works perfectly fine.
    you can join by the PK or UK and check the checksum for differences.

    If you table has a few columns then it is ok to compare column one by one, but when you have a many columns, it will slow down .

  4. mercerd says:

    interesting material, where such topics do you find? I will often go

  5. Peter says:

    Check the following link for the difference between checksum and binary_checksum and the way types and bytes for strings are handled:

    http://decipherinfosys.wordpress.com/2007/05/18/checksum-functions-in-sql-server-2005/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: