(NOT) Using CHECKSUM or BINARY_CHECKSUM for Changing Dimensions

October 31, 2007

Let me start out by saying that CHECKSUM or BINARY_CHECSUM should not be used to determine if a row has changed in a database.  One might take a BINARY_CHECKSUM of a row when it is inserted into a dimension or a fact table and then later come back using the business key and the BINARY_CHECKSUM to determine whether the row had changed.

Here is a quote from Wikipedia about what checksums are for:

A checksum is a form of redundancy check, a simple way to protect the integrity of data by detecting errors in data that are sent through space (telecommunications) or time (storage). It works by adding up the basic components of a message, typically the asserted bits, and storing the resulting value. Anyone can later perform the same operation on the data, compare the result to the authentic checksum, and (assuming that the sums match) conclude that the message was probably not corrupted. (http://en.wikipedia.org/wiki/Checksum)

There is a nuance to checking the same data to see if it has changed because of error during transmission or changed due to an error in storage mechanism as opposed to checking different copies of data to see if they are the same.

Let me show an example.

First we’ll make a table of all the uppercase letters of the alphabet:

create table #letters (letter char(1))

insert into #letters (letter) values (‘A’)
insert into #letters (letter) values (‘B’)
insert into #letters (letter) values (‘C’)
insert into #letters (letter) values (‘D’)
insert into #letters (letter) values (‘E’)
insert into #letters (letter) values (‘F’)
insert into #letters (letter) values (‘G’)
insert into #letters (letter) values (‘H’)
insert into #letters (letter) values (‘I’)
insert into #letters (letter) values (‘J’)
insert into #letters (letter) values (‘K’)
insert into #letters (letter) values (‘L’)
insert into #letters (letter) values (‘M’)
insert into #letters (letter) values (‘N’)
insert into #letters (letter) values (‘O’)
insert into #letters (letter) values (‘P’)
insert into #letters (letter) values (‘Q’)
insert into #letters (letter) values (‘R’)
insert into #letters (letter) values (‘S’)
insert into #letters (letter) values (‘T’)
insert into #letters (letter) values (‘U’)
insert into #letters (letter) values (‘V’)
insert into #letters (letter) values (‘W’)
insert into #letters (letter) values (‘X’)
insert into #letters (letter) values (‘Y’)
insert into #letters (letter) values (‘Z’)

Next, lets query the table for a three way cartesian join of the letters and a BINARY_CHECKSUM of the concatenation of those three letters.  I’ll sort by the BINARY_CHECKSUM.

select L1.Letter+L2.Letter+L3.Letter lettergroup, binary_checksum(L1.Letter+L2.Letter+L3.Letter) binchecksum
from #letters L1
inner join #letters L2
on 1=1
inner join #letters L3
on 1=1
order by 2

You’ll see here that there are many combinations of three letters that have the same BINARY_CHECKSUM.

You can even run this query for a little more information from our contrived but illustratory sample:

select binchecksum, count(*)
from
(
select L1.Letter+L2.Letter+L3.Letter lettergroup, binary_checksum(L1.Letter+L2.Letter+L3.Letter) binchecksum
from #letters L1
inner join #letters L2
on 1=1
inner join #letters L3
on 1=1
) ss
group by binchecksum
order by count(*) desc

This query shows us that there are a total of only 7,016 total distinct BINARY_CHECKSUM values across a total of 17,576 (26*3) rows.  That’s not even half!

By the way, this happened at a client site I was working at and we had to completely re-write our updating procedures to stop using BINARY_CHECKSUM to determine when a row needed to be updated in a datamart.  For me, the moral of the story was to NEVER use BINARY_CHECKSUM to determine if a row has changed and needed updating.  If you decide to use it, try it out first to see if you are going to encounter some problems.  I just have a hunch you will.


Building create scripts for user permissions

October 5, 2007

I needed to use T-SQL (as opposed to using the Sql Server object model and .NET) to write out CREATE scrips for logins to a database as well as the roles to which they were assigned.  I decided to use the Catalog Security Views to accomplish this task.

Here is a script that will create CREATE USER scripts for all users in the database:

select ‘CREATE USER [‘ + database_principal_name + ‘] FOR LOGIN [‘ + server_login_name + ‘] WITH DEFAULT_SCHEMA=[dbo]’ as sql
from (
select dp.name as database_principal_name, sl.name server_login_name
from sys.database_principals dp inner join sys.sql_logins sl on dp.sid = sl.sid where dp.type=‘S’and dp.name not in (‘dbo’, ‘guest’, ‘INFORMATION_SCHEMA’, ‘sys’)
) ss

Here is a script that will create statements to add users to roles:

select ‘EXEC sp_addrolemember N”’ + role_name + ”’, N”’ + login_name + ”” as sql from
(
select dp_login.name as login_name, dp_role.name as role_name from sys.database_role_members drm
inner join
(
select * from sys.database_principals dp where dp.type=‘S’ and dp.name not in (‘dbo’, ‘guest’, ‘INFORMATION_SCHEMA’, ‘sys’)) dp_login
on drm.member_principal_id = dp_login.principal_id
inner join
(select * from sys.database_principals dp where dp.type=‘R’ ) dp_role
on drm.role_principal_id = dp_role.principal_id ) ss

I haven’t yet tried to script out table level permissions but I’ll bet that a similar approach would work.  Also, I tried to make the statements as readable as possible, but they don’t look very good in the blog window.  I think if you’ll take them and paste them into SSMS then they will be easier to read.

In a subsequent post, I’ll talk about how you can use this kind of script to build an automated delivery system for datamarts.

Mark