Automated Testing for Datamarts

September 27, 2006

Another tenet of Agile development is automated testing of code.  We can apply this idea to our datamart as well.

What kinds of things need to be tested in a data warehouse?  We don’t need to test transactions as this is the responsibility of the ETL system.  What we do need to test is the quality of the data in the mart.  This includes both measures in the fact table and data in the dimension tables.

There are two different times that we need to test our datamart.  We need to test before our ETL load and also after.  This allows us to first test the veracity of the structure of the datamart.  We can then run the regular or standard ETL process into the fact or dimension table and then re-run the test with the new expected results.  These two sets of tests should be run on known and static data.  One set scripted out as insert statements for the fact table or dimension table and another set in an ODS or flat file for use in the ETL process run that is tested.

Why would we want to all the trouble of creating runnig all these tests?  Having automated tests allows us to refactor and expand the datamart with more confidense.  As we move through more and more sprints or development iterations, our test set grows giving us more confidense that new development doesn’t have adverse effects on already deployed functionality.

Some of the obvious tests we might run include sums of measures in the fact table sliced by many different dimensions, row counts in the fact table according to dimension slices and row counts of dimension tables by different dimension attributes.

The most important thing about testing the datamart is that the number of tests that are run during the testing phase continues to grow.  As time passes, we should have more tests testing more functionality.

The tool that I have found that comes the closest to being able to perform these types of tests on SQL Server is called TSQLUnit.  This tool places about 3 or 4 tables and 5 or 6 stored procedures in your database.  You then call one of the stored procedures to run the tests.  It loops through the database catalog and calls stored procedures with the prefix of ‘tsu_’.  These are stored procedures that are written to perform the tests.  The test results are then stored in one of the tables.

While this to me isn’t ideal, it is a really good start to testing a datamart.  What would be better, in my opinion, would be a console application that does all the testing out of an XML file that defines queries and answers and runs all the tests and then reports answers.

Let me know what you think about TSQLUnit or comments about tests for a datamart. 


BuildDB – Open Source Tool for the Agile SQL2005 Database

September 24, 2006

I’d like to announce the beginning of a new open source project to fill a hole in the set of tools needed for an Agile Database / Datamart built on SQL 2005.

The name of the project is BuildDB and it is hosted by Google Project Hosting under the GNU General Public License 2.0.

In consultation with a friend and colleague, Greg Graham, I wrote a tool that will build a SQL 2005 database from script files using a command line console app.  The app is written in Visual Studio 2005 on the .NET Framework 2.0.  The project is definitely not finished and I hope that maybe a person or two will step up and help me test it and maybe even do some coding.

It works like this:

BuildDB [TargetServerName] [TargetDBName] [ScriptsRoot]

Under the ScriptsRoot directory, the application looks for these directories: “Logins”, “Users and Roles”, “Tables”, “User Defined Functions”, “Views”, “Stored Procedures”, “Triggers” and “Static Data”.

The tool will go through these directories and create the database and then run the scripts in the “Static Data” directory allowing the opportunity of inserting data right from the get-go.

There are already some known issues with the application and also some enhancements that I’ll be working on in the near future.  Even with these issues and enhancements outstanding the tool is pretty useful but maybe incapable of building a super complicated database.  If someone tries the tool and finds any other issues that I don’t have identified in the Issues List, please add it or get in touch with me via this blog.

Developer Edition to the Rescue

September 18, 2006

In an earlier post I had mentioned the problem that SSIS packages couldn’t be executed in a way that was “in-process” – in other words, my build would have to create a SQL job that executed a package that I had deployed to the SSIS server.  I would use sp_start_job to start the job and the problem is that it just starts the job and reports whether the job was started successfully.  The build moves on to the next step.  This would make it hard to load shared (conformed) dimensions first and then load the fact table second.

This doesn’t turn out to be a problem because we can use the Developer Edition of SQL Server and install SSIS on our local machines and use dtexec on the command line.

I don’t know why I didn’t think of this earlier, but it completely solves the problem.

We didn’t want to install SQL on our local machines for data security reasons and this doesn’t cause a problem.

Problem solved!

The Continuously Integrated Datamart

September 9, 2006

The first effort for creating an automated build for the data warehouse has turned out to need some re-thinking. 

The original vision was for dropping the datamart database and then recreating that database, playing in some create scripts for the tables and then deploying and running some SSIS packages to populate the data.  This turns out to be un-realistic.

There are tables in the warehouse that cannot be deleted.  Type two slowly changing dimensions (SCDs) and fact tables that keep history are two examples.  These tables can’t be re-created because the historical data isn’t in the ODS anymore.  This requires a change in thinking.

The source for a warehouse build is not just the create scripts and the ODS – this isn’t enough.  We have to think of the current warehouse as the source for the revised version.  This means modifying and updating the current datamart instead of completely rebuilding it during a build.

What does this mean for our Agile Warehouse – one that is continuously integrated?

We have to employ some database change management techniques so that we can build in one automated process.

The iteration lifecycle from a high level would include:

  1. Complete development on the dev database
  2. Restore the test environment to match the structure of the production environment
  3. Merge in changes from the dev environment to test environment
  4. Run automated tests on test environment
  5. If automated tests are successful and the current changes are to be moved to production, run the same sync and test process against the production environment and backup the test environment for the next iteration.  If the automated tests are not successful, then more development will be done on dev and the process starts over

This creates the framework for the ongoing lifecycle of development.  In my next post, I’ll talk a little about how you can build the technology infrastructure for making this environment a reality.

Automating SSAS Deployment

September 6, 2006

One of the last pieces of our complete deployment we have left to automate is Analysis Services.  This turns out to be pretty simple.

Microsoft has created a tool called Analysis Services Deployment Wizard.  It can be found in your start menu under SQL 2005, Analysis Services, Deployment Wizard.  This takes the asdatabase file in your bin directory and creates an XMLA script that creates the SSAS database.  It can be found at:

[Install Drive]\Program Files\Microsoft SQL Server\90\Tools\Binn\VSShell\Common7\IDE\Microsoft.AnalysisServices.Deployment.exe.

It can be called from the command line without showing the GUI wizard piece.

Now the only thing is that we need to play that script against our server.  Microsoft has also written this tool for us already.  There is a sample app that comes with SQL server and it can be found in [Install Drive]\Program Files\Microsoft SQL Server\90\Samples\Analysis Services\Administrator\ascmd\CS.  Open the solution and compile and you’ll have a tool that will play XMLA scripts against the server.  It has a simple command line interface too.

Next we’ll have to create some XMLA scripts that we can play that will process our dimensions and cubes.

Remote SSIS Execution

September 1, 2006

Jeff Shaver brought up a problem he is experienceing in a comment to my last post and instead of posting another comment in that entry, I decided to go ahead and post an article about how you can remotely execute SSIS packages.

Just to recap, we have now automated PowerDesigner so that we can get the create table statements out of it, automated the check in of these new scripts, then run the newest create table scripts against our destination server.  At this point we are ready to fill the datamart.  The next thing we do is deploy our SSIS package that fills the datamart by using DTUtil. Read more about this here.

The next thing to do is to call the SSIS package that we just deployed.

We have found a few different ways to do this.  One way is to create a job on the server through T-SQL and add the package execution as a step in the the job and then turn around and execute the job using sp_start_job.  The problem with this method is that sp_start_job doesn’t wait until the job is finished.  It only starts the job and reports whether the job began successfully.  This is OK, but the problem is that if you have some tasks you need to do after the job is completed, you don’t know when the job is done.  We haven’t implemented it yet, but we have discussed having the package perform some sort of a callback – maybe through a database record write or maybe a web service call – something of this nature.

Another way to execute the package would be to use xp_cmdshell to call DTExec on the command line.  This has its own set of problems.  First of all, there is a security problem that has to be addressed if you decide to turn on xp_cmdshell.  Second, if the job hangs, the xp_cmdshell is also hung.  I haven’t tried this method out myself to see if the job is executed in-process and blocks the thread until the job is completed or not.  I’m sure someone who has tried this can fill us in on that – although I’m pretty sure it will stay in-process.

A third way of doing the job (as Jeff suggested) would be to write a web service that loads up the SSIS package via SSIS objects and runs the package from there.  I’m pretty sure this will work but I didn’t work on it long enough to see it all the way through.  I also thought about using a CLR stored procedure to host the work – a slam dunk if you could – but it looks like you can’t use the assemblies that we need inside a CLR stored proc. 😦

One other way I want to mention in passing is that you can call DTExec from your build machine and execute a package located on a server.  I assume that this runs the package on the build server and not on the server so this could be a performance problem.  I did have one problem trying this – I got an error saying that I didn’t have the right version of SSIS to do a Data Conversion Task.  I’m assuming that means that when you run packages from the command line on a machine with BIDS, it runs as if you have standard edition instead of enterprise.  Not sure, but that is my hunch.

So a compromise is in order – you have to choose your poison.

We have decided, for now, to go with the job creation method.  We’ll see how that goes and change it later if we need.

Thanks to Jeff for the inquiry.