Google Code Upgrade

December 20, 2006

Google Code project hosting has been upgraded!  I just noticed that there is a location for storing downloads and a new Wiki tab as well.  This was my only complaint about google code up to this point and it is now fixed.

I have placed the latest executable in the download section for each of the two projects that I’m hosting.

So now you don’t have to download the source (although you are welcome to) and have to comple it to use the application.

(This goes for both NDataUnit and BuildDB)

Mark


NDataUnit – Database Unit Testing Tool

October 22, 2006

Unit testing is something that can really help in the development cycle for a data warehouse.  Having a robust suite of tests can help us refactor and create new features with confidence that we aren’t breaking existing features.  Obviously at the outset we won’t have a large set of tests, but as time passes, we should continually be creating simple, small, and fast tests that as a whole test every facet of the warehouse we can think to test.

So with much fanfare (drumroll please) I’d like to announce another open source tool that I have started:

NDataUnit

NDataUnit is written in C# under the GNU 2.0 license and is intended for anyone who needs to perform tests on data and compare the returned results to expected results.  NDataUnit is a console app that can be called from NAnt or MSBuild and it takes a test definition file (XML) and some command line arguments (Server, Database and Output File) and runs the tests described (SQL Statements) and compares the results against those listed in the test definition file.  When the application is completed, if you specify, a log of the results of your tests are contained in a log file.  Also, NDataUnit will return as a DOS ErrorLevel how many errors there were during processing.  This means that if all the tests were successful, it will return 0 and NAnt will continue processing.  This allow for complete integration with NAnt and an automated build.

NDataUnit is also hosted on Google Code like BuildDB.

One way to make use of this tool and start unit testing your warehouse would be to take two subsets of your ODS data and load one of them into a structural copy of your ODS and the other you should run through the ETL process into the star schema.  Then script out the star schema or fact table data into insert statements that can be run during our automated build.

After this is complete, you should be able to build your datamart from scripts (now including test data).  At this point we have the datamart in a known state.  This means we should be able to write tests to ensure that it is in the known state.  For instance we could sum up the number of distict values in a foreign key column in the fact table.  We could also sum up the total value in a measure column.  Many tests could be written to ensure that it is in the known state we expect.

After we test the datamart to be in the known state, we would proceede on to our regular transactional ETL process.  We would run the ETL process against the subset of data we identified earlier that is now in a structural copy of the ODS.  Since we are once again using known data, we should be able to write tests to ensure that our ETL process is running correctly.

We can approach this in two ways:

  • Write tests that test the test-data ODS to the newly loaded datamart.
  • Write tests that look from known values in the datamart after ETL.

It would make sense to have a mixture of these two types of tests.  Each one can help find different kinds of problems.

I hope that this is only the beginning of NDataUnit.  I’m very open to having help with enhancing the application by either letting me know what it needs to do, or someone stepping up and getting their hands dirty in the code.

Let me know what you think.


Automated Testing for Datamarts

September 27, 2006

Another tenet of Agile development is automated testing of code.  We can apply this idea to our datamart as well.

What kinds of things need to be tested in a data warehouse?  We don’t need to test transactions as this is the responsibility of the ETL system.  What we do need to test is the quality of the data in the mart.  This includes both measures in the fact table and data in the dimension tables.

There are two different times that we need to test our datamart.  We need to test before our ETL load and also after.  This allows us to first test the veracity of the structure of the datamart.  We can then run the regular or standard ETL process into the fact or dimension table and then re-run the test with the new expected results.  These two sets of tests should be run on known and static data.  One set scripted out as insert statements for the fact table or dimension table and another set in an ODS or flat file for use in the ETL process run that is tested.

Why would we want to all the trouble of creating runnig all these tests?  Having automated tests allows us to refactor and expand the datamart with more confidense.  As we move through more and more sprints or development iterations, our test set grows giving us more confidense that new development doesn’t have adverse effects on already deployed functionality.

Some of the obvious tests we might run include sums of measures in the fact table sliced by many different dimensions, row counts in the fact table according to dimension slices and row counts of dimension tables by different dimension attributes.

The most important thing about testing the datamart is that the number of tests that are run during the testing phase continues to grow.  As time passes, we should have more tests testing more functionality.

The tool that I have found that comes the closest to being able to perform these types of tests on SQL Server is called TSQLUnit.  This tool places about 3 or 4 tables and 5 or 6 stored procedures in your database.  You then call one of the stored procedures to run the tests.  It loops through the database catalog and calls stored procedures with the prefix of ‘tsu_’.  These are stored procedures that are written to perform the tests.  The test results are then stored in one of the tables.

While this to me isn’t ideal, it is a really good start to testing a datamart.  What would be better, in my opinion, would be a console application that does all the testing out of an XML file that defines queries and answers and runs all the tests and then reports answers.

Let me know what you think about TSQLUnit or comments about tests for a datamart. 


BuildDB – Open Source Tool for the Agile SQL2005 Database

September 24, 2006

I’d like to announce the beginning of a new open source project to fill a hole in the set of tools needed for an Agile Database / Datamart built on SQL 2005.

The name of the project is BuildDB and it is hosted by Google Project Hosting under the GNU General Public License 2.0.

In consultation with a friend and colleague, Greg Graham, I wrote a tool that will build a SQL 2005 database from script files using a command line console app.  The app is written in Visual Studio 2005 on the .NET Framework 2.0.  The project is definitely not finished and I hope that maybe a person or two will step up and help me test it and maybe even do some coding.

It works like this:

BuildDB [TargetServerName] [TargetDBName] [ScriptsRoot]

Under the ScriptsRoot directory, the application looks for these directories: “Logins”, “Users and Roles”, “Tables”, “User Defined Functions”, “Views”, “Stored Procedures”, “Triggers” and “Static Data”.

The tool will go through these directories and create the database and then run the scripts in the “Static Data” directory allowing the opportunity of inserting data right from the get-go.

There are already some known issues with the application and also some enhancements that I’ll be working on in the near future.  Even with these issues and enhancements outstanding the tool is pretty useful but maybe incapable of building a super complicated database.  If someone tries the tool and finds any other issues that I don’t have identified in the Issues List, please add it or get in touch with me via this blog.