Unit testing is something that can really help in the development cycle for a data warehouse. Having a robust suite of tests can help us refactor and create new features with confidence that we aren’t breaking existing features. Obviously at the outset we won’t have a large set of tests, but as time passes, we should continually be creating simple, small, and fast tests that as a whole test every facet of the warehouse we can think to test.
So with much fanfare (drumroll please) I’d like to announce another open source tool that I have started:
NDataUnit
NDataUnit is written in C# under the GNU 2.0 license and is intended for anyone who needs to perform tests on data and compare the returned results to expected results. NDataUnit is a console app that can be called from NAnt or MSBuild and it takes a test definition file (XML) and some command line arguments (Server, Database and Output File) and runs the tests described (SQL Statements) and compares the results against those listed in the test definition file. When the application is completed, if you specify, a log of the results of your tests are contained in a log file. Also, NDataUnit will return as a DOS ErrorLevel how many errors there were during processing. This means that if all the tests were successful, it will return 0 and NAnt will continue processing. This allow for complete integration with NAnt and an automated build.
NDataUnit is also hosted on Google Code like BuildDB.
One way to make use of this tool and start unit testing your warehouse would be to take two subsets of your ODS data and load one of them into a structural copy of your ODS and the other you should run through the ETL process into the star schema. Then script out the star schema or fact table data into insert statements that can be run during our automated build.
After this is complete, you should be able to build your datamart from scripts (now including test data). At this point we have the datamart in a known state. This means we should be able to write tests to ensure that it is in the known state. For instance we could sum up the number of distict values in a foreign key column in the fact table. We could also sum up the total value in a measure column. Many tests could be written to ensure that it is in the known state we expect.
After we test the datamart to be in the known state, we would proceede on to our regular transactional ETL process. We would run the ETL process against the subset of data we identified earlier that is now in a structural copy of the ODS. Since we are once again using known data, we should be able to write tests to ensure that our ETL process is running correctly.
We can approach this in two ways:
- Write tests that test the test-data ODS to the newly loaded datamart.
- Write tests that look from known values in the datamart after ETL.
It would make sense to have a mixture of these two types of tests. Each one can help find different kinds of problems.
I hope that this is only the beginning of NDataUnit. I’m very open to having help with enhancing the application by either letting me know what it needs to do, or someone stepping up and getting their hands dirty in the code.
Let me know what you think.