Ordering Data Inside ETL

April 27, 2010

Ever wondered if it mattered what order your data was in inside an ETL load?

I hadn’t really considered it until this morning.  I have already considered and used the technique of dropping indexes during ETL and then creating them afterwords.  But this was new.

I had a load that took 50 seconds.  It had a clustered index as well as a unique index that was used to enforce a primary key.  The ETL load came in in no order.

The first thing I tried was to order the ETL in ascending order – which is the same as the order of the clustered index – to see if it mattered.  I suspected it would and it did.  the time went from 50 s. to 33 s.  This was good – nearly half the time just by ordering the data.  I wondered if ordering the data backwards from the clustered index would help or hurt, and to my surprise, it helped and lowered the time further to 28 s.

Now I wondered how much lower I could go by dropping the indexes before ETL and the rebuilding them afterwords.

I re-ran the non-ordered ETL using the drop and rebuild method and got 28 s.  So if all I had done was just drop and rebuild indexes, I would have gone from 50 to 28 s.  Not bad.  But when I ordered the load descending it dropped even further to 14 s.  The last test of course was the order it back to ascending and see which ordering worked best for the drop and rebuild method.  It turned out that ordering ascending worked better and dropped the time down to 13 s.

This was a surprise to me.  With the indexes built, the opposite order to the clustered index was better.  When you drop and reload, the same order was better.  I think the moral of the story isn’t specifically whether ascending or descending is better – I suppose it could be different in other cases.  Mostly just remember that the order can really matter and when you are trying to squeeze every ounce of performance out of ETL, check out the order of your ETL. 

Anyone on the SQL CAT team want to comment on why backwards works better when indexes are left intact?


Unknown Member in Analysis Services

October 22, 2009

Analysis Services has a built in member for each dimension called (by default) “Unknown.”  This is to simplify the process of dealing with facts that have the property of Unknown for a particular dimension.  If a dimension member comes to the fact table after failing a lookup in the SSIS package and contains a null for the surrogate key, Analysis Services assigns it to this special Unknown Member and moves forward.

There are three steps to this situation.  First, the dimension itself has a property called UnknownMember that describes the usage of this unknown member.  It can be set to Visible, Hidden, or None.  Next, the dimension member set with the usage of Key has a property called NullProcessing that is set to UnknownMember.  This tells the dimension what to do in case of coming across a null surrogate key in the fact table.  Third, in the Dimension Usage screen of the cube, for each dimension in use, there is a setting under Advanced that once again describes NullProcessing.  This also can be set to describe behavior when processing a null dimension surrogate key.  Here is a link to a description of all the options.  This is a reference to Analysis Services Scripting, but it was the only place I could find these options described.  http://msdn.microsoft.com/en-us/library/ms127041.aspx

I think that this unknown member is a very convenient inclusion by the Analysis Services team, but I think I’ll pass on using it.  There is some syntactic sugar in MDX that allows the usage of a member called UNKNOWNMEMBER that seems nice, but what this scenario does not allow is an unknown member in the relational store.  If you don’t ever plan on querying the relational store, then the only place you will need an unknown member will be Analysis Services.  You can then pass unknown members to the fact table as null and allow AS to process accordingly.

I like to leave the relational store in as query-able state as possible.  Report writers might later have a reason to use it and having null in the fact table for surrogate keys will cause problems.  Report writers will have to use LEFT JOIN and then derive an unknown member at query time.

I think in this situation, creating an unknown member in the dimension with a surrogate key of 1, 0, or -1 (a special number of your choosing) is a good solution.

You’ll have to go and turn off the unknown member in the dimension, change NullProcessing in the key attribute for the dimension and change NullProcessing in the dimension usage of the cube to enable it.  But I think you’ll find that this is a good compromise when the relational store needs to be as query-able as possible.


PerformancePoint 2007 with SQL 2008

October 14, 2009

I have been using the Monitoring Server Depolyment Guide for PerformancePoint 2007 (excerpted from Nick Barclay and Adrian Downes’ Book) document as a guide for installing and deploying PPS Monitoring.  There is a lot of documentation around how to get PPS 2007 to work with SQL 2008 among them being CU9 for SQL 2005.

PPS uses SQL 2005 clients for database and Analysis Services access – PPS was released well before SQL 2008.  When it comes time to hook up to Analysis Services 2008 and SQL 2008, it seems odd to install a SQL update to the PPS server – even if SQL is on another box.  This is to update the client tools used by PPS to access the data back end.

In the Deploying Monitoring Server documentation, it says that CU9 for SQL 2005 is required.  Well – this is not completely the truth.  Specifiaclly, you need a minimum version for accessing SQL 2008.  For SNAC, it is 2005.90.3282.0 with a file date of 05-Aug-2008.  The CU9 website doesn’t specifically point out the version numbers of the other required clients, ADOMD and ASOLEDB9.

In any case, applying SQL 2005 CU9 is unecessary if you use SQL Server 2005 Feature Pack from December 2008.  This collection of various and sundry tools has versions newer than are included in CU9 and are sufficient for running PPS.  The version for SNAC included in Feature Pack from December is 2005.90.4035.0 from 25-Nov-2008.


Installing on Server 2008 R2

September 29, 2009

I have been messing with setting up some new virtual machines.  I was trying to use the latest and greatest version of each tool and ran into problems.

As it turns out, the lack of compatability between Server 2008 R2 and other tools is pretty well documented.  I tried to install SQL 2008 on Server 2008 R2 and got some weird errors like “server can’t access the window handle.”  I haven’t thought about window handles in a long time – not since Windows 3.11 programming in C++.

Anyways – I think you can fix all of this by slipstreaming in service packs of the tools you are installing, but I gave up and just went back to Server 2008 SP2.

I also had problems installing MOSS 2007 on Server 2008 R2 as well.  When installing, I got an error that said that it didn’t have permission to create the configuration database.  Very strange.  But installing on Server 2008 SP2 went just fine.


SQL 2008 SP1

April 8, 2009

Service Pack 1 for SQL 2008 is ready for download.  There isn’t much new functionality in SP1.  Microsoft has made an effort to release a functionality complete piece of software, hence no new stuff.

One thing I saw that I thought was cool was that Report Builder 2.0 one-click is now included.  I did think it was strange that if you clicked “Report Builder” in Reporting Services Manager you got Report Builder 1.0.  Anyways, that won’t happen any more.

Here is the link:

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=66ab3dbb-bf3e-4f46-9559-ccc6a4f9dc19


RDBMS Doomed?

February 13, 2009

Here is a great article from Read Write Web that discusses the place and uses for a Key/Value database such as SSDS or SimpleDB or Google App Engine.  Jason Massie argues that the DBA profession will level off in 5-7 years and then start to fall off in 10 because of the prevelence of cloud based Key/Value based databases.

During the last couple of releases of SQL Server, Microsoft has been focused on BI.  Why?  Everyone has lots of data.  Everyone knows that there is a lot of value in that data, but they need to get at it.  BI is one of the top agenda items for CIOs according to Gartner.

What does this have to do with SSDS and SimpleDB?

SSDS and SimpleDB have basically no reporting or analytics capabilities.  Microsoft has mentioned that the future for SSDS includes OLAP like capabilities.  Even if this was available today and shipped today, would companies be ready to move their entire data ecosystem to Microsoft, Amazon, or Google, re-write everything out of RDBMS and into Key/Value programming, then re-work ETL and data warhouse infrasturcture to work on a non-existant (so far, Microsoft says) architecture?

They sure aren’t going to move all of the OLTP systems they have to the cloud right now because the data is then trapped in a no-reporting storage mechanism.  Google App Engine only allows 1000 result objects per query.  Amazon only allows 5 second duration queries.  Can companies’ appetite for analytics be satiated by that?  I don’t think so.

This seems a very log way off.  Even if one of these Key/Value databases HAD reporting and analytics, it seems a minimum of 5 to 7 year for any sizable amount of migration to take place.  And this would be if there WERE analytics and reporting.  There aren’t.

Many companies have large amounts of data and money in their VSAM/COBOL systems that are still running fine that everyone said would have to have been replaced a decade ago.  Those systems aren’t going anywhere for quite a while and these systems need analytics.  These analytics will need to be done on-sight on RDBMS/OLAP platforms.

I suppose for a San Francisco-based startup-focused DBA, there might be a noticible movement to the cloud, but for the rest of the world, it is going to take longer than 5 years to notice a plateau in the need for RDBMS DBAs.

I will however conceede that at some point, there will be a draw to the cloud.  But there are many hurdles to overcome before that can happen in earnest.

One change might be a cloud hosted RDBMS.  Certainly there are some out there somewhere, but the hot topic right now is Key/Value and how that enables large distributed systems.  For a company looking to save money, a hosted RDBMS makes sense, but probably not SSDS and Key/Value.


Float Data Types

January 29, 2009

This is a really big distinction but is only mentioned in passing in BOL (check here), but approximate numerics are just that:  approximate.  Someone who understands the storage engine better can explain why a foat or real can’t store an exact number, but they can’t and don’t.  You might put 1.5 into it and later come to find that the value is 1.4999.

If you need an exact representation of a number that isn’t an integer, you need to use numeric or decimal.  These are functionally the same, so it doesn’t matter which you choose.