Avoiding the Union All

April 9, 2008

LineageIDs are assigned to any column being passed through a dataflow.  This is the id that SSIS uses to track where a column is sourced.  When you add an OLE DB Source to your data flow, every column that enters the data flow there is assigned a LineageID there.  New columns can be created and LineageIDs assigned at other places as well.

For instance, the Union all.  SSIS treats the output columns of a Union All as new columns in a dataflow and assigns new LineageIDs.  This is good because any colunm going through a Union All is technically not any of the input columns – it is a new column of the combined inputs.

A common use of the Union All is to bring rows back together after a Lookup task to lookup dimension keys where rows that didn’t find a lookup match are directed to the error output.  These error output rows are sent to a derived colunm task to have the dimension key set to the ‘NA’ member or whatever the ‘Unknown’ dimension member is.

This is all fine and dandy until you have quite a few Union All tasks in the data flow.  I found that once I had 20 or so in the flow, the performance of BIDS hit the floor.  It seemed as if BIDS couldn’t deal with all of the LineageIDs – or maybe more secifically all of the columns derived in the package.  Now instead of haveing 200 columns, we had 2000 to 3000 unique LineageIDs.

With suggestion of a colleague of mine, Yuri Bykov, we took out all of the Union All statements and instead set the error row output to ignore on the Lookup.  That flows right to a derived column task that check to see if the column that the lookup task added to the data flow is null.  If it is null and the column flowing through that was the column used to look up into the dimension, we know that the lookup failed and we can set the dimension key to ‘Unknown’ or whichever member it needs to be.

This really improved the performance of BIDS and I think even the performance of the package.

Advertisements

OLE DB Source Metadata Refresh

April 4, 2008

There is a particular behavior on the OLE DB Source that I’d like to highlight.  I have had the need to change the data type of a column in a source query recently and found that the OLE DB Source doesn’t flush and refresh its metadata.

Let’s say I have a varchar 5 field and it really is suposed to be a dollar amount.  I could go in to the package and add a Data Conversion task – which is a good option, but also I could go in to my source query and change it there by performing a cast([my_varchar_column] as money).  What I have found is that you have to comment out the column and click OK to get out of the OLE DB Source dialog, then go back and un-comment the column so that it is back in the metadata collection with the new data type.

I know that performing cast operations in a souce query may not be a best practice, but I had a table with a few hundred columns and a whole bunch of them needed to have their data type changed.  I could either write a source query (using the catalog) that had all the right data types built in, or build a set of data conversion tasks (like maybe 50) that do the same thing.  In the interest of time, I did the former.