Avoiding the Union All

LineageIDs are assigned to any column being passed through a dataflow.  This is the id that SSIS uses to track where a column is sourced.  When you add an OLE DB Source to your data flow, every column that enters the data flow there is assigned a LineageID there.  New columns can be created and LineageIDs assigned at other places as well.

For instance, the Union all.  SSIS treats the output columns of a Union All as new columns in a dataflow and assigns new LineageIDs.  This is good because any colunm going through a Union All is technically not any of the input columns – it is a new column of the combined inputs.

A common use of the Union All is to bring rows back together after a Lookup task to lookup dimension keys where rows that didn’t find a lookup match are directed to the error output.  These error output rows are sent to a derived colunm task to have the dimension key set to the ‘NA’ member or whatever the ‘Unknown’ dimension member is.

This is all fine and dandy until you have quite a few Union All tasks in the data flow.  I found that once I had 20 or so in the flow, the performance of BIDS hit the floor.  It seemed as if BIDS couldn’t deal with all of the LineageIDs – or maybe more secifically all of the columns derived in the package.  Now instead of haveing 200 columns, we had 2000 to 3000 unique LineageIDs.

With suggestion of a colleague of mine, Yuri Bykov, we took out all of the Union All statements and instead set the error row output to ignore on the Lookup.  That flows right to a derived column task that check to see if the column that the lookup task added to the data flow is null.  If it is null and the column flowing through that was the column used to look up into the dimension, we know that the lookup failed and we can set the dimension key to ‘Unknown’ or whichever member it needs to be.

This really improved the performance of BIDS and I think even the performance of the package.


3 Responses to Avoiding the Union All

  1. John Welch says:

    The Union All is a partially blocking transform, so it’s copying each buffer to a new buffer in memory. In some cases, this can improve performance, because it introduces a new thread of execution, but it also increases memory pressure. With 20 of these, you may have been encountering disk caching.

  2. Mark Garner says:

    Thanks for your observation. Actually, I was having performance problems in BIDS itself – during design time. The data flow performance was ok I guess, but honestly I couldn’t get anything done in BIDS to find out. Scrolling around in the package took a minute or two for the screen to refresh! Blech!

  3. Mark Tassin says:

    What about when doing a Type 2 SCD lookup? i.e. I have a lookup that does a lookup where a given date is between the effective and ianctive dates on a customer dimension. If it doesn’t find a match, it passes it to a second lookup that does a match using the current version of the record. That second lookup I have configured as you mention above. But without Union All, how would I recombine the streams of data?

    Or should I instead run the other lookups and changes as separate paths and never recombine them? Won’t each lookup generating two paths also cause trouble once I have 3 or 4 SCD lookups (generating 8-16 different yet nearly identitcal paths)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: