Ordering Data Inside ETL

April 27, 2010

Ever wondered if it mattered what order your data was in inside an ETL load?

I hadn’t really considered it until this morning.  I have already considered and used the technique of dropping indexes during ETL and then creating them afterwords.  But this was new.

I had a load that took 50 seconds.  It had a clustered index as well as a unique index that was used to enforce a primary key.  The ETL load came in in no order.

The first thing I tried was to order the ETL in ascending order – which is the same as the order of the clustered index – to see if it mattered.  I suspected it would and it did.  the time went from 50 s. to 33 s.  This was good – nearly half the time just by ordering the data.  I wondered if ordering the data backwards from the clustered index would help or hurt, and to my surprise, it helped and lowered the time further to 28 s.

Now I wondered how much lower I could go by dropping the indexes before ETL and the rebuilding them afterwords.

I re-ran the non-ordered ETL using the drop and rebuild method and got 28 s.  So if all I had done was just drop and rebuild indexes, I would have gone from 50 to 28 s.  Not bad.  But when I ordered the load descending it dropped even further to 14 s.  The last test of course was the order it back to ascending and see which ordering worked best for the drop and rebuild method.  It turned out that ordering ascending worked better and dropped the time down to 13 s.

This was a surprise to me.  With the indexes built, the opposite order to the clustered index was better.  When you drop and reload, the same order was better.  I think the moral of the story isn’t specifically whether ascending or descending is better – I suppose it could be different in other cases.  Mostly just remember that the order can really matter and when you are trying to squeeze every ounce of performance out of ETL, check out the order of your ETL. 

Anyone on the SQL CAT team want to comment on why backwards works better when indexes are left intact?


Avoiding the Annoying in The File Connection Manager

May 19, 2008

Let’s say you want to delete a file as a part of your SSIS package and let’s say that the file is created during the executions of the package.  Simple – you use the File System Task.  You just add the task to the package and create a file connection manager to point to the file…only the file doesn’t exist yet.  The file connection manager editor won’t let you leave the file name pointing to a file that doesn’t exist.

Simple fix – All you have to do is use the properties window in SSIS instead of the dialog that comes up when you double click the connection manager.  Doesn’t look like DelayValidation is required here.

Duplicate ErrorCode Columns

May 2, 2008

A colleague of mine was selecting data from a table that contained a column called ErrorCode.  The OLE DB Source in SSIS was giving an error that said:

The package contains two objects with the duplicate name of “output column “ErrorCode” (13)” and “output column “ErrorCode” (279)”

She was having trouble figuring out why it was complaining since the table she was pulling from had only one column named “ErrorCode”.

If you remember, data transformations can add columns to the data flow that describe error conditions.  One of those columns is “ErrorCode”.  This was the problem.  SSIS wanted to reserve ErrorCode for the column that it adds and that conflicted with the column that was in the table.  I wasn’t able to make this happen when I added a column called “ErrorCode” to a select statement – only when it was in the table from which the statment is selected.

We went to the source table and changed the name from ErrorCode to something else.  This fixed the problem.  The other way to fix it is to use the AS clause in SQL to change the name in the source query.  Changing the output name in the Columns tab in the OLE DB Source didn’t fix the problem so you’ll have to go a little deeper to resolve the conflict.


Avoiding the Union All

April 9, 2008

LineageIDs are assigned to any column being passed through a dataflow.  This is the id that SSIS uses to track where a column is sourced.  When you add an OLE DB Source to your data flow, every column that enters the data flow there is assigned a LineageID there.  New columns can be created and LineageIDs assigned at other places as well.

For instance, the Union all.  SSIS treats the output columns of a Union All as new columns in a dataflow and assigns new LineageIDs.  This is good because any colunm going through a Union All is technically not any of the input columns – it is a new column of the combined inputs.

A common use of the Union All is to bring rows back together after a Lookup task to lookup dimension keys where rows that didn’t find a lookup match are directed to the error output.  These error output rows are sent to a derived colunm task to have the dimension key set to the ‘NA’ member or whatever the ‘Unknown’ dimension member is.

This is all fine and dandy until you have quite a few Union All tasks in the data flow.  I found that once I had 20 or so in the flow, the performance of BIDS hit the floor.  It seemed as if BIDS couldn’t deal with all of the LineageIDs – or maybe more secifically all of the columns derived in the package.  Now instead of haveing 200 columns, we had 2000 to 3000 unique LineageIDs.

With suggestion of a colleague of mine, Yuri Bykov, we took out all of the Union All statements and instead set the error row output to ignore on the Lookup.  That flows right to a derived column task that check to see if the column that the lookup task added to the data flow is null.  If it is null and the column flowing through that was the column used to look up into the dimension, we know that the lookup failed and we can set the dimension key to ‘Unknown’ or whichever member it needs to be.

This really improved the performance of BIDS and I think even the performance of the package.

OLE DB Source Metadata Refresh

April 4, 2008

There is a particular behavior on the OLE DB Source that I’d like to highlight.  I have had the need to change the data type of a column in a source query recently and found that the OLE DB Source doesn’t flush and refresh its metadata.

Let’s say I have a varchar 5 field and it really is suposed to be a dollar amount.  I could go in to the package and add a Data Conversion task – which is a good option, but also I could go in to my source query and change it there by performing a cast([my_varchar_column] as money).  What I have found is that you have to comment out the column and click OK to get out of the OLE DB Source dialog, then go back and un-comment the column so that it is back in the metadata collection with the new data type.

I know that performing cast operations in a souce query may not be a best practice, but I had a table with a few hundred columns and a whole bunch of them needed to have their data type changed.  I could either write a source query (using the catalog) that had all the right data types built in, or build a set of data conversion tasks (like maybe 50) that do the same thing.  In the interest of time, I did the former.

Yea for the scroll bar!

February 25, 2008

One of the things that irked me about Integration Services (BIDS) is that the Execute SQL Task didn’t have a scroll bar for the SQL Text.  If you had a long piece of SQL, the only way to scroll was to use your cursor and arrow keys.  How annoying.

This has been fixed in 2008 (Katmai).


SQL Task Scroll Bar

Null inside a Script Component

January 9, 2008

Hope everyone had a good holiday season.  I did.

Something I’m dealing with today has me stumped.  I searched google (ahem – live search) to see if there was any information out there and I didn’t find anything.

I have a script componenet that I’m using as a transformation and I’m adding a colunm to the data flow inside of it and the new column is an integer (DT_I4).  Inside the script I’m testing to see if a varchar field has a number in it (hence why I’m using a script component – there is no IsNumeric() in SSIS) and setting it to NULL if the incoming data isn’t a number.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)Dim Result As Int32
   If System.Int32.TryParse(Row.VarcharColumn, Result) Then
      Row.IntegerColumn = Result
      Row.IntegerColumn = vbNull
   End If
End Sub

So here is the deal.  When I set the IntegerColumn = vbNull, it comes out in the data flow as 1.  If I set it = to Nothing, it comes out in the data flow as 0Does anyone know how to set an integer in the data flow = NULL inside a script transformation taskBy the way – the work around I came up with was to set the integer column = 98989898 ( or some other rediculous integer) and then use a derived column task to transform that back to null in the next step of the data flow.Anyone have thoughts?


UPDATE: A colleague of mine, Yuri Bykov, has come up with an answer.  Check the comment below.  Thanks, Yuri!