Ordering Data Inside ETL

April 27, 2010

Ever wondered if it mattered what order your data was in inside an ETL load?

I hadn’t really considered it until this morning.  I have already considered and used the technique of dropping indexes during ETL and then creating them afterwords.  But this was new.

I had a load that took 50 seconds.  It had a clustered index as well as a unique index that was used to enforce a primary key.  The ETL load came in in no order.

The first thing I tried was to order the ETL in ascending order – which is the same as the order of the clustered index – to see if it mattered.  I suspected it would and it did.  the time went from 50 s. to 33 s.  This was good – nearly half the time just by ordering the data.  I wondered if ordering the data backwards from the clustered index would help or hurt, and to my surprise, it helped and lowered the time further to 28 s.

Now I wondered how much lower I could go by dropping the indexes before ETL and the rebuilding them afterwords.

I re-ran the non-ordered ETL using the drop and rebuild method and got 28 s.  So if all I had done was just drop and rebuild indexes, I would have gone from 50 to 28 s.  Not bad.  But when I ordered the load descending it dropped even further to 14 s.  The last test of course was the order it back to ascending and see which ordering worked best for the drop and rebuild method.  It turned out that ordering ascending worked better and dropped the time down to 13 s.

This was a surprise to me.  With the indexes built, the opposite order to the clustered index was better.  When you drop and reload, the same order was better.  I think the moral of the story isn’t specifically whether ascending or descending is better – I suppose it could be different in other cases.  Mostly just remember that the order can really matter and when you are trying to squeeze every ounce of performance out of ETL, check out the order of your ETL. 

Anyone on the SQL CAT team want to comment on why backwards works better when indexes are left intact?

Advertisements

Fragmentation Stats on an Index (Part 2)

November 26, 2007

In my last post I was discussing how and why clicking on the Fragmentation tab on the properties page of an index could take a while to return results.  I identified it as a DBCC since sp_who2 as well as being identified as a DBCC command by querying

SELECT * FROM sys.dm_exec_requests
CROSS APPLY sys.dm_exec_sql_text(sql_handle)

I guess the point I was trying to make was that index fragmentation stats don’t seem to be cached anywhere (why would they be since they are changing all the time) and must be calculated each time you click on the Fragmentation tab – or perform DBCC SHOWCONTIG – or query db_dm_physical_index_stats.  This means that on deep tables, these queries can take some time.  I’m not working with an amazing disk subsystem and the server I’m working on only has two processors.  So for me – for sure – they take a while.

If you don’t mind waiting for SSMS to finish the query to see the results, that’s great.  I had accidentally clicked on it and needed SSMS to release the thread so that that SSMS would operate properly.  I’m hoping that I was showing how to deal with SSMS hanging on this page when your data set is large.

Thanks to Simon Worth for pointing this out.  He is completely correct.  DBCC SHOWCONTIG has been retired for using db_dm_physical_index_stats instead.  SSMS is making a SELECT to db_dm_physical_index_stats and not really calling DBCC SHOWCONTIG.  I wouldn’t be suprised if under the covers that both DBCC SHOWCONTIG and a call to db_dm_physical_index_stats are doing the exact same operations to figure out what the fragmentation stats are.  Microsoft has just put a new front end on to DBCC SHOWCONTIG so that it looks more like a regular T-SQL SELECT statement.

Thanks, Simon.


Fragmentation Stats on an Index

November 8, 2007

I’m working with very large data sets and I accidentally clicked on the “Fragmentation” tab on the properties page for the clustered index of a table.  Management Studio did not handle this very gracefully.  Everything locked up and Management Studio becomes un-usable.

What Managment Studio is doing is issuing a DBCC on the index to see what the fragmentation stats are.  What makes this bad is that this table has 100,000,000 rows in it.  Needless to say, this takes a while to complete.

I decided that I wasn’t willing to wait for the DBCC to finish so I used a KILL statement to end the process.  Everything came back to life and I was able to dismiss the index properties dialog.

I wonder if it would be nice for SQL Server to have some idea if the DBCC will take a while and if it will, maybe it can be treaded off so that it doesn’t hang SSMS.  Just a thought.


Using the Missing Index DMV

March 20, 2007

I ran across an older post from the Microsoft SQL Server Query Optimization Team.  It is a script that creates a database and some supporting tables and a job that uses a couple of stored procs in the new database to periodically check the Missing Index Dynamic Management View and pulls the index suggestion down into a log.  Every time that an index is suggested by the DMV, it is logged.  If the suggestion is already there, it increments a counter and logs when exactly it noticed the suggestion.

Click here for the blog post.

Sometimes it suggests some slightly crazy indexes, but any help with index optimization is helpful.