Ever wondered if it mattered what order your data was in inside an ETL load?
I hadn’t really considered it until this morning. I have already considered and used the technique of dropping indexes during ETL and then creating them afterwords. But this was new.
I had a load that took 50 seconds. It had a clustered index as well as a unique index that was used to enforce a primary key. The ETL load came in in no order.
The first thing I tried was to order the ETL in ascending order – which is the same as the order of the clustered index – to see if it mattered. I suspected it would and it did. the time went from 50 s. to 33 s. This was good - nearly half the time just by ordering the data. I wondered if ordering the data backwards from the clustered index would help or hurt, and to my surprise, it helped and lowered the time further to 28 s.
Now I wondered how much lower I could go by dropping the indexes before ETL and the rebuilding them afterwords.
I re-ran the non-ordered ETL using the drop and rebuild method and got 28 s. So if all I had done was just drop and rebuild indexes, I would have gone from 50 to 28 s. Not bad. But when I ordered the load descending it dropped even further to 14 s. The last test of course was the order it back to ascending and see which ordering worked best for the drop and rebuild method. It turned out that ordering ascending worked better and dropped the time down to 13 s.
This was a surprise to me. With the indexes built, the opposite order to the clustered index was better. When you drop and reload, the same order was better. I think the moral of the story isn’t specifically whether ascending or descending is better – I suppose it could be different in other cases. Mostly just remember that the order can really matter and when you are trying to squeeze every ounce of performance out of ETL, check out the order of your ETL.
Anyone on the SQL CAT team want to comment on why backwards works better when indexes are left intact?