Editorials

Data Warehouse Load Optimization

Optimizing your data warehouse load processes can be a difficult problem to crack. The more there is normalization in your warehouse, such as using a snowflake schema, the harder it can be to import large volumes of data. Yet, that is the main purpose of a data warehouse; to store and mine large volumes of data. The reason it becomes more complicated to optimize with a normalized schema is that you will often be required to perform a join to match up foreign keys in related tables. Here are a few implementations that can help.

First, use staging tables. You probably already do this. The key reason is that you are able to bulk load your data into staging tables very quickly because you can limit the indexes, reducing the load for maintaining indexes while importing data. Even if you do require indexes on your staging tables, sometimes you can drop them during import, and then re-create the indexes once all the data has been imported. Usually this is faster than maintaining the indexes during the insert phase.

Second, use natural keys. If a natural key is available from and external source, then when you bring the data into your staging table, you can bring the external key also. If you can utilize the external key when mapping to the different dimension tables, the number of joins may be reduced, resulting in faster import into the fact and dimension tables.

Third, use indexes specifically created for and during your import process. There are so many different kinds of indexes that can help speed up the performance when processing large volume imports. Filtered indexes are a primary example, with the behavior of isolating specific categories of data you wish to manage. Covering indexes are very effective as well. A combination of a filtered index and a covering index can generate marked improvement as you move data from staging to your warehouse data schema.

Fourth, use a cursor. WHAT!? Yes…use a cursor. In this case, instead of using a cursor to process individual records, a cursor may be used to process groups of data. The cursor may be a simple for loop in an SSIS package, or it may be an actual cursor in a script or stored procedure. The intention is to reduce the scope of the data being managed. If you are summarizing millions of records, you will often find huge performance benefits by processing smaller groups of thousands, or 100 thousands, rather than process the entire set of millions. Usually the benefit is the result of not exceeding the faster resources of the server. Work is performed in cache, and memory is not being swapped to disk to make room for work to be done.

Fifth, as always, optimize TempDB.

Sixth, make sure your log files are on disks optimized for sequential access. Preferably, they are on a different disk subsystem from the rest of your database files.

Finally, throw hardware at it. Nothing can replace the value of Memory. The more memory you have on the server, the less disk thrashing occurs, and swapping out of cache occurs.

What other things would you recommend? Why not leave a comment with your favorite?

Cheers,

Ben