Editorials

ROW_NUBER vs. GROUP BY

Last Thursday I published an editorial demonstrating one use of the ROW_NUMBER function as a technique to get the last value for each group contained within a set of groups, A Little ROW_NUMBER Magic. See the example below in figure 1 as executed against my instance of AdventureWorks2012.

Figure 1 – ROW_NUMBER Grouping Query

WITH MyPay
AS
(
    SELECT      BusinessentityId
               ,RateChangeDate
               ,Rate
               ,PayFrequency
               ,ModifiedDate
               ,ROW_NUMBER() OVER 
                    (
                        PARTITION BY BusinessEntityID
                        ORDER BY ModifiedDate DESC
                    ) as RowId
    FROM        AdventureWorks2012.HumanResources.EmployeePayHistory 
)
SELECT      *
FROM        MyPay
WHERE       RowId = 1

Eilenblogger posted a comment with a different solution, which was the method we used before ROW_NUMBER was made available in SQL Server 2005. I modified his example slightly so that the difference is easily visible. I’ll call this the GROUP BY example for clarification.

Figure 2 – GROUP BY Query

WITH MyPay
AS
(
    SELECT      BusinessentityId
               ,Max(RateChangeDate) as RateChangeDate
    FROM        AdventureWorks2012.HumanResources.EmployeePayHistory 
    GROUP BY    BusinessEntityID
)
SELECT      *
FROM        MyPay
JOIN        AdventureWorks2012.HumanResources.EmployeePayHistory eph 
            ON MyPay.BusinessEntityID = eph.BusinessEntityID
            AND MyPay.RateChangeDate = eph.RateChangeDate

We both reviewed the query plan for either query, and it made sense that the group by method should be the faster in performance. I had done that comparison back in 2005 and found that ROW_NUMBER performed best for the scenarios I was testing. As promised in the comments, I did go back and do a comparison for performance between the two implementations. This editorial contains the results of that reseach. Because of limited space, I am summarizing the results.

Even though the query plans were considerably different, I was not able to get any measurable difference between the two implementations. I compared CPU execution time, total duration, and read count by turning on the advanced options in SSMS for SET STATISTICS TIME and SET STATISTICS IO on. Running multiple executions of both queries were essentially the same.

I then tried the same thing by changing the Max Degrees of Parallelism and testing performance multiple times. Again, both queries were essentially comparable. This left me still wanting more. So, I added some records to the table being queried. Originally the table had only 324 rows, and returned only 290 rows when our query returned only one row per BusinessEntityID. I added records to the table so that the number of unique BusinessEntitityID values was still 290, but had nearly 2 million records. That meant that the query then needed to find the last record per BusinessEntityID out of a list of nearly 2 million records, returning only 290 records.

With these additional records, I finally had something that I could measure. The GROUP BY implementation was far superior, regardless of Max Degrees of Parallelism. It consistently ran at about 500ms. The ROW_NUMBER implementation ran at 4,000ms+. Changing Parallelism did reduce performance time to 3,500ms.

Another thing you might look into for something like this would be indexing. In this case, the best index was already part of the primary key of the table. The table had a unique primary key of BusinessEntityID and RateChangeDate. So, it seemed like there may be some unfair advantage to the GROUP BY implementation. To test this theory I modified both queries to use the ModifiedDate column instead of RateChangeDate, because my table didn’t have any index for ModifiedDate. Even with 2 million records, the results were virtually the same.

Here are some things I take away from this exercise

  • The SQL Server engine has changes from one version to the next. Things that may have been true in one version may not be true in another version
  • The size of tables (columns and number of rows), indexes available, and parallelism can make a difference in performance
  • If you have a query that runs good now, it may not run well later if the data changes dramatically
  • Don’t be afraid to challenge things that you have proven in the past

Cheers,

Ben