I really like the ROW_NUMBER() function in SQL Server as a means of getting the first or last instance of a record for each specific group within record set containing more than one group. Ok, let me break that down a little bit.
It’s pretty easy to get the last record of the line items for a single purchase order. But, what if you wanted to get the last line item for all purchase orders? You can; they are all in the same table. But, without using ROW_NUMBER() the task is a little more complicated. Since this editorial is about ROW_NUMBER, I won’t take the time to show you how we used to do it.
For my example today I am using the EmployeePayHistory table found in the AdventureWorks2012 database, because it demonstrates the problem, and I have a copy of that database handy. The table has BusinessEntityID, RateChangeDate, Rate, PayFrequency and ModifiedDate columns. The first thing I wanted to do was to determine if there were any groups within that table having multiple records. To find that out, I did a GROUP BY query on BusinessEntityID.
SELECT BusinessEntityID, Count(1) as Records FROM AdventureWorks2012.HumanResources.EmployeePayHistory GROUP BY BusinessentityId HAVING count(1) > 1
This query returned 13 BusinessEntityID records, each having 3 unique entries. I wanted to see what those specific records looked like, so I used the results of the previous input in a new query, joining back to the detail table to get the records for only those 13 BusinessEntities.
WITH MyEntities AS ( SELECT BusinessentityId, Count(1) as Records FROM AdventureWorks2012.HumanResources.EmployeePayHistory GROUP BY BusinessentityId HAVING count(1) > 1 ) SELECT eph.* FROM MyEntities e JOIN AdventureWorks2012.HumanResources.EmployeePayHistory eph ON e.BusinessEntityID = eph.BusinessEntityID ORDER BY eph.BusinessEntityID, eph.RateChangeDate
Following is a table with the first BusinessEntityID
BusinessEntityID | RateChangeDate | Rate | PayFrequency | ModifiedDate |
---|---|---|---|---|
4 | 2002-01-05 | 8.62 | 2 | 2001-12-22 |
4 | 2004-07-01 | 23.72 | 2 | 2004-06-16 |
4 | 2006-01-15 | 29.8462 | 2 | 2006-01-01 |
The same kind of history was demonstrated for the other 12 Business Entities. I now want to write a query that returns only one record for every Business Entity in the detail table, regardless of how many records may be present for a single Business Entity. Using ROW_NUMBER() it is easy to get the first or last record for each individual Business Entity. In my query I want to get the last record for each BusinessEntityID based on the RateChangeDate. From the table above, I want to get the third record, having the last RateChangeDate.
WITH MyPay AS ( SELECT BusinessentityId ,RateChangeDate ,Rate ,PayFrequency ,ModifiedDate ,ROW_NUMBER() OVER ( PARTITION BY BusinessEntityID ORDER BY RateChangeDate DESC ) as RowId FROM AdventureWorks2012.HumanResources.EmployeePayHistory ) SELECT BusinessentityId ,RateChangeDate ,Rate ,PayFrequency ,ModifiedDate FROM MyPay WHERE RowId = 1
This query returns the last record for each Business Entity, based on the RateChangeDate. I enforce this behavior by Partitioning by each BusinessEntityID. Using this partition causes the ROW_NUMBER function to start counting from 1 each time the BusinessEntityID changes. In the ROW_NUMBER function I also sort the records with the ORDER BY RateChangeDate DESC clause, so the data is sorted by the latest to the oldest RateChangeDate. I wrap all of that magic in a CTE I call MyPay, which I may then use to write my query, returning only the records I want with the where clause WHERE RowId = 1, the value of my ROW_NUMBER function in the CTE.
Your final results are now 290 records in my database, 26 less than the total number of records in the table.
There you have it, a way to get max out of a group within a set of groups.
Cheers,
Ben