Top Few per Group – Using T-SQL

Ever needed to get the most recent 2 transactions for every customer in a list using a T-SQL statement?  This sounds like an easy task, but upon further examination the query is a little harder than it seems.  One solution, a very ugly solution I might add, would be to cursor through all the customers and run a select statement with an order by clause to get the most recent transactions.  I bet there are other possible solutions, but they all seem to take multiple passes.  A lot of people hate continued posts, but this one is kind of long so I’m going to do just that.  Sorry for the click-through.  I’m not making money on ad pageviews!  🙂

Maybe everybody else has already figured this out, but I finally came up with what I think is a very elegant solution.  If you use the RANK() function and ROW_NUMBER() function, it gets pretty easy.  First, let’s start with an example that is in BOL.  This query retrieves sales numbers for customers by zip code and demonstrates all the different kind of ranking functions SQL Server has.  I have taken out a couple of columns for clarity.  I didn’t need them for this demonstration.  If you’d like to see the original, please click here.

SELECT c.FirstName, c.LastName
    ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS RowNumber
    ,RANK() OVER (ORDER BY a.PostalCode) AS Rank
    ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s
    INNER JOIN Person.Contact c
        ON s.SalesPersonID = c.ContactID
    INNER JOIN Person.Address a
        ON a.AddressID = c.ContactID
WHERE TerritoryID IS NOT NULL
    AND SalesYTD <> 0

Here are the results:

FirstName LastName RowNumber Rank SalesYTD PostalCode
Maciej Dusza 1 1 $4,557,045.05 98027
Shelley Dyck 2 1 $5,200,475.23 98027
Linda Ecoffey 3 1 $3,857,163.63 98027
Carla Eldridge 4 1 $1,764,938.99 98027
Carol Elliott 5 1 $2,811,012.72 98027
Jauna Elson 6 6 $3,018,725.49 98055
Michael Emanuel 7 6 $3,189,356.25 98055
Terry Eminhizer 8 6 $3,587,378.43 98055
Gail Erickson 9 6 $5,015,682.38 98055
Mark Erickson 10 6 $3,827,950.24 98055
Martha Espinoza 11 6 $1,931,620.18 98055
Janeth Esteves 12 6 $2,241,204.04 98055
Twanna Evans 13 6 $1,758,385.93 98055

I have modified the query so that I can get the customers with the two highest sales number for a zip code:

SELECT * FROM
(
    SELECT c.FirstName, c.LastName
        ,ROW_NUMBER() OVER (ORDER BY a.PostalCode, s.SalesYTD desc) AS RowNumber
        ,RANK() OVER (ORDER BY a.PostalCode) AS Rank
        ,s.SalesYTD, a.PostalCode
    FROM Sales.SalesPerson s
        INNER JOIN Person.Contact c
            ON s.SalesPersonID = c.ContactID
        INNER JOIN Person.Address a
            ON a.AddressID = c.ContactID
    WHERE TerritoryID IS NOT NULL
        AND SalesYTD <> 0
) ss

You’ll see here that I’ve added a couple of things.  First I added another sort item to the RowNumber column.  This changes the RowNumber column so that the order is now not just by PostalCode, it is also by SalesYTD descending.  The seond thing I added was an surrounding SELECT statement so that I could reference the RowNumber and Rank columns in the WHERE clause.  I’m not sure why, but you can’t reference ROW_NUMBER() or RANK() in the where clause.  This fixes that.  I have also used a Common Table Expression to accomplish the same thing.  Use whatever you like.  Here are the results:

FirstName LastName RowNumber Rank SalesYTD PostalCode
Shelley Dyck 1 1 $5,200,475.23 98027
Maciej Dusza 2 1 $4,557,045.05 98027
Linda Ecoffey 3 1 $3,857,163.63 98027
Carol Elliott 4 1 $2,811,012.72 98027
Carla Eldridge 5 1 $1,764,938.99 98027
Gail Erickson 6 6 $5,015,682.38 98055
Mark Erickson 7 6 $3,827,950.24 98055
Terry Eminhizer 8 6 $3,587,378.43 98055
Michael Emanuel 9 6 $3,189,356.25 98055
Jauna Elson 10 6 $3,018,725.49 98055
Janeth Esteves 11 6 $2,241,204.04 98055
Martha Espinoza 12 6 $1,931,620.18 98055
Twanna Evans 13 6 $1,758,385.93 98055

You’ll see now that we have instructed the ROW_NUMBER() function now to order by PostalCode as well as SalesYTD, the solution takes shape.  We see that if we restrict the WHERE clause to only include rows where RowNumber is equal to or only one greater than Rank, we have our solution.  Here is the final SQL statement:

SELECT * FROM
(
    SELECT c.FirstName, c.LastName
        ,ROW_NUMBER() OVER (ORDER BY a.PostalCode, s.SalesYTD desc) AS RowNumber
        ,RANK() OVER (ORDER BY a.PostalCode) AS Rank
        ,s.SalesYTD, a.PostalCode
    FROM Sales.SalesPerson s
        INNER JOIN Person.Contact c
            ON s.SalesPersonID = c.ContactID
        INNER JOIN Person.Address a
            ON a.AddressID = c.ContactID
    WHERE TerritoryID IS NOT NULL
        AND SalesYTD <> 0
) ss
WHERE ss.RowNumber = ss.Rank or ss.RowNumber = ss.Rank + 1

And here are the final results:

FirstName LastName RowNumber Rank SalesYTD PostalCode
Shelley Dyck 1 1 $5,200,475.23 98027
Maciej Dusza 2 1 $4,557,045.05 98027
Gail Erickson 6 6 $5,015,682.38 98055
Mark Erickson 7 6 $3,827,950.24 98055

These two functions put together allow us to make this kind of request a very easy task.  It is also a batch process that SQL Server can handle with efficiency.

Advertisements

13 Responses to Top Few per Group – Using T-SQL

  1. Mike Morin says:

    Hi Mark –

    Thanks for posting on this subject. I agree that it is a tricky problem and like you, I prefer more elegant single-statement approach. I think the solution you suggest is a great example of how to use new syntax found in SS2005, though it might be a little more power then needed. When faced with this exact issue a while back, and I actually found the answer in an MS Access KB article! I basically followed the article’s instructions, lifted the SQL generated, and tidied up a bit. Here it is for the Northwind database, but you can apply it to any situation once you see the form. The subquery is what limits the list to N choices per group, and its where clause keeps it lined up with the categories in the main query. The main query controls the categories and products you want to select. In this example, the IN clause returns a list of quantities as though they were IDs, but it could just as easily return ProductIDs and you could set any criteria and sort you choose in the subquery. I’m pasting the code snippet I keep with me all the time because it is such a twisty problem, no one should have to solve it from scratch every time! Thanks for your post!

    –Query to select Top N Values Per Group
    –from: http://support.microsoft.com/kb/q153747/

    use northwind
    go

    –picks the top 3 products by quantity for each category
    SELECT c.CategoryName, p.ProductName, p.UnitsInStock
    FROM Categories c
    INNER JOIN Products p ON c.CategoryID = p.CategoryID
    WHERE p.UnitsInStock IN (
    SELECT TOP 3 [UnitsInStock]
    FROM Products
    WHERE [CategoryID] = c.[CategoryID]
    ORDER BY [UnitsInStock] DESC
    )
    ORDER BY c.CategoryName, p.UnitsInStock DESC

  2. Mark Garner says:

    You absolutely can do it that way. Don’t forget that for each category you have to perform a sub-query to join to to fulfill the request. Each one of those sub-selects will have to go and get all of the records in the Products table that are from that category, sort them, and then take the top three. If you have a small database like Northwind, it works great. But if you have tables of a few hundred million records each, the time it takes to perform the sub-select and sort those results a hundred million times is going to take a lot longer than if you perform one single batch operation on the large table. The way to complete the request in the smallest amount of time will be to perform one singe query that does all the sorting and filtering all at once on the database instead of a few hundred million sub-queries.

  3. Eric Desch says:

    Hi Mark,

    Great stuff! I wanted to make it dynamic in terms of how many from each group to return…top 50, top 10 or whatever so I altered the last line to read:
    WHERE ss.RowNumber = ss.Rank or ss.RowNumber < ss.Rank + @howMany

    Works like a charm.
    Many Thanks!

  4. bebandit says:

    Sweet!
    This is exactly what we needed. Worked like a charm.
    Thanks!

  5. bebandit says:

    This has helped me multiple times. Thanks!

  6. Mark,
    You rule. Thank you for publishing your solution. I was trying to limit previews of artwork in our online artist’s gallery to four pieces per artist even if they had more approved to show in the database. Your solution, plus the additional comment from Eric Desch, worked perfectly. I never would have solved this so easily on my own. Many thanks, again!

  7. Here was my final SQL code:

    SELECT tblGallery.ArtistID, tblGallery.Name, tblGallery.ArtworkID, tblGallery.Title, tblGallery.Preview FROM (
    SELECT tblArtists.ArtistID, tblArtists.Name, tblArtwork.ArtworkID, tblArtwork.Title, tblArtists.ArtFolder + ‘/’ + tblArtwork.Preview AS Preview, RANK() OVER (ORDER BY tblArtists.ArtistID) AS ‘iRANK’, ROW_NUMBER() OVER(ORDER BY tblArtists.ArtistID, tblArtwork.ArtworkID DESC) AS ‘iROW_NUMBER’
    FROM tblArtists INNER JOIN tblArtwork ON tblArtists.ArtistID = tblArtwork.ArtistID
    WHERE (tblArtwork.GalleryChoice = 1) AND (tblArtists.Active = 1) AND (tblArtwork.Active = 1)) AS tblGallery
    WHERE tblGallery.iROW_NUMBER = tblGallery.iRANK or tblGallery.iROW_NUMBER < tblGallery.iRANK + 4
    ORDER BY tblGallery.Name, tblGallery.Title

  8. Shane says:

    select COL1, COL2
    from
    (select COL1, COL2, ROW_NUMBER() OVER (PARTITION BY COL1 ORDER BY COL2) as ROWER
    from TABLE
    –WHERE BLAH BLAH
    ) xxx
    where ROWER < @NUM_ROWS_PER_GROUP+1

  9. amrit says:

    Hi All,

    The solutions provided by Mark and Mike Morin differs in functionality when there are ties at the Nth position. Mike Morin’s solution will include all the records having tie at Nth position but Mike’s solution will always show exactly N rows ignoring the ties at Nth position. I am selecting 3 toppers of each class, but if there is a tie at 3rd position then I need all the students having a tie at the 3rd position. So Mike Morin’s query work in my case.

  10. Well done. I’ve been struggling to find a clean, top (n) recordset per merchant that was scalable for a rather large DB.

    This code was easy to follow and adjust for use on our website.

  11. mohan says:

    Excellent one… Thank you so much…

  12. Fischer says:

    Nice one. I used it to get the a given percentage from each just now. I just joined the above code with a select clause that had the total of each in it (totalRowsForGrouping) and used Erics “ss.RowNumber < ss.Rank + @howMany" and changed @howmany with @percentage / 100 * x.totalRowsForGrouping.

  13. Mainul says:

    Worked like a charm. Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: