Editorials

Dealing With Duplicates

One common query is to find and remove duplicate records from a set of data. Duplicate data often occurs in databases containing a primary key based on a sequential ID, as is common in many databases. While having a sequential ID is not wrong, if there is not alternate key, or unique index based on the natural key values of a table, duplicate may occur without error, and your data become corrupt.

An example of this would be a schema as follows:

CREATE TABLE PurchaseOrder (

PurchaseOrderId INT NOT NULL IDENTITY(1,1)

,CustomerId INT NO NULL

,CONSTRAINT PK_PurchaseOrder PRIMARY KEY CLUSTERED (PurchaseOrderId)

)


CREATE TABLE PurchaseOrderItem(

PurchaseOrderItemId INT NOT NULL IDENTITY(1,1)

,PurchaseOrderId INT NO NULL

,LineItemNo INT NOT NULL

,PartId INT NOT NULL

,Quantity INT NOT NULL

,…

,CONSTRAINT PK_PurchaseOrderItem PRIMARY KEY CLUSTERED (PurchaseOrderItemId)

,CONSTRAINT FK_PurchaseOrderItem_PurchaseOrder

FOREIGN KEY (PurchaseOrderId)

REFERENCES PurchaseOrder (PurchaseOrderId)

)

You should note that there is a primary key on PurchaseOrderItem based on the identity column, PurchaseOrderItemId, which will be unique for every new row added to the table. However, there is a natural key contained in the table. The combination of PurchaseOrderId and LineItemNo must be unique. A purchase order may only have one line 1, and following. Without our unique constraint on these two columns your database allows for duplicate records for the natural key. You can do one of two things in this situation. 1) Change the primary key to PurchaseOrderId, LineItemNo. Or 2) create a unique index on PurchaseOrderId, LineItemNo. If you opt for the second solution, you may now use the unique key as an alternate key for referencing rows in the PurchaseOrderItem table.

For those databases with a schema allowing duplicate rows, you need a method to identify, and maybe even remove duplicate entries. In our purchase order example, I am going to assume that the latest entry has the most accurate data. Following is a query that will delete all but the latest entry whenever duplicates are found.

;WITH RankedPOItems

AS

(

SELECT

PurchaseOrderId

,LineItemNo

,PurchaseOrderItemId

,ROW_NUMBER() OVER (

PARTITION BY PurchaseOrderId, LineItemNumber

ORDER BY PurchaseOrderId, LineItemNumber, PurchaseOrderitemId DESC)

AS RowId

)

DELETE POI

FROM RankedPOItems RPOI

JOIN PurchaseOrderitems POI ON POI.PurchaseOrderItem = RPOI.PurchaseOrderItem

WHERE RPOI.RowId > 1

The CTE returns a set of all PurchaseOrderItems, and assigns a sequential value to each row, restarting the count when the PurchaseOrderId and LineItemNunmber change. Because it is sorted in descending order of the PurchaseOrderItmeId column, the last entry for a natural key value will be 1, and preceding values > 1. Therefore, you can delete every record where the calculated RowId is > 1.

There are many other queries to find and remove duplicates. This one is probably the easiest to understand.

Cheers,

Ben