Removing Duplicates Using CTE

-

Overview

Imports from a variety of sources and direct entry of records can often result in duplicate records within database tables. These duplicates make many operations unreliable and threaten overall data integrity and. One effective strategy for removing duplicates is to use a CTE (Common Table Expression) and a PARTITION clause.

Example

The student table contains duplicate records that we will remove using a CTE.

Create example table

CREATE TABLE Student(
  FirstName         VARCHAR(20)
  ,LastName         VARCHAR(20)
  ,PhoneNumber      VARCHAR(20)
  ,Advisor          VARCHAR(20)
  ,DateAdded        Date
 );

Add example rows


INSERT INTO Student(FirstName,
                      LastName,
                      PhoneNumber,
                      Advisor,
                      DateAdded)
VALUES ('Alan', 'Jones',  '802 555 3214', 'Mr. Smith','2015-12-17'),
       ('Sarah', 'Berry',  '802 333 5555', 'Ms. Beel','2014-05-07'),
       ('Craig', 'Probst', '802 222 0342', 'Ms. Hemsworth','2015-04-16'),
       ('Alan', 'Jones',  '802 555 3214', 'Mr. Parsley','2016-11-15');


Resulting Students table

For this example we will consider a record to be a duplicate if it has the same first name, last name and phone number as another record. By this definition, the second and last rows for Alfred Jones are duplicates.

FirstName LastName PhoneNumber Advisor DateAdded
Sarah Berry 802 333 5555 Ms. Beel 5/7/2014
Alfred Jones 802 555 3214 Mr. Smith 12/17/2014
Craig Probst 802 222 0342 Ms. Hemsworth 4/16/2015
Alfred Jones 802 555 3214 Mr. Parsley 11/15/2016

Remove Duplicates

WITH toDeleteCTE AS --create CTE
(
    SELECT FirstName,
           LastName,
           PhoneNumber,
           Advisor,
           DateAdded
           --add row number column partitioned over all columns that define a row as unique
           ROW_NUMBER() OVER
                (PARTITION BY FirstName,
                               LastName,
                               PhoneNumber
                ) AS rn
    FROM Student
)--Delete every row after the first within each partition
DELETE FROM toDeleteCTE WHERE rn > 1

Result Table

The duplicate row has been found using our own definition of equivalence and removed from the source table.

FirstName LastName PhoneNumber Advisor DateAdded
Sarah Berry 802 333 5555 Ms. Beel 2014-05-07
Alfred Jones 802 555 3214 Mr. Smith 2015-12-17
Craig Probst 802 222 0342 Ms. Hemsworth 2015-04-16

Notes

As the code snippet above demonstrates, CTEs point to source tables and deletion of rows from the CTE results in deletion from the source table.

Additional Resources