Delete Duplicate Records in SQL

In relational databases, duplicate records can occur due to various reasons such as data entry errors, system glitches, or incorrect import processes. Duplicates can affect performance, produce misleading results, and lead to data integrity issues. This article explores how to identify and delete duplicate rows in SQL using different methods.

Delete Duplicate Records in SQL

Getting Started

In SQL, a duplicate row means that all (or a subset of) columns in two or more rows have the same values. In the below table, rows with ID 1, 3, and 4 are duplicates based on the Name and Email fields.

ID	Name	Email
1	John	john@example.com
2	Alice	alice@example.com
3	John	john@example.com
4	John	john@example.com

Identifying Duplicates

You can use a GROUP BY clause along with HAVING COUNT(*) > 1 to find duplicates:

 SELECT Name, Email, COUNT(ID)  
 FROM Users  
 GROUP BY Name, Email  
 HAVING COUNT(ID) > 1;

This query will return the duplicate combinations of Name and Email.

Deleting Duplicate Records

There are several ways to delete duplicates in SQL, depending on your SQL dialect (e.g., MySQL, PostgreSQL, SQL Server, Oracle). Below are common approaches.

Approach 1: Using ROW_NUMBER()

If your database supports window functions (like PostgreSQL, SQL Server, Oracle, or MySQL 8+), use the ROW_NUMBER() function to identify and delete duplicate records in SQL.

SQL Queries To Delete Duplicate Records In SQL

 WITH CTE AS (  
  SELECT *,  
      ROW_NUMBER() OVER (PARTITION BY Name, Email ORDER BY ID) AS RowNumber  
  FROM Users  
 )  
 DELETE FROM CTE WHERE RowNumber > 1;

Explanation:

ROW_NUMBER() assigns a unique number to each row within a duplicate group.
Only the first occurrence (rn = 1) is kept.
All others (rn > 1) are deleted.

Approach 2: Delete with GROUP BY and MIN()

If you don’t have access to window functions, you can still remove duplicates using GROUP BY with a subquery.

SQL Queries To Delete Duplicates in SQL

 DELETE FROM Users  
 WHERE ID NOT IN (  
  SELECT MIN(ID) FROM Users GROUP BY Name, Email  
 );

Explanation:

GROUP BY selects the smallest ID (or any rule) per duplicate group.
All other rows are deleted.

Approach 3: Using a Temporary Table

In systems like MySQL < 8.0 or SQLite, use a temporary table. This is slower for large datasets but works reliably when other methods aren't available.

 CREATE TABLE temp_users AS  
 SELECT MIN(ID) as ID, Name, Email  
 FROM Users  
 GROUP BY Name, Email;  
 DELETE FROM Users;  
 INSERT INTO Users (ID, Name, Email)  
 SELECT * FROM temp_users;

Precautions

Backup your data before making any deletions.
Run a SELECT query first to verify which rows will be deleted.

Use a transaction to ensure safe rollbacks:

 BEGIN TRANSACTION;  
 -- deletion logic here  
 ROLLBACK; -- or COMMIT;

Consider whether all columns should match or just a subset.

Best Practices

Always define a PRIMARY KEY or UNIQUE constraint to prevent future duplicates.
Use data validation at the application level before inserting data.
Regularly audit your database for anomalies and duplicates.

Summary

Delete duplicates in SQL is a common data maintenance task. Whether you use ROW_NUMBER(), GROUP BY, or temporary tables, the key is to identify the logic that defines a "duplicate" and delete accordingly. Always proceed with caution and validate the data before deleting.

Thanks

Delete Duplicate Records in SQL