In relational databases, duplicate records can occur due to various reasons such as data entry errors, system glitches, or incorrect import processes. Duplicates can affect performance, produce misleading results, and lead to data integrity issues. This article explores how to identify and delete duplicate rows in SQL using different methods.
Delete Duplicate Records in SQL
Getting Started
In SQL, a duplicate row means that all (or a subset of) columns in two or more rows have the same values. In the below table, rows with ID 1, 3, and 4 are duplicates based on the Name and Email fields.
ID | Name | |
---|---|---|
1 | John | john@example.com |
2 | Alice | alice@example.com |
3 | John | john@example.com |
4 | John | john@example.com |
Identifying Duplicates
You can use a GROUP BY clause along with HAVING COUNT(*) > 1 to find duplicates:
SELECT Name, Email, COUNT(ID)
FROM Users
GROUP BY Name, Email
HAVING COUNT(ID) > 1;
This query will return the duplicate combinations of Name and Email.
Deleting Duplicate Records
There are several ways to delete duplicates in SQL, depending on your SQL dialect (e.g., MySQL, PostgreSQL, SQL Server, Oracle). Below are common approaches.
Approach 1: Using ROW_NUMBER()
If your database supports window functions (like PostgreSQL, SQL Server, Oracle, or MySQL 8+), use the ROW_NUMBER()
function to identify and delete duplicate records in SQL.
WITH CTE AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Name, Email ORDER BY ID) AS RowNumber
FROM Users
)
DELETE FROM CTE WHERE RowNumber > 1;
Explanation:
ROW_NUMBER()
assigns a unique number to each row within a duplicate group.- Only the first occurrence (
rn = 1
) is kept. - All others (
rn > 1
) are deleted.
Approach 2: Delete with GROUP BY and MIN()
If you don’t have access to window functions, you can still remove duplicates using GROUP BY with a subquery.
SQL Queries To Delete Duplicates in SQL DELETE FROM Users
WHERE ID NOT IN (
SELECT MIN(ID) FROM Users GROUP BY Name, Email
);
Explanation:
GROUP BY
selects the smallestID
(or any rule) per duplicate group.- All other rows are deleted.
Approach 3: Using a Temporary Table
In systems like MySQL < 8.0 or SQLite, use a temporary table. This is slower for large datasets but works reliably when other methods aren't available.
CREATE TABLE temp_users AS
SELECT MIN(ID) as ID, Name, Email
FROM Users
GROUP BY Name, Email;
DELETE FROM Users;
INSERT INTO Users (ID, Name, Email)
SELECT * FROM temp_users;
Precautions
- Backup your data before making any deletions.
- Run a SELECT query first to verify which rows will be deleted.
- Use a transaction to ensure safe rollbacks:
BEGIN TRANSACTION; -- deletion logic here ROLLBACK; -- or COMMIT;
- Consider whether all columns should match or just a subset.
Best Practices
- Always define a PRIMARY KEY or UNIQUE constraint to prevent future duplicates.
- Use data validation at the application level before inserting data.
- Regularly audit your database for anomalies and duplicates.
Summary
Delete duplicates in SQL is a common data maintenance task. Whether you use ROW_NUMBER(), GROUP BY, or temporary tables, the key is to identify the logic that defines a "duplicate" and delete accordingly. Always proceed with caution and validate the data before deleting.
Thanks