Summary: in this tutorial, you will learn how to use the GROUP BY
clause or ROW_NUMBER()
function to find duplicate values in a table.
Technically, you use the UNIQUE
constraints to enforce the uniqueness of rows in one or more columns of a table. However, sometimes you may find duplicate values in a table due to the poor database design, application bugs, or uncleaned data from external sources. Your job is to identify these duplicate values in effective ways.
To find the duplicate values in a table, you follow these steps:
- First, define criteria for duplicates: values in a single column or multiple columns.
- Second, write a query to search for duplicates.
If you want to also delete the duplicate rows, you can go to the deleting duplicates from a table tutorial.
Let’s set up a sample table for the demonstration.
Setting up a sample table
First, create a new table named t1
that contains three columns id
, a
, and b
.
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
id INT IDENTITY(1, 1),
a INT,
b INT,
PRIMARY KEY(id)
);
Code language: SQL (Structured Query Language) (sql)
Then, insert some rows into the t1
table:
INSERT INTO
t1(a,b)
VALUES
(1,1),
(1,2),
(1,3),
(2,1),
(1,2),
(1,3),
(2,1),
(2,2);
Code language: SQL (Structured Query Language) (sql)
The t1
table contains the following duplicate rows:
(1,2)
(2,1)
(1,3)
Code language: SQL (Structured Query Language) (sql)
Your goal is to write a query to find the above duplicate rows.
Using GROUP BY clause to find duplicates in a table
This statement uses the GROUP BY
clause to find the duplicate rows in both a
and b
columns of the t1
table:
SELECT
a,
b,
COUNT(*) occurrences
FROM t1
GROUP BY
a,
b
HAVING
COUNT(*) > 1;
Code language: SQL (Structured Query Language) (sql)
Here is the result:
How it works:
- First, the
GROUP BY
clause groups the rows into groups by values in botha
andb
columns. - Second, the
COUNT()
function returns the number of occurrences of each group (a,b). - Third, the
HAVING
clause keeps only duplicate groups, which are groups that have more than one occurrence.
To return the entire row for each duplicate row, you join the result of the above query with the t1 table using a common table expression (CTE):
WITH cte AS (
SELECT
a,
b,
COUNT(*) occurrences
FROM t1
GROUP BY
a,
b
HAVING
COUNT(*) > 1
)
SELECT
t1.id,
t1.a,
t1.b
FROM t1
INNER JOIN cte ON
cte.a = t1.a AND
cte.b = t1.b
ORDER BY
t1.a,
t1.b;
Code language: SQL (Structured Query Language) (sql)
Here is the output:
Generally, the query for finding the duplicate values in one column using the GROUP BY
clause is as follows:
SELECT
col,
COUNT(col)
FROM
table_name
GROUP BY
col
HAVING
COUNT(col) > 1;
Code language: SQL (Structured Query Language) (sql)
The query for finding the duplicate values in multiple columns using the GROUP BY
clause :
SELECT
col1,col2,...
COUNT(*)
FROM
table_name
GROUP BY
col1,col2,...
HAVING
COUNT(*) > 1;
Code language: SQL (Structured Query Language) (sql)
Using ROW_NUMBER() function to find duplicates in a table
The following statement uses the ROW_NUMBER()
function to find duplicate rows based on both a
and b
columns:
WITH cte AS (
SELECT
a,
b,
ROW_NUMBER() OVER (
PARTITION BY a,b
ORDER BY a,b) rownum
FROM
t1
)
SELECT
*
FROM
cte
WHERE
rownum > 1;
Code language: SQL (Structured Query Language) (sql)
Here is the result:
How it works:
First, the ROW_NUMBER()
distributes rows of the t1 table into partitions by values in the a
and b
columns. The duplicate rows will have repeated values in the a
and b
columns, but different row numbers as shown in the following picture:
Second, the outer query removes the first row in each group.
Generally, This statement uses the ROW_NUMBER()
function to find the duplicate values in one column of a table:
WITH cte AS (
SELECT
col,
ROW_NUMBER() OVER (
PARTITION BY col
ORDER BY col) row_num
FROM
t1
)
SELECT * FROM cte
WHERE row_num > 1;
Code language: SQL (Structured Query Language) (sql)
In this tutorial, you have learned how to use the GROUP BY
clause or ROW_NUMBER()
function to find duplicate values in SQL Server.