sql删除表中重复记录_SQL从SQL表中删除重复行的不同方法-CSDN博客

sql删除表中重复记录

This article explains the process of performing SQL delete activity for duplicate rows from a SQL table.

本文介绍了对SQL表中的重复行执行SQL删除活动的过程。

介绍 (Introduction)

We should follow certain best practices while designing objects in SQL Server. For example, a table should have primary keys, identity columns, clustered and non-clustered indexes, constraints to ensure data integrity and performance. Even we follow the best practices, and we might face issues such as duplicate rows. We might also get these data in intermediate tables in data import, and we want to remove duplicate rows before actually inserting in the production tables.

在SQL Server中设计对象时，我们应遵循某些最佳做法。例如，一个表应具有主键，标识列，聚集和非聚集索引，约束以确保数据完整性和性能。即使我们遵循最佳实践，也可能会遇到重复行之类的问题。我们可能还会在数据导入的中间表中获取这些数据，并且我们想在实际插入生产表之前删除重复的行。

Suppose your SQL table contains duplicate rows and you want to remove those duplicate rows. Many times, we face these issues. It is a best practice as well to use the relevant keys, constrains to eliminate the possibility of duplicate rows however if we have duplicate rows already in the table. We need to follow specific methods to clean up duplicate data. This article explores the different methods to remove duplicate data from the SQL table.

假设您SQL表包含重复的行，并且您要删除这些重复的行。很多时候，我们面临这些问题。最好的做法是使用相关的键，以限制消除重复行的可能性，但是如果我们表中已经存在重复行。我们需要遵循特定的方法来清理重复的数据。本文探讨了从SQL表中删除重复数据的不同方法。

Let’s create a sample Employee table and insert a few records in it.

让我们创建一个示例Employee表，并在其中插入一些记录。

CREATE TABLE Employee
    ( 
    [ID] INT identity(1,1), 
    [FirstName] Varchar(100), 
    [LastName] Varchar(100), 
    [Country] Varchar(100), 
    ) 
    GO 
    
    Insert into Employee ([FirstName],[LastName],[Country] )values('Raj','Gupta','India'),
                                ('Raj','Gupta','India'),
                                ('Mohan','Kumar','USA'),
                                ('James','Barry','UK'),
                                ('James','Barry','UK'),
                                ('James','Barry','UK')

In the table, we have a few duplicate records, and we need to remove them.

在表中，我们有一些重复的记录，我们需要删除它们。

SQL使用Group By和Haveing子句删除重复的行 (SQL delete duplicate Rows using Group By and having clause)

In this method, we use the SQL GROUP BY clause to identify the duplicate rows. The Group By clause groups data as per the defined columns and we can use the COUNT function to check the occurrence of a row.

在此方法中，我们使用SQL GROUP BY子句来标识重复的行。 Group By子句根据定义的列对数据进行分组，我们可以使用COUNT函数检查行的出现。

For example, execute the following query, and we get those records having occurrence greater than 1 in the Employee table.

例如，执行以下查询，然后在Employee表中获得出现次数大于1的那些记录。

SELECT [FirstName], 
    [LastName], 
    [Country], 
    COUNT(*) AS CNT
FROM [SampleDB].[dbo].[Employee]
GROUP BY [FirstName], 
      [LastName], 
      [Country]
HAVING COUNT(*) > 1;

In the output above, we have two duplicate records with ID 1 and 3.

在上面的输出中，我们有两个ID为1和3的重复记录。

Emp ID 1 has two occurrences in the Employee table
Emp ID 1在Employee表中有两次出现
Emp ID 3 has three occurrences in the Employee table
Emp ID 3在Employee表中有3次出现

We require to keep a single row and remove the duplicate rows. We need to remove only duplicate rows from the table. For example, the EmpID 1 appears two times in the table. We want to remove only one occurrence of it.

我们需要保留一行并删除重复的行。我们只需要从表中删除重复的行。例如，EmpID 1在表中出现两次。我们只想删除一次。

We use the SQL MAX function to calculate the max id of each data row.

我们使用SQL MAX函数来计算每个数据行的最大ID。

SELECT *
    FROM [SampleDB].[dbo].[Employee]
    WHERE ID NOT IN
    (
        SELECT MAX(ID)
        FROM [SampleDB].[dbo].[Employee]
        GROUP BY [FirstName], 
                 [LastName], 
                 [Country]
    );

In the following screenshot, we can see that the above Select statement excludes the Max id of each duplicate row and we get only the minimum ID value.

在下面的屏幕截图中，我们可以看到上面的Select语句排除了每个重复行的最大ID，并且仅获得了最小ID值。

DELETE FROM [SampleDB].[dbo].[Employee]
    WHERE ID NOT IN
    (
        SELECT MAX(ID) AS MaxRecordID
        FROM [SampleDB].[dbo].[Employee]
        GROUP BY [FirstName], 
                 [LastName], 
                 [Country]
    );

Once you execute the delete statement, perform a select on an Employee table, and we get the following records that do not contain duplicate rows.

一旦执行delete语句，对Employee表执行选择，我们将获得以下不包含重复行的记录。

SQL使用公用表表达式（CTE）删除重复的行 (SQL delete duplicate Rows using Common Table Expressions (CTE))

We can use Common Table Expressions commonly known as CTE to remove duplicate rows in SQL Server. It is available starting from SQL Server 2005.

我们可以使用通用表表达式（通常称为CTE）来删除SQL Server中的重复行。从SQL Server 2005开始可用。

We use a SQL ROW_NUMBER function, and it adds a unique sequential row number for the row.

我们使用SQL ROW_NUMBER函数，它为该行添加了唯一的顺序行号。

In the following CTE, it partitions the data using the PARTITION BY clause for the [Firstname], [Lastname] and [Country] column and generates a row number for each row.

在下面的CTE中，它使用PARTITION BY子句对[Firstname]，[Lastname]和[Country]列进行数据分区，并为每行生成一个行号。

WITH CTE([firstname], 
    [lastname], 
    [country], 
    duplicatecount)
AS (SELECT [firstname], 
           [lastname], 
           [country], 
           ROW_NUMBER() OVER(PARTITION BY [firstname], 
                                          [lastname], 
                                          [country]
           ORDER BY id) AS DuplicateCount
    FROM [SampleDB].[dbo].[employee])
SELECT *
FROM CTE;

In the output, if any row has the value of [DuplicateCount] column greater than 1, it shows that it is a duplicate row.

在输出中，如果有任何行的[DuplicateCount]列的值大于1，则表明该行是重复行。

Remove Duplicate Rows using Common Table Expressions (CTE)

We can remove the duplicate rows using the following CTE.

我们可以使用以下CTE删除重复的行。

WITH CTE([FirstName], 
    [LastName], 
    [Country], 
    DuplicateCount)
AS (SELECT [FirstName], 
           [LastName], 
           [Country], 
           ROW_NUMBER() OVER(PARTITION BY [FirstName], 
                                          [LastName], 
                                          [Country]
           ORDER BY ID) AS DuplicateCount
    FROM [SampleDB].[dbo].[Employee])
DELETE FROM CTE
WHERE DuplicateCount > 1;

It removes the rows having the value of [DuplicateCount] greater than 1

它删除[DuplicateCount]值大于1的行

SQL的RANK函数删除重复的行 (RANK function to SQL delete duplicate rows)

We can use the SQL RANK function to remove the duplicate rows as well. SQL RANK function gives unique row ID for each row irrespective of the duplicate row.

我们也可以使用SQL RANK函数删除重复的行。 SQL RANK函数为每行提供唯一的行ID，而与重复行无关。

In the following query, we use a RANK function with the PARTITION BY clause. The PARTITION BY clause prepares a subset of data for the specified columns and gives rank for that partition.

在以下查询中，我们将RANK函数与PARTITION BY子句一起使用。 PARTITION BY子句为指定的列准备数据的子集，并给出该分区的等级。

SELECT E.ID, 
    E.firstname, 
    E.lastname, 
    E.country, 
    T.rank
FROM [SampleDB].[dbo].[Employee] E
  INNER JOIN
(
 SELECT *, 
        RANK() OVER(PARTITION BY firstname, 
                                 lastname, 
                                 country
        ORDER BY id) rank
 FROM [SampleDB].[dbo].[Employee]
) T ON E.ID = t.ID;

In the screenshot, you can note that we need to remove the row having Rank greater than one. Let’s remove those rows using the following query.

在屏幕截图中，您可以注意到我们需要删除排名大于1的行。让我们使用以下查询删除这些行。

DELETE E
    FROM [SampleDB].[dbo].[Employee] E
         INNER JOIN
    (
        SELECT *, 
               RANK() OVER(PARTITION BY firstname, 
                                        lastname, 
                                        country
               ORDER BY id) rank
        FROM [SampleDB].[dbo].[Employee]
    ) T ON E.ID = t.ID
    WHERE rank > 1;

使用SSIS包SQL删除重复的行 (Use SSIS package to SQL delete duplicate rows)

SQL Server integration service provides various transformation, operators that help both administrators and developers in reducing manual effort and optimize the tasks. SSIS package can remove the duplicate rows from a SQL table as well.

SQL Server集成服务提供了各种转换操作员，可帮助管理员和开发人员减少人工工作并优化任务。 SSIS包也可以从SQL表中删除重复的行。

在SSIS程序包中使用排序运算符删除重复的行 (Use Sort Operator in an SSIS package for removing duplicating rows)

We can use a Sort operator to sort the values in a SQL table. You might ask how data sorting can remove duplicate rows?

我们可以使用Sort运算符对SQL表中的值进行排序。您可能会问，数据排序如何删除重复的行？

Let’s create the SSIS package to show this task.

让我们创建SSIS包来显示此任务。

In SQL Server Data Tools, create a new Integration package. In the new package, add an OLE DB source connection
在SQL Server数据工具中，创建一个新的集成包。在新程序包中，添加一个OLE DB源连接
Open OLE DB source editor and configuration the source connection and select the destination table

打开OLE DB源编辑器并配置源连接，然后选择目标表
Click on Preview data and you can see we still have duplicate data in the source table

单击预览数据，您可以看到源表中仍然有重复的数据
Add a Sort operator from the SSIS toolbox for SQL delete operation and join it with the source data

从SSIS工具箱中添加排序运算符以进行SQL删除操作，并将其与源数据连接

For the configuration of the Sort operator, double click on it and select the columns that contain duplicate values. In our case, duplicate value is in [FirstName], [LastName], [Country] columns.

对于“排序”运算符的配置，双击它并选择包含重复值的列。在我们的例子中，[FirstName]，[LastName]，[Country]列中有重复的值。

We can also use the ascending or descending sorting types for the columns. The default sort method is ascending. In the sort order, we can choose the column sort order. Sort order 1 shows the column which will be sorted first.

我们还可以对列使用升序或降序排序类型。默认的排序方法是升序。在排序顺序中，我们可以选择列的排序顺序。排序顺序1显示将首先排序的列。

On the bottom left side, notice a checkbox Remove rows with duplicate sort values.

在左下角，注意一个复选框删除具有重复排序值的行。

It will do the task of removing duplicate rows for us from the source data. Let’s put a tick in this checkbox and click ok. It performs the SQL delete activity in the SSIS package.

它将完成从源数据中删除重复行的任务。让我们在此复选框中打勾，然后单击“确定”。它在SSIS包中执行SQL删除活动。

Select the firstname, lastname and country columns

Once we click OK, it returns to the data flow tab, and we can see the following SSIS package.

单击“确定”后，它将返回到“数据流”选项卡，我们可以看到以下SSIS包。

We can add SQL Server destinations to store the data after removing duplicate rows. We only want to check that sort operator is doing the task for us or not.

删除重复的行后，我们可以添加SQL Server目标来存储数据。我们只想检查排序运算符是否正在为我们完成任务。

Add a SQL Multicast Transformation from the SSIS toolbox as shown below.

如下所示，从SSIS工具箱中添加SQL多播转换。

To view the distinct data, right-click on the connector between Sort and Multicast. Click on Enable Data Viewer.

要查看不同的数据，请在“排序”和“多播”之间的连接器上单击鼠标右键。单击启用数据查看器。

The overall SSIS package looks like below.

整个SSIS软件包如下所示。

Execute the package to perform SQL delete operation. It opens the Sort output data viewer at the Data flow task. In this data viewer, you can see distinct data after removing the duplicate values.

执行程序包以执行SQL删除操作。它将在“数据流”任务中打开“排序输出数据查看器”。在此数据查看器中，删除重复的值后，您可以看到不同的数据。

Sort output data viewer at Data flow task

Close this and the SSIS package shows successfully executed.

关闭它，SSIS包显示成功执行。

结论 (Conclusion)

In this article, we explored the process of SQL delete duplicate rows using various ways such as T-SQL, CTE, and SSIS package. You can use the method in which you feel comfortable. However, I would suggest not to implement these procedures and package on the production data directly. You should test in a lower environment.

在本文中，我们探讨了使用T-SQL，CTE和SSIS包等各种方式对SQL删除重复行的过程。您可以使用自己感觉舒适的方法。但是，我建议不要直接执行这些过程并将其打包在生产数据上。您应该在较低的环境中进行测试。