起因是公司的crm录入不规范,有重复数据。
之后考虑到需要手动处理,首先需要自动找出重复的数据
查重要求:
存在多个不允许重复的字段,任一字段重复,则判断为同一个客户。划分到同一重复组中。
查重sql如下
SELECT
CONCAT('组', dense_rank() OVER (ORDER BY group_key)) AS group_info,
id, name, tel
FROM (
SELECT
id, name, tel,
CONCAT_WS('-',
IF(name IS NULL, '', name),
IF(tel IS NULL, '', '1')
) AS group_key,
COUNT(*) OVER (PARTITION BY CONCAT_WS('-',
IF(name IS NULL, '', name),
IF(tel IS NULL, '', '1')
)) AS group_count
FROM customer
) AS subquery
WHERE group_count > 1
ORDER BY group_key
通过上述sql可以获得一个查重的sql。
那么接下来的问题是如果重复数据大,我们是需要一个分页的。首先我们需要知道重复了多少组。
用以下sql获取:
SELECT COUNT(*) AS total_count
FROM (
SELECT DISTINCT group_info
FROM (
SELECT
CONCAT('组', dense_rank() OVER (ORDER BY group_key)) AS group_info,
id, name, tel
FROM (
SELECT
id, name, tel,
CONCAT_WS('-',
IF(name IS NULL, '', name),
IF(tel IS NULL, '', '1')
) AS group_key,
COUNT(*) OVER (PARTITION BY CONCAT_WS('-',
IF(name IS NULL, '', name),
IF(tel IS NULL, '', '1')
)) AS group_count
FROM customer
) AS subquery
WHERE group_count > 1
ORDER BY group_key
) AS data_counts
) AS distinct_groups;
这里的分页应当按照重复组分页,传统的分页就失效了。所以我们就需要手动计算分页。
把分组字段group_info中的组字去掉,按group_info进行大于,小于判断
SELECT
dense_rank() OVER (ORDER BY group_key) AS group_info,
id, name, tel
FROM (
SELECT
id, name, tel,
CONCAT_WS('-',
IF(name IS NULL, '', name),
IF(tel IS NULL, '', '1')
) AS group_key,
COUNT(*) OVER (PARTITION BY CONCAT_WS('-',
IF(name IS NULL, '', name),
IF(tel IS NULL, '', '1')
)) AS group_count
FROM customer
) AS subquery
WHERE group_count > 1
AND group_info>0 AND group_info<11
ORDER BY group_key