Mongodb操作（二）：批量查重及去重

TU不秃头

已于 2022-12-08 11:41:17 修改

阅读量2k

点赞数

分类专栏： # 数据库文章标签： mongodb 数据库

于 2022-12-08 11:40:49 首次发布

本文链接：https://blog.csdn.net/qq_44780372/article/details/128233853

版权

数据库专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、查询重复数据

    result_list = collection.aggregate([
        {'$group': {'_id': {'tid': '$tid', 'author_name': '$author_name', 'content': '$content'}, 'count': {'$sum': 1}}}, 
        {'$match': {'count': {'$gt': 1}}}
   ])
    for result in result_list:
        print(result)

输出结果中count为重复次数
在这里插入图片描述

二、去除重复数据（forEach函数）

db.collection.aggregate([
{
	'$group':{'_id':{'tid': '$tid', 'author_name': '$author_name', 'content': '$content'},'count':{'$sum':1},'dups':{'$addToSet':'$_id'}}
},
{
	'$match':{'count':{'$gt':1}}
}
],{
	allowDiskUse:true
}).forEach(function(doc){
	doc.dups.shift();
	db.collection.remove({_id:{$in:doc.dups}});
})

由于这个forEach函数引入doc会报错，暂未得到解决，所以考虑通过遍历去重。

# 参数解释
（1）根据author_name、tid等分组并统计数量，$group只会返回参与分组的字段，使用$addToSet在返回结果数组中增加_id字段
（2）使用$match匹配数量大于1的数据
（3）doc.dups.shift();作用是剔除重复数据其中一个_id，让后面的删除语句不会删除所有数据
（4）使用forEach循环根据_id删除数据
（5）$addToSet 操作符只有在值没有存在于数组中时才会向数组中添加一个值。如果值已经存在于数组中，$addToSet返回，不会修改数组。
（6）allowDiskUse: true        
数据过大会报内存错误：Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in       
可以在后面添加上这个属性就不会了。数据不大的情况下，可以不用试一下

注意：forEach和$addToSet的驼峰写法不能全部写成小写，因为mongodb严格区分大小写。

三、去除重复数据（遍历去重）

    for result in result_list:
        print(result)
        # 对重复数组中第一条数据进行操作
        for item in collection.find(result['_id'])[0]:
            print(item)
        # 对除第一条数据外其他数据进行：删除或者更新，这里是删除
        for item in collection.find(result['_id'])[1:result['count']]:
            print(item)
            collection.delete_one(item)