mongodb java 去重复,从MongoDB 4.2数据库中删除重复项

I am trying to remove duplicates from MongoDB but all solutions find fail.

My JSON structure:

{

"_id" : ObjectId("5d94ad15667591cf569e6aa4"),

"a" : "aaa",

"b" : "bbb",

"c" : "ccc",

"d" : "ddd",

"key" : "057cea2fc37aabd4a59462d3fd28c93b"

}

Key value is md5(a+b+c+d).

I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again.

I already tried

db.data.ensureIndex( { key:1 }, { unique:true, dropDups:true } )

But for what I understand dropDups were removed in MongoDB > 3.0.

I tried also several of java script codes like:

var duplicates = [];

db.data.aggregate([

{ $match: {

key: { "$ne": '' } // discard selection criteria

}},

{ $group: {

_id: { key: "$key"}, // can be grouped on multiple properties

dups: { "$addToSet": "$_id" },

count: { "$sum": 1 }

}},

{ $match: {

count: { "$gt": 1 } // Duplicates considered as count greater than one

}}

],

{allowDiskUse: true} // For faster processing if set is larger

).forEach(function(doc) {

doc.dups.shift(); // First element skipped for deleting

doc.dups.forEach( function(dupId){

duplicates.push(dupId); // Getting all duplicate ids

}

)

})

and it fails with:

QUERY [Js] uncaught exception: Error: command failed: {

“ok“: 0,

“errmsg“ : “assertion src/mongo/db/pipeline/value.cpp:1365“.

“code“ : 8,

“codeName" : “UnknownError“

} : aggregate failed

I haven't change MongoDB settings, working with the default settings.

解决方案

This is my input collection dups, with some duplicate data (k with values 11 and 22):

{ "_id" : 1, "k" : 11 }

{ "_id" : 2, "k" : 22 }

{ "_id" : 3, "k" : 11 }

{ "_id" : 4, "k" : 44 }

{ "_id" : 5, "k" : 55 }

{ "_id" : 6, "k" : 66 }

{ "_id" : 7, "k" : 22 }

{ "_id" : 8, "k" : 88 }

{ "_id" : 9, "k" : 11 }

The query removes the duplicates:

db.dups.aggregate([

{ $group: {

_id: "$k",

dups: { "$addToSet": "$_id" },

count: { "$sum": 1 }

}},

{ $project: { k: "$_id", _id: { $arrayElemAt: [ "$dups", 0 ] } } }

] )

=>

{ "k" : 88, "_id" : 8 }

{ "k" : 22, "_id" : 7 }

{ "k" : 44, "_id" : 4 }

{ "k" : 55, "_id" : 5 }

{ "k" : 66, "_id" : 6 }

{ "k" : 11, "_id" : 9 }

As you see the following duplicate data is removed:

{ "_id" : 1, "k" : 11 }

{ "_id" : 2, "k" : 22 }

{ "_id" : 3, "k" : 11 }

Get the results in an array:

var arr = db.dups.aggregate([ ...] ).toArray()

The arr has the array of the documents:

[

{

"k" : 88,

"_id" : 8

},

{

"k" : 22,

"_id" : 7

},

{

"k" : 44,

"_id" : 4

},

{

"k" : 55,

"_id" : 5

},

{

"k" : 66,

"_id" : 6

},

{

"k" : 11,

"_id" : 9

}

]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值