mongodb java 去重复,从MongoDB 4.2数据库中删除重复项

最新推荐文章于 2023-04-23 18:47:48 发布

刘洛希

最新推荐文章于 2023-04-23 18:47:48 发布

阅读量188

点赞数

文章标签： mongodb java 去重复

I am trying to remove duplicates from MongoDB but all solutions find fail.

My JSON structure:

{

"_id" : ObjectId("5d94ad15667591cf569e6aa4"),

"a" : "aaa",

"b" : "bbb",

"c" : "ccc",

"d" : "ddd",

"key" : "057cea2fc37aabd4a59462d3fd28c93b"

}

Key value is md5(a+b+c+d).

I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again.

I already tried

db.data.ensureIndex( { key:1 }, { unique:true, dropDups:true } )

But for what I understand dropDups were removed in MongoDB > 3.0.

I tried also several of java script codes like:

var duplicates = [];

db.data.aggregate([

{ $match: {

key: { "$ne": '' } // discard selection criteria

}},

{ $group: {

_id: { key: "$key"}, // can be grouped on multiple properties

dups: { "$addToSet": "$_id" },

}},

{ $match: {

}}

{allowDiskUse: true} // For faster processing if set is larger

).forEach(function(doc) {

doc.dups.shift(); // First element skipped for deleting

doc.dups.forEach( function(dupId){

duplicates.push(dupId); // Getting all duplicate ids

}

)

})

and it fails with:

QUERY [Js] uncaught exception: Error: command failed: {

“ok“: 0,

“errmsg“ : “assertion src/mongo/db/pipeline/value.cpp:1365“.

“code“ : 8,

“codeName" : “UnknownError“

} : aggregate failed

I haven't change MongoDB settings, working with the default settings.

解决方案

This is my input collection dups, with some duplicate data (k with values 11 and 22):

{ "_id" : 1, "k" : 11 }

{ "_id" : 2, "k" : 22 }

{ "_id" : 3, "k" : 11 }

{ "_id" : 4, "k" : 44 }

{ "_id" : 5, "k" : 55 }

{ "_id" : 6, "k" : 66 }

{ "_id" : 7, "k" : 22 }

{ "_id" : 8, "k" : 88 }

{ "_id" : 9, "k" : 11 }

The query removes the duplicates:

db.dups.aggregate([

{ $group: {

_id: "$k",

dups: { "$addToSet": "$_id" },

}},

{ $project: { k: "$_id", _id: { $arrayElemAt: [ "$dups", 0 ] } } }

] )

{ "k" : 88, "_id" : 8 }

{ "k" : 22, "_id" : 7 }

{ "k" : 44, "_id" : 4 }

{ "k" : 55, "_id" : 5 }

{ "k" : 66, "_id" : 6 }

{ "k" : 11, "_id" : 9 }

As you see the following duplicate data is removed:

{ "_id" : 1, "k" : 11 }

{ "_id" : 2, "k" : 22 }

{ "_id" : 3, "k" : 11 }

Get the results in an array:

var arr = db.dups.aggregate([ ...] ).toArray()

The arr has the array of the documents:

[

{

"k" : 88,

"_id" : 8

{

"k" : 22,

"_id" : 7

{

"k" : 44,

"_id" : 4

{

"k" : 55,

"_id" : 5

{

"k" : 66,

"_id" : 6

{

"k" : 11,

"_id" : 9

}

]

刘洛希

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫