mongodb java 子查询,MongoDB中的子查询

最新推荐文章于 2022-11-18 10:36:27 发布

邓泽键

最新推荐文章于 2022-11-18 10:36:27 发布

阅读量243

点赞数

文章标签： mongodb java 子查询

本文探讨了在 MongoDB 中如何处理跨集合查询，尤其是针对用户和事件集合进行计数和分组操作的问题。由于 MongoDB 不支持传统 SQL 的 JOIN 操作，作者建议通过在事件集合中嵌入用户信息，如国家字段，来优化查询效率。通过这种方式，可以简化聚合管道，并避免在服务器上执行多次查询。此外，还提供了批量更新事件集合以包含用户国家信息的示例代码。

摘要由CSDN通过智能技术生成

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:

{_id: ObjectId("xxxxx"), country: "UK",...}

and actions like

{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}

I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country

db.events.aggregate({

$match: {

createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },

user: { $exists: true }

}

{

$group: {

_id: {

year: { $year: "$createdAt" },

user_obj: "$user"

count: { $sum: 1 }

}

{

$group: {

_id: {

year: "$_id.year",

country: db.users.findOne({

_id: { $eq: "$_id.user_obj" },

country: { $exists: true }

}).country

total: { $sum: "$count" },

distinct: { $sum: 1 }

}

})

解决方案

No Joins in here, just us bears

So MongoDB "does not do joins". You might have tried something like this in the shell for example:

db.events.find().forEach(function(event) {

event.user = db.user.findOne({ "_id": eventUser });

printjson(event)

})

But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.

For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:

db.events.aggregate([

{ "$match": {

"user": {

"$in": db.users.distinct("_id",{ "country": "UK" })

}

}}

])

Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.

Correcting

You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:

Embed your whole user object data within the event data.

At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.

So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:

{

"_id": ObjectId("yyyyy"),

"createdAt": ISODate(),

"user": {

"_id": ObjectId("xxxxx"),

"country": "UK"

}

Then the "aggregation" process becomes simple:

db.events.aggregate([

{ "$match": {

"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },

"user": { "$exists": true }

}},

{ "$group": {

"_id": {

"year": { "$year": "$createdAt" },

"user_id": "$user._id"

"country": "$user.country"

"count": { "$sum": 1 }

}},

{ "$group": {

"_id": "$_id.country",

"total": { "$sum": "$count" },

"distinct": { "$sum": 1 }

}}

])

We're not normal

Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:

var bulk = db.events.intitializeUnorderedBulkOp(),

count = 0;

db.users.find().forEach(function(user) {

// update multiple events for user

bulk.find({ "user": user._id }).update({

"$set": { "user": { "_id": user._id, "country": user.country } }

});

count++;

// Send batch every 1000

if ( count % 1000 == 0 ) {

bulk.execute();

bulk = db.events.intitializeUnorderedBulkOp();

}

});

// Clear any queued

if ( count % 1000 != 0 )

bulk.execute();

So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.

If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".

But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.

邓泽键

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mongodb java 子查询,MongoDB中的子查询

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:{_id: ObjectId("xxxxx"), country: "UK",...}and actions like{_id: ObjectId("yyyyy"), createdAt: ISODate()...
复制链接

扫一扫