mongodb java 子查询,MongoDB中的子查询

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:

{_id: ObjectId("xxxxx"), country: "UK",...}

and actions like

{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}

I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country

db.events.aggregate({

$match: {

createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },

user: { $exists: true }

}

},

{

$group: {

_id: {

year: { $year: "$createdAt" },

user_obj: "$user"

},

count: { $sum: 1 }

}

},

{

$group: {

_id: {

year: "$_id.year",

country: db.users.findOne({

_id: { $eq: "$_id.user_obj" },

country: { $exists: true }

}).country

},

total: { $sum: "$count" },

distinct: { $sum: 1 }

}

})

解决方案

No Joins in here, just us bears

So MongoDB "does not do joins". You might have tried something like this in the shell for example:

db.events.find().forEach(function(event) {

event.user = db.user.findOne({ "_id": eventUser });

printjson(event)

})

But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.

For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:

db.events.aggregate([

{ "$match": {

"user": {

"$in": db.users.distinct("_id",{ "country": "UK" })

}

}}

])

Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.

Correcting

You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:

Embed your whole user object data within the event data.

At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.

So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:

{

"_id": ObjectId("yyyyy"),

"createdAt": ISODate(),

"user": {

"_id": ObjectId("xxxxx"),

"country": "UK"

}

}

Then the "aggregation" process becomes simple:

db.events.aggregate([

{ "$match": {

"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },

"user": { "$exists": true }

}},

{ "$group": {

"_id": {

"year": { "$year": "$createdAt" },

"user_id": "$user._id"

"country": "$user.country"

},

"count": { "$sum": 1 }

}},

{ "$group": {

"_id": "$_id.country",

"total": { "$sum": "$count" },

"distinct": { "$sum": 1 }

}}

])

We're not normal

Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:

var bulk = db.events.intitializeUnorderedBulkOp(),

count = 0;

db.users.find().forEach(function(user) {

// update multiple events for user

bulk.find({ "user": user._id }).update({

"$set": { "user": { "_id": user._id, "country": user.country } }

});

count++;

// Send batch every 1000

if ( count % 1000 == 0 ) {

bulk.execute();

bulk = db.events.intitializeUnorderedBulkOp();

}

});

// Clear any queued

if ( count % 1000 != 0 )

bulk.execute();

So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.

If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".

But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值