java mongodb 查询 游标,MongoDB:不能使用游标来遍历所有的数据

Update on update:

It's caused by corrupted data set. Not MongoDB or the driver.

=========================================================================

I'm using the latest Java driver(2.11.3) of MongoDB(2.4.6). I've got a collection with ~250M records and I want to use a cursor to iterate through all of them. However, after 10 minutes or so I got either a false cursor.hasNext(), or an exception saying that the cursor does not exist on server.

After that I learned about cursor timeout and wrapped my cursor.next() with try/catch. If any exception, or hasNext() returned false before iterating through all the records, the program closes the cursor and allocates a new one, and then skip right back into context.

But later on I read about cursor.skip() performance issues... And the program just reached ~20M records, and cursor.next() after cursor.skip() throwed out "java.util.NoSuchElementException". I believe that's because the skip operation has timed out, which invalidated the cursor.

Yes I've read about skip() performance issues and cursor timeout issues... But now I think I'm in a dilemma where fixing one will break the other.

So, is there a way to gracefully iterate through all the data in a huge dataset?

@mnemosyn has already pointed out that I have to rely on range-based queries. But the problem is that I want to split all the data into 16 parts and process them on different machines, and the data is not uniformly distributed within any monotonic key space. If load balancing is desired, there must be a way to calculate how many keys are in a particular range and balance them. My goal is to partition them into 16 parts, so I have to find the quartiles of quartiles (sorry, I don't know if there's a mathematical term for this) of the keys and use them to split data.

Is there a way to achieve this?

I do have some ideas when the first seek is achieved by obtaining the partition boundary keys. If the new cursor times out again, I can simply record the latest tweetID and jump back in with the new range. However, the range query should be fast enough or otherwise I still get timeouts. I'm not confident about this...

Update:

Problem solved! I didn't realise that I don't have to partition data in a chunky way. A round-robin job dispatcher will do. See comments in the accepted answer.

解决方案

In general, yes. If you have a monotonic field, ideally an indexed field, you can simply walk along that. For instance, if you're using fields of type ObjectId as primary key or if you have a CreatedDate or something, you can simply use an $lt query, take a fixed number of elements, then query again using $lt of the smallest _id or CreatedDate you encountered in the previous batch.

Be careful about strict monotonic behavior vs. non-strict monotonic behavior: You might have to use $lte if the keys aren't strict, then prevent doing things twice on the dupes. Since the _id field is unique, ObjectIds are always strictly monotonic.

If you don't have such a key, things are a little more tricky. You can still iterate 'along the index' (whatever index, be it a name, a hash, a UUID, Guid, etc.). That works just as well, but it's hard to do snapshotting, because you never know whether the result you have just found was inserted before you started to traverse, or not. Also, when documents are inserted at the beginning of the traversal, those will be missed.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值